Q-Learning

A model-free reinforcement learning algorithm that learns the optimal action-selection policy using a Q-table of state-action values.

What It Is

Q-Learning learns a Q-table where each entry Q(state, action) represents the expected cumulative reward of taking that action in that state and following the optimal policy afterward. The agent updates this table through experience until it converges to the optimal values.

Q-Learning is the foundation of modern RL. It is off-policy (learns the optimal policy regardless of the agent's current behavior) and guaranteed to converge to the optimal solution given enough exploration.

The Q-Learning Formula

Q(s, a) = Q(s, a) + alpha * (reward + gamma * max(Q(s', a')) - Q(s, a)) Where: s = current state a = action taken s' = next state after taking action a alpha = learning rate (0 to 1) - how fast it learns gamma = discount factor (0 to 1) - importance of future rewards reward = immediate reward received max(Q(s', a')) = best possible Q-value from the next state

Key Parameters

alpha (Learning Rate)

Controls how much new information overrides old. High = fast learning but unstable. Low = slow but stable. Typical: 0.1

gamma (Discount Factor)

How much future rewards matter. 0 = greedy (only immediate). 1 = far-sighted. Typical: 0.9

epsilon (Exploration Rate)

Probability of choosing a random action instead of the best known. Balances exploration vs exploitation. Typical: 0.1-0.2

episodes

Number of complete training runs. More episodes = better learning. Typical: 500-10000

How It Works

Algorithm Steps

Initialize Q-table to zeros for all (state, action) pairs
For each episode:
Start at the initial state
Choose action using epsilon-greedy: random with probability epsilon, best Q-value otherwise
Take action, observe new state and reward
Update Q-table using the Q-learning formula
Repeat until the episode ends (goal reached or max steps)
After training: the Q-table contains optimal action values. Always pick the action with the highest Q-value for each state.

Code: Grid World (4x4)

The agent starts at (0,0) and must reach (3,3). Each move costs -1 reward. Reaching the goal gives +10.

import numpy as np import random # Grid size (4x4 matrix) n_rows = 4 n_cols = 4 # Actions actions = ['up', 'down', 'left', 'right'] action_dict = {'up': 0, 'down': 1, 'left': 2, 'right': 3} # Q-table [state_row][state_col][action] q_table = np.zeros((n_rows, n_cols, len(actions))) # Parameters alpha = 0.1 # learning rate gamma = 0.9 # discount factor epsilon = 0.2 # exploration factor episodes = 500 # Reward function def get_reward(state): if state == (3, 3): return 10 else: return -1 # Environment transition def take_action(state, action): row, col = state if action == 'up': row = max(row - 1, 0) elif action == 'down': row = min(row + 1, n_rows - 1) elif action == 'left': col = max(col - 1, 0) elif action == 'right': col = min(col + 1, n_cols - 1) return (row, col) # Training loop for episode in range(episodes): state = (0, 0) while state != (3, 3): # Until it reaches the goal if random.uniform(0, 1) < epsilon: action = random.choice(actions) else: # Pick best action from Q-table action = actions[np.argmax(q_table[state[0], state[1]])] new_state = take_action(state, action) reward = get_reward(new_state) old_q = q_table[state[0], state[1], action_dict[action]] next_max = np.max(q_table[new_state[0], new_state[1]]) # Q-learning formula new_q = old_q + alpha * (reward + gamma * next_max - old_q) q_table[state[0], state[1], action_dict[action]] = new_q state = new_state print("Training complete!")

Code: Test the Learned Policy

# Show the path taken by the agent from (0,0) to (3,3) state = (0, 0) path = [state] while state != (3, 3): # Choose the best action (greedy - no exploration) best_action_idx = np.argmax(q_table[state[0], state[1]]) best_action = actions[best_action_idx] # Move to the next state new_state = take_action(state, best_action) path.append(new_state) # Break if stuck (safety condition) if new_state == state: print("Agent is stuck!") break state = new_state # Print the optimal path print("Optimal path from (0,0) to (3,3):") for step in path: print(step) # Expected output: the agent follows the shortest path # (0,0) -> (1,0) -> (2,0) -> (3,0) -> (3,1) -> (3,2) -> (3,3)

Q-Learning vs SARSA

Feature	Q-Learning	SARSA
Policy type	Off-policy (learns optimal regardless of behavior)	On-policy (learns from actions actually taken)
Update rule	Uses max Q(s', a') for next state	Uses Q(s', a') where a' is the actual next action
Exploration effect	Ignores exploration in Q-value updates	Exploration affects Q-value updates
Convergence	Converges to optimal Q-values	Converges to policy being followed
Risk tolerance	Can learn riskier but optimal paths	Learns safer paths that avoid penalties

When to Use Q-Learning

Good For	Not Ideal For
Small, discrete state/action spaces	Continuous state spaces (use DQN instead)
Grid worlds, simple games, routing	Very large state spaces (Q-table becomes huge)
Learning optimal policies off-policy	When you need the agent to be safe during training (use SARSA)
Foundation for understanding Deep RL	Complex environments like Atari, robotics (use DQN/PPO)

Q-Learning stores Q-values in a table, which only works for small, discrete state spaces. For large or continuous spaces (images, continuous control), use Deep Q-Networks (DQN) which replace the table with a neural network.

Reinforcement Learning Q-Learning Model-Free Value-Based