Q-Learning
A model-free reinforcement learning algorithm that learns the optimal action-selection policy using a Q-table of state-action values.
What It Is
Q-Learning learns a Q-table where each entry Q(state, action) represents the expected cumulative reward of taking that action in that state and following the optimal policy afterward. The agent updates this table through experience until it converges to the optimal values.
Q-Learning is the foundation of modern RL. It is off-policy (learns the optimal policy regardless of the agent's current behavior) and guaranteed to converge to the optimal solution given enough exploration.
The Q-Learning Formula
Q(s, a) = Q(s, a) + alpha * (reward + gamma * max(Q(s', a')) - Q(s, a))
Where:
s = current state
a = action taken
s' = next state after taking action a
alpha = learning rate (0 to 1) - how fast it learns
gamma = discount factor (0 to 1) - importance of future rewards
reward = immediate reward received
max(Q(s', a')) = best possible Q-value from the next state
Key Parameters
alpha (Learning Rate)
Controls how much new information overrides old. High = fast learning but unstable. Low = slow but stable. Typical: 0.1
gamma (Discount Factor)
How much future rewards matter. 0 = greedy (only immediate). 1 = far-sighted. Typical: 0.9
epsilon (Exploration Rate)
Probability of choosing a random action instead of the best known. Balances exploration vs exploitation. Typical: 0.1-0.2
episodes
Number of complete training runs. More episodes = better learning. Typical: 500-10000
How It Works
Algorithm Steps
- Initialize Q-table to zeros for all (state, action) pairs
- For each episode:
- Start at the initial state
- Choose action using epsilon-greedy: random with probability epsilon, best Q-value otherwise
- Take action, observe new state and reward
- Update Q-table using the Q-learning formula
- Repeat until the episode ends (goal reached or max steps)
- After training: the Q-table contains optimal action values. Always pick the action with the highest Q-value for each state.
Code: Grid World (4x4)
The agent starts at (0,0) and must reach (3,3). Each move costs -1 reward. Reaching the goal gives +10.
import numpy as np
import random
# Grid size (4x4 matrix)
n_rows = 4
n_cols = 4
# Actions
actions = ['up', 'down', 'left', 'right']
action_dict = {'up': 0, 'down': 1, 'left': 2, 'right': 3}
# Q-table [state_row][state_col][action]
q_table = np.zeros((n_rows, n_cols, len(actions)))
# Parameters
alpha = 0.1 # learning rate
gamma = 0.9 # discount factor
epsilon = 0.2 # exploration factor
episodes = 500
# Reward function
def get_reward(state):
if state == (3, 3):
return 10
else:
return -1
# Environment transition
def take_action(state, action):
row, col = state
if action == 'up':
row = max(row - 1, 0)
elif action == 'down':
row = min(row + 1, n_rows - 1)
elif action == 'left':
col = max(col - 1, 0)
elif action == 'right':
col = min(col + 1, n_cols - 1)
return (row, col)
# Training loop
for episode in range(episodes):
state = (0, 0)
while state != (3, 3): # Until it reaches the goal
if random.uniform(0, 1) < epsilon:
action = random.choice(actions)
else:
# Pick best action from Q-table
action = actions[np.argmax(q_table[state[0], state[1]])]
new_state = take_action(state, action)
reward = get_reward(new_state)
old_q = q_table[state[0], state[1], action_dict[action]]
next_max = np.max(q_table[new_state[0], new_state[1]])
# Q-learning formula
new_q = old_q + alpha * (reward + gamma * next_max - old_q)
q_table[state[0], state[1], action_dict[action]] = new_q
state = new_state
print("Training complete!")
Code: Test the Learned Policy
# Show the path taken by the agent from (0,0) to (3,3)
state = (0, 0)
path = [state]
while state != (3, 3):
# Choose the best action (greedy - no exploration)
best_action_idx = np.argmax(q_table[state[0], state[1]])
best_action = actions[best_action_idx]
# Move to the next state
new_state = take_action(state, best_action)
path.append(new_state)
# Break if stuck (safety condition)
if new_state == state:
print("Agent is stuck!")
break
state = new_state
# Print the optimal path
print("Optimal path from (0,0) to (3,3):")
for step in path:
print(step)
# Expected output: the agent follows the shortest path
# (0,0) -> (1,0) -> (2,0) -> (3,0) -> (3,1) -> (3,2) -> (3,3)
Q-Learning vs SARSA
| Feature | Q-Learning | SARSA |
| Policy type | Off-policy (learns optimal regardless of behavior) | On-policy (learns from actions actually taken) |
| Update rule | Uses max Q(s', a') for next state | Uses Q(s', a') where a' is the actual next action |
| Exploration effect | Ignores exploration in Q-value updates | Exploration affects Q-value updates |
| Convergence | Converges to optimal Q-values | Converges to policy being followed |
| Risk tolerance | Can learn riskier but optimal paths | Learns safer paths that avoid penalties |
When to Use Q-Learning
| Good For | Not Ideal For |
| Small, discrete state/action spaces | Continuous state spaces (use DQN instead) |
| Grid worlds, simple games, routing | Very large state spaces (Q-table becomes huge) |
| Learning optimal policies off-policy | When you need the agent to be safe during training (use SARSA) |
| Foundation for understanding Deep RL | Complex environments like Atari, robotics (use DQN/PPO) |
Q-Learning stores Q-values in a table, which only works for small, discrete state spaces. For large or continuous spaces (images, continuous control), use Deep Q-Networks (DQN) which replace the table with a neural network.
Reinforcement Learning Q-Learning Model-Free Value-Based