ML Playground / Q-Learning View Notebook

Q-Learning

A model-free reinforcement learning algorithm that learns the optimal action-selection policy using a Q-table of state-action values.

What It Is

Q-Learning learns a Q-table where each entry Q(state, action) represents the expected cumulative reward of taking that action in that state and following the optimal policy afterward. The agent updates this table through experience until it converges to the optimal values.

Q-Learning is the foundation of modern RL. It is off-policy (learns the optimal policy regardless of the agent's current behavior) and guaranteed to converge to the optimal solution given enough exploration.

The Q-Learning Formula

Q(s, a) = Q(s, a) + alpha * (reward + gamma * max(Q(s', a')) - Q(s, a)) Where: s = current state a = action taken s' = next state after taking action a alpha = learning rate (0 to 1) - how fast it learns gamma = discount factor (0 to 1) - importance of future rewards reward = immediate reward received max(Q(s', a')) = best possible Q-value from the next state

Key Parameters

alpha (Learning Rate)

Controls how much new information overrides old. High = fast learning but unstable. Low = slow but stable. Typical: 0.1

gamma (Discount Factor)

How much future rewards matter. 0 = greedy (only immediate). 1 = far-sighted. Typical: 0.9

epsilon (Exploration Rate)

Probability of choosing a random action instead of the best known. Balances exploration vs exploitation. Typical: 0.1-0.2

episodes

Number of complete training runs. More episodes = better learning. Typical: 500-10000

How It Works

Algorithm Steps
  1. Initialize Q-table to zeros for all (state, action) pairs
  2. For each episode:
  3. Start at the initial state
  4. Choose action using epsilon-greedy: random with probability epsilon, best Q-value otherwise
  5. Take action, observe new state and reward
  6. Update Q-table using the Q-learning formula
  7. Repeat until the episode ends (goal reached or max steps)
  8. After training: the Q-table contains optimal action values. Always pick the action with the highest Q-value for each state.

Code: Grid World (4x4)

The agent starts at (0,0) and must reach (3,3). Each move costs -1 reward. Reaching the goal gives +10.

import numpy as np import random # Grid size (4x4 matrix) n_rows = 4 n_cols = 4 # Actions actions = ['up', 'down', 'left', 'right'] action_dict = {'up': 0, 'down': 1, 'left': 2, 'right': 3} # Q-table [state_row][state_col][action] q_table = np.zeros((n_rows, n_cols, len(actions))) # Parameters alpha = 0.1 # learning rate gamma = 0.9 # discount factor epsilon = 0.2 # exploration factor episodes = 500 # Reward function def get_reward(state): if state == (3, 3): return 10 else: return -1 # Environment transition def take_action(state, action): row, col = state if action == 'up': row = max(row - 1, 0) elif action == 'down': row = min(row + 1, n_rows - 1) elif action == 'left': col = max(col - 1, 0) elif action == 'right': col = min(col + 1, n_cols - 1) return (row, col) # Training loop for episode in range(episodes): state = (0, 0) while state != (3, 3): # Until it reaches the goal if random.uniform(0, 1) < epsilon: action = random.choice(actions) else: # Pick best action from Q-table action = actions[np.argmax(q_table[state[0], state[1]])] new_state = take_action(state, action) reward = get_reward(new_state) old_q = q_table[state[0], state[1], action_dict[action]] next_max = np.max(q_table[new_state[0], new_state[1]]) # Q-learning formula new_q = old_q + alpha * (reward + gamma * next_max - old_q) q_table[state[0], state[1], action_dict[action]] = new_q state = new_state print("Training complete!")

Code: Test the Learned Policy

# Show the path taken by the agent from (0,0) to (3,3) state = (0, 0) path = [state] while state != (3, 3): # Choose the best action (greedy - no exploration) best_action_idx = np.argmax(q_table[state[0], state[1]]) best_action = actions[best_action_idx] # Move to the next state new_state = take_action(state, best_action) path.append(new_state) # Break if stuck (safety condition) if new_state == state: print("Agent is stuck!") break state = new_state # Print the optimal path print("Optimal path from (0,0) to (3,3):") for step in path: print(step) # Expected output: the agent follows the shortest path # (0,0) -> (1,0) -> (2,0) -> (3,0) -> (3,1) -> (3,2) -> (3,3)

Q-Learning vs SARSA

FeatureQ-LearningSARSA
Policy typeOff-policy (learns optimal regardless of behavior)On-policy (learns from actions actually taken)
Update ruleUses max Q(s', a') for next stateUses Q(s', a') where a' is the actual next action
Exploration effectIgnores exploration in Q-value updatesExploration affects Q-value updates
ConvergenceConverges to optimal Q-valuesConverges to policy being followed
Risk toleranceCan learn riskier but optimal pathsLearns safer paths that avoid penalties

When to Use Q-Learning

Good ForNot Ideal For
Small, discrete state/action spacesContinuous state spaces (use DQN instead)
Grid worlds, simple games, routingVery large state spaces (Q-table becomes huge)
Learning optimal policies off-policyWhen you need the agent to be safe during training (use SARSA)
Foundation for understanding Deep RLComplex environments like Atari, robotics (use DQN/PPO)

Q-Learning stores Q-values in a table, which only works for small, discrete state spaces. For large or continuous spaces (images, continuous control), use Deep Q-Networks (DQN) which replace the table with a neural network.

Reinforcement Learning Q-Learning Model-Free Value-Based