ML Playground / Deep Q-Network (DQN)

Deep Q-Network (DQN)

Combining Q-Learning with deep neural networks to handle environments with large or continuous state spaces.

From Q-Learning to DQN

Q-Learning stores Q-values in a table. This works for small state spaces (like a 4x4 grid), but what about an Atari game with millions of possible screen pixels? You can't have a table that big. DQN replaces the Q-table with a neural network that approximates Q-values for any state.

Q-Table: lookup table mapping (state, action) → value. Works for small spaces.
DQN: neural network that takes a state as input and outputs Q-values for all actions. Works for any state space.

Key Idea

Q-Learning Table: Q[state][action] = value (lookup from table) DQN Neural Network: Q(state; θ) → [Q(a1), Q(a2), ..., Q(an)] (predict with neural net) The network learns parameters θ to approximate the Q-function.

Two Key Innovations

1. Experience Replay

Instead of training on experiences sequentially (which is correlated and unstable), store experiences in a replay buffer and sample random mini-batches for training. This breaks correlations and reuses data efficiently.

Replay Buffer: [(s, a, r, s'), (s, a, r, s'), ...] (up to 1M transitions) Each step: 1. Store (state, action, reward, next_state) in buffer 2. Sample random batch of 32-64 experiences 3. Train network on this batch

2. Target Network

Using the same network for both predictions and targets causes instability. DQN uses two networks: a "policy" network (updated every step) and a "target" network (updated less frequently by copying weights).

Loss = (r + γ * max_a' Q_target(s', a'; θ⁻) - Q(s, a; θ))² θ = policy network weights (updated every step) θ⁻ = target network weights (copied from θ every C steps)

Algorithm

DQN Training Loop
  1. Initialize replay buffer D with capacity N
  2. Initialize policy network Q with random weights θ
  3. Initialize target network Q_target with weights θ⁻ = θ
  4. For each episode:
    1. Get initial state s
    2. For each step:
      1. With probability ε select random action (explore), otherwise a = argmax_a Q(s; θ) (exploit)
      2. Execute action a, observe reward r and next state s'
      3. Store (s, a, r, s') in replay buffer D
      4. Sample random mini-batch from D
      5. Compute target: y = r + γ * max_a' Q_target(s', a'; θ⁻)
      6. Update θ by minimizing (y - Q(s, a; θ))²
      7. Every C steps: copy θ → θ⁻

Code Implementation

import numpy as np import random from collections import deque class DQNAgent: def __init__(self, state_size, action_size): self.state_size = state_size self.action_size = action_size self.memory = deque(maxlen=10000) # Replay buffer self.gamma = 0.95 # Discount factor self.epsilon = 1.0 # Exploration rate self.epsilon_min = 0.01 self.epsilon_decay = 0.995 self.learning_rate = 0.001 self.model = self._build_model() # Policy network self.target_model = self._build_model() # Target network self.update_target() def _build_model(self): from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense model = Sequential([ Dense(64, activation='relu', input_dim=self.state_size), Dense(64, activation='relu'), Dense(self.action_size, activation='linear') # Q-values for each action ]) model.compile(optimizer='adam', loss='mse') return model def update_target(self): # Copy policy network weights to target network self.target_model.set_weights(self.model.get_weights()) def remember(self, state, action, reward, next_state, done): self.memory.append((state, action, reward, next_state, done)) def act(self, state): # Epsilon-greedy action selection if np.random.rand() <= self.epsilon: return random.randrange(self.action_size) q_values = self.model.predict(state, verbose=0) return np.argmax(q_values[0]) def replay(self, batch_size=32): # Train on random batch from replay buffer if len(self.memory) < batch_size: return batch = random.sample(self.memory, batch_size) for state, action, reward, next_state, done in batch: target = reward if not done: # Use TARGET network for stability target += self.gamma * np.max(self.target_model.predict(next_state, verbose=0)[0]) target_f = self.model.predict(state, verbose=0) target_f[0][action] = target self.model.fit(state, target_f, epochs=1, verbose=0) if self.epsilon > self.epsilon_min: self.epsilon *= self.epsilon_decay # Usage: # agent = DQNAgent(state_size=4, action_size=2) # for episode in range(500): # state = env.reset() # for step in range(200): # action = agent.act(state) # next_state, reward, done, _ = env.step(action) # agent.remember(state, action, reward, next_state, done) # agent.replay() # state = next_state # agent.update_target() # Update target network every episode

DQN vs Q-Learning

FeatureQ-LearningDQN
State representationDiscrete (table lookup)Any (image, continuous)
ScalabilitySmall state spacesMillions of states
GeneralizationNone (exact states only)Generalizes to similar states
StabilityGuaranteed convergenceNeeds replay + target network
ComputeVery fastGPU recommended

DQN only works for discrete action spaces (choose from N actions). For continuous actions (like robot joint angles), use DDPG or SAC instead.