Deep Q-Network (DQN)

Combining Q-Learning with deep neural networks to handle environments with large or continuous state spaces.

From Q-Learning to DQN

Q-Learning stores Q-values in a table. This works for small state spaces (like a 4x4 grid), but what about an Atari game with millions of possible screen pixels? You can't have a table that big. DQN replaces the Q-table with a neural network that approximates Q-values for any state.

Q-Table: lookup table mapping (state, action) → value. Works for small spaces.
DQN: neural network that takes a state as input and outputs Q-values for all actions. Works for any state space.

Key Idea

Q-Learning Table: Q[state][action] = value (lookup from table) DQN Neural Network: Q(state; θ) → [Q(a1), Q(a2), ..., Q(an)] (predict with neural net) The network learns parameters θ to approximate the Q-function.

Two Key Innovations

1. Experience Replay

Instead of training on experiences sequentially (which is correlated and unstable), store experiences in a replay buffer and sample random mini-batches for training. This breaks correlations and reuses data efficiently.

Replay Buffer: [(s, a, r, s'), (s, a, r, s'), ...] (up to 1M transitions) Each step: 1. Store (state, action, reward, next_state) in buffer 2. Sample random batch of 32-64 experiences 3. Train network on this batch

2. Target Network

Using the same network for both predictions and targets causes instability. DQN uses two networks: a "policy" network (updated every step) and a "target" network (updated less frequently by copying weights).

Loss = (r + γ * max_a' Q_target(s', a'; θ⁻) - Q(s, a; θ))² θ = policy network weights (updated every step) θ⁻ = target network weights (copied from θ every C steps)

Algorithm

DQN Training Loop

Initialize replay buffer D with capacity N
Initialize policy network Q with random weights θ
Initialize target network Q_target with weights θ⁻ = θ
For each episode:
1. Get initial state s
2. For each step:
  1. With probability ε select random action (explore), otherwise a = argmax_a Q(s; θ) (exploit)
  2. Execute action a, observe reward r and next state s'
  3. Store (s, a, r, s') in replay buffer D
  4. Sample random mini-batch from D
  5. Compute target: y = r + γ * max_a' Q_target(s', a'; θ⁻)
  6. Update θ by minimizing (y - Q(s, a; θ))²
  7. Every C steps: copy θ → θ⁻

Code Implementation

import numpy as np import random from collections import deque class DQNAgent: def __init__(self, state_size, action_size): self.state_size = state_size self.action_size = action_size self.memory = deque(maxlen=10000) # Replay buffer self.gamma = 0.95 # Discount factor self.epsilon = 1.0 # Exploration rate self.epsilon_min = 0.01 self.epsilon_decay = 0.995 self.learning_rate = 0.001 self.model = self._build_model() # Policy network self.target_model = self._build_model() # Target network self.update_target() def _build_model(self): from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense model = Sequential([ Dense(64, activation='relu', input_dim=self.state_size), Dense(64, activation='relu'), Dense(self.action_size, activation='linear') # Q-values for each action ]) model.compile(optimizer='adam', loss='mse') return model def update_target(self): # Copy policy network weights to target network self.target_model.set_weights(self.model.get_weights()) def remember(self, state, action, reward, next_state, done): self.memory.append((state, action, reward, next_state, done)) def act(self, state): # Epsilon-greedy action selection if np.random.rand() <= self.epsilon: return random.randrange(self.action_size) q_values = self.model.predict(state, verbose=0) return np.argmax(q_values[0]) def replay(self, batch_size=32): # Train on random batch from replay buffer if len(self.memory) < batch_size: return batch = random.sample(self.memory, batch_size) for state, action, reward, next_state, done in batch: target = reward if not done: # Use TARGET network for stability target += self.gamma * np.max(self.target_model.predict(next_state, verbose=0)[0]) target_f = self.model.predict(state, verbose=0) target_f[0][action] = target self.model.fit(state, target_f, epochs=1, verbose=0) if self.epsilon > self.epsilon_min: self.epsilon *= self.epsilon_decay # Usage: # agent = DQNAgent(state_size=4, action_size=2) # for episode in range(500): # state = env.reset() # for step in range(200): # action = agent.act(state) # next_state, reward, done, _ = env.step(action) # agent.remember(state, action, reward, next_state, done) # agent.replay() # state = next_state # agent.update_target() # Update target network every episode

DQN vs Q-Learning

Feature	Q-Learning	DQN
State representation	Discrete (table lookup)	Any (image, continuous)
Scalability	Small state spaces	Millions of states
Generalization	None (exact states only)	Generalizes to similar states
Stability	Guaranteed convergence	Needs replay + target network
Compute	Very fast	GPU recommended

DQN only works for discrete action spaces (choose from N actions). For continuous actions (like robot joint angles), use DDPG or SAC instead.