Deep Q-Network (DQN)
Combining Q-Learning with deep neural networks to handle environments with large or continuous state spaces.
From Q-Learning to DQN
Q-Learning stores Q-values in a table. This works for small state spaces (like a 4x4 grid), but what about an Atari game with millions of possible screen pixels? You can't have a table that big. DQN replaces the Q-table with a neural network that approximates Q-values for any state.
Q-Table: lookup table mapping (state, action) → value. Works for small spaces.
DQN: neural network that takes a state as input and outputs Q-values for all actions. Works for any state space.
Key Idea
Q-Learning Table:
Q[state][action] = value (lookup from table)
DQN Neural Network:
Q(state; θ) → [Q(a1), Q(a2), ..., Q(an)] (predict with neural net)
The network learns parameters θ to approximate the Q-function.
Two Key Innovations
1. Experience Replay
Instead of training on experiences sequentially (which is correlated and unstable), store experiences in a replay buffer and sample random mini-batches for training. This breaks correlations and reuses data efficiently.
Replay Buffer: [(s, a, r, s'), (s, a, r, s'), ...] (up to 1M transitions)
Each step:
1. Store (state, action, reward, next_state) in buffer
2. Sample random batch of 32-64 experiences
3. Train network on this batch
2. Target Network
Using the same network for both predictions and targets causes instability. DQN uses two networks: a "policy" network (updated every step) and a "target" network (updated less frequently by copying weights).
Loss = (r + γ * max_a' Q_target(s', a'; θ⁻) - Q(s, a; θ))²
θ = policy network weights (updated every step)
θ⁻ = target network weights (copied from θ every C steps)
Algorithm
DQN Training Loop
- Initialize replay buffer D with capacity N
- Initialize policy network Q with random weights θ
- Initialize target network Q_target with weights θ⁻ = θ
- For each episode:
- Get initial state s
- For each step:
- With probability ε select random action (explore), otherwise a = argmax_a Q(s; θ) (exploit)
- Execute action a, observe reward r and next state s'
- Store (s, a, r, s') in replay buffer D
- Sample random mini-batch from D
- Compute target: y = r + γ * max_a' Q_target(s', a'; θ⁻)
- Update θ by minimizing (y - Q(s, a; θ))²
- Every C steps: copy θ → θ⁻
Code Implementation
import numpy as np
import random
from collections import deque
class DQNAgent:
def __init__(self, state_size, action_size):
self.state_size = state_size
self.action_size = action_size
self.memory = deque(maxlen=10000) # Replay buffer
self.gamma = 0.95 # Discount factor
self.epsilon = 1.0 # Exploration rate
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
self.learning_rate = 0.001
self.model = self._build_model() # Policy network
self.target_model = self._build_model() # Target network
self.update_target()
def _build_model(self):
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
model = Sequential([
Dense(64, activation='relu', input_dim=self.state_size),
Dense(64, activation='relu'),
Dense(self.action_size, activation='linear') # Q-values for each action
])
model.compile(optimizer='adam', loss='mse')
return model
def update_target(self):
# Copy policy network weights to target network
self.target_model.set_weights(self.model.get_weights())
def remember(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))
def act(self, state):
# Epsilon-greedy action selection
if np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
q_values = self.model.predict(state, verbose=0)
return np.argmax(q_values[0])
def replay(self, batch_size=32):
# Train on random batch from replay buffer
if len(self.memory) < batch_size:
return
batch = random.sample(self.memory, batch_size)
for state, action, reward, next_state, done in batch:
target = reward
if not done:
# Use TARGET network for stability
target += self.gamma * np.max(self.target_model.predict(next_state, verbose=0)[0])
target_f = self.model.predict(state, verbose=0)
target_f[0][action] = target
self.model.fit(state, target_f, epochs=1, verbose=0)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
# Usage:
# agent = DQNAgent(state_size=4, action_size=2)
# for episode in range(500):
# state = env.reset()
# for step in range(200):
# action = agent.act(state)
# next_state, reward, done, _ = env.step(action)
# agent.remember(state, action, reward, next_state, done)
# agent.replay()
# state = next_state
# agent.update_target() # Update target network every episode
DQN vs Q-Learning
| Feature | Q-Learning | DQN |
| State representation | Discrete (table lookup) | Any (image, continuous) |
| Scalability | Small state spaces | Millions of states |
| Generalization | None (exact states only) | Generalizes to similar states |
| Stability | Guaranteed convergence | Needs replay + target network |
| Compute | Very fast | GPU recommended |
DQN only works for discrete action spaces (choose from N actions). For continuous actions (like robot joint angles), use DDPG or SAC instead.