A paradigm where an agent learns optimal behavior by interacting with an environment, receiving rewards for good actions and penalties for bad ones.
Reinforcement Learning is a type of machine learning where an agent learns to make decisions by trial and error. It interacts with an environment, takes actions, observes outcomes, and receives rewards or penalties. The goal is to learn a policy (strategy) that maximizes cumulative reward over time.
RL is fundamentally different from supervised learning. There are no labeled examples. The agent must explore and discover which actions yield the best rewards through experience.
The learner/decision-maker (e.g., an AI bot, robot, or game player).
The world the agent interacts with (e.g., a game, a maze, a road).
The current situation the agent observes (e.g., position on a grid).
What the agent does (e.g., move left, accelerate, jump).
Feedback signal: positive for good actions, negative for bad ones.
The strategy mapping states to actions. The goal is to find the optimal policy.
No knowledge of environment dynamics. Learns purely from experience. Most common in practice. Algorithms: Q-Learning, DQN, SARSA, Policy Gradient.
Learns or is given a model of the environment. Can plan ahead by simulating future states. Algorithms: Dyna-Q, Monte Carlo Tree Search (AlphaGo).
| Algorithm | Type | Key Idea |
|---|---|---|
| Q-Learning | Model-Free, Value-based | Learns a table of (state, action) values. Classic and foundational. |
| SARSA | Model-Free, Value-based | Like Q-Learning but updates using the action actually taken (on-policy). |
| DQN | Model-Free, Value-based | Q-Learning with neural networks. Handles complex state spaces (images, games). |
| Policy Gradient | Model-Free, Policy-based | Directly learns the policy function instead of value estimates. |
| Actor-Critic | Model-Free, Hybrid | Combines value-based (Critic) and policy-based (Actor) for faster learning. |
| Monte Carlo | Model-Free | Learns from complete episodes by averaging observed rewards. |
| Dynamic Programming | Model-Based | Uses Bellman equations. Requires full knowledge of environment. |
The fundamental tradeoff in RL:
The epsilon-greedy strategy balances both: with probability epsilon, explore randomly; otherwise, exploit the best known action.
RL requires many episodes of interaction to learn well. Training can be unstable and sample-inefficient. Start with simple environments (like GridWorld) before tackling complex problems.
Reinforcement Learning Agent Policy Reward