Reinforcement Learning (RL)

A paradigm where an agent learns optimal behavior by interacting with an environment, receiving rewards for good actions and penalties for bad ones.

What It Is

Reinforcement Learning is a type of machine learning where an agent learns to make decisions by trial and error. It interacts with an environment, takes actions, observes outcomes, and receives rewards or penalties. The goal is to learn a policy (strategy) that maximizes cumulative reward over time.

RL is fundamentally different from supervised learning. There are no labeled examples. The agent must explore and discover which actions yield the best rewards through experience.

Core Components

Agent

The learner/decision-maker (e.g., an AI bot, robot, or game player).

Environment

The world the agent interacts with (e.g., a game, a maze, a road).

State

The current situation the agent observes (e.g., position on a grid).

Action

What the agent does (e.g., move left, accelerate, jump).

Reward

Feedback signal: positive for good actions, negative for bad ones.

Policy

The strategy mapping states to actions. The goal is to find the optimal policy.

How It Works

The RL Loop

Observe the current state of the environment
Choose an action based on the current policy
Execute the action — environment transitions to a new state
Receive a reward (positive or negative feedback)
Update the policy based on the experience
Repeat for many episodes until the agent learns the optimal behavior

Example: Game AI

Car racing game: Stay on track = +1 point Go off road = -5 points Complete a lap = +10 points The AI tries different strategies, observes which actions lead to higher cumulative reward, and improves over time.

Types of RL

Model-Free

No knowledge of environment dynamics. Learns purely from experience. Most common in practice. Algorithms: Q-Learning, DQN, SARSA, Policy Gradient.

Model-Based

Learns or is given a model of the environment. Can plan ahead by simulating future states. Algorithms: Dyna-Q, Monte Carlo Tree Search (AlphaGo).

Popular RL Algorithms

Algorithm	Type	Key Idea
Q-Learning	Model-Free, Value-based	Learns a table of (state, action) values. Classic and foundational.
SARSA	Model-Free, Value-based	Like Q-Learning but updates using the action actually taken (on-policy).
DQN	Model-Free, Value-based	Q-Learning with neural networks. Handles complex state spaces (images, games).
Policy Gradient	Model-Free, Policy-based	Directly learns the policy function instead of value estimates.
Actor-Critic	Model-Free, Hybrid	Combines value-based (Critic) and policy-based (Actor) for faster learning.
Monte Carlo	Model-Free	Learns from complete episodes by averaging observed rewards.
Dynamic Programming	Model-Based	Uses Bellman equations. Requires full knowledge of environment.

Exploration vs Exploitation

The fundamental tradeoff in RL:

Exploration — Try new actions to discover potentially better rewards
Exploitation — Use the best known action to maximize immediate reward

The epsilon-greedy strategy balances both: with probability epsilon, explore randomly; otherwise, exploit the best known action.

Real-World Applications

Games — AlphaGo, Chess AI, Atari game-playing agents
Robotics — Robot navigation, manipulation, walking
Self-driving cars — Learning to navigate roads safely
Finance — Automated trading strategies
Healthcare — Personalized treatment plans
Recommendations — Dynamic content/ad placement

RL requires many episodes of interaction to learn well. Training can be unstable and sample-inefficient. Start with simple environments (like GridWorld) before tackling complex problems.

Reinforcement Learning Agent Policy Reward