ML Playground / RL Introduction View Notebook

Reinforcement Learning (RL)

A paradigm where an agent learns optimal behavior by interacting with an environment, receiving rewards for good actions and penalties for bad ones.

What It Is

Reinforcement Learning is a type of machine learning where an agent learns to make decisions by trial and error. It interacts with an environment, takes actions, observes outcomes, and receives rewards or penalties. The goal is to learn a policy (strategy) that maximizes cumulative reward over time.

RL is fundamentally different from supervised learning. There are no labeled examples. The agent must explore and discover which actions yield the best rewards through experience.

Core Components

Agent

The learner/decision-maker (e.g., an AI bot, robot, or game player).

Environment

The world the agent interacts with (e.g., a game, a maze, a road).

State

The current situation the agent observes (e.g., position on a grid).

Action

What the agent does (e.g., move left, accelerate, jump).

Reward

Feedback signal: positive for good actions, negative for bad ones.

Policy

The strategy mapping states to actions. The goal is to find the optimal policy.

How It Works

The RL Loop
  1. Observe the current state of the environment
  2. Choose an action based on the current policy
  3. Execute the action — environment transitions to a new state
  4. Receive a reward (positive or negative feedback)
  5. Update the policy based on the experience
  6. Repeat for many episodes until the agent learns the optimal behavior

Example: Game AI

Car racing game: Stay on track = +1 point Go off road = -5 points Complete a lap = +10 points The AI tries different strategies, observes which actions lead to higher cumulative reward, and improves over time.

Types of RL

Model-Free

No knowledge of environment dynamics. Learns purely from experience. Most common in practice. Algorithms: Q-Learning, DQN, SARSA, Policy Gradient.

Model-Based

Learns or is given a model of the environment. Can plan ahead by simulating future states. Algorithms: Dyna-Q, Monte Carlo Tree Search (AlphaGo).

Popular RL Algorithms

AlgorithmTypeKey Idea
Q-LearningModel-Free, Value-basedLearns a table of (state, action) values. Classic and foundational.
SARSAModel-Free, Value-basedLike Q-Learning but updates using the action actually taken (on-policy).
DQNModel-Free, Value-basedQ-Learning with neural networks. Handles complex state spaces (images, games).
Policy GradientModel-Free, Policy-basedDirectly learns the policy function instead of value estimates.
Actor-CriticModel-Free, HybridCombines value-based (Critic) and policy-based (Actor) for faster learning.
Monte CarloModel-FreeLearns from complete episodes by averaging observed rewards.
Dynamic ProgrammingModel-BasedUses Bellman equations. Requires full knowledge of environment.

Exploration vs Exploitation

The fundamental tradeoff in RL:

The epsilon-greedy strategy balances both: with probability epsilon, explore randomly; otherwise, exploit the best known action.

Real-World Applications

RL requires many episodes of interaction to learn well. Training can be unstable and sample-inefficient. Start with simple environments (like GridWorld) before tackling complex problems.

Reinforcement Learning Agent Policy Reward