Learning by Doing: Monte Carlo Methods in Reinforcement Learning

Table Of Content
By CamelEdge
Updated on Thu Apr 17 2025
Introduction
In our previous post, we introduced the concept of a Markov Decision Process (MDP) — the mathematical framework underlying decision-making in uncertain environments.
To quickly recap, an MDP is defined by a 5-tuple:
Where:
- is the set of all possible states.
- is the set of actions available in a given state .
- is the transition probability, the likelihood of landing in state after taking action in state .
- is the reward function, giving the immediate payoff of a transition.
- is the discount factor, balancing immediate and future rewards.
In that setting, we assume full knowledge of how the environment behaves — the transition function and reward function are known. This allows us to use offline planning algorithms (like Value Iteration or Policy Iteration) that simulate the world internally without needing to interact with the real environment.
But what happens when we don't know how the world works?
Reinforcement Learning: Learning Through Interaction
This is where Reinforcement Learning (RL) steps in. In RL, the agent no longer has access to the underlying transition dynamics or reward function. Instead, it must learn by doing — by interacting with the environment, taking actions, and observing what happens.
You can think of it like learning to play a video game without reading the manual. You explore, make mistakes, see what works, and improve your strategy over time.
In RL, we typically tackle two main tasks:
1. Prediction
In the prediction problem, we aim to estimate the value function — the expected return when starting from state and following policy . This helps the agent understand how good each state is under a given strategy.
2. Control
In the control problem, we go a step further: we want to find the optimal policy — the strategy that maximizes expected return over time.
Two Families of RL Algorithms
Depending on how we learn from experience, RL methods fall into two major families:
🔹 Monte Carlo Methods
Monte Carlo (MC) methods learn after each complete episode. The agent plays out an entire sequence of states, actions, and rewards, and only then updates its estimates based on the total return observed.
- These methods are online in the sense that they collect data from the environment.
- But they are also batch learners — updates happen at the end of an episode.
- Monte Carlo methods are particularly well-suited when episodes are well-defined (like games with clear end points).
🔸 Temporal-Difference (TD) Methods
Temporal-Difference methods, on the other hand, learn on the fly — they update value estimates after every step, using a combination of observed reward and current value estimates.
- These methods blend ideas from Monte Carlo and dynamic programming.
- TD methods are more flexible, especially in continuous or long-horizon tasks where waiting for an episode to end would be inefficient.
What Are Monte Carlo Methods?
Monte Carlo (MC) methods are a family of reinforcement learning algorithms that learn from complete episodes of experience. That is, they wait until an episode ends before making any updates.
To use Monte Carlo methods, we assume:
- Episodes eventually terminate (i.e., we’re in an episodic setting).
- We can sample episodes by following some policy in the environment.
- We observe total return from each state or state-action pair.
Monte Carlo for Prediction
Let’s say we’re given a policy , and we want to estimate its value function .
First-Visit MC Prediction
- For each state , keep a list of the returns received the first time is visited in each episode.
- Then, .
Every-Visit MC Prediction
- Instead of just the first visit, consider every time a state is visited in an episode.
- Still average the returns following each occurrence.
Both approaches converge to the true value function as the number of episodes increases — they differ in how they use episode data.
Example

Let’s consider an example using a 3x4 grid world:
- Entering a terminal state yields a reward of +1 or –1.
- Moving to any non-terminal state results in a small penalty of –0.1.
- The discount factor is set to .
When we follow a policy in this environment, it generates a trajectory — a sequence of states, actions, and rewards:
The return at time , denoted as , is the sum of discounted rewards starting from that time step:
This return can also be computed recursively using the relationship:
Let’s say we have an episode in a 3x4 gridworld that follows this sequence of states:
The rewards received after each transition are:
Let’s calculate the returns backward using :
We use first-visit Monte Carlo to update the value of each visited state by averaging its returns. After this episode, we append each to the return list of the corresponding state:
State | Return | New Estimate |
---|---|---|
(4,1) | –0.05 | Avg of all returns for (4,1) |
(3,1) | +0.1 | Avg of all returns for (3,1) |
(3,2) | +0.4 | Avg of all returns for (3,2) |
(3,3) | +1 | Avg of all returns for (3,3) |
The actual new value depends on the number of episodes seen so far — we update by averaging over all observed returns for each state. Over time, this leads to a more accurate estimate of the value function under the current policy.
Monte Carlo for Control
So far, we’ve seen how Monte Carlo Prediction helps us evaluate a fixed policy by estimating its state-value function — that is, how good it is to be in each state when following .
But what if our goal is not just to evaluate a policy — what if we want to improve it and ultimately learn the optimal policy?
To do that, we need more than just state values. We need to consider different actions in each state — not just the ones the policy currently recommends. This means we need exploration, and a way to evaluate the outcomes of all possible actions.
Enter the Action-Value Function: The action-value function gives us the expected return of taking action in state , and then following policy afterward:
This is the foundation of Monte Carlo Control:
By estimating for all state-action pairs, we can improve the policy by choosing actions with higher estimated returns.
Imagine an agent navigating a gridworld.
To ensure it explores all possible actions, we begin each episode by randomly sampling the first action (shown in red). After that, the agent follows its current policy (indicated by black arrows).
This setup allows the agent to gather experience not just for the actions the policy recommends, but for other possible actions too — helping it gradually discover which actions lead to better long-term outcomes.
By combining policy evaluation (estimating from experience) with policy improvement (choosing better actions based on those estimates), Monte Carlo Control can learn optimal behavior over time — even in complex environments.
Exploring Starts
One way is to ensure all state-action pairs are tried over time, by randomly initializing each episode (this is called exploring starts). Here's the idea:
- Generate many episodes using the current policy.
- Estimate the action-value function by averaging returns after each pair.
- Improve the policy: at each state, choose the action with the highest estimated value (greedy policy).
- Repeat until convergence.
But exploring starts isn’t always practical. So we need something better...
ε-Greedy Monte Carlo Control
Instead of relying on random starts, we can encourage exploration directly in the policy by using a stochastic policy — one that assigns non-zero probability to every action in every state.
A common approach is the ε-greedy policy. In an ε-greedy policy, the agent mostly follows the current best-known action (exploitation), but with a small probability , it chooses a random action (exploration):
- With probability : choose the action with the highest estimated value.
- With probability : choose a random action uniformly from all possible actions.
Formally, the policy at state is:
This ensures continued exploration, which is crucial for convergence to the optimal policy, especially in early episodes.
Now that we’re exploring more systematically, we can apply the Generalized Policy Iteration loop:
- Initialize action-value function arbitrarily.
- Repeat for each episode:
- Generate an episode using the current ε-greedy policy.
- For each state-action pair in the episode:
- Compute the return .
- Update by averaging returns (first-visit or every-visit).
- Improve the policy by making it ε-greedy with respect to the updated .
Over time, this process drives the policy toward the optimal policy, as exploration and exploitation work hand in hand.
This approach avoids the need for artificial constructs like exploring starts, and it’s easy to implement — making it one of the most widely used techniques in reinforcement learning.
Pros and Cons of Monte Carlo Methods
✅ Advantages:
- Simple to implement.
- No need to know or estimate environment dynamics.
- Great for problems with clear episodic structure.
❌ Limitations:
- Need complete episodes to update — not ideal for ongoing tasks.
- High variance in returns can slow down learning.
- Can be inefficient if episodes are long or sparse in rewards.
Summary
Monte Carlo methods offer a natural, intuitive approach to learning from experience. They shine in settings where episodes are well-defined and rewards are meaningful at the end of a sequence.
To recap, MC methods:
- Estimate value functions by averaging observed returns.
- Can be used for both prediction and control.
- Rely on complete episodes and exploration to ensure convergence.
Next up: we’ll look at Temporal-Difference (TD) learning — a different approach that learns from incomplete episodes, updates at each step, and tends to converge faster in many real-world settings.
Stay tuned! 🧠⚡