Learning by Doing: Monte Carlo Methods in Reinforcement Learning

In our previous post, we introduced the concept of a Markov Decision Process (MDP) — the mathematical framework underlying decision-making in uncertain environments.

To quickly recap, an MDP is defined by a 5-tuple:
$(S, A, T, R, \gamma)$

Where:

$S$ is the set of all possible states.
$A(s)$ is the set of actions available in a given state $s$ .
$T(s, a, s')$ is the transition probability, the likelihood of landing in state $s'$ after taking action $a$ in state $s$ .
$R(s, a, s')$ is the reward function, giving the immediate payoff of a transition.
$\gamma$ is the discount factor, balancing immediate and future rewards.

In that setting, we assume full knowledge of how the environment behaves — the transition function $T$ and reward function $R$ are known. This allows us to use offline planning algorithms (like Value Iteration or Policy Iteration) that simulate the world internally without needing to interact with the real environment.

But what happens when we don't know how the world works?

Reinforcement Learning: Learning Through Interaction

This is where Reinforcement Learning (RL) steps in. In RL, the agent no longer has access to the underlying transition dynamics or reward function. Instead, it must learn by doing — by interacting with the environment, taking actions, and observing what happens.

You can think of it like learning to play a video game without reading the manual. You explore, make mistakes, see what works, and improve your strategy over time.

In RL, we typically tackle two main tasks:

1. Prediction

In the prediction problem, we aim to estimate the value function $V_\pi(s)$ — the expected return when starting from state $s$ and following policy $\pi$ . This helps the agent understand how good each state is under a given strategy.

2. Control

In the control problem, we go a step further: we want to find the optimal policy $\pi^*$ — the strategy that maximizes expected return over time.

Two Families of RL Algorithms

Depending on how we learn from experience, RL methods fall into two major families:

🔹 Monte Carlo Methods

Monte Carlo (MC) methods learn after each complete episode. The agent plays out an entire sequence of states, actions, and rewards, and only then updates its estimates based on the total return observed.

These methods are online in the sense that they collect data from the environment.
But they are also batch learners — updates happen at the end of an episode.
Monte Carlo methods are particularly well-suited when episodes are well-defined (like games with clear end points).

🔸 Temporal-Difference (TD) Methods

Temporal-Difference methods, on the other hand, learn on the fly — they update value estimates after every step, using a combination of observed reward and current value estimates.

These methods blend ideas from Monte Carlo and dynamic programming.
TD methods are more flexible, especially in continuous or long-horizon tasks where waiting for an episode to end would be inefficient.

What Are Monte Carlo Methods?

Monte Carlo (MC) methods are a family of reinforcement learning algorithms that learn from complete episodes of experience. That is, they wait until an episode ends before making any updates.

To use Monte Carlo methods, we assume:

Episodes eventually terminate (i.e., we’re in an episodic setting).
We can sample episodes by following some policy in the environment.
We observe total return from each state or state-action pair.

Monte Carlo for Prediction

Let’s say we’re given a policy $\pi$ , and we want to estimate its value function $V_\pi(s)$ .

First-Visit MC Prediction

For each state $s$ , keep a list of the returns received the first time $s$ is visited in each episode.
Then, $V(s) = \text{average of those returns}$ .

Every-Visit MC Prediction

Instead of just the first visit, consider every time a state is visited in an episode.
Still average the returns following each occurrence.

Both approaches converge to the true value function as the number of episodes increases — they differ in how they use episode data.

Example

Let’s consider an example using a 3x4 grid world:

Entering a terminal state yields a reward of +1 or –1.
Moving to any non-terminal state results in a small penalty of –0.1.
The discount factor is set to $\gamma = 0.5$ .

When we follow a policy in this environment, it generates a trajectory — a sequence of states, actions, and rewards:

S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}, R_{t+2}, S_{t+2}, A_{t+2}, R_{t+3}, \dots

The return at time $t$ , denoted as $G_t$ , is the sum of discounted rewards starting from that time step:

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \dots

This return can also be computed recursively using the relationship:

G_t = R_{t+1} + \gamma G_{t+1}

Let’s say we have an episode in a 3x4 gridworld that follows this sequence of states:

$(4,1) \rightarrow (3,1) \rightarrow (3,2) \rightarrow (3,3) \rightarrow (4,3)$

The rewards received after each transition are:

R_{1}, R_{2}, R_{3}, R_{4} = -0.1, -0.1, -0.1, +1

Let’s calculate the returns backward using $\gamma = 0.5$ :

$G_4 = 0$
$G_3 = +1 + 0.5 \cdot G_4 = +1$
$G_2 = -0.1 + 0.5 \cdot G_3 = +0.4$
$G_1 = -0.1 + 0.5 \cdot G_2 = +0.1$
$G_0 = -0.1 + 0.5 \cdot G_0 = -0.05$

We use first-visit Monte Carlo to update the value of each visited state by averaging its returns. After this episode, we append each $G_t$ to the return list of the corresponding state:

State	Return $G_t$	New Estimate
(4,1)	–0.05	Avg of all returns for (4,1)
(3,1)	+0.1	Avg of all returns for (3,1)
(3,2)	+0.4	Avg of all returns for (3,2)
(3,3)	+1	Avg of all returns for (3,3)

The actual new value depends on the number of episodes seen so far — we update by averaging over all observed returns for each state. Over time, this leads to a more accurate estimate of the value function under the current policy.

Monte Carlo for Control

So far, we’ve seen how Monte Carlo Prediction helps us evaluate a fixed policy $\pi$ by estimating its state-value function $v_\pi(s)$ — that is, how good it is to be in each state when following $\pi$ .

But what if our goal is not just to evaluate a policy — what if we want to improve it and ultimately learn the optimal policy?

To do that, we need more than just state values. We need to consider different actions in each state — not just the ones the policy currently recommends. This means we need exploration, and a way to evaluate the outcomes of all possible actions.

Enter the Action-Value Function: The action-value function $q_\pi(s, a)$ gives us the expected return of taking action $a$ in state $s$ , and then following policy $\pi$ afterward:

q_\pi(s, a) = \mathbb{E}_\pi \left[ G_t \mid S_t = s, A_t = a \right]

This is the foundation of Monte Carlo Control:
By estimating $q_\pi(s, a)$ for all state-action pairs, we can improve the policy by choosing actions with higher estimated returns.

Imagine an agent navigating a gridworld.

To ensure it explores all possible actions, we begin each episode by randomly sampling the first action (shown in red). After that, the agent follows its current policy (indicated by black arrows).

This setup allows the agent to gather experience not just for the actions the policy recommends, but for other possible actions too — helping it gradually discover which actions lead to better long-term outcomes.

By combining policy evaluation (estimating $q_\pi(s, a)$ from experience) with policy improvement (choosing better actions based on those estimates), Monte Carlo Control can learn optimal behavior over time — even in complex environments.

Exploring Starts

One way is to ensure all state-action pairs are tried over time, by randomly initializing each episode (this is called exploring starts). Here's the idea:

Generate many episodes using the current policy.
Estimate the action-value function $Q(s, a)$ by averaging returns after each $s, a$ pair.
Improve the policy: at each state, choose the action with the highest estimated value (greedy policy).
Repeat until convergence.

But exploring starts isn’t always practical. So we need something better...

ε-Greedy Monte Carlo Control

Instead of relying on random starts, we can encourage exploration directly in the policy by using a stochastic policy — one that assigns non-zero probability to every action in every state.

A common approach is the ε-greedy policy. In an ε-greedy policy, the agent mostly follows the current best-known action (exploitation), but with a small probability $\varepsilon$ , it chooses a random action (exploration):

With probability $1 - \varepsilon$ : choose the action with the highest estimated value.
With probability $\varepsilon$ : choose a random action uniformly from all possible actions.

Formally, the policy at state $s$ is:

\pi(a \mid s) = \begin{cases} 1 - \varepsilon + \frac{\varepsilon}{|\mathcal{A}(s)|}, & \text{if } a = \arg\max_a q(s,a) \\\\ \frac{\varepsilon}{|\mathcal{A}(s)|}, & \text{otherwise} \end{cases}

This ensures continued exploration, which is crucial for convergence to the optimal policy, especially in early episodes.

Now that we’re exploring more systematically, we can apply the Generalized Policy Iteration loop:

Initialize action-value function $q(s, a)$ arbitrarily.

Repeat for each episode:

Generate an episode using the current ε-greedy policy.

For each state-action pair in the episode:

Compute the return $G_t$ .

Update $q(s, a)$ by averaging returns (first-visit or every-visit).

Improve the policy by making it ε-greedy with respect to the updated $q(s, a)$ .

Over time, this process drives the policy toward the optimal policy, as exploration and exploitation work hand in hand.

This approach avoids the need for artificial constructs like exploring starts, and it’s easy to implement — making it one of the most widely used techniques in reinforcement learning.

Pros and Cons of Monte Carlo Methods

✅ Advantages:

Simple to implement.
No need to know or estimate environment dynamics.
Great for problems with clear episodic structure.

❌ Limitations:

Need complete episodes to update — not ideal for ongoing tasks.
High variance in returns can slow down learning.
Can be inefficient if episodes are long or sparse in rewards.

Summary

Monte Carlo methods offer a natural, intuitive approach to learning from experience. They shine in settings where episodes are well-defined and rewards are meaningful at the end of a sequence.

To recap, MC methods:

Estimate value functions by averaging observed returns.
Can be used for both prediction and control.
Rely on complete episodes and exploration to ensure convergence.

Next up: we’ll look at Temporal-Difference (TD) learning — a different approach that learns from incomplete episodes, updates at each step, and tends to converge faster in many real-world settings.

Stay tuned! 🧠⚡