Learning On the Go: Temporal-Difference Learning in Reinforcement Learning

In our last post, we looked at Monte Carlo methods, which estimate value-functions $V_\pi(s)$ by averaging returns after complete episodes. Simple and effective — but they require waiting until the end of an episode.

But what if we want to learn during an episode, one step at a time?

That’s where Temporal-Difference (TD) learning shines. ✨
It blends the strengths of Monte Carlo methods (learning from experience) and Dynamic Programming (bootstrapping from estimates) — enabling faster, more flexible learning.

🧠 “If one had to identify one idea as central and novel to reinforcement learning, it would be temporal-difference learning.”
— Sutton & Barto, Reinforcement Learning: An Introduction

In the next section, we’ll dive into how TD methods work, starting with TD(0) for prediction and then look at control algorithms such as SARSA and Q-learning.

What Is Temporal-Difference Learning?

TD methods blend the strengths of Monte Carlo and Dynamic Programming:

Like Monte Carlo: They learn directly from experience, without a model of the environment.
Like Dynamic Programming: They bootstrap, meaning they update estimates using other estimates (not just complete returns).

TD(0) — The Simplest TD Method

Let’s start with TD(0), used for value prediction.

Suppose the agent experiences a transition:

S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}, R_{t+2}, S_{t+2}, A_{t+2}, R_{t+3}, \dots

The return at time $t$ , denoted as $G_t$ , represents the sum of discounted rewards from that time step onward:

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \dots

To efficiently compute this return, we can use a recursive approach known as the bootstrap method. This involves expressing the return in terms of the immediate reward and the estimated return from the next time step. By factoring out the first reward and recognizing the remaining terms as the return from the next time step, we get:

$G_t = R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \gamma^2 R_{t+4} + \dots)$

The expression in parentheses is the return from the next time step, denoted as $G_{t+1}$ . Thus, we have:

$G_t = R_{t+1} + \gamma G_{t+1}$

The value function, denoted as $V_\pi(s)$ , represents the expected return when starting from state $s$ and following policy $\pi$ . It is mathematically expressed as:

V_\pi(s) = \mathbb{E}_\pi \left[ G_t \mid S_t = s \right]

where $G_t$ is the return at time $t$ .

The bootstrapped value-function is expressed as:

V_\pi(s) = \mathbb{E}_\pi \left[ R_{t+1} + \gamma G_{t+1} \mid S_t = s \right]

To further refine this, we apply the linearity of expectation, allowing us to replace $G_{t+1}$ with the value function of the subsequent state, $V(S_{t+1})$ . This gives us:

V_\pi(s) = \mathbb{E}_\pi \left[ R_{t+1} + \gamma V(S_{t+1}) \mid S_t = s \right]

This recursive relationship enables incremental updates to our value estimates, utilizing the current estimate of future returns. The update rule for the value of state $S_t$ is:

V(S_t) \leftarrow V(S_t) + \alpha \big[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \big]

Here, the term inside the brackets is known as the TD error:

\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)

The TD error measures the difference between the predicted value and the "bootstrapped" target, which is the sum of the immediate reward and the discounted value of the next state. The agent uses this error to adjust its current estimate, moving it closer to the target.

Why Use TD Learning?

TD methods bring some major advantages:

✅ Learn before the episode ends — immediate feedback.
✅ Lower variance than Monte Carlo (since they use estimates).
✅ Work well in continuous tasks (no need for episodes).
✅ Faster convergence in many practical problems.

The trade-off? TD estimates are biased, since they rely on current value approximations.

But often, this bias is outweighed by reduced variance and faster learning.

TD Learning for Control

Just like Monte Carlo methods, TD learning can also be used for control — learning optimal policies.

The most famous TD control method is:

🔹 SARSA (On-Policy TD Control)

SARSA stands for:
State – Action – Reward – next State – next Action

It updates the action-value function $Q(s, a)$ using the actual action taken under the current (usually ε-greedy) policy:

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \big[ R_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \big]

It’s on-policy, meaning the policy used to select actions is the same as the one being improved.

🔸 Q-Learning (Off-Policy TD Control)

Q-learning is similar, but more aggressive. It updates $Q(s, a)$ assuming we act optimally from the next state:

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \big[ R_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t) \big]

It’s off-policy, because it learns the optimal policy regardless of the behavior policy used to explore.

In practice, Q-learning is a powerhouse — it’s the basis for deep Q-networks (DQNs) and many real-world RL systems.

Example

Imagine we have observed the following $S_t, A_t, S_{t+1}, R_{t+1}$ transitions:

(B, East, C, 2)
(C, South, E, 4)
(C, East, A, 6)
(B, East, C, 2)

Assume the initial value for each state is $0$ , with a discount factor $\gamma = 1$ and a learning rate $\alpha = 0.5$ .

Let's determine the updated state values using Temporal-Difference (TD) learning after processing these four transitions. We will apply the TD update rule iteratively to refine our value estimates based on the observed transitions.

Transitions	B	C
(initial)	0	0
(B, East, C, 2)	1	0
(C, South, E, 4)	1	2
(C, East, A, 6)	1	4
(B, East, C, 2)	3.5	4

The Q values learned are as follows:

Transitions	(B, East)	(C, South)	(C, East)
(initial)	0	0	0
(B, East, C, 2)	1	0	0
(C, South, E, 4)	1	2	0
(C, East, A, 6)	1	2	3
(B, East, C, 2)	3	2	3

Monte Carlo vs. TD Learning — A Quick Comparison

Feature	Monte Carlo	Temporal-Difference (TD)
Updates when?	End of episode	After every step
Requires complete episodes?	Yes	No
Bias	Unbiased	Biased (uses estimates)
Variance	High	Lower
Bootstrapping	❌ No	✅ Yes
Works with continuing tasks	❌ Not ideal	✅ Yes

Summary

Temporal-Difference learning is one of the most important tools in the RL toolbox:

Learns from incomplete episodes.
Updates value estimates using bootstrapped targets.
Includes algorithms like TD(0), SARSA, and Q-learning.
Suitable for both prediction and control tasks.

If Monte Carlo is like learning from reflection at the end of a journey, TD is like learning as you go — adjusting course at every step.