Learning On the Go: Temporal-Difference Learning in Reinforcement Learning

Table Of Content
By CamelEdge
Updated on Tue Apr 29 2025
Introduction
In our last post, we looked at Monte Carlo methods, which estimate value-functions by averaging returns after complete episodes. Simple and effective — but they require waiting until the end of an episode.
But what if we want to learn during an episode, one step at a time?
That’s where Temporal-Difference (TD) learning shines. ✨
It blends the strengths of Monte Carlo methods (learning from experience) and Dynamic Programming (bootstrapping from estimates) — enabling faster, more flexible learning.
🧠 “If one had to identify one idea as central and novel to reinforcement learning, it would be temporal-difference learning.”
— Sutton & Barto, Reinforcement Learning: An Introduction
In the next section, we’ll dive into how TD methods work, starting with TD(0) for prediction and then look at control algorithms such as SARSA and Q-learning.
What Is Temporal-Difference Learning?
TD methods blend the strengths of Monte Carlo and Dynamic Programming:
- Like Monte Carlo: They learn directly from experience, without a model of the environment.
- Like Dynamic Programming: They bootstrap, meaning they update estimates using other estimates (not just complete returns).
TD(0) — The Simplest TD Method
Let’s start with TD(0), used for value prediction.
Suppose the agent experiences a transition:
The return at time , denoted as , represents the sum of discounted rewards from that time step onward:
To efficiently compute this return, we can use a recursive approach known as the bootstrap method. This involves expressing the return in terms of the immediate reward and the estimated return from the next time step. By factoring out the first reward and recognizing the remaining terms as the return from the next time step, we get:
The expression in parentheses is the return from the next time step, denoted as . Thus, we have:
The value function, denoted as , represents the expected return when starting from state and following policy . It is mathematically expressed as:
where is the return at time .
The bootstrapped value-function is expressed as:
To further refine this, we apply the linearity of expectation, allowing us to replace with the value function of the subsequent state, . This gives us:
This recursive relationship enables incremental updates to our value estimates, utilizing the current estimate of future returns. The update rule for the value of state is:
Here, the term inside the brackets is known as the TD error:
The TD error measures the difference between the predicted value and the "bootstrapped" target, which is the sum of the immediate reward and the discounted value of the next state. The agent uses this error to adjust its current estimate, moving it closer to the target.
Why Use TD Learning?
TD methods bring some major advantages:
- ✅ Learn before the episode ends — immediate feedback.
- ✅ Lower variance than Monte Carlo (since they use estimates).
- ✅ Work well in continuous tasks (no need for episodes).
- ✅ Faster convergence in many practical problems.
The trade-off? TD estimates are biased, since they rely on current value approximations.
But often, this bias is outweighed by reduced variance and faster learning.
TD Learning for Control
Just like Monte Carlo methods, TD learning can also be used for control — learning optimal policies.
The most famous TD control method is:
🔹 SARSA (On-Policy TD Control)
SARSA stands for:
State – Action – Reward – next State – next Action
It updates the action-value function using the actual action taken under the current (usually ε-greedy) policy:
It’s on-policy, meaning the policy used to select actions is the same as the one being improved.
🔸 Q-Learning (Off-Policy TD Control)
Q-learning is similar, but more aggressive. It updates assuming we act optimally from the next state:
It’s off-policy, because it learns the optimal policy regardless of the behavior policy used to explore.
In practice, Q-learning is a powerhouse — it’s the basis for deep Q-networks (DQNs) and many real-world RL systems.
Example

Imagine we have observed the following transitions:
- (B, East, C, 2)
- (C, South, E, 4)
- (C, East, A, 6)
- (B, East, C, 2)
Assume the initial value for each state is , with a discount factor and a learning rate .
Let's determine the updated state values using Temporal-Difference (TD) learning after processing these four transitions. We will apply the TD update rule iteratively to refine our value estimates based on the observed transitions.
Transitions | A | B | C | D | E |
---|---|---|---|---|---|
(initial) | 0 | 0 | 0 | 0 | 0 |
(B, East, C, 2) | 0 | 1 | 0 | 0 | 0 |
(C, South, E, 4) | 0 | 1 | 2 | 0 | 0 |
(C, East, A, 6) | 0 | 1 | 4 | 0 | 0 |
(B, East, C, 2) | 0 | 3.5 | 4 | 0 | 0 |
The Q values learned are as follows:
Transitions | (B, East) | (C, South) | (C, East) |
---|---|---|---|
(initial) | 0 | 0 | 0 |
(B, East, C, 2) | 1 | 0 | 0 |
(C, South, E, 4) | 1 | 2 | 0 |
(C, East, A, 6) | 1 | 2 | 3 |
(B, East, C, 2) | 3 | 2 | 3 |
Monte Carlo vs. TD Learning — A Quick Comparison
Feature | Monte Carlo | Temporal-Difference (TD) |
---|---|---|
Updates when? | End of episode | After every step |
Requires complete episodes? | Yes | No |
Bias | Unbiased | Biased (uses estimates) |
Variance | High | Lower |
Bootstrapping | ❌ No | ✅ Yes |
Works with continuing tasks | ❌ Not ideal | ✅ Yes |
Summary
Temporal-Difference learning is one of the most important tools in the RL toolbox:
- Learns from incomplete episodes.
- Updates value estimates using bootstrapped targets.
- Includes algorithms like TD(0), SARSA, and Q-learning.
- Suitable for both prediction and control tasks.
If Monte Carlo is like learning from reflection at the end of a journey, TD is like learning as you go — adjusting course at every step.