CamelEdge
ai

Learning On the Go: Temporal-Difference Learning in Reinforcement Learning

train depiction
Table Of Content

    By CamelEdge

    Updated on Tue Apr 29 2025


    In our last post, we looked at Monte Carlo methods, which estimate value-functions Vπ(s)V_\pi(s) by averaging returns after complete episodes. Simple and effective — but they require waiting until the end of an episode.

    But what if we want to learn during an episode, one step at a time?

    That’s where Temporal-Difference (TD) learning shines. ✨
    It blends the strengths of Monte Carlo methods (learning from experience) and Dynamic Programming (bootstrapping from estimates) — enabling faster, more flexible learning.

    🧠 “If one had to identify one idea as central and novel to reinforcement learning, it would be temporal-difference learning.”
    Sutton & Barto, Reinforcement Learning: An Introduction

    In the next section, we’ll dive into how TD methods work, starting with TD(0) for prediction and then look at control algorithms such as SARSA and Q-learning.


    What Is Temporal-Difference Learning?

    TD methods blend the strengths of Monte Carlo and Dynamic Programming:

    • Like Monte Carlo: They learn directly from experience, without a model of the environment.
    • Like Dynamic Programming: They bootstrap, meaning they update estimates using other estimates (not just complete returns).

    TD(0) — The Simplest TD Method

    Let’s start with TD(0), used for value prediction.

    Suppose the agent experiences a transition:

    St,At,Rt+1,St+1,At+1,Rt+2,St+2,At+2,Rt+3,S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}, R_{t+2}, S_{t+2}, A_{t+2}, R_{t+3}, \dots

    The return at time tt, denoted as GtG_t, represents the sum of discounted rewards from that time step onward:

    Gt=Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \gamma^3 R_{t+4} + \dots

    To efficiently compute this return, we can use a recursive approach known as the bootstrap method. This involves expressing the return in terms of the immediate reward and the estimated return from the next time step. By factoring out the first reward and recognizing the remaining terms as the return from the next time step, we get:

    Gt=Rt+1+γ(Rt+2+γRt+3+γ2Rt+4+)G_t = R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \gamma^2 R_{t+4} + \dots)

    The expression in parentheses is the return from the next time step, denoted as Gt+1G_{t+1}. Thus, we have:

    Gt=Rt+1+γGt+1G_t = R_{t+1} + \gamma G_{t+1}

    The value function, denoted as Vπ(s)V_\pi(s), represents the expected return when starting from state ss and following policy π\pi. It is mathematically expressed as:

    Vπ(s)=Eπ[GtSt=s]V_\pi(s) = \mathbb{E}_\pi \left[ G_t \mid S_t = s \right]

    where GtG_t is the return at time tt.

    The bootstrapped value-function is expressed as:

    Vπ(s)=Eπ[Rt+1+γGt+1St=s]V_\pi(s) = \mathbb{E}_\pi \left[ R_{t+1} + \gamma G_{t+1} \mid S_t = s \right]

    To further refine this, we apply the linearity of expectation, allowing us to replace Gt+1G_{t+1} with the value function of the subsequent state, V(St+1)V(S_{t+1}). This gives us:

    Vπ(s)=Eπ[Rt+1+γV(St+1)St=s]V_\pi(s) = \mathbb{E}_\pi \left[ R_{t+1} + \gamma V(S_{t+1}) \mid S_t = s \right]

    This recursive relationship enables incremental updates to our value estimates, utilizing the current estimate of future returns. The update rule for the value of state StS_t is:

    V(St)V(St)+α[Rt+1+γV(St+1)V(St)]V(S_t) \leftarrow V(S_t) + \alpha \big[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \big]

    Here, the term inside the brackets is known as the TD error:

    δt=Rt+1+γV(St+1)V(St)\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)

    The TD error measures the difference between the predicted value and the "bootstrapped" target, which is the sum of the immediate reward and the discounted value of the next state. The agent uses this error to adjust its current estimate, moving it closer to the target.


    Why Use TD Learning?

    TD methods bring some major advantages:

    • Learn before the episode ends — immediate feedback.
    • Lower variance than Monte Carlo (since they use estimates).
    • ✅ Work well in continuous tasks (no need for episodes).
    • ✅ Faster convergence in many practical problems.

    The trade-off? TD estimates are biased, since they rely on current value approximations.

    But often, this bias is outweighed by reduced variance and faster learning.


    TD Learning for Control

    Just like Monte Carlo methods, TD learning can also be used for control — learning optimal policies.

    The most famous TD control method is:

    🔹 SARSA (On-Policy TD Control)

    SARSA stands for:
    State – Action – Reward – next State – next Action

    It updates the action-value function Q(s,a)Q(s, a) using the actual action taken under the current (usually ε-greedy) policy:

    Q(st,at)Q(st,at)+α[Rt+1+γQ(st+1,at+1)Q(st,at)]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \big[ R_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \big]

    It’s on-policy, meaning the policy used to select actions is the same as the one being improved.


    🔸 Q-Learning (Off-Policy TD Control)

    Q-learning is similar, but more aggressive. It updates Q(s,a)Q(s, a) assuming we act optimally from the next state:

    Q(st,at)Q(st,at)+α[Rt+1+γmaxaQ(st+1,a)Q(st,at)]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \big[ R_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t) \big]

    It’s off-policy, because it learns the optimal policy regardless of the behavior policy used to explore.

    In practice, Q-learning is a powerhouse — it’s the basis for deep Q-networks (DQNs) and many real-world RL systems.


    Example

    TD grid example

    Imagine we have observed the following St,At,St+1,Rt+1S_t, A_t, S_{t+1}, R_{t+1} transitions:

    1. (B, East, C, 2)
    2. (C, South, E, 4)
    3. (C, East, A, 6)
    4. (B, East, C, 2)

    Assume the initial value for each state is 00, with a discount factor γ=1\gamma = 1 and a learning rate α=0.5\alpha = 0.5.

    Let's determine the updated state values using Temporal-Difference (TD) learning after processing these four transitions. We will apply the TD update rule iteratively to refine our value estimates based on the observed transitions.

    TransitionsABCDE
    (initial)00000
    (B, East, C, 2)01000
    (C, South, E, 4)01200
    (C, East, A, 6)01400
    (B, East, C, 2)03.5400

    The Q values learned are as follows:

    Transitions(B, East)(C, South)(C, East)
    (initial)000
    (B, East, C, 2)100
    (C, South, E, 4)120
    (C, East, A, 6)123
    (B, East, C, 2)323

    Monte Carlo vs. TD Learning — A Quick Comparison

    FeatureMonte CarloTemporal-Difference (TD)
    Updates when?End of episodeAfter every step
    Requires complete episodes?YesNo
    BiasUnbiasedBiased (uses estimates)
    VarianceHighLower
    Bootstrapping❌ No✅ Yes
    Works with continuing tasks❌ Not ideal✅ Yes

    Summary

    Temporal-Difference learning is one of the most important tools in the RL toolbox:

    • Learns from incomplete episodes.
    • Updates value estimates using bootstrapped targets.
    • Includes algorithms like TD(0), SARSA, and Q-learning.
    • Suitable for both prediction and control tasks.

    If Monte Carlo is like learning from reflection at the end of a journey, TD is like learning as you go — adjusting course at every step.