2026-06-07-RLClassic
Overview: From Planning to Learning
In the previous post, we introduced the basic formulation of reinforcement learning, including Markov Decision Processes, policies, value functions, and Bellman equations. Once value functions are defined, the next question is how to compute or estimate them.
In this post, we focus on the value prediction problem:
Given a fixed policy \(\pi\), how can we estimate its state value function \(V^\pi\)?
This is different from the control problem, where the goal is to improve the policy and eventually find an optimal policy \(\pi^*\). In the Dynamic Programming section, we briefly discuss policy iteration and value iteration because they naturally follow from Bellman equations. However, for Monte Carlo and Temporal-Difference methods, the discussion focuses on estimating \(V^\pi\) for a fixed policy.
The three major approaches discussed in this post are:
- Dynamic Programming
- Monte Carlo methods
- Temporal-Difference learning
These methods can be understood as a gradual transition from model-based planning to model-free learning.
Dynamic Programming assumes that the environment model is known. In other words, we have access to the transition dynamics \(P(s’ \mid s,a)\) and the reward function \(r(s,a)\). With this model, we can compute Bellman backups by taking expectations over all possible next states. Therefore, Dynamic Programming is usually considered a planning method rather than a learning method.
Monte Carlo methods remove the need for a known model. Instead of computing expectations using \(P(s’ \mid s,a)\), they estimate value functions from sampled episodes. After an episode ends, the complete return \(G_t\) observed from a state is used as a sample target for updating the value estimate. In this sense, Monte Carlo methods are model-free learning methods.
Temporal-Difference learning also learns from sampled experience without requiring the environment model. However, unlike Monte Carlo, TD methods do not need to wait until the end of an episode. They update value estimates after each transition using a bootstrapped target, combining the immediate reward with the current estimate of the next state value.
Therefore, the progression from DP to MC to TD can be summarized as follows:
| Method | Model required? | Learns from samples? | Uses bootstrapping? | Main idea |
|---|---|---|---|---|
| Dynamic Programming | Yes | No | Yes | Compute Bellman backups using the known model |
| Monte Carlo | No | Yes | No | Estimate values from complete sampled returns |
| Temporal-Difference | No | Yes | Yes | Learn from one-step samples using bootstrapped targets |
This gives a natural path:
1 | Dynamic Programming: |
The main theme of this post is:
DP, MC, and TD can be viewed as three different ways to estimate \(V^\pi\): DP uses the known model, MC uses complete sampled returns, and TD uses one-step bootstrapped targets.
Dynamic Programming
Dynamic Programming (DP) methods solve reinforcement learning problems by using a known MDP model. In other words, we assume that the transition probability \(P(s’ \mid s,a)\) and reward function \(r(s,a,s’)\) are available.
The key idea is:
If the model is known, we can update value estimates by looking one step ahead and averaging over all possible next states.
In code, this usually means that for each state \(s\) and action \(a\), we loop over all possible transitions:
1 | for prob, next_state, reward, done in P[s][a]: |
This is different from Monte Carlo or TD methods, where we only observe one sampled transition at a time.
Policy Iteration
Policy iteration alternates between two steps:
- Policy evaluation: estimate the value function of the current policy.
- Policy improvement: improve the policy by acting greedily with respect to the current value function.
The process repeats until the policy no longer changes.
Policy Evaluation
Suppose we have a fixed deterministic policy \(\pi\). Policy evaluation tries to compute \(V^\pi(s)\), the value of every state under this policy.
The Bellman expectation equation is:
Since solving this equation analytically can be expensive, we usually use an iterative update starting with a guess \(V^0(s)\) for each state:
$$ V^{k+1}(s) = r(s,\pi(s)) + \gamma \sum_{s'} P(s' \mid s,\pi(s)) V^k(s') $$We repeat this update until the value estimates change by less than a small threshold. At convergence, the resulting value function approximates \(V^\pi\), the value of the current policy.
In pseudocode:
1 | repeat: |
Policy Improvement
Once we have \(V^\pi\), we can improve the policy.
For each state \(s\), we evaluate every possible action \(a\) by computing its one-step lookahead value:
$$ Q^\pi(s,a) = r(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V^\pi(s') $$Then for every state \(s\) we choose the action with the largest value to update the policy:
$$ \pi'(s) = \arg\max_a Q^\pi(s,a) = \arg\max_a \left[ r(s,a)+ \gamma \sum_{s'} P(s' \mid s,a) V^\pi(s') \right] $$If the policy does not change for any state, then the policy is stable and becomes optimal.
The overall policy iteration procedure is:
1 | initialize policy arbitrarily |
The important point is that policy iteration separates evaluation and improvement. It first estimates \(V^\pi\) for the current policy, and then improves the policy using the estimated values.
Value Iteration
Value iteration combines policy evaluation and policy improvement into a single update.
Instead of fully evaluating one fixed policy, value iteration directly applies the Bellman optimality equation:
$$ V^*(s) = \max_a\left[ r(s,a)+ \gamma \sum_{s'} P(s' \mid s,a) V^*(s') \right] $$This update is different from policy evaluation.
In policy evaluation, the action is fixed by the current policy \(\pi\). In value iteration, the action is chosen greedily at every update:
So policy evaluation estimates \(V^\pi\), the state value of a fixed policy, while value iteration directly approximates \(V^*\), the optimal state value function.
In code, value iteration looks like this:
1 | initialize V arbitrarily |
After value iteration converges, we obtain an approximation of the optimal value function \(V^*\).
Then we can recover the optimal policy by choosing the greedy action with respect to \(V^*\):
$$ \pi^*(s) = \arg\max_a \left[ r(s,a)+ \gamma \sum_{s'} P(s' \mid s,a) V^*(s') \right] $$In code:
1 | policy = {} |
The key difference is:
| Method | Main update | What it estimates |
|---|---|---|
| Policy Evaluation | Bellman expectation backup under fixed \(\pi\) | \(V^\pi\) |
| Policy Iteration | Policy evaluation + Policy improvement | \(\pi^*\) |
| Value Iteration | Bellman optimality backup | \(V^*\), then extract \(\pi^*\) |
Policy iteration keeps an explicit policy throughout the algorithm. Value iteration mainly updates the value function first and extracts the policy at the end.
Both methods require the model \(P(s’ \mid s,a)\) and \(r(s,a)\) and obtains \(\pi^*\) without interacting with the actual environment, so they are model-based planning methods.
Monte Carlo Methods
Dynamic Programming requires a known environment model. In particular, DP policy evaluation needs access to \(P(s’ \mid s,a)\) and \(r(s,a)\) so that it can compute the Bellman expectation backup.
Monte Carlo methods remove this requirement. Instead of computing expectations using the model, Monte Carlo methods estimate value functions from sampled episodes.
In this section, we focus on Monte Carlo prediction:
Given a fixed policy \(\pi\), estimate its state value function \(V^\pi\) from sampled episodes.
Monte Carlo methods are model-free because they do not require the transition dynamics or reward function. They only require trajectories generated by following the policy \(\pi\).
Learning from Complete Episodes
Suppose the agent follows a fixed policy \(\pi\) and generates an episode:
1 | S_0, A_0, R_0, S_1, A_1, R_1, ..., S_{T-1}, A_{T-1}, R_{T-1}, S_T |
For each time step (t), the return is the discounted sum of rewards from that point onward:
$$ G_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \cdots + \gamma^{T-t-1}R_{T-1} = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k} $$The state value function is defined as:
$$ V^\pi(s) = \mathbb{E}_\pi [ G_t \mid S_t=s ] \approx \frac{1}{N}\sum_{i=1}^{N}G_t^{(i)} $$Monte Carlo methods estimate this expectation by averaging sampled returns. Every time we visit a state \(s\), we can compute the return \(G_t\) from that time step and use it as a sample of \(V^\pi(s)\).
The Monte Carlo target is: \(G_t\). This target is unbiased, because:
$$ \mathbb{E}_\pi [ G_t \mid S_t=s ] = V^\pi(s) $$However, Monte Carlo methods usually have high variance because \(G_t\) depends on all future rewards in the episode. A single sampled return may be much larger or smaller than the true expected value.
Another important feature is that Monte Carlo methods need complete episodes. We cannot compute \(G_t\) until we know all future rewards after time \(t\).
Incremental Monte Carlo Update
Suppose we have observed returns \(G_1, G_2, \dots, G_n\) for a state \(s\). The empirical mean is:
$$ V(s) = \frac{1}{n} \sum_{i=1}^{n} G^{(i)} $$When a new return \(G^{(n+1)}\) is observed, the average can be updated incrementally:
$$ V^n(s) = \frac{1}{n} \sum_{i=1}^{n} G^{(i)} = \frac{1}{n} \left[ G^{(n)} + \sum_{i=1}^{n-1} G^{(i)} \right] = \frac{1}{n} \left[ G^{(n)} + (n-1)V^{n-1} \right] = \frac{1}{n} \left[ G^{(n)} + nV^{n-1} - V^{n-1} \right] = V^{n-1}(s) + \frac{1}{n} \left[ G_n - V^{n-1}(s) \right] $$This has the same form as many RL update rules:
$$ \text{new estimate} = \text{old estimate} + \text{step size} \cdot ( \text{target} - \text{old estimate} ) $$For Monte Carlo prediction, the target is the complete return (G_t):
$$ V(S_t) \leftarrow V(S_t) + \alpha \left[ G_t - V(S_t) \right] $$If we use \(\alpha = \frac{1}{N(S_t)}\), this update is equivalent to computing the sample average of all observed returns for state \(S_t\).
The main idea of Monte Carlo prediction is therefore:
To estimate \(V^\pi(s)\), run episodes under policy \(\pi\), compute complete returns \(G_t\), and move \(V(S_t)\) toward those returns.
Monte Carlo prediction is model-free and unbiased, but it requires complete episodes and can have high variance.
Temporal-Difference Learning
Monte Carlo methods estimate \(V^\pi\) using complete returns from full episodes. This removes the need for a known model, but it also means that Monte Carlo methods have to wait until the end of an episode before updating the value estimate.
Temporal-Difference (TD) learning removes this requirement. TD methods learn from sampled experience like Monte Carlo methods, but they update the value function after each transition instead of waiting for a full episode.
In this section, we focus on TD prediction:
Given a fixed policy \(\pi\), estimate its state value function \(V^\pi\) from sampled transitions.
TD learning is model-free because it does not require \(P(s’ \mid s,a)\) or \(r(s,a)\). It only needs sampled transitions generated by following policy \(\pi\).
A sampled transition has the form:
1 | S_t, A_t, R_t, S_{t+1} |
One-Step TD Prediction
The Monte Carlo update is:
$$ V(S_t) \leftarrow V(S_t) + \alpha \left[ G_t - V(S_t) \right] $$TD prediction replaces the complete return \(G_t\) with a one-step bootstrapped target: \(R_t + \gamma V(S_{t+1})\)
Therefore, the TD(0) update is:
$$ V(S_t) \leftarrow V(S_t) + \alpha \left[ R_t + \gamma V(S_{t+1}) - V(S_t) \right] $$This target contains two parts:
- the sampled immediate reward \(R_t\);
- the current estimate of the next state’s value \(V(S_{t+1})\).
This is called bootstrapping because the update target depends on the current value estimate itself.
The important difference from Monte Carlo is that TD updates immediately after observing one transition.
1 | Monte Carlo: |
TD Error
The difference between the TD target and the current value estimate is called the TD error.
For TD(0), the TD error is:
$$ \delta_t = R_t + \gamma V(S_{t+1}) - V(S_t) $$Then the TD update can be written more compactly as:
$$ V(S_t) \leftarrow V(S_t) + \alpha \delta_t $$The TD error measures how surprising the transition is under the current value estimate.
If \(\delta_t > 0\), then the observed reward and next-state value are better than expected, so \(V(S_t)\) should increase.
If \(\delta_t < 0\), then the observed reward and next-state value are worse than expected, so \(V(S_t)\) should decrease.
If \(\delta_t = 0\), then the current estimate is consistent with this one-step transition.
TD learning can also be viewed as a sample-based approximation to the Bellman expectation backup. For a fixed policy \(\pi\), the Bellman expectation operator is:
$$ (T^\pi V)(s) = \mathbb{E}_\pi \left[ R_t + \gamma V(S_{t+1}) \mid S_t=s \right] $$Which is eaxctly the expected value of the TD target
$$ R_t + \gamma V(S_{t+1}) $$TD as a Bridge Between DP and MC
Temporal-Difference learning is often described as a bridge between Dynamic Programming and Monte Carlo methods.
Like Monte Carlo, TD learns from sampled experience. It does not need the transition model \(P(s’ \mid s,a)\), the reward model \(r(s,a)\), or a full enumeration of possible next states.
Like Dynamic Programming, TD uses bootstrapping. It updates the value of a state using the current estimate of the next state’s value.
This gives the following comparison:
| Method | Requires model? | Update target | Uses bootstrapping? | Waits until episode ends? |
|---|---|---|---|---|
| Dynamic Programming | Yes | Expected Bellman backup | Yes | No |
| Monte Carlo | No | Complete return \(G_t\) | No | Yes |
| Temporal-Difference | No | \(R_t + \gamma V(S_{t+1})\) | Yes | No |
The TD target is generally biased because it depends on the current estimate \(V(S_{t+1})\), which may not yet be accurate. In contrast, the Monte Carlo return \(G_t\) is an unbiased sample of \(V^\pi(S_t)\).
However, TD usually has lower variance than Monte Carlo because it only uses one sampled reward and one next-state value estimate, rather than the entire future trajectory.
So the main tradeoff is:
1 | Monte Carlo: |
The main idea of TD prediction is therefore:
To estimate \(V^\pi(s)\), observe one transition at a time and move \(V(S_t)\) toward the one-step bootstrapped target \(R_t + \gamma V(S_{t+1})\).
This makes TD learning more incremental than Monte Carlo methods and more practical for long or continuing tasks.
2026-06-07-RLClassic