2026-06-07-RLClassic

Overview: From Planning to Learning

In the previous post, we introduced the basic formulation of reinforcement learning, including Markov Decision Processes, policies, value functions, and Bellman equations. Once value functions are defined, the next question is how to compute or estimate them.

In this post, we focus on the value prediction problem:

Given a fixed policy \(\pi\), how can we estimate its state value function \(V^\pi\)?

This is different from the control problem, where the goal is to improve the policy and eventually find an optimal policy \(\pi^*\). In the Dynamic Programming section, we briefly discuss policy iteration and value iteration because they naturally follow from Bellman equations. However, for Monte Carlo and Temporal-Difference methods, the discussion focuses on estimating \(V^\pi\) for a fixed policy.

The three major approaches discussed in this post are:

  1. Dynamic Programming
  2. Monte Carlo methods
  3. Temporal-Difference learning

These methods can be understood as a gradual transition from model-based planning to model-free learning.

Dynamic Programming assumes that the environment model is known. In other words, we have access to the transition dynamics \(P(s’ \mid s,a)\) and the reward function \(r(s,a)\). With this model, we can compute Bellman backups by taking expectations over all possible next states. Therefore, Dynamic Programming is usually considered a planning method rather than a learning method.

Monte Carlo methods remove the need for a known model. Instead of computing expectations using \(P(s’ \mid s,a)\), they estimate value functions from sampled episodes. After an episode ends, the complete return \(G_t\) observed from a state is used as a sample target for updating the value estimate. In this sense, Monte Carlo methods are model-free learning methods.

Temporal-Difference learning also learns from sampled experience without requiring the environment model. However, unlike Monte Carlo, TD methods do not need to wait until the end of an episode. They update value estimates after each transition using a bootstrapped target, combining the immediate reward with the current estimate of the next state value.

Therefore, the progression from DP to MC to TD can be summarized as follows:

Method Model required? Learns from samples? Uses bootstrapping? Main idea
Dynamic Programming Yes No Yes Compute Bellman backups using the known model
Monte Carlo No Yes No Estimate values from complete sampled returns
Temporal-Difference No Yes Yes Learn from one-step samples using bootstrapped targets

This gives a natural path:

1
2
3
4
5
6
7
8
Dynamic Programming:
known model -> expected Bellman backup -> planning

Monte Carlo:
unknown model -> complete sampled return -> model-free learning

Temporal-Difference:
unknown model -> sampled transition + bootstrapping -> incremental model-free learning

The main theme of this post is:

DP, MC, and TD can be viewed as three different ways to estimate \(V^\pi\): DP uses the known model, MC uses complete sampled returns, and TD uses one-step bootstrapped targets.

Dynamic Programming

Dynamic Programming (DP) methods solve reinforcement learning problems by using a known MDP model. In other words, we assume that the transition probability \(P(s’ \mid s,a)\) and reward function \(r(s,a,s’)\) are available.

The key idea is:

If the model is known, we can update value estimates by looking one step ahead and averaging over all possible next states.

In code, this usually means that for each state \(s\) and action \(a\), we loop over all possible transitions:

1
2
for prob, next_state, reward, done in P[s][a]:
...

This is different from Monte Carlo or TD methods, where we only observe one sampled transition at a time.


Policy Iteration

Policy iteration alternates between two steps:

  1. Policy evaluation: estimate the value function of the current policy.
  2. Policy improvement: improve the policy by acting greedily with respect to the current value function.

The process repeats until the policy no longer changes.


Policy Evaluation

Suppose we have a fixed deterministic policy \(\pi\). Policy evaluation tries to compute \(V^\pi(s)\), the value of every state under this policy.
The Bellman expectation equation is:

$$ V^\pi(s)= r(s,\pi(s)) + \gamma \sum_{s'} P(s' \mid s,\pi(s)) V^\pi(s') $$

Since solving this equation analytically can be expensive, we usually use an iterative update starting with a guess \(V^0(s)\) for each state:

$$ V^{k+1}(s) = r(s,\pi(s)) + \gamma \sum_{s'} P(s' \mid s,\pi(s)) V^k(s') $$

We repeat this update until the value estimates change by less than a small threshold. At convergence, the resulting value function approximates \(V^\pi\), the value of the current policy.

In pseudocode:

1
2
3
4
5
6
7
8
9
10
repeat:
for each state s:
V_new(s) =
r(s, π(s))
+ γ Σ_{s'} P(s'|s, π(s)) V(s')

if max_s |V_new(s) - V(s)| < θ:
stop

V ← V_new

Policy Improvement

Once we have \(V^\pi\), we can improve the policy.

For each state \(s\), we evaluate every possible action \(a\) by computing its one-step lookahead value:

$$ Q^\pi(s,a) = r(s,a) + \gamma \sum_{s'} P(s' \mid s,a) V^\pi(s') $$

Then for every state \(s\) we choose the action with the largest value to update the policy:

$$ \pi'(s) = \arg\max_a Q^\pi(s,a) = \arg\max_a \left[ r(s,a)+ \gamma \sum_{s'} P(s' \mid s,a) V^\pi(s') \right] $$

If the policy does not change for any state, then the policy is stable and becomes optimal.

The overall policy iteration procedure is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
initialize policy arbitrarily
initialize V arbitrarily

while True:
# 1. Policy Evaluation
evaluate V under current policy

# 2. Policy Improvement
policy_stable = True

for s in states:
old_action = policy[s]
policy[s] = greedy_action_with_respect_to(V)

if policy[s] != old_action:
policy_stable = False

if policy_stable:
break

The important point is that policy iteration separates evaluation and improvement. It first estimates \(V^\pi\) for the current policy, and then improves the policy using the estimated values.


Value Iteration

Value iteration combines policy evaluation and policy improvement into a single update.

Instead of fully evaluating one fixed policy, value iteration directly applies the Bellman optimality equation:

$$ V^*(s) = \max_a\left[ r(s,a)+ \gamma \sum_{s'} P(s' \mid s,a) V^*(s') \right] $$

This update is different from policy evaluation.
In policy evaluation, the action is fixed by the current policy \(\pi\). In value iteration, the action is chosen greedily at every update:

$$ V_{k+1}(s) = \max_a \left[ r(s,a)+ \gamma \sum_{s'} P(s' \mid s,a) V_k(s') \right] $$

So policy evaluation estimates \(V^\pi\), the state value of a fixed policy, while value iteration directly approximates \(V^*\), the optimal state value function.

In code, value iteration looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
initialize V arbitrarily

while True:
delta = 0

for s in states:
old_v = V[s]

action_values = []

for a in actions:
q_sa = 0

for prob, next_state, reward, done in P[s][a]:
if done:
q_sa += prob * reward
else:
q_sa += prob * (reward + gamma * V[next_state])

action_values.append(q_sa)

V[s] = max(action_values)

delta = max(delta, abs(old_v - V[s]))

if delta < theta:
break

After value iteration converges, we obtain an approximation of the optimal value function \(V^*\).

Then we can recover the optimal policy by choosing the greedy action with respect to \(V^*\):

$$ \pi^*(s) = \arg\max_a \left[ r(s,a)+ \gamma \sum_{s'} P(s' \mid s,a) V^*(s') \right] $$

In code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
policy = {}

for s in states:
action_values = []

for a in actions:
q_sa = 0

for prob, next_state, reward, done in P[s][a]:
if done:
q_sa += prob * reward
else:
q_sa += prob * (reward + gamma * V[next_state])

action_values.append(q_sa)

policy[s] = argmax(action_values)

The key difference is:

Method Main update What it estimates
Policy Evaluation Bellman expectation backup under fixed \(\pi\) \(V^\pi\)
Policy Iteration Policy evaluation + Policy improvement \(\pi^*\)
Value Iteration Bellman optimality backup \(V^*\), then extract \(\pi^*\)

Policy iteration keeps an explicit policy throughout the algorithm. Value iteration mainly updates the value function first and extracts the policy at the end.

Both methods require the model \(P(s’ \mid s,a)\) and \(r(s,a)\) and obtains \(\pi^*\) without interacting with the actual environment, so they are model-based planning methods.

Monte Carlo Methods

Dynamic Programming requires a known environment model. In particular, DP policy evaluation needs access to \(P(s’ \mid s,a)\) and \(r(s,a)\) so that it can compute the Bellman expectation backup.

Monte Carlo methods remove this requirement. Instead of computing expectations using the model, Monte Carlo methods estimate value functions from sampled episodes.

In this section, we focus on Monte Carlo prediction:

Given a fixed policy \(\pi\), estimate its state value function \(V^\pi\) from sampled episodes.

Monte Carlo methods are model-free because they do not require the transition dynamics or reward function. They only require trajectories generated by following the policy \(\pi\).


Learning from Complete Episodes

Suppose the agent follows a fixed policy \(\pi\) and generates an episode:

1
S_0, A_0, R_0, S_1, A_1, R_1, ..., S_{T-1}, A_{T-1}, R_{T-1}, S_T

For each time step (t), the return is the discounted sum of rewards from that point onward:

$$ G_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \cdots + \gamma^{T-t-1}R_{T-1} = \sum_{k=0}^{T-t-1} \gamma^k R_{t+k} $$

The state value function is defined as:

$$ V^\pi(s) = \mathbb{E}_\pi [ G_t \mid S_t=s ] \approx \frac{1}{N}\sum_{i=1}^{N}G_t^{(i)} $$

Monte Carlo methods estimate this expectation by averaging sampled returns. Every time we visit a state \(s\), we can compute the return \(G_t\) from that time step and use it as a sample of \(V^\pi(s)\).

The Monte Carlo target is: \(G_t\). This target is unbiased, because:

$$ \mathbb{E}_\pi [ G_t \mid S_t=s ] = V^\pi(s) $$

However, Monte Carlo methods usually have high variance because \(G_t\) depends on all future rewards in the episode. A single sampled return may be much larger or smaller than the true expected value.

Another important feature is that Monte Carlo methods need complete episodes. We cannot compute \(G_t\) until we know all future rewards after time \(t\).


Incremental Monte Carlo Update

Suppose we have observed returns \(G_1, G_2, \dots, G_n\) for a state \(s\). The empirical mean is:

$$ V(s) = \frac{1}{n} \sum_{i=1}^{n} G^{(i)} $$

When a new return \(G^{(n+1)}\) is observed, the average can be updated incrementally:

$$ V^n(s) = \frac{1}{n} \sum_{i=1}^{n} G^{(i)} = \frac{1}{n} \left[ G^{(n)} + \sum_{i=1}^{n-1} G^{(i)} \right] = \frac{1}{n} \left[ G^{(n)} + (n-1)V^{n-1} \right] = \frac{1}{n} \left[ G^{(n)} + nV^{n-1} - V^{n-1} \right] = V^{n-1}(s) + \frac{1}{n} \left[ G_n - V^{n-1}(s) \right] $$

This has the same form as many RL update rules:

$$ \text{new estimate} = \text{old estimate} + \text{step size} \cdot ( \text{target} - \text{old estimate} ) $$

For Monte Carlo prediction, the target is the complete return (G_t):

$$ V(S_t) \leftarrow V(S_t) + \alpha \left[ G_t - V(S_t) \right] $$

If we use \(\alpha = \frac{1}{N(S_t)}\), this update is equivalent to computing the sample average of all observed returns for state \(S_t\).

The main idea of Monte Carlo prediction is therefore:

To estimate \(V^\pi(s)\), run episodes under policy \(\pi\), compute complete returns \(G_t\), and move \(V(S_t)\) toward those returns.

Monte Carlo prediction is model-free and unbiased, but it requires complete episodes and can have high variance.

Temporal-Difference Learning

Monte Carlo methods estimate \(V^\pi\) using complete returns from full episodes. This removes the need for a known model, but it also means that Monte Carlo methods have to wait until the end of an episode before updating the value estimate.

Temporal-Difference (TD) learning removes this requirement. TD methods learn from sampled experience like Monte Carlo methods, but they update the value function after each transition instead of waiting for a full episode.

In this section, we focus on TD prediction:

Given a fixed policy \(\pi\), estimate its state value function \(V^\pi\) from sampled transitions.

TD learning is model-free because it does not require \(P(s’ \mid s,a)\) or \(r(s,a)\). It only needs sampled transitions generated by following policy \(\pi\).

A sampled transition has the form:

id
1
S_t, A_t, R_t, S_{t+1}

One-Step TD Prediction

The Monte Carlo update is:

$$ V(S_t) \leftarrow V(S_t) + \alpha \left[ G_t - V(S_t) \right] $$

TD prediction replaces the complete return \(G_t\) with a one-step bootstrapped target: \(R_t + \gamma V(S_{t+1})\)

Therefore, the TD(0) update is:

$$ V(S_t) \leftarrow V(S_t) + \alpha \left[ R_t + \gamma V(S_{t+1}) - V(S_t) \right] $$

This target contains two parts:

  1. the sampled immediate reward \(R_t\);
  2. the current estimate of the next state’s value \(V(S_{t+1})\).

This is called bootstrapping because the update target depends on the current value estimate itself.

The important difference from Monte Carlo is that TD updates immediately after observing one transition.

1
2
3
4
5
Monte Carlo:
wait until the episode ends -> compute G_t -> update V(S_t)

Temporal Difference:
observe one transition -> compute R_t + gamma V(S_{t+1}) -> update V(S_t)

TD Error

The difference between the TD target and the current value estimate is called the TD error.

For TD(0), the TD error is:

$$ \delta_t = R_t + \gamma V(S_{t+1}) - V(S_t) $$

Then the TD update can be written more compactly as:

$$ V(S_t) \leftarrow V(S_t) + \alpha \delta_t $$

The TD error measures how surprising the transition is under the current value estimate.

If \(\delta_t > 0\), then the observed reward and next-state value are better than expected, so \(V(S_t)\) should increase.

If \(\delta_t < 0\), then the observed reward and next-state value are worse than expected, so \(V(S_t)\) should decrease.

If \(\delta_t = 0\), then the current estimate is consistent with this one-step transition.

TD learning can also be viewed as a sample-based approximation to the Bellman expectation backup. For a fixed policy \(\pi\), the Bellman expectation operator is:

$$ (T^\pi V)(s) = \mathbb{E}_\pi \left[ R_t + \gamma V(S_{t+1}) \mid S_t=s \right] $$

Which is eaxctly the expected value of the TD target

$$ R_t + \gamma V(S_{t+1}) $$

TD as a Bridge Between DP and MC

Temporal-Difference learning is often described as a bridge between Dynamic Programming and Monte Carlo methods.

Like Monte Carlo, TD learns from sampled experience. It does not need the transition model \(P(s’ \mid s,a)\), the reward model \(r(s,a)\), or a full enumeration of possible next states.

Like Dynamic Programming, TD uses bootstrapping. It updates the value of a state using the current estimate of the next state’s value.

This gives the following comparison:

Method Requires model? Update target Uses bootstrapping? Waits until episode ends?
Dynamic Programming Yes Expected Bellman backup Yes No
Monte Carlo No Complete return \(G_t\) No Yes
Temporal-Difference No \(R_t + \gamma V(S_{t+1})\) Yes No

The TD target is generally biased because it depends on the current estimate \(V(S_{t+1})\), which may not yet be accurate. In contrast, the Monte Carlo return \(G_t\) is an unbiased sample of \(V^\pi(S_t)\).

However, TD usually has lower variance than Monte Carlo because it only uses one sampled reward and one next-state value estimate, rather than the entire future trajectory.

So the main tradeoff is:

id
1
2
3
4
5
Monte Carlo:
unbiased target, higher variance, must wait until episode ends

Temporal Difference:
biased bootstrapped target, lower variance, can update after every transition

The main idea of TD prediction is therefore:

To estimate \(V^\pi(s)\), observe one transition at a time and move \(V(S_t)\) toward the one-step bootstrapped target \(R_t + \gamma V(S_{t+1})\).

This makes TD learning more incremental than Monte Carlo methods and more practical for long or continuing tasks.

Author

Jiangshan Gong

Posted on

2026-06-07

Updated on

2026-06-17

Licensed under

Comments