2026-06-06-RLPre

RL Problem Settings

Reinforcement Learning studies how an agent learns to make sequential decisions through interaction with an environment.

At each time step, the agent observes the current state, chooses an action, receives a reward, and moves to the next state. The goal is to learn a decision-making rule (policy) that maximizes cumulative future reward.

A standard formalization of this problem is the Markov Decision Process (MDP).

An MDP is usually written as:

$$ \mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma) $$

where:

  • \( \mathcal{S} \) is the state space.
  • \( \mathcal{A} \) is the action space.
  • \( P: \mathcal{S} \times \mathcal{A} \to \Delta(\mathcal{S}) \) is the transition function, where \( \Delta(\mathcal{S}) \) is the space of all probability distributions over \( \mathcal{S} \).
  • \( R: \mathcal{S} \times \mathcal{A} \to \mathbb{R} \) is the reward function.
  • \( \gamma \in [0,1) \) is the discount factor.

The state (s) represents the information available to the agent at a given time.
The action (a) is the choice made by the agent.
The transition function describes the environment dynamics:

$$ P(s' \mid s,a) = \Pr(S_{t+1}=s' \mid S_t=s, A_t=a) $$

The reward function gives immediate feedback after taking an action:

$$ r(s,a) = \mathbb{E}[R_{t} \mid S_t=s, A_t=a] $$

The discount factor \(\gamma\) controls how much the agent cares about future rewards. A larger \(\gamma\) means the agent cares more about long-term rewards, while a smaller \(\gamma\) makes the agent focus more on immediate rewards.

The Markov property means that the future depends only on the current state and action, not on the full history:

$$ P(S_{t+1} \mid S_t, A_t, S_{t-1}, A_{t-1}, \dots) = P(S_{t+1} \mid S_t, A_t) $$

This assumption is important because it allows us to describe the environment dynamics using only the current state and action.


What RL Is Learning: Policy

Policy, Trajectory, and Return

The central object that RL tries to learn is a policy.

A policy describes how the agent chooses an action \(a \in \mathcal{A} \) given a state \(s\). It can be deterministic or stochastic.

A deterministic policy maps each state to one action \(a = \pi(s)\), while a stochastic policy gives a probability distribution over actions \(\pi(a \mid s)=\Pr(A_t=a \mid S_t=s)\)

When the agent follows a policy \(\pi\), it generates a trajectory:

$$ S_0, A_0, R_0, S_1, A_1, R_1, S_2, \dots $$

In some textbooks, the reward received after taking action \(A_t\) in state \(S_t\) is written as \(R_{t+1}\), because it is observed together with the next state \(S_{t+1}\).
In this note, I use the convention that \(R_t\) is the reward produced by taking action \(A_t\) at state \(S_t\).

An episode is one complete trajectory from an initial state to a terminal state.

The discounted return from time step \(t\) is defined as:

$$ G_t = R_{t} + \gamma R_{t+1} + \gamma^2 R_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^{k} R_{t+k} $$

The policy is good if it tends to produce large returns.

The objective of RL is to find a policy that maximizes expected cumulative reward.


State Visitation Distribution

A policy not only determines what action the agent takes in a state, but also determines what states the agent is likely to visit.

For a given policy \(\pi\), the state visitation distribution describes how frequently different states are visited when the agent follows \(\pi\).

The time-dependent state distribution is:

$$ \Pr(S_t=s \mid \pi) $$

This is the probability that the agent is in state \(s\) at time step \(t\), assuming it follows policy \(\pi\).

In infinite-horizon discounted RL, we often define the discounted state visitation distribution as:

$$ d^\pi(s) = (1-\gamma) \sum_{t=0}^{\infty} \gamma^t \Pr(S_t=s \mid \pi) $$

This distribution weights earlier time steps more heavily and later time steps less heavily, matching the discounted-return objective (Imagine with a policy, a state only appears frequently in the late stages of an episode, its reward will be weighted less due to the discounting and vice versa).

The factor \((1-\gamma)\) normalizes the distribution such that \(\sum_s d^\pi(s) = 1\).

Therefore, \(d^\pi(s)\) can be interpreted as a probability distribution over states.

The important idea is:

Different policies induce different state distributions.

This is one key difference between reinforcement learning and supervised learning. In supervised learning, the data distribution is usually fixed. In RL, the policy affects the data distribution because the policy determines where the agent goes.


Value Functions

State Value Function and Action Value Function

To evaluate a policy, we need to measure how good it is to be in a state or to take an action.

The state value function of a policy \(\pi\) is defined as the expected return starting from state \(s\) and then following \(\pi\) for all the following timesteps:

$$ V^\pi(s) = \mathbb{E}_\pi [ G_t \mid S_t=s ] $$

Equivalently:

$$ V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k} \mid S_t=s \right] $$

The action value function of a policy \(\pi\) is defined as the expected return after taking action \(a\) in state \(s\), and then following \(\pi\) for all the following timesteps:

$$ Q^\pi(s,a) = \mathbb{E}_\pi [ G_t \mid S_t=s, A_t=a ] $$

Equivalently:

$$ Q^\pi(s,a) = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k} \mid S_t=s, A_t=a \right] $$

The relationship between \(V^\pi\) and \(Q^\pi\) is:

$$ V^\pi(s) = \mathbb{E}_{a\sim\pi(\cdot \mid s)} \left[ Q^\pi(s,a) \right] = \sum_{a\in\mathcal{A}} \pi(a \mid s) Q^\pi(s,a) $$ $$ Q^\pi(s, a) = r(s,a) + \gamma \mathbb{E}_{s' \sim P(\cdot \mid s,a)} \left[ V^\pi(s') \right] $$

Bellman Expectation Equations

The key observation behind Bellman equations is that the return can be decomposed recursively:

$$ G_t = R_{t} + \gamma G_{t+1} $$

Therefore, the value of a state can be written as immediate reward plus discounted value of the next state.

The Bellman expectation equation for \(Q^\pi\) is:

$$ Q^\pi(s,a) = r(s,a) + \gamma \mathbb{E}_{s' \sim P(\cdot \mid s,a)} \left[ V^\pi(s') \right] = r(s,a) + \gamma \sum_{s'\in\mathcal{S}} P(s' \mid s,a) \sum_{a\in\mathcal{A}} \pi(a \mid s) Q^\pi(s,a) $$

Intuitively, the value of taking action \(a\) in state \(s\) equals:

  1. the immediate reward from taking action \(a\), plus
  2. the discounted future value after reaching the next state and continuing to follow \(\pi\).

For a fixed policy \(\pi\), the Bellman expectation equation for \(V^\pi\) is:

$$ V^\pi(s) = \mathbb{E}_\pi \left[R_t + \gamma V^\pi(s') \mid S_t=s \right] = \sum_{a\in\mathcal{A}} \pi(a \mid s) \left[ r(s,a) + \gamma \sum_{s'\in\mathcal{S}} P(s' \mid s,a) V^\pi(s') \right] $$

Unlike \(Q^{\pi}(s,a)\), we cannot determine \(a\) for \(r(s,a)\), so we need to take the expectation over actions under the policy.

The Bellman expectation equation is for policy evaluation: given a policy \(\pi\), compute how good it is.


Solving Value Functions

For a finite MDP, we can write the Bellman expectation equation in matrix form.

Let \(V^\pi\) be a vector containing the value of every state. Let \(R^\pi\) be the expected immediate reward under policy \(\pi\), and let \(P^\pi\) be the state transition matrix induced by policy \(\pi\).

The policy-induced transition matrix is:

$$ P^\pi(s' \mid s) = \sum_a \pi(a \mid s) P(s' \mid s,a) $$

The probability of transitioning from state \(s\) to state \(s’\) under policy \(\pi\) is the sum over all actions of the probability of taking that action under \(\pi\) times the probability of transitioning to \(s’\) given state \(s\) and action \(a\).

The policy-induced reward is:

$$ R^\pi(s) = \sum_a \pi(a \mid s) R(s,a) $$

Then the Bellman expectation equation can be written as:

$$ V^\pi = R^\pi + \gamma P^\pi V^\pi $$

Note that \(V^\pi\) is a vector of values for all states. The scalar form comes from the previous section that \(V^\pi(s) = R^\pi(s) + \gamma \sum_{s’} P^\pi(s’ \mid s) V^\pi(s’)\).

Rearranging gives:

$$ (I - \gamma P^\pi)V^\pi = R^\pi $$

Therefore, the analytical solution is:

$$ V^\pi = (I - \gamma P^\pi)^{-1} R^\pi $$

This gives a closed-form solution for the value function.

However, in practice, this analytical solution is usually not feasible. There are several reasons:

  1. The state space may be very large.
  2. The transition matrix \(P^\pi\) may be unknown.
  3. Computing the inverse of a large matrix is expensive.
  4. In many RL problems, the agent only has sampled experience from interacting with the environment.

Therefore, we need practical methods for estimating value functions.

Three major families of methods are:

  • Dynamic Programming (DP)
  • Monte Carlo (MC) methods
  • Temporal-Difference (TD) learning

Dynamic Programming assumes the transition dynamics are known and uses Bellman equations directly.

Monte Carlo methods estimate value functions from complete sampled episodes.

Temporal-Difference learning combines ideas from both DP and MC: it learns from sampled experience, but updates estimates using bootstrapping from current value estimates.


Optimal Policy

Bellman Optimality Equations

So far, we have discussed how to evaluate a fixed policy \(\pi\). The next question is how to find the best policy.

An optimal policy \(\pi^*\) is a policy that that simultaneously maximizes \(V^\pi(s)\) for all \(s \in \mathcal{S}\) and maximizes \(Q^\pi(s,a)\) for all \(s \in \mathcal{S}, a \in \mathcal{A}\):

$$ V^*(s) = \max_\pi V^\pi(s), \quad \forall s \in \mathcal{S} $$ $$ Q^*(s,a) = \max_\pi Q^\pi(s,a), \quad \forall s \in \mathcal{S}, a \in \mathcal{A} $$

To maximize \(Q^\pi(s,a)\), after taking action \(a\) in state \(s\), the agent should follow the optimal policy for all subsequent steps, written as

$$ Q^*(s,a) = r(s,a) + \gamma \sum_{s'\in\mathcal{S}} P(s' \mid s,a) V^*(s') $$

And in return, the optimal value function can be written as:

$$ V^*(s) = \max_a Q^*(s,a) $$

Note how this is the same as in the Bellman expectation equation

The optimal value function satisfies the Bellman optimality equation.

For \(V^*\), the Bellman optimality equation is:

$$ V^*(s) = \max_a \left[ r(s,a') + \gamma \sum_{s'\in\mathcal{S}} P(s' \mid s,a') V^*(s') \right] $$

For \(Q^*\), the Bellman optimality equation is:

$$ Q^*(s,a) = r(s,a)+\gamma \sum_{s'\in\mathcal{S}} P(s' \mid s,a) \max_{a'} Q^*(s',a') $$

The optimal policy can be recovered from the optimal action value function by choosing the action with the largest \(Q^*\):

$$ \pi^*(s) = \arg\max_a Q^*(s,a) $$

Bellman Optimality Operator and Fixed Point

The Bellman optimality equation can also be viewed through the Bellman optimality operator.

Define the Bellman optimality operator \(\mathcal{T}: \mathbb{R}^{|\mathcal{S}\times \mathcal{A}|} \to \mathbb{R}^{|\mathcal{S} \times \mathcal{A}|}\) as follows: when applied to some vector \(f \in \mathbb{R}^{|\mathcal{S} \times \mathcal{A}|}\):

$$ (\mathcal{T}f)(s,a) = r(s,a) + \gamma \left\langle P(\cdot \mid s,a), V_f \right\rangle $$

where \(V_f(\cdot) = \max_{a\in\mathcal{A}} f(\cdot,a)\).

Then the optimal value function is a fixed point of this operator:

$$ Q^* = \mathcal{T}Q^* $$

This means that if we apply the Bellman optimality operator to the optimal value function, it does not change.

This fixed-point view is important because many RL algorithms can be understood as repeatedly applying or approximating Bellman updates.

For example:

  • Value Iteration repeatedly applies the Bellman optimality operator.
  • Policy Iteration alternates between policy evaluation and policy improvement.
  • TD methods approximate Bellman updates using sampled transitions.
  • Q-learning estimates \(Q^*\) using samples and the Bellman optimality target.

References

  1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.

  2. Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.

  3. Jiang, N. (2026). CS443: Reinforcement Learning. University of Illinois Urbana-Champaign. Course website: https://nanjiang.cs.illinois.edu/cs443/

Author

Jiangshan Gong

Posted on

2026-06-08

Updated on

2026-06-11

Licensed under

Comments