Skip to content

E.2.5 Bellman Expectation Equation and Action Values

Prerequisites: E.2.1 Probability Basics and E.2.2 State Values -- you should know conditional probability and the linearity of expectation.


Earlier we interpreted the value function as "the average return from a state":

vπ(s)=Eπ[GtSt=s].v_\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s].

This definition is correct, but it hides all details inside the expectation symbol Eπ\mathbb{E}_\pi. If we do not know what is being summed inside the expectation or what weights are being used, we cannot actually compute the value. It is like knowing that "the average grade is 85" without knowing how many courses there are or how each course is weighted. This article expands the expectation step by step and shows which sums live behind the notation. The recursive relationship that appears after this expansion is the Bellman expectation equation. Its purpose is to compress "the value of all future steps" into the recursive form "one-step reward plus next-state value," so we can compute values without running a full trajectory.

First split GtG_t into "one-step reward plus future return":

Gt=Rt+1+γGt+1.G_t=R_{t+1}+\gamma G_{t+1}.

Substitute this into the value function:

vπ(s)=Eπ[Rt+1+γGt+1St=s].v_\pi(s)=\mathbb{E}_\pi[R_{t+1}+\gamma G_{t+1}\mid S_t=s].

Expectation has a basic property: linearity. The expectation of a sum is the sum of expectations. Therefore:

vπ(s)=Eπ[Rt+1St=s]+γEπ[Gt+1St=s].v_\pi(s)=\mathbb{E}_\pi[R_{t+1}\mid S_t=s]+\gamma\mathbb{E}_\pi[G_{t+1}\mid S_t=s].

The left term, Eπ[Rt+1St=s]\mathbb{E}_\pi[R_{t+1}\mid S_t=s], is the average immediate reward. It expands as:

Eπ[Rt+1St=s]=aπ(as)rp(rs,a)r.\mathbb{E}_\pi[R_{t+1}\mid S_t=s] =\sum_a \pi(a\mid s)\sum_r p(r\mid s,a)r.

The right term, Eπ[Gt+1St=s]\mathbb{E}_\pi[G_{t+1}\mid S_t=s], is the average next-state value:

Eπ[Gt+1St=s]=aπ(as)sp(ss,a)vπ(s).\mathbb{E}_\pi[G_{t+1}\mid S_t=s] =\sum_a \pi(a\mid s)\sum_{s'}p(s'\mid s,a)v_\pi(s').

Combining the two terms gives the Bellman expectation equation:

vπ(s)=aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)].v_\pi(s)=\sum_a\pi(a\mid s) \left[ \sum_r p(r\mid s,a)r +\gamma\sum_{s'}p(s'\mid s,a)v_\pi(s') \right].

The formula looks complex, but each layer is only a probability-weighted average:

  • aπ(as)\sum_a\pi(a\mid s): average over all possible actions.
  • rp(rs,a)r\sum_r p(r\mid s,a)r: average over all possible rewards.
  • sp(ss,a)vπ(s)\sum_{s'}p(s'\mid s,a)v_\pi(s'): average over the values of all possible next states.

Checking With the Two-State Example

Return to the two-state example from the introduction. Suppose the policy is deterministic. In s1s_1, it chooses one action, which always moves to s2s_2 and gives reward 22. In s2s_2, it chooses one action, which always moves to s1s_1 and gives reward 11. Let γ=0.5\gamma=0.5.

Because the policy is deterministic, π(as)\pi(a\mid s) is 11 for the chosen action and 00 for all others, so the "sum over actions" layer disappears. Expanding s1s_1:

v(s1)=2immediate reward+γ(1×v(s2))next state s2 with probability 1=2+0.5v(s2).v(s_1)=\underbrace{2}_{\text{immediate reward}}+\gamma\underbrace{(1\times v(s_2))}_{\text{next state }s_2\text{ with probability }1}=2+0.5v(s_2).

Expanding s2s_2:

v(s2)=1immediate reward+γ(1×v(s1))next state s1 with probability 1=1+0.5v(s1).v(s_2)=\underbrace{1}_{\text{immediate reward}}+\gamma\underbrace{(1\times v(s_1))}_{\text{next state }s_1\text{ with probability }1}=1+0.5v(s_1).

These are the two equations from the introduction. This verifies that the Bellman expectation equation is doing exactly "one-step reward plus probability-weighted future value."

What If the Policy Is Stochastic?

If the policy has two possible actions in s1s_1, with probabilities π(lefts1)=0.3\pi(\text{left}\mid s_1)=0.3 and π(rights1)=0.7\pi(\text{right}\mid s_1)=0.7:

  • Choosing left gives reward 11 and moves next to s1s_1.
  • Choosing right gives reward 33 and moves next to s2s_2.

Then the Bellman expectation equation for s1s_1 is:

v(s1)=0.3×[1+0.5v(s1)]+0.7×[3+0.5v(s2)].v(s_1)=0.3\times[1+0.5v(s_1)]+0.7\times[3+0.5v(s_2)].

Each bracket contains "the immediate reward plus future value after choosing one action," and the coefficient in front is the probability of choosing that action. Expanding layer by layer means taking probability-weighted averages layer by layer.


Action Values and Advantage Functions

The state value function vπ(s)v_\pi(s) answers: "starting from state ss, what average return can we get?" But when training a policy, we often face a more specific question: in state ss, is action left better or action right better? Knowing only the average return cannot answer this, because the average may mix one good action and one bad action. We need to evaluate each action separately. That is why we need the action value function.

qπ(s,a)=Eπ[GtSt=s,At=a].q_\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a].

The action value function qπ(s,a)q_\pi(s,a) is defined as: after choosing action aa in state ss, and then continuing under policy π\pi, what average return do we get?

Its Bellman form is:

qπ(s,a)=rp(rs,a)r+γsp(ss,a)vπ(s).q_\pi(s,a)=\sum_r p(r\mid s,a)r +\gamma\sum_{s'}p(s'\mid s,a)v_\pi(s').

Notice the relationship between qq and vv: the state value is the weighted average of action values:

vπ(s)=aπ(as)qπ(s,a).v_\pi(s)=\sum_a\pi(a\mid s)q_\pi(s,a).

Intuitively, "the average return from ss" equals "the sum of each action's return multiplied by the probability of choosing that action."

The advantage function is the difference between action value and state value:

Aπ(s,a)=qπ(s,a)vπ(s).A_\pi(s,a)=q_\pi(s,a)-v_\pi(s).

It measures how much better action aa is than the average action level in state ss. The earlier example "return 10, average 8, so the advantage is 2" is a numerical version of this formula.

The Relationship Between q, v, and A in Numbers

Suppose that in a state ss, the policy chooses left with probability 0.40.4 and right with probability 0.60.6. We know:

ActionAction value q(s,a)q(s,a)
left55
right88

The state value is the weighted average of action values:

v(s)=0.4×5+0.6×8=2+4.8=6.8.v(s)=0.4\times5+0.6\times8=2+4.8=6.8.

The advantage values are:

A(s,left)=56.8=1.8,A(s,right)=86.8=1.2.A(s,\text{left})=5-6.8=-1.8, \qquad A(s,\text{right})=8-6.8=1.2.

Left is 1.81.8 below average, so it has negative advantage. Right is 1.21.2 above average, so it has positive advantage. Policy gradients use this sign to decide that the probability of right should go up and the probability of left should go down.


Trajectory-Form Importance Sampling

We have already seen the one-step importance weight:

ρt=π(atst)b(atst).\rho_t=\frac{\pi(a_t\mid s_t)}{b(a_t\mid s_t)}.

If a full trajectory is sampled by a behavior policy bb, and we want to estimate the return of a target policy π\pi, then we multiply the probability ratios across all steps:

ρ0:T=t=0Tπ(atst)b(atst).\rho_{0:T}=\prod_{t=0}^{T}\frac{\pi(a_t\mid s_t)}{b(a_t\mid s_t)}.

Here \prod means "multiply all terms together," just as \sum means "add all terms together." The off-policy Monte Carlo estimate becomes:

v^π(s)=1Ni=1Nρ(i)G(i).\hat{v}_\pi(s)=\frac{1}{N}\sum_{i=1}^N \rho^{(i)}G^{(i)}.

This method works, but it is risky. If many probability ratios are multiplied, the weight can become extremely large and cause variance to explode. Practical algorithms therefore often use truncation, weighted importance sampling, or other more stable methods.

Numerical Example: Trajectory Importance Weights

Suppose a two-step trajectory has the following target-policy and behavior-policy probabilities at each step:

Stepπ(atst)\pi(a_t\mid s_t)b(atst)b(a_t\mid s_t)One-step weight
t=0t=00.60.60.30.30.6/0.3=20.6/0.3=2
t=1t=10.80.80.40.40.8/0.4=20.8/0.4=2

The trajectory importance weight is the product of the two steps:

ρ0:2=2×2=4.\rho_{0:2}=2\times2=4.

If this trajectory has return G=5G=5, its weighted contribution is 4×5=204\times5=20. This may still look acceptable. But if every step has weight 33, then after 1010 steps the weight becomes 310=590493^{10}=59049. This is what "variance explosion" means. PPO clipping and normalization are, in essence, methods for controlling such extreme weights.


Covariance and Policy Gradient Variance

The tendency of two random variables to change together is measured by covariance:

Cov(X,Y)=E[(XE[X])(YE[Y])].\mathrm{Cov}(X,Y)=\mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])].

The correlation coefficient further normalizes covariance to [1,1][-1,1]:

ρX,Y=Cov(X,Y)σXσY.\rho_{X,Y}=\frac{\mathrm{Cov}(X,Y)}{\sigma_X\sigma_Y}.

In reinforcement learning, gradient estimates are often random variables. For example, a policy gradient sample is:

gt=A^tθlogπθ(atst).g_t=\hat{A}_t\nabla_\theta\log\pi_\theta(a_t\mid s_t).

If A^t\hat{A}_t fluctuates heavily, the variance of gtg_t also grows. Baselines, advantage normalization, and GAE are all ways of controlling the variance of this stochastic gradient.

Numerical Example: Covariance

Suppose two policy gradient samples have advantages A^=[2,1]\hat{A}=[2, -1], and the corresponding gradient norms are [4,1][4, 1].

Means: Aˉ=0.5\bar{A}=0.5, gˉ=2.5\bar{g}=2.5.

Covariance:

Cov=(20.5)(42.5)+(10.5)(12.5)2=1.5×1.5+(1.5)×(1.5)2=4.52=2.25.\mathrm{Cov}=\frac{(2-0.5)(4-2.5)+(-1-0.5)(1-2.5)}{2}=\frac{1.5\times1.5+(-1.5)\times(-1.5)}{2}=\frac{4.5}{2}=2.25.

The covariance is positive, which means the gradient tends to be larger when the advantage is larger. The two signals are positively correlated. If the covariance is close to 00, the advantage size and gradient size have no stable relationship, so the gradient estimate becomes noisier.


Summary

This article expanded the value function from a "black-box expectation" into the computable Bellman expectation equation, then introduced action values and advantage functions:

ConceptFormulaRole
Bellman expectation equationvπ(s)=aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)]v_\pi(s)=\sum_a\pi(a\mid s)[\sum_r p(r\mid s,a)r+\gamma\sum_{s'}p(s'\mid s,a)v_\pi(s')]Compresses "all future steps" into "one step plus recursion"
Action valueqπ(s,a)=Eπ[GtSt=s,At=a]q_\pi(s,a)=\mathbb{E}_\pi[G_t\mid S_t=s,A_t=a]Evaluates the average return after choosing a specific action
Advantage functionAπ(s,a)=qπ(s,a)vπ(s)A_\pi(s,a)=q_\pi(s,a)-v_\pi(s)Measures how much better this action is than the average
Trajectory importance samplingρ0:T=tπ(atst)b(atst)\rho_{0:T}=\prod_t\frac{\pi(a_t\mid s_t)}{b(a_t\mid s_t)}Uses old-policy data to evaluate a new policy; watch for variance explosion
CovarianceCov(X,Y)=E[(XE[X])(YE[Y])]\mathrm{Cov}(X,Y)=\mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]Measures joint fluctuation of two random variables and helps reason about gradient variance

The core idea of the Bellman expectation equation is recursion: we do not need to run a full trajectory if we know "one-step reward plus next-state value." Action values let us compare actions, and the advantage function turns that comparison into a number. Trajectory importance sampling lets us reuse old data, but it carries the risk of variance explosion.

Next: E.2.6 Probability and Statistics Formula Reference and Exercises -- a formula summary for this module, with exercises.

现代强化学习实战课程