Skip to content

6.1 The Advantage Function

At the end of Chapter 5, we found that subtracting a baseline V(s)V(s) reduces the variance of policy gradients without changing the gradient direction. This section deepens that insight and introduces the advantage function -- the bridge connecting the Actor and the Critic.

Prerequisites

  • REINFORCE policy gradient: θJθlogπ(as)Gt\nabla_\theta J \approx \nabla_\theta \log \pi(a|s) \cdot G_t -- where to insert the baseline
  • State value V(s)V(s): what makes a good baseline
  • Action value Q(s,a)Q(s,a): the advantage is defined as the difference between QQ and VV
  • TD error: δ=r+γV(s)V(s)\delta = r + \gamma V(s') - V(s) -- a practical estimator of the advantage

From Baseline to Advantage Function

Recall the REINFORCE policy gradient:

θJθlogπ(as)Gt\nabla_\theta J \approx \nabla_\theta \log \pi(a|s) \cdot G_t

GtG_t is the total discounted return from the current step to the end of the episode (review: discounted return). The problem is that GtG_t fluctuates wildly -- under the same policy, from the same state, two rollouts can yield completely different GtG_t values.

After subtracting the baseline V(s)V(s):

θJθlogπ(as)(GtV(s))\nabla_\theta J \approx \nabla_\theta \log \pi(a|s) \cdot (G_t - V(s))

The quantity in parentheses, GtV(s)G_t - V(s), is already an estimate of the advantage function. The formal definition is:

Aπ(s,a)=Qπ(s,a)Vπ(s)(6.1)A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s) \tag{6.1}

SymbolMeaning
Aπ(s,a)A^\pi(s,a)Advantage function: how much better taking action aa in state ss is compared to "average."
Qπ(s,a)Q^\pi(s,a)Action-value function: expected discounted return starting from state ss, taking action aa first, then following policy π\pi.
Vπ(s)V^\pi(s)State-value function: expected discounted return starting from state ss and following policy π\pi.
π\piThe current policy, determining the probability of each action in each state.

Their difference captures exactly "how many extra points were earned because action aa was chosen."

In words, the advantage says:

How much better is this action than what we would typically get in this state?

  • A>0A > 0: the action is better than expected; we should choose it more often
  • A<0A < 0: the action is worse than expected; we should choose it less often
  • A0A \approx 0: the action is about as good as expected

A chess analogy: V(s)V(s) is "this position has a 60% win rate overall," while Q(s,play rook)Q(s, \text{play rook}) is "after playing the rook move, the win rate becomes 75%." The advantage is A=75%60%=15%A = 75\% - 60\% = 15\%, meaning the rook move is 15 percentage points better than the average outcome for the position -- a strong choice.

Let us work through a concrete 3-step episode to see how the advantage is computed. Suppose the discount factor is γ=0.9\gamma = 0.9, and a sampled trajectory yields:

s0r=+2s1r=+3s2r=+1s3 (terminal)s_0 \xrightarrow{r=+2} s_1 \xrightarrow{r=+3} s_2 \xrightarrow{r=+1} s_3\ (\text{terminal})

Computing the discounted return GtG_t from each time step:

G0=r1+γr2+γ2r3=2+0.9×3+0.92×1=2+2.7+0.81=5.51G_0 = r_1 + \gamma r_2 + \gamma^2 r_3 = 2 + 0.9 \times 3 + 0.9^2 \times 1 = 2 + 2.7 + 0.81 = 5.51

G1=r2+γr3=3+0.9×1=3.9G_1 = r_2 + \gamma r_3 = 3 + 0.9 \times 1 = 3.9

G2=r3=1G_2 = r_3 = 1

Now suppose the Critic provides value estimates for each state:

StateV(s)V(s)
s0s_03.0
s1s_12.5
s2s_20.8

Substituting GtG_t and V(s)V(s) into AGtV(s)A \approx G_t - V(s) yields the advantage estimate at each time step:

Step ttStateGtG_tV(st)V(s_t)A=GtV(st)A = G_t - V(s_t)Meaning
0s0s_05.515.513.03.05.513.0=2.515.51 - 3.0 = 2.512.512.51 better than expected
1s1s_13.93.92.52.53.92.5=1.43.9 - 2.5 = 1.41.41.4 better than expected
2s2s_2110.80.810.8=0.21 - 0.8 = 0.20.20.2 better than expected

All three advantages are positive, meaning every action along this trajectory performed better than average. GtV(s)G_t - V(s) is an MC-return-based estimate of the advantage; it is unbiased but high-variance (different trajectories produce very different GtG_t values).

Advantage Versus Cumulative Return

The advantage reduces variance because it subtracts the reward you would have gotten anyway, retaining only the portion attributable to the specific action.

Consider a more complete example. Suppose that in some state ss, the policy's average return is V(s)=10V(s) = 10. Four trajectories are sampled with returns Gt(1)=18G_t^{(1)} = 18, Gt(2)=15G_t^{(2)} = 15, Gt(3)=7G_t^{(3)} = 7, and Gt(4)=4G_t^{(4)} = 4.

First, using GtG_t as the gradient signal:

EpisodeGtG_tGradient signalMeaning
118×18\nabla \times 18Large positive, strongly pushes action
215×15\nabla \times 15Positive, pushes action
37×7\nabla \times 7Positive, pushes action
44×4\nabla \times 4Positive, pushes action

All four are positive. The policy would conclude that "in this state, no matter what, this action is good" -- yet episodes 3 and 4 actually returned below average.

Now using A=GtV(s)A = G_t - V(s):

EpisodeGtG_tV(s)V(s)A=GtV(s)A = G_t - V(s)Gradient signalMeaning
118101810=+818 - 10 = +8×(+8)\nabla \times (+8)Far above average, strongly push
215101510=+515 - 10 = +5×(+5)\nabla \times (+5)Above average, push
3710710=37 - 10 = -3×(3)\nabla \times (-3)Below average, suppress
4410410=64 - 10 = -6×(6)\nabla \times (-6)Far below average, strongly suppress

With GtG_t, all four episodes produce positive gradient signals -- the policy cannot distinguish "truly good" from "lucky high return." With AA, the signal is calibrated: above-average returns get positive signals, below-average returns get negative signals.

To see the variance reduction quantitatively: using GtG_t, the four signals have mean 18+15+7+44=11\frac{18+15+7+4}{4} = 11 and variance (1811)2+(1511)2+(711)2+(411)24=49+16+16+494=32.5\frac{(18-11)^2+(15-11)^2+(7-11)^2+(4-11)^2}{4} = \frac{49+16+16+49}{4} = 32.5. Using AA, the four signals have mean 8+5364=1\frac{8+5-3-6}{4} = 1 and variance (81)2+(51)2+(31)2+(61)24=49+16+16+494=32.5\frac{(8-1)^2+(5-1)^2+(-3-1)^2+(-6-1)^2}{4} = \frac{49+16+16+49}{4} = 32.5.

The four-sample variance is the same, but AA has a mean much closer to zero. As sample size grows, GtG_t's range is determined by the randomness of the entire trajectory (potentially spanning from 0 to dozens), while AA's range is centered by V(s)V(s), with positive and negative values canceling out to produce a more stable expected gradient direction. This is exactly the mechanism by which "subtracting a baseline reduces variance."

Estimating the Advantage with the TD Error

The theoretical definition of the advantage is A=QVA = Q - V, but in practice we rarely compute QQ directly. Starting from the definition and performing a one-step expansion yields a more practical form.

Begin with Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s). The action-value function is defined as:

Qπ(s,a)=E[Rt+1+γVπ(St+1)St=s,At=a]Q^\pi(s,a) = \mathbb{E}\left[R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s, A_t = a\right]

This expectation represents: after taking action aa in state ss, the immediate reward plus the value of the next state. If we take a single sample (without completing the entire episode or averaging over all possible transitions), we obtain a one-step estimate of QQ:

Q(s,a)r+γV(s)Q(s,a) \approx r + \gamma V(s')

where rr is the actual reward received in this step and ss' is the actual next state reached. Substituting this approximation into the advantage definition:

A(s,a)=Q(s,a)V(s)r+γV(s)V(s)A(s,a) = Q(s,a) - V(s) \approx r + \gamma V(s') - V(s)

The right-hand side is the TD error:

A(s,a)r+γV(s)V(s)=δ(6.2)A(s,a) \approx r + \gamma V(s') - V(s) = \delta \tag{6.2}

SymbolMeaning
rrThe actual immediate reward received in this step.
γ\gammaDiscount factor, controlling how much future value is discounted.
V(s)V(s')The Critic's value estimate for the next state ss'.
V(s)V(s)The Critic's value estimate for the current state ss.
δ\deltaTD error: how much better (or worse) the actual outcome was after one step.

Replacing GtG_t with the TD error as the policy gradient signal has two benefits:

  1. No need to wait for the episode to end -- updates can happen after every step (GtG_t requires a full episode, a limitation of MC methods)
  2. Lower variance -- δ\delta involves randomness from only a single step (GtG_t accumulates randomness over the entire trajectory)

Let us walk through a concrete numerical example. Suppose γ=0.9\gamma = 0.9, and at some step:

  • Current state ss, Critic estimates V(s)=5.0V(s) = 5.0
  • The agent takes some action and receives immediate reward r=+2r = +2
  • The next state is ss', Critic estimates V(s)=4.0V(s') = 4.0

Substituting into the TD error formula:

δ=r+γV(s)V(s)=2+0.9×4.05.0=2+3.65.0=+0.6\delta = r + \gamma V(s') - V(s) = 2 + 0.9 \times 4.0 - 5.0 = 2 + 3.6 - 5.0 = +0.6

δ=+0.6\delta = +0.6 means this step was 0.60.6 better than the Critic predicted. Using this δ\delta as the advantage estimate, the policy gradient will slightly increase the probability of this action.

Try different numbers. Suppose the same transition yields r=1r = -1 instead:

δ=1+0.9×4.05.0=1+3.65.0=2.4\delta = -1 + 0.9 \times 4.0 - 5.0 = -1 + 3.6 - 5.0 = -2.4

δ=2.4\delta = -2.4 means this step performed far worse than predicted. The policy gradient will decrease the probability of this action.

Now consider the case δ=0\delta = 0. If r=+1r = +1, V(s)=5.0V(s') = 5.0, V(s)=5.5V(s) = 5.5:

δ=1+0.9×5.05.5=1+4.55.5=0\delta = 1 + 0.9 \times 5.0 - 5.5 = 1 + 4.5 - 5.5 = 0

δ=0\delta = 0: the actual outcome matches the Critic's prediction exactly. The policy gradient signal is zero, and the action's probability remains unchanged.

Now let us connect three time steps. Consider a 3-step episode with γ=0.9\gamma = 0.9:

StepStateActionrrNext stateV(s)V(s)V(s)V(s')δ=r+γV(s)V(s)\delta = r + \gamma V(s') - V(s)
0s0s_0a0a_0+3+3s1s_12.04.03+0.9×4.02.0=3+3.62.0=+4.63 + 0.9 \times 4.0 - 2.0 = 3 + 3.6 - 2.0 = +4.6
1s1s_1a1a_1+1+1s2s_24.01.01+0.9×1.04.0=1+0.94.0=2.11 + 0.9 \times 1.0 - 4.0 = 1 + 0.9 - 4.0 = -2.1
2s2s_2a2a_2+2+2s3s_31.00.02+0.9×0.01.0=2+0.01.0=+1.02 + 0.9 \times 0.0 - 1.0 = 2 + 0.0 - 1.0 = +1.0

The three δ\delta values are +4.6+4.6, 2.1-2.1, and +1.0+1.0. Step 0's action far exceeded expectations, so the policy should increase a0a_0's probability; step 1's action fell short, so the policy should decrease a1a_1's probability; step 2 slightly exceeded expectations, mildly encouraging a2a_2.

For comparison, the MC returns GtG_t for the same trajectory are:

G0=3+0.9×1+0.92×2=3+0.9+1.62=5.52G_0 = 3 + 0.9 \times 1 + 0.9^2 \times 2 = 3 + 0.9 + 1.62 = 5.52

G1=1+0.9×2=2.8G_1 = 1 + 0.9 \times 2 = 2.8

G2=2G_2 = 2

The corresponding MC advantage estimates:

StepGtG_tV(s)V(s)AMC=GtV(s)A_{\text{MC}} = G_t - V(s)
05.522.05.522.0=+3.525.52 - 2.0 = +3.52
12.84.02.84.0=1.22.8 - 4.0 = -1.2
221.021.0=+1.02 - 1.0 = +1.0

Both estimates give the same directional signals (positive, negative, positive), but different magnitudes. The TD advantage δ\delta looks only one step ahead, while the MC advantage GtV(s)G_t - V(s) sees to the end of the episode. δ\delta has lower variance (only one step of randomness) but is biased (depends on the accuracy of V(s)V(s')); GtV(s)G_t - V(s) is unbiased but high-variance (incorporating randomness from the entire trajectory).

This is the MC-to-TD transition replayed in the policy optimization setting: REINFORCE uses GtG_t (MC), while Actor-Critic uses δ\delta (TD).

REINFORCE (MC)Actor-Critic (TD)
Advantage estimateGtV(s)G_t - V(s) (requires full trajectory)r+γV(s)V(s)=δr + \gamma V(s') - V(s) = \delta (update after one step)
Update timingafter the episode endsevery step
Variancehighlow
Costnonerequires training a Critic

Implementing the Critic Network

To compute δ=r+γV(s)V(s)\delta = r + \gamma V(s') - V(s), you need V(s)V(s) and V(s)V(s'). In real problems VV is unknown -- a network is needed to approximate it. This network is the Critic.

text
Actor (policy network)             Critic (value network)
  input:  state s                   input:  state s
  output: π_θ(a|s) distribution      output: V_φ(s) scalar
  role:   choose actions             role:   evaluate state value
  params: θ                          params: φ

The Actor and the Critic share the same input (the state ss) but produce different outputs: the Actor outputs a probability distribution over actions, while the Critic outputs a scalar value estimate. They cooperate through the advantage estimate AδA \approx \delta: the Critic provides an evaluation signal, and the Actor adjusts its behavior based on that evaluation.

But how is the Critic trained? How does it learn to estimate V(s)V(s) accurately? The next section expands on the three methods -- DP, MC, and TD -- briefly surveyed in Chapter 3, showing how they are applied concretely in Critic training. See: Critic training methods

现代强化学习实战课程