6.1 The Advantage Function
At the end of Chapter 5, we found that subtracting a baseline reduces the variance of policy gradients without changing the gradient direction. This section deepens that insight and introduces the advantage function -- the bridge connecting the Actor and the Critic.
Prerequisites
- REINFORCE policy gradient: -- where to insert the baseline
- State value : what makes a good baseline
- Action value : the advantage is defined as the difference between and
- TD error: -- a practical estimator of the advantage
From Baseline to Advantage Function
Recall the REINFORCE policy gradient:
is the total discounted return from the current step to the end of the episode (review: discounted return). The problem is that fluctuates wildly -- under the same policy, from the same state, two rollouts can yield completely different values.
After subtracting the baseline :
The quantity in parentheses, , is already an estimate of the advantage function. The formal definition is:
| Symbol | Meaning |
|---|---|
| Advantage function: how much better taking action in state is compared to "average." | |
| Action-value function: expected discounted return starting from state , taking action first, then following policy . | |
| State-value function: expected discounted return starting from state and following policy . | |
| The current policy, determining the probability of each action in each state. |
Their difference captures exactly "how many extra points were earned because action was chosen."
In words, the advantage says:
How much better is this action than what we would typically get in this state?
- : the action is better than expected; we should choose it more often
- : the action is worse than expected; we should choose it less often
- : the action is about as good as expected
A chess analogy: is "this position has a 60% win rate overall," while is "after playing the rook move, the win rate becomes 75%." The advantage is , meaning the rook move is 15 percentage points better than the average outcome for the position -- a strong choice.
Let us work through a concrete 3-step episode to see how the advantage is computed. Suppose the discount factor is , and a sampled trajectory yields:
Computing the discounted return from each time step:
Now suppose the Critic provides value estimates for each state:
| State | |
|---|---|
| 3.0 | |
| 2.5 | |
| 0.8 |
Substituting and into yields the advantage estimate at each time step:
| Step | State | Meaning | |||
|---|---|---|---|---|---|
| 0 | better than expected | ||||
| 1 | better than expected | ||||
| 2 | better than expected |
All three advantages are positive, meaning every action along this trajectory performed better than average. is an MC-return-based estimate of the advantage; it is unbiased but high-variance (different trajectories produce very different values).
Advantage Versus Cumulative Return
The advantage reduces variance because it subtracts the reward you would have gotten anyway, retaining only the portion attributable to the specific action.
Consider a more complete example. Suppose that in some state , the policy's average return is . Four trajectories are sampled with returns , , , and .
First, using as the gradient signal:
| Episode | Gradient signal | Meaning | |
|---|---|---|---|
| 1 | 18 | Large positive, strongly pushes action | |
| 2 | 15 | Positive, pushes action | |
| 3 | 7 | Positive, pushes action | |
| 4 | 4 | Positive, pushes action |
All four are positive. The policy would conclude that "in this state, no matter what, this action is good" -- yet episodes 3 and 4 actually returned below average.
Now using :
| Episode | Gradient signal | Meaning | |||
|---|---|---|---|---|---|
| 1 | 18 | 10 | Far above average, strongly push | ||
| 2 | 15 | 10 | Above average, push | ||
| 3 | 7 | 10 | Below average, suppress | ||
| 4 | 4 | 10 | Far below average, strongly suppress |
With , all four episodes produce positive gradient signals -- the policy cannot distinguish "truly good" from "lucky high return." With , the signal is calibrated: above-average returns get positive signals, below-average returns get negative signals.
To see the variance reduction quantitatively: using , the four signals have mean and variance . Using , the four signals have mean and variance .
The four-sample variance is the same, but has a mean much closer to zero. As sample size grows, 's range is determined by the randomness of the entire trajectory (potentially spanning from 0 to dozens), while 's range is centered by , with positive and negative values canceling out to produce a more stable expected gradient direction. This is exactly the mechanism by which "subtracting a baseline reduces variance."
Estimating the Advantage with the TD Error
The theoretical definition of the advantage is , but in practice we rarely compute directly. Starting from the definition and performing a one-step expansion yields a more practical form.
Begin with . The action-value function is defined as:
This expectation represents: after taking action in state , the immediate reward plus the value of the next state. If we take a single sample (without completing the entire episode or averaging over all possible transitions), we obtain a one-step estimate of :
where is the actual reward received in this step and is the actual next state reached. Substituting this approximation into the advantage definition:
The right-hand side is the TD error:
| Symbol | Meaning |
|---|---|
| The actual immediate reward received in this step. | |
| Discount factor, controlling how much future value is discounted. | |
| The Critic's value estimate for the next state . | |
| The Critic's value estimate for the current state . | |
| TD error: how much better (or worse) the actual outcome was after one step. |
Replacing with the TD error as the policy gradient signal has two benefits:
- No need to wait for the episode to end -- updates can happen after every step ( requires a full episode, a limitation of MC methods)
- Lower variance -- involves randomness from only a single step ( accumulates randomness over the entire trajectory)
Let us walk through a concrete numerical example. Suppose , and at some step:
- Current state , Critic estimates
- The agent takes some action and receives immediate reward
- The next state is , Critic estimates
Substituting into the TD error formula:
means this step was better than the Critic predicted. Using this as the advantage estimate, the policy gradient will slightly increase the probability of this action.
Try different numbers. Suppose the same transition yields instead:
means this step performed far worse than predicted. The policy gradient will decrease the probability of this action.
Now consider the case . If , , :
: the actual outcome matches the Critic's prediction exactly. The policy gradient signal is zero, and the action's probability remains unchanged.
Now let us connect three time steps. Consider a 3-step episode with :
| Step | State | Action | Next state | ||||
|---|---|---|---|---|---|---|---|
| 0 | 2.0 | 4.0 | |||||
| 1 | 4.0 | 1.0 | |||||
| 2 | 1.0 | 0.0 |
The three values are , , and . Step 0's action far exceeded expectations, so the policy should increase 's probability; step 1's action fell short, so the policy should decrease 's probability; step 2 slightly exceeded expectations, mildly encouraging .
For comparison, the MC returns for the same trajectory are:
The corresponding MC advantage estimates:
| Step | |||
|---|---|---|---|
| 0 | 5.52 | 2.0 | |
| 1 | 2.8 | 4.0 | |
| 2 | 2 | 1.0 |
Both estimates give the same directional signals (positive, negative, positive), but different magnitudes. The TD advantage looks only one step ahead, while the MC advantage sees to the end of the episode. has lower variance (only one step of randomness) but is biased (depends on the accuracy of ); is unbiased but high-variance (incorporating randomness from the entire trajectory).
This is the MC-to-TD transition replayed in the policy optimization setting: REINFORCE uses (MC), while Actor-Critic uses (TD).
| REINFORCE (MC) | Actor-Critic (TD) | |
|---|---|---|
| Advantage estimate | (requires full trajectory) | (update after one step) |
| Update timing | after the episode ends | every step |
| Variance | high | low |
| Cost | none | requires training a Critic |
Implementing the Critic Network
To compute , you need and . In real problems is unknown -- a network is needed to approximate it. This network is the Critic.
Actor (policy network) Critic (value network)
input: state s input: state s
output: π_θ(a|s) distribution output: V_φ(s) scalar
role: choose actions role: evaluate state value
params: θ params: φThe Actor and the Critic share the same input (the state ) but produce different outputs: the Actor outputs a probability distribution over actions, while the Critic outputs a scalar value estimate. They cooperate through the advantage estimate : the Critic provides an evaluation signal, and the Actor adjusts its behavior based on that evaluation.
But how is the Critic trained? How does it learn to estimate accurately? The next section expands on the three methods -- DP, MC, and TD -- briefly surveyed in Chapter 3, showing how they are applied concretely in Critic training. See: Critic training methods