5.4 Variance Reduction and Baselines
In the previous section, we ran REINFORCE on CartPole and saw the most direct symptom of high variance: the learning curve shakes violently, and the policy gets dragged around by luck. This section answers a key question:
Can we reduce the variance of without changing the direction of the gradient in expectation?
Yes. The policy gradient theorem has an important property: in the gradient estimator, we are allowed to subtract a baseline that does not depend on the action.
A Baseline Does Not Change the Expectation
Recall the policy gradient theorem:
Now replace with , where is any function that depends only on the state and not on the action:
Why does this not change the expectation? Because the baseline term contributes zero:
The last step uses a key identity: the expectation of the score function (the gradient of the log-probability) is zero. Intuitively, measures “how should we adjust parameters to increase the probability of a particular action.” If we take a probability-weighted average over all actions, the increases and decreases cancel exactly.
Proof: $\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)] = 0$
The normalization condition of a probability distribution is . Taking the gradient with respect to on both sides gives:
Using , rewrite as :
The left-hand side is exactly .
So a baseline does not change the expectation (and therefore the expected direction) of the gradient. What it changes in practice is the variance of the gradient estimator.
Intuition: Why a Baseline Reduces Variance
After subtracting a baseline, the update signal changes from “how many points did this rollout get” to “how much better was this rollout than what we expected.”
Consider an example in CartPole. Suppose the current policy is already reasonably good: starting from state , it lasts about 100 steps on average ():
| Case | Update Direction (No Baseline) | Update Direction (With Baseline) | ||
|---|---|---|---|---|
| Good luck, lasted 150 steps | 150 | +50 | Strongly reinforce | Moderately reinforce |
| Typical, lasted 100 steps | 100 | 0 | Moderately reinforce | No update |
| Bad luck, lasted only 50 step | 50 | -50 | Slightly reinforce | Decrease probability |
Without a baseline, all three cases produce a positive , so the policy gets reinforced even when that particular outcome is worse than average (the “bad luck” case). With a baseline, the typical case produces no update, and the bad-luck case is correctly penalized.
What the baseline does is build a per-state “passing line”: if the outcome is above the line, reinforce; if it is below the line, suppress. The line is not constant: different states have different , because what counts as “normal performance” depends on where you are in the episode.
The Best Baseline Is
The baseline can be any function that does not depend on the action. The simplest choice is a constant (for example, the average return across episodes). A constant baseline is already useful in stateless bandits, but it cannot distinguish between different states.
A better choice is a state-dependent baseline . Theory shows that when , the variance reduction is close to optimal [1]. Look at it from another angle: answers exactly the question “starting from this state, and following the current policy, how many points do we get on average.” Using it as a baseline turns the update signal into “how much better was the actual outcome than the average.”
We call the advantage:
In REINFORCE, is a Monte Carlo estimate of , so the advantage estimate takes the form:
- : this action is better than the average at this state; increase its probability
- : this action is worse than the average; decrease its probability
- : about as expected; no strong update
What the Advantage Function Means
The advantage function is one of the most important ideas in policy gradient methods. It does not ask “how good is this action,” but rather “how much better is this action than average.” This relative signal is far more stable than the absolute return signal .
We will use the advantage function repeatedly in later chapters:
- Chapter 6 Actor-Critic: use a critic network to estimate directly, enabling per-step updates (no need to wait for the episode to end)
- Chapter 7 PPO: use GAE (Generalized Advantage Estimation) to trade off bias and variance
- Chapter 9 RLHF: the signal produced by a reward model is, in essence, also a kind of advantage estimate
Implementation: Adding a Value Network
In practice, we estimate with an additional neural network (a value network):
# The value network learns V(s)
values = value_net(states_t)
value_loss = nn.MSELoss()(values, returns_t) # Use G_t as the training target
# Update the policy using the advantage
with torch.no_grad():
values_pred = value_net(states_t)
advantages = returns_t - values_pred # Â_t = G_t - V(s_t)
policy_loss = -(log_probs * advantages).mean()The value network is trained so that is as close to as possible. What this step means is that it is learning “starting from this state, what score do we get on average.” The policy network no longer uses directly, but uses the advantage .
This is REINFORCE with a Value Baseline. It is still REINFORCE (you still have to wait until the episode ends and use Monte Carlo returns), but the update signal changes from to .
In the next section, we will compare vanilla REINFORCE and REINFORCE + Value Baseline on CartPole: Hands-on: CartPole Comparison Experiment.
Greensmith, E., Bartlett, P. L., & Baxter, J. (2004). Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5, 1471-1530. ↩︎