5.2 The Policy Gradient Theorem and REINFORCE
The previous section explained why we need policy-based methods: DQN's does not work in continuous action spaces, so learning the policy directly is the more natural approach. This section answers two questions: what metric should we use to measure "how good" a policy is, and how do we optimize that metric?
The Policy Objective
Chapter 3 introduced the policy objective — a measure of "how good this policy is overall." The answer is natural: across all possible starting points, how much discounted return does policy accumulate on average?
| Symbol | Role | Meaning |
|---|---|---|
| Policy parameters | Neural network weights: changing them changes the policy's behavior | |
| Policy | Given a state, outputs a probability distribution over actions | |
| Objective | The policy's "report card" — the average score achieved by the policy with parameters | |
| $\mathbb{E}{\pi\theta}$ | Expectation | Run the policy many times and average |
| Discounted reward | Reward at time ; farther-future rewards are worth less |
is our north star. The goal is simple: find parameters that maximize .
Numerical Example: for a 3-Step Episode
Suppose an episode has only 3 steps, with rewards , , and discount factor . The discounted return of this trajectory is
This is the return of one trajectory. is the expectation over all possible trajectory returns — run the policy infinitely many times and take the average. Different policies produce different trajectory distributions, hence different . Suppose policy A tends to choose high-reward paths and policy B tends to choose low-reward paths:
| Policy | Average return of possible trajectories | |
|---|---|---|
| A | Around | Larger |
| B | Around | Smaller |
The optimization objective is to find parameters that maximize (i.e., the highest average return).
Gradient Ascent
How do we make larger? The classic move in deep learning: walk in the direction of the gradient.
| Symbol | Role | Meaning |
|---|---|---|
| Gradient | Which direction should we change parameters to improve the policy most? | |
| Learning rate | How big is one step? Too large oscillates; too small is slow | |
| Ascent | Notice the plus sign: we maximize, not minimize |
Numerical Example: One Parameter Update
Suppose parameters , gradient , learning rate . Writing out the update component by component:
After the update, . Each component has moved a small step in the direction indicated by the gradient. controls the step size: with , each step moves only a few thousandths, but the effect accumulates over many rounds.
But how do we compute ? The objective contains an expectation , which in principle requires averaging over all possible trajectories. The number of possible trajectories is astronomical; we cannot enumerate them all. It is like trying to compute the average height of every student in a school — you cannot measure everyone, but you can randomly sample 100 students and estimate. The sample mean computed from those 100 students is a sampling estimate of the true mean. Policy gradients use the same idea: run a few trajectories and use the average of their gradients to estimate the true .
The Policy Gradient Theorem
This is exactly where the policy gradient theorem enters. In 1992, Williams showed in the REINFORCE paper that the seemingly intractable gradient can be rewritten into a form that can be estimated by sampling. [1] Sutton and colleagues later generalized and systematized this result in 2000. [2]
Let's unpack it term by term:
| Symbol | Role | Meaning |
|---|---|---|
| Take gradient | Which direction should we change the parameters? | |
| Log probability | Under state , the log-probability of choosing action | |
| Gradient of log-prob | How do parameters change the probability of choosing this action? | |
| Return | Total reward from time to the end — "how many points did we get after this action?" | |
| Outer | Expectation | "Run many episodes and average" — approximated by sampling |
In one sentence: if an action leads to a good outcome (large ), increase the probability of taking it again; if it leads to a bad outcome (small ), decrease that probability.
Numerical Example: Gradient Computation in a Bandit Setting
Let's walk through a bandit scenario. Two-armed bandit: arm A wins with probability 30%, arm B with probability 70%. The policy has a single parameter (assume larger means higher probability of choosing B). Currently , .
Trajectory 1: chose B, reward = 1.0
First step, compute the log probability:
Second step, compute the gradient of the log probability. Suppose that under the current parameters (this value depends on the specific parameterization; we pick a concrete number here).
Third step, multiply by the return. The bandit has only one step, so :
The gradient is positive, so the parameter updates as . The parameter increases, and increases accordingly — the policy learns "choosing B led to a good result, so choose B more often."
Trajectory 2: chose A, reward = 0
Suppose (the gradient direction for choosing A is opposite to choosing B). Multiply by the return :
The gradient is zero — the parameters do not update. The policy chose A but got zero reward, so it neither encourages nor discourages this action.
Trajectory 3: chose A, reward = 0.5
The gradient is negative, so the parameter updates as . The parameter decreases, which means decreases and increases — the policy learns "A got some reward this time, so we can choose A slightly more often." But since A's average reward is much lower than B's, over many samples the positive gradient from B will accumulate and dominate, and the policy will eventually converge to preferring B.
Numerical Example: Gradient Computation for a 3-Step Episode
Consider a CartPole-like scenario: 3-step episode, .
| Step | State | Action | Reward |
|---|---|---|---|
| 0 | = right | ||
| 1 | = left | ||
| 2 | = right |
First, compute the return at each step. includes only the last step:
accumulates from step 1 to the end:
accumulates from step 0 to the end:
Suppose the log-probability gradient values at each step are:
| Step | |||
|---|---|---|---|
| 0 | |||
| 1 | |||
| 2 |
The gradient contribution from each step:
The gradient estimate from this single trajectory is the sum of the three:
If , the parameter update is . Step 1 contributes the most (with and a relatively large gradient component of ), indicating that the decision "choose left in " has a prominent contribution to the final return, so the parameters will move in the direction of increasing the probability of choosing left in .
The Log-Derivative Trick
Why don't we write something like , and instead introduce a ?
This is a mathematical technique known as the log-derivative trick. By the chain rule:
The division by cancels the factor hidden inside the expectation, making the entire expression clean and computable. From an engineering point of view, since probabilities lie in , directly computing gradients of probabilities can produce extremely small values, destabilizing training. The maps to , yielding more numerically stable gradients.
Numerical Example: The Effect of the Log-Derivative Trick
Take a concrete example: . Suppose a small perturbation of the parameter changes to . Then
Using the raw probability gradient:
Using the log-derivative trick:
Now consider another action with , which becomes after the same perturbation:
The two actions have the same probability gradient (), but the low-probability action's log-derivative gradient is times that of the high-probability action. This makes intuitive sense: raising a probability to is a relative increase; raising a probability to is only a relative increase. Dividing by converts absolute changes into relative changes, giving actions at different probability scales comparable magnitudes in gradient space.
Math derivation: from the objective to the policy gradient theorem
The gradient of the objective differentiates through the trajectory distribution:
Here denotes a trajectory, and is the probability that policy generates trajectory . The gradient can only act on , since the reward does not depend on :
The key step is the identity :
The trajectory probability factorizes as . Taking logs and differentiating with respect to , the environment transition probability disappears because it does not depend on , leaving only the policy terms:
Substituting back into the expectation yields the policy gradient theorem. The most elegant aspect of this derivation is that the environment dynamics — the state transition probabilities — cancel out during differentiation. This means policy gradients do not need a model of the environment, which is the fundamental reason they are much more flexible than dynamic programming approaches.
The REINFORCE Algorithm
The policy gradient theorem gives us the form of the gradient. REINFORCE is the most straightforward implementation of the theorem: it uses Monte Carlo sampling to estimate the expectation. The algorithm flow is:
- Run one full episode with the current policy , recording state, action, and reward at each step.
- For each time step, compute the return from that step until the end of the episode: .
- Estimate the gradient with samples: .
- Update parameters along the gradient: .
In PyTorch, the core update can be written in one line:
loss = -log_prob * G_t # minus sign because PyTorch does gradient descent (minimize) by default, while we want ascent (maximize)A full multi-step version:
# REINFORCE core (multi-step version)
for t in range(len(rewards)):
G_t = sum(gamma ** k * rewards[t + k] for k in range(len(rewards) - t))
loss += -log_probs[t] * G_t
optimizer.zero_grad()
loss.backward()
optimizer.step()Numerical Example: A Complete Episode's REINFORCE Update
Continuing with the 3-step episode from earlier, :
| Step | State | Action | Reward | |
|---|---|---|---|---|
| 0 | right | |||
| 1 | left | |||
| 2 | right |
Step 1: Compute the return at each step.
from step 2 to the end (only 1 step):
from step 1 to the end (2 steps):
from step 0 to the end (3 steps):
Summary:
| Step | ||
|---|---|---|
| 0 | 1 | 5.23 |
| 1 | 2 | 4.7 |
| 2 | 3 | 3 |
Step 2: Compute the gradient estimate.
Each step's gradient term is . In PyTorch, log_prob is , and autograd handles the gradient. Here we only look at the scalar loss construction:
Step 3: Backpropagation and parameter update.
loss.backward() computes . Since , we have
optimizer.step() executes (PyTorch defaults to gradient descent). The two negatives cancel:
This is exactly gradient ascent. This round of updates moves the parameters in the direction of increasing the probability of high-return actions.
A Minimal Example: Multi-Armed Bandits
Before we dive into CartPole, let's use a minimal setting to understand what loss = -log_prob * reward is really doing.
Imagine a bandit with two arms: arm A wins with probability 30%, arm B wins with probability 70%. The policy network is just a Softmax layer that outputs the probability of choosing A and B. The core training code:
probs = policy(state)
dist = torch.distributions.Categorical(probs)
action = dist.sample() # sample an action according to probability
log_prob = dist.log_prob(action) # log π(a|s)
reward = pull_arm(action.item()) # take the action
loss = -log_prob * reward # REINFORCE coreAfter 300 episodes, the probability of choosing B will typically climb from around 0.5 to 0.85–0.95: the policy learns to prefer the arm with the higher win rate. But the curve will not be smooth — it will be jagged and noisy. If you increase the learning rate from 0.01 to 0.1, the policy may swing violently between A and B.
This is the core pain point of REINFORCE: high variance.
The Variance Problem in REINFORCE
is the cumulative return from time until the episode ends, which includes all randomness along that future trajectory. For the same action, different sampled trajectories can produce very different values:
| Case | What actually happened | |
|---|---|---|
| Good luck | Subsequent steps all yielded high rewards | Large |
| Bad luck | Subsequent steps all yielded low rewards | Small |
Policy gradients use to decide whether "this action was good." But when fluctuates heavily, the same good action can be penalized due to bad luck, and the same bad action can be rewarded due to good luck. It is like judging a student's ability by a single exam — a bad score does not necessarily mean poor mastery; it may just be an off day.
Numerical Example: Comparing Gradient Signals from Two Trajectories
Back to the bandit scenario: , . Both times we chose B (the same action), but the subsequent luck differed.
Episode 1: chose B, reward 1.0 (good outcome)
The gradient pushes up.
Episode 2: chose B, reward 0.0 (bad outcome, bad luck)
The gradient signal is zero — the parameters do not change.
Now consider a multi-step example. Same 3-step episode structure, , where the policy chose the same action "right" at :
| Episode | Gradient signal direction | ||||
|---|---|---|---|---|---|
| 1 | 1 | 2 | 3 | Strongly positive | |
| 2 | 1 | 0 | 0 | Weakly positive |
Both episodes took the same action "right" at , yet differs by more than a factor of 4. Episode 1 would push up significantly, while Episode 2 would only push it up weakly. The problem is that does not only reflect whether "choosing right at is good" — it also includes the randomness of and . The rewards from the subsequent two steps are not controlled by the current action, yet they are all counted into .
In the bandit experiment, this manifests as jagged and oscillating training curves. In more complex environments (such as CartPole), high variance makes training even more unstable — sometimes the policy improves nicely, then suddenly gets derailed by a run of bad luck.
Discrete vs. Continuous Action Spaces
In this chapter's experiments we use discrete action spaces (choose A vs. B, CartPole left vs. right), but the policy gradient theorem applies equally to continuous action spaces:
| Discrete action space | Continuous action space | |
|---|---|---|
| Example | CartPole left/right, LLM token choice | Robot joint angle, steering wheel angle |
| Output layer | Softmax, probability for each action | Gaussian parameters, mean and standard deviation |
| Sampling | Sample according to Softmax | Sample from |
| Compute | log_softmax | Log-density formula of a Gaussian distribution |
With the same policy gradient formula, changing only the output layer lets us switch from "left/right" to "continuous torque." This is where policy gradients are more flexible than value-based methods: DQN's is not computable in continuous spaces, while policy gradients differentiate through a probability density, which is naturally compatible with continuous actions.
Thinking question: what is the essential difference between REINFORCE and Q-Learning updates?
Q-Learning updates the value function — "how many points is this action worth" — and the policy is obtained implicitly via . REINFORCE updates the policy parameters directly, skipping the intermediate step of learning values.
This difference leads to two key consequences: Q-Learning is off-policy (it can reuse old data for training), while REINFORCE is on-policy (it must use fresh data generated by the current policy); Q-Learning can only handle discrete actions (it needs to enumerate all actions to take the max), while REINFORCE can handle continuous actions (it directly differentiates the log probability density).
REINFORCE can work, but its high variance makes it nearly unusable in practice. Fortunately, the policy gradient theorem has a remarkable property: we can subtract a baseline that does not depend on the action from the gradient estimator, without changing the expected direction of the gradient, while significantly reducing variance. We will develop this in Section 5.4. In the next section, we will first run vanilla REINFORCE on CartPole: Hands-on: CartPole.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4), 229-256. DOI ↩︎
Sutton, R. S., et al. (1999). Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 12. ↩︎