Skip to content

5.2 The Policy Gradient Theorem and REINFORCE

The previous section explained why we need policy-based methods: DQN's argmax\arg\max does not work in continuous action spaces, so learning the policy πθ(as)\pi_\theta(a|s) directly is the more natural approach. This section answers two questions: what metric should we use to measure "how good" a policy is, and how do we optimize that metric?

The Policy Objective

Chapter 3 introduced the policy objective J(θ)J(\theta) — a measure of "how good this policy is overall." The answer is natural: across all possible starting points, how much discounted return does policy πθ\pi_\theta accumulate on average?

J(θ)=Eπθ[t=0γtrt]J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]

SymbolRoleMeaning
θ\thetaPolicy parametersNeural network weights: changing them changes the policy's behavior
πθ\pi_\thetaPolicyGiven a state, outputs a probability distribution over actions
J(θ)J(\theta)ObjectiveThe policy's "report card" — the average score achieved by the policy with parameters θ\theta
$\mathbb{E}{\pi\theta}$ExpectationRun the policy many times and average
γtrt\gamma^t r_tDiscounted rewardReward at time tt; farther-future rewards are worth less

J(θ)J(\theta) is our north star. The goal is simple: find parameters θ\theta that maximize J(θ)J(\theta).

Numerical Example: J(θ)J(\theta) for a 3-Step Episode

Suppose an episode has only 3 steps, with rewards r0=1r_0=1, r1=2r_1=2, r2=3r_2=3 and discount factor γ=0.9\gamma=0.9. The discounted return of this trajectory is

t=02γtrt=γ0r0+γ1r1+γ2r2=0.90×1+0.91×2+0.92×3=1×1+0.9×2+0.81×3=1+1.8+2.43=5.23.\begin{aligned} \sum_{t=0}^{2} \gamma^t r_t &= \gamma^0 r_0 + \gamma^1 r_1 + \gamma^2 r_2 \\ &= 0.9^0 \times 1 + 0.9^1 \times 2 + 0.9^2 \times 3 \\ &= 1 \times 1 + 0.9 \times 2 + 0.81 \times 3 \\ &= 1 + 1.8 + 2.43 \\ &= 5.23. \end{aligned}

This is the return of one trajectory. J(θ)J(\theta) is the expectation over all possible trajectory returns — run the policy πθ\pi_\theta infinitely many times and take the average. Different policies πθ\pi_\theta produce different trajectory distributions, hence different J(θ)J(\theta). Suppose policy A tends to choose high-reward paths and policy B tends to choose low-reward paths:

PolicyAverage return of possible trajectoriesJ(θ)J(\theta)
AAround 5.235.23Larger
BAround 2.102.10Smaller

The optimization objective is to find parameters θ\theta that maximize J(θ)J(\theta) (i.e., the highest average return).

Gradient Ascent

How do we make J(θ)J(\theta) larger? The classic move in deep learning: walk in the direction of the gradient.

θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \, \nabla_\theta J(\theta)

SymbolRoleMeaning
θJ(θ)\nabla_\theta J(\theta)GradientWhich direction should we change parameters to improve the policy most?
α\alphaLearning rateHow big is one step? Too large oscillates; too small is slow
++AscentNotice the plus sign: we maximize, not minimize

Numerical Example: One Parameter Update

Suppose parameters θ=[0.5, 0.3, 0.1]\theta = [0.5,\ 0.3,\ -0.1], gradient θJ(θ)=[0.1, 0.2, 0.05]\nabla_\theta J(\theta) = [0.1,\ -0.2,\ 0.05], learning rate α=0.01\alpha = 0.01. Writing out the update component by component:

θ00.5+0.01×0.1=0.5+0.001=0.501,θ10.3+0.01×(0.2)=0.30.002=0.298,θ20.1+0.01×0.05=0.1+0.0005=0.0995.\begin{aligned} \theta_0 &\leftarrow 0.5 + 0.01 \times 0.1 = 0.5 + 0.001 = 0.501, \\ \theta_1 &\leftarrow 0.3 + 0.01 \times (-0.2) = 0.3 - 0.002 = 0.298, \\ \theta_2 &\leftarrow -0.1 + 0.01 \times 0.05 = -0.1 + 0.0005 = -0.0995. \end{aligned}

After the update, θ=[0.501, 0.298, 0.0995]\theta = [0.501,\ 0.298,\ -0.0995]. Each component has moved a small step in the direction indicated by the gradient. α\alpha controls the step size: with α=0.01\alpha=0.01, each step moves only a few thousandths, but the effect accumulates over many rounds.

But how do we compute θJ(θ)\nabla_\theta J(\theta)? The objective contains an expectation E\mathbb{E}, which in principle requires averaging over all possible trajectories. The number of possible trajectories is astronomical; we cannot enumerate them all. It is like trying to compute the average height of every student in a school — you cannot measure everyone, but you can randomly sample 100 students and estimate. The sample mean computed from those 100 students is a sampling estimate of the true mean. Policy gradients use the same idea: run a few trajectories and use the average of their gradients to estimate the true θJ(θ)\nabla_\theta J(\theta).

The Policy Gradient Theorem

This is exactly where the policy gradient theorem enters. In 1992, Williams showed in the REINFORCE paper that the seemingly intractable gradient θJ(θ)\nabla_\theta J(\theta) can be rewritten into a form that can be estimated by sampling. [1] Sutton and colleagues later generalized and systematized this result in 2000. [2]

θJ(θ)=Eπθ[tθlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot G_t \right]

Let's unpack it term by term:

SymbolRoleMeaning
θ\nabla_\thetaTake gradientWhich direction should we change the parameters?
logπθ(atst)\log \pi_\theta(a_t | s_t)Log probabilityUnder state sts_t, the log-probability of choosing action ata_t
θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t | s_t)Gradient of log-probHow do parameters change the probability of choosing this action?
GtG_tReturnTotal reward from time tt to the end — "how many points did we get after this action?"
Outer E\mathbb{E}Expectation"Run many episodes and average" — approximated by sampling

In one sentence: if an action leads to a good outcome (large GtG_t), increase the probability of taking it again; if it leads to a bad outcome (small GtG_t), decrease that probability.

Numerical Example: Gradient Computation in a Bandit Setting

Let's walk through a bandit scenario. Two-armed bandit: arm A wins with probability 30%, arm B with probability 70%. The policy has a single parameter θ\theta (assume larger θ\theta means higher probability of choosing B). Currently πθ(Bs)=0.7\pi_\theta(B|s) = 0.7, πθ(As)=0.3\pi_\theta(A|s) = 0.3.

Trajectory 1: chose B, reward = 1.0

First step, compute the log probability:

logπθ(Bs)=log(0.7)=0.357.\log \pi_\theta(B|s) = \log(0.7) = -0.357.

Second step, compute the gradient of the log probability. Suppose that under the current parameters θlogπθ(Bs)=0.5\nabla_\theta \log \pi_\theta(B|s) = 0.5 (this value depends on the specific parameterization; we pick a concrete number here).

Third step, multiply by the return. The bandit has only one step, so G0=r0=1.0G_0 = r_0 = 1.0:

θlogπθ(Bs)G0=0.5×1.0=0.5.\nabla_\theta \log \pi_\theta(B|s) \cdot G_0 = 0.5 \times 1.0 = 0.5.

The gradient is positive, so the parameter updates as θθ+α×0.5\theta \leftarrow \theta + \alpha \times 0.5. The parameter increases, and πθ(Bs)\pi_\theta(B|s) increases accordingly — the policy learns "choosing B led to a good result, so choose B more often."

Trajectory 2: chose A, reward = 0

logπθ(As)=log(0.3)=1.204.\log \pi_\theta(A|s) = \log(0.3) = -1.204.

Suppose θlogπθ(As)=0.5\nabla_\theta \log \pi_\theta(A|s) = -0.5 (the gradient direction for choosing A is opposite to choosing B). Multiply by the return G0=0G_0 = 0:

θlogπθ(As)G0=(0.5)×0=0.\nabla_\theta \log \pi_\theta(A|s) \cdot G_0 = (-0.5) \times 0 = 0.

The gradient is zero — the parameters do not update. The policy chose A but got zero reward, so it neither encourages nor discourages this action.

Trajectory 3: chose A, reward = 0.5

θlogπθ(As)G0=(0.5)×0.5=0.25.\nabla_\theta \log \pi_\theta(A|s) \cdot G_0 = (-0.5) \times 0.5 = -0.25.

The gradient is negative, so the parameter updates as θθ+α×(0.25)\theta \leftarrow \theta + \alpha \times (-0.25). The parameter decreases, which means πθ(Bs)\pi_\theta(B|s) decreases and πθ(As)\pi_\theta(A|s) increases — the policy learns "A got some reward this time, so we can choose A slightly more often." But since A's average reward is much lower than B's, over many samples the positive gradient from B will accumulate and dominate, and the policy will eventually converge to preferring B.

Numerical Example: Gradient Computation for a 3-Step Episode

Consider a CartPole-like scenario: 3-step episode, γ=0.9\gamma=0.9.

StepStateActionReward
0s0s_0a0a_0 = rightr0=1r_0 = 1
1s1s_1a1a_1 = leftr1=2r_1 = 2
2s2s_2a2a_2 = rightr2=3r_2 = 3

First, compute the return GtG_t at each step. G2G_2 includes only the last step:

G2=r2=3.G_2 = r_2 = 3.

G1G_1 accumulates from step 1 to the end:

G1=r1+γr2=2+0.9×3=2+2.7=4.7.G_1 = r_1 + \gamma r_2 = 2 + 0.9 \times 3 = 2 + 2.7 = 4.7.

G0G_0 accumulates from step 0 to the end:

G0=r0+γr1+γ2r2=1+0.9×2+0.81×3=1+1.8+2.43=5.23.G_0 = r_0 + \gamma r_1 + \gamma^2 r_2 = 1 + 0.9 \times 2 + 0.81 \times 3 = 1 + 1.8 + 2.43 = 5.23.

Suppose the log-probability gradient values at each step are:

Steplogπθ(atst)\log \pi_\theta(a_t | s_t)θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t | s_t)GtG_t
0logπθ(rights0)=0.4\log \pi_\theta(\text{right}\mid s_0) = -0.4[0.3,0.1][0.3, -0.1]5.235.23
1logπθ(lefts1)=0.7\log \pi_\theta(\text{left}\mid s_1) = -0.7[0.2,0.4][-0.2, 0.4]4.74.7
2logπθ(rights2)=0.3\log \pi_\theta(\text{right}\mid s_2) = -0.3[0.1,0.2][0.1, 0.2]33

The gradient contribution from each step:

Step 0:θlogπθ(a0s0)G0=[0.3,0.1]×5.23=[1.569,0.523],Step 1:θlogπθ(a1s1)G1=[0.2,0.4]×4.7=[0.94,1.88],Step 2:θlogπθ(a2s2)G2=[0.1,0.2]×3=[0.3,0.6].\begin{aligned} \text{Step 0:} \quad \nabla_\theta \log \pi_\theta(a_0|s_0) \cdot G_0 &= [0.3, -0.1] \times 5.23 = [1.569, -0.523], \\ \text{Step 1:} \quad \nabla_\theta \log \pi_\theta(a_1|s_1) \cdot G_1 &= [-0.2, 0.4] \times 4.7 = [-0.94, 1.88], \\ \text{Step 2:} \quad \nabla_\theta \log \pi_\theta(a_2|s_2) \cdot G_2 &= [0.1, 0.2] \times 3 = [0.3, 0.6]. \end{aligned}

The gradient estimate from this single trajectory is the sum of the three:

tθlogπθ(atst)Gt=[1.569,0.523]+[0.94,1.88]+[0.3,0.6]=[1.5690.94+0.3, 0.523+1.88+0.6]=[0.929, 1.957].\begin{aligned} \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t &= [1.569, -0.523] + [-0.94, 1.88] + [0.3, 0.6] \\ &= [1.569 - 0.94 + 0.3,\ -0.523 + 1.88 + 0.6] \\ &= [0.929,\ 1.957]. \end{aligned}

If α=0.01\alpha = 0.01, the parameter update is θθ+0.01×[0.929, 1.957]\theta \leftarrow \theta + 0.01 \times [0.929,\ 1.957]. Step 1 contributes the most (with G1=4.7G_1=4.7 and a relatively large gradient component of 0.40.4), indicating that the decision "choose left in s1s_1" has a prominent contribution to the final return, so the parameters will move in the direction of increasing the probability of choosing left in s1s_1.

The Log-Derivative Trick

Why don't we write something like θπθ(atst)Gt\nabla_\theta \pi_\theta(a_t|s_t) \cdot G_t, and instead introduce a log\log?

This is a mathematical technique known as the log-derivative trick. By the chain rule:

θlogπ=θππ\nabla_\theta \log \pi = \frac{\nabla_\theta \pi}{\pi}

The division by π\pi cancels the π\pi factor hidden inside the expectation, making the entire expression clean and computable. From an engineering point of view, since probabilities π\pi lie in (0,1)(0, 1), directly computing gradients of probabilities can produce extremely small values, destabilizing training. The log\log maps (0,1)(0, 1) to (,0)(-\infty, 0), yielding more numerically stable gradients.

Numerical Example: The Effect of the Log-Derivative Trick

Take a concrete example: πθ(as)=0.7\pi_\theta(a|s) = 0.7. Suppose a small perturbation of the parameter changes πθ(as)\pi_\theta(a|s) to 0.710.71. Then

θπθ(as)0.710.7=0.01.\nabla_\theta \pi_\theta(a|s) \approx 0.71 - 0.7 = 0.01.

Using the raw probability gradient:

θπθ(as)=0.01.\nabla_\theta \pi_\theta(a|s) = 0.01.

Using the log-derivative trick:

θlogπθ(as)=θπθ(as)πθ(as)=0.010.7=0.0143.\nabla_\theta \log \pi_\theta(a|s) = \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} = \frac{0.01}{0.7} = 0.0143.

Now consider another action with πθ(as)=0.05\pi_\theta(a'|s) = 0.05, which becomes 0.060.06 after the same perturbation:

θπθ(as)=0.01,θlogπθ(as)=0.010.05=0.2.\nabla_\theta \pi_\theta(a'|s) = 0.01, \quad \nabla_\theta \log \pi_\theta(a'|s) = \frac{0.01}{0.05} = 0.2.

The two actions have the same probability gradient (0.010.01), but the low-probability action's log-derivative gradient is 0.2/0.0143140.2/0.0143 \approx 14 times that of the high-probability action. This makes intuitive sense: raising a 5%5\% probability to 6%6\% is a 20%20\% relative increase; raising a 70%70\% probability to 71%71\% is only a 1.4%1.4\% relative increase. Dividing by π\pi converts absolute changes into relative changes, giving actions at different probability scales comparable magnitudes in gradient space.

Math derivation: from the objective to the policy gradient theorem

The gradient of the objective differentiates through the trajectory distribution:

θJ(θ)=θτP(τ;θ)trt(τ)\nabla_\theta J(\theta) = \nabla_\theta \sum_{\tau} P(\tau; \theta) \sum_t r_t(\tau)

Here τ=(s0,a0,s1,a1,)\tau = (s_0, a_0, s_1, a_1, \ldots) denotes a trajectory, and P(τ;θ)P(\tau; \theta) is the probability that policy πθ\pi_\theta generates trajectory τ\tau. The gradient can only act on P(τ;θ)P(\tau; \theta), since the reward does not depend on θ\theta:

θJ(θ)=τθP(τ;θ)trt(τ)\nabla_\theta J(\theta) = \sum_{\tau} \nabla_\theta P(\tau; \theta) \sum_t r_t(\tau)

The key step is the identity θP=PθlogP\nabla_\theta P = P \cdot \nabla_\theta \log P:

θJ(θ)=τP(τ;θ)(θlogP(τ;θ))trt(τ)\nabla_\theta J(\theta) = \sum_{\tau} P(\tau; \theta) \left( \nabla_\theta \log P(\tau; \theta) \right) \sum_t r_t(\tau)

The trajectory probability factorizes as P(τ;θ)=tπθ(atst)P(st+1st,at)P(\tau; \theta) = \prod_t \pi_\theta(a_t|s_t) \cdot P(s_{t+1}|s_t, a_t). Taking logs and differentiating with respect to θ\theta, the environment transition probability P(ss,a)P(s'|s,a) disappears because it does not depend on θ\theta, leaving only the policy terms:

θlogP(τ;θ)=tθlogπθ(atst)\nabla_\theta \log P(\tau; \theta) = \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)

Substituting back into the expectation yields the policy gradient theorem. The most elegant aspect of this derivation is that the environment dynamics — the state transition probabilities — cancel out during differentiation. This means policy gradients do not need a model of the environment, which is the fundamental reason they are much more flexible than dynamic programming approaches.

The REINFORCE Algorithm

The policy gradient theorem gives us the form of the gradient. REINFORCE is the most straightforward implementation of the theorem: it uses Monte Carlo sampling to estimate the expectation. The algorithm flow is:

  1. Run one full episode with the current policy πθ\pi_\theta, recording state, action, and reward at each step.
  2. For each time step, compute the return from that step until the end of the episode: Gt=k=tTγktrkG_t = \sum_{k=t}^{T} \gamma^{k-t} r_k.
  3. Estimate the gradient with samples: θJtθlogπθ(atst)Gt\nabla_\theta J \approx \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t.
  4. Update parameters along the gradient: θθ+αθJ\theta \leftarrow \theta + \alpha \nabla_\theta J.

In PyTorch, the core update can be written in one line:

python
loss = -log_prob * G_t  # minus sign because PyTorch does gradient descent (minimize) by default, while we want ascent (maximize)

A full multi-step version:

python
# REINFORCE core (multi-step version)
for t in range(len(rewards)):
    G_t = sum(gamma ** k * rewards[t + k] for k in range(len(rewards) - t))
    loss += -log_probs[t] * G_t

optimizer.zero_grad()
loss.backward()
optimizer.step()

Numerical Example: A Complete Episode's REINFORCE Update

Continuing with the 3-step episode from earlier, γ=0.9\gamma=0.9:

StepStateActionRewardlogπθ(atst)\log \pi_\theta(a_t | s_t)
0s0s_0rightr0=1r_0=10.4-0.4
1s1s_1leftr1=2r_1=20.7-0.7
2s2s_2rightr2=3r_2=30.3-0.3

Step 1: Compute the return GtG_t at each step.

G2G_2 from step 2 to the end (only 1 step):

G2=r2=3.G_2 = r_2 = 3.

G1G_1 from step 1 to the end (2 steps):

G1=r1+γr2=2+0.9×3=2+2.7=4.7.G_1 = r_1 + \gamma r_2 = 2 + 0.9 \times 3 = 2 + 2.7 = 4.7.

G0G_0 from step 0 to the end (3 steps):

G0=r0+γr1+γ2r2=1+0.9×2+0.81×3=1+1.8+2.43=5.23.G_0 = r_0 + \gamma r_1 + \gamma^2 r_2 = 1 + 0.9 \times 2 + 0.81 \times 3 = 1 + 1.8 + 2.43 = 5.23.

Summary:

Steprtr_tGtG_t
015.23
124.7
233

Step 2: Compute the gradient estimate.

Each step's gradient term is θlogπθ(atst)Gt\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t. In PyTorch, log_prob is logπθ(atst)\log \pi_\theta(a_t|s_t), and autograd handles the gradient. Here we only look at the scalar loss construction:

loss=t=02logπθ(atst)Gt=(logπθ(rights0)G0+logπθ(lefts1)G1+logπθ(rights2)G2)=((0.4)×5.23+(0.7)×4.7+(0.3)×3)=(2.092+(3.29)+(0.9))=(6.282)=6.282.\begin{aligned} \text{loss} &= -\sum_{t=0}^{2} \log \pi_\theta(a_t|s_t) \cdot G_t \\ &= -(\log \pi_\theta(\text{right}|s_0) \cdot G_0 + \log \pi_\theta(\text{left}|s_1) \cdot G_1 + \log \pi_\theta(\text{right}|s_2) \cdot G_2) \\ &= -((-0.4) \times 5.23 + (-0.7) \times 4.7 + (-0.3) \times 3) \\ &= -(−2.092 + (−3.29) + (−0.9)) \\ &= -(-6.282) \\ &= 6.282. \end{aligned}

Step 3: Backpropagation and parameter update.

loss.backward() computes θloss\nabla_\theta \text{loss}. Since loss=tlogπθ(atst)Gt\text{loss} = -\sum_t \log \pi_\theta(a_t|s_t) \cdot G_t, we have

θloss=tθlogπθ(atst)Gt.\nabla_\theta \text{loss} = -\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t.

optimizer.step() executes θθαθloss\theta \leftarrow \theta - \alpha \cdot \nabla_\theta \text{loss} (PyTorch defaults to gradient descent). The two negatives cancel:

θθα(tθlogπθ(atst)Gt)=θ+αtθlogπθ(atst)Gt.\theta \leftarrow \theta - \alpha \cdot \left(-\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right) = \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t.

This is exactly gradient ascent. This round of updates moves the parameters in the direction of increasing the probability of high-return actions.

A Minimal Example: Multi-Armed Bandits

Before we dive into CartPole, let's use a minimal setting to understand what loss = -log_prob * reward is really doing.

Imagine a bandit with two arms: arm A wins with probability 30%, arm B wins with probability 70%. The policy network is just a Softmax layer that outputs the probability of choosing A and B. The core training code:

python
probs = policy(state)
dist = torch.distributions.Categorical(probs)
action = dist.sample()          # sample an action according to probability
log_prob = dist.log_prob(action)  # log π(a|s)

reward = pull_arm(action.item())  # take the action

loss = -log_prob * reward         # REINFORCE core

After 300 episodes, the probability of choosing B will typically climb from around 0.5 to 0.85–0.95: the policy learns to prefer the arm with the higher win rate. But the curve will not be smooth — it will be jagged and noisy. If you increase the learning rate from 0.01 to 0.1, the policy may swing violently between A and B.

This is the core pain point of REINFORCE: high variance.

The Variance Problem in REINFORCE

GtG_t is the cumulative return from time tt until the episode ends, which includes all randomness along that future trajectory. For the same action, different sampled trajectories can produce very different GtG_t values:

CaseWhat actually happenedGtG_t
Good luckSubsequent steps all yielded high rewardsLarge
Bad luckSubsequent steps all yielded low rewardsSmall

Policy gradients use GtG_t to decide whether "this action was good." But when GtG_t fluctuates heavily, the same good action can be penalized due to bad luck, and the same bad action can be rewarded due to good luck. It is like judging a student's ability by a single exam — a bad score does not necessarily mean poor mastery; it may just be an off day.

Numerical Example: Comparing Gradient Signals from Two Trajectories

Back to the bandit scenario: πθ(Bs)=0.7\pi_\theta(B|s) = 0.7, πθ(As)=0.3\pi_\theta(A|s) = 0.3. Both times we chose B (the same action), but the subsequent luck differed.

Episode 1: chose B, reward 1.0 (good outcome)

gradient term=θlogπθ(Bs)×1.0.\text{gradient term} = \nabla_\theta \log \pi_\theta(B|s) \times 1.0.

The gradient pushes πθ(Bs)\pi_\theta(B|s) up.

Episode 2: chose B, reward 0.0 (bad outcome, bad luck)

gradient term=θlogπθ(Bs)×0.0=0.\text{gradient term} = \nabla_\theta \log \pi_\theta(B|s) \times 0.0 = 0.

The gradient signal is zero — the parameters do not change.

Now consider a multi-step example. Same 3-step episode structure, γ=0.9\gamma=0.9, where the policy chose the same action "right" at s0s_0:

Episoder0r_0r1r_1r2r_2G0G_0Gradient signal direction
11231+1.8+2.43=5.231 + 1.8 + 2.43 = 5.23Strongly positive
21001+0+0=11 + 0 + 0 = 1Weakly positive

Both episodes took the same action "right" at s0s_0, yet G0G_0 differs by more than a factor of 4. Episode 1 would push π(rights0)\pi(\text{right}|s_0) up significantly, while Episode 2 would only push it up weakly. The problem is that G0G_0 does not only reflect whether "choosing right at s0s_0 is good" — it also includes the randomness of r1r_1 and r2r_2. The rewards from the subsequent two steps are not controlled by the current action, yet they are all counted into G0G_0.

In the bandit experiment, this manifests as jagged and oscillating training curves. In more complex environments (such as CartPole), high variance makes training even more unstable — sometimes the policy improves nicely, then suddenly gets derailed by a run of bad luck.

Discrete vs. Continuous Action Spaces

In this chapter's experiments we use discrete action spaces (choose A vs. B, CartPole left vs. right), but the policy gradient theorem applies equally to continuous action spaces:

Discrete action spaceContinuous action space
ExampleCartPole left/right, LLM token choiceRobot joint angle, steering wheel angle
Output layerSoftmax, probability for each actionGaussian parameters, mean μ\mu and standard deviation σ\sigma
SamplingSample according to SoftmaxSample from N(μ,σ2)\mathcal{N}(\mu, \sigma^2)
Compute logπ\log \pilog_softmaxLog-density formula of a Gaussian distribution

With the same policy gradient formula, changing only the output layer lets us switch from "left/right" to "continuous torque." This is where policy gradients are more flexible than value-based methods: DQN's argmax\arg\max is not computable in continuous spaces, while policy gradients differentiate through a probability density, which is naturally compatible with continuous actions.

Thinking question: what is the essential difference between REINFORCE and Q-Learning updates?

Q-Learning updates the value function Q(s,a)Q(s,a) — "how many points is this action worth" — and the policy is obtained implicitly via argmaxQ\arg\max Q. REINFORCE updates the policy parameters θ\theta directly, skipping the intermediate step of learning QQ values.

This difference leads to two key consequences: Q-Learning is off-policy (it can reuse old data for training), while REINFORCE is on-policy (it must use fresh data generated by the current policy); Q-Learning can only handle discrete actions (it needs to enumerate all actions to take the max), while REINFORCE can handle continuous actions (it directly differentiates the log probability density).

REINFORCE can work, but its high variance makes it nearly unusable in practice. Fortunately, the policy gradient theorem has a remarkable property: we can subtract a baseline that does not depend on the action from the gradient estimator, without changing the expected direction of the gradient, while significantly reducing variance. We will develop this in Section 5.4. In the next section, we will first run vanilla REINFORCE on CartPole: Hands-on: CartPole.



  1. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4), 229-256. DOI ↩︎

  2. Sutton, R. S., et al. (1999). Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 12. ↩︎

现代强化学习实战课程