Skip to content

7.2 PPO Mathematical Derivation

In the previous section, we trained LunarLander with SB3's PPO and looked at curves such as reward, entropy, and clip fraction. Now we should answer a more basic question:

What exactly is PPO, and why does it eventually become a single loss function?

Prerequisites

This section integrates and extends the material from Chapters 5 and 6. The following ideas will appear repeatedly in the derivation:

PPO stands for Proximal Policy Optimization. The name is worth unpacking:

  • Policy: the model that chooses actions.
  • Optimization: training, i.e., improving that policy.
  • Proximal: "nearby" updates; the new policy should not move too far from the old one.

So here is the headline conclusion:

PPO is not a policy, and it is not merely a loss. PPO is a method for training a policy network.

In reinforcement learning, the policy is the object we truly train. It is usually written as:

πθ(as)\pi_\theta(a \mid s)

This means: "under state ss, the policy network parameterized by θ\theta assigns probability to action aa." In code, this policy is typically the Actor network. For example, the Actor takes a game frame or a robot state as input, and outputs a probability distribution over actions.

What PPO provides is a recipe for training this Actor. It does not hard-code actions, and it does not replace the policy network. Instead, it specifies an update rule: use the current Actor to collect a batch of experience, then adjust the Actor using that batch, while preventing each update from being too aggressive.

It helps to separate three closely-related concepts:

NameWhat It IsRoughly What It Corresponds To In Code
Policythe object being trained; chooses actions given statesactor / model output action_probs
PPOthe training method: sampling, advantage estimation, constrained updates, backpropthe full training loop
PPO lossa differentiable objective used to update network parameters in PPOpolicy_loss + value_loss - entropy_bonus

Why will we keep talking about a loss? Because neural networks cannot directly interpret the instruction "make the policy more stable; do not change too fast." An optimizer understands a very specific interface: give it a scalar loss, it computes gradients via loss.backward(), then updates parameters via optimizer.step().

So PPO's ideas must eventually become a loss in order to update the Actor and Critic.

Put differently: PPO is a method, the policy is the model being trained, and the loss is the training signal that makes the method real in code. We derive the PPO loss not because PPO is only a loss, but because the loss is the point at which PPO touches neural-network parameters.

PPO Code Skeleton

To keep the formulas grounded, we first show what PPO "looks like in code." The code below is not an engineering-optimized implementation. It is a learning-oriented minimal PyTorch PPO skeleton: policy network, sampling, advantage estimation, PPO-Clip loss, value-function loss, entropy bonus, and multiple epochs of updates.

Every time we derive a new formula, we will come back to a corresponding part of this code. The highlighted lines are the ones we will repeatedly unpack. You do not need to fully understand every line now; just remember the big picture:

PPO ultimately links "collect experience, estimate advantages, constrain policy changes, backpropagate updates" into a single training loop.

 21    # [A] 策略和值函数:Actor 输出动作概率,Critic 输出状态价值
 22    def forward(self, obs):
 23        h = self.backbone(obs)
 24        logits = self.actor(h)
 25        action_probs = F.softmax(logits, dim=-1)
 26        value = self.critic(h).squeeze(-1)
 27        return action_probs, value
 28 
 29    # [B] 采样动作:从策略分布中抽动作,并保存旧策略 log_prob
 30    def act(self, obs):
 31        action_probs, value = self.forward(obs)
 32        dist = Categorical(action_probs)
 33        action = dist.sample()
 34        log_prob = dist.log_prob(action)
 35        return action, log_prob, value
 36 
 37    def evaluate(self, obs, actions):
 38        action_probs, values = self.forward(obs)
 39        dist = Categorical(action_probs)
 40        new_logprobs = dist.log_prob(actions)
 41        entropy = dist.entropy()
 42        return new_logprobs, values, entropy
 77# [D] 计算优势:这个动作比当前状态的平均水平好多少
 78def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
 79    advantages = torch.zeros_like(rewards)
 80    last_advantage = 0.0
 81    next_value = 0.0
 82 
 83    for t in reversed(range(len(rewards))):
 84        mask = 1.0 - dones[t]
 85        delta = rewards[t] + gamma * next_value * mask - values[t]
 86        last_advantage = delta + gamma * lam * mask * last_advantage
 87        advantages[t] = last_advantage
 88        next_value = values[t]
 89 
 90    returns = advantages + values
 91    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
 92    return advantages, returns
109            new_logprobs, new_values, entropy = model.evaluate(states[mb], actions[mb])
110            ratio = torch.exp(new_logprobs - old_logprobs[mb])
111 
112            surr1 = ratio * advantages[mb]
113            clipped_ratio = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps)
114            surr2 = clipped_ratio * advantages[mb]
115            policy_loss = -torch.min(surr1, surr2).mean()
116 
117            value_loss = F.mse_loss(new_values, returns[mb])
118            entropy_bonus = entropy.mean()
119            loss = policy_loss + vf_coef * value_loss - ent_coef * entropy_bonus
120 
121            optimizer.zero_grad()
122            loss.backward()
123            optimizer.step()
126# [F] 训练循环:采样一批数据,再用这批数据更新多轮
127device = "cuda" if torch.cuda.is_available() else "cpu"
128env = gym.make("CartPole-v1")
129model = ActorCritic(env.observation_space.shape[0], env.action_space.n).to(device)
130optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)
131 
132for update in range(100):
133    batch = collect_rollout(env, model, steps=2048, device=device)
134    advantages, returns = compute_gae(batch["rewards"], batch["values"], batch["dones"])
135    ppo_update(model, optimizer, batch, advantages, returns)

You can roughly split the code into six parts:

TagCode BlockWhat We Will Explain Later
[A]forwardwhat the policy πθ(as)\pi_\theta(a\mid s) and value function Vθ(s)V_\theta(s) are
[B]act / evaluatewhy we construct dist, and why we store log_prob
[C]collect_rolloutwhat on-policy data is, and why we record the old policy probabilities
[D]compute_gaehow returns, value functions, and advantages relate
[E]ppo_updatePPO-Clip's ratio, clamp, min, and the total loss
[F]training loopwhy we update multiple epochs on the same batch

When key variables appear later, we will repeatedly refer back to this mapping table:

SymbolMeaningTypical Code Variable
sts_tstate at time ttstates
ata_taction taken at time ttactions
rtr_treward at time ttrewards
GtG_tdiscounted return starting from time ttreturns
Vθ(s)V_\theta(s)Critic's estimate of future return from state ssvalue / new_values
AtA_t or A^t\hat{A}_tadvantage estimate: how much better this action is than the state's baselineadvantages
πold(as)\pi_{\text{old}}(a \mid s)the old policy that collected this batchstored old_logprobs
rt(θ)r_t(\theta)ratio πθ/πold\pi_\theta / \pi_{\text{old}}ratio = exp(new_logprobs-old_logprobs)
ε\varepsilonPPO clipping range, often 0.1 or 0.2clip_eps / clip_range
H[πθ]H[\pi_\theta]policy entropy (how random the action distribution is)entropy

PPO Core Idea: learn multiple epochs on the same batch, while clipping prevents the new policy from drifting too far from the old one

Step 1: A Probabilistic View of Reinforcement Learning

The most basic reinforcement-learning loop is:

Mermaid diagram

Here tt is the time step. sts_t is the state observed at step tt, ata_t is the action taken, and rtr_t is the immediate feedback from the environment. Reinforcement learning is not about a single reward; it is about the long-term result produced by a sequence of decisions.

We typically formalize the environment as a Markov Decision Process (MDP):

M=(S,A,P,R,γ)\mathcal{M} = (\mathcal{S}, \mathcal{A}, P, R, \gamma)

Each symbol means (review: the MDP 5-tuple):

  • S\mathcal{S}: state space, the set of all possible states.
  • A\mathcal{A}: action space, the set of all possible actions.
  • P(st+1st,at)P(s_{t+1}\mid s_t,a_t): transition probability, the probability of moving to st+1s_{t+1} after taking action ata_t at state sts_t.
  • R(st,at)R(s_t,a_t): reward function, the immediate payoff of the action.
  • γ\gamma: discount factor, how much we value future rewards today.

The policy is what we train. In symbols:

πθ(atst)\pi_\theta(a_t \mid s_t)

This means: "the policy network with parameters θ\theta assigns probability to action ata_t under state sts_t." In code, the Actor typically outputs action probabilities, then wraps them into a distribution object dist:

 21    # [A] 策略和值函数:Actor 输出动作概率,Critic 输出状态价值
 22    def forward(self, obs):
 23        h = self.backbone(obs)
 24        logits = self.actor(h)
 25        action_probs = F.softmax(logits, dim=-1)
 26        value = self.critic(h).squeeze(-1)
 27        return action_probs, value
 28 
 29    # [B] 采样动作:从策略分布中抽动作,并保存旧策略 log_prob
 30    def act(self, obs):
 31        action_probs, value = self.forward(obs)
 32        dist = Categorical(action_probs)
 33        action = dist.sample()
 34        log_prob = dist.log_prob(action)
 35        return action, log_prob, value
 36 
 37    def evaluate(self, obs, actions):
 38        action_probs, values = self.forward(obs)
 39        dist = Categorical(action_probs)
 40        new_logprobs = dist.log_prob(actions)
 41        entropy = dist.entropy()
 42        return new_logprobs, values, entropy

By default, this shows [A] policy outputs and [B] action sampling. In the full code, network definitions appear before it, and rollout collection appears after it.

action_probs corresponds to πθ(st)\pi_\theta(\cdot \mid s_t), the probability distribution over all actions. For example, in a discrete environment with 3 actions, action_probs = [0.1, 0.7, 0.2] means action 0 has 10% probability, action 1 has 70%, and action 2 has 20%.

dist is short for distribution: a distribution object. Categorical(action_probs) wraps the probabilities into a discrete distribution. It is not an action, and it is not a parameter; it is better viewed as a "lottery box with tools" where each action has its own probability.

This object provides methods that show up everywhere in RL code:

CodeMeaningMath Counterpart
dist.sample()sample an action according to action_probs instead of always taking argmaxatπθ(st)a_t \sim \pi_\theta(\cdot \mid s_t)
dist.log_prob(action)the log probability of the sampled actionlogπθ(atst)\log \pi_\theta(a_t \mid s_t)
dist.entropy()how random the action distribution is (later used to encourage exploration)H[πθ]H[\pi_\theta]

So action is sampled from dist, and log_prob is the log probability of that action under the current policy. We will need it for policy gradients and PPO ratios. Here we use Categorical because tasks like CartPole and LunarLander have discrete actions. For continuous actions, one often uses a continuous distribution such as Normal, but the pattern is the same:

construct a distribution, sample an action, and record the log_prob.

If we run from the initial state until termination, we obtain a trajectory:

τ=(s0,a0,r0,s1,a1,r1,,sT)\tau = (s_0,a_0,r_0,s_1,a_1,r_1,\ldots,s_T)

τ\tau is short for trajectory. It is not a single sample, but a full interaction history. Given a policy πθ\pi_\theta, the probability of seeing trajectory τ\tau can be written as:

pθ(τ)=ρ0(s0)t=0T1πθ(atst)P(st+1st,at)p_\theta(\tau) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t\mid s_t) P(s_{t+1}\mid s_t,a_t)

This expression is long, but it says only three things:

  • ρ0(s0)\rho_0(s_0): where the initial state comes from.
  • πθ(atst)\pi_\theta(a_t\mid s_t): how the agent selects actions at each state.
  • P(st+1st,at)P(s_{t+1}\mid s_t,a_t): how the environment transitions after actions.

The crucial observation is:

In this product, only πθ(atst)\pi_\theta(a_t\mid s_t) contains the trainable parameters θ\theta.

The environment dynamics PP are usually unknown, non-differentiable, and not directly modifiable. This is why policy-gradient methods only need the action log_prob: the root cause is that only the policy term depends on θ\theta.

Step 2: Discounted Return

If we maximize the immediate reward rtr_t only, the agent becomes myopic. For example, in LunarLander, firing the engine hard might change attitude immediately but can cause a crash later. Reinforcement learning is about maximizing a sequence of future rewards:

Gt=rt+γrt+1+γ2rt+2+G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots

More compactly:

Gt=k=0Tt1γkrt+kG_t = \sum_{k=0}^{T-t-1}\gamma^k r_{t+k}

The symbols mean:

  • GtG_t: the return, the cumulative reward starting from time tt.
  • kk: offset into the future. k=0k=0 is the current reward rtr_t, k=1k=1 is the next-step reward rt+1r_{t+1}.
  • γk\gamma^k: discount weight for future rewards. The farther the reward, the more it is discounted.
  • TT: trajectory length. For continuing tasks, one often writes k=0γkrt+k\sum_{k=0}^{\infty}\gamma^k r_{t+k}.

Why introduce γ\gamma? Three reasons.

First, γ\gamma expresses that "the future matters, but is usually less certain than the present." When γ=0\gamma=0, the agent only cares about immediate rewards; as γ\gamma approaches 11, it cares more about long-term outcomes. For CartPole and LunarLander, a typical choice is γ=0.99\gamma=0.99.

Second, in infinite-horizon tasks, if every step has positive reward then a direct sum can diverge. With 0γ<10\le\gamma<1, discounted sums are much more likely to remain finite.

Third, discounted return has a very implementation-friendly recursion:

Gt=rt+γGt+1G_t = r_t + \gamma G_{t+1}

This means: total return from now equals "reward now" plus "discounted return from the next step." In code, we typically compute returns backward in time:

python
G = 0
returns = []
for reward in reversed(rewards):
    G = reward + gamma * G
    returns.insert(0, G)

Here G corresponds to GtG_t, reward is rtr_t, gamma is γ\gamma, and returns stores the discounted return for each time step.

With this, the objective of a policy can be written as:

J(θ)=Eτpθ(τ)[G0]=Eτπθ[t=0T1γtrt]J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)}[G_0] = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T-1}\gamma^t r_t \right]

J(θ)J(\theta) reads as: "how good is the policy with parameters θ\theta?" The E\mathbb{E} denotes expectation, because even the same policy can yield different trajectories across runs. The environment may be stochastic, and the policy samples actions stochastically. So we maximize not the reward from a single run, but the long-run return in expectation.

Step 3: From Objective to Policy Gradient

Now the question becomes: how do we adjust θ\theta to increase J(θ)J(\theta)?

Write the objective as a sum over all possible trajectories:

J(θ)=τpθ(τ)R(τ)J(\theta) = \sum_{\tau} p_\theta(\tau)R(\tau)

where R(τ)R(\tau) is the discounted return of the full trajectory. Differentiate with respect to θ\theta:

θJ(θ)=τθpθ(τ)R(τ)\nabla_\theta J(\theta) = \sum_{\tau} \nabla_\theta p_\theta(\tau)R(\tau)

Differentiating trajectory probability pθ(τ)p_\theta(\tau) directly is hard. The key trick in policy gradients is the identity:

θpθ(τ)=pθ(τ)θlogpθ(τ)\nabla_\theta p_\theta(\tau) = p_\theta(\tau)\nabla_\theta \log p_\theta(\tau)

which follows from logx=1xx\nabla \log x = \frac{1}{x}\nabla x. Substitute it back:

θJ(θ)=τpθ(τ)θlogpθ(τ)R(τ)=Eτpθ[θlogpθ(τ)R(τ)]\nabla_\theta J(\theta) = \sum_{\tau} p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) R(\tau) = \mathbb{E}_{\tau\sim p_\theta} \left[ \nabla_\theta \log p_\theta(\tau)R(\tau) \right]

Now expand logpθ(τ)\log p_\theta(\tau):

logpθ(τ)=logρ0(s0)+t=0T1logπθ(atst)+t=0T1logP(st+1st,at)\log p_\theta(\tau) = \log \rho_0(s_0) + \sum_{t=0}^{T-1}\log \pi_\theta(a_t\mid s_t) + \sum_{t=0}^{T-1}\log P(s_{t+1}\mid s_t,a_t)

When differentiating with respect to θ\theta, ρ0\rho_0 and PP vanish since they do not depend on θ\theta:

θlogpθ(τ)=t=0T1θlogπθ(atst)\nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t\mid s_t)

This yields the classic REINFORCE gradient:

θJ(θ)=Eτπθ[t=0T1θlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta} \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t\mid s_t)G_t \right]

Why use GtG_t instead of the full-trajectory return G0G_0? Because the action at time tt cannot affect rewards that happened before time tt. Using the return from the current time onward respects causality and reduces noise.

In implementations, we usually do not hand-write this gradient. Instead, we write an equivalent loss and let autodiff compute gradients:

python
policy_loss = -(log_probs * returns).mean()
policy_loss.backward()
optimizer.step()

Why the minus sign? Mathematically we want to maximize logπθ(atst)Gt\log \pi_\theta(a_t\mid s_t)G_t, but PyTorch optimizers minimize losses. So we negate it.

If GtG_t is large, gradient descent increases the log probability of the action; if GtG_t is small or negative, it decreases that action's probability.

Step 4: Value Functions, Baselines, and Advantages

Vanilla REINFORCE can work, but its variance is large (review: the fatal flaw of REINFORCE). The reason is that GtG_t only tells us "how much reward came after this step," but does not say "is that good for this particular state."

For example, suppose after some step in LunarLander we see Gt=80G_t=80. That sounds good, but if in the same state a typical policy averages 120120, then this action is below average. We need a reference point, and that reference is the state-value function:

Vπ(st)=Eπ[Gtst]V^\pi(s_t) = \mathbb{E}_{\pi}[G_t \mid s_t]

Vπ(st)V^\pi(s_t) means: if we are at state sts_t now and continue following policy π\pi, what return do we get on average?

The action-value function additionally conditions on the action:

Qπ(st,at)=Eπ[Gtst,at]Q^\pi(s_t,a_t) = \mathbb{E}_{\pi}[G_t \mid s_t,a_t]

Qπ(st,at)Q^\pi(s_t,a_t) means: at state sts_t, first take action ata_t, then follow policy π\pi afterward; what return do we get on average?

Subtract the two to obtain the advantage function:

Aπ(st,at)=Qπ(st,at)Vπ(st)A^\pi(s_t,a_t) = Q^\pi(s_t,a_t) - V^\pi(s_t)

The meaning is simple:

How much better is this action than an average action at this state?

If At>0A_t>0, the action is better than average and its probability should increase. If At<0A_t<0, it is worse than average and its probability should decrease. If At=0A_t=0, it is roughly average.

In practice we do not know the true VπV^\pi and QπQ^\pi. We estimate Vθ(st)V_\theta(s_t) with a Critic network, then approximate the advantage using returns or GAE:

A^tGtVθ(st)\hat{A}_t \approx G_t - V_\theta(s_t)

In code, this corresponds to:

 77# [D] 计算优势:这个动作比当前状态的平均水平好多少
 78def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
 79    advantages = torch.zeros_like(rewards)
 80    last_advantage = 0.0
 81    next_value = 0.0
 82 
 83    for t in reversed(range(len(rewards))):
 84        mask = 1.0 - dones[t]
 85        delta = rewards[t] + gamma * next_value * mask - values[t]
 86        last_advantage = delta + gamma * lam * mask * last_advantage
 87        advantages[t] = last_advantage
 88        next_value = values[t]
 89 
 90    returns = advantages + values
 91    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
 92    return advantages, returns
117            value_loss = F.mse_loss(new_values, returns[mb])

By default, this shows [D] advantage estimation and [E] value-function training together: advantages tells the Actor how to change, while returns provides supervision targets for the Critic's value_loss.

Without GAE, the simplest approximation is advantages = returns - values. In this chapter's code we compute advantages using GAE; the next section derives GAE in detail. For now, interpret it as "the part that is better or worse than what the Critic expected."

Why can we replace GtG_t with AtA_t? Because subtracting a baseline b(st)b(s_t) that depends only on the state does not change the expected gradient (review: baseline variance reduction):

Eatπθ[θlogπθ(atst)b(st)]=b(st)θatπθ(atst)=b(st)θ1=0\mathbb{E}_{a_t\sim\pi_\theta} \left[ \nabla_\theta\log\pi_\theta(a_t\mid s_t)b(s_t) \right] = b(s_t)\nabla_\theta \sum_{a_t}\pi_\theta(a_t\mid s_t) = b(s_t)\nabla_\theta 1 = 0

This derivation shows: subtracting a baseline does not change the expected gradient direction; it only reduces variance. Therefore the policy gradient is often written in the Actor-Critic form:

θJ(θ)=Et[θlogπθ(atst)A^t]\nabla_\theta J(\theta) = \mathbb{E}_t \left[ \nabla_\theta \log \pi_\theta(a_t\mid s_t)\hat{A}_t \right]

This is the division of labor between Actor and Critic: the Critic estimates Vθ(st)V_\theta(s_t) to provide the "average level" of the current state, and the Actor adjusts action probabilities according to the advantage A^t\hat{A}_t.

Step 5: The Limits of Vanilla Policy Gradients

At this point we have an algorithm that looks complete:

Mermaid diagram

The problem is that vanilla policy gradients have a requirement:

the data used to update the policy should ideally be collected by that same policy.

This property is called on-policy. In the formula, the expectation is:

Eτπθ[]\mathbb{E}_{\tau\sim\pi_\theta}[\cdots]

which means the data should come from the current policy πθ\pi_\theta. But after one gradient update, parameters change from θold\theta_{\text{old}} to θ\theta. The trajectories we just collected no longer come from the new policy; they come from the old policy πold\pi_{\text{old}}.

If we use each batch only once, training becomes extremely wasteful. Collecting, say, 2048 steps of environment interaction can be expensive, especially for robotics, game simulators, and LLM answer generation. Naturally, we ask:

Can we reuse data collected by the old policy to update the new policy for multiple epochs?

This is PPO's core tension:

We want to reuse old data to improve sample efficiency, but we must not let the new policy drift too far from the old one, otherwise old data will mislead the update.

In the learning-oriented PPO skeleton, collect_rollout deliberately stores the log probability at sampling time:

 45# [C] 采样一批 on-policy 数据:这些数据来自“当前策略”
 46def collect_rollout(env, model, steps=2048, device="cpu"):
 47    obs, _ = env.reset()
 48    batch = {k: [] for k in ["states", "actions", "rewards", "dones", "old_logprobs", "values"]}
 49 
 50    for _ in range(steps):
 51        obs_tensor = torch.as_tensor(obs, dtype=torch.float32, device=device)
 52        with torch.no_grad():
 53            action, old_logprob, value = model.act(obs_tensor)
 54 
 55        next_obs, reward, terminated, truncated, _ = env.step(action.item())
 56        done = terminated or truncated
 57 
 58        batch["states"].append(obs)
 59        batch["actions"].append(action.item())
 60        batch["rewards"].append(reward)
 61        batch["dones"].append(done)
 62        batch["old_logprobs"].append(old_logprob.item())
 63        batch["values"].append(value.item())
 64 
 65        obs = next_obs if not done else env.reset()[0]
 66 
 67    return {
 68        "states": torch.as_tensor(np.array(batch["states"]), dtype=torch.float32, device=device),
 69        "actions": torch.as_tensor(batch["actions"], dtype=torch.long, device=device),
 70        "rewards": torch.as_tensor(batch["rewards"], dtype=torch.float32, device=device),
 71        "dones": torch.as_tensor(batch["dones"], dtype=torch.float32, device=device),
 72        "old_logprobs": torch.as_tensor(batch["old_logprobs"], dtype=torch.float32, device=device),
 73        "values": torch.as_tensor(batch["values"], dtype=torch.float32, device=device),
 74    }

This old_logprobs is logπold(atst)\log \pi_{\text{old}}(a_t\mid s_t). During updates, we recompute the same state-action pairs under the new policy to get new_logprobs. Comparing them tells us how far the policy has moved. Importance sampling is the tool that answers whether "old data can still be used."

Step 6: Importance Sampling

The previous issue is that vanilla policy gradients want data collected by πθ\pi_\theta. Can we use data collected by πold\pi_{\text{old}} to evaluate and improve a new policy? Yes, via importance sampling.

6.1 The Importance Sampling Identity

The core identity is: for any function ff,

Eaπθ[f(a)]=Eaπold[πθ(as)πold(as)f(a)]\mathbb{E}_{a \sim \pi_\theta} [f(a)] = \mathbb{E}_{a \sim \pi_{\text{old}}} \left[ \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)} \cdot f(a) \right]

Why is this true? Expand the left side:

Eaπθ[f(a)]=aπθ(as)f(a)\mathbb{E}_{a \sim \pi_\theta} [f(a)] = \sum_a \pi_\theta(a|s) \cdot f(a)

Rewrite πθ(as)\pi_\theta(a|s) as πold(as)πθ(as)πold(as)\pi_{\text{old}}(a|s)\cdot \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)}:

=aπold(as)πθ(as)πold(as)f(a)=Eaπold[πθ(as)πold(as)f(a)]= \sum_a \pi_{\text{old}}(a|s) \cdot \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)} \cdot f(a) = \mathbb{E}_{a \sim \pi_{\text{old}}} \left[ \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)} \cdot f(a) \right]

The identity holds. The intuition is: we want the expectation of ff under the "new world" πθ\pi_\theta, but we only have samples from the "old world" πold\pi_{\text{old}}. The fix is to reweight each sample. If the new world is more likely to produce this action than the old world, the weight is greater than 1; otherwise it is less than 1. The weight is exactly πθπold\frac{\pi_\theta}{\pi_{\text{old}}}.

6.2 Policy Ratio

Define the policy ratio:

rt(θ)=πθ(atst)πold(atst)r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\text{old}}(a_t | s_t)}

In code, we compute it using the exponential of the log-prob difference, which is numerically more stable than direct division:

 95# [E] PPO 更新:ratio、clip、min 和总 loss 都在这里
 96def ppo_update(model, optimizer, batch, advantages, returns,
 97               clip_eps=0.2, vf_coef=0.5, ent_coef=0.01,
 98               epochs=10, minibatch_size=64):
 99    states = batch["states"]
100    actions = batch["actions"]
101    old_logprobs = batch["old_logprobs"]
102    batch_size = states.size(0)
103 
104    for _ in range(epochs):
105        indices = torch.randperm(batch_size, device=states.device)
106        for start in range(0, batch_size, minibatch_size):
107            mb = indices[start:start + minibatch_size]
108 
109            new_logprobs, new_values, entropy = model.evaluate(states[mb], actions[mb])
110            ratio = torch.exp(new_logprobs - old_logprobs[mb])
111 
112            surr1 = ratio * advantages[mb]

rt=1r_t=1 means the new and old policies assign the same probability to this action. rt>1r_t>1 means the new policy is more inclined to take this action; rt<1r_t<1 means the opposite.

6.3 Surrogate Objective

Apply importance sampling to the policy-gradient objective to get the surrogate objective:

LIS(θ)=Et[rt(θ)At]L^{\text{IS}}(\theta) = \mathbb{E}_t \left[ r_t(\theta) \cdot A_t \right]

Expanded:

LIS(θ)=Et[πθ(atst)πold(atst)At]L^{\text{IS}}(\theta) = \mathbb{E}_t \left[ \frac{\pi_\theta(a_t | s_t)}{\pi_{\text{old}}(a_t | s_t)} \cdot A_t \right]

In code this is surr1 = ratio * advantages, right after ratio in the PPO update:

109            new_logprobs, new_values, entropy = model.evaluate(states[mb], actions[mb])
110            ratio = torch.exp(new_logprobs - old_logprobs[mb])
111 
112            surr1 = ratio * advantages[mb]
113            clipped_ratio = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps)
114            surr2 = clipped_ratio * advantages[mb]
115            policy_loss = -torch.min(surr1, surr2).mean()

This objective has an important property:

at θ=θold\theta=\theta_{\text{old}}, its first-order gradient matches the vanilla policy gradient.

θLIS(θ)θ=θold=θJ(θ)\nabla_\theta L^{\text{IS}}(\theta) \bigg|_{\theta = \theta_{\text{old}}} = \nabla_\theta J(\theta)

The check is straightforward: when θ=θold\theta=\theta_{\text{old}}, we have rt=1r_t=1. Also θrt=θπθπold=θπθπold\nabla_\theta r_t = \nabla_\theta \frac{\pi_\theta}{\pi_{\text{old}}} = \frac{\nabla_\theta \pi_\theta}{\pi_{\text{old}}}. Substituting restores the policy-gradient form.

But once θ\theta moves away from θold\theta_{\text{old}}, the two objectives diverge. The farther away the new policy is, the less reliable the surrogate becomes. That is the next problem to solve.

Step 7: From the Surrogate Objective to PPO-Clip

We now have a key expression:

LIS(θ)=Et[rt(θ)At]L^{\text{IS}}(\theta) = \mathbb{E}_t[r_t(\theta)A_t]

Do not rush to TRPO yet. If we only look at this expression, it already reveals PPO's two core inputs:

NameSymbolCode VariableWhat Question It Answers
policy ratiort(θ)r_t(\theta)ratiodoes the new policy prefer this action more than the old policy?
advantageAtA_t or A^t\hat{A}_tadvantagesis this action better than average at this state?

If we impose no constraints, we would simply maximize:

rt(θ)Atr_t(\theta)A_t

In code, this is surr1 = ratio * advantages:

109            new_logprobs, new_values, entropy = model.evaluate(states[mb], actions[mb])
110            ratio = torch.exp(new_logprobs - old_logprobs[mb])
111 
112            surr1 = ratio * advantages[mb]
113            clipped_ratio = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps)
114            surr2 = clipped_ratio * advantages[mb]
115            policy_loss = -torch.min(surr1, surr2).mean()

You can interpret surr1 as the raw policy-improvement objective. Its rule is:

  • If At>0A_t>0, this is a good action; we want to increase its probability, i.e. make rtr_t larger.
  • If At<0A_t<0, this is a bad action; we want to decrease its probability, i.e. make rtr_t smaller.

But this objective is too greedy. Suppose At=+2A_t=+2 and the current ratio is rt=5r_t=5, then rtAt=10r_tA_t=10. If we keep pushing this action up so that rtr_t becomes 10, 50, 100, the objective keeps increasing. The optimizer would think "bigger is always better," but at that point the new policy is far from the old one, and the old data is no longer reliable.

PPO does not introduce a complicated new algorithm. It adds a very direct conservative rule on top of this objective:

You may increase the probability of good actions and decrease the probability of bad actions, but do not let the new policy move too far relative to the old policy.

So we restrict the policy ratio to a small interval:

rt(θ)=clip(rt(θ),1ε,1+ε)\overline{r}_t(\theta) = \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)

If ε=0.2\varepsilon=0.2, the interval is [0.8,1.2][0.8, 1.2]. This means: for an action that appears in the old batch, the new policy's probability should ideally not be below 0.80.8 times the old policy's probability, and not be above 1.21.2 times it.

In code, the unclipped objective surr1, the clipped objective surr2, and the final policy_loss are computed together:

109            new_logprobs, new_values, entropy = model.evaluate(states[mb], actions[mb])
110            ratio = torch.exp(new_logprobs - old_logprobs[mb])
111 
112            surr1 = ratio * advantages[mb]
113            clipped_ratio = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps)
114            surr2 = clipped_ratio * advantages[mb]
115            policy_loss = -torch.min(surr1, surr2).mean()
116 
117            value_loss = F.mse_loss(new_values, returns[mb])
118            entropy_bonus = entropy.mean()
119            loss = policy_loss + vf_coef * value_loss - ent_coef * entropy_bonus

Now we have two objectives:

CodeMathMeaning
surr1rt(θ)Atr_t(\theta)A_twhat the policy would like to do without constraints
surr2clip(rt(θ),1ε,1+ε)At\text{clip}(r_t(\theta),1-\varepsilon,1+\varepsilon)A_thow far we allow it to change under the update constraint

PPO takes the smaller of the two:

JCLIP(θ)=Et[min(rt(θ)At,  clip(rt(θ),1ε,1+ε)At)]J^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta)A_t,\; \text{clip}(r_t(\theta),1-\varepsilon,1+\varepsilon)A_t \right) \right]

This is PPO-Clip. It is not derived by mechanically transforming the TRPO constraint into algebra. Instead, it starts from the importance-sampling surrogate objective and adds a conservative rule: "do not let the ratio drift too far." TRPO is one historical source of this conservative mindset, but it is not required to understand the PPO code.

In code this is torch.min(surr1, surr2).mean(). Why the minus sign? Because we want to maximize policy_objective, while PyTorch minimizes losses. So we write policy_loss = -policy_objective.

What Clipping Does

Case 1: At>0A_t > 0 (good action; probability should increase)

When At>0A_t>0, we want rtr_t to increase (the new policy assigns higher probability to the action). The unclipped term rtAtr_t\cdot A_t grows linearly with rtr_t with no upper bound. The clipped term rtAt\overline{r}_t\cdot A_t becomes a constant (1+ε)At(1+\varepsilon)\cdot A_t once rt>1+εr_t>1+\varepsilon.

Range of rtr_tUnclipped rtAtr_t \cdot A_tClipped rtAt\overline{r}_t \cdot A_tWhich One min\min Picks
rt1+εr_t \le 1+\varepsilonrtAtr_t\cdot A_trtAtr_t\cdot A_tequal; normal optimization
rt>1+εr_t > 1+\varepsilonrtAtr_t\cdot A_t (larger)(1+ε)At(1+\varepsilon)\cdot A_t (constant)clipped term; zero gradient

So the probability of good actions can increase, but only up to about (1+ε)(1+\varepsilon) times that of the old policy. Beyond that, the objective becomes "flat": it stops rewarding further increases, so the gradient becomes zero.

Case 2: At<0A_t < 0 (bad action; probability should decrease)

When At<0A_t<0, we want rtr_t to decrease (the new policy assigns lower probability). But if rtr_t has already dropped below 1ε1-\varepsilon, the new policy has already pushed that action probability down too much; PPO no longer rewards further suppression.

This is easy to misread because AtA_t is negative. Consider a numeric example: At=2A_t=-2, ε=0.2\varepsilon=0.2. If rt=0.7r_t=0.7, the unclipped term is 0.7×(2)=1.40.7\times(-2)=-1.4, while the clipped term is 0.8×(2)=1.60.8\times(-2)=-1.6. The min picks the smaller value, i.e. 1.6-1.6, which is the clipped term. Since the clipped term is constant, the gradient is zero.

Range of rtr_tUnclipped rtAtr_t \cdot A_tClipped rtAt\overline{r}_t \cdot A_tWhich One min\min Picks
rt<1εr_t < 1-\varepsilonlarger (e.g. 1.4-1.4)(1ε)At(1-\varepsilon)\cdot A_t (constant)clipped term; zero gradient
rt1εr_t \ge 1-\varepsilonunclipped termequal inside interval; clipped can become larger aboveunclipped term; keep optimizing

So the probability of bad actions can decrease, but only down to about (1ε)(1-\varepsilon) times the old probability. Beyond that the objective goes flat and stops providing further incentive. If the bad action probability increases instead, the unclipped term makes the objective worse, and the gradient pulls it back.

Case 3: At=0A_t = 0 (neutral action)

Then rtAt=0r_t\cdot A_t=0. No matter how rtr_t changes, the objective is always 0, so PPO does not adjust that action.

Putting these cases together, the meaning of PPO-Clip becomes clear:

it does not forbid learning; it simply stops rewarding the part of the change that has already gone too far.

python
import numpy as np
import matplotlib.pyplot as plt

# ==========================================
# Geometric intuition for the PPO-Clip objective
# ==========================================
epsilon = 0.2
r = np.linspace(0.0, 2.0, 500)

def clip_objective(r, A, eps=0.2):
    r_clipped = np.clip(r, 1 - eps, 1 + eps)
    return np.minimum(r * A, r_clipped * A)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, (A_val, title) in zip(
    axes,
    [(1.0, "A > 0 (good action)"), (-1.0, "A < 0 (bad action)"), (0.0, "A = 0 (neutral)")],
):
    obj = clip_objective(r, A_val)
    ax.plot(r, r * A_val, "b--", alpha=0.4, label="unclipped r·A")
    ax.plot(r, obj, "r-", linewidth=2, label="PPO-Clip min(...)")
    ax.axvspan(1 - epsilon, 1 + epsilon, alpha=0.1, color="green", label="safe interval")
    ax.set_title(title)
    ax.set_xlabel("policy ratio r_t(θ)")
    ax.set_ylabel("objective value")
    ax.legend(fontsize=8)

plt.suptitle("Three cases of the PPO-Clip objective (ε=0.2)", fontsize=13)
plt.tight_layout()
plt.savefig("ppo_clip_three_cases.png", dpi=150)
print("Saved visualization")

Clipping Intuition

If you look at the three cases together, PPO-Clip's design intention becomes very clear:

Mermaid diagram

With ε=0.2\varepsilon=0.2, after each update, the probability assigned to an action is constrained to remain near the old policy. This "safety rail" ensures that even if gradient estimates are noisy, the policy will not jump too far in a single step.

Step 8: PPO Is Not Only a Loss Function

At this point it is easy to form a misconception: does understanding PPO mean understanding the PPO loss? The answer is: no.

PPO is a policy-optimization algorithm. More concretely, it is a training procedure that answers:

Given a policy network that already acts, how do we use newly collected experience to make it reliably better?

So PPO is not a single formula, and it is not just one loss.backward() call. A complete PPO method includes at least these pieces:

Component in PPOWhat It DoesWhere It Appears in Code
sampling with the current policyinteract with the environment to collect a new batchcollect_trajectories(...)
old policy recordstore action probabilities at sampling time for later comparisonold_logprobs
advantage estimationjudge whether each action is above/below averageadvantages / compute_gae(...)
clipped policy updateupdate the Actor while constraining drift from the old policyppo_clip_loss(...)
value-function trainingtrain the Critic to estimate state values accuratelyvalue_loss
entropy bonusmaintain exploration; avoid becoming too confident too earlyentropy_bonus
multi-epoch mini-batch updatesreuse the same batch for multiple epochs to improve sample usen_epochs / mini-batch

Therefore, the PPO loss is not the entirety of PPO, but it is the most important "policy update rule" within PPO. It tells the Actor which action probabilities to increase, which to decrease, and the maximum allowed change.

You can think of PPO as a training protocol:

Mermaid diagram

The reason "loss" matters is that neural networks update parameters through backpropagation. To affect parameters, PPO's ideas must become a differentiable objective. That is why we emphasize PPO loss, but you should not shrink PPO into the loss alone.

Step 9: How PPO Appears in Code

If we keep only PPO's core policy update, the landing point is the following lines:

109            new_logprobs, new_values, entropy = model.evaluate(states[mb], actions[mb])
110            ratio = torch.exp(new_logprobs - old_logprobs[mb])
111 
112            surr1 = ratio * advantages[mb]
113            clipped_ratio = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps)
114            surr2 = clipped_ratio * advantages[mb]
115            policy_loss = -torch.min(surr1, surr2).mean()
116 
117            value_loss = F.mse_loss(new_values, returns[mb])
118            entropy_bonus = entropy.mean()
119            loss = policy_loss + vf_coef * value_loss - ent_coef * entropy_bonus

This piece of code needs three main inputs:

InputWhere It Comes FromWhat It Does
old_logprobsstored during rollout collectionrecords the old policy's probability for the action
new_logprobsrecomputed during updatethe new policy's probability for the same action
advantagescomputed from returns, Critic, or GAEtells whether the action should be encouraged or suppressed

It outputs a scalar policy_loss. This scalar is exactly what backpropagation consumes:

117            value_loss = F.mse_loss(new_values, returns[mb])
118            entropy_bonus = entropy.mean()
119            loss = policy_loss + vf_coef * value_loss - ent_coef * entropy_bonus
120 
121            optimizer.zero_grad()
122            loss.backward()
123            optimizer.step()

Of course, real PPO does not only train the Actor; it also trains the Critic, and usually includes an entropy bonus to encourage exploration. So we combine policy_loss into a full loss:

loss = policy_loss + vf_coef * value_loss - ent_coef * entropy_bonus.

If you have derived PPO on paper and want to implement it, you only need to connect the data in this order:

Mermaid diagram

This is the minimal closed loop that turns PPO formulas into a training program.

Supplement: TRPO Is Historical Context, Not a Required Derivation

TRPO (Trust Region Policy Optimization) and PPO solve the same issue: policy updates must not be too large. TRPO is written as:

maxθLIS(θ)s.t.DˉKL(θold,θ)δ\max_\theta L^{\text{IS}}(\theta) \quad \text{s.t.} \quad \bar{D}_{\text{KL}}(\theta_{\text{old}}, \theta) \leq \delta

This means: optimize the surrogate objective, but constrain the average KL divergence between old and new policies by a small threshold δ\delta.

This path is theoretically elegant, but in practice it requires constrained optimization, conjugate gradients, approximate second-order information, and more. For a chapter whose goal is "derive PPO loss from formulas," TRPO is not a necessary prerequisite. Treat it as a historical note:

TRPO limits policy change via a KL constraint; PPO approximates a similar effect by clipping the policy ratio.

So the main line should be:

Mermaid diagram

TRPO simply reminds us: PPO's "Proximal" comes from trust-region thinking, but the concrete code you need is ratio, clamp, min, and the total loss.

Step 10: The Full PPO Loss

In real training, PPO does not only optimize the clipped surrogate; it trains the Critic and preserves exploration. To avoid symbol confusion, separate two things:

  • JPPO(θ)J^{\text{PPO}}(\theta): the mathematical objective we want to maximize.
  • loss: the training loss we minimize in code.

The maximization objective can be written as:

JPPO(θ)=JCLIP(θ)c1LVF(θ)+c2H[πθ]J^{\text{PPO}}(\theta) = J^{\text{CLIP}}(\theta) - c_1 L^{\text{VF}}(\theta) + c_2 H[\pi_\theta]

Here JCLIPJ^{\text{CLIP}} is the policy-improvement objective, LVFL^{\text{VF}} is the Critic's value error, and H[πθ]H[\pi_\theta] is the policy entropy. Since code minimizes loss, we negate the policy objective and the entropy term.

In code, the total loss is composed here:

117            value_loss = F.mse_loss(new_values, returns[mb])
118            entropy_bonus = entropy.mean()
119            loss = policy_loss + vf_coef * value_loss - ent_coef * entropy_bonus
120 
121            optimizer.zero_grad()
122            loss.backward()
123            optimizer.step()

Policy Loss

The policy maximization objective is the clipped surrogate:

JCLIP(θ)=Et[min(rt(θ)At,  rt(θ)At)]J^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \cdot A_t,\; \overline{r}_t(\theta) \cdot A_t \right) \right]

The policy_loss in code is its negative:

Lpolicy(θ)=JCLIP(θ)L^{\text{policy}}(\theta) = -J^{\text{CLIP}}(\theta)

This term updates the Actor: increase probabilities of good actions, decrease probabilities of bad actions, while clipping constrains the magnitude of change within a safe range.

Value-Function Loss

The Critic should estimate state values accurately. The value loss is the mean squared error between the Critic prediction Vθ(st)V_\theta(s_t) and a target return VttargV_t^{\text{targ}}:

LVF(θ)=Et[(Vθ(st)Vttarg)2]L^{\text{VF}}(\theta) = \mathbb{E}_t \left[ \left( V_\theta(s_t) - V_t^{\text{targ}} \right)^2 \right]

Here VttargV_t^{\text{targ}} is computed via GAE (derived in detail in the next section).

Why do we need a separate value loss? Because the Critic's accuracy directly determines the quality of the advantage estimate AtA_t. If the Critic is inaccurate, AtA_t will have large bias and can mislead the Actor. The MSE loss continuously corrects the Critic so its predictions track true returns.

In code: value_loss = F.mse_loss(new_values, returns[mb]). It is backpropagated together with policy_loss in the same update function.

Entropy Bonus

Policy entropy encourages exploration and prevents premature collapse to a deterministic policy:

H[πθ]=Et[aπθ(ast)logπθ(ast)]H[\pi_\theta] = -\mathbb{E}_t \left[ \sum_a \pi_\theta(a|s_t) \log \pi_\theta(a|s_t) \right]

Higher entropy means the policy is more "hesitant" (more uniform action distribution), which encourages exploration; lower entropy means the policy is more "certain" (always choosing one action), which reduces exploration. The coefficient c2c_2 is often around 0.01.

Why include entropy? Clipping stabilizes training, but it can also cause a side effect: the policy may "lock onto" a suboptimal action too early. The entropy bonus rewards uncertainty inside the loss, ensuring the policy retains ongoing exploration pressure.

In code: entropy_bonus = entropy.mean(). Note the minus sign in the total loss: - ent_coef * entropy_bonus, because we want to maximize entropy, which is equivalent to subtracting it when minimizing loss.

10.4 How the Three Terms Work Together

Mermaid diagram

Each term does a different job:

policy loss drives Actor improvement, value loss ensures the Critic provides accurate advantage signals, and entropy bonus preserves exploration.

They collaborate through the shared Actor-Critic network. In ppo_from_scratch.py, the Actor and Critic share the same backbone network (shared_net), so one backpropagation updates both.

10.5 Hyperparameter Summary

SymbolNameTypical ValueRoleCode Parameter
ε\varepsilonclip range0.1-0.2limits how far ratios may moveclip_range
c1c_1value-loss coefficient0.5balances policy update vs value fittingvf_coef
c2c_2entropy coefficient0.01encourages explorationent_coef
γ\gammadiscount factor0.99decay of future rewardsgamma
λ\lambdaGAE parameter0.95bias-variance tradeoff in advantage estimationgae_lambda
TTrollout length2048how many steps to collect per rolloutn_steps
KKnumber of epochs10how many passes over the same data batchn_epochs

Step 11: The Complete PPO Algorithm

Putting everything together, the PPO training loop is:

Mermaid diagram

If you compare against the code, each step can be traced to a specific piece:

 21    # [A] 策略和值函数:Actor 输出动作概率,Critic 输出状态价值
 22    def forward(self, obs):
 23        h = self.backbone(obs)
 24        logits = self.actor(h)
 25        action_probs = F.softmax(logits, dim=-1)
 26        value = self.critic(h).squeeze(-1)
 27        return action_probs, value
 28 
 29    # [B] 采样动作:从策略分布中抽动作,并保存旧策略 log_prob
 30    def act(self, obs):
 31        action_probs, value = self.forward(obs)
 32        dist = Categorical(action_probs)
 33        action = dist.sample()
 34        log_prob = dist.log_prob(action)
 35        return action, log_prob, value
 36 
 37    def evaluate(self, obs, actions):
 38        action_probs, values = self.forward(obs)
 39        dist = Categorical(action_probs)
 40        new_logprobs = dist.log_prob(actions)
 41        entropy = dist.entropy()
 42        return new_logprobs, values, entropy
 77# [D] 计算优势:这个动作比当前状态的平均水平好多少
 78def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
 79    advantages = torch.zeros_like(rewards)
 80    last_advantage = 0.0
 81    next_value = 0.0
 82 
 83    for t in reversed(range(len(rewards))):
 84        mask = 1.0 - dones[t]
 85        delta = rewards[t] + gamma * next_value * mask - values[t]
 86        last_advantage = delta + gamma * lam * mask * last_advantage
 87        advantages[t] = last_advantage
 88        next_value = values[t]
 89 
 90    returns = advantages + values
 91    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
 92    return advantages, returns
109            new_logprobs, new_values, entropy = model.evaluate(states[mb], actions[mb])
110            ratio = torch.exp(new_logprobs - old_logprobs[mb])
111 
112            surr1 = ratio * advantages[mb]
113            clipped_ratio = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps)
114            surr2 = clipped_ratio * advantages[mb]
115            policy_loss = -torch.min(surr1, surr2).mean()
116 
117            value_loss = F.mse_loss(new_values, returns[mb])
118            entropy_bonus = entropy.mean()
119            loss = policy_loss + vf_coef * value_loss - ent_coef * entropy_bonus
120 
121            optimizer.zero_grad()
122            loss.backward()
123            optimizer.step()
126# [F] 训练循环:采样一批数据,再用这批数据更新多轮
127device = "cuda" if torch.cuda.is_available() else "cpu"
128env = gym.make("CartPole-v1")
129model = ActorCritic(env.observation_space.shape[0], env.action_space.n).to(device)
130optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)
131 
132for update in range(100):
133    batch = collect_rollout(env, model, steps=2048, device=device)
134    advantages, returns = compute_gae(batch["rewards"], batch["values"], batch["dones"])
135    ppo_update(model, optimizer, batch, advantages, returns)

Some key design decisions and their intuition:

  • Reuse the same data for KK epochs: collecting data is expensive (requires running the environment), so we update multiple times on the same batch. Clipping prevents multi-epoch updates from drifting too far.
  • Mini-batch updates: split TT steps into several mini-batches; compute gradients per mini-batch to improve training efficiency.
  • Recompute rtr_t each epoch: even though the data batch is the same, θ\theta changes after each epoch, so rtr_t changes too; clipping continues to take effect dynamically.
Derivation Note: PPO-Penalty Variant

The PPO paper actually proposes two variants. Besides PPO-Clip, it proposes PPO-Penalty (also called PPO-KL), which directly adds a KL penalty term:

LKL(θ)=Et[rt(θ)AtβDKL(πold,πθ)]L^{\text{KL}}(\theta) = \mathbb{E}_t \left[ r_t(\theta) \cdot A_t - \beta \cdot D_{\text{KL}}(\pi_{\text{old}}, \pi_\theta) \right]

β\beta is an adaptive coefficient: if current KL is too large, increase β\beta to penalize more; if KL is too small, decrease β\beta to loosen the constraint.

PPO-Penalty can be better in some settings (especially when you need precise control of policy change), but it is more complex to implement and introduces an additional adaptive mechanism to tune. In practice, PPO-Clip is more common.

Thought Question 1: If we set ε to 0, what does PPO-Clip degenerate into?

When ε=0\varepsilon=0, the clipping interval collapses to [1,1][1,1], so rt(θ)=1\overline{r}_t(\theta)=1. The PPO-Clip objective becomes:

LCLIP(θ)=Et[min(rt(θ)At,  1At)]L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \cdot A_t, \; 1 \cdot A_t \right) \right]

For At>0A_t>0, min(rtAt,At)\min(r_t \cdot A_t, A_t): when rt>1r_t>1, the objective is the constant AtA_t, so further increasing a good action's probability no longer improves the objective; when rt<1r_t<1, the objective is rtAtr_t\cdot A_t and the gradient only pushes it back toward 1. This means good actions cannot be meaningfully increased above the old policy.

For At<0A_t < 0, min(rtAt,At)\min(r_t \cdot A_t, A_t): when rt<1r_t < 1, the objective is the constant AtA_t, so further decreasing a bad action's probability no longer improves the objective; when rt>1r_t > 1, the objective is rtAtr_t\cdot A_t and the gradient only pushes it back toward 1. This means bad actions cannot be meaningfully decreased below the old policy either.

In short, ε=0\varepsilon=0 almost freezes the policy near the old one: whether advantages are positive or negative, the policy cannot make meaningful improvements. This shows ε\varepsilon controls both "allowed change magnitude" and "learning capacity."

Thought Question 2: Can clipping fully replace a KL constraint? Can clipping fail?

Clipping effectively limits policy change in most situations, but it has a theoretical weakness: it constrains the ratio rtr_t for each individual action, rather than directly constraining the overall distribution distance (KL divergence) between two policies.

Consider an extreme case: a policy has 100 actions, and clipping allows each action probability to change by ±20%\pm 20\%. If all actions are pushed to the boundary simultaneously, the overall distribution change can exceed a KL constraint such as δ=0.01\delta=0.01. In practice, this is rare because advantage estimates are noisy and usually do not push all actions in extreme directions simultaneously. But for settings where policy-change control must be strict (e.g., LLM alignment), practitioners often monitor KL as an additional safety metric. This is why in Chapter 8's RLHF training you will see both clip_fraction and approx_kl logged.

Thought Question 3: Why does PPO update K epochs on the same batch, instead of collecting K batches and updating once each?

The two strategies have the same total number of samples (K×TK \times T steps), but differ in data quality and compute cost.

"Collect K batches, update once each" uses fresh data from the current policy every time, so the gradient estimate is unbiased. But collecting data requires environment simulation, which is often far more expensive than parameter updates. In LLM settings, generating a batch of responses can take minutes, while a gradient update can take seconds.

"Collect one batch, update K epochs" reuses old data for multiple updates. From the importance-sampling viewpoint, only the first epoch is unbiased; later epochs introduce bias as θ\theta drifts away from θold\theta_{\text{old}}. Clipping is designed to mitigate this: when the drift becomes too large, clipping drives gradients toward zero and effectively stops unsafe updates. This is an engineering tradeoff: accept "small bias" in exchange for "large compute savings."

In practice, KK is often 3-10, and clipping can keep the bias within an acceptable range.


At this point, you have the complete mathematical picture of PPO: from policy gradients, to the importance-sampling surrogate objective, to the PPO-Clip policy loss formed by ratio, clamp, and min, and finally to the total loss that can be backpropagated directly.

The next two sections each go deeper into a key detail:

现代强化学习实战课程