7.2 PPO Mathematical Derivation
In the previous section, we trained LunarLander with SB3's PPO and looked at curves such as reward, entropy, and clip fraction. Now we should answer a more basic question:
What exactly is PPO, and why does it eventually become a single loss function?
Prerequisites
This section integrates and extends the material from Chapters 5 and 6. The following ideas will appear repeatedly in the derivation:
- Policy objective - what PPO tries to maximize
- Policy Gradient Theorem - where PPO's derivation starts
- Advantage function - a lower-variance substitute for
- Critic training - the theoretical source of the value-function loss
- Discounted return - from one-step rewards to long-horizon objectives
PPO stands for Proximal Policy Optimization. The name is worth unpacking:
- Policy: the model that chooses actions.
- Optimization: training, i.e., improving that policy.
- Proximal: "nearby" updates; the new policy should not move too far from the old one.
So here is the headline conclusion:
PPO is not a policy, and it is not merely a loss. PPO is a method for training a policy network.
In reinforcement learning, the policy is the object we truly train. It is usually written as:
This means: "under state , the policy network parameterized by assigns probability to action ." In code, this policy is typically the Actor network. For example, the Actor takes a game frame or a robot state as input, and outputs a probability distribution over actions.
What PPO provides is a recipe for training this Actor. It does not hard-code actions, and it does not replace the policy network. Instead, it specifies an update rule: use the current Actor to collect a batch of experience, then adjust the Actor using that batch, while preventing each update from being too aggressive.
It helps to separate three closely-related concepts:
| Name | What It Is | Roughly What It Corresponds To In Code |
|---|---|---|
| Policy | the object being trained; chooses actions given states | actor / model output action_probs |
| PPO | the training method: sampling, advantage estimation, constrained updates, backprop | the full training loop |
| PPO loss | a differentiable objective used to update network parameters in PPO | policy_loss + value_loss - entropy_bonus |
Why will we keep talking about a loss? Because neural networks cannot directly interpret the instruction "make the policy more stable; do not change too fast." An optimizer understands a very specific interface: give it a scalar loss, it computes gradients via loss.backward(), then updates parameters via optimizer.step().
So PPO's ideas must eventually become a loss in order to update the Actor and Critic.
Put differently: PPO is a method, the policy is the model being trained, and the loss is the training signal that makes the method real in code. We derive the PPO loss not because PPO is only a loss, but because the loss is the point at which PPO touches neural-network parameters.
PPO Code Skeleton
To keep the formulas grounded, we first show what PPO "looks like in code." The code below is not an engineering-optimized implementation. It is a learning-oriented minimal PyTorch PPO skeleton: policy network, sampling, advantage estimation, PPO-Clip loss, value-function loss, entropy bonus, and multiple epochs of updates.
Every time we derive a new formula, we will come back to a corresponding part of this code. The highlighted lines are the ones we will repeatedly unpack. You do not need to fully understand every line now; just remember the big picture:
PPO ultimately links "collect experience, estimate advantages, constrain policy changes, backpropagate updates" into a single training loop.
21 # [A] 策略和值函数:Actor 输出动作概率,Critic 输出状态价值
22 def forward(self, obs):
23 h = self.backbone(obs)
24 logits = self.actor(h)
25 action_probs = F.softmax(logits, dim=-1)
26 value = self.critic(h).squeeze(-1)
27 return action_probs, value
28
29 # [B] 采样动作:从策略分布中抽动作,并保存旧策略 log_prob
30 def act(self, obs):
31 action_probs, value = self.forward(obs)
32 dist = Categorical(action_probs)
33 action = dist.sample()
34 log_prob = dist.log_prob(action)
35 return action, log_prob, value
36
37 def evaluate(self, obs, actions):
38 action_probs, values = self.forward(obs)
39 dist = Categorical(action_probs)
40 new_logprobs = dist.log_prob(actions)
41 entropy = dist.entropy()
42 return new_logprobs, values, entropy
⋮
77# [D] 计算优势:这个动作比当前状态的平均水平好多少
78def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
79 advantages = torch.zeros_like(rewards)
80 last_advantage = 0.0
81 next_value = 0.0
82
83 for t in reversed(range(len(rewards))):
84 mask = 1.0 - dones[t]
85 delta = rewards[t] + gamma * next_value * mask - values[t]
86 last_advantage = delta + gamma * lam * mask * last_advantage
87 advantages[t] = last_advantage
88 next_value = values[t]
89
90 returns = advantages + values
91 advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
92 return advantages, returns
⋮
109 new_logprobs, new_values, entropy = model.evaluate(states[mb], actions[mb])
110 ratio = torch.exp(new_logprobs - old_logprobs[mb])
111
112 surr1 = ratio * advantages[mb]
113 clipped_ratio = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps)
114 surr2 = clipped_ratio * advantages[mb]
115 policy_loss = -torch.min(surr1, surr2).mean()
116
117 value_loss = F.mse_loss(new_values, returns[mb])
118 entropy_bonus = entropy.mean()
119 loss = policy_loss + vf_coef * value_loss - ent_coef * entropy_bonus
120
121 optimizer.zero_grad()
122 loss.backward()
123 optimizer.step()
⋮
126# [F] 训练循环:采样一批数据,再用这批数据更新多轮
127device = "cuda" if torch.cuda.is_available() else "cpu"
128env = gym.make("CartPole-v1")
129model = ActorCritic(env.observation_space.shape[0], env.action_space.n).to(device)
130optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)
131
132for update in range(100):
133 batch = collect_rollout(env, model, steps=2048, device=device)
134 advantages, returns = compute_gae(batch["rewards"], batch["values"], batch["dones"])
135 ppo_update(model, optimizer, batch, advantages, returns)
You can roughly split the code into six parts:
| Tag | Code Block | What We Will Explain Later |
|---|---|---|
| [A] | forward | what the policy and value function are |
| [B] | act / evaluate | why we construct dist, and why we store log_prob |
| [C] | collect_rollout | what on-policy data is, and why we record the old policy probabilities |
| [D] | compute_gae | how returns, value functions, and advantages relate |
| [E] | ppo_update | PPO-Clip's ratio, clamp, min, and the total loss |
| [F] | training loop | why we update multiple epochs on the same batch |
When key variables appear later, we will repeatedly refer back to this mapping table:
| Symbol | Meaning | Typical Code Variable |
|---|---|---|
| state at time | states | |
| action taken at time | actions | |
| reward at time | rewards | |
| discounted return starting from time | returns | |
| Critic's estimate of future return from state | value / new_values | |
| or | advantage estimate: how much better this action is than the state's baseline | advantages |
| the old policy that collected this batch | stored old_logprobs | |
| ratio | ratio = exp(new_logprobs-old_logprobs) | |
PPO clipping range, often 0.1 or 0.2 | clip_eps / clip_range | |
| policy entropy (how random the action distribution is) | entropy |
Step 1: A Probabilistic View of Reinforcement Learning
The most basic reinforcement-learning loop is:
Here is the time step. is the state observed at step , is the action taken, and is the immediate feedback from the environment. Reinforcement learning is not about a single reward; it is about the long-term result produced by a sequence of decisions.
We typically formalize the environment as a Markov Decision Process (MDP):
Each symbol means (review: the MDP 5-tuple):
- : state space, the set of all possible states.
- : action space, the set of all possible actions.
- : transition probability, the probability of moving to after taking action at state .
- : reward function, the immediate payoff of the action.
- : discount factor, how much we value future rewards today.
The policy is what we train. In symbols:
This means: "the policy network with parameters assigns probability to action under state ." In code, the Actor typically outputs action probabilities, then wraps them into a distribution object dist:
21 # [A] 策略和值函数:Actor 输出动作概率,Critic 输出状态价值
22 def forward(self, obs):
23 h = self.backbone(obs)
24 logits = self.actor(h)
25 action_probs = F.softmax(logits, dim=-1)
26 value = self.critic(h).squeeze(-1)
27 return action_probs, value
28
29 # [B] 采样动作:从策略分布中抽动作,并保存旧策略 log_prob
30 def act(self, obs):
31 action_probs, value = self.forward(obs)
32 dist = Categorical(action_probs)
33 action = dist.sample()
34 log_prob = dist.log_prob(action)
35 return action, log_prob, value
36
37 def evaluate(self, obs, actions):
38 action_probs, values = self.forward(obs)
39 dist = Categorical(action_probs)
40 new_logprobs = dist.log_prob(actions)
41 entropy = dist.entropy()
42 return new_logprobs, values, entropy
By default, this shows [A] policy outputs and [B] action sampling. In the full code, network definitions appear before it, and rollout collection appears after it.
action_probs corresponds to , the probability distribution over all actions. For example, in a discrete environment with 3 actions, action_probs = [0.1, 0.7, 0.2] means action 0 has 10% probability, action 1 has 70%, and action 2 has 20%.
dist is short for distribution: a distribution object. Categorical(action_probs) wraps the probabilities into a discrete distribution. It is not an action, and it is not a parameter; it is better viewed as a "lottery box with tools" where each action has its own probability.
This object provides methods that show up everywhere in RL code:
| Code | Meaning | Math Counterpart |
|---|---|---|
dist.sample() | sample an action according to action_probs instead of always taking argmax | |
dist.log_prob(action) | the log probability of the sampled action | |
dist.entropy() | how random the action distribution is (later used to encourage exploration) |
So action is sampled from dist, and log_prob is the log probability of that action under the current policy. We will need it for policy gradients and PPO ratios. Here we use Categorical because tasks like CartPole and LunarLander have discrete actions. For continuous actions, one often uses a continuous distribution such as Normal, but the pattern is the same:
construct a distribution, sample an action, and record the log_prob.
If we run from the initial state until termination, we obtain a trajectory:
is short for trajectory. It is not a single sample, but a full interaction history. Given a policy , the probability of seeing trajectory can be written as:
This expression is long, but it says only three things:
- : where the initial state comes from.
- : how the agent selects actions at each state.
- : how the environment transitions after actions.
The crucial observation is:
In this product, only contains the trainable parameters .
The environment dynamics are usually unknown, non-differentiable, and not directly modifiable. This is why policy-gradient methods only need the action log_prob: the root cause is that only the policy term depends on .
Step 2: Discounted Return
If we maximize the immediate reward only, the agent becomes myopic. For example, in LunarLander, firing the engine hard might change attitude immediately but can cause a crash later. Reinforcement learning is about maximizing a sequence of future rewards:
More compactly:
The symbols mean:
- : the return, the cumulative reward starting from time .
- : offset into the future. is the current reward , is the next-step reward .
- : discount weight for future rewards. The farther the reward, the more it is discounted.
- : trajectory length. For continuing tasks, one often writes .
Why introduce ? Three reasons.
First, expresses that "the future matters, but is usually less certain than the present." When , the agent only cares about immediate rewards; as approaches , it cares more about long-term outcomes. For CartPole and LunarLander, a typical choice is .
Second, in infinite-horizon tasks, if every step has positive reward then a direct sum can diverge. With , discounted sums are much more likely to remain finite.
Third, discounted return has a very implementation-friendly recursion:
This means: total return from now equals "reward now" plus "discounted return from the next step." In code, we typically compute returns backward in time:
G = 0
returns = []
for reward in reversed(rewards):
G = reward + gamma * G
returns.insert(0, G)Here G corresponds to , reward is , gamma is , and returns stores the discounted return for each time step.
With this, the objective of a policy can be written as:
reads as: "how good is the policy with parameters ?" The denotes expectation, because even the same policy can yield different trajectories across runs. The environment may be stochastic, and the policy samples actions stochastically. So we maximize not the reward from a single run, but the long-run return in expectation.
Step 3: From Objective to Policy Gradient
Now the question becomes: how do we adjust to increase ?
Write the objective as a sum over all possible trajectories:
where is the discounted return of the full trajectory. Differentiate with respect to :
Differentiating trajectory probability directly is hard. The key trick in policy gradients is the identity:
which follows from . Substitute it back:
Now expand :
When differentiating with respect to , and vanish since they do not depend on :
This yields the classic REINFORCE gradient:
Why use instead of the full-trajectory return ? Because the action at time cannot affect rewards that happened before time . Using the return from the current time onward respects causality and reduces noise.
In implementations, we usually do not hand-write this gradient. Instead, we write an equivalent loss and let autodiff compute gradients:
policy_loss = -(log_probs * returns).mean()
policy_loss.backward()
optimizer.step()Why the minus sign? Mathematically we want to maximize , but PyTorch optimizers minimize losses. So we negate it.
If is large, gradient descent increases the log probability of the action; if is small or negative, it decreases that action's probability.
Step 4: Value Functions, Baselines, and Advantages
Vanilla REINFORCE can work, but its variance is large (review: the fatal flaw of REINFORCE). The reason is that only tells us "how much reward came after this step," but does not say "is that good for this particular state."
For example, suppose after some step in LunarLander we see . That sounds good, but if in the same state a typical policy averages , then this action is below average. We need a reference point, and that reference is the state-value function:
means: if we are at state now and continue following policy , what return do we get on average?
The action-value function additionally conditions on the action:
means: at state , first take action , then follow policy afterward; what return do we get on average?
Subtract the two to obtain the advantage function:
The meaning is simple:
How much better is this action than an average action at this state?
If , the action is better than average and its probability should increase. If , it is worse than average and its probability should decrease. If , it is roughly average.
In practice we do not know the true and . We estimate with a Critic network, then approximate the advantage using returns or GAE:
In code, this corresponds to:
77# [D] 计算优势:这个动作比当前状态的平均水平好多少
78def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
79 advantages = torch.zeros_like(rewards)
80 last_advantage = 0.0
81 next_value = 0.0
82
83 for t in reversed(range(len(rewards))):
84 mask = 1.0 - dones[t]
85 delta = rewards[t] + gamma * next_value * mask - values[t]
86 last_advantage = delta + gamma * lam * mask * last_advantage
87 advantages[t] = last_advantage
88 next_value = values[t]
89
90 returns = advantages + values
91 advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
92 return advantages, returns
⋮
117 value_loss = F.mse_loss(new_values, returns[mb])
By default, this shows [D] advantage estimation and [E] value-function training together: advantages tells the Actor how to change, while returns provides supervision targets for the Critic's value_loss.
Without GAE, the simplest approximation is advantages = returns - values. In this chapter's code we compute advantages using GAE; the next section derives GAE in detail. For now, interpret it as "the part that is better or worse than what the Critic expected."
Why can we replace with ? Because subtracting a baseline that depends only on the state does not change the expected gradient (review: baseline variance reduction):
This derivation shows: subtracting a baseline does not change the expected gradient direction; it only reduces variance. Therefore the policy gradient is often written in the Actor-Critic form:
This is the division of labor between Actor and Critic: the Critic estimates to provide the "average level" of the current state, and the Actor adjusts action probabilities according to the advantage .
Step 5: The Limits of Vanilla Policy Gradients
At this point we have an algorithm that looks complete:
The problem is that vanilla policy gradients have a requirement:
the data used to update the policy should ideally be collected by that same policy.
This property is called on-policy. In the formula, the expectation is:
which means the data should come from the current policy . But after one gradient update, parameters change from to . The trajectories we just collected no longer come from the new policy; they come from the old policy .
If we use each batch only once, training becomes extremely wasteful. Collecting, say, 2048 steps of environment interaction can be expensive, especially for robotics, game simulators, and LLM answer generation. Naturally, we ask:
Can we reuse data collected by the old policy to update the new policy for multiple epochs?
This is PPO's core tension:
We want to reuse old data to improve sample efficiency, but we must not let the new policy drift too far from the old one, otherwise old data will mislead the update.
In the learning-oriented PPO skeleton, collect_rollout deliberately stores the log probability at sampling time:
45# [C] 采样一批 on-policy 数据:这些数据来自“当前策略”
46def collect_rollout(env, model, steps=2048, device="cpu"):
47 obs, _ = env.reset()
48 batch = {k: [] for k in ["states", "actions", "rewards", "dones", "old_logprobs", "values"]}
49
50 for _ in range(steps):
51 obs_tensor = torch.as_tensor(obs, dtype=torch.float32, device=device)
52 with torch.no_grad():
53 action, old_logprob, value = model.act(obs_tensor)
54
55 next_obs, reward, terminated, truncated, _ = env.step(action.item())
56 done = terminated or truncated
57
58 batch["states"].append(obs)
59 batch["actions"].append(action.item())
60 batch["rewards"].append(reward)
61 batch["dones"].append(done)
62 batch["old_logprobs"].append(old_logprob.item())
63 batch["values"].append(value.item())
64
65 obs = next_obs if not done else env.reset()[0]
66
67 return {
68 "states": torch.as_tensor(np.array(batch["states"]), dtype=torch.float32, device=device),
69 "actions": torch.as_tensor(batch["actions"], dtype=torch.long, device=device),
70 "rewards": torch.as_tensor(batch["rewards"], dtype=torch.float32, device=device),
71 "dones": torch.as_tensor(batch["dones"], dtype=torch.float32, device=device),
72 "old_logprobs": torch.as_tensor(batch["old_logprobs"], dtype=torch.float32, device=device),
73 "values": torch.as_tensor(batch["values"], dtype=torch.float32, device=device),
74 }
This old_logprobs is . During updates, we recompute the same state-action pairs under the new policy to get new_logprobs. Comparing them tells us how far the policy has moved. Importance sampling is the tool that answers whether "old data can still be used."
Step 6: Importance Sampling
The previous issue is that vanilla policy gradients want data collected by . Can we use data collected by to evaluate and improve a new policy? Yes, via importance sampling.
6.1 The Importance Sampling Identity
The core identity is: for any function ,
Why is this true? Expand the left side:
Rewrite as :
The identity holds. The intuition is: we want the expectation of under the "new world" , but we only have samples from the "old world" . The fix is to reweight each sample. If the new world is more likely to produce this action than the old world, the weight is greater than 1; otherwise it is less than 1. The weight is exactly .
6.2 Policy Ratio
Define the policy ratio:
In code, we compute it using the exponential of the log-prob difference, which is numerically more stable than direct division:
95# [E] PPO 更新:ratio、clip、min 和总 loss 都在这里
96def ppo_update(model, optimizer, batch, advantages, returns,
97 clip_eps=0.2, vf_coef=0.5, ent_coef=0.01,
98 epochs=10, minibatch_size=64):
99 states = batch["states"]
100 actions = batch["actions"]
101 old_logprobs = batch["old_logprobs"]
102 batch_size = states.size(0)
103
104 for _ in range(epochs):
105 indices = torch.randperm(batch_size, device=states.device)
106 for start in range(0, batch_size, minibatch_size):
107 mb = indices[start:start + minibatch_size]
108
109 new_logprobs, new_values, entropy = model.evaluate(states[mb], actions[mb])
110 ratio = torch.exp(new_logprobs - old_logprobs[mb])
111
112 surr1 = ratio * advantages[mb]
means the new and old policies assign the same probability to this action. means the new policy is more inclined to take this action; means the opposite.
6.3 Surrogate Objective
Apply importance sampling to the policy-gradient objective to get the surrogate objective:
Expanded:
In code this is surr1 = ratio * advantages, right after ratio in the PPO update:
109 new_logprobs, new_values, entropy = model.evaluate(states[mb], actions[mb])
110 ratio = torch.exp(new_logprobs - old_logprobs[mb])
111
112 surr1 = ratio * advantages[mb]
113 clipped_ratio = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps)
114 surr2 = clipped_ratio * advantages[mb]
115 policy_loss = -torch.min(surr1, surr2).mean()
This objective has an important property:
at , its first-order gradient matches the vanilla policy gradient.
The check is straightforward: when , we have . Also . Substituting restores the policy-gradient form.
But once moves away from , the two objectives diverge. The farther away the new policy is, the less reliable the surrogate becomes. That is the next problem to solve.
Step 7: From the Surrogate Objective to PPO-Clip
We now have a key expression:
Do not rush to TRPO yet. If we only look at this expression, it already reveals PPO's two core inputs:
| Name | Symbol | Code Variable | What Question It Answers |
|---|---|---|---|
| policy ratio | ratio | does the new policy prefer this action more than the old policy? | |
| advantage | or | advantages | is this action better than average at this state? |
If we impose no constraints, we would simply maximize:
In code, this is surr1 = ratio * advantages:
109 new_logprobs, new_values, entropy = model.evaluate(states[mb], actions[mb])
110 ratio = torch.exp(new_logprobs - old_logprobs[mb])
111
112 surr1 = ratio * advantages[mb]
113 clipped_ratio = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps)
114 surr2 = clipped_ratio * advantages[mb]
115 policy_loss = -torch.min(surr1, surr2).mean()
You can interpret surr1 as the raw policy-improvement objective. Its rule is:
- If , this is a good action; we want to increase its probability, i.e. make larger.
- If , this is a bad action; we want to decrease its probability, i.e. make smaller.
But this objective is too greedy. Suppose and the current ratio is , then . If we keep pushing this action up so that becomes 10, 50, 100, the objective keeps increasing. The optimizer would think "bigger is always better," but at that point the new policy is far from the old one, and the old data is no longer reliable.
PPO does not introduce a complicated new algorithm. It adds a very direct conservative rule on top of this objective:
You may increase the probability of good actions and decrease the probability of bad actions, but do not let the new policy move too far relative to the old policy.
So we restrict the policy ratio to a small interval:
If , the interval is . This means: for an action that appears in the old batch, the new policy's probability should ideally not be below times the old policy's probability, and not be above times it.
In code, the unclipped objective surr1, the clipped objective surr2, and the final policy_loss are computed together:
109 new_logprobs, new_values, entropy = model.evaluate(states[mb], actions[mb])
110 ratio = torch.exp(new_logprobs - old_logprobs[mb])
111
112 surr1 = ratio * advantages[mb]
113 clipped_ratio = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps)
114 surr2 = clipped_ratio * advantages[mb]
115 policy_loss = -torch.min(surr1, surr2).mean()
116
117 value_loss = F.mse_loss(new_values, returns[mb])
118 entropy_bonus = entropy.mean()
119 loss = policy_loss + vf_coef * value_loss - ent_coef * entropy_bonus
Now we have two objectives:
| Code | Math | Meaning |
|---|---|---|
surr1 | what the policy would like to do without constraints | |
surr2 | how far we allow it to change under the update constraint |
PPO takes the smaller of the two:
This is PPO-Clip. It is not derived by mechanically transforming the TRPO constraint into algebra. Instead, it starts from the importance-sampling surrogate objective and adds a conservative rule: "do not let the ratio drift too far." TRPO is one historical source of this conservative mindset, but it is not required to understand the PPO code.
In code this is torch.min(surr1, surr2).mean(). Why the minus sign? Because we want to maximize policy_objective, while PyTorch minimizes losses. So we write policy_loss = -policy_objective.
What Clipping Does
Case 1: (good action; probability should increase)
When , we want to increase (the new policy assigns higher probability to the action). The unclipped term grows linearly with with no upper bound. The clipped term becomes a constant once .
| Range of | Unclipped | Clipped | Which One Picks |
|---|---|---|---|
| equal; normal optimization | |||
| (larger) | (constant) | clipped term; zero gradient |
So the probability of good actions can increase, but only up to about times that of the old policy. Beyond that, the objective becomes "flat": it stops rewarding further increases, so the gradient becomes zero.
Case 2: (bad action; probability should decrease)
When , we want to decrease (the new policy assigns lower probability). But if has already dropped below , the new policy has already pushed that action probability down too much; PPO no longer rewards further suppression.
This is easy to misread because is negative. Consider a numeric example: , . If , the unclipped term is , while the clipped term is . The min picks the smaller value, i.e. , which is the clipped term. Since the clipped term is constant, the gradient is zero.
| Range of | Unclipped | Clipped | Which One Picks |
|---|---|---|---|
| larger (e.g. ) | (constant) | clipped term; zero gradient | |
| unclipped term | equal inside interval; clipped can become larger above | unclipped term; keep optimizing |
So the probability of bad actions can decrease, but only down to about times the old probability. Beyond that the objective goes flat and stops providing further incentive. If the bad action probability increases instead, the unclipped term makes the objective worse, and the gradient pulls it back.
Case 3: (neutral action)
Then . No matter how changes, the objective is always 0, so PPO does not adjust that action.
Putting these cases together, the meaning of PPO-Clip becomes clear:
it does not forbid learning; it simply stops rewarding the part of the change that has already gone too far.
import numpy as np
import matplotlib.pyplot as plt
# ==========================================
# Geometric intuition for the PPO-Clip objective
# ==========================================
epsilon = 0.2
r = np.linspace(0.0, 2.0, 500)
def clip_objective(r, A, eps=0.2):
r_clipped = np.clip(r, 1 - eps, 1 + eps)
return np.minimum(r * A, r_clipped * A)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, (A_val, title) in zip(
axes,
[(1.0, "A > 0 (good action)"), (-1.0, "A < 0 (bad action)"), (0.0, "A = 0 (neutral)")],
):
obj = clip_objective(r, A_val)
ax.plot(r, r * A_val, "b--", alpha=0.4, label="unclipped r·A")
ax.plot(r, obj, "r-", linewidth=2, label="PPO-Clip min(...)")
ax.axvspan(1 - epsilon, 1 + epsilon, alpha=0.1, color="green", label="safe interval")
ax.set_title(title)
ax.set_xlabel("policy ratio r_t(θ)")
ax.set_ylabel("objective value")
ax.legend(fontsize=8)
plt.suptitle("Three cases of the PPO-Clip objective (ε=0.2)", fontsize=13)
plt.tight_layout()
plt.savefig("ppo_clip_three_cases.png", dpi=150)
print("Saved visualization")Clipping Intuition
If you look at the three cases together, PPO-Clip's design intention becomes very clear:
With , after each update, the probability assigned to an action is constrained to remain near the old policy. This "safety rail" ensures that even if gradient estimates are noisy, the policy will not jump too far in a single step.
Step 8: PPO Is Not Only a Loss Function
At this point it is easy to form a misconception: does understanding PPO mean understanding the PPO loss? The answer is: no.
PPO is a policy-optimization algorithm. More concretely, it is a training procedure that answers:
Given a policy network that already acts, how do we use newly collected experience to make it reliably better?
So PPO is not a single formula, and it is not just one loss.backward() call. A complete PPO method includes at least these pieces:
| Component in PPO | What It Does | Where It Appears in Code |
|---|---|---|
| sampling with the current policy | interact with the environment to collect a new batch | collect_trajectories(...) |
| old policy record | store action probabilities at sampling time for later comparison | old_logprobs |
| advantage estimation | judge whether each action is above/below average | advantages / compute_gae(...) |
| clipped policy update | update the Actor while constraining drift from the old policy | ppo_clip_loss(...) |
| value-function training | train the Critic to estimate state values accurately | value_loss |
| entropy bonus | maintain exploration; avoid becoming too confident too early | entropy_bonus |
| multi-epoch mini-batch updates | reuse the same batch for multiple epochs to improve sample use | n_epochs / mini-batch |
Therefore, the PPO loss is not the entirety of PPO, but it is the most important "policy update rule" within PPO. It tells the Actor which action probabilities to increase, which to decrease, and the maximum allowed change.
You can think of PPO as a training protocol:
The reason "loss" matters is that neural networks update parameters through backpropagation. To affect parameters, PPO's ideas must become a differentiable objective. That is why we emphasize PPO loss, but you should not shrink PPO into the loss alone.
Step 9: How PPO Appears in Code
If we keep only PPO's core policy update, the landing point is the following lines:
109 new_logprobs, new_values, entropy = model.evaluate(states[mb], actions[mb])
110 ratio = torch.exp(new_logprobs - old_logprobs[mb])
111
112 surr1 = ratio * advantages[mb]
113 clipped_ratio = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps)
114 surr2 = clipped_ratio * advantages[mb]
115 policy_loss = -torch.min(surr1, surr2).mean()
116
117 value_loss = F.mse_loss(new_values, returns[mb])
118 entropy_bonus = entropy.mean()
119 loss = policy_loss + vf_coef * value_loss - ent_coef * entropy_bonus
This piece of code needs three main inputs:
| Input | Where It Comes From | What It Does |
|---|---|---|
old_logprobs | stored during rollout collection | records the old policy's probability for the action |
new_logprobs | recomputed during update | the new policy's probability for the same action |
advantages | computed from returns, Critic, or GAE | tells whether the action should be encouraged or suppressed |
It outputs a scalar policy_loss. This scalar is exactly what backpropagation consumes:
117 value_loss = F.mse_loss(new_values, returns[mb])
118 entropy_bonus = entropy.mean()
119 loss = policy_loss + vf_coef * value_loss - ent_coef * entropy_bonus
120
121 optimizer.zero_grad()
122 loss.backward()
123 optimizer.step()
Of course, real PPO does not only train the Actor; it also trains the Critic, and usually includes an entropy bonus to encourage exploration. So we combine policy_loss into a full loss:
loss = policy_loss + vf_coef * value_loss - ent_coef * entropy_bonus.
If you have derived PPO on paper and want to implement it, you only need to connect the data in this order:
This is the minimal closed loop that turns PPO formulas into a training program.
Supplement: TRPO Is Historical Context, Not a Required Derivation
TRPO (Trust Region Policy Optimization) and PPO solve the same issue: policy updates must not be too large. TRPO is written as:
This means: optimize the surrogate objective, but constrain the average KL divergence between old and new policies by a small threshold .
This path is theoretically elegant, but in practice it requires constrained optimization, conjugate gradients, approximate second-order information, and more. For a chapter whose goal is "derive PPO loss from formulas," TRPO is not a necessary prerequisite. Treat it as a historical note:
TRPO limits policy change via a KL constraint; PPO approximates a similar effect by clipping the policy ratio.
So the main line should be:
TRPO simply reminds us: PPO's "Proximal" comes from trust-region thinking, but the concrete code you need is ratio, clamp, min, and the total loss.
Step 10: The Full PPO Loss
In real training, PPO does not only optimize the clipped surrogate; it trains the Critic and preserves exploration. To avoid symbol confusion, separate two things:
- : the mathematical objective we want to maximize.
loss: the training loss we minimize in code.
The maximization objective can be written as:
Here is the policy-improvement objective, is the Critic's value error, and is the policy entropy. Since code minimizes loss, we negate the policy objective and the entropy term.
In code, the total loss is composed here:
117 value_loss = F.mse_loss(new_values, returns[mb])
118 entropy_bonus = entropy.mean()
119 loss = policy_loss + vf_coef * value_loss - ent_coef * entropy_bonus
120
121 optimizer.zero_grad()
122 loss.backward()
123 optimizer.step()
Policy Loss
The policy maximization objective is the clipped surrogate:
The policy_loss in code is its negative:
This term updates the Actor: increase probabilities of good actions, decrease probabilities of bad actions, while clipping constrains the magnitude of change within a safe range.
Value-Function Loss
The Critic should estimate state values accurately. The value loss is the mean squared error between the Critic prediction and a target return :
Here is computed via GAE (derived in detail in the next section).
Why do we need a separate value loss? Because the Critic's accuracy directly determines the quality of the advantage estimate . If the Critic is inaccurate, will have large bias and can mislead the Actor. The MSE loss continuously corrects the Critic so its predictions track true returns.
In code: value_loss = F.mse_loss(new_values, returns[mb]). It is backpropagated together with policy_loss in the same update function.
Entropy Bonus
Policy entropy encourages exploration and prevents premature collapse to a deterministic policy:
Higher entropy means the policy is more "hesitant" (more uniform action distribution), which encourages exploration; lower entropy means the policy is more "certain" (always choosing one action), which reduces exploration. The coefficient is often around 0.01.
Why include entropy? Clipping stabilizes training, but it can also cause a side effect: the policy may "lock onto" a suboptimal action too early. The entropy bonus rewards uncertainty inside the loss, ensuring the policy retains ongoing exploration pressure.
In code: entropy_bonus = entropy.mean(). Note the minus sign in the total loss: - ent_coef * entropy_bonus, because we want to maximize entropy, which is equivalent to subtracting it when minimizing loss.
10.4 How the Three Terms Work Together
Each term does a different job:
policy loss drives Actor improvement, value loss ensures the Critic provides accurate advantage signals, and entropy bonus preserves exploration.
They collaborate through the shared Actor-Critic network. In ppo_from_scratch.py, the Actor and Critic share the same backbone network (shared_net), so one backpropagation updates both.
10.5 Hyperparameter Summary
| Symbol | Name | Typical Value | Role | Code Parameter |
|---|---|---|---|---|
| clip range | 0.1-0.2 | limits how far ratios may move | clip_range | |
| value-loss coefficient | 0.5 | balances policy update vs value fitting | vf_coef | |
| entropy coefficient | 0.01 | encourages exploration | ent_coef | |
| discount factor | 0.99 | decay of future rewards | gamma | |
| GAE parameter | 0.95 | bias-variance tradeoff in advantage estimation | gae_lambda | |
| rollout length | 2048 | how many steps to collect per rollout | n_steps | |
| number of epochs | 10 | how many passes over the same data batch | n_epochs |
Step 11: The Complete PPO Algorithm
Putting everything together, the PPO training loop is:
If you compare against the code, each step can be traced to a specific piece:
21 # [A] 策略和值函数:Actor 输出动作概率,Critic 输出状态价值
22 def forward(self, obs):
23 h = self.backbone(obs)
24 logits = self.actor(h)
25 action_probs = F.softmax(logits, dim=-1)
26 value = self.critic(h).squeeze(-1)
27 return action_probs, value
28
29 # [B] 采样动作:从策略分布中抽动作,并保存旧策略 log_prob
30 def act(self, obs):
31 action_probs, value = self.forward(obs)
32 dist = Categorical(action_probs)
33 action = dist.sample()
34 log_prob = dist.log_prob(action)
35 return action, log_prob, value
36
37 def evaluate(self, obs, actions):
38 action_probs, values = self.forward(obs)
39 dist = Categorical(action_probs)
40 new_logprobs = dist.log_prob(actions)
41 entropy = dist.entropy()
42 return new_logprobs, values, entropy
⋮
77# [D] 计算优势:这个动作比当前状态的平均水平好多少
78def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
79 advantages = torch.zeros_like(rewards)
80 last_advantage = 0.0
81 next_value = 0.0
82
83 for t in reversed(range(len(rewards))):
84 mask = 1.0 - dones[t]
85 delta = rewards[t] + gamma * next_value * mask - values[t]
86 last_advantage = delta + gamma * lam * mask * last_advantage
87 advantages[t] = last_advantage
88 next_value = values[t]
89
90 returns = advantages + values
91 advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
92 return advantages, returns
⋮
109 new_logprobs, new_values, entropy = model.evaluate(states[mb], actions[mb])
110 ratio = torch.exp(new_logprobs - old_logprobs[mb])
111
112 surr1 = ratio * advantages[mb]
113 clipped_ratio = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps)
114 surr2 = clipped_ratio * advantages[mb]
115 policy_loss = -torch.min(surr1, surr2).mean()
116
117 value_loss = F.mse_loss(new_values, returns[mb])
118 entropy_bonus = entropy.mean()
119 loss = policy_loss + vf_coef * value_loss - ent_coef * entropy_bonus
120
121 optimizer.zero_grad()
122 loss.backward()
123 optimizer.step()
⋮
126# [F] 训练循环:采样一批数据,再用这批数据更新多轮
127device = "cuda" if torch.cuda.is_available() else "cpu"
128env = gym.make("CartPole-v1")
129model = ActorCritic(env.observation_space.shape[0], env.action_space.n).to(device)
130optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)
131
132for update in range(100):
133 batch = collect_rollout(env, model, steps=2048, device=device)
134 advantages, returns = compute_gae(batch["rewards"], batch["values"], batch["dones"])
135 ppo_update(model, optimizer, batch, advantages, returns)
Some key design decisions and their intuition:
- Reuse the same data for epochs: collecting data is expensive (requires running the environment), so we update multiple times on the same batch. Clipping prevents multi-epoch updates from drifting too far.
- Mini-batch updates: split steps into several mini-batches; compute gradients per mini-batch to improve training efficiency.
- Recompute each epoch: even though the data batch is the same, changes after each epoch, so changes too; clipping continues to take effect dynamically.
Derivation Note: PPO-Penalty Variant
The PPO paper actually proposes two variants. Besides PPO-Clip, it proposes PPO-Penalty (also called PPO-KL), which directly adds a KL penalty term:
is an adaptive coefficient: if current KL is too large, increase to penalize more; if KL is too small, decrease to loosen the constraint.
PPO-Penalty can be better in some settings (especially when you need precise control of policy change), but it is more complex to implement and introduces an additional adaptive mechanism to tune. In practice, PPO-Clip is more common.
Thought Question 1: If we set ε to 0, what does PPO-Clip degenerate into?
When , the clipping interval collapses to , so . The PPO-Clip objective becomes:
For , : when , the objective is the constant , so further increasing a good action's probability no longer improves the objective; when , the objective is and the gradient only pushes it back toward 1. This means good actions cannot be meaningfully increased above the old policy.
For , : when , the objective is the constant , so further decreasing a bad action's probability no longer improves the objective; when , the objective is and the gradient only pushes it back toward 1. This means bad actions cannot be meaningfully decreased below the old policy either.
In short, almost freezes the policy near the old one: whether advantages are positive or negative, the policy cannot make meaningful improvements. This shows controls both "allowed change magnitude" and "learning capacity."
Thought Question 2: Can clipping fully replace a KL constraint? Can clipping fail?
Clipping effectively limits policy change in most situations, but it has a theoretical weakness: it constrains the ratio for each individual action, rather than directly constraining the overall distribution distance (KL divergence) between two policies.
Consider an extreme case: a policy has 100 actions, and clipping allows each action probability to change by . If all actions are pushed to the boundary simultaneously, the overall distribution change can exceed a KL constraint such as . In practice, this is rare because advantage estimates are noisy and usually do not push all actions in extreme directions simultaneously. But for settings where policy-change control must be strict (e.g., LLM alignment), practitioners often monitor KL as an additional safety metric. This is why in Chapter 8's RLHF training you will see both clip_fraction and approx_kl logged.
Thought Question 3: Why does PPO update K epochs on the same batch, instead of collecting K batches and updating once each?
The two strategies have the same total number of samples ( steps), but differ in data quality and compute cost.
"Collect K batches, update once each" uses fresh data from the current policy every time, so the gradient estimate is unbiased. But collecting data requires environment simulation, which is often far more expensive than parameter updates. In LLM settings, generating a batch of responses can take minutes, while a gradient update can take seconds.
"Collect one batch, update K epochs" reuses old data for multiple updates. From the importance-sampling viewpoint, only the first epoch is unbiased; later epochs introduce bias as drifts away from . Clipping is designed to mitigate this: when the drift becomes too large, clipping drives gradients toward zero and effectively stops unsafe updates. This is an engineering tradeoff: accept "small bias" in exchange for "large compute savings."
In practice, is often 3-10, and clipping can keep the bias within an acceptable range.
At this point, you have the complete mathematical picture of PPO: from policy gradients, to the importance-sampling surrogate objective, to the PPO-Clip policy loss formed by ratio, clamp, and min, and finally to the total loss that can be backpropagated directly.
The next two sections each go deeper into a key detail:
- Intuition and experiments for clipping: Trust Region and Clipping
- GAE derivation and its use in reward models and LLM alignment: GAE, Reward Models, and LLM Alignment