Chapter 7: PPO — The Art of Stable Training

In the previous chapter, we built the Actor-Critic architecture: the Actor is responsible for choosing actions, and the Critic is responsible for judging whether those actions are good or bad. The two cooperate through the advantage function $A(s,a)$ . On CartPole, Actor-Critic performs quite well. But when you move the same architecture to more complex environments (for example, LunarLander) or to much larger models (for example, language models with billions of parameters), a serious issue starts to surface: training instability.

Policy gradient methods have a notorious weakness: if a single update step is too large, the policy can "collapse." Imagine learning to ride a bicycle. If you shift your center of gravity too aggressively in one attempt, you do not ride better, you simply crash. The TD Error signal in Actor-Critic does reduce variance, but it does not fundamentally solve this problem. What we need is a mechanism that constrains how much the policy is allowed to change at each step, so the policy can "move fast in small steps" rather than "leap to the finish in one jump." This is the core problem that PPO (Proximal Policy Optimization) is designed to solve.

Prerequisites (Quick Review)

This chapter will frequently use the following concepts:

Policy gradient $\nabla_\theta J$ : PPO adds constraints on top of policy gradients
The high-variance issue in REINFORCE: why we need a series of improvements
Advantage function $A(s,a)$ : PPO's policy update depends on advantage signals
TD Error and Critic training: how the Critic is trained in PPO
Actor-Critic architecture: PPO is a variant of Actor-Critic

This chapter follows the path "hands-on → theory → constraints → estimation." We will first run a continuous-control experiment on BipedalWalker and see the training curves, policy entropy, clipping fraction, and KL divergence with our own eyes. Then we will unpack the mathematics behind PPO: the derivation, the clipping mechanism, and methods for advantage estimation. LunarLander has already served as the introductory task in earlier chapters, so we will not repeat it here. Instead, we will move directly into continuous action spaces, where PPO's characteristics become more visible.

Section	The Question You Will Answer
Hands-on: BipedalWalker continuous control	What does PPO training look like in practice? How does it handle continuous action spaces? How should we read Reward, Entropy, Clip Fraction, and KL?
PPO: Mathematical derivation	Where do PPO's formulas come from? What is the full chain from policy gradients to the clipped surrogate objective? What terms are in the complete loss?
Trust regions and clipping	Why does a too-large update step cause collapse? How do TRPO's KL constraint and PPO's clipping work, respectively?
GAE, reward models, and LLM alignment	How does GAE interpolate between bias and variance? In PPO for LLM alignment, how many models need to run at the same time?
PPO game projects	Which games have already been trained with PPO? Where are the player entry points, training environments, and reproduction evidence?
RL exploration in long-horizon tasks	How does classical RL deal with long-horizon tasks? How do hierarchical RL, HER, world models, and reward shaping work?

Let's start by running PPO and looking at the results: Hands-on: BipedalWalker continuous control.

1. CartPole Balancing

2. DPO Preference Tuning

3. MDP and Value Functions

4. Deep Q-Networks

5. Policy-Based Methods

6. Actor-Critic

7. PPO

8. The RLHF Pipeline

9. Post-Training Alignment

10. Agentic RL

11. VLM Reinforcement Learning

12. Future Trends

B. RL Engineering Practice

C. Code Cheatsheet

E. Math Foundations for RL

E.1 Linear Algebra

E.2 Probability & Estimation

E.3 Calculus & Optimization

E.4 Information Theory

Chapter 7: PPO — The Art of Stable Training

E.1 Linear Algebra

E.2 Probability & Estimation

E.3 Calculus & Optimization

E.4 Information Theory

Chapter 7: PPO — The Art of Stable Training ​

Chapter 7: PPO — The Art of Stable Training