Part 2: Theory and Methods - Knowledge Summary
What Did We Learn in This Part?
These four chapters form the theoretical core of the book. We started from the most basic question, "How do we describe decision-making in mathematics?", and progressed all the way to PPO, the most widely used algorithm in modern industry. Mastering this part gives you the key to reading essentially all later LLM alignment algorithms.
After these four chapters, you should understand:
- The MDP 5-tuple : a mathematical language for "an agent makes decisions in an environment."
- Value functions and Bellman equations: and measure "how valuable a state is" and "how valuable an action at a state is." Bellman equations tell us: current value = immediate reward + discounted next-step value.
- TD error: measures the gap between prediction and reality and is the learning signal behind almost all RL algorithms.
- The three key components of DQN: Q-network (approximate with a neural network), experience replay (break sample correlations), target network (stabilize training targets).
- The policy gradient theorem: , directly differentiating the policy, naturally supporting continuous actions.
- Actor-Critic: the Actor learns the policy, the Critic learns values; they cooperate through the advantage function .
- PPO clipping: use to constrain changes in the policy ratio and prevent unstable, overly large updates.
- GAE: , interpolating between bias and variance.
Now let us review the content chapter by chapter.
Chapter 3: MDP - A Mathematical Description of Decision Problems
Markov Decision Process
To discuss reinforcement learning rigorously, we need a mathematical framework for "an agent makes decisions in an environment." This framework is the Markov Decision Process (MDP), defined by a 5-tuple :
- is the set of states. In CartPole, .
- is the set of actions. In the discrete case, ; in the continuous case, is an interval of real values.
- is the transition probability: after taking action at state , the probability of moving to . "Markov" means the future depends only on the current state, not on the past. In chess, you only need the current board position; you do not need the full move history.
- is the reward function: the immediate reward after taking action at state .
- is the discount factor: it controls the tradeoff between "immediate payoff" and "long-term payoff." near 1 emphasizes the long term; near 0 emphasizes the present.
The agent's goal is to find a policy that maximizes the discounted cumulative return from any starting state:
Why do we need a discount factor ? Mathematically, an infinite series must converge; guarantees is finite. Intuitively, future rewards are less certain than immediate ones: "100 dollars today" is typically more attractive than "100 dollars next year."
Value Functions and Bellman Equations
With return defined, the next question is: how many "points" is a state worth? How many "points" is a state-action pair worth? This leads to two core concepts.
The state-value function is the expected return starting from state and following policy :
The action-value function is the expected return when we take action at state and then follow policy :
They are related by:
In words: state value is the policy-weighted average of action values.
The Bellman equation further reveals their recursive structure: you do not need to "see the future," only one step:
Intuitively: the value of the current state equals the expected value over actions (weighted by the policy), where each action value equals immediate reward plus discounted value of the next state. This self-consistent equation is the basis for computing value functions.
TD Error: The Learning Signal Across RL Algorithms
In realistic settings, we usually do not know the transition probabilities or reward function (we do not know the environment model). We can only interact with the environment and obtain samples. At this point, a key learning signal appears: the temporal-difference error (TD error):
TD error measures the gap between prediction and reality. is our prediction of the current state's value, and is the one-step reward plus our estimate of the next step. If our prediction is perfectly correct, . If reality is better than predicted, , and we should increase our estimate of .
This simple formula runs through the entire RL landscape. From Q-learning to DQN, from REINFORCE to PPO, the learning signals are essentially TD error or its variants.
From Slot Machines to GridWorld: Understanding Q-Learning in Code
In the two-armed bandit experiment, we first experienced the "exploration vs exploitation" tension: you want to choose the arm with higher win rate (exploit), but you worry the other arm might actually be better (explore). In 4x4 GridWorld, we ran the full Q-learning loop:
Q = np.zeros((n_states, n_actions)) # initialize Q-table
for episode in range(1000):
state = env.reset()
while not done:
# epsilon-greedy: explore with probability epsilon, otherwise exploit the current best action
action = epsilon_greedy(Q[state], epsilon)
next_state, reward, done = env.step(action)
# update Q-values using TD error
td_target = reward + gamma * np.max(Q[next_state])
Q[state, action] += alpha * (td_target - Q[state, action])
state = next_stateThis snippet contains all essential RL elements: store value estimates in a table , balance exploration and exploitation with -greedy, and update using TD error. When the state space becomes large (for example, from a 16-cell grid to Atari frames with pixels), the table no longer fits. This is exactly the problem DQN solves.
Chapter 4: DQN - The Leap from Tables to Neural Networks
The Curse of Dimensionality and Function Approximation
CartPole has continuous states, but only 4 dimensions. Atari frames, by contrast, are pixel tensors of size , and the state space has on the order of possibilities, a number larger than the number of atoms in the observable universe. It is impossible to store Q-values for every state in a table.
DQN solves this by approximating the Q-function with a neural network. The network takes state as input and outputs Q-values for each action . The training objective is to minimize squared TD error:
Here are the parameters of the online network, and are the parameters of the target network. This loss means: make the network's prediction match "one-step reward plus the next-step maximum Q-value."
The Three Key Components of DQN
A neural network alone is not enough. If we train directly on each new interaction step, consecutive samples are highly correlated (they come from adjacent time steps), and training becomes unstable. DQN introduces three important design choices:
Experience replay stores each step into a large buffer, and then samples random mini-batches for training. This breaks temporal correlations and makes gradient updates closer to i.i.d. assumptions.
class ReplayBuffer:
def __init__(self, capacity=10000):
self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
return random.sample(self.buffer, batch_size)Target networks keep a delayed copy of the online network. Every so often, we hard-copy the online parameters into the target network. This keeps TD targets stable for a while and avoids the difficulty of "chasing a moving target." Think of learning to shoot: if the hoop moves every second, it is hard to learn; if it moves only occasionally, you have time to adapt.
-greedy exploration gradually reduces the probability of random exploration over training, transitioning from broad exploration early to fine-grained exploitation later.
The DQN Family
The original DQN became famous in the 2015 Atari paper, and many improvements followed. Double DQN decouples "action selection" and "action evaluation": use the online network to pick , then use the target network to evaluate that action. This reduces overestimation bias in original DQN, analogous to not letting the same person both set an exam and grade it.
Dueling DQN decomposes Q-values into state value and advantage :
When all actions in a state are similarly good, , and the network mainly needs to learn . This improves efficiency.
Chapter 5: Policy Gradients - Learning the Policy Directly
From Value-Based to Policy-Based
DQN follows an indirect route: learn , then choose actions via . This route has a fundamental limitation: it naturally handles only discrete and finite action spaces. A robot arm's torques are continuous values, e.g. , and you cannot assign a Q-value to every combination. Text generation in LLMs is even more extreme: at each step you choose from tens of thousands of tokens.
Policy gradient methods take a different route: instead of learning value functions, they parameterize the policy directly as , then optimize to maximize expected return. The policy gradient theorem states that the gradient of can be estimated as:
This formula has a clean intuition: points in the direction that makes action more likely under state , while tells you whether that direction is good. If , increase the action probability; if , decrease it. That is the whole REINFORCE algorithm.
def reinforce_update(policy, optimizer, states, actions, returns):
log_probs = []
for s, a in zip(states, actions):
dist = Categorical(policy(s))
log_probs.append(dist.log_prob(a))
loss = 0
for log_prob, G in zip(log_probs, returns):
loss += -log_prob * G # minus sign: gradient descent = maximize return
loss /= len(returns)
optimizer.zero_grad()
loss.backward()
optimizer.step()Baselines and Variance
REINFORCE has a fatal weakness: high variance. If every episode return is around 100, then even a "good" action receives a large positive signal, just slightly less large than others. This makes gradient estimates unstable.
The standard fix is to introduce a baseline and modify the gradient to:
As long as does not depend on the action , this modification does not change the expected gradient (because ), but it can greatly reduce variance. The most common baseline is the state-value function , i.e., the Critic.
The Actor-Critic Architecture
Combining the Actor (policy network) and the Critic (value network) yields the Actor-Critic architecture. The Actor selects actions, and the Critic evaluates "how much better this action is than average." That quantity is the advantage function:
class ActorCritic(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(state_dim, 128), nn.ReLU(),
)
self.actor = nn.Sequential(
nn.Linear(128, action_dim), nn.Softmax(dim=-1)
)
self.critic = nn.Linear(128, 1)
def forward(self, x):
features = self.shared(x)
return self.actor(features), self.critic(features)During training, the Actor updates the policy using an advantage estimate , while the Critic updates the value estimate using squared TD error:
# update after one environment step
_, next_value = model(next_state)
td_target = reward + gamma * next_value * (1 - done)
td_error = td_target - value
actor_loss = -log_prob * td_error.detach() # Actor: update policy using advantage
critic_loss = td_error ** 2 # Critic: update value using TD error
loss = actor_loss + critic_lossChapter 6: PPO - Making Policy Updates More Stable
Trust Regions and Clipping
Actor-Critic is more stable than REINFORCE, but policy updates can still be too large. If a single update changes the policy dramatically, previously collected data becomes irrelevant, and training can oscillate violently.
PPO (Proximal Policy Optimization) uses an elegant clipping mechanism to address this. It defines the policy ratio:
When , the new and old policies match. When , the new policy is more likely to choose the action; when , it is less likely. PPO maximizes:
Typically . When (good actions), PPO allows to increase at most to , preventing overly aggressive probability increases. When (bad actions), PPO allows to decrease at most to . This "safety rail" keeps updates within a trust region.
The full PPO loss also includes two additional terms:
where is the Critic's value-fitting loss (MSE) and is the policy entropy bonus. The entropy term encourages exploration and prevents premature collapse to suboptimal deterministic behavior.
GAE: Balancing Bias and Variance
How we estimate advantage matters greatly for PPO. Two naive extremes are:
- use one-step TD error : high bias but low variance;
- use full-trajectory return : unbiased but high variance.
GAE (Generalized Advantage Estimation) interpolates between them:
Here controls the interpolation. reduces to one-step TD (large bias), and reduces to full returns (large variance). In practice, is common.
def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
advantages = []
gae = 0
for t in reversed(range(len(rewards))):
next_value = values[t + 1] if t + 1 < len(values) else 0
delta = rewards[t] + gamma * next_value * (1 - dones[t]) - values[t]
gae = delta + gamma * lam * (1 - dones[t]) * gae
advantages.insert(0, gae)
return advantagesFrom LunarLander to LLMs
PPO was first validated on classic RL environments such as LunarLander, but its real power shows up in LLM alignment. In RLHF, PPO simultaneously manages four models: Actor (the language model being trained), Critic (value network), Reference (the frozen original model used for KL regularization), and Reward Model (the judge scoring responses). The Bradley-Terry preference model defines the reward-model training objective:
This framework is the starting point of Chapter 8, and DPO's key observation is that this framework can be optimized without explicitly training a reward model.
Summary
Part 2 followed a full theoretical arc: MDP provides the language of decision problems -> Bellman equations provide a recursive method to compute values -> DQN approximates with neural networks and resolves the curse of dimensionality -> policy gradients optimize policies directly and support continuous actions -> Actor-Critic introduces a Critic to reduce variance -> PPO uses clipping and GAE for stable training.
Every concept on this path will reappear in later LLM alignment chapters. Understanding PPO clipping helps you understand GRPO's within-group normalization; understanding the Actor-Critic division of labor helps you understand the roles of the four models in RLHF.
Next stop: Part 3: The LLM Era