Skip to content

Part 2: Theory and Methods - Knowledge Summary

What Did We Learn in This Part?

These four chapters form the theoretical core of the book. We started from the most basic question, "How do we describe decision-making in mathematics?", and progressed all the way to PPO, the most widely used algorithm in modern industry. Mastering this part gives you the key to reading essentially all later LLM alignment algorithms.

After these four chapters, you should understand:

  • The MDP 5-tuple (S,A,P,R,γ)(S, A, P, R, \gamma): a mathematical language for "an agent makes decisions in an environment."
  • Value functions and Bellman equations: Vπ(s)V^\pi(s) and Qπ(s,a)Q^\pi(s,a) measure "how valuable a state is" and "how valuable an action at a state is." Bellman equations tell us: current value = immediate reward + discounted next-step value.
  • TD error: δ=r+γV(s)V(s)\delta = r + \gamma V(s') - V(s) measures the gap between prediction and reality and is the learning signal behind almost all RL algorithms.
  • The three key components of DQN: Q-network (approximate QQ with a neural network), experience replay (break sample correlations), target network (stabilize training targets).
  • The policy gradient theorem: θJ=E[logπθ(as)Gt]\nabla_\theta J = \mathbb{E}[\nabla \log \pi_\theta(a|s) \cdot G_t], directly differentiating the policy, naturally supporting continuous actions.
  • Actor-Critic: the Actor learns the policy, the Critic learns values; they cooperate through the advantage function A(s,a)=Q(s,a)V(s)A(s,a) = Q(s,a) - V(s).
  • PPO clipping: use clip(rt,1ε,1+ε)\text{clip}(r_t, 1-\varepsilon, 1+\varepsilon) to constrain changes in the policy ratio and prevent unstable, overly large updates.
  • GAE: A^t=k=0(γλ)kδt+k\hat{A}_t = \sum_{k=0}^{\infty}(\gamma\lambda)^k \delta_{t+k}, interpolating between bias and variance.

Now let us review the content chapter by chapter.

Chapter 3: MDP - A Mathematical Description of Decision Problems

Markov Decision Process

To discuss reinforcement learning rigorously, we need a mathematical framework for "an agent makes decisions in an environment." This framework is the Markov Decision Process (MDP), defined by a 5-tuple (S,A,P,R,γ)(S, A, P, R, \gamma):

  • SS is the set of states. In CartPole, s=(position,velocity,angle,angular velocity)R4s = (\text{position}, \text{velocity}, \text{angle}, \text{angular velocity}) \in \mathbb{R}^4.
  • AA is the set of actions. In the discrete case, A={a1,a2,}A=\{a_1, a_2, \ldots\}; in the continuous case, AA is an interval of real values.
  • P(ss,a)P(s'|s,a) is the transition probability: after taking action aa at state ss, the probability of moving to ss'. "Markov" means the future depends only on the current state, not on the past. In chess, you only need the current board position; you do not need the full move history.
  • R(s,a)R(s,a) is the reward function: the immediate reward after taking action aa at state ss.
  • γ[0,1)\gamma \in [0, 1) is the discount factor: it controls the tradeoff between "immediate payoff" and "long-term payoff." γ\gamma near 1 emphasizes the long term; γ\gamma near 0 emphasizes the present.

The agent's goal is to find a policy π(as)\pi(a|s) that maximizes the discounted cumulative return from any starting state:

Gt=rt+γrt+1+γ2rt+2+=k=0γkrt+kG_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k r_{t+k}

Why do we need a discount factor γ\gamma? Mathematically, an infinite series must converge; γ<1\gamma < 1 guarantees GtG_t is finite. Intuitively, future rewards are less certain than immediate ones: "100 dollars today" is typically more attractive than "100 dollars next year."

Value Functions and Bellman Equations

With return defined, the next question is: how many "points" is a state worth? How many "points" is a state-action pair worth? This leads to two core concepts.

The state-value function Vπ(s)V^\pi(s) is the expected return starting from state ss and following policy π\pi:

Vπ(s)=Eπ[k=0γkrt+k  |  st=s]V^\pi(s) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \;\middle|\; s_t = s\right]

The action-value function Qπ(s,a)Q^\pi(s, a) is the expected return when we take action aa at state ss and then follow policy π\pi:

Qπ(s,a)=Eπ[k=0γkrt+k  |  st=s,at=a]Q^\pi(s, a) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \;\middle|\; s_t = s, a_t = a\right]

They are related by:

Vπ(s)=aπ(as)Qπ(s,a)V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s, a)

In words: state value is the policy-weighted average of action values.

The Bellman equation further reveals their recursive structure: you do not need to "see the future," only one step:

Vπ(s)=aπ(as)[R(s,a)+γsP(ss,a)Vπ(s)]V^\pi(s) = \sum_a \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s') \right]

Intuitively: the value of the current state equals the expected value over actions (weighted by the policy), where each action value equals immediate reward plus discounted value of the next state. This self-consistent equation is the basis for computing value functions.

TD Error: The Learning Signal Across RL Algorithms

In realistic settings, we usually do not know the transition probabilities PP or reward function RR (we do not know the environment model). We can only interact with the environment and obtain samples. At this point, a key learning signal appears: the temporal-difference error (TD error):

δ=r+γV(s)V(s)\delta = r + \gamma V(s') - V(s)

TD error measures the gap between prediction and reality. V(s)V(s) is our prediction of the current state's value, and r+γV(s)r + \gamma V(s') is the one-step reward plus our estimate of the next step. If our prediction is perfectly correct, δ=0\delta = 0. If reality is better than predicted, δ>0\delta > 0, and we should increase our estimate of V(s)V(s).

This simple formula runs through the entire RL landscape. From Q-learning to DQN, from REINFORCE to PPO, the learning signals are essentially TD error or its variants.

From Slot Machines to GridWorld: Understanding Q-Learning in Code

In the two-armed bandit experiment, we first experienced the "exploration vs exploitation" tension: you want to choose the arm with higher win rate (exploit), but you worry the other arm might actually be better (explore). In 4x4 GridWorld, we ran the full Q-learning loop:

python
Q = np.zeros((n_states, n_actions))  # initialize Q-table

for episode in range(1000):
    state = env.reset()
    while not done:
        # epsilon-greedy: explore with probability epsilon, otherwise exploit the current best action
        action = epsilon_greedy(Q[state], epsilon)
        next_state, reward, done = env.step(action)
        # update Q-values using TD error
        td_target = reward + gamma * np.max(Q[next_state])
        Q[state, action] += alpha * (td_target - Q[state, action])
        state = next_state

This snippet contains all essential RL elements: store value estimates in a table Q(s,a)Q(s,a), balance exploration and exploitation with ε\varepsilon-greedy, and update using TD error. When the state space becomes large (for example, from a 16-cell grid to Atari frames with 210×160210 \times 160 pixels), the table no longer fits. This is exactly the problem DQN solves.

Chapter 4: DQN - The Leap from Tables to Neural Networks

The Curse of Dimensionality and Function Approximation

CartPole has continuous states, but only 4 dimensions. Atari frames, by contrast, are pixel tensors of size 84×84×484 \times 84 \times 4, and the state space has on the order of 25628224256^{28224} possibilities, a number larger than the number of atoms in the observable universe. It is impossible to store Q-values for every state in a table.

DQN solves this by approximating the Q-function with a neural network. The network takes state ss as input and outputs Q-values for each action Q(s,a1),Q(s,a2),Q(s, a_1), Q(s, a_2), \ldots. The training objective is to minimize squared TD error:

L(θ)=E[(r+γmaxaQ(s,a;θ)Q(s,a;θ))2]\mathcal{L}(\theta) = \mathbb{E}\left[\left(r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)\right)^2\right]

Here θ\theta are the parameters of the online network, and θ\theta^- are the parameters of the target network. This loss means: make the network's prediction Q(s,a)Q(s,a) match "one-step reward plus the next-step maximum Q-value."

The Three Key Components of DQN

A neural network alone is not enough. If we train directly on each new interaction step, consecutive samples are highly correlated (they come from adjacent time steps), and training becomes unstable. DQN introduces three important design choices:

Experience replay stores each step (s,a,r,s)(s, a, r, s') into a large buffer, and then samples random mini-batches for training. This breaks temporal correlations and makes gradient updates closer to i.i.d. assumptions.

python
class ReplayBuffer:
    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)

Target networks keep a delayed copy of the online network. Every so often, we hard-copy the online parameters into the target network. This keeps TD targets stable for a while and avoids the difficulty of "chasing a moving target." Think of learning to shoot: if the hoop moves every second, it is hard to learn; if it moves only occasionally, you have time to adapt.

ε\varepsilon-greedy exploration gradually reduces the probability of random exploration over training, transitioning from broad exploration early to fine-grained exploitation later.

The DQN Family

The original DQN became famous in the 2015 Atari paper, and many improvements followed. Double DQN decouples "action selection" and "action evaluation": use the online network θ\theta to pick argmaxaQ(s,a;θ)\arg\max_{a'} Q(s', a'; \theta), then use the target network θ\theta^- to evaluate that action. This reduces overestimation bias in original DQN, analogous to not letting the same person both set an exam and grade it.

Dueling DQN decomposes Q-values into state value V(s)V(s) and advantage A(s,a)A(s,a):

Q(s,a)=V(s)+A(s,a)1AaA(s,a)Q(s, a) = V(s) + A(s, a) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s, a')

When all actions in a state are similarly good, A(s,a)0A(s,a) \approx 0, and the network mainly needs to learn V(s)V(s). This improves efficiency.

Chapter 5: Policy Gradients - Learning the Policy Directly

From Value-Based to Policy-Based

DQN follows an indirect route: learn Q(s,a)Q(s,a), then choose actions via argmax\arg\max. This route has a fundamental limitation: it naturally handles only discrete and finite action spaces. A robot arm's torques are continuous values, e.g. [10,10]6[-10, 10]^6, and you cannot assign a Q-value to every combination. Text generation in LLMs is even more extreme: at each step you choose from tens of thousands of tokens.

Policy gradient methods take a different route: instead of learning value functions, they parameterize the policy directly as πθ(as)\pi_\theta(a|s), then optimize θ\theta to maximize expected return. The policy gradient theorem states that the gradient of J(θ)=Eπθ[Gt]J(\theta) = \mathbb{E}_{\pi_\theta}[G_t] can be estimated as:

θJ(θ)=Eπθ[tθlogπθ(atst)Gt]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]

This formula has a clean intuition: θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t|s_t) points in the direction that makes action ata_t more likely under state sts_t, while GtG_t tells you whether that direction is good. If Gt>0G_t > 0, increase the action probability; if Gt<0G_t < 0, decrease it. That is the whole REINFORCE algorithm.

python
def reinforce_update(policy, optimizer, states, actions, returns):
    log_probs = []
    for s, a in zip(states, actions):
        dist = Categorical(policy(s))
        log_probs.append(dist.log_prob(a))

    loss = 0
    for log_prob, G in zip(log_probs, returns):
        loss += -log_prob * G  # minus sign: gradient descent = maximize return
    loss /= len(returns)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Baselines and Variance

REINFORCE has a fatal weakness: high variance. If every episode return GtG_t is around 100, then even a "good" action receives a large positive signal, just slightly less large than others. This makes gradient estimates unstable.

The standard fix is to introduce a baseline b(s)b(s) and modify the gradient to:

θJ(θ)=Eπθ[θlogπθ(atst)(Gtb(st))]\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t))\right]

As long as b(s)b(s) does not depend on the action aa, this modification does not change the expected gradient (because Eaπ[logπ(as)]=0\mathbb{E}_{a \sim \pi}[\nabla \log \pi(a|s)] = 0), but it can greatly reduce variance. The most common baseline is the state-value function V(s)V(s), i.e., the Critic.

The Actor-Critic Architecture

Combining the Actor (policy network) and the Critic (value network) yields the Actor-Critic architecture. The Actor selects actions, and the Critic evaluates "how much better this action is than average." That quantity is the advantage function:

A(s,a)=Q(s,a)V(s)A(s, a) = Q(s, a) - V(s)

python
class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, 128), nn.ReLU(),
        )
        self.actor = nn.Sequential(
            nn.Linear(128, action_dim), nn.Softmax(dim=-1)
        )
        self.critic = nn.Linear(128, 1)

    def forward(self, x):
        features = self.shared(x)
        return self.actor(features), self.critic(features)

During training, the Actor updates the policy using an advantage estimate A=r+γV(s)V(s)A = r + \gamma V(s') - V(s), while the Critic updates the value estimate using squared TD error:

python
# update after one environment step
_, next_value = model(next_state)
td_target = reward + gamma * next_value * (1 - done)
td_error = td_target - value

actor_loss = -log_prob * td_error.detach()  # Actor: update policy using advantage
critic_loss = td_error ** 2                 # Critic: update value using TD error
loss = actor_loss + critic_loss

Chapter 6: PPO - Making Policy Updates More Stable

Trust Regions and Clipping

Actor-Critic is more stable than REINFORCE, but policy updates can still be too large. If a single update changes the policy dramatically, previously collected data becomes irrelevant, and training can oscillate violently.

PPO (Proximal Policy Optimization) uses an elegant clipping mechanism to address this. It defines the policy ratio:

rt(θ)=πθ(atst)πold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}

When rt=1r_t = 1, the new and old policies match. When rt>1r_t > 1, the new policy is more likely to choose the action; when rt<1r_t < 1, it is less likely. PPO maximizes:

LCLIP(θ)=Et[min(rt(θ)A^t,  clip(rt(θ),1ε,1+ε)A^t)]L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) \hat{A}_t,\;\text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) \hat{A}_t\right)\right]

Typically ε=0.2\varepsilon=0.2. When A^t>0\hat{A}_t > 0 (good actions), PPO allows rtr_t to increase at most to 1+ε1+\varepsilon, preventing overly aggressive probability increases. When A^t<0\hat{A}_t < 0 (bad actions), PPO allows rtr_t to decrease at most to 1ε1-\varepsilon. This "safety rail" keeps updates within a trust region.

The full PPO loss also includes two additional terms:

L(θ)=LCLIP(θ)c1LVF(θ)+c2H[πθ]L(\theta) = L^{\text{CLIP}}(\theta) - c_1 L^{\text{VF}}(\theta) + c_2 \mathcal{H}[\pi_\theta]

where LVFL^{\text{VF}} is the Critic's value-fitting loss (MSE) and H\mathcal{H} is the policy entropy bonus. The entropy term encourages exploration and prevents premature collapse to suboptimal deterministic behavior.

GAE: Balancing Bias and Variance

How we estimate advantage A^t\hat{A}_t matters greatly for PPO. Two naive extremes are:

  • use one-step TD error δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t): high bias but low variance;
  • use full-trajectory return GtV(st)G_t - V(s_t): unbiased but high variance.

GAE (Generalized Advantage Estimation) interpolates between them:

A^tGAE=k=0(γλ)kδt+k\hat{A}_t^{\text{GAE}} = \sum_{k=0}^{\infty} (\gamma \lambda)^k \delta_{t+k}

Here λ[0,1]\lambda \in [0, 1] controls the interpolation. λ=0\lambda = 0 reduces to one-step TD (large bias), and λ=1\lambda = 1 reduces to full returns (large variance). In practice, λ=0.95\lambda = 0.95 is common.

python
def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
    advantages = []
    gae = 0
    for t in reversed(range(len(rewards))):
        next_value = values[t + 1] if t + 1 < len(values) else 0
        delta = rewards[t] + gamma * next_value * (1 - dones[t]) - values[t]
        gae = delta + gamma * lam * (1 - dones[t]) * gae
        advantages.insert(0, gae)
    return advantages

From LunarLander to LLMs

PPO was first validated on classic RL environments such as LunarLander, but its real power shows up in LLM alignment. In RLHF, PPO simultaneously manages four models: Actor (the language model being trained), Critic (value network), Reference (the frozen original model used for KL regularization), and Reward Model (the judge scoring responses). The Bradley-Terry preference model defines the reward-model training objective:

P(ywylx)=σ(r(x,yw)r(x,yl))P(y_w \succ y_l | x) = \sigma(r(x, y_w) - r(x, y_l))

This framework is the starting point of Chapter 8, and DPO's key observation is that this framework can be optimized without explicitly training a reward model.

Summary

Part 2 followed a full theoretical arc: MDP provides the language of decision problems -> Bellman equations provide a recursive method to compute values -> DQN approximates QQ with neural networks and resolves the curse of dimensionality -> policy gradients optimize policies directly and support continuous actions -> Actor-Critic introduces a Critic to reduce variance -> PPO uses clipping and GAE for stable training.

Every concept on this path will reappear in later LLM alignment chapters. Understanding PPO clipping helps you understand GRPO's within-group normalization; understanding the Actor-Critic division of labor helps you understand the roles of the four models in RLHF.

Next stop: Part 3: The LLM Era

现代强化学习实战课程