Chapter 5: Policy-Based Methods: Policy Gradients and REINFORCE

In Chapter 4, we followed Route 1: learn $Q(s,a)$ to score each action, then pick the action with the highest score (review: Q(s,a) and the greedy policy). DQN performs well on CartPole and Atari, but it has a fundamental limitation:

it can only handle a finite set of discrete actions.

CartPole has only two choices, "push left" and "push right". DQN computes a $Q$ value for each action and takes the maximum. But what if we want to control a robotic arm? The shoulder, elbow, and wrist joints each have multiple degrees of freedom, and each degree can apply a continuous torque. The set of possible action combinations is infinite, so it is impossible to compute a $Q$ value for every combination. The situation is even more obvious in large language model text generation: at every step we sample from tens of thousands of tokens. The policy itself is a continuous probability distribution, so the $\arg\max$ mindset does not really apply.

Learning The Policy Directly

If "score first, then choose" is not viable, we take a different route:

skip $Q$ values and learn the policy $\pi_\theta(a|s)$ directly.

Instead of asking "how many points is each action worth?", we learn "what to do in what situation".

This is exactly the core idea of Chapter 3's Route 2: the policy objective $J(\theta)$ : define a policy objective function $J(\theta)$ , then directly optimize the parameters $\theta$ to maximize $J(\theta)$ . In this chapter, we will go deeper along this route, moving from the policy gradient theorem to the REINFORCE algorithm, and then to variance reduction via baselines.

Prerequisites (Quick Review)

We will repeatedly use the following concepts in this chapter. If any of them feels fuzzy, click through for a quick refresh before continuing:

Policy $\pi_\theta$ and objective $J(\theta)$ : how do we represent a parameterized policy, and how do we define "how good" a policy is?
Monte Carlo (MC) methods: "finish an entire episode, then look back", the sampling foundation behind REINFORCE
State value $V(s)$ : "how many points on average can I get?", the source of baselines

Main Thread Of This Chapter

We will develop the chapter along two parallel threads. The first is theory: from the policy gradient theorem to the REINFORCE algorithm, then to baseline variance reduction and the advantage function. The second is practice: we will first get vanilla REINFORCE running on CartPole, observe the high-variance behavior, and then add a value baseline to compare the results.

Section	Core Question
Why Policy Gradients Are Necessary	Where does DQN's $\arg\max$ break down? Why learn the policy directly?
The REINFORCE Algorithm	What is the mathematical form of the policy gradient theorem? How implement REINFORCE?
Hands-on: CartPole	How does REINFORCE perform on a real control task? What does high variance look like?
Improving Policy Gradients	Why do baselines reduce variance? Why is $V(s)$ an optimal baseline?
Hands-on: CartPole Ablation	What is the practical effect of a value baseline? Look from reward and variance.

Let's begin with the motivation for policy gradients: from DQN to policy gradients.

1. CartPole Balancing

2. DPO Preference Tuning

3. MDP and Value Functions

4. Deep Q-Networks

5. Policy-Based Methods

6. Actor-Critic

7. PPO

8. The RLHF Pipeline

9. Post-Training Alignment

10. Agentic RL

11. VLM Reinforcement Learning

12. Future Trends

B. RL Engineering Practice

C. Code Cheatsheet

E. Math Foundations for RL

E.1 Linear Algebra

E.2 Probability & Estimation

E.3 Calculus & Optimization

E.4 Information Theory

Chapter 5: Policy-Based Methods: Policy Gradients and REINFORCE

Learning The Policy Directly

Main Thread Of This Chapter

E.1 Linear Algebra

E.2 Probability & Estimation

E.3 Calculus & Optimization

E.4 Information Theory

Chapter 5: Policy-Based Methods: Policy Gradients and REINFORCE ​

Learning The Policy Directly ​

Main Thread Of This Chapter ​

Chapter 5: Policy-Based Methods: Policy Gradients and REINFORCE

Learning The Policy Directly

Main Thread Of This Chapter