Chapter 6: Actor-Critic, Where Two Lines of Thought Converge

Chapter 4 followed Line 1 (Value-Based): learn $Q(s,a)$ and pick the action with the highest score (review: Q(s,a) and the Greedy Policy). This tends to produce accurate scoring, but it is not good at exploration, and it can only handle discrete actions. Chapter 5 followed Line 2 (Policy-Based): directly optimize $J(\\theta)$ (review: Policy Objective). This is good at exploration and supports continuous actions, but its variance is too large: run the same policy twice, and the gradient estimates can be wildly different.

At the end of the previous chapter, we found a key clue: subtracting a baseline reduces variance (review: Policy Gradient Improvements), and the best baseline is $V(s)$ (review: State-Value Function). But $V(s)$ itself must be learned, which means we need a dedicated network to estimate it. This network is the Critic.

In this chapter, we will stitch the two lines together: train a Critic using the methods from Line 1 to evaluate how good an action is, and train an Actor using the methods from Line 2 to choose actions. This is the Actor-Critic architecture.

Prerequisites (Quick Review)

This chapter is a synthesis of everything we have built so far. The following concepts will appear repeatedly:

State-value $V(s)$ and the Bellman equation: the theoretical foundation of the Critic. $V^\\pi(s)$ measures "starting from state $s$ , how many points do we get on average?"
Action-value $Q(s,a)$ : the difference between $Q$ and $V$ is the advantage function
DP / MC / TD: three ways to estimate values: three concrete strategies for training the Critic
TD Error $\\delta = r + \\gamma V(s') - V(s)$ : the Critic's core training signal
Policy objective $J(\\theta)$ and policy gradients: the Actor's optimization target
REINFORCE and baselines: why we need $V(s)$ as a baseline

Chapter Roadmap

Section	Core Question
Advantage Function	What is the advantage function $A(s,a)$ ? Why is it better than using $G_t$ directly?
Training the Critic	How do we train a Critic to estimate $V(s)$ ? Concrete implementations of DP/MC/TD
Actor-Critic Architecture	How do the Actor and Critic collaborate? How does TD Error replace $G_t$ ?
Frontier-Scale Applications of Actor-Critic	AlphaStar, SAC robots, Isaac Lab: how AC lands in industrial-scale practice
Hands-on: Pendulum Balancing	How does Actor-Critic handle continuous action spaces?
Hands-on: BipedalWalker	Can Actor-Critic learn complex continuous control?

Let’s begin with the advantage function. It is the bridge that connects the Actor and the Critic: Advantage Function.

1. CartPole Balancing

2. DPO Preference Tuning

3. MDP and Value Functions

4. Deep Q-Networks

5. Policy-Based Methods

6. Actor-Critic

7. PPO

8. The RLHF Pipeline

9. Post-Training Alignment

10. Agentic RL

11. VLM Reinforcement Learning

12. Future Trends

B. RL Engineering Practice

C. Code Cheatsheet

E. Math Foundations for RL

E.1 Linear Algebra

E.2 Probability & Estimation

E.3 Calculus & Optimization

E.4 Information Theory

Chapter 6: Actor-Critic, Where Two Lines of Thought Converge

Chapter Roadmap

E.1 Linear Algebra

E.2 Probability & Estimation

E.3 Calculus & Optimization

E.4 Information Theory

Chapter 6: Actor-Critic, Where Two Lines of Thought Converge ​

Chapter Roadmap ​

Chapter 6: Actor-Critic, Where Two Lines of Thought Converge

Chapter Roadmap