Chapter 6: Actor-Critic, Where Two Lines of Thought Converge
Chapter 4 followed Line 1 (Value-Based): learn and pick the action with the highest score (review: Q(s,a) and the Greedy Policy). This tends to produce accurate scoring, but it is not good at exploration, and it can only handle discrete actions. Chapter 5 followed Line 2 (Policy-Based): directly optimize (review: Policy Objective). This is good at exploration and supports continuous actions, but its variance is too large: run the same policy twice, and the gradient estimates can be wildly different.
At the end of the previous chapter, we found a key clue: subtracting a baseline reduces variance (review: Policy Gradient Improvements), and the best baseline is (review: State-Value Function). But itself must be learned, which means we need a dedicated network to estimate it. This network is the Critic.
In this chapter, we will stitch the two lines together: train a Critic using the methods from Line 1 to evaluate how good an action is, and train an Actor using the methods from Line 2 to choose actions. This is the Actor-Critic architecture.
Prerequisites (Quick Review)
This chapter is a synthesis of everything we have built so far. The following concepts will appear repeatedly:
- State-value and the Bellman equation: the theoretical foundation of the Critic. V^\\pi(s) measures "starting from state , how many points do we get on average?"
- Action-value : the difference between and is the advantage function
- DP / MC / TD: three ways to estimate values: three concrete strategies for training the Critic
- TD Error : the Critic's core training signal
- Policy objective and policy gradients: the Actor's optimization target
- REINFORCE and baselines: why we need as a baseline
Chapter Roadmap
| Section | Core Question |
|---|---|
| Advantage Function | What is the advantage function ? Why is it better than using directly? |
| Training the Critic | How do we train a Critic to estimate ? Concrete implementations of DP/MC/TD |
| Actor-Critic Architecture | How do the Actor and Critic collaborate? How does TD Error replace ? |
| Frontier-Scale Applications of Actor-Critic | AlphaStar, SAC robots, Isaac Lab: how AC lands in industrial-scale practice |
| Hands-on: Pendulum Balancing | How does Actor-Critic handle continuous action spaces? |
| Hands-on: BipedalWalker | Can Actor-Critic learn complex continuous control? |
Let’s begin with the advantage function. It is the bridge that connects the Actor and the Critic: Advantage Function.