Skip to content

Chapter 6: Actor-Critic, Where Two Lines of Thought Converge

Chapter 4 followed Line 1 (Value-Based): learn Q(s,a)Q(s,a) and pick the action with the highest score (review: Q(s,a) and the Greedy Policy). This tends to produce accurate scoring, but it is not good at exploration, and it can only handle discrete actions. Chapter 5 followed Line 2 (Policy-Based): directly optimize J(theta)J(\\theta) (review: Policy Objective). This is good at exploration and supports continuous actions, but its variance is too large: run the same policy twice, and the gradient estimates can be wildly different.

At the end of the previous chapter, we found a key clue: subtracting a baseline reduces variance (review: Policy Gradient Improvements), and the best baseline is V(s)V(s) (review: State-Value Function). But V(s)V(s) itself must be learned, which means we need a dedicated network to estimate it. This network is the Critic.

In this chapter, we will stitch the two lines together: train a Critic using the methods from Line 1 to evaluate how good an action is, and train an Actor using the methods from Line 2 to choose actions. This is the Actor-Critic architecture.

Prerequisites (Quick Review)

This chapter is a synthesis of everything we have built so far. The following concepts will appear repeatedly:

Chapter Roadmap

SectionCore Question
Advantage FunctionWhat is the advantage function A(s,a)A(s,a)? Why is it better than using GtG_t directly?
Training the CriticHow do we train a Critic to estimate V(s)V(s)? Concrete implementations of DP/MC/TD
Actor-Critic ArchitectureHow do the Actor and Critic collaborate? How does TD Error replace GtG_t?
Frontier-Scale Applications of Actor-CriticAlphaStar, SAC robots, Isaac Lab: how AC lands in industrial-scale practice
Hands-on: Pendulum BalancingHow does Actor-Critic handle continuous action spaces?
Hands-on: BipedalWalkerCan Actor-Critic learn complex continuous control?

Let’s begin with the advantage function. It is the bridge that connects the Actor and the Critic: Advantage Function.

现代强化学习实战课程