6.3 Actor-Critic Architecture
In the previous two sections we met the advantage function and the training method for the Critic. Now let's assemble all the parts and see how the Actor and the Critic collaborate.
Prerequisites for This Section
- Advantage function -- "How much better is this action than the average?"
- TD Error -- a practical estimate of the advantage
- Policy gradient -- the Actor's update formula
- REINFORCE and baselines -- motivation for going from to
From REINFORCE to Actor-Critic
Recall the gradient formula of REINFORCE from Chapter 5 (review: policy gradient theorem):
is the cumulative return over the full trajectory -- this is precisely why REINFORCE has high variance. The baseline analysis in Chapter 5 showed that subtracting reduces variance. In the previous section we also found that we need not wait for the episode to end -- the TD Error can replace as an advantage estimate:
This substitution is fundamentally transformative:
| REINFORCE | Actor-Critic | |
|---|---|---|
| Advantage estimate | (MC, needs full trajectory) | (TD, update after one step) |
| Update timing | after the episode ends | after every step |
| Variance | high | low |
| Bias | unbiased | biased (bias introduced by bootstrapping) |
| Cost | none | must train a Critic |
Numerical Comparison: Both Updates on the Same Scenario
Consider CartPole. At time step the agent is in state , chooses action "right" (), and then interacts for 5 more steps until the episode ends. The trajectory is:
| Time | State | Action | Reward |
|---|---|---|---|
| right | 1.0 | ||
| right | 1.0 | ||
| left | 1.0 | ||
| right | 1.0 | ||
| right | 1.0 |
Take discount factor .
REINFORCE computation. REINFORCE must wait until the episode ends before updating. It computes the full return starting from time :
This serves as the weight in the policy gradient. Suppose the current policy assigns probability to the right action in state . The log-probability is
The policy gradient update becomes
The problem: on a different trajectory, could be 1.0 (the pole fell after one step) or 10.0 (the agent survived for a long time). The fluctuation in propagates directly into the gradient -- this is the source of REINFORCE's high variance.
REINFORCE Formula Symbol Table
Symbol Meaning Log-probability gradient w.r.t. policy parameters ; indicates which direction to adjust Full discounted return from time to the end of the episode Immediate reward received at time Discount factor, controlling how fast future rewards decay
Actor-Critic computation. Actor-Critic does not wait for the episode to end. Suppose the Critic estimates and for the current and next states. After one step, the immediate reward is received, and the TD Error can be computed immediately:
, meaning the outcome of this step is better than the Critic originally expected. This positive TD Error serves directly as the advantage estimate:
Using the same , the gradient is of comparable magnitude to REINFORCE's, but the weight is no longer the cumulative return over the entire trajectory -- it is a single-step TD Error. The range of fluctuation in is far smaller than that of , because it contains only the randomness of one step rather than the accumulated randomness of an entire trajectory.
Actor-Critic Formula Symbol Table
Symbol Meaning Log-probability gradient w.r.t. policy parameters TD Error, serving as a one-step estimate of the advantage Immediate reward received in this step Discounted value estimate of the next state (the Critic's prediction of the future) Critic's value estimate for the current state (used as the baseline)
The core differences between the two methods can be summarized in a single comparison table:
| Computation step | REINFORCE | Actor-Critic |
|---|---|---|
| Update precondition | episode ends, full trajectory available | one step taken, and obtained |
| Advantage estimate | (5-step cumulative return) | (one-step TD Error) |
| Gradient weight | affected by randomness of the entire trajectory | affected by randomness of a single step only |
| Additional components needed | none | Critic providing and |
| Per-step computation | small (no network forward pass) | larger (Critic requires an extra forward pass) |
Actor-Critic Architecture
Integrating the advantage function with Critic training yields one of the most classic architectures in reinforcement learning. The Actor is responsible for selecting actions, the Critic for evaluating how good they are. The two collaborate through the advantage function :
Actor-Critic Data Flow
state s
|
+--> Actor (policy network)
| pi(a|s) -> choose action a
| |
| execute action a
| |
| v
| environment -> returns r, s'
| |
+--> Critic (value network) |
| V(s) ----------------+
| V(s') ----------------+
| |
| delta = r + gamma*V(s') - V(s)
| |
| v
| Actor update: theta <- theta + alpha * grad log pi(a|s) * delta
| Critic update: V(s) <- V(s) + alpha * delta
|
+--> next step, repeat the above processBoth networks share the same input (state ) but perform different tasks:
| Network | Role | Input | Output | Learning objective |
|---|---|---|---|---|
| Actor | select actions | state | action probabilities | maximize cumulative reward |
| Critic | evaluate states | state | value estimate | predict future return accurately |
If you look carefully at the Critic's update rule, -- isn't this exactly TD learning from Chapter 3? The Critic is, in essence, a neural-network implementation of the value function from Chapter 3, independently learning "how many points each state is worth." The Actor is a neural-network implementation of the policy , adjusting its behavior based on the evaluation provided by the Critic.
Two function approximators work in concert -- the Critic helps the Actor judge "how much better this action is than average," the Actor adjusts its policy accordingly, and the new policy generates new data that helps the Critic learn better. This is where the name Actor-Critic comes from.
Complete Numerical Derivation of a Single Update Step
Let us walk through one complete Actor-Critic update step with a concrete scenario. In CartPole, suppose the state vector at some time step is . The current model parameters are . After a forward pass, the Actor and Critic produce:
| Component | Output | Value |
|---|---|---|
| Actor | action probabilities | |
| Critic | state value |
Here and .
Step 1: Sample an action. Sample from the distribution to obtain (the second action). The corresponding log-probability is:
Step 2: Execute the action and observe the transition. The environment returns immediate reward and next state .
Step 3: Critic evaluates the next state. Feed into the Critic (note: no gradient is computed here):
Step 4: Compute the TD target and TD Error.
-- the actual outcome of this step exceeded the Critic's expectation, indicating that "choosing right in state " was a better-than-average decision.
Step 5: Compute the Actor Loss.
Note that is marked as .detach() -- it participates in the Actor Loss as a constant and does not propagate gradients back through the Critic.
Actor Loss Formula Symbol Table
Symbol Meaning Actor's loss function; taking its gradient is equivalent to the policy gradient Log-probability of the chosen action; a differentiable function of TD Error, serving as the advantage estimate; does not participate in gradient computation for the Actor negative sign Converts gradient ascent into gradient descent: minimizing is equivalent to maximizing
Step 6: Compute the Critic Loss.
This is the mean-squared-error form -- it drives toward the TD target .
Critic Loss Formula Symbol Table
Symbol Meaning Critic's loss function, driving toward the TD target TD Error, where participates in the Critic's gradient computation Squaring ensures that both positive and negative errors produce positive loss, with larger errors penalized more heavily
Step 7: Total Loss and Backpropagation.
During backpropagation, gradients flow along two paths:
- Actor path: . Since is treated as a constant, it only scales and directs the gradient -- when , the probability of right increases; when , it decreases.
- Critic path: . Since is a differentiable function of , the gradient directly adjusts the Critic's prediction to bring it closer to the TD target.
The complete computation chain for one update step:
| Step | Input | Computation | Output |
|---|---|---|---|
| Forward | |||
| Sample | |||
| Env | |||
| Evaluate | |||
| TD | |||
| Loss |
Implementing Actor-Critic in PyTorch
Compared with REINFORCE, Actor-Critic adds a Critic network, but the overall structure remains clean:
import torch
import torch.nn as nn
import torch.optim as optim
import gymnasium as gym
import numpy as np
# ==========================================
# 1. Actor-Critic network (shared feature extractor)
# ==========================================
class ActorCritic(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
# shared feature extraction layer
self.shared = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
)
# Actor head: outputs action probabilities
self.actor = nn.Sequential(
nn.Linear(128, action_dim),
nn.Softmax(dim=-1)
)
# Critic head: outputs state value
self.critic = nn.Linear(128, 1)
def forward(self, x):
features = self.shared(x)
action_probs = self.actor(features)
state_value = self.critic(features)
return action_probs, state_value
# ==========================================
# 2. Training loop (update every step; no need to wait for episode end)
# ==========================================
env = gym.make("CartPole-v1")
model = ActorCritic(state_dim=4, action_dim=2)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
gamma = 0.99
reward_history = []
for episode in range(500):
state, _ = env.reset()
total_reward = 0
while True:
state_t = torch.FloatTensor(state)
# Actor chooses action; Critic evaluates state
probs, value = model(state_t)
dist = torch.distributions.Categorical(probs)
action = dist.sample()
log_prob = dist.log_prob(action)
# Execute action
next_state, reward, terminated, truncated, _ = env.step(action.item())
done = terminated or truncated
total_reward += reward
# Critic evaluates the next state
with torch.no_grad():
_, next_value = model(torch.FloatTensor(next_state))
next_value = 0 if done else next_value
# TD Error = advantage estimate (review: Section 6.1 A ~ delta)
td_target = reward + gamma * next_value
td_error = td_target - value
# Actor loss: policy gradient x advantage
actor_loss = -log_prob * td_error.detach()
# Critic loss: make V(s) close to TD target (review: Section 6.2 L = delta^2)
critic_loss = td_error.pow(2)
# Total loss
loss = actor_loss + critic_loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
state = next_state
if done:
break
reward_history.append(total_reward)
if (episode + 1) % 50 == 0:
avg = np.mean(reward_history[-50:])
print(f"Episode {episode+1} | Avg Reward: {avg:.1f}")Compared with the REINFORCE code in Chapter 5, the key differences are: there is an additional Critic network (outputting ); TD Error (td_target - value) replaces ; the Critic has its own loss function (MSE); and updates happen every step rather than waiting for the episode to end.
Code Trace: A Complete Training Step
Below we assume the model is at some point during training and trace through one complete loop. Let the current state be with discount factor .
Forward pass. Feed state_t = torch.FloatTensor([0.1, 0.2, -0.3, 0.4]) into the model:
probs, value = model(state_t)
# probs = tensor([0.6000, 0.4000]) <- Actor output: left prob 0.6, right prob 0.4
# value = tensor(1.2000) <- Critic output: V(s) = 1.2Sample action and log-probability.
dist = torch.distributions.Categorical(probs)
action = dist.sample() # action = tensor(1), i.e. right
log_prob = dist.log_prob(action) # log_prob = log(0.4) = tensor(-0.9163).
Environment interaction. Execute action.item() = 1 (right):
next_state, reward, terminated, truncated, _ = env.step(action.item())
# reward = 1.0
# terminated = False, truncated = FalseEvaluate the next state.
with torch.no_grad():
_, next_value = model(torch.FloatTensor(next_state))
# next_value = tensor(2.0000) <- V(s') = 2.0
# done = False, so next_value is not zeroed outCompute TD target and TD Error.
td_target = reward + gamma * next_value # = 1.0 + 0.99 * 2.0 = tensor(2.9800)
td_error = td_target - value # = 2.98 - 1.2 = tensor(1.7800)Compute both losses.
Actor Loss ( is detached, participating as a constant):
actor_loss = -log_prob * td_error.detach() # = -(-0.9163) * 1.78 = tensor(1.6310)Critic Loss ( contains ; gradients propagate through back to Critic parameters):
critic_loss = td_error.pow(2) # = 1.78^2 = tensor(3.1684)Total loss.
loss = actor_loss + critic_loss # = tensor(4.7994)Backpropagation and parameter update. After loss.backward() computes the gradients, optimizer.step() updates parameters with learning rate . The effect of this update:
- Actor direction: , indicating that choosing right was better than expected. Gradient ascent increases -- next time a similar state is encountered, the agent will be more inclined to choose right.
- Critic direction: is below the TD target of . The gradient from pulls upward, bringing it closer to .
Summary of key values across the entire computation chain:
| Variable | Value | Meaning |
|---|---|---|
probs | [0.6, 0.4] | Actor's probability distribution over two actions |
value | 1.2 | Critic's estimate of the current state |
log_prob | -0.9163 | Log-probability of the chosen action (right) |
reward | 1.0 | Immediate reward returned by the environment |
next_value | 2.0 | Critic's estimate of the next state |
td_target | 2.98 | |
td_error | 1.78 | \delta = \text{td_target} - V(s) |
actor_loss | 1.6310 | (after .detach) |
critic_loss | 3.1684 | |
loss | 4.7994 |
Actor-Critic Training Curve on CartPole
Training Curve of Actor-Critic on CartPole
500 +
| ===============
400 + ====
| ====
300 + =====
| ====
200 + ====
| ==
100 +/
+----------------------------------------------
0 50 100 150 200 250 300 350 400 450 500
Episode
Compare with the typical curve of REINFORCE (more jagged, slower convergence)On CartPole, Actor-Critic typically stabilizes at 500 points (the maximum) within 200-300 episodes, whereas REINFORCE may need 500+ episodes and exhibits a visibly jagged curve. This is the payoff of "trading bias for variance" -- every step provides a more stable gradient signal, and policy updates are no longer driven by luck.
Further Evolution of Actor-Critic
Actor-Critic is not the destination; it is a skeleton. In later chapters you will encounter various extensions:
| Chapter | Variant | Key improvement |
|---|---|---|
| Chapter 7 PPO | PPO-Clip | Limit the size of policy updates to avoid "taking steps that are too big" |
| Chapter 7 GAE | Generalized Advantage Estimation | Exponentially weighted sum of multi-step TD errors; precisely control the bias-variance tradeoff |
| Chapter 9 DPO | Implicit Actor-Critic | Replace the Critic with preference data; remove the on-policy constraint |
| Chapter 9 GRPO | Remove the Critic | Replace with an in-group mean; save one network |
All variants share the same skeleton: one network responsible for choosing, plus one signal responsible for evaluating. What changes is only "where the evaluation signal comes from" and "how the selection network is updated."
Question to think about: if Actor-Critic is better than REINFORCE, why not use a pure Critic (only V)?
Because with only a Critic, there is no way to directly output a policy. The Critic learns or , and deriving a policy from it requires (review: greedy optimal policy). But in continuous action spaces, this has no closed-form solution -- you cannot compare infinitely many continuous values one by one.
The Actor's value lies in directly outputting action probabilities, which naturally handles continuous action spaces. This is why two networks are needed -- the Critic provides "evaluation" and the Actor provides "selection." Neither can be omitted.
Question to think about: where does the "bias" in Actor-Critic come from, and is it harmful?
The bias comes from the Critic's bootstrapping -- the Critic uses its own estimate to update . If is itself inaccurate, the error propagates backward. It is like calibrating one ruler with another inaccurate ruler -- the errors accumulate.
But this bias is not necessarily harmful. A moderate amount of bias can buy much lower variance, and overall convergence may be faster than the unbiased but high-variance REINFORCE. In Chapter 7, GAE is precisely about controlling this "bias-variance tradeoff" -- using a parameter to smoothly interpolate between pure TD (high bias, low variance) and pure MC (unbiased, high variance).
Now let's look at how the Actor-Critic architecture performs in large-scale applications: Frontiers of Large-Scale Actor-Critic Applications.