6.2 Training the Critic
In the previous section, we defined the advantage function and introduced the Critic network as an estimator of . This section expands on the three classic value-estimation methods from Chapter 3 -- DP, MC, and TD -- and shows how each one trains the Critic in practice.
Prerequisites
- DP/MC/TD value-estimation methods -- principles and comparisons
- Bellman expectation equation -- theoretical basis for DP updates
- TD Error -- the core signal of TD methods
We continue with the three-cell corridor from Chapter 3, using a fixed policy : at both and , move right with probability 0.8 and left with probability 0.2. The transitions and rewards are:
| Current state | Action | Policy prob | Next state | Reward |
|---|---|---|---|---|
| left | 0.2 | |||
| right | 0.8 | |||
| left | 0.2 | |||
| right | 0.8 | |||
| end | 1.0 | -- |
We set . All three methods estimate the same value table; they differ only in where the update targets come from.
DP: A Theoretical Baseline
If we knew the full transition probabilities and reward function (recall the MDP 5-tuple), we could iterate the Critic directly using the Bellman expectation equation:
Each symbol in this equation:
| Symbol | Meaning |
|---|---|
| Critic's current value estimate for state , with parameters | |
| An action available at state (e.g., left, right) | |
| Probability that the current policy selects action at state | |
| Immediate reward received after taking action in state | |
| A possible next state after taking action | |
| Probability of transitioning to from state and action | |
| Critic's current value estimate for next state | |
| Discount factor |
Expanding for state in the corridor. The outer sum weights over actions according to the policy; the inner sum weights over next states according to the transition probabilities. Since transitions are deterministic (moving right always goes right; moving left either hits the wall or goes back), the inner has probability 1 only for the actual destination:
Similarly for , where moving right reaches the terminal state and moving left returns to :
By repeatedly applying this update to all states, converges to the exact . Starting from an all-zero table, we substitute numbers round by round.
Round 1 -- the old table is all zeros, so the target reduces to the average immediate cost of each action:
Round 2 -- using the round-1 results as the old table:
Round 3 -- using the round-2 results as the old table:
Summary of each round:
| Round | |||
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| 1 | -1.2 | -1.2 | 0 |
| 2 | -2.4 | -1.44 | 0 |
| 3 | -2.832 | -1.68 | 0 |
| converged | -3.375 | -1.875 | 0 |
In each round, the values for and encode the "average consequence of acting under the current policy" -- moving right is generally better, but the policy occasionally moves left, and the costs of detours and wall bumps must also enter the value table.
On this basis, we can also perform policy improvement -- at state , choose the action that maximizes (recall the greedy optimal policy). The loop "evaluate the policy improve the policy evaluate again" is exactly Policy Iteration, which is guaranteed to converge to an optimal policy.
In real-world problems, however, it is almost never feasible to know the complete and . DP's role in Actor-Critic is primarily a theoretical baseline -- it tells you the Critic's optimal answer when everything is known.
MC: Update the Critic Using Complete Trajectories
Monte Carlo (MC) updates wait until a complete episode finishes, then use the actual return to train the Critic. The Critic loss is a mean squared error:
Each symbol in this equation:
| Symbol | Meaning |
|---|---|
| Critic loss, measuring the prediction error | |
| Actual discounted return from time step to the end of the episode (the MC target) | |
| Critic's current value prediction for state |
is the Critic's prediction error -- the episode actually returned , but the Critic previously predicted . The loss is the square of this error.
Numerical Example
Suppose we sample the following trajectory:
With , we compute by summing the rewards from each visit position to the end:
| Visit | State | Remaining rewards | computation | MC target |
|---|---|---|---|---|
| 1 | ||||
| 2 | ||||
| 3 | ||||
| 4 | ||||
| 5 |
Loss Computation and Gradient Update
Assume the Critic is a simple value table with and . Using the first visit to as an example, the MC target is :
The gradient-descent update (learning rate ):
Here , but it is more common to absorb the into the learning rate and write directly:
The complete update process across all visits:
| Updated state | MC target | Old value | Update computation | New value |
|---|---|---|---|---|
| 1st | 0 | |||
| 2nd | ||||
| 1st | 0 | |||
| 3rd | ||||
| 2nd |
MC methods (recall the MC value update: ) provide an unbiased estimate because they use the true return, but they have two limitations:
- You must wait until the episode ends to compute ; you cannot learn online step by step.
- High variance -- can fluctuate drastically across different episodes.
In a neural-network implementation, the MC method is equivalent to: run one full episode, collect all pairs, then perform a gradient-descent update on the Critic parameters using this batch.
TD: One-Step Updates
Temporal Difference (TD) learning updates the Critic using the TD Error. The Critic loss is:
Each symbol in this equation:
| Symbol | Meaning |
|---|---|
| Critic loss, measuring the magnitude of the TD Error | |
| Immediate reward received at the current step | |
| Discount factor | |
| Critic's current value prediction for next state | |
| Critic's current value prediction for current state | |
| TD Error, i.e., |
Minimizing makes the Critic's predictions progressively more accurate. The meaning of : after taking one step, the difference between "the reward actually received plus the next-step prediction" and "the current prediction." means this step was better than expected; means it was worse.
Numerical Example
Using the same trajectory as MC:
The initial value table is all zeros, with learning rate . TD updates after every step, reading from the current latest table.
Step 1: . Current , .
Step 2: . Current (updated in the previous step), .
means "received and arrived at " exactly equals the previous estimate of -- the prediction had no error.
Step 3: . Current , .
Note that here is the value just updated in step 1 -- TD immediately uses freshly learned information.
Step 4: . Current , .
Step 5: . Current , .
indicates that moving right from to the terminal was better than the current estimate of , so is adjusted upward.
Step-by-step Summary
| Step | Transition | Updated state | Old | TD target | New | |||
|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 0 | ||||||
| 2 | 0 | |||||||
| 3 | 0 | |||||||
| 4 | ||||||||
| 5 | 0 |
TD Loss Computation
Using step 3 as an example, :
The gradient-descent direction:
The parameter moves in the direction of , i.e., decreases. In practice, the update is equivalently , consistent with the table above.
The advantages of TD methods (recall the TD(0) update: ):
- No need to wait for the episode to end -- you can update at every step.
- Lower variance -- acts as an "anchor" that stabilizes the estimate.
- Matches the Actor's update cadence -- both update once per environment step.
The price is introducing bias: is itself an estimate, not the true value. This is called bootstrapping -- using your own estimates to update your own estimates. In practice, however, this bias is far smaller than the benefit gained from reducing variance.
Comparing the Three Methods
| DP | MC | TD | |
|---|---|---|---|
| Used to train Critic? | Theoretical baseline | Usable | Practical default |
| Need episode to end? | No | Yes | No |
| Unbiased? | Yes | Yes | No (biased but lower variance) |
| Variance | Low | High | Medium |
| Bootstrapping | Yes | No | Yes |
MC vs. TD: A Numerical Comparison
Same trajectory , initial table all zeros, , .
MC -- updates only after the entire episode ends. At the first visit to , the target is the complete return over the whole trajectory:
MC uses all information from start to finish in a single update.
TD -- updates immediately after the first step. Step 1 uses only one-step information:
The TD target () is much smaller in magnitude than the MC target (), but TD does not need to wait for the episode to end. As more trajectories accumulate, TD's also gradually approaches the true value of .
Both methods eventually converge to the same , but their update paths differ: MC makes large single updates () with high variance; TD makes small updates () but more frequently, with lower variance.
In practice, Actor-Critic methods almost always use TD to train the Critic. In more advanced implementations (e.g., GAE in Chapter 7), MC and TD are combined -- a parameter interpolates between them to achieve an optimal bias-variance tradeoff.
The Full Critic-Training Workflow
Putting the pieces together, a one-step Actor-Critic training loop looks like this:
- Interact: At state , the Actor selects action ; the environment returns and .
- Forward pass: The Critic computes the current prediction and the next-step prediction .
- Compute TD Error: .
- Update Critic: Update the Critic parameters using as the loss.
- Update Actor: Update the Actor parameters using as the advantage estimate.
Numerical Walkthrough
Assume the current Critic value table is , , , with , Critic learning rate , and Actor learning rate .
Step 1: Interact
At state , the Actor chooses right with probability 0.8 and left with probability 0.2. Suppose this sample picks right; the environment returns , .
Step 2: Forward Pass
Step 3: Compute TD Error
indicates that moving right from to was worse than the current prediction -- actually receiving plus 's estimate of totals , which is lower than 's estimate of .
Step 4: Update Critic
Parameter update (using a value table as an example):
The Critic lowered -- this experience suggests 's value is lower than previously estimated.
Step 5: Update Actor
indicates that this action (moving right) performed worse than expected. The Actor's update direction is to decrease the probability of this action. Using the policy gradient as an example:
Since , the parameters move opposite to , reducing the probability .
If , the action was better than expected, and the Actor increases its probability.
The Critic parameters update in the direction that makes smaller -- predictions become more accurate. The Actor parameters update in the direction that assigns higher probability to actions with positive -- choices become better. This creates a virtuous cycle: the more accurate the Critic's evaluation, the faster the Actor improves; the more diverse actions the Actor tries, the richer the data the Critic sees, and the more accurate its evaluation becomes.