E.2.5 Bellman Expectation Equation and Action Values
Prerequisites: E.2.1 Probability Basics and E.2.2 State Values -- you should know conditional probability and the linearity of expectation.
Earlier we interpreted the value function as "the average return from a state":
This definition is correct, but it hides all details inside the expectation symbol . If we do not know what is being summed inside the expectation or what weights are being used, we cannot actually compute the value. It is like knowing that "the average grade is 85" without knowing how many courses there are or how each course is weighted. This article expands the expectation step by step and shows which sums live behind the notation. The recursive relationship that appears after this expansion is the Bellman expectation equation. Its purpose is to compress "the value of all future steps" into the recursive form "one-step reward plus next-state value," so we can compute values without running a full trajectory.
First split into "one-step reward plus future return":
Substitute this into the value function:
Expectation has a basic property: linearity. The expectation of a sum is the sum of expectations. Therefore:
The left term, , is the average immediate reward. It expands as:
The right term, , is the average next-state value:
Combining the two terms gives the Bellman expectation equation:
The formula looks complex, but each layer is only a probability-weighted average:
- : average over all possible actions.
- : average over all possible rewards.
- : average over the values of all possible next states.
Checking With the Two-State Example
Return to the two-state example from the introduction. Suppose the policy is deterministic. In , it chooses one action, which always moves to and gives reward . In , it chooses one action, which always moves to and gives reward . Let .
Because the policy is deterministic, is for the chosen action and for all others, so the "sum over actions" layer disappears. Expanding :
Expanding :
These are the two equations from the introduction. This verifies that the Bellman expectation equation is doing exactly "one-step reward plus probability-weighted future value."
What If the Policy Is Stochastic?
If the policy has two possible actions in , with probabilities and :
- Choosing left gives reward and moves next to .
- Choosing right gives reward and moves next to .
Then the Bellman expectation equation for is:
Each bracket contains "the immediate reward plus future value after choosing one action," and the coefficient in front is the probability of choosing that action. Expanding layer by layer means taking probability-weighted averages layer by layer.
Action Values and Advantage Functions
The state value function answers: "starting from state , what average return can we get?" But when training a policy, we often face a more specific question: in state , is action left better or action right better? Knowing only the average return cannot answer this, because the average may mix one good action and one bad action. We need to evaluate each action separately. That is why we need the action value function.
The action value function is defined as: after choosing action in state , and then continuing under policy , what average return do we get?
Its Bellman form is:
Notice the relationship between and : the state value is the weighted average of action values:
Intuitively, "the average return from " equals "the sum of each action's return multiplied by the probability of choosing that action."
The advantage function is the difference between action value and state value:
It measures how much better action is than the average action level in state . The earlier example "return 10, average 8, so the advantage is 2" is a numerical version of this formula.
The Relationship Between q, v, and A in Numbers
Suppose that in a state , the policy chooses left with probability and right with probability . We know:
| Action | Action value |
|---|---|
| left | |
| right |
The state value is the weighted average of action values:
The advantage values are:
Left is below average, so it has negative advantage. Right is above average, so it has positive advantage. Policy gradients use this sign to decide that the probability of right should go up and the probability of left should go down.
Trajectory-Form Importance Sampling
We have already seen the one-step importance weight:
If a full trajectory is sampled by a behavior policy , and we want to estimate the return of a target policy , then we multiply the probability ratios across all steps:
Here means "multiply all terms together," just as means "add all terms together." The off-policy Monte Carlo estimate becomes:
This method works, but it is risky. If many probability ratios are multiplied, the weight can become extremely large and cause variance to explode. Practical algorithms therefore often use truncation, weighted importance sampling, or other more stable methods.
Numerical Example: Trajectory Importance Weights
Suppose a two-step trajectory has the following target-policy and behavior-policy probabilities at each step:
| Step | One-step weight | ||
|---|---|---|---|
The trajectory importance weight is the product of the two steps:
If this trajectory has return , its weighted contribution is . This may still look acceptable. But if every step has weight , then after steps the weight becomes . This is what "variance explosion" means. PPO clipping and normalization are, in essence, methods for controlling such extreme weights.
Covariance and Policy Gradient Variance
The tendency of two random variables to change together is measured by covariance:
The correlation coefficient further normalizes covariance to :
In reinforcement learning, gradient estimates are often random variables. For example, a policy gradient sample is:
If fluctuates heavily, the variance of also grows. Baselines, advantage normalization, and GAE are all ways of controlling the variance of this stochastic gradient.
Numerical Example: Covariance
Suppose two policy gradient samples have advantages , and the corresponding gradient norms are .
Means: , .
Covariance:
The covariance is positive, which means the gradient tends to be larger when the advantage is larger. The two signals are positively correlated. If the covariance is close to , the advantage size and gradient size have no stable relationship, so the gradient estimate becomes noisier.
Summary
This article expanded the value function from a "black-box expectation" into the computable Bellman expectation equation, then introduced action values and advantage functions:
| Concept | Formula | Role |
|---|---|---|
| Bellman expectation equation | Compresses "all future steps" into "one step plus recursion" | |
| Action value | Evaluates the average return after choosing a specific action | |
| Advantage function | Measures how much better this action is than the average | |
| Trajectory importance sampling | Uses old-policy data to evaluate a new policy; watch for variance explosion | |
| Covariance | Measures joint fluctuation of two random variables and helps reason about gradient variance |
The core idea of the Bellman expectation equation is recursion: we do not need to run a full trajectory if we know "one-step reward plus next-state value." Action values let us compare actions, and the advantage function turns that comparison into a number. Trajectory importance sampling lets us reuse old data, but it carries the risk of variance explosion.
Next: E.2.6 Probability and Statistics Formula Reference and Exercises -- a formula summary for this module, with exercises.