E.2.2 From Random Trajectories to State Values
Prerequisites: E.2.1 Probability, Conditional Probability, and Expectation -- you should know the definition of expectation.
The previous article defined expectation as "averaging by weighting outcomes with their probabilities." This article applies expectation to reinforcement learning. Starting from a state, the trajectory is random and the return is random. The value of that state is the expectation over all possible returns.
From Random Trajectories to State Values
Starting from state , suppose there are three possible trajectories:
| Trajectory | Probability | Discounted return |
|---|---|---|
| A | ||
| B | ||
| C |
What is the value of state under policy ? By the definition of expectation, multiply each outcome by its probability and add:
This number is the concrete expansion of
The subscript in means "weighted by probabilities under policy ." The vertical bar means "under the condition that the starting state is ." Read the whole line as: starting from state , following policy , take the weighted average of all possible returns.
Notice that is not the return of one trajectory. Any actual sampled trajectory may return , , or . The number is the average over infinitely many trajectories. This is the difference between "return" and "value": a return is single-run, while a value is an expectation.
How the Discount Factor Affects Value
The example above did not explicitly consider discounting. Each trajectory directly supplied a return. In real RL, the return is discounted:
The closer is to , the more the policy "cares about the future." The smaller is, the more the policy "focuses on the immediate present." Consider a numerical example.
Suppose one trajectory has the immediate reward sequence .
| Discounted return | Meaning | |
|---|---|---|
| Almost only the first two steps matter | ||
| Looks fairly far ahead | ||
| Almost all rewards are counted |
When , the fifth reward contributes only to the total return, so it is almost negligible. When , the fifth reward still contributes about , nearly as important as an immediate reward.
This directly changes the numerical value of the value function. With a larger , state values are often higher because they include more future rewards, but they are also harder to compute and estimate because the calculation must account for a longer and more uncertain future.
Variance: Measuring Instability
Two policies can have the same average return while feeling completely different during training.
Policy A has three returns:
Policy B has three returns:
Both averages are . But policy B fluctuates much more: sometimes it receives , sometimes . If these returns are used to update policy parameters, the gradient direction for policy B swings sharply from update to update, making training unstable. Variance measures this kind of fluctuation.
For policy B, the deviations from the average are . Squaring them and averaging gives:
is short for variance. The formula is : take each value's deviation from the mean, square it, and average.
In reinforcement learning, high variance means the learning signal is unstable. Policy gradient methods often need baselines, advantages, GAE, and related techniques to reduce variance. The mathematical basis for these techniques appears in the following articles.
What Variance Means During Training
Consider policy gradients. Suppose the gradient update signal for "choose right" in a state is , where is an advantage estimate.
Low-variance policy A: the advantage estimate stays near . Each update points in roughly the same direction, so the parameters move steadily.
High-variance policy B: the advantage estimate jumps between and . One update says "right is bad, reduce its probability." The next says "right is good, increase its probability." The parameters oscillate, and training becomes inefficient.
This is why reducing variance is almost always one of the central issues in RL training. Baselines, GAE, and PPO clipping are all answering the same question in different forms: how can we make the gradient update signal more stable?
Summary
This article applied the basic tools of probability theory to core RL concepts:
| Concept | Definition | RL role |
|---|---|---|
| State value | Conditional expectation of return, | Measures the average return from a state |
| Discount factor | Controls how far the policy looks ahead; larger values emphasize the future | |
| Variance | Measures return fluctuation and affects training stability |
State value is the conditional expectation of return: it compresses the returns of many random trajectories into one representative number. The discount factor controls how far this compression looks: a small focuses on the present, while a large values the future but is harder to estimate. Variance tells us how stable the gradient signal is. Two policies may have the same average return, but the one with larger variance is harder to train. Later methods such as Monte Carlo estimation, baselines, and GAE all try to reduce variance while keeping the expectation as unchanged as possible.
Next: E.2.3 Monte Carlo, Incremental Averages, and Importance Sampling -- approximating expectations with sample averages when transition probabilities are unknown.