E.2.6 Probability and Statistics Formula Reference and Exercises
Prerequisites: This page summarizes all formulas from module E.2. It is best reviewed after reading E.2.1 through E.2.5. If this is your first pass, start with the main articles first.
This page collects all formulas used in module E.2 for review. It is meant as a reference after you have read the preceding articles.
Probability Formulas Used in This Book
| Concept | Formula | RL meaning |
|---|---|---|
| Policy probability | Probability of choosing action in state | |
| State transition probability | Probability of entering the next state after taking an action | |
| Expectation | Average reward, average return, value function | |
| State value | Average discounted return from state | |
| Variance | Size of learning-signal fluctuation | |
| Monte Carlo estimate | Estimate value using sample averages | |
| Trajectory probability | Probability that a policy generates a full trajectory | |
| Baseline variance reduction | Subtracting a baseline does not change the expected gradient | |
| GAE | $\hat{A}t^{GAE}=\sum_k(\gamma\lambda)^k\delta{t+k}$ | Trade off between TD and MC |
| Importance weight | Off-policy correction | |
| PPO clipped objective | Limit extreme changes in importance weights |
Summary
At this point, the basic tools of probability theory are in place: probability describes randomness, expectation describes averages, variance describes fluctuation, Monte Carlo approximates expectations through sampling, and importance sampling corrects bias with probability ratios. The structure of this page is: start from probability tables, weighted averages, and sample averages, then extend to the Bellman expectation equation, action values, trajectory importance sampling, and stochastic-gradient variance. When reading more complex formulas later, first ask what the formula is averaging over: actions, rewards, next states, trajectories, or gradient samples.
Common Pitfalls
- Treating one return as the value. A value is an expectation; the return from one trajectory is only one sample.
- Looking only at the mean and ignoring variance. Two policies can have the same average return but different variance, so their training stability can be very different.
- Thinking importance sampling reuses data for free. Probability ratios can correct bias, but they may also greatly increase variance.
Exercises
- Three trajectories have returns and probabilities . What is the state value?
- Sample returns are . What are the mean and variance?
- The behavior policy probability is , and the target policy probability is . What is the one-step importance weight?