Skip to content

E.2 Probability, Expectation, and Random Estimation

Data in reinforcement learning comes from random interaction: the policy may choose actions randomly, and the environment may return random feedback. To understand this randomness, we need probability theory. This section follows the natural order of probability: first sample spaces, events, and random variables; then probability, conditional probability, expectation, and variance; and finally Monte Carlo estimation, importance sampling, and the Bellman expectation equation.

Random trajectories and expected value

Roadmap

ArticleMathematical pathRole in reinforcement learning
E.2.1 Probability, Conditional Probability, and Expectationsample space -> event -> random variable -> probability -> expectationDescribe stochastic policies and stochastic environments
E.2.2 Random Variables, Returns, and State Valuesrandom return -> conditional expectation -> varianceDefine value functions and the stability of learning signals
E.2.3 Variance, Monte Carlo, and Sample Averagessample mean -> incremental average -> importance samplingEstimate unknown expectations from data
E.2.4 Trajectory Probability, Baselines, and GAEtrajectory probability -> baseline invariance -> accumulated TD errorsConnect policy gradients with advantage estimation
E.2.5 Bellman Expectation Equationtake expectations over actions, rewards, and next states layer by layerDerive the full Bellman expectation equation
E.2.6 Summary, Formulas, and Exercisesformula review -> common pitfalls -> exercisesReview and check understanding

现代强化学习实战课程