E. Mathematical Foundations for RL

If you opened this appendix, it is probably because a formula in the main text slowed you down. Maybe it was an expectation symbol inside a Bellman equation, a KL divergence term inside PPO, or the gradient operator that suddenly appears in the policy gradient theorem.

These formulas can look intimidating, but the underlying tools are not that many: scalars, vectors, matrices, probability, expectation, derivatives, gradients, entropy, and KL divergence. Once you understand what each word means, and how they combine inside reinforcement learning formulas, the notation becomes far less mysterious.

This appendix does not organize math by algorithm. It follows a more natural learning order:

mathematical objects -> linear operations -> probability and expectation -> stochastic estimation -> recursive equations -> optimization and gradients -> distribution distance -> full RL formulas.

Learning Route

The whole appendix can be summarized as one line:

mathematical objects -> linear algebra -> probability and expectation -> stochastic estimation -> Bellman recursion -> optimization and gradients -> distribution distance -> RL derivations.

Appendix Map

Section	Topic	Main question
E.1 Mathematical objects and linear algebra	scalars, vectors, matrices, dot products, norms, linear equations	How do we write states, values, and parameters as computable objects?
E.2 Probability, expectation, and stochastic estimation	probability, conditional probability, random variables, expectation, variance, sampling	How do random trajectories become average value?
E.3 Calculus and optimization	derivatives, gradients, chain rule, Taylor expansion, optimization algorithms	Which direction should parameters move?
E.4 Information theory and distribution distance	self-information, entropy, cross-entropy, KL, mutual information	How do we measure policy randomness and policy change?

A Running Example

Several sections reuse the same tiny two-state environment. There are two states, $s_1$ and $s_2$ :

in $s_1$ , the agent receives reward $2$ , then moves to $s_2$
in $s_2$ , the agent receives reward $1$ , then moves back to $s_1$
the discount factor is $\gamma = 0.5$

Let the state values be $v_1$ and $v_2$ . Intuitively:

$\begin{aligned} v_1 &= 2 + 0.5v_2, \\ v_2 &= 1 + 0.5v_1. \end{aligned}$

This same example plays different roles in different modules:

in linear algebra, it is a two-variable linear system
in probability, it is "immediate reward + expected next-state value"
in stochastic estimation, it can be approximated from sampled trajectories
in optimization, it becomes a target for a value network or policy network
in information theory, it connects to policy distributions, exploration, and update constraints

If you can translate a complicated formula back into this two-state example, math stops being a wall and becomes a tool.

How to Use This Appendix

The sidebar divides the content into four math modules. You do not need to read everything at once. There are three useful modes:

Systematic catch-up: start from E.1.1 and read in order.
Just-in-time lookup: if Bellman matrix form is confusing, read E.1.2; if GAE is confusing, read the probability and calculus sections; if KL constraints are confusing, read E.4.2.
Quick review: use each module's formula summary and exercises after finishing the corresponding topic.

If the sidebar feels too large, read only the first page of each module first:

After these four pages, return to the detailed topics whenever a formula in the main text needs support.

1. CartPole Balancing

2. DPO Preference Tuning

3. MDP and Value Functions

4. Deep Q-Networks

5. Policy-Based Methods

6. Actor-Critic

7. PPO

8. The RLHF Pipeline

9. Post-Training Alignment

10. Agentic RL

11. VLM Reinforcement Learning

12. Future Trends

B. RL Engineering Practice

C. Code Cheatsheet

E. Math Foundations for RL

E.1 Linear Algebra

E.2 Probability & Estimation

E.3 Calculus & Optimization

E.4 Information Theory

E. Mathematical Foundations for RL

Learning Route

Appendix Map

Suggested Reading Order

A Running Example

How to Use This Appendix

E.1 Linear Algebra

E.2 Probability & Estimation

E.3 Calculus & Optimization

E.4 Information Theory

E. Mathematical Foundations for RL ​

Learning Route ​

Appendix Map ​

Suggested Reading Order ​

A Running Example ​

How to Use This Appendix ​

E. Mathematical Foundations for RL

Learning Route

Appendix Map

Suggested Reading Order

A Running Example

How to Use This Appendix