Skip to content

E.3 Calculus and Optimization

Training a reinforcement learning agent is, at its core, a matter of adjusting parameters: making the average return higher and higher, or making prediction error smaller and smaller. The underlying language for this process is calculus. Derivatives tell us "which way to move", gradients tell us "how each parameter should move", and the chain rule lets that signal travel backward through the entire computation graph.

This section follows that thread. We start from functions and rates of change, move step by step to derivatives, gradients, and the chain rule, then see how these tools appear in policy gradients, Taylor approximations, PPO clipping, and GRPO normalization.

Gradient update diagram

Roadmap

ArticleMathematical paceRole in reinforcement learning
E.3.1 Derivatives, Gradients, and the Chain RuleFunction -> derivative -> gradient -> chain ruleUnderstand how parameters affect the objective function
E.3.2 From Gradients to Policy GradientsLog-probability gradient -> return weighting -> advantage functionDerive the update direction behind "increase the probability of good actions"
E.3.3 Optimization Stability: PPO and AdamProbability ratio -> clipping -> adaptive step sizeControl policy update size and gradient noise
E.3.4 Derivation Tools: Log Trick and TaylorLog-derivative trick -> Taylor expansion -> second-order intuitionUnderstand the derivation skeleton of policy gradients and PPO
E.3.5 Complete Optimization FormulasFull expressions for PG, DQN, GAE, PPO, GRPOConnect modern RL training objectives
E.3.6 Summary, Formulas, and ExercisesFormula review -> pitfalls -> exercisesReview and check understanding

现代强化学习实战课程