Skip to content

E.4 Information Theory and Distribution Distance

If you have trained a language model, you have probably seen this situation: the model hesitates between two answers, or after one update step its style suddenly drifts. Behind these problems are two basic questions: "how do we measure how random a distribution is?" and "how do we measure how far apart two distributions are?" The tools that answer them come from information theory.

Information theory began as a foundation of communication, but it appears almost everywhere in reinforcement learning: policy exploration needs entropy, stable PPO updates need KL constraints, RLHF alignment training depends on cross-entropy and KL divergence, and DPO repackages these tools into an elegant preference optimization formula.

This section starts from the simplest probability events and builds up to the mathematical core of RLHF and DPO.

Policy distributions, entropy, and KL

Roadmap

ArticleMathematical rhythmRole in reinforcement learning
E.4.1 Self-Information, Entropy, and ExplorationProbability event -> self-information -> entropyMeasures policy randomness and exploration
E.4.2 Cross-Entropy and KL DivergenceEncoding cost -> cross-entropy -> KLMeasures differences between prediction and policy distributions
E.4.3 KL Constraints, RLHF, and DPOKL regularization -> log probability ratio -> preference lossUnderstands policy constraints in alignment training
E.4.4 Mutual Information and Representation LearningReduction in conditional uncertainty -> mutual informationMeasures task-relevant information kept in representations
E.4.5 Complete Information Theory FormulasFull expressions for KL, RLHF, DPO, and mutual informationUnifies distribution distance and preference optimization
E.4.6 Summary, Formulas, and ExercisesFormula review -> common mistakes -> exercisesReviews and checks understanding

现代强化学习实战课程