E.4.6 Information Theory Formula Reference and Exercises
Prerequisites: This page summarizes all formulas in module E.4. It is best reviewed after reading E.4.1 through E.4.5. If this is your first pass, skip to the main articles first.
This page collects all formulas used in module E.4 for review. It is recommended that you read the previous articles first and then use this page as a reference table.
Information Theory Formulas You Will Meet in This Book
| Concept | Formula | Meaning in reinforcement learning |
|---|---|---|
| Self-information | Low-probability events contain more information | |
| Entropy | Policy randomness and exploration | |
| Entropy bonus | Encourages exploration and avoids premature certainty | |
| Cross-entropy | Classification training and reward model training | |
| KL divergence | Measures differences between old and new policies | |
| Cross-entropy-KL relationship | KL is extra encoding cost | |
| KL penalty | Constrains policy drift in PPO/RLHF | |
| RLHF objective | $J(\pi)=\mathbb{E}\pi[r(x,y)]-\beta D{KL}(\pi_\theta|\pi_{ref})$ | Reward maximization with a reference model constraint |
| DPO loss | $\mathcal{L}{DPO}=-\mathbb{E}[\log\sigma(\beta\log\frac{\pi\theta(y_w\mid x)}{\pi_{ref}(y_w\mid x)}-\beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{ref}(y_l\mid x)})]$ | Uses preference data to optimize relative probabilities |
| Mutual information | Whether a representation keeps task-relevant information |
Summary
The hierarchy on this page is: start from "smaller probability means more information" and "a more uniform policy has higher entropy," then extend to cross-entropy, KL, the RLHF regularized objective, and the DPO loss. When reading complex information-theoretic formulas, first ask: is this measuring randomness, prediction error, or the distance between two policy distributions?
Common Mistakes
- Treating entropy as noise. High entropy means the policy is more random and may help exploration, but it does not mean the policy is worse.
- Treating KL as an ordinary distance. KL is asymmetric. and emphasize different errors.
- Thinking the KL constraint is only mathematical decoration. In RLHF, the KL term directly determines how far the model can move from the reference model.
Exercises
- Compare and . Which has higher entropy? Why?
- If the old policy is and the new policy is , write the expanded expression for .
- In the RLHF objective , when becomes larger, does the policy update become more aggressive or more conservative?