E.4.5 Complete Formulas for KL, RLHF, DPO, and Mutual Information
Prerequisites: This page summarizes all formulas in module E.4. It is best reviewed after reading E.4.1 through E.4.4.
This page collects the complete formulas from module E.4 for review. It is recommended that you read the previous articles first and then use this page as a reference table.
The Relationship Between KL, Cross-Entropy, and Entropy
We have looked at entropy, cross-entropy, and KL divergence separately. They are actually tied together by one equation, and this equation is the foundation for understanding all later formulas.
Expanding the definitions:
Combining the sums:
This shows that KL divergence can be understood as the extra information cost paid when the true distribution is but you encode it with .
In machine learning:
- Minimize cross-entropy .
- When is fixed, this is equivalent to minimizing .
This is the mathematical basis of cross-entropy loss in classification models, reward models, and language model training.
Advanced Formula: KL-Regularized Objective in RLHF
This section places KL divergence inside the full RLHF optimization objective. The reward term pushes the model toward high-scoring answers, while the KL term acts like a safety rope that pulls the model back near the reference model.
RLHF policy optimization is often written as:
where:
- is the score assigned by the reward model to answer .
- is the current policy model being optimized.
- is the reference model, usually the SFT model.
- controls the tradeoff between "pursue reward" and "do not drift away from the reference model."
If is too small, the model can drift too far in pursuit of reward and produce reward hacking. If is too large, the model barely dares to change and learning becomes weak.
Advanced Formula: DPO's Log Probability Ratio
DPO does not explicitly train a reward model and then run PPO. Instead, it directly optimizes the policy using preference data. Its core tool is the log probability ratio, which compares how much the current model prefers a certain answer relative to the reference model.
For a preference sample , where is the better answer and is the worse answer, the DPO loss is often written as:
This expression can be understood from a simple example:
- If the model raises the winner's probability more than the reference model does, the first term becomes larger.
- If the model raises the loser's probability more than the reference model does, the second term becomes larger and offsets the advantage.
- The larger the difference, the more the model agrees with the preference data.
The core of DPO is not to make the winner's probability infinitely large. It is that, relative to the reference model, the winner should be preferred over the loser. This is the implicit form of KL regularization in preference learning.
Advanced Formula: Mutual Information and Representation Learning
Mutual information combines entropy and KL divergence to answer how much information two random variables share. In representation learning, it is used to evaluate whether a state representation keeps information related to task return.
In reinforcement learning representation learning, we may want the state representation and future return to have high mutual information:
This means the representation preserves information related to task return. At the same time, we may want the representation to have low mutual information with irrelevant noise, improving generalization.
Formulas like this do not necessarily appear directly in basic algorithms, but they are common in exploration, representation learning, world models, and unsupervised RL.
Summary
This page summarized the core formulas from module E.4:
| Formula category | Core equation/expression | Intuition |
|---|---|---|
| KL-cross-entropy-entropy | The extra encoding cost is the distribution gap | |
| RLHF objective | Pursue reward but do not move too far from the reference model | |
| DPO loss | A larger relative probability gap is better | |
| Mutual information | How much uncertainty in is reduced after knowing |
Next: E.4.6 Formula Reference and Exercises -- review all formulas in this module and check your understanding with exercises.