E.4.2 Cross-Entropy and KL Divergence
Prerequisites: E.4.1 Self-Information, Entropy, and Exploration. You need to know the definition of entropy.
In the previous section, we used entropy to measure the randomness of a single policy. But in training, the more common situation is comparing two distributions: the model's predicted distribution vs. the true label distribution, or a new policy vs. an old policy. This requires two new tools: cross-entropy and KL divergence.
Cross-Entropy: The Cost of Walking with the Wrong Map
Cross-entropy measures the cost of making predictions with the wrong distribution. Classification models and reward models use it as a training loss. When the true distribution is and you predict with , cross-entropy tells you how large the cost is.
Imagine trying to find a road while holding an inaccurate map. If the map is close to the real roads, you only take a small detour. If the map is badly distorted, you may get completely lost. Cross-entropy measures this "cost of walking with the wrong map." Mathematically, it is defined as:
It looks similar to entropy. The difference is that the logarithm contains rather than . In other words, you are encoding with the "wrong" distribution.
Consider a classification example. Suppose the correct answer is the first class:
The model predicts:
Because , the cross-entropy keeps only the term for the correct class:
If the model predicts the correct class with more confidence, for example , the loss becomes , which is smaller than . The more accurate the prediction, the lower the cross-entropy. This is why it is widely used as a training loss for classification models. In RLHF, reward models, preference models, and policy models all rely on it.
KL Divergence: The "Surprise" Between Two Distributions
Cross-entropy tells us how much it costs to predict one distribution with another. But if we subtract the entropy of the true distribution itself from that cost, the remaining part purely reflects the difference between the two distributions. This is the problem KL divergence solves. KL divergence measures the difference between two distributions, and PPO and RLHF use it to prevent a policy from changing too much.
Intuitively, KL divergence measures this: if your true belief is distribution , but you must act according to distribution , how "surprised" would you be? The formula is:
is the ratio of two probabilities. If and agree on some , the ratio is close to 1, , and there is no surprise. If they disagree strongly, the ratio moves away from 1 and the KL divergence becomes larger.
Consider a practical RL situation. PPO and RLHF often need to compare old and new policies. Suppose the old policy is:
Two candidate new policies are:
Intuitively, new policy 2 is farther from the old policy. Using KL divergence, with the old policy as and new policy 1 as :
With new policy 2 as :
The second value is larger, showing that new policy 2 deviates from the old policy more aggressively.
Why KL Divergence Is Not Symmetric
After understanding the basic use of KL divergence, one common pitfall is that KL divergence does not satisfy commutativity. That is, . The choice of direction is not arbitrary. PPO and RLHF use different directions, emphasizing different types of error.
To understand the asymmetry, consider:
Looking at from the perspective of , namely : in the real world, the first action almost always happens, but your model assigns half of the probability to the second action. This error, being vague when reality is almost certain, receives a large penalty.
Looking at from the perspective of , namely : in the real world, both actions are possible, but your model assigns almost all probability to the first one. This error, being overconfident when reality is ambiguous, has a completely different character.
So when using KL divergence, the direction is not arbitrary. PPO uses , meaning "from the old policy's perspective, how much has the new policy changed?" RLHF uses , meaning "from the current model's perspective, how far is it from the reference model?" Different directions emphasize different biases.
Summary
This article introduced two tools for measuring distance between distributions:
| Concept | Problem it solves | Core formula | Role in RL |
|---|---|---|---|
| Cross-entropy | How costly it is to predict with the wrong distribution | Training loss for classifiers and reward models | |
| KL divergence | How far apart two distributions are | Constrains policy drift in PPO/RLHF | |
| KL asymmetry | Different KL directions mean different things | PPO and RLHF use different directions |
The relationship between cross-entropy and KL divergence, , is the key bridge for understanding PPO, RLHF, and DPO in the next article.
Next: E.4.3 Information Theory in PPO, RLHF, and DPO -- applying cross-entropy and KL to alignment training.