Skip to content

E.4.2 Cross-Entropy and KL Divergence

Prerequisites: E.4.1 Self-Information, Entropy, and Exploration. You need to know the definition of entropy.


In the previous section, we used entropy to measure the randomness of a single policy. But in training, the more common situation is comparing two distributions: the model's predicted distribution vs. the true label distribution, or a new policy vs. an old policy. This requires two new tools: cross-entropy and KL divergence.

Cross-Entropy: The Cost of Walking with the Wrong Map

Cross-entropy measures the cost of making predictions with the wrong distribution. Classification models and reward models use it as a training loss. When the true distribution is PP and you predict with QQ, cross-entropy tells you how large the cost is.

Imagine trying to find a road while holding an inaccurate map. If the map is close to the real roads, you only take a small detour. If the map is badly distorted, you may get completely lost. Cross-entropy measures this "cost of walking with the wrong map." Mathematically, it is defined as:

H(P,Q)=xP(x)logQ(x).H(P,Q)=-\sum_x P(x)\log Q(x).

It looks similar to entropy. The difference is that the logarithm contains QQ rather than PP. In other words, you are encoding with the "wrong" distribution.

Consider a classification example. Suppose the correct answer is the first class:

P=[1,0].P=[1,0].

The model predicts:

Q=[0.8,0.2].Q=[0.8,0.2].

Because P=[1,0]P=[1,0], the cross-entropy keeps only the term for the correct class:

H(P,Q)=log0.8.H(P,Q)=-\log 0.8.

If the model predicts the correct class with more confidence, for example Q=[0.95,0.05]Q=[0.95,0.05], the loss becomes log0.95-\log 0.95, which is smaller than log0.8-\log 0.8. The more accurate the prediction, the lower the cross-entropy. This is why it is widely used as a training loss for classification models. In RLHF, reward models, preference models, and policy models all rely on it.


KL Divergence: The "Surprise" Between Two Distributions

Cross-entropy tells us how much it costs to predict one distribution with another. But if we subtract the entropy of the true distribution itself from that cost, the remaining part purely reflects the difference between the two distributions. This is the problem KL divergence solves. KL divergence measures the difference between two distributions, and PPO and RLHF use it to prevent a policy from changing too much.

Intuitively, KL divergence measures this: if your true belief is distribution PP, but you must act according to distribution QQ, how "surprised" would you be? The formula is:

DKL(PQ)=xP(x)logP(x)Q(x).D_{KL}(P\|Q)=\sum_x P(x)\log\frac{P(x)}{Q(x)}.

P(x)Q(x)\frac{P(x)}{Q(x)} is the ratio of two probabilities. If PP and QQ agree on some xx, the ratio is close to 1, log1=0\log 1=0, and there is no surprise. If they disagree strongly, the ratio moves away from 1 and the KL divergence becomes larger.

Consider a practical RL situation. PPO and RLHF often need to compare old and new policies. Suppose the old policy is:

πold=[0.5,0.5].\pi_{old}=[0.5,0.5].

Two candidate new policies are:

πnew(1)=[0.6,0.4],πnew(2)=[0.9,0.1].\pi_{new}^{(1)}=[0.6,0.4], \qquad \pi_{new}^{(2)}=[0.9,0.1].

Intuitively, new policy 2 is farther from the old policy. Using KL divergence, with the old policy as PP and new policy 1 as QQ:

DKL(πoldπnew(1))=0.5log0.50.6+0.5log0.50.4.D_{KL}(\pi_{old}\|\pi_{new}^{(1)}) =0.5\log\frac{0.5}{0.6}+0.5\log\frac{0.5}{0.4}.

With new policy 2 as QQ:

DKL(πoldπnew(2))=0.5log0.50.9+0.5log0.50.1.D_{KL}(\pi_{old}\|\pi_{new}^{(2)}) =0.5\log\frac{0.5}{0.9}+0.5\log\frac{0.5}{0.1}.

The second value is larger, showing that new policy 2 deviates from the old policy more aggressively.


Why KL Divergence Is Not Symmetric

After understanding the basic use of KL divergence, one common pitfall is that KL divergence does not satisfy commutativity. That is, DKL(PQ)DKL(QP)D_{KL}(P\|Q)\neq D_{KL}(Q\|P). The choice of direction is not arbitrary. PPO and RLHF use different directions, emphasizing different types of error.

DKL(PQ)DKL(QP).D_{KL}(P\|Q)\neq D_{KL}(Q\|P).

To understand the asymmetry, consider:

P=[0.99,0.01],Q=[0.5,0.5].P=[0.99,0.01], \qquad Q=[0.5,0.5].

Looking at QQ from the perspective of PP, namely DKL(PQ)D_{KL}(P\|Q): in the real world, the first action almost always happens, but your model QQ assigns half of the probability to the second action. This error, being vague when reality is almost certain, receives a large penalty.

Looking at PP from the perspective of QQ, namely DKL(QP)D_{KL}(Q\|P): in the real world, both actions are possible, but your model PP assigns almost all probability to the first one. This error, being overconfident when reality is ambiguous, has a completely different character.

So when using KL divergence, the direction is not arbitrary. PPO uses DKL(πoldπnew)D_{KL}(\pi_{old}\|\pi_{new}), meaning "from the old policy's perspective, how much has the new policy changed?" RLHF uses DKL(πθπref)D_{KL}(\pi_\theta\|\pi_{ref}), meaning "from the current model's perspective, how far is it from the reference model?" Different directions emphasize different biases.


Summary

This article introduced two tools for measuring distance between distributions:

ConceptProblem it solvesCore formulaRole in RL
Cross-entropyHow costly it is to predict with the wrong distributionH(P,Q)=xP(x)logQ(x)H(P,Q)=-\sum_x P(x)\log Q(x)Training loss for classifiers and reward models
KL divergenceHow far apart two distributions areDKL(PQ)=xP(x)logP(x)Q(x)D_{KL}(P|Q)=\sum_x P(x)\log\frac{P(x)}{Q(x)}Constrains policy drift in PPO/RLHF
KL asymmetryDifferent KL directions mean different thingsDKL(PQ)DKL(QP)D_{KL}(P|Q)\neq D_{KL}(Q|P)PPO and RLHF use different directions

The relationship between cross-entropy and KL divergence, DKL(PQ)=H(P,Q)H(P)D_{KL}(P\|Q)=H(P,Q)-H(P), is the key bridge for understanding PPO, RLHF, and DPO in the next article.

Next: E.4.3 Information Theory in PPO, RLHF, and DPO -- applying cross-entropy and KL to alignment training.

现代强化学习实战课程