E.4.4 Mutual Information and Representation Learning
Prerequisites: E.4.1 Entropy and E.4.2 KL Divergence. You need to know the definitions of entropy and KL divergence.
The previous three sections discussed the properties of a single distribution, entropy, and the distance between two distributions, KL divergence. But sometimes we want to ask a subtler question: how much information do two random variables share? This requires mutual information. It measures how much the uncertainty of one variable decreases after we know another variable, and in representation learning it is used to evaluate whether a state representation keeps task-relevant information.
Mutual Information: How Much Knowing One Variable Helps
The central question mutual information answers is: after knowing , how much does the uncertainty of decrease? The formula is:
Here is the uncertainty of itself, namely its entropy, and is "how much uncertainty remains in after is known." The difference is the uncertainty removed because we know .
Consider an intuitive example. Suppose is "whether the next step succeeds," and is "the current state representation." If we do not know the state representation, success and failure are equally likely:
After knowing the state representation, we can almost determine whether the step will succeed, and the uncertainty drops to:
Then the mutual information is:
bit means that removes of the uncertainty in . It helps a great deal in predicting .
Mutual information can also be defined using KL divergence:
This form says that mutual information is the KL divergence between the joint distribution of and and the distribution that would hold if they were independent. If and are truly independent, these two distributions are the same and the mutual information is .
In reinforcement learning, mutual information is often used in representation learning. A good state representation should preserve information related to future return while discarding task-irrelevant noise:
Formulas like this do not necessarily appear directly in basic algorithms, but they are common in research on exploration, representation learning, world models, and unsupervised RL.
Summary
This article introduced a tool for measuring how much information two random variables share:
| Concept | Problem it solves | Core formula | Role in RL |
|---|---|---|---|
| Mutual information | How much one variable reduces uncertainty in another | Evaluates whether representations keep task-relevant information | |
| KL definition of mutual information | Expressing mutual information with KL divergence | Mutual information is 0 under independence |
Mutual information connects the entropy and KL divergence from the previous articles: it uses KL to measure the difference between a joint distribution and the independence assumption, and it uses the reduction in entropy to measure information gain. The next article summarizes all complete formulas in module E.4.
Next: E.4.5 Complete Information Theory Formulas -- full expressions for KL, RLHF, DPO, and mutual information.