E.2.1 Probability Basics: Probability, Conditional Probability, and Expectation
Prerequisites: This article does not require prior probability theory, but it is useful to read the two-state running example in the appendix introduction first.
Start With "What Could Happen?"
Probability theory does not begin with policies. It begins with a more basic question: what can a random event look like?
The sample space is the set of all possible outcomes. For example, when rolling one die:
An event is a subset of the sample space. For example, the event "the die shows an even number" is:
A random variable is a function that maps random outcomes to numbers. If denotes the die result, then . In reinforcement learning, the reward , the return , and the next state can all be viewed as random variables.
Only after we have random variables can we talk about expectation, variance, and value functions. A value function is essentially the conditional expectation of a random return.
Probability as Long-Run Frequency
Suppose a policy has two possible actions in state :
| Action | Probability |
|---|---|
| left | |
| right |
This means that if the agent visits state many times, it chooses left about of the time and right about of the time.
In RL notation:
The vertical bar reads as "conditioned on." The expression means: given that the current state is , what is the probability of choosing action ?
Conditional Probability and State Transitions
The policy can be random when it chooses actions, and the environment can also be random when it returns the next state. The vertical bar in expresses this conditional relationship. To see the environment's randomness more concretely, suppose the agent executes action right:
| Next state | Probability |
|---|---|
This can be written as:
It means: given that the current state is and the action is right, these are the probabilities of the possible next states.
The state transition probability in a Markov decision process is exactly this kind of conditional probability:
Do not rush to memorize the symbol. Read it as a sentence: after taking action in state , the probability of moving to next state .
Expectation as a Weighted Average
Consider a simple example. An action has two possible outcomes:
| Outcome | Probability | Reward |
|---|---|---|
| success | ||
| failure |
The average reward is not the simple average . It is weighted by probability:
denotes expectation, and inside the brackets is the random variable. The line reads " has expectation ." In plain language: if we repeat this action many times, the average reward per trial approaches .
The structure of this formula is:
- reads as "expectation." It is not a new number; it names the operation of adding all possible outcomes after weighting them by probability.
- The brackets contain the random variable being averaged.
- is the expanded form of expectation: multiply every possible outcome by its probability , then add them all.
The state value function in reinforcement learning is also an expectation:
The subscript means "under policy ," and the expression after the vertical bar is the condition. The full line reads: starting from state , following policy , what is the average future discounted return ?
Joint Probability, Marginal Probability, and the Law of Total Probability
So far we have looked at conditional probability : it describes which next state appears after a particular action is known. But many RL calculations need to combine conditional probabilities. For example, what is the total probability of reaching state from state ? This total probability must add over all possible actions. To perform this fuller probability reasoning, we need three tools: joint probability, marginal probability, and the law of total probability.
Joint probability describes the probability that two events happen together. For example:
If state appears with probability , and the probability of choosing action right in this state is , then the probability of "being in and choosing right" is:
Marginal probability removes variables we do not care about by summing them out. For example, suppose next state can be caused by two actions:
| Action | ||
|---|---|---|
| left | ||
| right |
Then, under policy , the total probability of moving from to is:
This is the law of total probability:
The "sum over actions first, then sum over next states" pattern that often appears in Bellman equations is fundamentally a combination of the law of total probability and expectation.
Conditional Expectation: The Mathematical Core of Value Functions
Ordinary expectation asks "what is the average?" over all possible cases. In reinforcement learning, however, we usually do not care about the average over every possible situation. We care about the average return given that the current state is . This is like asking not "what is the average score in the whole school?" but "given that this student is in the advanced class, what score should we expect?" Conditional expectation answers the question "what is the average when a condition is already known?" It is the mathematical core of value functions.
Suppose state has two actions:
| Action | Probability | Average return after the action |
|---|---|---|
| left | ||
| right |
If we know the current state is , but the action is still sampled from the policy, then the average return starting from is:
The state value function
is a conditional expectation. It is not the result of one trajectory. It is the average over all possible future trajectories under the condition "starting from state ."
Once this is clear, the randomness of value functions becomes clear as well: trajectories may differ, returns may differ, but the state value is the conditional average of those returns.
Summary
This article built five basic concepts from probability theory:
| Concept | Definition | RL role |
|---|---|---|
| Sample space | The set of all possible outcomes | Defines possible states and actions of the environment |
| Random variable | Maps random outcomes to numbers | Reward , return , state |
| Probability | Long-run frequency of an outcome | Probability that a policy chooses an action |
| Conditional probability | Probability given partial information | State transition |
| Expectation | Probability-weighted average | Value function |
Probability describes randomness, conditional probability describes randomness under a known condition, and expectation compresses randomness into a representative number. Together, these three tools form the mathematical foundation of value functions.
Next: E.2.2 Random Variables, Returns, and State Values -- applying expectation to returns and value functions.