Skip to content

E.2.1 Probability Basics: Probability, Conditional Probability, and Expectation

Prerequisites: This article does not require prior probability theory, but it is useful to read the two-state running example in the appendix introduction first.


Start With "What Could Happen?"

Probability theory does not begin with policies. It begins with a more basic question: what can a random event look like?

The sample space Ω\Omega is the set of all possible outcomes. For example, when rolling one die:

Ω={1,2,3,4,5,6}.\Omega=\{1,2,3,4,5,6\}.

An event is a subset of the sample space. For example, the event "the die shows an even number" is:

A={2,4,6}.A=\{2,4,6\}.

A random variable is a function that maps random outcomes to numbers. If XX denotes the die result, then X(ω)=ωX(\omega)=\omega. In reinforcement learning, the reward Rt+1R_{t+1}, the return GtG_t, and the next state St+1S_{t+1} can all be viewed as random variables.

Only after we have random variables can we talk about expectation, variance, and value functions. A value function is essentially the conditional expectation of a random return.


Probability as Long-Run Frequency

Suppose a policy has two possible actions in state ss:

ActionProbability
left0.30.3
right0.70.7

This means that if the agent visits state ss many times, it chooses left about 30%30\% of the time and right about 70%70\% of the time.

In RL notation:

π(lefts)=0.3,π(rights)=0.7.\pi(\text{left} \mid s)=0.3, \qquad \pi(\text{right} \mid s)=0.7.

The vertical bar \mid reads as "conditioned on." The expression π(as)\pi(a \mid s) means: given that the current state is ss, what is the probability of choosing action aa?


Conditional Probability and State Transitions

The policy can be random when it chooses actions, and the environment can also be random when it returns the next state. The vertical bar \mid in π(as)\pi(a\mid s) expresses this conditional relationship. To see the environment's randomness more concretely, suppose the agent executes action right:

Next stateProbability
s1s_10.20.2
s2s_20.80.8

This can be written as:

p(s1s,right)=0.2,p(s2s,right)=0.8.p(s_1 \mid s, \text{right})=0.2, \qquad p(s_2 \mid s, \text{right})=0.8.

It means: given that the current state is ss and the action is right, these are the probabilities of the possible next states.

The state transition probability in a Markov decision process is exactly this kind of conditional probability:

p(ss,a).p(s' \mid s, a).

Do not rush to memorize the symbol. Read it as a sentence: after taking action aa in state ss, the probability of moving to next state ss'.


Expectation as a Weighted Average

Consider a simple example. An action has two possible outcomes:

OutcomeProbabilityReward
success0.80.81010
failure0.20.25-5

The average reward is not the simple average (105)/2=2.5(10-5)/2=2.5. It is weighted by probability:

E[R]=0.8×10+0.2×(5)=7.\mathbb{E}[R] = 0.8 \times 10 + 0.2 \times (-5) = 7.

E\mathbb{E} denotes expectation, and RR inside the brackets is the random variable. The line reads "RR has expectation 77." In plain language: if we repeat this action many times, the average reward per trial approaches 77.

The structure of this formula is:

  • E\mathbb{E} reads as "expectation." It is not a new number; it names the operation of adding all possible outcomes after weighting them by probability.
  • The brackets [][\cdot] contain the random variable being averaged.
  • xp(x)x\sum_x p(x)x is the expanded form of expectation: multiply every possible outcome xx by its probability p(x)p(x), then add them all.

The state value function in reinforcement learning is also an expectation:

vπ(s)=Eπ[GtSt=s].v_\pi(s)=\mathbb{E}_\pi[G_t \mid S_t=s].

The subscript π\pi means "under policy π\pi," and the expression after the vertical bar is the condition. The full line reads: starting from state ss, following policy π\pi, what is the average future discounted return GtG_t?

Joint Probability, Marginal Probability, and the Law of Total Probability

So far we have looked at conditional probability p(ss,a)p(s'\mid s,a): it describes which next state appears after a particular action is known. But many RL calculations need to combine conditional probabilities. For example, what is the total probability of reaching state ss' from state ss? This total probability must add over all possible actions. To perform this fuller probability reasoning, we need three tools: joint probability, marginal probability, and the law of total probability.

Joint probability describes the probability that two events happen together. For example:

P(A,B)=P(A)P(BA).P(A,B)=P(A)P(B\mid A).

If state ss appears with probability 0.40.4, and the probability of choosing action right in this state is 0.70.7, then the probability of "being in ss and choosing right" is:

P(s,right)=0.4×0.7=0.28.P(s,\text{right})=0.4\times0.7=0.28.

Marginal probability removes variables we do not care about by summing them out. For example, suppose next state ss' can be caused by two actions:

Actionπ(as)\pi(a\mid s)p(ss,a)p(s'\mid s,a)
left0.30.30.20.2
right0.70.70.80.8

Then, under policy π\pi, the total probability of moving from ss to ss' is:

pπ(ss)=0.3×0.2+0.7×0.8=0.62.p_\pi(s'\mid s)=0.3\times0.2+0.7\times0.8=0.62.

This is the law of total probability:

pπ(ss)=aπ(as)p(ss,a).p_\pi(s'\mid s)=\sum_a \pi(a\mid s)p(s'\mid s,a).

The "sum over actions first, then sum over next states" pattern that often appears in Bellman equations is fundamentally a combination of the law of total probability and expectation.


Conditional Expectation: The Mathematical Core of Value Functions

Ordinary expectation asks "what is the average?" over all possible cases. In reinforcement learning, however, we usually do not care about the average over every possible situation. We care about the average return given that the current state is ss. This is like asking not "what is the average score in the whole school?" but "given that this student is in the advanced class, what score should we expect?" Conditional expectation answers the question "what is the average when a condition is already known?" It is the mathematical core of value functions.

Suppose state ss has two actions:

ActionProbabilityAverage return after the action
left0.40.433
right0.60.688

If we know the current state is ss, but the action is still sampled from the policy, then the average return starting from ss is:

E[GS=s]=0.4×3+0.6×8=6.\mathbb{E}[G\mid S=s]=0.4\times3+0.6\times8=6.

The state value function

vπ(s)=Eπ[GtSt=s]v_\pi(s)=\mathbb{E}_\pi[G_t\mid S_t=s]

is a conditional expectation. It is not the result of one trajectory. It is the average over all possible future trajectories under the condition "starting from state ss."

Once this is clear, the randomness of value functions becomes clear as well: trajectories may differ, returns may differ, but the state value is the conditional average of those returns.


Summary

This article built five basic concepts from probability theory:

ConceptDefinitionRL role
Sample spaceThe set of all possible outcomesDefines possible states and actions of the environment
Random variableMaps random outcomes to numbersReward RR, return GG, state SS
ProbabilityLong-run frequency of an outcomeProbability that a policy chooses an action
Conditional probabilityProbability given partial informationState transition p(ss,a)p(s'\mid s,a)
ExpectationProbability-weighted averageValue function vπ(s)=E[Gs]v_\pi(s)=\mathbb{E}[G\mid s]

Probability describes randomness, conditional probability describes randomness under a known condition, and expectation compresses randomness into a representative number. Together, these three tools form the mathematical foundation of value functions.

Next: E.2.2 Random Variables, Returns, and State Values -- applying expectation to returns and value functions.

现代强化学习实战课程