Skip to content

3.3 V(s): Value Functions and Bellman Equations

Section Guide

Core content

  • Vπ(s)V^\pi(s): evaluates how much future return policy π\pi can obtain on average from state ss.
  • Bellman equation: rewrites “looking through the whole future” as “immediate reward + next-state value”.
  • Value table: in finite-state problems, stores each state’s V(s)V(s) as a set of numbers that can be updated.
  • From VV to QQ: first evaluate a policy, then compare actions, and finally use that comparison to find a better policy.

In the previous section, we built the formal framework of RL: an MDP describes the environment rules, GtG_t measures the total return of a trajectory, and a policy π\pi specifies how actions are chosen at each step. With these symbols, we can already describe how an agent interacts with an environment. But description is not solution. Reinforcement learning ultimately cares about the policy: in the same state, changing the action-selection rule can lead to a completely different future. Therefore, before discussing “how to find a good policy,” we first need to solve a basic problem that serves policy improvement: how do we evaluate the current policy’s performance in each state?

The state-value function Vπ(s)V^\pi(s) is prepared for exactly this problem. It asks: if the agent is currently standing in state ss, and from this moment onward always acts according to policy π\pi, how much total return can it obtain on average in the future? Notice that Vπ(s)V^\pi(s) is not a score that naturally belongs to the state itself. It is the performance of policy π\pi in state ss. In the same CartPole situation, a bad policy may quickly let the pole fall, while a good policy can keep it balanced. The state is the same; the policy is different; therefore the value is different.

Directly computing Vπ(s)V^\pi(s) seems to require enumerating every possible future trajectory starting from ss, which is usually impossible. The role of the Bellman equation is to break this long-term evaluation into a one-step recursion: first look at the immediate reward, then delegate the remaining future to the value of the next state. In this way, policy evaluation no longer needs to see the entire future at once. It can gradually propagate information backward through relationships between neighboring states.

Given a policy π\pi, how do we evaluate its performance in every state? Further, if we want a better policy, how do we move from “evaluating states” to “choosing actions”?

Core concept

The Bellman equation connects “policy evaluation” and “policy improvement”: first use VπV^\pi to evaluate a policy, then use QQ or an optimality equation to compare actions, so the policy has a direction for improvement.

Core formulas

Vπ(s)=Eπ[k=0γkrt+kst=s](definition of state-value function: evaluates the long-term return of policy π in state s)V^\pi(s)=\mathbb{E}_\pi\left[\sum_{k=0}^{\infty}\gamma^k r_{t+k}\mid s_t=s\right] \quad \text{(definition of state-value function: evaluates the long-term return of policy $\pi$ in state $s$)}

Vπ(s)=aAπ(as)[R(s,a)+γsSP(ss,a)Vπ(s)](Bellman expectation equation: recursively computes value under a fixed policy)V^\pi(s)=\sum_{a\in\mathcal{A}}\pi(a\mid s)\left[R(s,a)+\gamma\sum_{s'\in\mathcal{S}}P(s'\mid s,a)V^\pi(s')\right] \quad \text{(Bellman expectation equation: recursively computes value under a fixed policy)}

V(s)=maxa[R(s,a)+γsSP(ss,a)V(s)](Bellman optimality equation: defines the optimal state value)V^*(s)=\max_a\left[R(s,a)+\gamma\sum_{s'\in\mathcal{S}}P(s'\mid s,a)V^*(s')\right] \quad \text{(Bellman optimality equation: defines the optimal state value)}

State Value and Bellman Equation:

  • Vπ(s)V^\pi(s): the value of policy π\pi in state ss, that is, the average total score obtained by starting from ss and following policy π\pi.
  • rt+kr_{t+k}: the one-step reward obtained at future step kk.
  • γ\gamma: the discount factor (Gamma), which controls how much future rewards matter (010 \sim 1).
  • A\mathcal{A}: the action space, the set of all actions available in the task. In CartPole, for example, these are usually “push left” and “push right”.
  • π(as)\pi(a\mid s): the policy, the probability of choosing action aa in state ss.
  • P(ss,a)P(s'\mid s,a): the state transition probability, the probability of moving to next state ss' after taking action aa in state ss.

This section repeatedly uses the following symbols. Let us first make clear what each one means:

SymbolHow to understand it first
sts_tThe state at step tt, such as a board position or robot pose
ata_tThe action at step tt, such as moving left or outputting a control value
A\mathcal{A}The action space, the set of all available actions
rtr_tThe immediate reward obtained at step tt
π(as)\pi(a\mid s)The probability that the policy chooses action aa in state ss
γ\gammaThe discount factor; farther rewards receive smaller weights
GtG_tThe total return accumulated from step tt onward
Vπ(s)V^\pi(s)The average total return from starting in state ss and following policy π\pi

There is an important distinction here:

  • rtr_t is the immediate reward of the current step. It only looks at one step, and we know it after taking that step.
  • GtG_t is the discounted total return actually obtained along one concrete trajectory starting from step tt. It looks at a whole trajectory, so different trajectories can have different GtG_t.
  • Vπ(s)V^\pi(s) is not the score of one trajectory. It is the average of GtG_t over many possible trajectories that start from state ss and then follow policy π\pi.
What is the difference between G and V? A numerical example

Start with the full definition. The original formula for discounted return GtG_t is:

Gt=k=0γkrt+kG_t = \sum_{k=0}^{\infty}\gamma^k r_{t+k}

Expanding the summation gives:

Gt=rt+γrt+1+γ2rt+2+γ3rt+3+G_t = r_t+\gamma r_{t+1}+\gamma^2 r_{t+2} +\gamma^3 r_{t+3} +\cdots

For demonstration, suppose the discount factor is γ=0.9\gamma=0.9. Starting from the same state ss, we play twice under the same policy, and in each run only keep the first three rewards. In other words, first look at this three-step truncated version:

Gt=rt+γrt+1+γ2rt+2G_t = r_t+\gamma r_{t+1}+\gamma^2r_{t+2}

The first run goes smoothly, with three rewards 2,4,62,4,6. Substitute them into the formula:

Gt(1)=2rt+0.9×4rt+1+0.92×6rt+2G_t^{(1)} = \underbrace{2}_{r_t} + 0.9\times \underbrace{4}_{r_{t+1}} + 0.9^2\times \underbrace{6}_{r_{t+2}}

The result is:

Gt(1)=10.46G_t^{(1)} = 10.46

The second run encounters a bad outcome later, with three rewards 2,1,32,1,-3. Substituting again:

Gt(2)=2rt+0.9×1rt+1+0.92×(3)rt+2G_t^{(2)} = \underbrace{2}_{r_t} + 0.9\times \underbrace{1}_{r_{t+1}} + 0.9^2\times \underbrace{(-3)}_{r_{t+2}}

The result is:

Gt(2)=0.47G_t^{(2)} = 0.47

Notice that both runs have the same first-step immediate reward, rt=2r_t=2, but what happens afterward is different, so the trajectory-level GtG_t differs greatly. The state value Vπ(s)V^\pi(s) is the expectation of these possible GtG_t values. If we currently only have these two samples, we can use their average as a rough estimate:

Vπ(s)10.46+0.472=5.465V^\pi(s)\approx \frac{10.46+0.47}{2}=5.465

So rtr_t is a one-step score, GtG_t is the total score actually obtained along one trajectory from a certain moment onward, and Vπ(s)V^\pi(s) answers: if we start many times from the same state ss, what is the long-run average of those total scores?

State-Value Function: Evaluating a State

In the previous section, we had the MDP tuple to describe the rules of the game, GtG_t to measure the total score obtained from a certain moment onward along one actual trajectory, and π\pi to define how actions are chosen. But when making decisions, you are not facing “an entire episode that has already happened.” You are facing a concrete intermediate situation. What you need is not an after-the-fact summary of “how many points this run finally got,” but a before-the-fact judgment: “if I am now standing in this situation, how good is the future likely to be?”

The word “average” is important here. It is not an average over time steps, nor is it the rewards in one episode simply divided by the number of steps. It is an average over many possible future trajectories that may happen when we start from the same state ss and continue following the same policy π\pi. Because the policy may be random and the environment transition may also be random, even if we always start from the same state ss, the future may unfold into different trajectories and produce different GtG_t values.

Using the numbers above: starting from the same state ss, one future trajectory may have return Gt=10.46G_t=10.46, while another may have return Gt=0.47G_t=0.47. Both are “the total score after one actual occurrence.” If, under policy π\pi, the first kind of future occurs with probability 60%60\% and the second with probability 40%40\%, then the state value is:

Vπ(s)=0.6×10.46+0.4×0.47=6.464V^\pi(s)=0.6\times 10.46+0.4\times 0.47=6.464

Therefore, GtG_t is like “the score report produced by this particular run,” while Vπ(s)V^\pi(s) is like “before the run starts, standing in the same state ss, an average prediction over many possible score reports.”

This naturally leads to the state-value function Vπ(s)V^\pi(s): it turns GtG_t from “the total score of one actual occurrence” into “the average prediction of future total score when standing in state ss.” Formally, it is the expectation of GtG_t under the condition st=ss_t=s:

The definition of the state-value function is:

Vπ(s)=Eπ[Gtst=s]V^\pi(s) = \mathbb{E}_\pi[G_t\mid s_t=s]

If we expand GtG_t, we get:

Vπ(s)=Eπ[k=0γkrt+kst=s]V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty}\gamma^k r_{t+k} \mid s_t=s \right]

Read this formula in three layers:

  1. k=0γkrt+k\sum_{k=0}^{\infty}\gamma^k r_{t+k}: starting from now, add up future rewards.
  2. γk\gamma^k: farther rewards are discounted more. γ=0\gamma=0 only cares about the present; γ\gamma close to 1 cares more about the long term.
  3. Eπ[]\mathbb{E}_\pi[\cdot]: the environment and policy may be random, so we look at the average result over many attempts.

Expanding the first few terms makes it clearer:

Vπ(s)=Eπ[rt+γrt+1+γ2rt+2+γ3rt+3+st=s]V^\pi(s) = \mathbb{E}_\pi\big[r_t + \gamma\, r_{t+1} + \gamma^2\, r_{t+2} + \gamma^3\, r_{t+3} + \cdots \mid s_t=s\big]

The meaning of each term is as follows: rtr_t is the reward obtained immediately (not discounted), γrt+1\gamma\, r_{t+1} is the next-step reward discounted once, γ2rt+2\gamma^2\, r_{t+2} is the reward two steps later discounted twice, and so on. The farther away a reward is, the higher the power of γ\gamma in its coefficient, and the smaller its contribution. To feel this with CartPole numbers: if each step has r=+1r=+1 and γ=0.9\gamma=0.9, then Vπ(s)1+0.9+0.81+0.729+=110.9=10V^\pi(s) \approx 1 + 0.9 + 0.81 + 0.729 + \cdots = \frac{1}{1-0.9} = 10 (assuming the policy can survive forever).

In one sentence: Vπ(s)V^\pi(s) is “starting from state ss, if we keep playing according to policy π\pi, how much total score can we obtain on average?”

Why does the superscript have to be π\pi? Because for the same state, different policies have different values. In the same chess position, a master continuing the game and a beginner continuing the game obviously have different final win rates. A good policy has high Vπ(s)V^\pi(s), and a poor policy has low Vπ(s)V^\pi(s). The goal of RL is to find the policy that makes value highest.

Bellman Equation

Now return to the computation itself. If we want to evaluate the performance of policy π\pi in state ss, the most direct method seems to be: compute every future trajectory starting from ss, then average their returns. But future trajectories may be very long and may branch heavily. Fortunately, return itself has a recursive structure: Gt=rt+γGt+1G_t = r_t + \gamma G_{t+1}. And the state value is defined as Vπ(s)=Eπ[Gtst=s]V^\pi(s) = \mathbb{E}_\pi[G_t \mid s_t = s]. Put these two facts together, and a natural question appears: since the return GtG_t of one trajectory can be recursive, can the policy-evaluation quantity Vπ(s)V^\pi(s) also be recursive?

The answer is yes. This is the Bellman equation. It is not a concept invented from nowhere. Starting from the definition of Vπ(s)V^\pi(s) and using the recursive structure of GtG_t, it rewrites “policy evaluation” as “the current step + evaluation of the next state.” Before deriving it, let us first understand why it is necessary:

We have already defined the state value Vπ(s)V^\pi(s): it is the expected discounted sum of all future rewards. If we want to know the value of a state, the most intuitive thought is: just follow the policy forward and add the rewards along one long trajectory, right?

But in reality, this faces two huge difficulties:

  1. The future is too long: some tasks have no clear endpoint (for example, maintaining a robot’s balance). How far would you have to add?
  2. There are too many possibilities: because both the environment and the policy may be random, states branch exponentially like tree branches. To compute an accurate expectation, you would need to traverse countless trajectories.

Faced with infinite horizons and huge state trees, the American applied mathematician Richard Bellman proposed the famous Principle of Optimality in the 1950s while creating the theory of Dynamic Programming. He discovered that we do not need to see through the entire future at once, because “today’s value” must contain “tomorrow’s value.”

This is like calculating retirement savings. You do not need to separately calculate and add the interest for every day over the next 30 years right now. You only need to know: “today’s total asset = today’s income + the future total asset brought by tomorrow’s principal.” This idea of elegantly decomposing an infinite-horizon large problem into “the current step” and “all remaining steps” is the soul of dynamic programming.

The recursive equality written from this principle is called the Bellman Equation. It transforms policy evaluation from an “infinite summation problem” that is hard to compute directly into a recursive relationship that only depends on neighboring states. Precisely because this recursive relationship exists, later methods such as DP, MC, and TD can estimate value through iteration or sampling.

Next, let us see how this magical “recursive loop” is rigorously derived from the original definition of Vπ(s)V^\pi(s).

Derivation

We have already understood the recursive relationship of value intuitively. Now we give the rigorous mathematical derivation. The essence of the Bellman equation is that it reveals the recursive relationship between the current state value and future state values. The greatness of this derivation is that it strictly proves mathematically that we can completely replace an infinite enumeration of the future with a recursive computation that only looks at “one step ahead.”

According to the definition of the state-value function, expand it:

Vπ(s)=Eπ[Gtst=s]=Eπ[rt+γrt+1+γ2rt+2+st=s]=Eπ[rtst=s]+γEπ[rt+1+γrt+2+st=s]=rπ(s)+γEπ[Gt+1st=s]\begin{aligned} V^\pi(s) &= \mathbb{E}_\pi[G_t \mid s_t = s] \\ &= \mathbb{E}_\pi[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots \mid s_t = s] \\ &= \mathbb{E}_\pi[r_t \mid s_t = s] + \gamma \mathbb{E}_\pi[r_{t+1} + \gamma r_{t+2} + \dots \mid s_t = s] \\ &= r_\pi(s) + \gamma \mathbb{E}_\pi[G_{t+1} \mid s_t = s] \end{aligned}

Here rπ(s)r_\pi(s) is the expected immediate reward obtained in state ss after fixing policy π\pi. It has already averaged over action uncertainty:

rπ(s)=aπ(as)R(s,a)r_\pi(s)=\sum_a \pi(a\mid s)R(s,a)

The next step is the most important one: we do want to rewrite Eπ[Gt+1st=s]\mathbb{E}_\pi[G_{t+1} \mid s_t = s] in the form of “next-state value,” but we cannot directly write it as Vπ(st+1)V^\pi(s_{t+1}).

That direct equality is mathematically wrong.

Use an analogy to understand the subtle difference:

  • Eπ[Gt+1st=s]\mathbb{E}_\pi[G_{t+1} \mid s_t = s] is like: today (standing at sts_t) you are predicting how much money you can earn starting tomorrow (Gt+1G_{t+1}). Because tomorrow may land in different states, this number has already averaged out the uncertainty of “where tomorrow will be.”
  • Vπ(st+1)V^\pi(s_{t+1}) is like: tomorrow (already standing in a concrete st+1s_{t+1}) you compute how much money you can earn in the future. But from today’s perspective, st+1s_{t+1} has not happened yet, so Vπ(st+1)V^\pi(s_{t+1}) is still a random variable.

Therefore, the equality we truly cannot write directly is:

Eπ[Gt+1st=s]=Vπ(st+1)\mathbb{E}_\pi[G_{t+1} \mid s_t = s] = V^\pi(s_{t+1})

The left side is a number that has already been averaged under condition st=ss_t=s, while the right side still depends on the random next state. The correct target is to put the right side inside the same conditional expectation:

Eπ[Gt+1st=s]=Eπ[Vπ(st+1)st=s]\mathbb{E}_\pi[G_{t+1} \mid s_t = s] = \mathbb{E}_\pi[V^\pi(s_{t+1}) \mid s_t = s]

To smoothly move the perspective (that is, the condition in probability theory) from sts_t to st+1s_{t+1}, we do not need any advanced theorem. We only need three basic mathematical tools: the definition of conditional expectation, marginalization in probability theory, and the most important physical assumption in reinforcement learning, the Markov Property.

Supplementary proof: how do we rigorously push the condition from the current state to the next state?

To avoid introducing dazzling new notation, we use only three variables throughout: sts_t (current state), st+1s_{t+1} (next state), and Gt+1G_{t+1} (future total return). Our final goal is to prove: Eπ[Gt+1st]=Eπ[Vπ(st+1)st]\mathbb{E}_\pi[G_{t+1} \mid s_t] = \mathbb{E}_\pi[V^\pi(s_{t+1}) \mid s_t].

Before deriving it, let us review a basic probability idea: what is a conditioning variable? How exactly does expectation (E\mathbb{E}) expand?

Imagine rolling two dice, XX and YY. If no one tells you any information and asks you to guess the average value rolled by XX (that is, its expectation), how would you compute it? Very simply: multiply every possible die outcome (1 through 6) by its probability (1/6), then add everything. In formula form, E[X]=xxP(x)\mathbb{E}[X] = \sum_x x \cdot P(x). This is ordinary expectation. But if someone secretly tells you, “Hey, YY rolled a 6.” Then YY is known information, a condition (conditioning variable). When predicting XX under the condition Y=6Y=6, your prediction may change. This is called conditional expectation. In formula form, replace ordinary probability with conditional probability: E[XY]=xxP(xY)\mathbb{E}[X \mid Y] = \sum_x x \cdot P(x \mid Y).

This operation of “expanding the big E\mathbb{E} into a summation \sum multiplied by conditional probabilities PP” is the core tool in the derivation below. Remember: the thing after the vertical bar | is the “given fact” we are standing on at the moment.

Step 1: Expand the expectation of Vπ(st+1)V^\pi(s_{t+1})

By definition, Vπ(st+1)V^\pi(s_{t+1}) is the expectation of future return Gt+1G_{t+1} when st+1s_{t+1} is given and policy π\pi is followed afterward. Using the formula reviewed above: Vπ(st+1)=Eπ[Gt+1st+1]=Gt+1Gt+1Pπ(Gt+1st+1)V^\pi(s_{t+1}) = \mathbb{E}_\pi[G_{t+1} \mid s_{t+1}] = \sum_{G_{t+1}} G_{t+1} \cdot P_\pi(G_{t+1} \mid s_{t+1}).

Therefore, the expression on the right side that we want to compute, Eπ[Vπ(st+1)st]\mathbb{E}_\pi[V^\pi(s_{t+1}) \mid s_t], is really the expectation of a long expression:

Eπ[Vπ(st+1)st]=Eπ[Gt+1Gt+1Pπ(Gt+1st+1)  |  st]\mathbb{E}_\pi[V^\pi(s_{t+1}) \mid s_t] = \mathbb{E}_\pi \left[ \sum_{G_{t+1}} G_{t+1} P_\pi(G_{t+1} \mid s_{t+1}) \;\middle|\; s_t \right]

Step 2: Expand the expectation again over the next state st+1s_{t+1}

The expression above still has an outer uppercase E[st]\mathbb{E}[\dots \mid s_t]. Although the thing inside the brackets is long, its essence is tomorrow’s value (st+1s_{t+1}).

Because “what tomorrow will be” is random, st+1s_{t+1} is an uncertain variable. How do we compute the expectation of this random variable? We use the same move: multiply every possible tomorrow by the probability that tomorrow occurs, then add them.

That is, the outer Eπ[st]\mathbb{E}_\pi[\dots \mid s_t] becomes st+1()Pπ(st+1st)\sum_{s_{t+1}} (\dots) \cdot P_\pi(s_{t+1} \mid s_t):

Eπ[Vπ(st+1)st]=st+1(Gt+1Gt+1Pπ(Gt+1st+1))this is Vπ(st+1)Pπ(st+1st)probability of tomorrow=st+1Gt+1Gt+1Pπ(Gt+1st+1)Pπ(st+1st)\begin{aligned} \mathbb{E}_\pi[V^\pi(s_{t+1}) \mid s_t] &= \sum_{s_{t+1}} \underbrace{\left( \sum_{G_{t+1}} G_{t+1} P_\pi(G_{t+1} \mid s_{t+1}) \right)}_{\text{this is } V^\pi(s_{t+1})} \cdot \underbrace{P_\pi(s_{t+1} \mid s_t)}_{\text{probability of tomorrow}} \\ &= \sum_{s_{t+1}} \sum_{G_{t+1}} G_{t+1} P_\pi(G_{t+1} \mid s_{t+1}) P_\pi(s_{t+1} \mid s_t) \end{aligned}

Note: the second line only opens the parentheses and multiplies in the outside term Pπ(st+1st)P_\pi(s_{t+1} \mid s_t).

Step 3: Inject the Markov property (the most important step!)

Notice the term Pπ(Gt+1st+1)P_\pi(G_{t+1} \mid s_{t+1}) in the expression above. According to the Markov property, once policy π\pi is fixed, the future return Gt+1G_{t+1} only depends on the state st+1s_{t+1} at that exact moment, not on the earlier state sts_t. This is like rolling dice: the result of the second roll depends only on how the second roll is made, not on the first roll. Therefore, mathematically, we can force sts_t into the condition without changing the probability value: Pπ(Gt+1st+1)=Pπ(Gt+1st+1,st)P_\pi(G_{t+1} \mid s_{t+1}) = P_\pi(G_{t+1} \mid s_{t+1}, s_t). Substitute this into the expression:

Eπ[Vπ(st+1)st]=st+1Gt+1Gt+1Pπ(Gt+1st+1,st)Pπ(st+1st)\mathbb{E}_\pi[V^\pi(s_{t+1}) \mid s_t] = \sum_{s_{t+1}} \sum_{G_{t+1}} G_{t+1} P_\pi(G_{t+1} \mid s_{t+1}, s_t) P_\pi(s_{t+1} \mid s_t)

Step 4: Probability multiplication rule and marginalization

If you have forgotten the basic probability rule, that is fine; we can derive it briefly. The most basic conditional probability formula is: P(AB)=P(A,B)P(B)P(A \mid B) = \frac{P(A, B)}{P(B)}, meaning that “the probability that A happens given B” equals “the probability that A and B happen together” divided by “the probability that B happens.” Multiplying the denominator across gives the multiplication rule: P(A,B)=P(AB)P(B)P(A, B) = P(A \mid B) \cdot P(B).

If we add an additional condition CC as a premise to every event, the formula still holds: P(A,BC)=P(AB,C)P(BC)P(A, B \mid C) = P(A \mid B, C) \cdot P(B \mid C).

Now take Gt+1G_{t+1} as AA, st+1s_{t+1} as BB, and sts_t as CC. Then the product of the two probabilities at the end of the expression can be merged: Pπ(Gt+1st+1,st)Pπ(st+1st)=Pπ(Gt+1,st+1st)P_\pi(G_{t+1} \mid s_{t+1}, s_t) \cdot P_\pi(s_{t+1} \mid s_t) = P_\pi(G_{t+1}, s_{t+1} \mid s_t). Continue simplifying:

Eπ[Vπ(st+1)st]=st+1Gt+1Gt+1Pπ(Gt+1,st+1st)=Gt+1Gt+1st+1Pπ(Gt+1,st+1st)(swap the order of summation)=Gt+1Gt+1Pπ(Gt+1st)(sum over all possible st+1, eliminating st+1)=Eπ[Gt+1st](this is exactly the definition of conditional expectation!)\begin{aligned} \mathbb{E}_\pi[V^\pi(s_{t+1}) \mid s_t] &= \sum_{s_{t+1}} \sum_{G_{t+1}} G_{t+1} P_\pi(G_{t+1}, s_{t+1} \mid s_t) \\ &= \sum_{G_{t+1}} G_{t+1} \sum_{s_{t+1}} P_\pi(G_{t+1}, s_{t+1} \mid s_t) \quad \text{(swap the order of summation)} \\ &= \sum_{G_{t+1}} G_{t+1} P_\pi(G_{t+1} \mid s_t) \quad \text{(sum over all possible $s_{t+1}$, eliminating $s_{t+1}$)} \\ &= \mathbb{E}_\pi[G_{t+1} \mid s_t] \quad \text{(this is exactly the definition of conditional expectation!)} \end{aligned}

Proof complete. Step by step, we have rigorously shown: Eπ[Gt+1st]=Eπ[Vπ(st+1)st]\mathbb{E}_\pi[G_{t+1} \mid s_t] = \mathbb{E}_\pi[V^\pi(s_{t+1}) \mid s_t].

Substitute this conclusion back into the original expression, and we obtain the standard form of the Bellman equation:

Vπ(s)=rπ(s)+γEπ[Vπ(st+1)st=s]=rπ(s)+γst+1Pπ(st+1st=s)Vπ(st+1)\begin{aligned} V^\pi(s) &= r_\pi(s) + \gamma \mathbb{E}_\pi[V^\pi(s_{t+1}) \mid s_t = s] \\ &= r_\pi(s) + \gamma \sum_{s_{t+1}} P_\pi(s_{t+1} \mid s_t = s) V^\pi(s_{t+1}) \end{aligned}

The second line is expanded using the definition of discrete conditional expectation. For any discrete random variable XX and condition YY:

E[XY]=xxP(xY)\mathbb{E}[X \mid Y] = \sum_{x} x \cdot P(x \mid Y)

That is: multiply every possible outcome xx by the probability of its occurrence under condition YY, then add everything. Here XX is Vπ(st+1)V^\pi(s_{t+1}), the condition YY is st=ss_t = s, and st+1s_{t+1} is a discrete random variable. Substituting into the definition:

Eπ[Vπ(st+1)st=s]=st+1Vπ(st+1)Pπ(st+1st=s)\mathbb{E}_\pi[V^\pi(s_{t+1}) \mid s_t = s] = \sum_{s_{t+1}} V^\pi(s_{t+1}) \cdot P_\pi(s_{t+1} \mid s_t = s)

The intuition for each term is: starting from current state ss, after acting according to policy π\pi, there is probability Pπ(st+1st=s)P_\pi(s_{t+1} \mid s_t = s) of jumping to next state st+1s_{t+1}, and the value of st+1s_{t+1} is Vπ(st+1)V^\pi(s_{t+1}). Weighting the values of all possible next states by their probabilities gives “how much the next step is worth on average.”

This is the core charm of the Bellman equation: the value of the current state equals the immediate reward plus the expectation of the next state’s value.

Matrix Form

The Bellman equation above only wrote the value of one state ss. But an environment has more than one state. CartPole has infinitely many states, and even a simple gridworld has dozens. Every state has its own Bellman equation, and these equations are coupled: Vπ(s)V^\pi(s) depends on Vπ(st+1)V^\pi(s_{t+1}), while the equation for Vπ(st+1)V^\pi(s_{t+1}) may refer back to Vπ(s)V^\pi(s). This means you cannot solve a single equation in isolation. You must solve all state equations jointly.

To write the equations as a matrix, first fix a policy π\pi. After fixing the policy, the uncertainty of action selection can be merged into two quantities:

rπ(s)=aπ(as)R(s,a)r_\pi(s)=\sum_a \pi(a\mid s)R(s,a)

Pπ(ss)=aπ(as)P(ss,a)P_\pi(s'\mid s)=\sum_a \pi(a\mid s)P(s'\mid s,a)

In other words, PπP_\pi is not the original environment transition P(ss,a)P(s'\mid s,a), but the state-to-state transition probability obtained after “first choosing an action according to the policy, then letting the environment transition.”

When the number of states is finite (NN states), these NN simultaneous equations can be written in matrix form:

(Vπ(s1)Vπ(s2)Vπ(sN))vπ=(rπ(s1)rπ(s2)rπ(sN))rπ+γ(Pπ(s1s1)Pπ(s2s1)Pπ(sNs1)Pπ(s1s2)Pπ(s2s2)Pπ(sNs2)Pπ(s1sN)Pπ(s2sN)Pπ(sNsN))Pπ(Vπ(s1)Vπ(s2)Vπ(sN))vπ\underbrace{ \begin{pmatrix} V^\pi(s_1) \\ V^\pi(s_2) \\ \vdots \\ V^\pi(s_N) \end{pmatrix} }_{\boldsymbol{v}_\pi} = \underbrace{ \begin{pmatrix} r_\pi(s_1) \\ r_\pi(s_2) \\ \vdots \\ r_\pi(s_N) \end{pmatrix} }_{\boldsymbol{r}_\pi} + \gamma \underbrace{ \begin{pmatrix} P_\pi(s_1 \mid s_1) & P_\pi(s_2 \mid s_1) & \dots & P_\pi(s_N \mid s_1) \\ P_\pi(s_1 \mid s_2) & P_\pi(s_2 \mid s_2) & \dots & P_\pi(s_N \mid s_2) \\ \vdots & \vdots & \ddots & \vdots \\ P_\pi(s_1 \mid s_N) & P_\pi(s_2 \mid s_N) & \dots & P_\pi(s_N \mid s_N) \end{pmatrix} }_{\boldsymbol{P}_\pi} \underbrace{ \begin{pmatrix} V^\pi(s_1) \\ V^\pi(s_2) \\ \vdots \\ V^\pi(s_N) \end{pmatrix} }_{\boldsymbol{v}_\pi}

Here:

  • vπ\boldsymbol{v}_\pi is an N×1N \times 1 column vector containing the values of all states under fixed policy π\pi.
  • rπ\boldsymbol{r}_\pi is an N×1N \times 1 column vector representing the expected immediate reward for each state after fixing policy π\pi.
  • Pπ\boldsymbol{P}_\pi is an N×NN \times N policy-induced state transition matrix, describing the probability of jumping from any state to another state after acting according to the policy.

If you are familiar with linear algebra, you can immediately see that this large system of equations can be compressed into one concise expression:

vπ=rπ+γPπvπ\boldsymbol{v}_\pi = \boldsymbol{r}_\pi + \gamma \boldsymbol{P}_\pi \boldsymbol{v}_\pi

Why are both sides vπ\boldsymbol{v}_\pi, yet the right side still represents “next-state” values?

The easiest point to get confused about is this: in the single-state formula above, the left side looks at the current state, while the right side looks at next states:

Vπ(s)=rπ(s)+γsPπ(ss)Vπ(s)V^\pi(s) = r_\pi(s) + \gamma\sum_{s'}P_\pi(s'\mid s)V^\pi(s')

The left side is Vπ(s)V^\pi(s), but the right side contains Vπ(s)V^\pi(s'). Why, after writing the matrix form, do both sides appear to become the same vπ\boldsymbol{v}_\pi?

The key is that the matrix form does not write only one state. It writes all state equations at the same time. To see this clearly, fix the current state as sis_i. The single-state formula becomes:

Vπ(si)=rπ(si)+γjPπ(sjsi)Vπ(sj)V^\pi(s_i) = r_\pi(s_i) + \gamma\sum_j P_\pi(s_j\mid s_i)V^\pi(s_j)

The correspondence is:

Symbol in the single-state formulaWhat it corresponds to in the matrix
Current state ssRow ii, the value currently being computed for sis_i
Next state ss'Column jj, the possible next state sjs_j
Pπ(ss)P_\pi(s'\mid s)Matrix entry Pπ(sjsi)P_\pi(s_j\mid s_i)
Vπ(s)V^\pi(s')The jj-th component of the value vector, Vπ(sj)V^\pi(s_j)

Therefore, the right side is not just vπ\boldsymbol{v}_\pi alone, but:

Pπvπ\boldsymbol{P}_\pi\boldsymbol{v}_\pi

The column vector vπ\boldsymbol{v}_\pi merely stores all possible “next-state values”:

vπ=(Vπ(s1)Vπ(s2)Vπ(sN))\boldsymbol{v}_\pi= \begin{pmatrix} V^\pi(s_1)\\ V^\pi(s_2)\\ \vdots\\ V^\pi(s_N) \end{pmatrix}

What truly determines “which current state we are in, and with what probability each next state appears” is the preceding Pπ\boldsymbol{P}_\pi. The ii-th row of the matrix multiplication expands to:

(Pπvπ)i=jPπ(sjsi)Vπ(sj)(\boldsymbol{P}_\pi\boldsymbol{v}_\pi)_i = \sum_j P_\pi(s_j\mid s_i)V^\pi(s_j)

This is exactly the next-state expectation on the right side of the single-state formula:

sPπ(ssi)Vπ(s)\sum_{s'}P_\pi(s'\mid s_i)V^\pi(s')

For example, the first row corresponds to starting from s1s_1:

(Pπvπ)1=Pπ(s1s1)Vπ(s1)+Pπ(s2s1)Vπ(s2)++Pπ(sNs1)Vπ(sN)(\boldsymbol{P}_\pi\boldsymbol{v}_\pi)_1 = P_\pi(s_1\mid s_1)V^\pi(s_1) + P_\pi(s_2\mid s_1)V^\pi(s_2) + \cdots + P_\pi(s_N\mid s_1)V^\pi(s_N)

That is, “starting from s1s_1, take the probability-weighted average of the next-state values.” If the next step may go to s2s_2, it uses Vπ(s2)V^\pi(s_2); if it may stay in s1s_1, it also uses Vπ(s1)V^\pi(s_1). All possible next-state values are stored in the same column vπ\boldsymbol{v}_\pi, so the right side still uses that same column vector.

In other words:

  • The vπ\boldsymbol{v}_\pi on the left: each current state’s own value.
  • The vπ\boldsymbol{v}_\pi on the right: a table providing the values of all possible next states.
  • The preceding Pπ\boldsymbol{P}_\pi: weights that value table according to “where the process may jump from the current state.”
  • The product Pπvπ\boldsymbol{P}_\pi\boldsymbol{v}_\pi: the expected next-state value from each current state.

The Bellman equation is not saying “the current value equals itself.” It is saying:

current value=immediate reward+γ×expected next-state value\text{current value}=\text{immediate reward}+\gamma\times\text{expected next-state value}

Since this is a linear equation in the unknown vπ\boldsymbol{v}_\pi, we can solve directly by moving terms and factoring:

vπγPπvπ=rπ(IγPπ)vπ=rπvπ=(IγPπ)1rπ\begin{aligned} \boldsymbol{v}_\pi - \gamma \boldsymbol{P}_\pi \boldsymbol{v}_\pi &= \boldsymbol{r}_\pi \\ (\boldsymbol{I} - \gamma \boldsymbol{P}_\pi) \boldsymbol{v}_\pi &= \boldsymbol{r}_\pi \\ \boldsymbol{v}_\pi &= (\boldsymbol{I} - \gamma \boldsymbol{P}_\pi)^{-1} \boldsymbol{r}_\pi \end{aligned}

(Here I\boldsymbol{I} is the identity matrix.)

At this point, we have obtained the analytic solution of the Bellman equation under fixed policy π\pi. This means that as long as the environment rules (rewards and transition probabilities) and policy π\pi are fully transparent, we can in theory directly compute the exact value of every state under that policy.

If we have a formula, why learn other algorithms?

Because reality is harsh. Computing the analytic solution requires inverting the matrix (IγPπ)(\boldsymbol{I} - \gamma \boldsymbol{P}_\pi), and matrix inversion has computational complexity as high as O(N3)O(N^3). If the state space is even slightly large (for example, Go has around 1017010^{170} board positions), we cannot finish the computation in a lifetime. Therefore, this “God’s-eye-view” direct solution usually exists only in theory and in extremely simple toy environments. For complex problems, we must turn to Dynamic Programming (DP), Monte Carlo (MC), or Temporal Difference (TD) methods, which approximate the true value through repeated iteration.

Action-Value Function Q(s,a)

So far, we have been asking: “Standing in this state, if I continue according to policy π\pi, how many points can I obtain on average?” The answer is Vπ(s)V^\pi(s). It is useful, but it is not enough for making choices. The reason is that Vπ(s)V^\pi(s) only scores the state. It does not separately answer whether a particular action itself is good.

In a more everyday phrasing: you are standing at an intersection, and Vπ(s)V^\pi(s) says something like “this intersection is generally not bad.” But what you actually need to decide is: go left or go right? Knowing only that “this intersection is good” is not enough. You also want to know “if I first go left, about how many points can I get; if I first go right, about how many points can I get.” The same is true in CartPole: when the pole tilts right, the state value only tells you how this situation looks on average under the current policy. It does not directly tell you whether to push left or push right now.

So we ask the question more finely: in state ss, if I first take action aa, and then continue according to policy π\pi, how many points can I obtain on average in the future? This score is the Q function, also called the action-value function.

Compared with Vπ(s)V^\pi(s), Qπ(s,a)Q^\pi(s,a) looks at one additional thing: action aa. You can remember the difference like this:

Difference between V and Q

Comparison pointVπ(s)V^\pi(s)Qπ(s,a)Q^\pi(s,a)
Question askedStanding in state ss, following policy π\pi, how good is it on average?Standing in state ss, first take action aa, then follow policy π\pi, how good is it on average?
How actions are handledNo action is specified separately; actions are chosen by policy π\piThe first action aa is specified
Effect of actionsMixed into the average result of policy π\piPulled out and scored separately
Suitable for answeringIs this situation good overall?Which action is better in this situation?

Thus, VV is better suited for evaluating “whether the situation is good overall,” while QQ is better suited for comparing “which action is better in this situation.” Policy improvement needs the latter.

The definition of the Q function is:

Qπ(st,at)=Eπ[Gtst,at]Q^\pi(s_t, a_t) = \mathbb{E}_\pi[G_t \mid s_t, a_t]

This formula directly defines Qπ(st,at)Q^\pi(s_t,a_t): currently in state sts_t, first specify action ata_t, then continue acting according to policy π\pi; what is the expected future return GtG_t?

The relationship between V and Q can first be understood in one sentence: VV is “how much this state is worth before the action has been decided”; QQ is “how much this choice is worth after a particular action has been specified.”

Consider a concrete state sts_t. Suppose there are only two actions: right or left. The current policy π\pi does not always choose one action; sometimes it goes right and sometimes it goes left:

Action ata_tPolicy selection probability π(atst)\pi(a_t\mid s_t)Action value Qπ(st,at)Q^\pi(s_t,a_t)Contribution to Vπ(st)V^\pi(s_t)
Right0.70.710100.7×10=70.7\times 10=7
Left0.30.3440.3×4=1.20.3\times 4=1.2

Here Qπ(st,right)=10Q^\pi(s_t,\text{right})=10 means: if we first go right, then continue according to policy π\pi, we can obtain 10 points on average. Qπ(st,left)=4Q^\pi(s_t,\text{left})=4 means that if we first go left, we can only obtain 4 points on average.

But Vπ(st)V^\pi(s_t) does not ask “how much is it worth after specifying a particular action.” It asks “how much is this state worth on average when policy π\pi chooses the action by itself.” Therefore, we must add the two action values weighted by the policy probabilities:

0.7×10+0.3×4=8.20.7 \times 10 + 0.3 \times 4 = 8.2

So in this example:

Vπ(st)=8.2V^\pi(s_t)=8.2

That is, Vπ(st)V^\pi(s_t) is not a separate object redefined from scratch. It averages all possible Qπ(st,at)Q^\pi(s_t,a_t) values in the same state according to the policy probabilities:

Vπ(st)=atAπ(atst)Qπ(st,at)V^\pi(s_t) = \sum_{a_t \in \mathcal{A}} \pi(a_t \mid s_t) Q^\pi(s_t, a_t)

Similarly, Qπ(s,a)Q^\pi(s,a) has its own one-step recursive relationship: first look at the immediate reward brought by action aa, then look at the value of the next state reached after that action.

Qπ(st,at)=Eπ[Gtst,at]=Eπ[rt+γGt+1st,at]=R(st,at)+γEπ[Gt+1st,at]=R(st,at)+γEπ[Vπ(st+1)st,at]=R(st,at)+γst+1SP(st+1st,at)Vπ(st+1)\begin{aligned} Q^\pi(s_t, a_t) &= \mathbb{E}_\pi[G_t \mid s_t, a_t] \\ &= \mathbb{E}_\pi[r_t + \gamma G_{t+1} \mid s_t, a_t] \\ &= R(s_t, a_t) + \gamma \mathbb{E}_\pi[G_{t+1} \mid s_t, a_t] \\ &= R(s_t, a_t) + \gamma \mathbb{E}_\pi[V^\pi(s_{t+1}) \mid s_t, a_t] \\ &= R(s_t, a_t) + \gamma \sum_{s_{t+1} \in \mathcal{S}} P(s_{t+1} \mid s_t, a_t) V^\pi(s_{t+1}) \end{aligned}

This says: the value of taking action ata_t equals the immediate reward brought by action ata_t plus the average value of the next state caused by action ata_t.

Bellman Expectation Equation

With the conversion relationship between VV and QQ, we can continue along the equality and write the full Bellman expectation equation.

That is, first write the relationship between VV and QQ, then substitute the Bellman expansion of Qπ(st,at)Q^\pi(s_t, a_t):

Vπ(st)=atAπ(atst)Qπ(st,at)=atAπ(atst)[R(st,at)+γst+1SP(st+1st,at)Vπ(st+1)]\begin{aligned} V^\pi(s_t) &= \sum_{a_t \in \mathcal{A}} \pi(a_t \mid s_t) Q^\pi(s_t,a_t) \\ &= \sum_{a_t \in \mathcal{A}} \pi(a_t \mid s_t) \left[ R(s_t, a_t) + \gamma \sum_{s_{t+1} \in \mathcal{S}} P(s_{t+1} \mid s_t, a_t) V^\pi(s_{t+1}) \right] \end{aligned}

The logic of this formula is very clear. It expands in two layers:

LayerWhat is being averaged?Corresponding formula
Action layerThe policy may choose different actionsatπ(atst)\sum_{a_t} \pi(a_t \mid s_t)
State layerThe same action may reach different next statesst+1P(st+1st,at)\sum_{s_{t+1}} P(s_{t+1} \mid s_t, a_t)

Intuitive example: treasure corridor

Imagine a five-cell corridor, with treasure on the far right. You can only keep walking right. Each step costs 1 unit of stamina, so every step has reward 1-1; after reaching the treasure, the game ends, and the terminal value is recorded as 00.

Do not rush to inspect the whole path. Start from the easiest position: cell 5 is the treasure, and the episode is already over, so its value is 0.

PositionWhy it is computed this wayValue
Cell 5Already at the treasure; no need to walk00
Cell 4One step to cell 5: 1+0-1 + 01-1
Cell 3One step to cell 4: 1+(1)-1 + (-1)2-2
Cell 2One step to cell 3: 1+(2)-1 + (-2)3-3
Cell 1One step to cell 2: 1+(3)-1 + (-3)4-4

So the value table for this corridor is:

[4,3,2,1,0][-4,\,-3,\,-2,\,-1,\,0]

This example is meant to explain one thing: the value of the current position can be computed as “the reward for taking one step + the value of the next cell.” This is the most intuitive form of the Bellman equation.

Bellman Optimality Equation

The Bellman expectation equation answers: “If I act according to policy π\pi, how many points is state sts_t worth?” But the ultimate goal of RL is to find the best policy. We want to know: “If I make the optimal choice at every step, how many points can state sts_t be worth at most?”

This introduces the optimal value function V(st)V^*(s_t) and the optimal action-value function Q(st,at)Q^*(s_t, a_t). In a state, the optimal choice is to select the action that maximizes Q(st,at)Q^*(s_t, a_t):

V(st)=maxatAQ(st,at)V^*(s_t) = \max_{a_t \in \mathcal{A}} Q^*(s_t, a_t)

Similarly, the optimal Q(st,at)Q^*(s_t, a_t) still follows the Bellman recursion (except that future states are also assumed to act according to the optimal policy):

Q(st,at)=R(st,at)+γst+1SP(st+1st,at)V(st+1)Q^*(s_t, a_t) = R(s_t, a_t) + \gamma \sum_{s_{t+1} \in \mathcal{S}} P(s_{t+1} \mid s_t, a_t) V^*(s_{t+1})

Substituting Q(st,at)Q^*(s_t, a_t) into V(st)V^*(s_t) gives the Bellman optimality equation:

V(st)=maxatA[R(st,at)+γst+1SP(st+1st,at)V(st+1)]V^*(s_t) = \max_{a_t \in \mathcal{A}} \left[ R(s_t, a_t) + \gamma \sum_{s_{t+1} \in \mathcal{S}} P(s_{t+1} \mid s_t, a_t) V^*(s_{t+1}) \right]

Compared with the expectation equation, the only difference is that summation (averaging) becomes maximization (max\max).

  • Expectation equation atπ(atst)\sum_{a_t} \pi(a_t \mid s_t): weighted average according to the policy probabilities.
  • Optimality equation maxat\max_{a_t}: discard probabilities and directly choose the action with the largest value. Once the optimality equation is solved, the optimal policy naturally follows: choose argmaxatQ(st,at)\arg\max_{a_t} Q^*(s_t,a_t) each time.

Look at a small two-step example. Suppose you are currently in state ss and have two actions: right or left. First write out the paths:

Current actionImmediate rewardReaches
Right+1+1sRs_R
Left+0+0sLs_L

The easy-to-confuse part is “after reaching the next state, how do we continue?” An ordinary policy and an optimal policy are different here:

Reached stateContinue under ordinary policy π\piFrom then on, always act optimally
sRs_RCan still get 55 on averageCan still get at most 77
sLs_LCan still get 22 on averageCan still get at most 44

This table means: if the first step goes right and reaches sRs_R, ordinary policy π\pi may still explore afterward and may not choose the best action every time, so it can get 55 on average. But if from sRs_R onward every step chooses the best action, it can get at most 77. If the first step goes left and reaches sLs_L, the ordinary policy can get 22 afterward on average, but if it acts optimally afterward, it can get at most 44.

Now computing QQ becomes clear. First consider ordinary policy π\pi. Qπ(s,a)Q^\pi(s,a) means: first take action aa, then continue following policy π\pi.

ActionRewardNext stateFuture value under π\piQπQ^\pi
Right+1+1sRs_R55Qπ(s,right)=1+5=6Q^\pi(s,\text{right})=1+5=6
Left+0+0sLs_L22Qπ(s,left)=0+2=2Q^\pi(s,\text{left})=0+2=2

If the current policy π\pi still explores randomly in state ss, for example choosing right with probability 40%40\% and left with probability 60%60\%, then its state value is the probability-weighted average of the two QπQ^\pi values:

Vπ(s)=0.4×Qπ(s,right)+0.6×Qπ(s,left)=0.4×6+0.6×2=3.6V^\pi(s)=0.4\times Q^\pi(s,\text{right})+0.6\times Q^\pi(s,\text{left}) =0.4\times 6+0.6\times 2=3.6

Now consider the optimal case. Q(s,a)Q^*(s,a) means: first take action aa, then act optimally at every later step. Therefore, after going right, we use the optimal future value of sRs_R, which is 77, not the ordinary-policy value 55:

ActionRewardNext stateFuture value under optimal behaviorQQ^*
Right+1+1sRs_R77Q(s,right)=1+7=8Q^*(s,\text{right})=1+7=8
Left+0+0sLs_L44Q(s,left)=0+4=4Q^*(s,\text{left})=0+4=4

In the current state, the optimal action value for right is 8 and for left is 4, so the optimal value directly takes the maximum:

V(s)=maxaQ(s,a)=max(8,4)=8V^*(s)=\max_a Q^*(s,a)=\max(8,4)=8

The difference between the two computations is:

CaseWhich Q is used?How the current action is handledComputed V
Ordinary policy VπV^\piQπQ^\piAverage by policy probability: 0.40.4 right, 0.60.6 left3.63.6
Optimal value VV^*QQ^*Directly choose the maximum: maxaQ(s,a)\max_a Q^*(s,a)88

The expectation equation evaluates “how good the current policy is on average,” so it uses QπQ^\pi and averages according to the policy probabilities. The optimality equation searches for “how good things can be if the best choice is made from now into the future,” so it uses QQ^* and takes the maximum.

Value Tables: From Value Equations to Updatable Tables

At this point, we know the definition of Vπ(s)V^\pi(s) and have derived the Bellman equation. It tells us: if policy π\pi is fixed, the true value function must satisfy the recursive relationship “immediate reward + next-state value.”

However, the Bellman equation only gives a constraint. It does not directly give the answer. It says, “the true VV must make the current-state and next-state values line up with each other.” This is like writing down a system of linear equations: doing so does not mean we have already solved every unknown. Knowing how values should recurse does not mean we already know the concrete number for every V(s)V(s).

To turn the equation into an algorithm, we need to assign each state a storage location, fill it initially with some number, and then let the program repeatedly modify it so it approaches the true value. This is the value table.

If the state space is small, the simplest method is to store one number for each state, without introducing a neural network first. For example, in a three-cell corridor with three states S,M,GS,M,G, we prepare three rows:

State ssCurrent estimate V(s)V(s)Meaning
SS00Long-term return estimate from SS
MM00Long-term return estimate from MM
GG00From the terminal state, usually there is no future return

This list of “state \rightarrow value estimate” is what this book calls a value table. It is not a new mathematical object, but a representation of VV. Mathematically, it can be viewed as the value vector v\boldsymbol{v} in the matrix form above. In a program, it can be a NumPy array, a Python dictionary, a hash table, or a column in a database table.

More generally, this belongs to tabular methods. In the reinforcement learning literature, tabular methods mean that the number of states and actions is small enough that we can store a separate entry for every state or every state-action pair and directly update those entries[1]. Therefore, the “table” does not necessarily have to be drawn as a spreadsheet; the key is that every state has its own storage location.

First separate two things that are easy to confuse:

  • Environment rule table: describes the task itself, recording “from where, doing what, to where, and receiving how much reward.” In small discrete problems, it is a tabular representation of the environment model. If the environment is stochastic, the table must also include probabilities of different next states.
  • Value table: describes the algorithm’s current judgment, recording “from this state, how much is it probably worth in the long run?”

Textbooks often draw the value table directly on the state space. The following figure comes from Sutton and Barto’s small Gridworld policy-evaluation example. Here GridWorld is a grid environment: cells represent states, movement directions represent actions, and the numbers in cells represent the current estimate of Vk(s)V_k(s).

The left column shows the value table after k=0,1,2,3,10,k=0,1,2,3,10,\infty rounds of policy evaluation under the same random policy. The right column shows what greedy policy would be obtained if we used these value tables to choose actions.

Sutton and Barto's iterative policy evaluation example in a small Gridworld: the left column shows state-value tables at different iteration counts, and the right column shows the corresponding greedy policies.

Source: Sutton and Barto, Reinforcement Learning: An Introduction, 2nd ed., Figure 4.1. The original book uses the CC BY-NC-ND 2.0 license.

For now, it is enough to look only at the left column. At k=0k=0, except for terminal states, many cell values start from 0. After several iterations, positions near the terminal states change first, and farther positions are gradually pulled along. The numbers are not “the reward obtained immediately upon entering this cell.” They are “the expected total return from this cell onward under the current policy.”

Consider a minimal example: a three-cell corridor.

Environment rule table: records what happens in one step

Current stateActionNext stateReward
SSMove rightMM1-1
MMMove rightGG1-1
GGEndNone00

Value table: records how much it is probably worth from here onward

State ssCurrent estimate V(s)V(s)Meaning
SS2-2Starting from SS, two steps are expected
MM1-1Starting from MM, one step is expected
GG00Already at the terminal state; no future reward

The first table comes from the environment and tells us “what happens after one step.” In this corridor example, transitions are deterministic, so each row contains only one next state. If the same action may lead to multiple next states, the environment rule table must be written with probabilities, which is exactly P(ss,a)P(s'\mid s,a). The second table comes from the algorithm and stores the current estimate of long-term return. Reward is one-step; value is long-term. Reward is given by the environment; value must be estimated by the algorithm.

Notice that the 2,1,0-2,-1,0 in the second table are the results after computation is already correct. At the beginning, the algorithm usually does not know these answers. It may initialize the entire table to 0:

State ssCurrent estimate V(s)V(s)
SS00
MM00
GG00

The 0 here does not mean these states are truly worth only 0 points. It means “we do not know yet, so use 0 as the initial guess.” What later DP, MC, and TD methods do is continuously modify the right-hand column so that it gradually approaches the true VπV^\pi.

To summarize: the value table solves a very concrete problem: it gives the abstract value function a carrier that can be repeatedly updated. When the number of states is finite and enumerable, the value table is the most direct representation. When the state space is huge or continuous, this table cannot fit everything, and later we need function approximation and neural networks to replace it.

The same idea naturally extends to action values. A state-value table stores one number for each state; an action-value table stores one number for every “state-action pair,” namely Q(s,a)Q(s,a). The Q table in Section 3.5 is essentially the value table here extended from “one cell per state” to “one cell per action under each state.” Classic TD learning can be viewed as using experience to correct state predictions[2]; Q-learning updates action values in a tabular setting with finite states, finite actions, and repeated sampling[3].

From Equation to Update Rule

The previous subsection explained the role of the value table: every state has a location in the table, but the number inside is initially only a temporary estimate. The next real problem is: how should these numbers gradually become estimates closer to the true values?

Look at the Bellman equation from another angle. Earlier, when we wrote it, it looked like an “equality that the correct answer must satisfy”:

Vπ(s)=aπ(as)[R(s,a)+γsP(ss,a)Vπ(s)].V^\pi(s)= \sum_a\pi(a\mid s) \left[ R(s,a)+\gamma\sum_{s'}P(s'\mid s,a)V^\pi(s') \right].

Both sides here use the true VπV^\pi, so the equation describes the relationship of the final answer. But at the beginning, the algorithm does not have the true VπV^\pi. It only has a temporary value table VkV_k. Therefore, the natural idea is: treat the right-hand side of the equation as a new target value, and use it to rewrite the old table.

Before answering how to rewrite it, distinguish two situations:

  • The environment model is known: here “environment model” means the mathematical version of the environment rules: after taking action aa in state ss, with what probability will we arrive at each next state ss', and how much reward will we obtain on average? In other words, we know the transition probability P(ss,a)P(s'\mid s,a) and reward R(s,a)R(s,a), so we can directly use the Bellman equation to construct an update rule.
  • The environment is a black box: we do not know this information. We can only actually take one step, observe one reward and one next state.

First look at the first situation.

For a fixed policy π\pi, denote the old table as VkV_k. Now we no longer pretend that we already know the true VπV^\pi. Instead, we use Vk(s)V_k(s') in the old table to estimate the value of next states:

BellmanTargetk(s)=aπ(as)[R(s,a)+γsP(ss,a)Vk(s)].\text{BellmanTarget}_k(s)= \sum_a\pi(a\mid s) \left[ R(s,a)+\gamma\sum_{s'}P(s'\mid s,a)V_k(s') \right].

This step means:

  • The inner sP(ss,a)Vk(s)\sum_{s'}P(s'\mid s,a)V_k(s') computes “the expected value of the next state after taking action aa.”
  • The outer aπ(as)\sum_a\pi(a\mid s) averages over actions according to the current policy π\pi.
  • After adding R(s,a)R(s,a) and discount factor γ\gamma, the result is “how much the old table thinks state ss should now be worth.”

The update rule is then direct:

Vk+1(s)BellmanTargetk(s).V_{k+1}(s)\leftarrow \text{BellmanTarget}_k(s).

Doing this computation for all states produces a new round of the value table, Vk+1V_{k+1}. This step is often called a Bellman backup: use the old estimate of the next state to update the current state in reverse.

Sometimes we do not directly replace the old value, but only move a small step toward the target:

V(s)V(s)+α[BellmanTargetk(s)V(s)].V(s)\leftarrow V(s)+\alpha\left[\text{BellmanTarget}_k(s)-V(s)\right].

Here α\alpha is the learning rate. α=1\alpha=1 means fully adopting the new target; α=0.1\alpha=0.1 means moving the old value only 10% of the way toward the target. Whether we replace directly or move by a small step, the core idea is the same: the right-hand side of the Bellman equation gives the target, and the value table stores the current estimate.

Use a minimal example to verify this. To avoid being distracted by multiple states, shrink the value table to the minimum: the entire table has only one row, and the only state is ss.

State ssCurrent estimate V(s)V(s)
ss00

Imagine there is only one slot machine A in front of you. Each round, you press the button, receive one reward, and then return to the same state ss. The policy is fixed: always choose A. Now ask only one question:

Starting from this state and continuing forever, what is the discounted average total return?

The setting is:

  • There is only one state, and after playing one round, we return to the same state.
  • Always choose machine A.
  • Machine A gives reward +1+1 with probability 60% and reward 1-1 with probability 40%.
  • The discount factor is γ=0.9\gamma=0.9.

The one-step expected reward is:

R=0.6×(+1)+0.4×(1)=0.2R=0.6\times(+1)+0.4\times(-1)=0.2

Because after each round we are still in the same state, the next-state value still looks up the same cell in the value table, namely V(s)V(s). According to the Bellman expectation equation:

V(s)=0.2+0.9V(s)V(s)=0.2+0.9V(s)

Analytic solution: solve the equation directly. V(s)0.9V(s)=0.2V(s)-0.9V(s)=0.2, so V(s)=2.0V(s)=2.0. This 2.02.0 is not a one-step reward. It is the number that the cell for state ss in the value table should ultimately approach, meaning “from now on, if we keep playing, the discounted average total return.”

Iterative method: if we do not want to, or cannot, solve the equation directly, we can also start from the initial value table V0(s)=0V_0(s)=0 and perform one Bellman update each round, changing the cell for state ss into a new estimate:

Vk+1(s)R+γVk(s)V_{k+1}(s)\leftarrow R+\gamma V_k(s)

Compute the first few rounds by hand:

RoundUpdate formulaNew value of V(s)V(s) in the table
Initial-00
Round 10.2+0.9×00.2+0.9\times00.20.2
Round 20.2+0.9×0.20.2+0.9\times0.20.380.38
Round 30.2+0.9×0.380.2+0.9\times0.380.5420.542
Round 40.2+0.9×0.5420.2+0.9\times0.5420.68780.6878
Round 50.2+0.9×0.68780.2+0.9\times0.68780.819020.81902

Each round uses “immediate expected reward 0.20.2 + discounted old value estimate” to obtain a new estimate. Because there is only one state, it looks like we are only updating one number. With multiple states, the same operation updates the whole value table cell by cell. At the beginning it is far from 2.02.0, but it approaches gradually.

python
R = 0.2
gamma = 0.9
V = 0.0

for i in range(50):
    V = R + gamma * V

print(round(V, 6))

运行结果

1.998698

This is the basic shape of Bellman iteration: instead of computing the answer all at once, repeatedly apply the Bellman relationship so that the estimate gradually approaches the true value. Under a fixed policy π\pi, this corresponds to iterative policy evaluation. If later we replace “averaging according to the policy” with “taking the maximum over actions,” it leads to value iteration in optimal control.

When the Environment Is a Black Box: TD Error

Return to the slot-machine example. There we could write V(s)=0.2+0.9V(s)V(s)=0.2+0.9V(s) because two things were already known: the average reward each time is 0.20.2, and after pressing the button we definitely return to the same state. In other words, we knew the environment rules, so we could directly compute “what happens next on average.”

Real tasks are usually not this convenient. The agent does not know where each action will lead, nor the probabilities of different outcomes. It can only actually take a step, see a reward, and see which state it lands in. TD error is used in this situation. First understand it as a reconciliation tool: compare “the target given by this one actual experience” with “the old estimate in the value table,” and see how much they differ.

If you do not fully understand this on the first reading, that is fine. The next section compares DP, MC, and TD side by side, and later algorithm chapters will repeatedly return to this concept. For now, grasp one distinction: when the model is known, we compute the average case; TD sees only the one case that just happened.

First look at the known-model form. Suppose we know the complete environment rules. Then we can write the exact Bellman target:

aπ(as)[R(s,a)+γsP(ss,a)V(s)].\sum_a\pi(a\mid s) \left[ R(s,a)+\gamma\sum_{s'}P(s'\mid s,a)V(s') \right].

Here R(s,a)R(s,a) is not one realized reward, but an average reward; P(ss,a)P(s'\mid s,a) is not one realized next state, but the probability of each possible next state. The expression means: include all possible actions and all possible next states, then average according to their probabilities.

TD cannot do this. It does not have that probability table. It only knows what truly happened in the most recent step: the reward observed this time was rr, and the next state reached this time was ss'. Therefore, it first constructs a temporary target from this one experience:

Target=r+γV(s)\text{Target} = r + \gamma V(s')

The lowercase rr and ss' remind us that they are not the average laws of the environment, but one result just observed. This target may be too high or too low, but after many samples, these high and low deviations gradually cancel out.

The comparison is:

MethodReward usedNext state usedNature of target
Exact Bellman updateExpected reward R(s,a)R(s,a)All possible ss', weighted by P(ss,a)P(s'\mid s,a)Exact expectation
TD updateThis actual reward rrThis actual next state ss'Noisy one-sample estimate

A numerical example makes it clearer. Suppose that after acting according to the current policy in state ss, the environment has two possible outcomes:

Next stateProbabilityThis-step rewardCurrent value estimate
s1s_170%+1+1V(s1)=5V(s_1)=5
s2s_230%1-1V(s2)=0V(s_2)=0

Let γ=0.9\gamma=0.9. If the model is known, we include both possibilities:

BellmanTarget=0.7×(1+0.9×5)+0.3×(1+0.9×0)=3.55.\text{BellmanTarget} =0.7\times(1+0.9\times5)+0.3\times(-1+0.9\times0) =3.55.

This is the exact Bellman update. It is like holding an environment manual: we know what outcomes may happen from ss and the probability of each, so we can compute the average target 3.553.55 at once.

TD is in a different position. It does not have this manual, and it is not actively sampling outcomes with 70% and 30% by itself. “Sampling” only means that the agent actually took one step and the environment gave it one result. If the environment naturally has the pattern “most of the time go to s1s_1, sometimes go to s2s_2,” then when repeatedly starting from ss, the agent will naturally see s1s_1 more often and s2s_2 occasionally. But in each update, TD uses only the single experience in front of it.

Suppose the value table currently has V(s)=3V(s)=3, and the learning rate is α=0.1\alpha=0.1.

If this time it really reached s1s_1 and received reward r=1r=1, TD computes:

TD Target=1+0.9×5=5.5,\text{TD Target}=1+0.9\times5=5.5,

δ=5.53=2.5,\delta=5.5-3=2.5,

Then it adjusts V(s)V(s) upward a little:

V(s)3+0.1×2.5=3.25.V(s)\leftarrow 3+0.1\times2.5=3.25.

If another time it really reached s2s_2 and received reward r=1r=-1, TD computes:

TD Target=1+0.9×0=1,\text{TD Target}=-1+0.9\times0=-1,

δ=13=4,\delta=-1-3=-4,

Then it adjusts V(s)V(s) downward a little:

V(s)3+0.1×(4)=2.6.V(s)\leftarrow 3+0.1\times(-4)=2.6.

Put the two cases together:

What really happened this timeTD TargetTD error δ\deltaUpdated V(s)V(s)
Reached s1s_1, reward +1+15.55.52.52.53.253.25
Reached s2s_2, reward 1-11-14-42.62.6

There is an important detail here: TD does not guarantee that every single update moves closer to the average target.

In this example, the exact Bellman target is 3.553.55, while the current V(s)=3V(s)=3, so the difference is 0.550.55. If this time we reached s1s_1, the updated value becomes 3.253.25, which is indeed closer to 3.553.55. But if this time we reached s2s_2, the updated value becomes 2.62.6, which is farther from 3.553.55.

This is not a contradiction. It is normal TD behavior. TD corrects the value table using only the experience in front of it, so one “bad outcome” pulls the estimate down, and one “good outcome” pushes it up. A single update may move in the wrong direction; what matters is the average direction over many updates.

Combine the two updates according to the environment’s own probabilities:

0.7×3.25+0.3×2.6=3.055.0.7\times3.25+0.3\times2.6=3.055.

That is, if we start from V(s)=3V(s)=3, the “average position” after the next TD update is 3.0553.055. It still does not jump directly to 3.553.55, but it is already closer to 3.553.55 than 33 was.

Similarly, from the TD error perspective:

0.7×2.5+0.3×(4)=0.55.0.7\times2.5+0.3\times(-4)=0.55.

The average TD error is positive, which means that over the long run, the current V(s)=3V(s)=3 is too low and should be adjusted upward. It is just that each actual experience does not necessarily move in that direction.

The point of this table is not that “the agent should sample according to 70% and 30%.” It is that TD only looks at one real experience each time. Common outcomes participate in updates more frequently, and rare outcomes participate less frequently. After many updates accumulate, the average direction gradually approaches the probability-weighted Bellman target above.

This δ\delta is the Temporal Difference Error, usually called the TD Error. It measures how far the “one-step target” is from the “current estimate”:

δ=r+γV(s)Bellman targetV(s)current estimate\delta = \underbrace{r + \gamma V(s')}_{\text{Bellman target}} - \underbrace{V(s)}_{\text{current estimate}}

Why is it called “temporal difference”? Because it compares prediction differences between two neighboring states in a time sequence: when standing at ss, we have an old prediction; after moving to ss', the term r+γV(s)r+\gamma V(s') gives a new target.

Its meaning is direct:

TD ErrorMeaningHow to adjust
δ>0\delta>0The target is higher than the current estimate, so we underestimatedIncrease V(s)V(s)
δ<0\delta<0The target is lower than the current estimate, so we overestimatedDecrease V(s)V(s)
δ=0\delta=0Under this sample, the target and estimate agreeAlmost no adjustment needed

The simplest update formula is:

V(s)V(s)+αδ.V(s)\leftarrow V(s)+\alpha\delta.

The 3.253.25 and 2.62.6 in the two branches above are exactly the results of this formula when α=0.1\alpha=0.1.

This is the basic move of TD learning: every time the agent takes a step, it uses “the one step actually experienced + the next-state estimate” to correct the current-state estimate. After many samples, the noise of single samples is averaged out, and the value table gradually approaches the true VπV^\pi.

DP, MC, and TD handle different environment conditions and information assumptions. The next section, Classic Methods at a Glance: DP, MC, and TD, compares them specifically.

Section Summary

This page has many formulas, but only one main thread:

  1. Vπ(s)V^\pi(s) is the average long-term return obtained by starting from state ss and acting according to policy π\pi.
  2. Return can recurse: Gt=rt+γGt+1G_t=r_t+\gamma G_{t+1}.
  3. Therefore value can also recurse: current value equals immediate reward plus next-state value.
  4. If the policy is fixed, we obtain the Bellman expectation equation.
  5. If the best action is chosen at every step, we obtain the Bellman optimality equation.
  6. In finite-state problems, V(s)V(s) can be stored as a value table, giving every state an updatable number.
  7. If we only have sampled data, we use the Bellman target and TD Error to gradually correct V(s)V(s) in the value table.

V(s)V(s) can already tell you “whether this situation is good,” but there are still two gaps at this point.

The first gap is: if we do not know the environment model, how do we estimate VV? The Bellman expectation equation contains P(ss,a)P(s'\mid s,a) and R(s,a)R(s,a), but in real tasks we often only obtain sampled trajectories one by one. The next section discusses DP, MC, and TD, which are three ways to estimate value from models or data.

The second gap is: VV does not directly tell you which action to choose. If V(s)=80V(s)=80, you only know the current situation is good, but you do not know whether to go left, go right, or stay still. A more direct approach is to score each action as well:

Q(s,a)Q(s,a)

There is no need to expand QQ yet. For now, just remember the division of labor between the two tables: the value table V(s)V(s) scores states, while the action-value table Q(s,a)Q(s,a) scores “first taking action aa in state ss.” If the environment model is known, meaning we know how each action affects the next state and reward, we can use the VV table for one-step lookahead to compare actions. If the environment model is unknown, directly learning the QQ table is often more convenient, because it stores the long-term consequences of each action directly.

So the next sequence is: first read DP, MC, and TD to solve “how to estimate VV”; then enter Action-Value Function to solve “how to use value to choose actions.”

References


  1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. Part I of the book is titled “Tabular Solution Methods,” and Chapter 4, Figure 4.1 shows the policy-evaluation process of writing state values directly in Gridworld cells. MIT Press Open Access page: https://mitpress.mit.edu/9780262039246/reinforcement-learning/; author-hosted PDF: http://incompleteideas.net/book/RLbook2020.pdf. ↩︎

  2. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9-44. DOI: https://doi.org/10.1007/BF00115009. ↩︎

  3. Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279-292. DOI: https://doi.org/10.1007/BF00992698; author page: https://www.gatsby.ucl.ac.uk/~dayan/papers/wd92.html. ↩︎

现代强化学习实战课程