E.1.2 Matrix Form of the Bellman Equation
Prerequisites: E.1.1 Vectors and Matrices, especially vectors, matrices, and matrix multiplication. It is also helpful to first read the Chapter 3 discussion of the Bellman equation, so the single-state form is familiar.
From One Equation to a Matrix Equation
Chapter 3 gave the Bellman equation for a single state:
This formula works well for one state. But when an environment contains states, we need such equations, one for each state. Combining these equations gives a single matrix expression:
Here is the value vector for all states, is the immediate reward vector for all states, and is the state transition matrix. Moving terms also gives a closed-form solution:
We will derive these formulas from the single-state equation in three steps: first place all values into a vector, then place transitions into a matrix, and finally assemble and solve the system of equations.
Step 1: Put All State Values into a Vector
Use the two-state example from the appendix introduction:
- In , the reward is , and the next state is always .
- In , the reward is , and the next state is always .
- The discount factor is .
The two state values are and . Put all state values into one column vector:
is the unknown quantity, and is the known immediate reward. With two states, the vector has two components. With states, it has components, but the notation stays the same.
Step 2: Put the Transition Relationship into a Matrix
Starting from always leads to , and starting from always leads to . The transition matrix is:
Rows correspond to "which state we start from," and columns correspond to "which state we go to next." The meaning of matrix-vector multiplication is that row of gives the expected next-state value when starting from state . Check:
This says exactly that from the next state is , so the future value is ; from the next state is , so the future value is . Matrix-vector multiplication has automatically performed the probability-weighted sum.
Step 3: Assemble the System
We now have three objects: value vector , reward vector , and transition matrix . Writing the Bellman equation state by state gives:
In matrix language, the immediate reward is , and the discounted future value is . Adding them gives:
Verify the first row on the right:
The second row:
This is exactly the same as the two equations written separately. The matrix form introduces no new content; it only compresses many equations with the same structure into one equation.
Why the Compression Works
The key is the term . Each row of is exactly a set of transition probabilities whose sum is , and matrix multiplication is exactly weighted averaging: probability times value. The Bellman equation says "value = immediate reward + discounted future value," and the matrix equation states the same relationship for all states at once.
| Symbol | Meaning | Dimension |
|---|---|---|
| v | Values of all states, the unknown | n x 1 |
| r | Immediate rewards of all states | n x 1 |
| gamma P v | Discounted probability-weighted future value | n x 1 |
All three quantities have dimension , so both sides of the equation have the same shape. This is the operation behind DP in Chapter 3: repeatedly apply until convergence.
Closed-Form Solution and Fixed Point
is a linear system, so it can be solved directly. Move the terms containing to the left:
Factor out the common term. Here is the identity matrix, with on the diagonal and elsewhere:
If is invertible, the closed-form solution is:
Substitute the two-state numbers:
Solving the system:
Geometric Intuition: The Intersection Is the Fixed Point
This system corresponds to two lines in the plane:
- The first equation is .
- The second equation is .
Their intersection is . This point is also the fixed point of the Bellman update: substituting into returns the same . The value no longer changes, which means this is the true value.
From Two States to States
The closed-form solution was derived for two states. If we generalize from 2 states to any , the equation keeps the same form:
Here:
- : values of all states under policy
- : expected immediate reward for each state
- : the transition matrix induced by the policy, with
With three states, is , and and are . The equation still holds. This is the central advantage of matrix notation: no matter how many states there are, the equation keeps the same form.
When Is Invertible?
The key requirement for solving is that must be invertible. Intuitively, this means the Bellman update must not diverge. When and is a valid transition matrix whose rows sum to , is essentially always invertible.
More precisely, the spectral radius of , meaning the largest absolute eigenvalue, satisfies . Therefore all eigenvalues of stay away from , so the matrix is invertible. E.1.4 explains this in detail.
Why We Do Not Directly Invert the Matrix
The derivation makes the closed-form solution look simple: write the matrix equation, invert the matrix, and obtain the answer. In practice, this path is usually infeasible for three reasons:
- The scale is too large. If the number of states is , then is a matrix. Matrix inversion costs and is essentially impossible.
- The matrix may not be explicitly available. In many practical problems, the entries of are unknown; we only observe sampled transitions. The MC and TD methods in Chapter 3 operate exactly under this condition.
- The state may be continuous. If the state is an image or text, there may be no finite matrix at all, so cannot be constructed.
Practical algorithms approximate the solution with iterative methods:
- Value iteration: repeatedly execute until convergence.
- Policy evaluation: repeatedly apply Bellman updates inside policy iteration.
- TD learning: use sampled data for incremental updates.
All of these methods approximate the solution in a more scalable way, without actually computing the inverse. The evolution from DP to MC to TD in Chapter 3 corresponds to this path: direct iteration with a known model, sampling without a known model, and one-step sampling updates.
Common Pitfall
When you see , do not assume a practical algorithm is really computing a matrix inverse. This formula is a theoretical closed-form solution used to show existence and uniqueness. Practical algorithms are iterative.
Matrix Form of the Q Function
So far we have handled only the state value . The action value can also be written in matrix form, and it has a clear algebraic relationship with the matrix form of .
Notation
Put the Q values of all pairs into one long vector:
Similarly, stores the immediate reward of each pair.
The transition matrix becomes . Each row corresponds to one pair, and each column corresponds to a next state :
The policy matrix compresses the Q vector back into a V vector:
V-Q Relationship
Check row : . This is the matrix form of .
Bellman Expectation Equation for Q
In matrix form:
Substitute to obtain the pure-Q recursion:
The closed-form solution is .
Bellman Optimality Equation for Q
Matrix form:
Here extracts, for each state, the maximum Q value among all actions of that state. Because max is not a linear operation, the optimality equation has no closed-form linear solution and must be approximated by iteration, such as Q-Learning.
Deriving the Matrix Form of V from Q
Substitute into , then left-multiply both sides by :
This is the same derived earlier. In the matrix form of V, and already average over the policy. In the matrix form of Q, the action dimension is preserved, and policy averaging is handled separately by . This is the matrix-language expression of the idea that carries finer-grained information than .
Matrix Form of DP Iteration
Chapter 3 introduced the state-by-state update for DP policy evaluation:
The matrix form splits the Bellman expectation equation into an iteration:
Each round performs one Bellman update for all states simultaneously. In the two-state example above, starting from and iterating repeatedly converges to the closed-form solution .
Policy improvement, , becomes the following matrix operation: for each state , compare the rows of that belong to that state, where each row corresponds to one action, and choose the action with the largest value.
Comparison Table
| Concept | State-by-state form, Chapter 3 | Matrix form |
|---|---|---|
| Bellman expectation equation | $\boldsymbol{v}\pi = \boldsymbol{r}\pi + \gamma P_\pi \boldsymbol{v}_\pi$ | |
| Bellman optimality equation | $V^(s)=\max_a\left[R(s,a)+\gamma\sum_{s'}P(s'\mid s,a)V^(s')\right]$ | , with rowwise max |
| Closed-form solution | - | |
| V-Q relationship | $\boldsymbol{v}\pi = \Pi\pi \boldsymbol{q}_\pi$ | |
| Q Bellman expectation | $\boldsymbol{q}\pi = \boldsymbol{r} + \gamma P \Pi\pi \boldsymbol{q}_\pi$ | |
| Q Bellman optimality | $Q^(s,a)=R(s,a)+\gamma\sum_{s'}P(s'\mid s,a)\max_{a'}Q^(s',a')$ | $\boldsymbol{q}* = \boldsymbol{r} + \gamma P \cdot\mathrm{rowmax}(\boldsymbol{q}*)$ |
| DP policy evaluation | $\boldsymbol{v}{k+1} = \boldsymbol{r}\pi + \gamma P_\pi \boldsymbol{v}_k$ |
MC and TD methods update individual states from samples, so they do not have a corresponding full-matrix form in the same sense.
Limitations of the Matrix Form
This article compressed the Bellman equation into the matrix form and gave the closed-form solution . No matter how many states there are, the equation is one line.
But the matrix form assumes one thing: each state has an independent value that can be stored.
| Environment | State-space size | Can store? |
|---|---|---|
| Two-state toy environment | 2 | Yes |
| GridWorld 10 x 10 | 100 | Yes |
| Go board | ~10^170 | No |
| Continuous state, such as robot joint angles | Infinite | No |
Once the number of states becomes large, not only is matrix inversion infeasible, but even the value table itself exceeds storage capacity. The closed-form solution is compact, but for Go's states, the required storage is unrealistic.
The solution is to avoid storing one value for every state. Instead, use a function to approximate value: extract features from the state, then compute value from the dot product of features and weights. The next article develops this idea.
Next: E.1.3 Dot Products, Norms, and Function Approximation explains how feature vectors and dot products approximate value when there are too many states to store.