Skip to content

E.1.5 Formula Review and Exercises

Prerequisites: This page summarizes all formulas in module E.1. It is best read after E.1.1 through E.1.4. If this is your first pass, read the main articles first.


Revisiting Chapter 3

You now have the tools of vectors, matrices, dot products, norms, and eigenvalues. Looking back at Chapter 3, many formulas that may have seemed acceptable by definition have cleaner matrix expressions underneath.

Bellman Equation: From State-by-State Writing to One Matrix Line

In Chapter 3, the Bellman expectation equation was written as:

Vπ(s)=aπ(as)[R(s,a)+γsP(ss,a)Vπ(s)].V^\pi(s) = \sum_{a} \pi(a|s)\left[R(s,a) + \gamma\sum_{s'} P(s'|s,a)V^\pi(s')\right].

With matrices, nn such equations become one line:

v=r+γPv.\boldsymbol{v} = \boldsymbol{r} + \gamma P\boldsymbol{v}.

The closed-form solution is v=(IγP)1r\boldsymbol{v} = (I-\gamma P)^{-1}\boldsymbol{r}. The DP methods in Chapter 3 repeatedly apply vk+1=r+γPvkv_{k+1} = r + \gamma Pv_k; in essence, they iteratively approximate this closed-form solution.

TD Error: Incremental Updates and the Matrix Equation

The TD update in Chapter 3 is:

V(s)V(s)+α[r+γV(s)V(s)].V(s) \leftarrow V(s) + \alpha\left[r + \gamma V(s') - V(s)\right].

In vector language, this can be written as:

vv+αesδ,\boldsymbol{v} \leftarrow \boldsymbol{v} + \alpha \cdot \boldsymbol{e}_s \cdot \delta,

where es\boldsymbol{e}_s is the one-hot vector for state ss, with only the ss-th position equal to 11, and δ=r+γV(s)V(s)\delta = r + \gamma V(s') - V(s) is the TD Error. This and the Bellman matrix form v=r+γPv\boldsymbol{v} = \boldsymbol{r} + \gamma P\boldsymbol{v} from E.1.2 are two expressions of the same object: the former is an incremental update that changes one component at a time, while the latter is a global equation that all components satisfy simultaneously.

Tabular Q-Learning: A Special Case of One-Hot plus Dot Product

Chapter 3's Q-Learning stores one number for every state-action pair. From the linear-algebra perspective, this is linear approximation with one-hot features: Q(s,a)=wx(s,a)Q(s,a) = \boldsymbol{w}^\top \boldsymbol{x}(s,a), where x(s,a)\boldsymbol{x}(s,a) is a one-hot vector. Tabular lookup is a special case of linear approximation.

Policy Gradient Updates: The Stage for Norms and Trust Regions

The second route in Chapter 3 defined a policy objective J(θ)J(\theta). When updating parameters, gradient clipping limits g2c\|\boldsymbol{g}\|_2 \leq c, and the trust-region constraint ΔθFΔθδ\Delta\theta^\top F\,\Delta\theta \leq \delta keeps training safe. These are all uses of linear algebra tools for stability.


Concept Map

The table below organizes all concepts in module E.1 by dependency. You can read down the dependency chain or use it as an index to jump to a concept.

ConceptCore formulaDepends onRole in RL
Scalarr=2, gamma=0.9NoneRewards and hyperparameters
Vectorv in R^nScalarStores values of all states
MatrixP in R^(n x n)VectorStores transitions among all states
Matrix multiplication(Pv)_i = sum_j P_ij v_jMatrix + vectorProbability-weighted future value
Bellman matrix formv = r + gamma P vMatrix multiplicationValue recursion equation
Linear-system solutionv = (I - gamma P)^-1 rBellman matrix formTheoretical closed-form solution
Dot productw^T x = sum_i w_i x_iVectorLinear value or policy approximation
L2 normx2=ixi2\lVert\boldsymbol{x}\rVert_2 = \sqrt{\sum_i x_i^2}VectorGradient clipping and regularization
EigenvalueA u = lambda uMatrixAnalyzes matrix scaling behavior
Spectral radius and convergencerho(gamma P) <= gamma < 1EigenvaluesValue-iteration convergence guarantee
Trust-region constraint(theta-theta_old)^T F (theta-theta_old) <= deltaNorms + eigenvaluesSafe update in TRPO

Reading from top to bottom, each concept adds a new capability on top of the previous one. If a concept feels unclear, return to the row named in its "Depends on" column.


From Difficulties to Tools

The central thread of module E.1 is moving from "cannot compute it" to "can compute it safely."

StageDifficultyMathematical toolKey formula
Difficulty 11000 states = 1000 equationsVectors, matrices, equation systemsv = (I - gamma P)^-1 r
Difficulty 2Too many states for the value tableDot products, normsv_hat(s) = w^T x(s), g2c\lVert\boldsymbol{g}\rVert_2 \leq c
Difficulty 3Training may diverge, explode, or driftEigenvalues, weighted normsrho(gamma P) <= gamma < 1, Delta theta^T F Delta theta <= delta

Common Pitfalls

  1. Treating matrices as abstract symbols. In the Bellman equation, every row of transition matrix PP is a probability table describing where the next state goes from the current state.
  2. Thinking inversion is the practical algorithm. v=(IγP)1r\boldsymbol{v}=(I-\gamma P)^{-1}\boldsymbol{r} is a theoretical closed-form solution. Real large-scale problems usually use iteration or function approximation, which is exactly the DP -> MC -> TD progression in Chapter 3.
  3. Ignoring vector direction. Gradients, advantage updates, and trust-region constraints care not only about numerical magnitude, but also about direction in parameter space.
  4. Confusing norms and regularization. A norm is a measurement tool, answering "how large." Regularization adds a norm to the training objective to constrain parameters. Gradient clipping limits the magnitude of a single update. These are three different uses.
  5. Thinking the L2 norm is the only norm. The L1 norm encourages sparse solutions, the Frobenius norm measures matrix size, and weighted norms, or quadratic forms, account for direction sensitivity. Different settings use different norms.

Exercises

Basic

  1. Bellman equation by hand. If γ=0.9\gamma=0.9, v1=1+0.9v2v_1=1+0.9v_2, and v2=2+0.9v1v_2=2+0.9v_1, compute v1v_1 and v2v_2 by hand.

  2. Write the matrix. Write the corresponding r\boldsymbol{r} and PP for the previous problem, then verify whether v=(IγP)1r\boldsymbol{v} = (I - \gamma P)^{-1}\boldsymbol{r} equals your hand calculation.

  3. Gradient clipping. If the gradient is [6,8][6,8]^\top and the maximum norm is 55, what is the clipped gradient?

Intermediate

  1. Eigenvalue calculation. What are the eigenvalues of A=[3102]A = \begin{bmatrix} 3 & 1 \\ 0 & 2 \end{bmatrix}? It is an upper-triangular matrix; what pattern do its eigenvalues follow?

  2. Three-state Bellman equation. There are three states, transition matrix P=[0.20.50.30.00.40.60.70.30.0]P = \begin{bmatrix} 0.2 & 0.5 & 0.3 \\ 0.0 & 0.4 & 0.6 \\ 0.7 & 0.3 & 0.0 \end{bmatrix}, reward vector r=[1,2,3]\boldsymbol{r}=[1, 2, 3]^\top, and γ=0.5\gamma=0.5. Write the Bellman matrix form without solving it.

  3. Dot-product approximation. State ss has features x(s)=[0.3,0.5,0.8]\boldsymbol{x}(s)=[0.3, -0.5, 0.8]^\top, and weights are w=[2,1,3]\boldsymbol{w}=[2, -1, 3]^\top. Compute Q(s,a)=wx(s)Q(s,a) = \boldsymbol{w}^\top\boldsymbol{x}(s). If the true value of Q(s,a)Q(s,a) is 55, what is the error?

Challenge

  1. Convergence-speed estimate. Suppose γ=0.95\gamma=0.95 and the initial error is e0=100\|\boldsymbol{e}_0\|=100. How many Bellman updates are needed at minimum to make the error less than 0.010.01? Hint: ekγke0\|\boldsymbol{e}_k\| \leq \gamma^k \|\boldsymbol{e}_0\|.

  2. Trust-region check. Let F=[2008]F = \begin{bmatrix} 2 & 0 \\ 0 & 8 \end{bmatrix}, Δθ=[0.3,0.1]\Delta\theta = [0.3, 0.1]^\top, and δ=0.5\delta = 0.5. Is this update inside the trust region? What about Δθ=[0.3,0.2]\Delta\theta = [0.3, 0.2]^\top? Explain why the second direction is more expensive.

现代强化学习实战课程