Skip to content

E.1.4 Convergence, Eigenvalues, and Trust Regions

Prerequisites: E.1.2 Bellman Matrix Form, especially v=r+γPv\boldsymbol{v} = \boldsymbol{r} + \gamma P\boldsymbol{v}. E.1.3 Dot Products and Norms, especially the definition of the L2 norm.


Overview

The previous two articles derived the matrix form of the Bellman equation, v=r+γPv\boldsymbol{v} = \boldsymbol{r} + \gamma P\boldsymbol{v}, and discussed function approximation. Both share one premise: training repeatedly applies an update rule. Whether it is value iteration vk+1=r+γPvkv_{k+1} = r + \gamma P v_k or SGD updating weights wwαg\boldsymbol{w} \leftarrow \boldsymbol{w} - \alpha \boldsymbol{g}, the process is iterative.

The previous article ended with two unresolved questions:

  1. When we repeatedly apply vk+1=r+γPvkv_{k+1} = r + \gamma P v_k, how do we guarantee that the sequence converges instead of oscillating forever or diverging?
  2. When policy parameters are updated, gradient clipping limits the maximum step size, but the same step size in different parameter directions can have very different effects on the policy distribution. This requires a more refined constraint.

The answers involve three mathematical tools. They protect training stability from long-term, short-term, and fine-grained perspectives. This article establishes three core conclusions:

Eigenvalues guarantee convergence

ρ(γP)γ<1ekγke00\boxed{\rho(\gamma P) \leq \gamma < 1 \quad\Longrightarrow\quad \|\boldsymbol{e}_{k}\| \leq \gamma^k \|\boldsymbol{e}_0\| \to 0}

Norms limit single-step magnitude

gclipped2=min ⁣(g2,  c)\boxed{\|\boldsymbol{g}_{clipped}\|_2 = \min\!\left(\|\boldsymbol{g}\|_2,\; c\right)}

Weighted norms give direction-sensitive safe updates

ΔθFΔθδ\boxed{\Delta\theta^\top F\,\Delta\theta \leq \delta}

We now develop these ideas one by one.


Convergence: Conditions and Guarantees for Bellman Updates

Intuition for Convergence: From One Dimension to Two

Start with a simple numerical transformation:

xk+1=0.5xk.x_{k+1} = 0.5\,x_k.

If x0=8x_0=8, then 84210.58 \to 4 \to 2 \to 1 \to 0.5 \to \cdots. Since each step multiplies by 0.50.5, whose absolute value is less than 11, the sequence converges to 00.

Now move to two dimensions. Suppose a transformation multiplies the two components of a vector by 0.50.5 and 0.30.3:

A=[0.5000.3].A = \begin{bmatrix} 0.5 & 0 \\ 0 & 0.3 \end{bmatrix}.

Then AkA^k multiplies the first component by 0.5k0.5^k and the second by 0.3k0.3^k. As kk grows, both components approach 00.

The key observation is: this transformation scales two directions by 0.50.5 and 0.30.3. Those numbers are the eigenvalues of the matrix.

Eigenvalues and Eigenvectors

More generally, if there exists a nonzero vector u\boldsymbol{u} and a scalar λ\lambda such that:

Au=λu,A\boldsymbol{u} = \lambda \boldsymbol{u},

then u\boldsymbol{u} is an eigenvector of matrix AA, and λ\lambda is the corresponding eigenvalue. In other words, when AA acts on u\boldsymbol{u}, it does not change the vector's direction; it only scales its length by λ\lambda.

A Concrete Calculation

We first compute eigenvalues for an ordinary matrix, then apply the conclusion to Bellman updates and explain why the eigenvalues of γP\gamma P must be less than 11 in absolute value.

Consider a 2×22\times2 matrix:

A=[4123].A = \begin{bmatrix} 4 & 1 \\ 2 & 3 \end{bmatrix}.

The eigenvalues satisfy the characteristic equation det(AλI)=0\det(A - \lambda I) = 0:

det[4λ123λ]=0.\det\begin{bmatrix} 4-\lambda & 1 \\ 2 & 3-\lambda \end{bmatrix} = 0.

Expanding:

(4λ)(3λ)2=0λ27λ+10=0.(4-\lambda)(3-\lambda) - 2 = 0 \quad\Longrightarrow\quad \lambda^2 - 7\lambda + 10 = 0.

Solving the quadratic:

λ=7±49402=7±32.\lambda = \frac{7 \pm \sqrt{49-40}}{2} = \frac{7 \pm 3}{2}.

Therefore λ1=5\lambda_1 = 5 and λ2=2\lambda_2 = 2. Geometrically, this matrix scales some directions by 55 and other directions by 22. When repeatedly applying AA, the direction with λ1=5\lambda_1 = 5 will dominate growth.

This gives the direct conclusion: if the largest eigenvalue magnitude of a matrix is greater than 11, repeatedly applying it can make vectors grow without bound; if all eigenvalue magnitudes are less than 11, vectors shrink and the iteration converges.

Eigenvalue Argument for Bellman Update Convergence

Multiplying by 0.50.5 converges in one dimension, and in two dimensions the same is true when all eigenvalue magnitudes are less than 11. This conclusion applies directly to Bellman updates:

vk+1=r+γPvk.\boldsymbol{v}_{k+1} = \boldsymbol{r} + \gamma P\boldsymbol{v}_k.

Assume the true solution is v\boldsymbol{v}^* and satisfies v=r+γPv\boldsymbol{v}^* = \boldsymbol{r} + \gamma P\boldsymbol{v}^*. Subtract the two equations:

vk+1v=γP(vkv).\boldsymbol{v}_{k+1} - \boldsymbol{v}^* = \gamma P(\boldsymbol{v}_k - \boldsymbol{v}^*).

Let the error be ek=vkv\boldsymbol{e}_k = \boldsymbol{v}_k - \boldsymbol{v}^*. Then:

ek+1=γPek.\boldsymbol{e}_{k+1} = \gamma P\,\boldsymbol{e}_k.

This recurrence has the same structure as the original one-dimensional example xk+1=0.5xkx_{k+1} = 0.5\,x_k; the only difference is that scalar multiplication has been replaced by matrix multiplication with γP\gamma P.

PP is a transition probability matrix, and each row sums to 11, so its spectral radius, the largest absolute eigenvalue, satisfies ρ(P)1\rho(P) \leq 1. After multiplying by the discount factor:

ρ(γP)γ<1.\rho(\gamma P) \leq \gamma < 1.

Thus every eigenvalue of γP\gamma P has absolute value less than 11. The error shrinks in every direction, so Bellman updates must converge.

This is the mathematical guarantee behind the Chapter 3 DP instruction "iterate until convergence." γ<1\gamma < 1 is not merely an engineering choice; it is the mathematical condition for convergence.

How the Size of γ\gamma Affects Convergence Speed

γ\gamma affects not only whether the iteration converges, but also how fast it converges.

gammarho(gamma P)Convergence speedMeaning
0.1<= 0.1Very fastFocuses mostly on the immediate future
0.5<= 0.5Fairly fastBalances near and distant returns
0.9<= 0.9SlowLooks far ahead and needs more iterations
0.99<= 0.99Very slowStrongly emphasizes long-term return

Numerical demonstration: suppose the initial error is e0=10\|\boldsymbol{e}_0\|=10. After kk iterations, the error bound is:

Iterations kError with gamma=0.5Error with gamma=0.9Error with gamma=0.99
1599.9
50.315.99.51
100.013.499.04
50approx 00.0056.05
100approx 0approx 03.66

The closer γ\gamma is to 11, the slower convergence becomes, but the resulting value looks farther into the future. This is a common accuracy-efficiency tradeoff in RL.

Contraction Mapping: A Formal Guarantee

Eigenvalues explain that error shrinks in every direction. Mathematically, there is a more unified way to express the same conclusion: a contraction mapping. It can be shown that the Bellman operator Tv=r+γPv\mathcal{T}\boldsymbol{v} = \boldsymbol{r} + \gamma P\boldsymbol{v} satisfies:

TuTvγuv.\|\mathcal{T}\boldsymbol{u} - \mathcal{T}\boldsymbol{v}\|_\infty \leq \gamma \|\boldsymbol{u}-\boldsymbol{v}\|_\infty.

This inequality says: after one Bellman update, the distance between two estimates is at most γ\gamma times what it was before. Since γ<1\gamma < 1, the distance strictly decreases. This means the iteration has exactly one fixed point, the target value function v\boldsymbol{v}^*.

Expand: Banach Fixed-Point Theorem

A contraction mapping has an elegant property, known as the Banach fixed-point theorem: in a complete metric space, a contraction mapping has exactly one fixed point, and repeated iteration converges to it. v\boldsymbol{v}^* is that fixed point.

The full theorem statement is not necessary here. The core conclusion is: γ<1\gamma < 1 is the mathematical guarantee for Bellman update convergence. If γ=1\gamma = 1, meaning no discounting, the update may fail to converge in some cases.


Eigenvalues guarantee convergence from a long-term perspective, but convergence only says that repeated iteration eventually reaches a fixed point. It does not guarantee that each individual update has controlled magnitude. The gradient clipping introduced in E.1.3 uses the L2 norm to limit the maximum step length and already prevents many training instabilities. But it has a blind spot: parameter changes in different directions can affect the policy distribution very differently, while the L2 norm treats all directions equally. To address this, we need to upgrade "uniform length" into "weighted length."


Direction Sensitivity: From Ordinary Norms to Trust-Region Constraints

The Limitation of Ordinary Distance

The L2 norm is defined as:

x22=xx.\|\boldsymbol{x}\|_2^2 = \boldsymbol{x}^\top \boldsymbol{x}.

For example, if x=[3,4]\boldsymbol{x} = [3, 4]^\top, then xx=9+16=25\boldsymbol{x}^\top\boldsymbol{x} = 9 + 16 = 25. The L2 norm treats every direction equally: moving one step to the right and one step upward have the same "length."

In parameter space, however, different directions have different safety risks. Consider two parameter update directions:

Direction A: Δθ=[1,0]\Delta\theta = [1, 0]^\top

This direction mainly changes the first parameter. Suppose that parameter controls the logit for choosing action left. Changing it by 11 may only shift the probability from 0.50.5 to about 0.730.73.

Direction B: Δθ=[0,1]\Delta\theta = [0, 1]^\top

This direction changes the second parameter. Suppose that parameter controls output scale. Changing it by 11 may make all action probabilities change drastically.

Both directions have L2 norm 11, but their effects on the policy distribution are completely different. A plain L2 constraint Δθ2δ\|\Delta\theta\|_2 \leq \delta cannot distinguish this difference.

Weighted Norms: Direction-Aware Measurement

To distinguish risk across directions, introduce a matrix FF and define a weighted length in place of the ordinary L2 norm:

xF2=xFx.\|\boldsymbol{x}\|_F^2 = \boldsymbol{x}^\top F\,\boldsymbol{x}.

Consider a concrete example:

F=[1004],x=[11].F = \begin{bmatrix} 1 & 0 \\ 0 & 4 \end{bmatrix}, \qquad \boldsymbol{x} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}.

Then:

xFx=[11][1004][11]=1×1+4×1=5.\boldsymbol{x}^\top F\,\boldsymbol{x} = \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 4 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \end{bmatrix} = 1 \times 1 + 4 \times 1 = 5.

The second direction is amplified by weight 44, so the same step length costs more in that direction. The larger the diagonal element of FF, the more tightly movement in that direction is constrained.

Trust-Region Constraint

TRPO uses a trust-region constraint of the following form:

(θθold)F(θθold)δ.(\theta - \theta_{old})^\top F(\theta - \theta_{old}) \leq \delta.

Here FF is usually the Fisher information matrix. The intuition is: a parameter update should not be judged only by Euclidean distance; it should also account for how much the step changes the policy distribution. If a direction would sharply change the policy distribution, FF shrinks the allowed step in that direction. If a direction barely affects the policy distribution, FF allows a larger step.

The (i,j)(i,j) entry of the Fisher information matrix is:

Fij=Eπ[logπθ(as)θilogπθ(as)θj].F_{ij} = \mathbb{E}_\pi\left[\frac{\partial \log \pi_\theta(a\mid s)}{\partial \theta_i}\frac{\partial \log \pi_\theta(a\mid s)}{\partial \theta_j}\right].

The core role of FF is to map distances in parameter space to distances in policy-distribution space. The former is Euclidean geometry; the latter is information geometry.

From TRPO to PPO: Exact Constraints and Approximate Constraints

TRPO directly solves an optimization problem with a quadratic constraint, which is computationally expensive because it requires computing the Fisher matrix and its inverse. PPO, the mainstream industrial RL algorithm, uses a simplified strategy:

  • It does not explicitly compute FF. Instead, it uses the probability ratio rt(θ)=πθ(atst)/πold(atst)r_t(\theta) = \pi_\theta(a_t\mid s_t)/\pi_{old}(a_t\mid s_t) to indirectly measure policy change.
  • It does not solve constrained optimization. Instead, it clips clip(r_t, 1-\epsilon, 1+\epsilon) to hard-limit how far the probability ratio can deviate.

The progression from ordinary norms to trust regions is:

Δθ22=ΔθΔθΔθFΔθδclip(rt,  1ϵ,  1+ϵ)\|\Delta\theta\|_2^2 = \Delta\theta^\top \Delta\theta \quad\longrightarrow\quad \Delta\theta^\top F\,\Delta\theta \leq \delta \quad\longrightarrow\quad \text{clip}(r_t,\; 1-\epsilon,\; 1+\epsilon)

Each step has the same goal: limit the magnitude of policy change. They differ in the tradeoff between precision and computational cost.

Common Pitfall

TRPO's trust-region constraint does not limit Euclidean distance in parameter space. It limits the effect of parameter changes on the policy distribution. The same Δθ2=0.1\|\Delta\theta\|_2 = 0.1 can produce completely different policy changes in different directions.


A Three-Level View

Putting the three tools together, they answer the same core question from three levels: how do we keep training stable?

LevelProblemToolFormula
Long termIteration convergenceEigenvalues, spectral radiusrho(gamma P) <= gamma < 1
Short termSingle-step magnitudeL2 norm, gradient clippinggclipped2c\lVert\boldsymbol{g}_{clipped}\rVert_2 \leq c
Fine-grainedDirection sensitivityWeighted norm, trust regionDelta theta^T F Delta theta <= delta

The progression is: eigenvalues guarantee convergence from the long-term perspective, norms limit single-step update size from the short-term perspective, and weighted norms distinguish risk across directions from the fine-grained perspective. The constraints become more precise, and the computational cost also rises.


E.1 Module Overview: The Logical Chain Across Four Articles

This is the last main article in module E.1. The four articles revolve around one core question: Can the Bellman equation V(s)=R(s)+γPV(s)V(s) = R(s) + \gamma\sum P V(s') actually be solved? Solving it faces three obstacles in sequence, and each obstacle introduces a new group of mathematical tools:

text
Bellman equation in Chapter 3
  |
  v  Obstacle 1: too many equations to write one by one
E.1.1 + E.1.2  vectors, matrices, linear systems
  |  -> v = r + gamma P v  writes all equations at once
  |  -> v = (I-gamma P)^-1 r  solves in one step
  |  But when there are too many states, the matrix cannot be stored
  v  Obstacle 2: too many states, matrix cannot be stored
E.1.3  dot products, norms, function approximation
  |  -> v_hat(s) = w^T x(s)  approximate with a dot product
  |  -> ||g||_2 <= c  use a norm to limit updates
  |  But with approximation plus iteration, is training stable?
  v  Obstacle 3: training stability under approximation and iteration
E.1.4  eigenvalues, weighted norms, trust regions
     -> three levels: convergence guarantee + gradient clipping + trust region

Mapped to the core problems in RL:

QuestionMathematical toolRL meaningArticle
How do we represent states and values?Vectors, matricesBasis of tabular methodsE.1.1
How do we compress Bellman equations?Linear systemsTheoretical basis of policy evaluationE.1.2
How do we approximate values that cannot be tabled?Dot products, normsFunction approximation and neural networksE.1.3
Why can training be stable?Eigenvalues, contraction mappingsConvergence guaranteeE.1.4
How can policies be updated safely?Weighted norms, trust regionsMathematical basis of TRPO/PPOE.1.4

Next: E.1.5 Formula Review and Exercises revisits the RL concepts from Chapter 3 through the lens of linear algebra.

现代强化学习实战课程