E.1.5 Formula Review and Exercises
Prerequisites: This page summarizes all formulas in module E.1. It is best read after E.1.1 through E.1.4. If this is your first pass, read the main articles first.
Revisiting Chapter 3
You now have the tools of vectors, matrices, dot products, norms, and eigenvalues. Looking back at Chapter 3, many formulas that may have seemed acceptable by definition have cleaner matrix expressions underneath.
Bellman Equation: From State-by-State Writing to One Matrix Line
In Chapter 3, the Bellman expectation equation was written as:
With matrices, such equations become one line:
The closed-form solution is . The DP methods in Chapter 3 repeatedly apply ; in essence, they iteratively approximate this closed-form solution.
TD Error: Incremental Updates and the Matrix Equation
The TD update in Chapter 3 is:
In vector language, this can be written as:
where is the one-hot vector for state , with only the -th position equal to , and is the TD Error. This and the Bellman matrix form from E.1.2 are two expressions of the same object: the former is an incremental update that changes one component at a time, while the latter is a global equation that all components satisfy simultaneously.
Tabular Q-Learning: A Special Case of One-Hot plus Dot Product
Chapter 3's Q-Learning stores one number for every state-action pair. From the linear-algebra perspective, this is linear approximation with one-hot features: , where is a one-hot vector. Tabular lookup is a special case of linear approximation.
Policy Gradient Updates: The Stage for Norms and Trust Regions
The second route in Chapter 3 defined a policy objective . When updating parameters, gradient clipping limits , and the trust-region constraint keeps training safe. These are all uses of linear algebra tools for stability.
Concept Map
The table below organizes all concepts in module E.1 by dependency. You can read down the dependency chain or use it as an index to jump to a concept.
| Concept | Core formula | Depends on | Role in RL |
|---|---|---|---|
| Scalar | r=2, gamma=0.9 | None | Rewards and hyperparameters |
| Vector | v in R^n | Scalar | Stores values of all states |
| Matrix | P in R^(n x n) | Vector | Stores transitions among all states |
| Matrix multiplication | (Pv)_i = sum_j P_ij v_j | Matrix + vector | Probability-weighted future value |
| Bellman matrix form | v = r + gamma P v | Matrix multiplication | Value recursion equation |
| Linear-system solution | v = (I - gamma P)^-1 r | Bellman matrix form | Theoretical closed-form solution |
| Dot product | w^T x = sum_i w_i x_i | Vector | Linear value or policy approximation |
| L2 norm | Vector | Gradient clipping and regularization | |
| Eigenvalue | A u = lambda u | Matrix | Analyzes matrix scaling behavior |
| Spectral radius and convergence | rho(gamma P) <= gamma < 1 | Eigenvalues | Value-iteration convergence guarantee |
| Trust-region constraint | (theta-theta_old)^T F (theta-theta_old) <= delta | Norms + eigenvalues | Safe update in TRPO |
Reading from top to bottom, each concept adds a new capability on top of the previous one. If a concept feels unclear, return to the row named in its "Depends on" column.
From Difficulties to Tools
The central thread of module E.1 is moving from "cannot compute it" to "can compute it safely."
| Stage | Difficulty | Mathematical tool | Key formula |
|---|---|---|---|
| Difficulty 1 | 1000 states = 1000 equations | Vectors, matrices, equation systems | v = (I - gamma P)^-1 r |
| Difficulty 2 | Too many states for the value table | Dot products, norms | v_hat(s) = w^T x(s), |
| Difficulty 3 | Training may diverge, explode, or drift | Eigenvalues, weighted norms | rho(gamma P) <= gamma < 1, Delta theta^T F Delta theta <= delta |
Common Pitfalls
- Treating matrices as abstract symbols. In the Bellman equation, every row of transition matrix is a probability table describing where the next state goes from the current state.
- Thinking inversion is the practical algorithm. is a theoretical closed-form solution. Real large-scale problems usually use iteration or function approximation, which is exactly the DP -> MC -> TD progression in Chapter 3.
- Ignoring vector direction. Gradients, advantage updates, and trust-region constraints care not only about numerical magnitude, but also about direction in parameter space.
- Confusing norms and regularization. A norm is a measurement tool, answering "how large." Regularization adds a norm to the training objective to constrain parameters. Gradient clipping limits the magnitude of a single update. These are three different uses.
- Thinking the L2 norm is the only norm. The L1 norm encourages sparse solutions, the Frobenius norm measures matrix size, and weighted norms, or quadratic forms, account for direction sensitivity. Different settings use different norms.
Exercises
Basic
Bellman equation by hand. If , , and , compute and by hand.
Write the matrix. Write the corresponding and for the previous problem, then verify whether equals your hand calculation.
Gradient clipping. If the gradient is and the maximum norm is , what is the clipped gradient?
Intermediate
Eigenvalue calculation. What are the eigenvalues of ? It is an upper-triangular matrix; what pattern do its eigenvalues follow?
Three-state Bellman equation. There are three states, transition matrix , reward vector , and . Write the Bellman matrix form without solving it.
Dot-product approximation. State has features , and weights are . Compute . If the true value of is , what is the error?
Challenge
Convergence-speed estimate. Suppose and the initial error is . How many Bellman updates are needed at minimum to make the error less than ? Hint: .
Trust-region check. Let , , and . Is this update inside the trust region? What about ? Explain why the second direction is more expensive.