Skip to content

E.1.1 Linear Algebra Basics: Vectors and Matrices

Prerequisites: This article does not require prior linear algebra, but it is helpful to first read the two-state running example in the appendix introduction.


Scalars, Sets, and Functions

A scalar is a single number. A reward r=2r=2 is a scalar, and a discount factor γ=0.9\gamma=0.9 is also a scalar. In RL, the following values are all scalars:

SymbolMeaningTypical values
rImmediate reward2, -1, 0.5
gammaDiscount factor0.9, 0.99, 0.5
alphaLearning rate0.001, 3e-4
epsilonExploration rate or clipping range0.1, 0.2

A scalar gives the numerical value of a reward. Another basic element of an environment is the state. The collection of all possible states forms a set.

A set lists all possible elements inside braces. For example, a small environment may have three states and two actions:

S={s1,s2,s3},A={left,right}.\mathcal{S}=\{s_1,s_2,s_3\}, \qquad \mathcal{A}=\{\text{left},\text{right}\}.

The calligraphic letters S\mathcal{S} and A\mathcal{A} are conventional; writing SS and AA would not change the meaning. When we discuss "the value under state ss," that ss must come from some set S\mathcal{S}.

A set defines the available states. Next, we need to assign a value to each state, which leads to the idea of a function.

A value function takes a state and returns a number:

v:SR,sv(s).v:\mathcal{S}\to\mathbb{R}, \qquad s\mapsto v(s).

The notation means:

  • v:SRv:\mathcal{S}\to\mathbb{R} says that vv is a function from S\mathcal{S} to R\mathbb{R}. The colon gives the type declaration, and the arrow gives the domain and codomain. R\mathbb{R} is the set of real numbers.
  • sv(s)s\mapsto v(s) says that input ss is mapped to output v(s)v(s).

In plain language, given a state ss, the function returns a number v(s)v(s). For example, v(s1)=3v(s_1)=3 means that the value of state s1s_1 is 33.

A policy function is similar. Its input is a state-action pair, and its output is a probability:

π:S×A[0,1],(s,a)π(as).\pi:\mathcal{S}\times\mathcal{A}\to[0,1], \qquad (s,a)\mapsto\pi(a\mid s).

Here:

  • The ×\times in S×A\mathcal{S}\times\mathcal{A} denotes the Cartesian product: the set of all state-action pairs.
  • [0,1][0,1] means the output range is the real interval from 00 to 11, because probabilities live in this range.
  • (s,a)π(as)(s,a)\mapsto\pi(a\mid s) means the input is a state-action pair and the output is the probability of choosing action aa in state ss.

A function requires the same input to correspond to only one output. v(s)v(s) gives one value for each state, so it satisfies this requirement. The transition probability p(ss,a)p(s'\mid s,a) is also a function: the input is the current state, action, and next state, and the output is the transition probability.

Scalars, sets, and functions handle one-to-one relationships: one state corresponds to one value, and one state-action pair corresponds to one probability. But reinforcement learning often needs to handle many states at once. If an environment has thousands of states, writing every v(si)v(s_i) separately becomes cumbersome. Putting all state values into one column and operating on them as a whole gives us a vector.


Vectors

Suppose an environment has three states and the current value estimates are:

StateValue
s13
s25
s32

Put all numbers into one column and treat them as a single object:

v=[352].\boldsymbol{v} = \begin{bmatrix} 3 \\ 5 \\ 2 \end{bmatrix}.

The three numbers inside the brackets are the components of the vector. v\boldsymbol{v} is written in bold to distinguish it from a single scalar. Once vectors are introduced, we can operate on "the values of all states" at the same time.

Addition. Suppose every state receives an additional reward of 11. This is equivalent to adding 11 to every component:

vnew=[352]+[111]=[463].\boldsymbol{v}_{new} = \begin{bmatrix} 3 \\ 5 \\ 2 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 4 \\ 6 \\ 3 \end{bmatrix}.

Vector addition adds components one by one. This requires the two vectors to have the same length. Vectors with different lengths cannot be added.

Scalar multiplication. Multiply every component of a vector by the same scalar. For example, applying discount factor γ=0.5\gamma=0.5 gives:

γv=0.5×[352]=[1.52.51.0].\gamma\boldsymbol{v} = 0.5 \times \begin{bmatrix} 3 \\ 5 \\ 2 \end{bmatrix} = \begin{bmatrix} 1.5 \\ 2.5 \\ 1.0 \end{bmatrix}.

This corresponds to discounting future value. In the Bellman equation v=r+γPvv = r + \gamma P v, γv\gamma v is this step: scale the future value by the discount factor before adding it to the immediate reward.

A vector can represent the values of all states, but it cannot represent how states transition into one another. From s1s_1, the next state may be s2s_2 or s3s_3, and this "from which state to which state, with what probability" relationship requires a matrix.


Matrices

Consider two states s1s_1 and s2s_2:

  • Starting from s1s_1, the next state is always s2s_2.
  • Starting from s2s_2, the next state is always s1s_1.

This transition relationship can be written as a matrix:

P=[0110].P = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}.

Rows correspond to "which state we start from," and columns correspond to "which state we go to next." The first row [0,1][0, 1] says: starting from s1s_1, the probability of reaching s1s_1 is 00, and the probability of reaching s2s_2 is 11. The second row has the analogous meaning.

If the current values of the two states are:

v=[3.332.67],\boldsymbol{v} = \begin{bmatrix} 3.33 \\ 2.67 \end{bmatrix},

then the first component of v\boldsymbol{v} is the value of s1s_1, and the second component is the value of s2s_2. Each row of the transition matrix is a probability distribution over next states. Multiplying a row by v\boldsymbol{v} takes a probability-weighted sum of possible next-state values:

next-state value from s1=0×v(s1)+1×v(s2)=2.67,\text{next-state value from }s_1 = 0 \times v(s_1) + 1 \times v(s_2) = 2.67,

next-state value from s2=1×v(s1)+0×v(s2)=3.33.\text{next-state value from }s_2 = 1 \times v(s_1) + 0 \times v(s_2) = 3.33.

Putting the two row results back into a vector gives PvP\boldsymbol{v}: the expected next-state value from each current state.

Pv=[0110][3.332.67]=[2.673.33].P\boldsymbol{v} = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} 3.33 \\ 2.67 \end{bmatrix} = \begin{bmatrix} 2.67 \\ 3.33 \end{bmatrix}.

The result matches the transition rule: from s1s_1 the next state is s2s_2, whose future value is 2.672.67; from s2s_2 the next state is s1s_1, whose future value is 3.333.33.

General Case

For three states, suppose the transition relationship is:

Current stateto s1to s2to s3
s10.10.70.2
s20.00.30.7
s30.50.50.0

Written as a matrix:

P=[0.10.70.20.00.30.70.50.50.0].P = \begin{bmatrix} 0.1 & 0.7 & 0.2 \\ 0.0 & 0.3 & 0.7 \\ 0.5 & 0.5 & 0.0 \end{bmatrix}.

The number of rows and columns both equal the number of states nn. Moving from 2 states to 3 states changes the matrix from 2×22\times2 to 3×33\times3, but the structure is unchanged: row ii always represents the probabilities of going from sis_i to every possible next state. This structure holds for any number of states.


Matrix Multiplication and Probability Weighting

So far we have placed state values into a vector and state transitions into a matrix. Now let us examine the computation inside matrix multiplication and why it exactly matches probability-weighted averaging.

Let

A=[1234],x=[1020].A= \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \qquad x= \begin{bmatrix} 10 \\ 20 \end{bmatrix}.

Then

Ax=[1×10+2×203×10+4×20]=[50110].Ax= \begin{bmatrix} 1\times10+2\times20 \\ 3\times10+4\times20 \end{bmatrix} = \begin{bmatrix} 50 \\ 110 \end{bmatrix}.

Each row of the matrix takes one dot product with the vector. Matrix-vector multiplication is many weighted sums performed together.

Probability Weighting as a Special Case

In reinforcement learning, when transition matrix PP multiplies value vector vv, row ii says: starting from state sis_i, average the values of all possible next states using transition probabilities as weights.

A concrete example:

P=[0.70.30.20.8],v=[105].P= \begin{bmatrix} 0.7 & 0.3 \\ 0.2 & 0.8 \end{bmatrix}, \qquad v= \begin{bmatrix} 10 \\ 5 \end{bmatrix}.

Then

Pv=[0.7×10+0.3×50.2×10+0.8×5]=[8.56].Pv= \begin{bmatrix} 0.7\times10+0.3\times5 \\ 0.2\times10+0.8\times5 \end{bmatrix} = \begin{bmatrix} 8.5 \\ 6 \end{bmatrix}.

The first row 0.7×10+0.3×5=8.50.7\times10+0.3\times5=8.5 means: starting from s1s_1, there is a 70%70\% chance of reaching a state with value 1010 and a 30%30\% chance of reaching a state with value 55, so the expected future value is 8.58.5.

The key property here is that every row of the matrix is a set of probabilities whose sum is 11. Therefore, matrix multiplication implements exactly "probability times value" weighted averaging. The Bellman equation v=r+γPvv = r + \gamma Pv is essentially "immediate reward plus discounted probability-weighted future value."

Matrix multiplication is not limited to probability matrices. In neural networks, rows of weight matrices usually do not sum to 11, but matrix multiplication is still weighted summation. Probability weighting is only one special case of matrix multiplication.


Dimension Checks

The simplest way to judge whether a linear algebra formula is valid is to check dimensions.

With nn states, the value vector is:

vRn.v\in\mathbb{R}^n.

The state transition matrix is:

PRn×n.P\in\mathbb{R}^{n\times n}.

Therefore the shape of PvPv is:

(n×n)(n×1)=n×1.(n\times n)(n\times1)=n\times1.

The result is still a value vector. Thus

v=r+γPvv=r+\gamma Pv

has matching shapes on both sides and is meaningful.

Shape Checks in Neural Networks

If a linear Q function is written as

Q(s,a)=wϕ(s,a),Q(s,a)=w^\top\phi(s,a),

then ww and ϕ(s,a)\phi(s,a) must have the same length. If wRdw\in\mathbb{R}^d, then ϕ(s,a)Rd\phi(s,a)\in\mathbb{R}^d, so the dot product is a scalar.

Shape checking is just as important in neural networks. Consider a simple two-layer network:

text
input state features, 128 dimensions -> hidden layer, 64 dimensions -> action logits, 2 dimensions

The weight matrix shapes are:

LayerWeight matrixShape
Layer 1W1128 x 64
Layer 2W264 x 2

Forward propagation:

h=σ(W1x),z=W2h.h = \sigma(W_1^\top x), \qquad z = W_2^\top h.

W1W_1 is 128×64128\times64, input xx is 128×1128\times1, so W1xW_1^\top x is 64×164\times1, and hidden state hh is also 64×164\times1. W2W_2 is 64×264\times2, so W2hW_2^\top h is 2×12\times1, exactly two action logits.

Dimension checking is an effective way to read papers and write code. Many formulas look complicated, but checking input and output dimensions often reveals whether they are reasonable.

Common Pitfall

In matrix multiplication, (A×B)(B×C)=A×C(A \times B)(B \times C) = A \times C, so the inner dimensions must match. If code raises RuntimeError: mat1 and mat2 shapes cannot be multiplied, some tensor dimension is usually wrong.


Summary

This article established five basic objects in linear algebra:

ObjectRole in RLExample
ScalarSingle reward or hyperparameterr=2, gamma=0.9
SetPossible states and actionsS=s1,s2S={s_1, s_2}
FunctionValue function and policy functionv(s), pi(a|s)
VectorAll state values togetherv=[3, 5, 2]^T
MatrixTransitions among all statesP in R^(n x n)

Their relationships are: scalars form vectors, vectors form matrices, and matrix-vector multiplication implements probability weighting. The next article combines these objects into a full system of equations: the matrix form of the Bellman equation v=r+γPvv = r + \gamma Pv.

Next: E.1.2 Matrix Form of the Bellman Equation shows how vectors, matrices, and matrix multiplication combine into the matrix form of the Bellman equation.

现代强化学习实战课程