E.1.1 Linear Algebra Basics: Vectors and Matrices
Prerequisites: This article does not require prior linear algebra, but it is helpful to first read the two-state running example in the appendix introduction.
Scalars, Sets, and Functions
A scalar is a single number. A reward is a scalar, and a discount factor is also a scalar. In RL, the following values are all scalars:
| Symbol | Meaning | Typical values |
|---|---|---|
| r | Immediate reward | 2, -1, 0.5 |
| gamma | Discount factor | 0.9, 0.99, 0.5 |
| alpha | Learning rate | 0.001, 3e-4 |
| epsilon | Exploration rate or clipping range | 0.1, 0.2 |
A scalar gives the numerical value of a reward. Another basic element of an environment is the state. The collection of all possible states forms a set.
A set lists all possible elements inside braces. For example, a small environment may have three states and two actions:
The calligraphic letters and are conventional; writing and would not change the meaning. When we discuss "the value under state ," that must come from some set .
A set defines the available states. Next, we need to assign a value to each state, which leads to the idea of a function.
A value function takes a state and returns a number:
The notation means:
- says that is a function from to . The colon gives the type declaration, and the arrow gives the domain and codomain. is the set of real numbers.
- says that input is mapped to output .
In plain language, given a state , the function returns a number . For example, means that the value of state is .
A policy function is similar. Its input is a state-action pair, and its output is a probability:
Here:
- The in denotes the Cartesian product: the set of all state-action pairs.
- means the output range is the real interval from to , because probabilities live in this range.
- means the input is a state-action pair and the output is the probability of choosing action in state .
A function requires the same input to correspond to only one output. gives one value for each state, so it satisfies this requirement. The transition probability is also a function: the input is the current state, action, and next state, and the output is the transition probability.
Scalars, sets, and functions handle one-to-one relationships: one state corresponds to one value, and one state-action pair corresponds to one probability. But reinforcement learning often needs to handle many states at once. If an environment has thousands of states, writing every separately becomes cumbersome. Putting all state values into one column and operating on them as a whole gives us a vector.
Vectors
Suppose an environment has three states and the current value estimates are:
| State | Value |
|---|---|
| s1 | 3 |
| s2 | 5 |
| s3 | 2 |
Put all numbers into one column and treat them as a single object:
The three numbers inside the brackets are the components of the vector. is written in bold to distinguish it from a single scalar. Once vectors are introduced, we can operate on "the values of all states" at the same time.
Addition. Suppose every state receives an additional reward of . This is equivalent to adding to every component:
Vector addition adds components one by one. This requires the two vectors to have the same length. Vectors with different lengths cannot be added.
Scalar multiplication. Multiply every component of a vector by the same scalar. For example, applying discount factor gives:
This corresponds to discounting future value. In the Bellman equation , is this step: scale the future value by the discount factor before adding it to the immediate reward.
A vector can represent the values of all states, but it cannot represent how states transition into one another. From , the next state may be or , and this "from which state to which state, with what probability" relationship requires a matrix.
Matrices
Consider two states and :
- Starting from , the next state is always .
- Starting from , the next state is always .
This transition relationship can be written as a matrix:
Rows correspond to "which state we start from," and columns correspond to "which state we go to next." The first row says: starting from , the probability of reaching is , and the probability of reaching is . The second row has the analogous meaning.
If the current values of the two states are:
then the first component of is the value of , and the second component is the value of . Each row of the transition matrix is a probability distribution over next states. Multiplying a row by takes a probability-weighted sum of possible next-state values:
Putting the two row results back into a vector gives : the expected next-state value from each current state.
The result matches the transition rule: from the next state is , whose future value is ; from the next state is , whose future value is .
General Case
For three states, suppose the transition relationship is:
| Current state | to s1 | to s2 | to s3 |
|---|---|---|---|
| s1 | 0.1 | 0.7 | 0.2 |
| s2 | 0.0 | 0.3 | 0.7 |
| s3 | 0.5 | 0.5 | 0.0 |
Written as a matrix:
The number of rows and columns both equal the number of states . Moving from 2 states to 3 states changes the matrix from to , but the structure is unchanged: row always represents the probabilities of going from to every possible next state. This structure holds for any number of states.
Matrix Multiplication and Probability Weighting
So far we have placed state values into a vector and state transitions into a matrix. Now let us examine the computation inside matrix multiplication and why it exactly matches probability-weighted averaging.
Let
Then
Each row of the matrix takes one dot product with the vector. Matrix-vector multiplication is many weighted sums performed together.
Probability Weighting as a Special Case
In reinforcement learning, when transition matrix multiplies value vector , row says: starting from state , average the values of all possible next states using transition probabilities as weights.
A concrete example:
Then
The first row means: starting from , there is a chance of reaching a state with value and a chance of reaching a state with value , so the expected future value is .
The key property here is that every row of the matrix is a set of probabilities whose sum is . Therefore, matrix multiplication implements exactly "probability times value" weighted averaging. The Bellman equation is essentially "immediate reward plus discounted probability-weighted future value."
Matrix multiplication is not limited to probability matrices. In neural networks, rows of weight matrices usually do not sum to , but matrix multiplication is still weighted summation. Probability weighting is only one special case of matrix multiplication.
Dimension Checks
The simplest way to judge whether a linear algebra formula is valid is to check dimensions.
With states, the value vector is:
The state transition matrix is:
Therefore the shape of is:
The result is still a value vector. Thus
has matching shapes on both sides and is meaningful.
Shape Checks in Neural Networks
If a linear Q function is written as
then and must have the same length. If , then , so the dot product is a scalar.
Shape checking is just as important in neural networks. Consider a simple two-layer network:
input state features, 128 dimensions -> hidden layer, 64 dimensions -> action logits, 2 dimensionsThe weight matrix shapes are:
| Layer | Weight matrix | Shape |
|---|---|---|
| Layer 1 | W1 | 128 x 64 |
| Layer 2 | W2 | 64 x 2 |
Forward propagation:
is , input is , so is , and hidden state is also . is , so is , exactly two action logits.
Dimension checking is an effective way to read papers and write code. Many formulas look complicated, but checking input and output dimensions often reveals whether they are reasonable.
Common Pitfall
In matrix multiplication, , so the inner dimensions must match. If code raises RuntimeError: mat1 and mat2 shapes cannot be multiplied, some tensor dimension is usually wrong.
Summary
This article established five basic objects in linear algebra:
| Object | Role in RL | Example |
|---|---|---|
| Scalar | Single reward or hyperparameter | r=2, gamma=0.9 |
| Set | Possible states and actions | |
| Function | Value function and policy function | v(s), pi(a|s) |
| Vector | All state values together | v=[3, 5, 2]^T |
| Matrix | Transitions among all states | P in R^(n x n) |
Their relationships are: scalars form vectors, vectors form matrices, and matrix-vector multiplication implements probability weighting. The next article combines these objects into a full system of equations: the matrix form of the Bellman equation .
Next: E.1.2 Matrix Form of the Bellman Equation shows how vectors, matrices, and matrix multiplication combine into the matrix form of the Bellman equation.