E.3.1 Calculus Basics: Derivatives, Gradients, and the Chain Rule
Prerequisite: This page does not require prior calculus knowledge, but you should first read the "two-state running example" in the appendix introduction and E.1.1 Vectors and Matrices.
Functions, Objectives, and Rates of Change
The central task in reinforcement learning is optimization: adjusting policy parameters so that the average return becomes higher and higher. The vectors, matrices, and Bellman equations introduced in earlier appendix pages tell us "how good the current policy is", but they do not tell us "how to make it better". To answer that question, we need to know how much the objective function changes when the parameters change a little. This is exactly the problem calculus solves.
Imagine turning the knob on a radio to find a station. At different knob positions, the received signal has different clarity. Calculus studies a similar question: when one quantity changes, how does another quantity change with it? In reinforcement learning, the knob is the policy parameter , and the signal clarity is the objective function .
An objective function can be understood as an input-output relationship:
If represents average return, then training means searching for parameter values that make larger. Derivatives and gradients answer a very plain question: standing here right now, which direction makes the objective rise fastest?
Once this point is clear, calculus no longer looks like a collection of isolated formulas. It becomes the basic language that optimization algorithms rely on.
Derivatives Through a One-Dimensional Function
The phrase "which direction to move" needs a more precise tool. That tool is the derivative. We first build intuition with a simple function that has only one parameter.
You can interpret as a highly simplified policy parameter. For example, means the probability of choosing right is small, while means the probability of choosing right is larger.
This function reaches its maximum at . Without drawing the graph, look at a few values:
From to , the objective increases. From to , the objective decreases. A derivative describes this phenomenon: near the current position, how fast and in which direction the function value changes as the parameter changes.
Taking the derivative of this function gives , read as "J prime of theta". The prime mark denotes a derivative:
At :
The derivative is positive, which means moving to the right increases the objective. At :
The derivative is negative, which means moving to the left increases the objective.
Gradient Ascent and Gradient Descent
Once we know the direction, the natural next step is to move along it. If the goal is to maximize return, we use gradient ascent:
Start from and choose learning rate :
Compute the derivative again:
Continue the update:
The parameter moves step by step toward . Each step moves a small distance in the direction indicated by the derivative. This is the core logic of gradient ascent.
Conversely, if the goal is to minimize a loss function , we change the plus sign to a minus sign. This is gradient descent:
The phrase "backpropagation plus optimizer" in deep learning is essentially this loop repeated over and over: compute gradients, then update parameters.
From Derivatives to Gradients: What If There Is More Than One Parameter?
One-dimensional derivatives are easy to understand, but real models often have hundreds of thousands or even hundreds of millions of parameters. Fortunately, the idea generalizes directly. Suppose the objective function has two parameters:
It is maximized at . If we take the derivative with respect to each parameter and arrange the results into a vector, we get the gradient. The symbol , read as "nabla", means "collect all partial derivatives together":
If the current parameters are , the gradient is
This means that should increase and should also increase, with increasing more strongly. The gradient is a kind of steering wheel: it tells every parameter where to move at the same time. With learning rate , one gradient-ascent update gives
The concept of a gradient is not limited to two parameters. A neural network may have thousands or millions of parameters; its gradient is a vector of the same dimension, telling each parameter how it should move.
The Chain Rule: How Signals Pass Between Layers
So far, we have discussed "taking derivatives directly with respect to parameters". But neural networks are compositions of many functions. The input is transformed by the first layer, the result is passed to the second layer, and so on until the loss is produced. The chain rule is the tool for handling this composite structure. It tells us how changes in an outer function pass step by step through intermediate variables to inner parameters.
Consider a concrete example:
If , then , and the loss is
We want to know how changes when changes a little. Of course, we can substitute into directly and then differentiate:
Differentiating gives
At :
Here is the derivative of the loss with respect to , written , and is the derivative of with respect to , written . The chain rule multiplies them:
In words: " affects , and affects , so the effect of on is the product of these two effects." Backpropagation is the efficient large-scale version of this process in neural networks. It starts from the loss and moves backward along the computation graph, applying the chain rule layer by layer.
Partial Derivatives: Move One Knob and Hold the Others Fixed
So far, the derivative notation we used was , which corresponds to functions with one variable. When a function has multiple variables, we need a new symbol, , read as "partial". It means "differentiate with respect to one variable while treating the other variables as constants". This is a partial derivative. Use the same two-parameter objective function:
It has two parameters. We can ask two separate questions: if is held fixed, how does affect the objective? Conversely, if is held fixed, how does affect it? This is what partial derivatives do.
The partial derivative with respect to is
The partial derivative with respect to is
Arranging all partial derivatives into a vector gives the gradient:
At :
This means both parameters should increase, and the ascent direction is stronger for the second parameter. Gradients in neural networks work the same way; only the number of parameters changes from to millions or more.
Learning Rate: How Far Each Step Goes
The gradient tells us the direction, but the distance to move is a separate issue. The learning rate is the knob that controls step size.
If the current parameter is
and the learning rate is , one gradient-ascent update is
If the learning rate is too small, training is very slow because each step moves only a little. If the learning rate is too large, the parameters may jump over the optimum or even diverge because the step passes straight over the summit. Gradients in reinforcement learning already contain considerable noise, so the choice of learning rate is especially sensitive. Techniques such as gradient clipping, the Adam optimizer, and PPO clipping are all, in essence, ways to stabilize step size.
Common Pitfall
The learning rate is not "the larger, the faster the learning". An overly large learning rate can make parameters oscillate near the optimum or even diverge. In practice, small values such as and are common, often combined with learning-rate decay or adaptive optimizers.
From Mathematical Formulas to Backpropagation in Code
The previous sections derived derivatives, gradients, and the chain rule on paper. In actual code, we do not hand-calculate the partial derivative of every parameter. Instead, an automatic differentiation system does it for us. Automatic differentiation relies precisely on the chain rule.
For example, consider the simple computation graph
Backpropagation starts from the final loss, first computes , then multiplies by to obtain .
The same principle applies to deep networks. Each layer only needs to know its local derivative, and the gradient of the whole chain can be passed back through the chain rule. Training policy networks, value networks, and reward models all depends on this mechanism.
Summary
This page introduced five core calculus tools:
| Concept | Role | Role in RL |
|---|---|---|
| Derivative | Describes the direction and speed of change at one point | Decides how a parameter should be adjusted |
| Gradient | Vector of all partial derivatives in a multi-parameter function | Tells every parameter where to move at once |
| Chain rule | Differentiates composite functions | The theoretical foundation of backpropagation |
| Partial derivative | Differentiates one variable while treating the rest as constants | Computes gradients parameter by parameter in neural networks |
| Learning rate | Controls the step size of each parameter update | Balances training speed and stability |
Together, these tools form the complete loop of "compute a gradient, then take one step". The next page applies these mathematical tools directly to policy optimization and derives policy gradients from the intuition that the probabilities of good actions should increase.
Next: E.3.2 Policy Gradients and Advantage Functions, which applies gradients to policy optimization.