Skip to content

E.3.1 Calculus Basics: Derivatives, Gradients, and the Chain Rule

Prerequisite: This page does not require prior calculus knowledge, but you should first read the "two-state running example" in the appendix introduction and E.1.1 Vectors and Matrices.


Functions, Objectives, and Rates of Change

The central task in reinforcement learning is optimization: adjusting policy parameters so that the average return becomes higher and higher. The vectors, matrices, and Bellman equations introduced in earlier appendix pages tell us "how good the current policy is", but they do not tell us "how to make it better". To answer that question, we need to know how much the objective function changes when the parameters change a little. This is exactly the problem calculus solves.

Imagine turning the knob on a radio to find a station. At different knob positions, the received signal has different clarity. Calculus studies a similar question: when one quantity changes, how does another quantity change with it? In reinforcement learning, the knob is the policy parameter θ\theta, and the signal clarity is the objective function J(θ)J(\theta).

An objective function can be understood as an input-output relationship:

θJ(θ).\theta \longmapsto J(\theta).

If J(θ)J(\theta) represents average return, then training means searching for parameter values that make J(θ)J(\theta) larger. Derivatives and gradients answer a very plain question: standing here right now, which direction makes the objective rise fastest?

Once this point is clear, calculus no longer looks like a collection of isolated formulas. It becomes the basic language that optimization algorithms rely on.


Derivatives Through a One-Dimensional Function

The phrase "which direction to move" needs a more precise tool. That tool is the derivative. We first build intuition with a simple function that has only one parameter.

J(θ)=(θ0.8)2+1.J(\theta)=-(\theta-0.8)^2+1.

You can interpret θ\theta as a highly simplified policy parameter. For example, θ=0.2\theta=0.2 means the probability of choosing right is small, while θ=0.8\theta=0.8 means the probability of choosing right is larger.

This function reaches its maximum at θ=0.8\theta=0.8. Without drawing the graph, look at a few values:

θ\thetaJ(θ)J(\theta)
0.20.20.640.64
0.50.50.910.91
0.80.81.001.00
1.01.00.960.96

From 0.20.2 to 0.50.5, the objective increases. From 0.80.8 to 1.01.0, the objective decreases. A derivative describes this phenomenon: near the current position, how fast and in which direction the function value changes as the parameter changes.

Taking the derivative of this function gives J(θ)J'(\theta), read as "J prime of theta". The prime mark denotes a derivative:

J(θ)=2(θ0.8).J'(\theta)=-2(\theta-0.8).

At θ=0.2\theta=0.2:

J(0.2)=1.2.J'(0.2)=1.2.

The derivative is positive, which means moving to the right increases the objective. At θ=1.0\theta=1.0:

J(1.0)=0.4.J'(1.0)=-0.4.

The derivative is negative, which means moving to the left increases the objective.


Gradient Ascent and Gradient Descent

Once we know the direction, the natural next step is to move along it. If the goal is to maximize return, we use gradient ascent:

θθ+αJ(θ).\theta \leftarrow \theta + \alpha J'(\theta).

Start from θ=0.2\theta=0.2 and choose learning rate α=0.1\alpha=0.1:

θ0.2+0.1×1.2=0.32.\theta \leftarrow 0.2 + 0.1\times1.2 = 0.32.

Compute the derivative again:

J(0.32)=2(0.320.8)=0.96.J'(0.32)=-2(0.32-0.8)=0.96.

Continue the update:

θ0.32+0.1×0.96=0.416.\theta \leftarrow 0.32 + 0.1\times0.96 = 0.416.

The parameter moves step by step toward 0.80.8. Each step moves a small distance in the direction indicated by the derivative. This is the core logic of gradient ascent.

Conversely, if the goal is to minimize a loss function L(θ)L(\theta), we change the plus sign to a minus sign. This is gradient descent:

θθαL(θ).\theta \leftarrow \theta - \alpha L'(\theta).

The phrase "backpropagation plus optimizer" in deep learning is essentially this loop repeated over and over: compute gradients, then update parameters.


From Derivatives to Gradients: What If There Is More Than One Parameter?

One-dimensional derivatives are easy to understand, but real models often have hundreds of thousands or even hundreds of millions of parameters. Fortunately, the idea generalizes directly. Suppose the objective function has two parameters:

J(θ1,θ2)=(θ11)2(θ22)2+5.J(\theta_1,\theta_2)=-(\theta_1-1)^2-(\theta_2-2)^2+5.

It is maximized at (1,2)(1,2). If we take the derivative with respect to each parameter and arrange the results into a vector, we get the gradient. The symbol \nabla, read as "nabla", means "collect all partial derivatives together":

J(θ1,θ2)=[2(θ11)2(θ22)].\nabla J(\theta_1,\theta_2)= \begin{bmatrix} -2(\theta_1-1) \\ -2(\theta_2-2) \end{bmatrix}.

If the current parameters are (0,0)(0,0), the gradient is

J(0,0)=[24].\nabla J(0,0)= \begin{bmatrix} 2 \\ 4 \end{bmatrix}.

This means that θ1\theta_1 should increase and θ2\theta_2 should also increase, with θ2\theta_2 increasing more strongly. The gradient is a kind of steering wheel: it tells every parameter where to move at the same time. With learning rate 0.10.1, one gradient-ascent update gives

[θ1θ2][00]+0.1[24]=[0.20.4].\begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix} \leftarrow \begin{bmatrix} 0 \\ 0 \end{bmatrix} +0.1 \begin{bmatrix} 2 \\ 4 \end{bmatrix} = \begin{bmatrix} 0.2 \\ 0.4 \end{bmatrix}.

The concept of a gradient is not limited to two parameters. A neural network may have thousands or millions of parameters; its gradient is a vector of the same dimension, telling each parameter how it should move.


The Chain Rule: How Signals Pass Between Layers

So far, we have discussed "taking derivatives directly with respect to parameters". But neural networks are compositions of many functions. The input is transformed by the first layer, the result is passed to the second layer, and so on until the loss is produced. The chain rule is the tool for handling this composite structure. It tells us how changes in an outer function pass step by step through intermediate variables to inner parameters.

Consider a concrete example:

y=3θ,L=(y6)2.y = 3\theta, \qquad L=(y-6)^2.

If θ=1\theta=1, then y=3y=3, and the loss is

L=(36)2=9.L=(3-6)^2=9.

We want to know how LL changes when θ\theta changes a little. Of course, we can substitute y=3θy=3\theta into LL directly and then differentiate:

L=(3θ6)2.L=(3\theta-6)^2.

Differentiating gives

dLdθ=2(3θ6)×3.\frac{dL}{d\theta}=2(3\theta-6)\times3.

At θ=1\theta=1:

dLdθ=2(36)×3=18.\frac{dL}{d\theta}=2(3-6)\times3=-18.

Here 2(3θ6)2(3\theta-6) is the derivative of the loss with respect to yy, written dLdy\frac{dL}{dy}, and 33 is the derivative of yy with respect to θ\theta, written dydθ\frac{dy}{d\theta}. The chain rule multiplies them:

dLdθ=dLdydydθ.\frac{dL}{d\theta}=\frac{dL}{dy}\cdot\frac{dy}{d\theta}.

In words: "θ\theta affects yy, and yy affects LL, so the effect of θ\theta on LL is the product of these two effects." Backpropagation is the efficient large-scale version of this process in neural networks. It starts from the loss and moves backward along the computation graph, applying the chain rule layer by layer.


Partial Derivatives: Move One Knob and Hold the Others Fixed

So far, the derivative notation we used was dd, which corresponds to functions with one variable. When a function has multiple variables, we need a new symbol, \partial, read as "partial". It means "differentiate with respect to one variable while treating the other variables as constants". This is a partial derivative. Use the same two-parameter objective function:

J(θ1,θ2)=(θ11)2(θ22)2+5.J(\theta_1,\theta_2)=-(\theta_1-1)^2-(\theta_2-2)^2+5.

It has two parameters. We can ask two separate questions: if θ2\theta_2 is held fixed, how does θ1\theta_1 affect the objective? Conversely, if θ1\theta_1 is held fixed, how does θ2\theta_2 affect it? This is what partial derivatives do.

The partial derivative with respect to θ1\theta_1 is

Jθ1=2(θ11).\frac{\partial J}{\partial \theta_1}=-2(\theta_1-1).

The partial derivative with respect to θ2\theta_2 is

Jθ2=2(θ22).\frac{\partial J}{\partial \theta_2}=-2(\theta_2-2).

Arranging all partial derivatives into a vector gives the gradient:

θJ=[Jθ1Jθ2].\nabla_\theta J= \begin{bmatrix} \frac{\partial J}{\partial \theta_1} \\ \frac{\partial J}{\partial \theta_2} \end{bmatrix}.

At (θ1,θ2)=(0,0)(\theta_1, \theta_2)=(0,0):

θJ(0,0)=[24].\nabla_\theta J(0,0)= \begin{bmatrix} 2 \\ 4 \end{bmatrix}.

This means both parameters should increase, and the ascent direction is stronger for the second parameter. Gradients in neural networks work the same way; only the number of parameters changes from 22 to millions or more.


Learning Rate: How Far Each Step Goes

The gradient tells us the direction, but the distance to move is a separate issue. The learning rate α\alpha is the knob that controls step size.

If the current parameter is

θ=[00],J=[24],\theta= \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \qquad \nabla J= \begin{bmatrix} 2 \\ 4 \end{bmatrix},

and the learning rate is α=0.1\alpha=0.1, one gradient-ascent update is

θ[00]+0.1[24]=[0.20.4].\theta\leftarrow \begin{bmatrix} 0 \\ 0 \end{bmatrix} +0.1 \begin{bmatrix} 2 \\ 4 \end{bmatrix} = \begin{bmatrix} 0.2 \\ 0.4 \end{bmatrix}.

If the learning rate is too small, training is very slow because each step moves only a little. If the learning rate is too large, the parameters may jump over the optimum or even diverge because the step passes straight over the summit. Gradients in reinforcement learning already contain considerable noise, so the choice of learning rate is especially sensitive. Techniques such as gradient clipping, the Adam optimizer, and PPO clipping are all, in essence, ways to stabilize step size.

Common Pitfall

The learning rate is not "the larger, the faster the learning". An overly large learning rate can make parameters oscillate near the optimum or even diverge. In practice, small values such as 10310^{-3} and 3×1043\times10^{-4} are common, often combined with learning-rate decay or adaptive optimizers.


From Mathematical Formulas to Backpropagation in Code

The previous sections derived derivatives, gradients, and the chain rule on paper. In actual code, we do not hand-calculate the partial derivative of every parameter. Instead, an automatic differentiation system does it for us. Automatic differentiation relies precisely on the chain rule.

For example, consider the simple computation graph

θy=3θL=(y6)2.\theta \to y=3\theta \to L=(y-6)^2.

Backpropagation starts from the final loss, first computes dLdy\frac{dL}{dy}, then multiplies by dydθ\frac{dy}{d\theta} to obtain dLdθ\frac{dL}{d\theta}.

The same principle applies to deep networks. Each layer only needs to know its local derivative, and the gradient of the whole chain can be passed back through the chain rule. Training policy networks, value networks, and reward models all depends on this mechanism.


Summary

This page introduced five core calculus tools:

ConceptRoleRole in RL
DerivativeDescribes the direction and speed of change at one pointDecides how a parameter should be adjusted
GradientVector of all partial derivatives in a multi-parameter functionTells every parameter where to move at once
Chain ruleDifferentiates composite functionsThe theoretical foundation of backpropagation
Partial derivativeDifferentiates one variable while treating the rest as constantsComputes gradients parameter by parameter in neural networks
Learning rateControls the step size of each parameter updateBalances training speed and stability

Together, these tools form the complete loop of "compute a gradient, then take one step". The next page applies these mathematical tools directly to policy optimization and derives policy gradients from the intuition that the probabilities of good actions should increase.

Next: E.3.2 Policy Gradients and Advantage Functions, which applies gradients to policy optimization.

现代强化学习实战课程