E.3.4 Policy Gradient, Taylor, and GRPO Derivations
Prerequisite: E.3.2 Policy Gradients and Advantage Functions. You need to know the basic form of the policy gradient.
How the Log-Derivative Trick Leads to Policy Gradients
So far, we have been using the conclusion of policy gradients. Now let us see how it is derived. Directly taking the gradient of the policy probability is often inconvenient: may be produced by complicated functions such as softmax, making the gradient expression cumbersome. The log-derivative trick converts the difficult into the easier , so the gradient can be estimated by sampling without knowing the environment transition probabilities. The most common form of the policy gradient is derived from this trick:
Here means "the expectation when sampling according to policy ". In other words, it is a weighted average over all possible state-action pairs, where the weights are the probabilities with which the policy chooses each action. The key to deriving this formula is a simple identity:
Multiplying both sides by gives an equivalent but more useful form:
The benefit of this transformation is that directly differentiating is often hard, while the gradient of is usually much simpler. Next, substitute this trick into the objective function. In a discrete action space, the objective can be written as
When taking the gradient with respect to the parameters, does not depend on ; only contains :
Substitute the log-derivative trick and replace :
Look carefully at this summation. Every term contains as a weight. This is exactly a weighted average when sampling according to the policy, that is, an expectation:
The expression above considers only one state. If we take a weighted average over all states, where is the frequency of visiting state under policy , we obtain the full policy gradient theorem:
In practical algorithms, is difficult to know exactly, so it is commonly replaced by a sampled cumulative return or an advantage estimate :
This is the shared gradient structure behind algorithms such as REINFORCE, Actor-Critic, and PPO.
Taylor Expansion, the Hessian, and PPO's Second-Order Intuition
Gradient descent looks only at the first derivative, the slope at the current position, and then takes one step along that slope. But if the step is too large, the first-order approximation becomes unreliable: you may think you are still climbing uphill, while in fact you have already passed the summit and started descending. Taylor expansion is the tool for analyzing how large a step can be before the first-order approximation stops being trustworthy. A first-order expansion looks only at slope; a second-order expansion also considers curvature, meaning how the function bends and in which direction. The trust-region idea behind PPO and TRPO comes from the concern that when parameter updates are too large, first-order approximations are no longer reliable. Taylor expansion helps us understand this mathematically.
Consider a numerical example. Let
The true value is
The first-order approximation is
It is already close, with an error of . A second-order Taylor expansion adds a curvature correction term:
For , , so
In the multivariable case, the in the second-order term becomes the Hessian matrix , which records how the function curves in each direction:
The trust-region idea behind PPO and TRPO is exactly the concern that when parameter updates are too large, first-order approximations are no longer reliable and the second-order curvature term begins to matter. If we still take large steps based only on first-order information, we may damage the policy.
For the PPO probability ratio
expand around :
The three terms mean:
| Term | Meaning |
|---|---|
| When the new and old policies match, the ratio is | |
| First-order term | Linear change caused by a small update |
| Second-order term | Extra change from curvature after the step grows |
Although PPO clipping does not explicitly compute the Hessian, it indirectly avoids the risk of uncontrolled higher-order terms by restricting the range of .
Group Normalization in GRPO
We have discussed policy gradients, PPO clipping, and Taylor expansion. All of these methods need an advantage estimate . Traditional methods such as PPO use a trained Critic network to estimate the advantage, but the Critic itself also has to be trained, which adds engineering complexity. The core idea of GRPO is to avoid the Critic and instead construct advantages through relative comparison among answers in the same group. Imagine a teacher grading an open-ended problem: put four students' scores together, give positive signals to answers above the group average, and negative signals to answers below the group average. There is no need for an additional "standard-score judge". Concretely, suppose the same prompt samples four answers with rewards
The mean is
The standard deviation is
The standardized advantage of the fourth answer is
The general form is
The whole calculation has two steps:
- Subtract the mean: decide whether this answer is better or worse than the group average.
- Divide by the standard deviation: put rewards from different questions onto a comparable scale. Some questions naturally produce higher scores and some lower scores; after dividing by the standard deviation, they can be compared across questions.
GRPO can remove the traditional PPO Critic because it constructs the baseline from relative comparison within the group. It does not care about the absolute score of an answer; it cares about where that answer ranks within its group.
Summary
This page introduced three derivation tools:
| Tool | Core formula | Role |
|---|---|---|
| Log-derivative trick | Turns probability gradients into sampleable log form | |
| Taylor expansion | Explains PPO trust regions and clipping through second-order intuition | |
| GRPO group normalization | Replaces the Critic with relative comparison inside a group |
These three tools correspond to the derivation skeleton of policy gradients, the theoretical basis for limiting update size, and an alternative to using a Critic. The next page organizes them into a complete formula reference.
Next: E.3.5 Complete Optimization Formulas, a formula reference for PG, DQN, GAE, PPO, and GRPO.