E.3.5 Complete Formulas for PG, DQN, GAE, PPO, and GRPO
Prerequisite: This page summarizes all formulas in module E.3. It is best reviewed after reading E.3.1 through E.3.4.
On the previous pages, we derived the single-sample form of the policy gradient, the log-derivative trick, PPO clipping, and GRPO group normalization. This page organizes those results into complete formulas and adds the DQN loss function and the derivation of GAE. You can treat this page as a quick reference and return to it whenever an unfamiliar symbol appears.
Policy Gradient Theorem
We have already seen the single-sample form:
The complete policy gradient theorem is written as
Each symbol means:
- is the frequency with which state is visited under policy . You can read it as "how much time the policy spends in state when it runs".
- is the action-value function. It represents the average future return after taking action in state .
- describes how the probability of choosing action changes when the parameter changes.
Using the log-derivative trick derived in the previous section, this formula can be rewritten in the more common log form. Because
we have
If the sampled return is used to estimate , we get REINFORCE:
If the action value is replaced by an advantage function, we get a more stable form:
Seen as a whole, the complicated formula does not appear out of nowhere. It is simply the intuition "increase the probability of good actions and decrease the probability of bad actions" written as a weighted average over all states and actions.
Loss for Value Function Approximation
Policy gradients handle "how to update the policy", but training also needs a module that estimates "how many points a state or action is worth". This is the job of the Critic or DQN. Why do we need this module? Because the advantage estimate in policy gradients depends on an accurate estimate of the value . If the value estimate is inaccurate, the policy update direction can become biased. The learning objective is direct: make the predicted value as close as possible to the target value.
Given a sample , the TD target in DQN is
Here denotes the target-network parameters. The loss function is
Taking the gradient gives
The first term is the TD error:
The second term, , tells us how the parameters change the predicted value. DQN training repeatedly reduces this prediction error.
GAE: Estimating Advantages by Accumulating TD Errors
Policy gradients need the advantage function to measure how much better an action is than the average level, but this quantity cannot be observed directly. There are two extreme estimation methods. Monte Carlo methods use the return of the whole trajectory: low bias but high variance, because randomness accumulates over many steps. Temporal difference methods use only one step, "reward plus next-state estimate": low variance but high bias, because only one step of information is used. GAE (Generalized Advantage Estimation) is introduced to adjust flexibly between these two extremes. It accumulates future multi-step TD errors with decreasing weights, using the parameter to control whether the estimate leans more toward MC or TD. Start with the one-step TD error:
If , the actual outcome is better than the Critic expected. If , the actual outcome is worse than expected. TD error only looks one step ahead. GAE accumulates future TD errors with decreasing weights:
Here controls the bias-variance tradeoff:
- Small : relies more on short-term TD errors, with lower variance but potentially higher bias.
- Large : closer to the full return, with lower bias but potentially higher variance.
GAE is common in PPO because it provides a convenient knob for balancing stability and accuracy.
PPO Clipped Objective
We have already seen the intuition behind the probability ratio and clipping:
The PPO clipped objective is
The formula looks complex, but it is not difficult if we unpack it part by part.
If , the action is better than average. We want to increase its probability, but only up to times the old probability.
If , the action is worse than average. We want to decrease its probability, but only down to times the old probability.
Therefore, the combination of min and clip implements a simple and effective principle: update the policy in the correct direction, but do not let it move too far in one step.
GRPO Group-Normalized Advantage
GRPO uses relative comparison within a group of answers. Suppose the same question generates answers, with rewards
First compute the mean:
Then compute the standard deviation:
The standardized advantage of each answer is
For example, if the rewards are , the mean is . The third answer is clearly above average, so its advantage is positive. The first answer is below average, so its advantage is negative. The benefit of this within-group relative comparison is that the model does not need an additional Critic network. It can update the policy using only comparisons among answers in the same group.
Summary
This page summarized all core formulas in module E.3:
| Formula | Core expression | Use |
|---|---|---|
| Policy gradient theorem | Theoretical foundation of policy optimization | |
| DQN loss | Training objective for value functions | |
| GAE | $\hat{A}^{GAE}t=\sum_l(\gamma\lambda)^l\delta{t+l}$ | Advantage estimate with bias-variance tradeoff |
| PPO clipping | Limits policy update size | |
| GRPO group advantage | Within-group relative comparison without a Critic |
When you encounter an unfamiliar symbol, return to this page for reference. The next page uses exercises to check your understanding of these formulas.
Next: E.3.6 Formula Reference and Exercises, which summarizes all formulas in this module and checks understanding through exercises.