C.4 GRPO and Reward Models

GRPO Loss

One-Line Memory

For the same prompt, sample $G$ completions. Normalize rewards within the group (z-score) to form advantages. Plug them into PPO’s clipped loss, then add a KL penalty. No critic is used.

Pseudocode

# 1) For a single prompt, sample G completions
completions = [generate(prompt) for _ in range(G)]

# 2) Score each completion
rewards = [reward_fn(c) for c in completions]   # [G]

# 3) Group-wise normalization -> advantage
advantages = (rewards - mean(rewards)) / (std(rewards) + eps)  # [G]

# 4) PPO clipped loss (using group-wise advantages)
ratio = exp(new_logp - old_logp)
surr1 = ratio * advantages
surr2 = clip(ratio, 1-eps, 1+eps) * advantages
policy_loss = -min(surr1, surr2).mean()

# 5) KL penalty (against a reference model)
kl = kl_penalty(log_probs, ref_log_probs)

# 6) Total
loss = policy_loss + kl_coeff * kl

What To Say In An Interview

GRPO stands for Group Relative Policy Optimization. Compared to PPO:

	PPO	GRPO
Where advantages come from	critic value $V(s)$ + GAE	group-wise normalized rewards
How many models are involved	often 4 (actor, critic, reference, reward model)	often 2-3 (actor, reference, RM or verifier)
KL penalty	optional	almost always used
Sampling style	single rollout per prompt	$G$ samples for the same prompt

Mnemonic: "PPO without the critic; replace advantages with group-wise z-score; everything else is basically copied."

Python (NumPy) Implementation

python

import numpy as np


def grpo_advantages(rewards):
    """
    rewards: [num_prompts, G]
    returns: [num_prompts, G] z-scored within each prompt group
    """
    mean = rewards.mean(axis=1, keepdims=True)
    std = rewards.std(axis=1, keepdims=True)
    return (rewards - mean) / (std + 1e-8)


def grpo_policy_loss(new_logps, old_logps, advantages, clip_eps=0.2):
    """Identical to PPO's clipped surrogate loss."""
    ratio = np.exp(new_logps - old_logps)
    surr1 = ratio * advantages
    surr2 = np.clip(ratio, 1 - clip_eps, 1 + clip_eps) * advantages
    return -np.minimum(surr1, surr2).mean()

PyTorch Implementation

python

import torch


def grpo_loss(
    log_probs,
    old_log_probs,
    ref_log_probs,
    rewards,
    clip_eps=0.2,
    kl_coeff=0.05,
):
    """
    log_probs:     [B, G, seq_len] current policy
    old_log_probs: [B, G, seq_len] behavior policy (used during sampling)
    ref_log_probs: [B, G, seq_len] reference policy
    rewards:       [B, G]          group-wise rewards

    B = number of prompts, G = group size
    """
    B, G = rewards.shape

    # 1) Group-wise normalization
    advantages = (rewards - rewards.mean(dim=1, keepdim=True)) / (rewards.std(dim=1, keepdim=True) + 1e-8)
    advantages = advantages.unsqueeze(-1)  # [B, G, 1], broadcast over seq_len if needed

    # 2) Sequence-level log-prob sums (one per completion)
    seq_logp = log_probs.sum(dim=-1)  # [B, G]
    seq_old = old_log_probs.sum(dim=-1)
    seq_ref = ref_log_probs.sum(dim=-1)

    # 3) Clipped policy loss
    ratio = torch.exp(seq_logp - seq_old)
    adv = advantages.squeeze(-1)  # [B, G]
    surr1 = ratio * adv
    surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * adv
    policy_loss = -torch.min(surr1, surr2).mean()

    # 4) KL penalty (a nonnegative estimator)
    log_ratio = seq_logp - seq_ref
    kl = (torch.exp(log_ratio) - 1 - log_ratio).mean()

    return policy_loss + kl_coeff * kl

Reward Model (Bradley-Terry)

One-Line Memory

If the chosen score is higher than the rejected score, sigmoid should be close to 1, so -log should be close to 0. The whole loss is one line: -log_sigmoid(r_chosen - r_rejected).

Pseudocode

r_w = reward_model(chosen_input)      # scalar reward for chosen
r_l = reward_model(rejected_input)    # scalar reward for rejected
loss = -log(sigmoid(r_w - r_l))

Intuition

The Bradley-Terry model assumes the probability that humans prefer $y_w$ over $y_l$ is:

$P(y_w \succ y_l) = \sigma(r(x, y_w) - r(x, y_l))$

Maximizing the log probability is equivalent to minimizing -log_sigmoid(diff).

Mnemonic: "Reward model training is pairwise cross-entropy."

Python (NumPy) Implementation

python

import numpy as np


def log_sigmoid(x):
    return -np.logaddexp(0, -x)


def reward_model_loss(r_chosen, r_rejected):
    """r_chosen, r_rejected: [B]"""
    return -log_sigmoid(r_chosen - r_rejected).mean()

PyTorch Implementation

python

import torch.nn.functional as F


def reward_model_loss(r_chosen, r_rejected):
    """
    r_chosen:   [B] reward scores for chosen
    r_rejected: [B] reward scores for rejected
    """
    return -F.logsigmoid(r_chosen - r_rejected).mean()

Interview Follow-Up: DPO vs RLHF-PPO

Interviewers often ask: "What are the pros/cons of DPO compared to PPO-based RLHF?" This table is a good answer:

Dimension	PPO-RLHF	DPO
Needs a reward model	yes	no (implicit)
Needs a critic	yes	no
Needs a reference model	optional	required
Online vs offline	online (requires sampling)	offline (preference data only)
Training cost	high (often 4 models)	lower (often 2 models)
Reward hacking risk	present (RM can be exploited)	lower (no explicit RM in training)
Theoretical optimum	stronger (can keep exploring)	limited by offline data quality
Best-fit use case	large-scale online training	when preference data is abundant

Common Pitfalls

Pitfall	Explanation
GRPO advantages are group-wise	Not global normalization. Compare only the $G$ completions from the same prompt.
No value loss in GRPO	There is no critic, so there is no value loss. That is the key difference from PPO.
Reward model must be frozen during policy updates	When training the RM, you backprop through it; when training the policy, RM is usually frozen.
KL is sequence-level in practice	Commonly compute KL after summing token log-probs over the completion, not per-token.
DPO implicitly learns a reward difference	`log_ratio_w - log_ratio_l` acts like an implicit reward delta.
Choosing $G$	Typical range is 4-16. Too small gives noisy advantages; too large increases sampling cost.
RLVR case	Rewards may come from a rule-based verifier (code execution, math answer check), not from a learned RM.

1. CartPole Balancing

2. DPO Preference Tuning

3. MDP and Value Functions

4. Deep Q-Networks

5. Policy-Based Methods

6. Actor-Critic

7. PPO

8. The RLHF Pipeline

9. Post-Training Alignment

10. Agentic RL

11. VLM Reinforcement Learning

12. Future Trends

B. RL Engineering Practice

C. Code Cheatsheet

E. Math Foundations for RL

E.1 Linear Algebra

E.2 Probability & Estimation

E.3 Calculus & Optimization

E.4 Information Theory

C.4 GRPO and Reward Models

GRPO Loss

One-Line Memory

Pseudocode

What To Say In An Interview

Python (NumPy) Implementation

PyTorch Implementation

Reward Model (Bradley-Terry)

One-Line Memory

Pseudocode

Intuition

Python (NumPy) Implementation

PyTorch Implementation

Interview Follow-Up: DPO vs RLHF-PPO

Common Pitfalls

E.1 Linear Algebra

E.2 Probability & Estimation

E.3 Calculus & Optimization

E.4 Information Theory

C.4 GRPO and Reward Models ​

GRPO Loss ​

One-Line Memory ​

Pseudocode ​

What To Say In An Interview ​

Python (NumPy) Implementation ​

PyTorch Implementation ​

Reward Model (Bradley-Terry) ​

One-Line Memory ​

Pseudocode ​

Intuition ​

Python (NumPy) Implementation ​

PyTorch Implementation ​

Interview Follow-Up: DPO vs RLHF-PPO ​

Common Pitfalls ​

C.4 GRPO and Reward Models

GRPO Loss

One-Line Memory

Pseudocode

What To Say In An Interview

Python (NumPy) Implementation

PyTorch Implementation

Reward Model (Bradley-Terry)

One-Line Memory

Pseudocode

Intuition

Python (NumPy) Implementation

PyTorch Implementation

Interview Follow-Up: DPO vs RLHF-PPO

Common Pitfalls