Skip to content

C.8 DAPO

DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) is a GRPO improvement proposed by ByteDance in 2025, and its interview frequency has been rising quickly.


DAPO vs GRPO: Three Improvements

ImprovementGRPODAPO
clippingsymmetric clip(ratio, 1-eps, 1+eps)decoupled clipping: clip positive/negative advantages separately
samplingfixed promptsdynamic sampling: filter prompts that are all-correct or all-wrong
overlong penaltybinary (overlong -> reward = 0)progressive penalty: the longer the excess, the larger the deduction

Decoupled Clipping

One-Line Memory

For positive advantages, clip only the upper bound (do not be greedy). For negative advantages, clip only the lower bound (do not be vengeful). GRPO clips both sides; DAPO clips only one side per sign.

Pseudocode

ratio = exp(new_logp - old_logp)

# positive advantage: encourage improvement, but clip the upper bound
pos_surr = min(ratio, 1 + eps) * advantage      # advantage > 0

# negative advantage: allow recovery, but clip the lower bound
neg_surr = max(ratio, 1 - eps) * advantage      # advantage < 0

loss = -mean(pos_surr + neg_surr)

Intuition

Compare to symmetric clipping:

GRPO (symmetric):
  advantage > 0:  min(ratio, 1+eps) * A
  advantage < 0:  max(ratio, 1-eps) * A

DAPO (decoupled):
  advantage > 0:  min(ratio, 1+eps_high) * A
  advantage < 0:  max(ratio, 1-eps_low)  * A

This makes it possible to tune exploration differently in the positive and negative directions (for example, more aggressive improvements but more conservative punishment).

Python (NumPy) Implementation

python
import numpy as np


def dapo_policy_loss(new_logp, old_logp, advantages, clip_high=0.28, clip_low=0.28):
    """
    new_logp:     [T]
    old_logp:     [T]
    advantages:   [T]
    clip_high:    upper-bound clipping for positive advantages
    clip_low:     lower-bound clipping for negative advantages
    """
    ratio = np.exp(new_logp - old_logp)

    pos_mask = advantages >= 0
    neg_mask = ~pos_mask

    loss = np.zeros_like(advantages)

    # positive: clip only the upper bound
    if pos_mask.any():
        clipped_ratio = np.minimum(ratio[pos_mask], 1 + clip_high)
        loss[pos_mask] = -(clipped_ratio * advantages[pos_mask])

    # negative: clip only the lower bound
    if neg_mask.any():
        clipped_ratio = np.maximum(ratio[neg_mask], 1 - clip_low)
        loss[neg_mask] = -(clipped_ratio * advantages[neg_mask])

    return loss.mean()

PyTorch Implementation

python
import torch


def dapo_policy_loss(new_logps, old_logps, advantages, clip_high=0.28, clip_low=0.28):
    """
    new_logps:    [B, seq_len]
    old_logps:    [B, seq_len]
    advantages:   [B, seq_len]
    """
    ratio = torch.exp(new_logps - old_logps)

    pos_mask = advantages >= 0
    neg_mask = ~pos_mask

    loss = torch.zeros_like(advantages)

    # positive: min(ratio, 1 + clip_high) * advantage
    if pos_mask.any():
        clipped = torch.clamp(ratio[pos_mask], max=1 + clip_high)
        loss[pos_mask] = -(clipped * advantages[pos_mask])

    # negative: max(ratio, 1 - clip_low) * advantage
    if neg_mask.any():
        clipped = torch.clamp(ratio[neg_mask], min=1 - clip_low)
        loss[neg_mask] = -(clipped * advantages[neg_mask])

    return loss.mean()

Dynamic Sampling

One-Line Memory

If all GG completions for the same prompt get the same reward (all correct or all wrong), skip that prompt: there is no learning signal.

Pseudocode

for each prompt:
    rewards = [get_reward(completion) for completion in group]
    if all rewards are the same:
        skip this prompt

PyTorch Implementation

python
import torch


def dynamic_sampling_filter(rewards):
    """
    rewards: [B, G] where B prompts, each with G completions
    returns: bool mask [B], True means keep this prompt
    """
    reward_std = rewards.std(dim=1)
    return reward_std > 1e-6

Intuition

GRPO uses group-wise z-score normalization. If all rewards in a group are identical, then std=0 and advantages are undefined or all zeros. DAPO filters those samples at the data level instead of discovering the problem later in the loss computation.


Overlong Reward Shaping

One-Line Memory

Overlong responses are not cut to zero in one shot. Penalize linearly by the amount of overflow.

Pseudocode

if response_length > max_length:
    penalty_ratio = (response_length - max_length) / max_length
    reward = reward - penalty_weight * penalty_ratio

Python (NumPy) Implementation

python
def overlong_reward_shaping(reward, response_length, max_length, penalty_weight=0.1):
    if response_length <= max_length:
        return reward
    penalty = penalty_weight * (response_length - max_length) / max_length
    return reward - penalty

Intuition

Compare to GRPO:

  • GRPO: overlong -> reward = 0 (binary, discontinuous)
  • DAPO: overlong -> reward decreases linearly (smooth signal)

From an RL view, a binary reward provides little directional signal at the boundary. A linear penalty tells the policy: "shorter would be better."


DAPO Total Loss (Sketch)

# 1) group-wise normalization (same as GRPO)
advantages = (rewards - mean) / (std + eps)

# 2) dynamic sampling filter
valid_mask = dynamic_sampling_filter(rewards)

# 3) decoupled-clipping policy loss
policy_loss = dapo_policy_loss(new_logp, old_logp, advantages, clip_high, clip_low)

# 4) KL penalty
kl = kl_penalty(log_probs, ref_log_probs)

# 5) total
loss = policy_loss[valid_mask].mean() + kl_coeff * kl

Full Comparison: GRPO vs DAPO

DimensionGRPODAPO
clippingsymmetric clip(r, 1-eps, 1+eps)decoupled; one epsilon for each sign of advantage
invalid datanot handled (std=0 -> NaN)filtered via dynamic sampling
overlong rewardbinary (0/1 style)progressive linear penalty
exploration flexibilityfixedcan be more aggressive for positive direction, more conservative for negative
representative workDeepSeek-R1ByteDance / Tsinghua DAPO

Common Pitfalls

PitfallExplanation
Decoupled clipping is not “no clipping”Clipping still exists; the positive/negative sides are tuned independently.
Wrong condition for dynamic samplingIt is not "reward below a threshold"; it is "group reward variance is (near) zero."
Overlong shaping is linear, not exponentialSimple (len - max_len) / max_len is enough.
Advantages are still group-wise normalizedThis part is exactly the same as GRPO.
clip_high and clip_low can differIn interviews: "you can tune exploration strength separately in the two directions."

现代强化学习实战课程