Skip to content

9.6 On-Policy Distillation: Turning the Teacher into Dense Reward

The previous section discussed how RLVR replaces an RM with a rule verifier, giving precise reward signals in domains such as mathematics and code where objective answers exist. This section looks at another route: instead of making the small model explore from scratch, let a stronger model give token-by-token guidance on trajectories generated by the small model itself. This route is called On-Policy Distillation, or OPD.

The core loop has only three steps. First, the student generates an answer to a prompt by itself, which is its "policy." Second, the teacher does not rewrite the answer; instead, it reads every step written by the student and judges whether each token is reasonable. Third, this judgment becomes a dense token-level signal sent back to the student for adjustment. The key terms line up as follows: the student model is the policy in RL, each selected token is an action, and the teacher's log-prob for that token is the feedback signal.

The core difference between OPD and the previous methods has two dimensions. First, who generates the training trajectory: SFT / SeqKD uses answers written by the teacher, GRPO uses the student's own exploration, and OPD also uses the student's own trajectories. Second, how dense the feedback is: GRPO / RLVR usually gives outcome-level reward, where a correct answer receives 1 and an incorrect one receives 0, so a 2000-token solution process has only that final 0 or 1. OPD, by contrast, has a signal for almost every token, because the teacher can assign a log-prob to each token.

MethodWho generates the training trajectoryWhere feedback comes fromFeedback granularityMain problem
SFT / SeqKDHuman or teacherReference answer tokenstoken-levelThe student does not practice its own mistakes.
PPO / GRPOstudentRM or rule verifiermostly sequence-levelRewards are sparse and sampling is expensive.
DPOOffline preference datachosen / rejected pairssequence-pair-levelCannot explore online.
OPDstudentteacher log-probtoken-levelRequires a good teacher and good initialization.

In one sentence: OPD gets both the on-policy distribution of RL and the dense supervision of distillation.

The Core Idea of OPD

The Teacher's Role in OPD

All distillation methods have a teacher. The difference is what role the teacher plays during training.

Teacher's roleWho produces the training trajectoryWhat the student learnsLimitation
SFT / SeqKDAnswer author: generates the complete answerteacherHow the teacher writesLearns only in teacher contexts and does not practice its own mistakes.
Off-policy distillationExpert demonstrator: shows complete solution trajectoriesteacherHow the teacher thinksThe trajectory is still one taken by the teacher; there is no signal where the student gets stuck.
OPDOnline evaluator: reads the student prefix and judges each tokenstudentHow to correct itself on paths it actually tookRequires a good teacher and good initialization.

From top to bottom, the teacher changes from data producer to reward provider, and the student changes from imitating teacher trajectories to being corrected on its own trajectories. OPD's motivation is precisely to remove distribution shift: during inference the student sees contexts generated by itself, so training should provide signals directly in those regions, rather than only laying data on paths the teacher has taken.

What OPD Optimizes

The OPD training loop is short:

  1. Given a prompt xx, the student πθ\pi_\theta samples an answer by itself, yπθ(x)y \sim \pi_\theta(\cdot \mid x).
  2. For every prefix ct=(x,y<t)c_t=(x,y_{<t}) generated by the student, the teacher qq computes the probability of the current token yty_t.
  3. The student is updated using the teacher's evaluation of the student's own tokens.

Google DeepMind's GKD is a representative starting point for this line of work. It no longer depends only on fixed teacher outputs. Instead, the student receives teacher feedback on self-generated sequences, and the framework allows different divergences such as forward KL, reverse KL, and JSD.[1]

With the most common reverse KL form, the objective can be written as:

LOPD(θ)=Eyπθ[tlogπθ(ytct)q(ytct)]\mathcal{L}_{\text{OPD}}(\theta) = \mathbb{E}_{y \sim \pi_\theta} \left[ \sum_t \log \frac{\pi_\theta(y_t \mid c_t)}{q(y_t \mid c_t)} \right]

This means minimizing the following quantity on states visited by the student:

DKL(πθ(ct)q(ct))D_{\text{KL}}(\pi_\theta(\cdot \mid c_t) \| q(\cdot \mid c_t))

After expansion, every token has a very natural training signal:

rt=logq(ytct)logπθ(ytct)r_t = \log q(y_t \mid c_t) - \log \pi_\theta(y_t \mid c_t)

If the teacher approves this token more than the student does, rtr_t is high. If the teacher thinks this token does not look like a good policy, rtr_t is low. In Thinking Machines Lab's implementation, this is almost equivalent to replacing the KL regularizer model in an RL training script with the teacher: sample student rollouts, compute student log-probs, let the teacher compute log-probs on the same trajectories, and finally use negative reverse KL as the per-token advantage.[2]

Therefore, the mapping between OPD and RL is direct:

RL conceptCounterpart in OPD
State sts_tprompt + generated prefix ctc_t
Action ata_tnext token yty_t
Policystudent πθ\pi_\theta
Rewardteacher's relative approval of the student's token
Sampling distributionthe student's own distribution
Advantage estimatecommonly logq(ytct)logπθ(ytct)\log q(y_t \mid c_t)-\log \pi_\theta(y_t\mid c_t)

One detail matters here: the teacher is not environmental truth. It is only a strong policy. OPD is not asking the model to discover new strategies beyond the teacher; it compresses the teacher's behavior on the student's own states into the student. It is closer to imitation learning with process supervision than to pure exploratory RL.

Terms: Distillation, Off-Policy, On-Policy

First separate three concepts that are often mixed together: distillation, off-policy / offline, and on-policy. They refer to different layers of the problem.

Distillation describes a teacher-student relationship: a strong model qq transfers ability to a smaller or cheaper model πθ\pi_\theta. The simplest distillation asks the teacher to generate answers and trains the student on those answers with cross-entropy:

Lhard KD=tlogπθ(ytTx,y<tT)\mathcal{L}_{\text{hard KD}} = -\sum_t \log \pi_\theta(y_t^T \mid x, y_{<t}^T)

Here the student sees the tokens selected by the teacher, ytTy_t^T. If you can access the teacher's full probability distribution, you can also do "soft distillation": instead of only telling the student what the correct token is, you also tell it how reasonable the other tokens are.

Lsoft KD=tDKL(q(ct)πθ(ct))\mathcal{L}_{\text{soft KD}} = \sum_t D_{\text{KL}}\left(q(\cdot \mid c_t) \| \pi_\theta(\cdot \mid c_t)\right)

Soft distillation contains more information. For example, the teacher may think "therefore" is very good, "so" is also acceptable, and "banana" is completely wrong. A hard label only tells you which word the teacher finally chose. A soft label tells you the teacher's judgment over the whole action space.

In LLMs, a policy is simply the model that outputs a distribution over the next token given a context:

πθ(atst)πθ(ytx,y<t)\pi_\theta(a_t \mid s_t) \quad \Longleftrightarrow \quad \pi_\theta(y_t \mid x, y_{<t})

Here the state sts_t is the prompt plus generated prefix, and the action ata_t is the next token. Whoever generates the trajectory is the behavior policy. This determines the distribution of training data.

Off-policy means that the training data is not generated by the current student itself, but by another policy μ\mu. This μ\mu may be the teacher, an old checkpoint, user logs, a historical model, or a fixed dataset. Ordinary teacher distillation is typical off-policy training:

yq(x),update πθy \sim q(\cdot \mid x), \quad \text{update } \pi_\theta

The data comes from the teacher qq, while the updated model is the student πθ\pi_\theta. The advantage is that this is cheap, reusable, and stable. The drawback is that the student does not train in the contexts created by its own mistakes.

Offline is stricter than off-policy: not only does the data come from another policy, but no new sampling is done during training; only a fixed dataset is used. SFT, DPO, and offline preference training are usually offline. A useful summary is:

ConceptWhere data comes fromIs new data sampled during training?Examples
offlineFixed historical datasetNoSFT, DPO, offline SeqKD
off-policyNot the current student's policyMay or may notteacher trajectories, old-model replay
on-policyCurrent student itselfYesPPO, GRPO, OPD rollout

Therefore, offline training in LLM post-training is usually also off-policy, but off-policy training is not necessarily offline. If you ask the teacher to regenerate data in every round and then train the student, it is not offline, but it is still off-policy because the behavior policy is the teacher, not the student.

On-policy means using data generated by the current student to update the current student. Formally:

yπθ(x),update πθy \sim \pi_\theta(\cdot \mid x), \quad \text{update } \pi_\theta

Its advantage is that the training distribution matches the inference distribution. Wherever the student goes during inference, training really lets it go there and then provides feedback there. The cost is also clear: before each update round, rollouts must be regenerated; old data quickly becomes stale; sample efficiency is low.

Putting these concepts together, OPD's position is clear:

  • It is distillation, because feedback comes from a teacher.
  • It is on-policy, because trajectories come from the student itself.
  • It is usually not purely offline, because student rollouts are repeatedly regenerated during training.
  • Its difference from ordinary off-policy distillation is not whether there is a teacher, but whose trajectory the teacher scores.

Why Ordinary Distillation Is Not Enough

The old problem of Knowledge Distillation, or KD, is that large models are powerful but expensive, while small models are cheap but weak. The standard solution is direct: let the teacher generate data, then train the student on that data with supervised learning. LLM-era KD surveys usually divide the field into categories such as white-box distillation, which sees teacher logits, and black-box distillation, which sees only teacher outputs; they also classify distillation by ability, such as reasoning, alignment, domain knowledge, and tool use.[3][4]

This route is very useful. DeepSeek-R1's distilled models are a typical example: first let a strong reasoning model generate high-quality trajectories, then SFT those trajectories into smaller models. For small models, this is often more stable than doing RL directly.

But it has a fundamental gap: during training, the student sees the teacher's state distribution; during inference, the student follows its own state distribution.

Suppose the prompt is xx, and the teacher trajectory is:

yT=(y1T,y2T,,yTT)q(x)y^{T} = (y_1^T, y_2^T, \dots, y_T^T) \sim q(\cdot \mid x)

Ordinary distillation trains:

Loff-policy(θ)=tlogπθ(ytTx,y<tT)\mathcal{L}_{\text{off-policy}}(\theta) = -\sum_t \log \pi_\theta(y_t^T \mid x, y_{<t}^T)

The context x,y<tTx, y_{<t}^T comes from the teacher. But once the student makes a mistake at step 3, the later context becomes x,y<3Sx, y_{<3}^{S}. The teacher may never have entered this state, and the SFT data contains no demonstration of "how to recover from here." Errors then amplify along autoregressive generation. This is exposure bias, and it can also be understood as distribution shift in imitation learning. DAgger pointed out long ago that to reduce this problem, the states visited by the learner itself must be included in training.[5]

OPD brings this idea into LLM distillation.

Online OPD vs Offline OPD

If OPD is on-policy, why is Lightning OPD called offline on-policy distillation? There is no contradiction. The key is to distinguish "who generated the trajectory" from "when the teacher scored it."

Standard OPD is online. Every training round uses the current student to generate new answers. The teacher computes each token's log-prob on the spot, and then the student is updated. In the next round, the student parameters have changed, so generation, scoring, and updating happen again.

text
Current student generates answers
-> teacher scores them on the spot
-> update student
-> new student generates answers again
-> teacher scores them again
-> ...

This route is most faithful to the definition of on-policy, but it is expensive. The teacher is often much larger than the student, and standard OPD requires a live teacher server to run continuously throughout training. The Lightning OPD paper calls this the infrastructure bottleneck of standard OPD: the teacher must compute log-probs for new rollouts at every gradient step.[6]

Lightning OPD is an offline approximation. It first trains an SFT student, then uses that SFT student to generate a fixed set of answers. The teacher scores this set once and stores each token's log-prob. During actual OPD training, the teacher server is no longer started; training reads the cached teacher scores and computes the current student's own log-probs online.

text
Preprocessing stage:
SFT student generates answers
-> teacher scores once
-> store tokens and teacher log-probs

Training stage:
read fixed answers and teacher log-probs
-> compute current student log-probs
-> update student

Lightning OPD is still on-policy because these fixed answers are not written by the teacher; they are generated by a student-family model. The teacher still gives feedback on student trajectories, preserving the most important part of OPD: the teacher grades what the student would actually write.

But Lightning OPD is no longer strict online OPD, because student parameters change during training while the rollouts are not refreshed. It relies on an empirical observation: in OPD / RL post-training, the student often does not drift far from its SFT initialization, so rollouts from the SFT student can approximate later student rollouts. The paper also emphasizes one condition: teacher consistency. The teacher that generated the SFT data and the teacher that scores OPD should preferably be the same; otherwise, cached trajectories and later scoring can introduce systematic bias.[6:1]

A practical choice looks like this:

SchemeAdvantageCostWhen it fits
Standard online OPDClosest to the definition; training distribution is freshest.Requires a live teacher and is expensive.Studying mechanisms, seeking maximum stability, enough resources.
Lightning / offline OPDCheap, easy to reproduce, no teacher server needed.Fixed rollouts; depends on teacher consistency.Quick validation, limited resources, small student drift.
Ordinary offline KDSimplest; direct SFT.Learns only teacher trajectories.Cold start; first pull the student into a reasonable region.

So the more precise statement is: standard OPD is online on-policy distillation; Lightning OPD is an engineering approximation that offline-izes standard OPD; ordinary teacher SFT is off-policy / offline distillation.

Why Reverse KL

The easiest source of confusion in distillation is the KL direction.

Classical KD often uses forward KL:

DKL(qπθ)D_{\text{KL}}(q \| \pi_\theta)

It asks the student to cover the teacher's probability mass. For classification tasks, this is natural: if the teacher says cat 0.7, dog 0.2, fox 0.1, the student should learn the same soft label. But for long-text generation, covering everything the teacher might say can make the student's distribution too smooth. Many low-probability, marginal tokens are raised, and generation can drift.

MiniLLM's core judgment is that generative LLM distillation is better served by reverse KL:

DKL(πθq)D_{\text{KL}}(\pi_\theta \| q)

Reverse KL is mode-seeking. It encourages the student to concentrate on a small number of high-probability teacher modes, rather than covering every possible teacher mode. MiniLLM also implements this objective with on-policy optimization, reducing exposure bias in long-text generation.[7]

But reverse KL has a practical limitation: it can only correct the policy in regions the student currently samples. If an important token has probability nearly 0 under the student's initialization, the student will never sample it, and the teacher has no opportunity to assign it a high score. This is why many practical recipes are not "direct OPD" but:

  1. First use off-policy SFT / SeqKD for cold start, pulling the student near the teacher's support.
  2. Then use OPD to refine the student on its own trajectories.

Thinking Machines Lab's reproduction uses the same pattern: first do off-policy reasoning distillation, then improve with OPD post-training. A 2026 mechanism analysis explains this more thoroughly: OPD success depends not only on teacher scores, but also on whether teacher and student form an optimizable local overlap near the student's current states.[8]

Another engineering route is to tune the divergence itself. DistiLLM uses skew KL and adaptive off-policy strategies, trying to find a smoother tradeoff between teacher signal and student learnability.[9]

Why Strong Teachers Can Still Fail

Intuitively, doing OPD sounds like "find a stronger teacher and compress its ability into the student." But what OPD really transfers is not leaderboard score; it is the teacher's local preference field on the student's own prefixes.

More concretely, the signal becomes useful gradient only when the teacher, at a real student state ct=(x,y<t)c_t=(x,y_{<t}), can reorder candidate tokens that the student is already seriously considering. If the teacher is strong but its high-probability tokens barely overlap with the student's high-probability tokens, reverse KL sees two disjoint supports. The student cannot sample the key tokens the teacher wants, and the teacher can only assign low scores to strange tokens the student already wrote. Training becomes a process where every step is criticized, but no absorbable direction is provided.

This is the concrete meaning of "thinking-pattern consistency": whether teacher and student use similar intermediate language to think about the same problem. A math teacher may prefer to write a full derivation first, while a student may first guess a formula. A reasoning teacher may explicitly split the problem into subproblems, while a base student only continues the prompt with a short answer. Both may eventually produce a correct answer, but their intermediate paths differ, so token-by-token distillation can pull them against each other.

The 2026 mechanism analysis reframes OPD from "ability transfer" into a problem of "local teachability."[8:1] A high teacher score only means the teacher can solve the problem; a teachable teacher can provide directions that the student can absorb at its current position. In SFT, these two things are often conflated because SFT directly pulls the student onto teacher trajectories. But in OPD, the student first walks its own path, and the teacher can only speak where the student arrives. The core question changes from "how strong is the teacher?" to "can the teacher's preferences form gradients on the student's current distribution?"

High score is not the same as new knowledge. If the teacher is only a larger sibling from the same training pipeline, it may simply have absorbed the same data and recipe more fully. For the student, its local preferences may provide no new direction, only stronger confidence on an existing distribution. A teacher that has undergone additional RL post-training, data expansion, or task-specialized training may introduce decision boundaries the student has not seen, even if its parameter count is not extreme.

OPD does not compress the teacher's answers into the student; it compresses the teacher's "local tradeoff pattern" into the student. A teacher that is merely "larger but homogeneous" may give the student stronger confirmation bias. A teacher that has experienced new RL, seen new failure cases, or formed new reasoning habits is more likely to provide transferable structure.

This also explains why OPD is more selective about teachers than ordinary distillation. Off-policy SFT can force the student to look at teacher trajectories. OPD performs local navigation on the student's own terrain: if there is no gradient near the current position leading toward the teacher's mode, even a much larger teacher can only report the answer from far away.

Overlap Tokens Are the Main Battlefield

One of the most illuminating experiments in the paper splits the token set apart: optimizing only on overlap tokens that both student and teacher assign high probability to almost does not hurt performance; looking only at non-overlap tokens barely helps. This explains why top-kk OPD is often enough, and also why the loss in a failed run may still move while capability does not improve.

The next-token distribution of a language model is very sharp: most probability mass is concentrated on a small number of candidates. If these candidate tokens are shared by teacher and student, the teacher's log-prob acts like a fine-grained ranker, telling the student, "you are already in the right candidate set; move weight toward the better option." If the candidate sets are not shared, the teacher's feedback mainly rejects the token sampled by the student without pulling the student into a new high-probability region.

Dense supervision does not mean every token is equally useful. The truly useful tokens are those the student can already imagine and the teacher can further rank. Overlap tokens are like a bridge: one end connects to the student's current ability, and the other connects to the teacher's better preference. Non-overlap tokens are more like a distant beacon: they can show that the direction is wrong, but they do not tell the student what the next step should be.

The paper also observes a counterintuitive phenomenon: the global reward from a failed teacher can still distinguish correct and incorrect rollouts, but this information does not form locally usable optimization geometry. Being able to score globally is not the same as being able to teach locally. This explains why some OPD runs appear to have a functioning reward and a non-weak teacher, yet the student still does not learn.

The Free Lunch Can Spoil in Long Chains

OPD is attractive because every token has reward, but long chains of thought expose a weakness: the later the trajectory goes, the more likely the student prefix is to drift away from the teacher's familiar distribution, and the more the teacher's log-prob on those unfamiliar prefixes looks like noise. The paper observes a backward-spreading entropy collapse in long responses: the suffix first becomes high-entropy and unstable, and then this instability gradually propagates back to earlier positions.

This shows that OPD's scaling bottleneck is not only expensive teacher forward passes, but also that the reliability of dense supervision decreases with trajectory depth. In short math problems or formatted answers, the teacher's grading of student prefixes is usually reliable. In 15K-token long reasoning, tool use, or multi-turn agent trajectories, the teacher may already be outside the state distribution it saw during training. More stable approaches include segmented distillation, mixing in sequence-level verifiers, limiting the horizon of each segment, or applying token-level loss only in high-confidence overlap regions.

Hands-On: Minimal OPD Scoring Implementation

The previous sections clarified OPD's mechanism and pitfalls. Now we implement the most central OPD step in code: the student generates an answer, and the teacher scores the student's trajectory token by token.

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

student_name = "Qwen/Qwen2.5-0.5B-Instruct"
teacher_name = "Qwen/Qwen2.5-1.5B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(student_name)
student = AutoModelForCausalLM.from_pretrained(
    student_name, torch_dtype=torch.bfloat16, device_map="auto"
)
teacher = AutoModelForCausalLM.from_pretrained(
    teacher_name, torch_dtype=torch.bfloat16, device_map="auto"
)
student.eval()
teacher.eval()

prompt = "Solve: if x + 3 = 7, what is x? Show your work."
inputs = tokenizer(prompt, return_tensors="pt").to(student.device)
prompt_len = inputs["input_ids"].shape[1]

# Step 1: student generates an answer.
with torch.no_grad():
    output_ids = student.generate(
        **inputs,
        max_new_tokens=64,
        do_sample=True,
        temperature=0.7,
    )

full_ids = output_ids


# Step 2: compute per-token log-probs from both student and teacher.
def next_token_logps(model, input_ids):
    """Compute the log probability of each token in the given sequence.

    logps[:, i] is the log-prob of token i+1 predicted by the logits at position i.
    """
    logits = model(input_ids).logits
    logps = torch.log_softmax(logits[:, :-1], dim=-1)
    next_ids = input_ids[:, 1:]
    return logps.gather(-1, next_ids.unsqueeze(-1)).squeeze(-1)


with torch.no_grad():
    student_logps = next_token_logps(student, full_ids)
    teacher_logps = next_token_logps(teacher, full_ids.to(teacher.device)).to(student_logps.device)

# Step 3: compute per-token reward: teacher approval minus student confidence.
gen_mask = torch.zeros_like(student_logps, dtype=torch.bool)
gen_mask[:, prompt_len - 1 :] = True  # Only score the generated part.

token_rewards = teacher_logps - student_logps
generated_ids = full_ids[:, 1:][gen_mask]
generated_rewards = token_rewards[gen_mask]

for tok_id, reward in zip(generated_ids[:32], generated_rewards[:32]):
    token = tokenizer.decode([tok_id.item()])
    print(f"{token!r:12s} reward={reward.item():+.3f}")

Design points:

  • next_token_logps() is used for both student and teacher to compute the log probability at each position. This is the same log-prob calculation used in GRPO; OPD reuses much of the same engineering infrastructure as RL.
  • token_rewards = teacher_logps - student_logps: this is OPD's per-token reward. A positive value means the teacher approves this token more than the student does; a negative value means the teacher does not approve it.
  • This code only performs the "scoring" part. To turn it into a training loop, three pieces remain: batch rollout, reward normalization/clipping, and policy-gradient updates.

Pseudocode for the training loop:

python
for prompts in dataloader:
    # Step 1: student rollout
    trajectories = student.rollout(prompts)
    student_logps = student.logprobs(trajectories)
    teacher_logps = teacher.logprobs(trajectories)

    # Step 2: compute per-token advantage
    advantages = teacher_logps - student_logps
    advantages = normalize_and_mask(advantages, trajectories.response_mask)

    # Step 3: policy-gradient update
    loss = policy_gradient_loss(
        new_logps=student.logprobs(trajectories),
        old_logps=student_logps.detach(),
        advantages=advantages.detach(),
    )
    loss.backward()
    optimizer.step()

Real systems also add KL to a reference model, length control, repetition penalties, prompt difficulty sampling, and eval gating. OPD is not "one formula and done"; it connects teacher log-probs to existing RL training infrastructure.

OPD and Neighboring Methods

Relationship with SFT / SeqKD

SFT is a necessary foundation. It is cheap, stable, and can quickly pull the student into a reasonable output region. OPD does not replace SFT; it solves error correction on the student's own trajectories after SFT.

A simple analogy:

  • SFT / SeqKD: pave the road into regions the teacher often visits.
  • OPD: the student drives by itself, while the teacher sits in the passenger seat and corrects it step by step.

Relationship with GRPO / RLVR

GRPO / RLVR rewards usually come from external verifiers: whether the answer is correct, whether code runs, or whether the format is valid. These rewards are objective, but often sparse. A 2000-token solution process receives only a final 0 or 1.

OPD's reward comes from the teacher, and every token can receive a signal. It does not require a reference answer or an RM, but its ceiling is limited by the teacher.

The two are therefore natural to combine. For math problems, a rule reward can tell the model whether the final answer is correct, while teacher log-probs provide denser shaping signals for the intermediate process. Thinking Machines Lab also lists "distillation-style per-token reward plus sequence-level environment reward" as a direction worth exploring further.[2:1]

Relationship with DPO

DPO is elegant because the language model itself can be interpreted as an implicit reward model. OPD goes one step further: it directly treats a strong teacher as a token-by-token reward model.

But DPO is offline preference optimization, suitable when chosen / rejected data already exists. OPD is online sampling, suitable when you have a teacher but lack preference data or a verifier.

Relationship with Black-Box Distillation

The OPD described above assumes access to teacher log-probs, meaning a white-box teacher or at least logprob access. In practice, many teachers only provide API text outputs and no logits. Black-box OPD must use a different signal. For example, 2025's GAD treats the student as a generator, trains a discriminator to distinguish teacher and student answers, and then uses that discriminator as an on-policy reward model that evolves with the student.[10]

This line is practically valuable, but its engineering complexity is also higher: it introduces another discriminator that can be hacked, can drift, and must be trained stably.

There are also teacher-free variants. Self-Distilled Reasoner lets the same model play teacher and student under different contexts, using self-distillation to reduce dependence on an external strong teacher.[11]

When to Use OPD

OPD is best suited for these scenarios:

ScenarioWhy OPD fits
Small model inheriting a large model's reasoning abilityThe small model explores poorly, and the teacher can give dense process signals.
No rule verifierNo reference answer is needed; the teacher only needs to evaluate tokens.
Domain-model post-trainingA strong teacher can restore instruction following or formatting.
Already has decent SFT initializationThe student can sample tokens near the teacher's support.
Want to reduce RL costTeacher forward can be more efficient than full RL exploration with sparse rewards.

It is not suitable for these scenarios:

ScenarioRisk
Want to clearly surpass the teacherOPD essentially compresses teacher behavior; it does not discover new strategies.
Student is too weak and cannot sample key tokensReverse KL cannot reinforce behavior near zero probability.
Teacher and student have conflicting reasoning stylesToken-by-token signals may pull against each other and destabilize training.
Long-horizon tasks rely only on local token rewardLocal alignment does not necessarily imply global task success.
Only a black-box teacher API is availableRequires extra reward/discriminator design.
Teacher is only a scaled-up version from the same pipelineIt may have a higher score but provide no new decision boundary for the student.

The 2026 OPD survey organizes the field along three dimensions: feedback signal (logit-based, outcome-based, self-play), teacher access (white-box, black-box, teacher-free), and loss granularity (token-level, sequence-level, hybrid).[12] This classification is useful: when you see a new "OPD method," ask these three questions first, and the name will not confuse you.

Practical Guide

A Practical Recipe

When running OPD in practice, use the following steps.

Step 1: choose the teacher. The teacher is not necessarily better just because it is larger. The key is that it is stronger than the student on the target task, can provide abilities the student has not yet learned, and is as compatible as possible with the student's tokenizer, output style, and reasoning format. Same-family models are usually easier, but "larger same-family model" is not a sufficient condition. It is best to run a small comparison: a scaled-up teacher from the same pipeline versus a teacher improved by extra RL or data augmentation. If the latter is clearly better, it suggests OPD needs new decision boundaries, not just parameter scale.

Step 2: do off-policy cold start. Use the teacher to generate a batch of high-quality answers, then SFT the student first. The goal is not to solve everything in one step, but to move the student near the teacher's support. If the initial overlap ratio is low, direct OPD often cannot start. First perform lightweight SFT on teacher rollouts; once the student can sample similar reasoning paths, switch to on-policy.

Step 3: choose the prompt distribution. Prompts are not only task data; they determine which states the student will enter. Use templates, system prompts, and problem types familiar to the teacher as much as possible, so early OPD has enough overlap. At the same time, mix in some OOD prompts to prevent the student from learning only the teacher's fixed verbal habits and low-entropy templates.

Step 4: sample student rollouts. For each prompt, sample 2-8 answers, and keep tokens, log-probs, masks, lengths, and stop reasons. This is basically the same rollout infrastructure used by PPO / GRPO.

Step 5: teacher scoring. Run teacher forward on the complete contexts generated by the student and obtain the log-prob of generated tokens. A white-box teacher can compute logits directly; a black-box teacher requires a separate reward approximation. Do not treat the teacher as an absolute judge. Treat it as a local preference: it is telling the student which choices in the prefixes the student itself wrote look more like paths the teacher would keep.

Step 6: update the student. Use teacher_logp - student_logp as the per-token advantage, then connect it to a PPO-style loss or importance-sampling loss. In practice, monitor entropy, KL, response length, and repetition rate to avoid premature collapse. For long-response tasks, apply loss only to the first several tokens, segmented windows, or high-confidence overlap tokens; do not assume an entire 10K-token trajectory is equally reliable.

Step 7: mix with task reward. If the task has a verifier, do not waste it. Use final reward for sequence-level direction and OPD reward for token-level shaping. This compensates for one blind spot of OPD: a token locally approved by the teacher does not necessarily guarantee a globally correct answer.

Step 8: run eval gating. OPD can easily compress the teacher's style into the student as well. Besides the target benchmark, evaluate general capability, formatting, refusal behavior, safety, and length distribution.

Quick Test Plan

For a first OPD validation, do not start with large-scale training. Use small experiments to answer three questions:

  1. Is the teacher's token-level signal meaningful?
  2. Can an offline approximation show positive movement first?
  3. Is online rollout clearly better than offline caching?

0. First Ask Whether the Teacher Is Actually Learnable

This is the fastest sanity check and can be done in tens of minutes. The goal is to judge whether the teacher is speaking near the student's current ability.

Choose 50-100 prompts, and let the student generate 2-4 answers for each. The teacher does not generate answers; it only computes token-by-token log-probs for the student's answers. Then inspect manually: are the tokens that receive high teacher scores reasonable continuations the student could plausibly write? Do low-score positions correspond to reasoning branches, format drift, or premature conclusions? If high and low scores mainly punish length, punish a certain template, or push all student tokens down, the teacher and student are not in the same thinking space.

This step directly verifies the paper's core insight: OPD often fails not because the teacher "cannot score," but because the teacher's scores do not land where the student can adjust. A globally informative reward can still leave training stuck if it gives no absorbable direction at each local token choice.

1. Lightning OPD Smoke Test

Second, run the offline version first because it is cheap, stable, and easy to reproduce.

The data scale can be small:

  • Training prompts: 200-1000
  • Validation prompts: 50-200
  • Student: a 0.5B-1.5B small model
  • Teacher: a larger same-family model, or the strong model you want to distill
  • Training: LoRA is enough; run 100-500 steps first and watch the trend

The process is:

text
1. Use the SFT student to generate fixed answers.
2. Precompute each generated token's log-prob with the teacher.
3. Train the student to increase the probability of tokens approved by the teacher.
4. Regenerate answers on held-out prompts.
5. Compare before and after: task score, length, repetition rate, teacher score.

The minimal acceptable result is not merely "teacher score goes up," but:

ObservationExpected result
held-out task scoreSlight improvement, or at least no degradation.
average response lengthNo obvious explosion and no extreme shortening.
repetition rateDoes not increase.
teacher scoreGoes up, but not by gaming with shorter answers, templates, or repetition.
manual samplesIn at least 20 samples, most look more like the teacher in a useful way.

If teacher score rises but task score falls, the student may be learning the teacher's local verbal habits rather than ability. In that case, check masks, length normalization, prompt distribution, and teacher consistency first.

2. Small Online OPD Comparison

If the offline smoke test gives a positive signal, run a small online comparison. Only run 2-3 rounds:

text
Round 1: current student generates rollout -> teacher scores -> train 50-100 steps
Round 2: updated student regenerates rollout -> teacher rescoring -> train again
Round 3: optional, observe whether improvement continues

The comparison group is Lightning OPD: same prompts, same teacher, same training steps, but fixed rollouts that are not refreshed. Check whether online is clearly better than offline.

If the result is...Conclusion
offline is already close to onlineLightning OPD is more economical; no need for a live teacher.
online is clearly betterStudent distribution drift is large; refreshing rollouts is valuable.
online is unstable and offline is steadierTeacher signal may be noisy, or online sampling quality is too poor.
neither improvesReturn first to SFT data, teacher choice, and task evaluation.

This test helps decide the engineering route. If offline is enough, use Lightning first. If online clearly wins, then consider building a teacher server.

3. Minimal Experiment Report Template

Every OPD run should record at least this table:

ItemContent
student / teacherModel names, parameter sizes, whether they are in the same family
dataPrompt source, quantity, whether deduplicated against evaluation
rolloutNumber of samples per prompt, temperature, max tokens
rewardWhether teacher log-prob is length-normalized and clipped
trainingonline or offline, steps, LoRA rank
evaluationtask score, length, repetition rate, manual samples
insight notesWhat new thing the teacher taught, whether the student actually absorbed it
conclusionContinue online, use Lightning, or go back to SFT

This table is much more important than looking only at loss. A decreasing OPD loss only means the student is becoming more like the teacher; it does not automatically mean the model is better at solving the task.

Open-Source Framework Support

This section lists the OPD and OPSD support status of mainstream training frameworks, along with their usage entry points. All information is verified against the current source code and official documentation of each framework.

Terminology Quick Reference

AbbreviationMeaning
OPDOn-Policy Distillation. The student generates its own rollouts, and an independent external teacher model scores each token.
OPSDOn-Policy Self-Distillation. The same model plays both teacher and student. The teacher receives extra information via privileged context (e.g., ground-truth answer, a "be concise" prefix) to supervise the student.

OPD Framework Details

1. slime (THUDM)

slime is the post-training framework behind GLM-4.5 / 4.6 / 4.7. OPD is a first-class feature, designed as an additive KL penalty that can be stacked on any RL algorithm.

Supported modes:

  • SGLang mode: the teacher runs on an independent SGLang server. Token-level log-probs are fetched via --rm_url during rollout. Suitable when the teacher has a different architecture or does not fit in GPU memory alongside the student.
  • Megatron mode: the teacher is loaded directly into the Megatron training process via --opd-teacher-load. Requires the teacher to share the same architecture as the policy / ref model.

Minimal launch command (Megatron mode):

bash
python train.py \
  --use-opd \
  --opd-type megatron \
  --opd-kl-coef 1.0 \
  --opd-teacher-load /path/to/teacher_ckpt \
  --adv_estimator grpo   # can also use ppo / reinforce_plus_plus

Key design:

  • OPD is orthogonal to the advantage estimator; it simply adds a reverse KL penalty to the advantage.
  • slime/rollout/on_policy_distillation.py implements the SGLang-mode reward_func: it calls the teacher server for every sample, trims the teacher log-probs to the response span, and writes them back into the Sample.
  • The official example uses Qwen3-8B student + Qwen3-32B teacher on DAPO-Math-17k, improving Math500 from 76% to 94%.

Does not support OPSD: the README explicitly requires "use a different (stronger) model as the teacher." There is no privileged-context mechanism.


2. veRL (ByteDance Seed)

veRL is one of the most active distributed RL frameworks in the community. OPD is provided as a standalone trainer.

Entry point: examples/on_policy_distillation_trainer/

Minimal launch script:

bash
bash examples/on_policy_distillation_trainer/run_qwen3_8b_fsdp.sh

Core configuration (Hydra YAML):

yaml
distillation:
  enabled: True
  teacher_models:
    teacher_model:
      model_path: 'Qwen/Qwen3-32B' # HF path or local path
  distillation_loss:
    loss_mode: 'k3' # choices: k1 / k3 / forward_kl_topk
    use_policy_gradient: True # jointly train with GRPO PG loss
    topk: 64 # teacher sends only top-k logits to save memory

Key design:

  • The teacher is served through a separate Ray cluster, decoupled from the student training process.
  • Supports topk sparse logits: the teacher returns only the top-64 logit values and indices; the student computes KL from these sparse values without materializing the full vocabulary.
  • Supports both FSDP and Megatron training backends, as well as vLLM for inference.
  • Official examples cover both text (run_qwen3_8b_fsdp.sh) and VLM (run_qwen3_vl_8b_fsdp.sh) OPD training.

Does not support OPSD: the official on_policy_distillation_trainer teacher_models configuration only accepts an external model path. The third-party repo HJSang/OPSD_OnPolicyDistillation is built on top of veRL, but its README clearly states "TODO: Add OPSD support. Currently only OPD is included."


3. NeMo RL (NVIDIA)

NeMo RL is NVIDIA's industrial-grade post-training framework. OPD exists as a native algorithm module.

Entry point: nemo_rl/algorithms/distillation.py

Launch command:

bash
python examples/run_distillation_math.py

Core configuration (YAML):

yaml
teacher:
  model_path: 'nvidia/Nemotron-4-340B' # independent teacher config
  tensor_parallel_size: 4 # teacher can use different TP
distillation:
  topk_logits_k: 64 # sparse top-k teacher logits
loss_fn:
  kl_type: 'reverse' # forward / reverse / mixed

Key design:

  • Two-phase execution: Phase 1 loads the teacher, computes logits for all micro-batches, caches them on CPU, then offloads the teacher. Phase 2 loads the student for forward + backward. Avoids keeping both models on GPU at the same time.
  • Independent parallelism strategies: teacher and student can use different Tensor Parallelism and Context Parallelism configurations.
  • Supports multi-turn rollout via max_rollout_turns.
  • Deeply integrated with the NeMo ecosystem; a good fit for teams that already have Megatron training infrastructure.

Does not support OPSD: teacher: PolicyConfig is an independent model configuration block. There is no privileged-context or same-model dual-role logic in the code.


4. TRL (HuggingFace)

TRL has the richest collection of experimental OPD trainers. Its key trait is light dependencies and fast onboarding.

Entry point: trl/experimental/gkd/

Minimal code example:

python
from trl import GKDTrainer, GKDConfig

trainer = GKDTrainer(
    model="Qwen/Qwen3-8B",           # student
    teacher_model="Qwen/Qwen3-32B",  # teacher (independent model)
    args=GKDConfig(
        kl_type="reverse",           # "forward" / "reverse" / "jsd"
        temperature=1.0,
        per_device_train_batch_size=4,
    ),
    train_dataset=dataset,
)
trainer.train()

Key design:

  • GKDTrainer inherits from DPOTrainer, leveraging a mature architecture.
  • Supports forward KL, reverse KL, and JSD.
  • Built on Accelerate; seamlessly switches between single GPU, DeepSpeed, and FSDP.
  • The experimental folder also contains minillm/, gold/, online_dpo/, and other variants covering different OPD algorithms.

TRL is also the underlying dependency for:

  • ms-swift: examples/train/rlhf/gkd/ directly calls TRL GKDTrainer, with additional wrappers for multimodal and Megatron adaptation.
  • LLaMA-Factory: supports OPD through TRL integration; no native standalone implementation.

5. Other Frameworks with Native OPD Support

FrameworkPositionEntry PointTraits
rLLM (UC Berkeley Sky)Lightweight OPD + OPSDrllm/trainer/distill/Single GPU via tinker, multi GPU via verl backend. AwesomeOPD records OPSD support.
AReaL (AntGroup / Tsinghua)Large-scale RL frameworkexamples/distillation/gsm8k_grpo_distill.yamlAligned with AntGroup's internal training platform.
ROLL (Alibaba)Multimodal RL frameworkroll/pipeline/distill/Native VLM support; built-in multiple-divergence library.
SkyRL (UC Berkeley NovaSky)RL research frameworkskyrl-train/examples/on_policy_distillation/From the NovaSky lab; aligned with the Sky Computing Lab ecosystem.
KDFlow (BJTU)KD-first frameworkexamples/on_policy_kd/SGLang teacher + FSDP2 student decoupled; native cross-tokenizer and VLM support.

6. Frameworks That Do Not Support OPD

OpenRLHF is explicitly excluded by AwesomeOPD. Although its architecture supports separating rollout / teacher / update:

  • The teacher is a remote Ray worker; transferring full logits across processes is extremely expensive.
  • There is no native on-policy distillation implementation; the existing distillation path uses offline fixed corpora.
  • Community discussions have not yet resulted in merged native OPD support.

OPSD Framework Details

OPSD (On-Policy Self-Distillation) is rarer than OPD: it requires the same model to play both teacher and student during training. The teacher receives extra information via privileged context (ground-truth answers, additional instructions, longer context, etc.) that the student does not see, and then scores the student's rollout token by token.

1. TRL (HuggingFace) — Currently the Only Official Experimental Implementation

Entry point: trl/experimental/self_distillation/

Core mechanism:

  • SelfDistillationMixin._split_prompt_and_privileged_context() separates prompt and privileged_context from the batch.
  • The same model runs two forwards:
    • Student forward: prompt + completion
    • Teacher forward: prompt + privileged_context + completion (the teacher sees more information)
  • Computes reverse KL: KL(teacher || student), applying loss only to the student-generated portion.

Key classes:

  • BaseSelfDistillationTrainer: online self-distillation base class; supports vLLM rollouts.
  • SelfDistillationMixin: shared loss computation; supports grpo, bnpo, dr_grpo, dapo, and other loss types.
  • SDPO (Self-Distillation Policy Optimization): concrete trainer implementation.

Data format requirement: The dataset must contain both prompt and privileged_context columns. For example, in math problems privileged_context can be the ground-truth solution or a derivation hint.

Code snippet:

python
from trl import SDPOTrainer, SelfDistillationConfig

trainer = SDPOTrainer(
    model="Qwen/Qwen3-8B",
    args=SelfDistillationConfig(
        kl_type="reverse",
        loss_type="grpo",      # or bnpo / dapo / etc.
    ),
    train_dataset=dataset,     # must contain "prompt" and "privileged_context"
)
trainer.train()

Supported divergences:

  • alpha=0: reverse KL (DKL(teacherstudent)D_{KL}(teacher \| student))
  • alpha=1: forward KL (DKL(studentteacher)D_{KL}(student \| teacher))
  • 0 < alpha < 1: JSD mixture

Limitations: experimental code; API may change. Currently supports only single-model architectures; no multi-teacher support.


2. rLLM (UC Berkeley Sky)

AwesomeOPD records that rLLM has an OPSD implementation under examples/math_distill/ (including an opsd/ subdirectory). The framework is lightweight and suitable for:

  • Quick single-GPU validation (tinker backend)
  • Multi-GPU scaling (verl backend)

The current GitHub path may have migrated; verify against the latest repository.


3. Frameworks That Explicitly Do Not Support OPSD

FrameworkReason
slimeArchitecturally requires --opd-teacher-load to point to an independent model; no privileged-context interface.
veRLOfficial configuration teacher_models.teacher_model.model_path only accepts external model paths.
NeMo RLteacher: PolicyConfig is an independent model configuration block; no same-model dual-role logic in the code.
ms-swift / LLaMA-FactoryIndirectly support OPD via TRL GKDTrainer; TRL's self_distillation module has not yet been wrapped.
OpenRLHFNo native OPD, let alone OPSD.

Selection Advice

  • Need OPD + large-scale distributed training (SGLang / Megatron / vLLM teacher server): Choose slime, veRL, or NeMo RL. All three are production-grade; slime and veRL have higher community activity.
  • Need OPSD (self-teaching): Currently only TRL has an official experimental implementation. If you are already in the TRL ecosystem, you can directly reuse the self_distillation/ module.
  • Already using SWIFT / ModelScope or LLaMA-Factory workflows: OPD is available indirectly via TRL GKDTrainer, but OPSD is not yet supported.
  • Just want a quick OPD mechanism validation: TRL GKDTrainer or veRL on_policy_distillation_trainer are both minimal-dependency starting points.

Chapter Summary

OPD's core is a choice of training paradigm:

  • Off-policy distillation has dense token supervision, but it does not train on the mistakes the student itself will make.
  • RL is on-policy, but rewards are often sparse and sample efficiency is low.
  • OPD connects the two: the student samples by itself, and the teacher gives dense token-by-token feedback.

It is especially suitable for small models, specialized models, and post-training ability transfer. But OPD is not a replacement for RL, nor is it a universal compressor. Its ceiling comes from the teacher, its stability comes from initialization, and its value comes from "giving dense signals on the student's own states."

The most important insight from the 2026 mechanism analysis is to shift the core OPD question from "is the teacher stronger?" to "is the teacher more learnable?" A stronger model may only be good at scoring on its own trajectories. A more learnable teacher can provide new knowledge, similar thinking paths, and absorbable local preferences on the student's current trajectories. This perspective changes how the entire distillation pipeline should be designed: first cold-start both models into the same language, then use teacher-aligned prompts to keep the teacher in a familiar distribution, and finally use task rewards to make sure local preferences do not diverge from the global objective.

From the main thread of this chapter, DPO, GRPO, RLVR, and OPD are all answering the same question: when we do not want to run full traditional RLHF, where else can training signals come from? DPO uses preference pairs, RLVR uses verifiers, and OPD uses a teacher. Understanding the boundaries of these three signals is the ability that truly transfers to new projects.

References


  1. Agarwal R, Vieillard N, Zhou Y, et al. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes, ICLR 2024. GKD uses teacher feedback on student self-generated sequences to reduce distribution shift. ↩︎

  2. Lu K, Thinking Machines Lab. On-Policy Distillation, 2025. Engineering OPD reproduction and Tinker implementation, including Qwen3 comparisons and personalization experiments. ↩︎ ↩︎

  3. Xu X, Li M, Tao C, et al. A Survey on Knowledge Distillation of Large Language Models, arXiv 2024. Surveys LLM KD from algorithm, skill, and verticalization perspectives. ↩︎

  4. Yang C, Lu W, Zhu Y, et al. Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application, arXiv 2024. Organizes white-box / black-box KD, evaluation, and applications. ↩︎

  5. Ross S, Gordon G, Bagnell D. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning, AISTATS 2011. DAgger brings states visited by the learner into the training set. ↩︎

  6. Shi Z, Zhang J, Jiang W, et al. Lightning OPD: Cost-effective On-Policy Distillation, arXiv 2026. Offline-izes standard OPD by precomputing teacher log-probs and avoiding a live teacher server during training. ↩︎ ↩︎

  7. Gu Y, Dong L, Wei F, Huang M. MiniLLM: Knowledge Distillation of Large Language Models, ICLR 2024. Uses reverse KL and on-policy optimization for generative LLM distillation. ↩︎

  8. Li Y, Zuo Y, He B, et al. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe, arXiv 2026. Analyzes OPD success conditions, token-level mechanisms, and failure recovery strategies. ↩︎ ↩︎

  9. Ko J, Kim S, Chen T, Yun S. DistiLLM: Towards Streamlined Distillation for Large Language Models, ICML 2024. Improves LLM distillation efficiency with skew KL and adaptive off-policy strategies. ↩︎

  10. Ye T, Dong L, Chi Z, et al. Black-Box On-Policy Distillation of Large Language Models, arXiv 2025. GAD uses a discriminator to provide on-policy reward when teacher logits are unavailable. ↩︎

  11. Zhao S, Xie Z, Liu M, et al. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models, arXiv 2026. A single model plays teacher and student under different contexts. ↩︎

  12. Song M, Zheng M. A Survey of On-Policy Distillation for Large Language Models, arXiv 2026. Unifies OPD under an f-divergence framework and classifies methods by feedback signal, teacher access, and loss granularity. ↩︎

现代强化学习实战课程