Skip to content

8.2 The RLHF Pipeline

Reading Guide

Core points

  • Master the three-stage InstructGPT-style RLHF pipeline: SFT, Reward Model, PPO.
  • Understand each stage's inputs, outputs, acceptance metrics, and typical failure modes.
  • Organize experiments as artifacts: datasets, checkpoints, and evaluation reports must be traceable.

Core formulas

LSFT=E(x,y)DSFT[logπθ(yx)](SFT: imitate high-quality answers)\mathcal{L}_{SFT} = -\mathbb{E}_{(x,y)\sim \mathcal{D}_{SFT}}\left[\log \pi_\theta(y\mid x)\right] \quad \text{(SFT: imitate high-quality answers)}

LRM=E(x,yw,yl)Dpref[logσ(rϕ(x,yw)rϕ(x,yl))](RM: learn preference ranking)\mathcal{L}_{RM} = -\mathbb{E}_{(x,y_w,y_l)\sim \mathcal{D}_{pref}} \left[\log \sigma(r_\phi(x,y_w)-r_\phi(x,y_l))\right] \quad \text{(RM: learn preference ranking)}

maxθ Eyπθ(x)[rϕ(x,y)βDKL(πθ(x)πref(x))](PPO-RLHF: optimize reward, but do not drift)\max_\theta\ \mathbb{E}_{y\sim \pi_\theta(\cdot\mid x)} \left[r_\phi(x,y) - \beta D_{KL}(\pi_\theta(\cdot\mid x)\|\pi_{ref}(\cdot\mid x))\right] \quad \text{(PPO-RLHF: optimize reward, but do not drift)}

Keep one sentence in mind:

RLHF is not a training script. It is an artifact pipeline. Each stage must leave behind data, models, metrics, and failure cases.

This chapter follows the classic OpenAI InstructGPT recipe: SFT first, then a reward model, then PPO-based RLHF. It is not the only post-training method, but it is the reference point you need before you can properly understand DPO, GRPO, RLVR, and later variations.

Mermaid diagram

One Prompt, Three Stages

Do not start from framework names. Start from a single user question:

text
Explain what PPO's clip ratio does, and give an intuitive example.

The standard RLHF pipeline does three different things around this same prompt:

  1. SFT stage: provide a high-quality demonstration answer so the model learns "how to respond to this type of instruction."
  2. RM stage: prepare multiple candidate answers to the same prompt and label which answer is better.
  3. PPO stage: let the current policy generate answers, score them with the RM, and update the policy to increase the probability of high-scoring answers.

These stages correspond to three different data formats:

json
{
  "sft_item": {
    "prompt": "Explain what PPO's clip ratio does, and give an intuitive example.",
    "response": "The clip ratio limits how far the new policy can move..."
  },
  "preference_item": {
    "prompt": "Explain what PPO's clip ratio does, and give an intuitive example.",
    "chosen": "Think of clipping like a seatbelt: it prevents one update from being too aggressive...",
    "rejected": "PPO is an algorithm and it is widely used. It is important."
  },
  "ppo_prompt_item": {
    "prompt": "Explain what PPO's clip ratio does, and give an intuitive example."
  }
}

The same prompt can appear in multiple stages, but you must prevent evaluation leakage: prompts used for evaluation should not be reused for training.

The Three Stages and Their Deliverables

StageInputOutputAcceptance metricsMost common failure
SFTinstruction-answer pairsan instruction-following assistantSFT loss, format adherence, human inspectionlearns style but stays shallow
RMpreference pairs (chosen/rejected)a reward model that scores responsesheld-out accuracy, margin, calibration sampleslearns wrong preference (length bias)
PPOprompts + RM + reference modelan improved policy (and a critic)reward rises without KL/length exploding; preference win-ratereward hacking, regression, instability

The critical point is: SFT and RM are not "just preparation." They are where most RLHF success or failure is decided. Bad SFT data gives you a bad starting policy; a biased RM gives PPO an incorrect target to optimize.

Step 0: Choose a Base Checkpoint

RLHF does not start from training a model from scratch. It starts from a base checkpoint that becomes an artifact:

text
artifacts/
  base/
    model_name.txt
    tokenizer_config.json
    generation_probe.jsonl

When selecting a base model for a teaching-scale experiment, check:

DimensionQuestionPractical suggestion
sizecan you run a four-model loop locally?360M to 0.5B is a good start
languagedoes it cover your target language?pick a model trained for your language
licensecan you fine-tune and redistribute?read the model card

The key output of Step 0 is not a trained model. It is a baseline report: how does the base model respond to a fixed prompt set? Without a baseline, there is no way to judge what SFT and RLHF actually changed.

Step 1: SFT Teaches "How To Answer"

SFT is supervised learning. Given a prompt xx and a demonstration answer yy, we maximize the conditional likelihood:

LSFT=t=1Tlogπθ(ytx,y<t).\mathcal{L}_{SFT} = -\sum_{t=1}^{T}\log \pi_\theta(y_t \mid x, y_{<t}).

In plain words: given the prompt and the already-generated prefix, make the model more likely to generate the demonstration next token.

One crucial implementation detail is the loss mask: in chat-format data, only the assistant tokens should contribute to the loss. If you train on the user/system text, you teach the model to repeat the user and to generate role markers.

Step 2: The Reward Model Teaches "What Is Better"

A reward model does not learn a single correct answer. It learns preference ordering. A typical sample is:

json
{
  "prompt": "Explain PPO's KL penalty.",
  "chosen": "The KL penalty acts like a safety rope: it prevents the policy from drifting too far from the reference.",
  "rejected": "KL is a math formula and PPO uses it, so it is important."
}

The RM learns a scoring function rϕ(x,y)r_\phi(x,y) such that:

rϕ(x,yw)>rϕ(x,yl).r_\phi(x,y_w) > r_\phi(x,y_l).

The common Bradley-Terry loss makes this trainable:

LRM=logσ(rϕ(x,yw)rϕ(x,yl)).\mathcal{L}_{RM} = -\log \sigma(r_\phi(x,y_w)-r_\phi(x,y_l)).

Do not only track accuracy. Also track the margin:

margin=rϕ(x,yw)rϕ(x,yl).\text{margin} = r_\phi(x,y_w) - r_\phi(x,y_l).

If margins are tiny, PPO will receive a weak and noisy reward signal even if the ordering accuracy looks acceptable.

Step 3: PPO-RLHF Optimizes the Policy Under Constraints

In PPO-RLHF, you typically run four components:

RoleSourceTrained?Purpose
Actorinitialized from SFTyesgenerate responses and get updated
Referencefrozen SFT checkpointnoKL anchor: "do not drift too far"
Reward modeltrained on preferencesnoprovide scalar reward signal
Criticoften initialized from Actoryesestimate value to reduce variance

The total reward is typically written as:

Rtotal(x,y)=rϕ(x,y)βDKL(πθ(x)πref(x))R_{total}(x,y) = r_\phi(x,y) - \beta D_{KL}(\pi_\theta(\cdot\mid x)\|\pi_{ref}(\cdot\mid x))

It captures the core tension of RLHF:

  • The RM wants the Actor to move toward responses that better match preferences.
  • The Reference wants the Actor to stay close to the SFT policy.
  • PPO wants each update step to be moderate.

Without the KL penalty, the Actor may quickly exploit blind spots in the RM. If the KL penalty is too strong, the Actor can barely learn at all.

Where Feedback Comes From

The H in classic RLHF stands for human feedback, but in real engineering the feedback source is usually a mixture:

SourceUse caseRisk
Human annotationHigh-quality seed data, final calibrationExpensive, slow, limited consistency
AI Judge / RLAIFScaling preference data, fast iterationAmplifies judge biases
Rule verificationMath, code, format and other verifiable tasksCannot cover open-ended dialogue quality
Online feedbackLikes, dislikes, copy, edit-and-resendNoisy, requires aggregation

This chapter still uses classic human preference as the main thread, but introduces AI Judges, rule checks, and manual review in data engineering and evaluation. This preserves the standard InstructGPT structure without reducing the course to an outdated purely-manual annotation workflow.

RLAIF, CAI, and Self-Play

RLAIF, self-play, and self-evolution should not become a separate main thread for Chapter 08. They belong in the "feedback source" position: they are fundamentally answering where preference data comes from and how to iterate faster.

MethodPipeline positionPurposeGuardrails needed
RLAIFGenerate preference pairs / RM training setReplace some human annotation with a strong modelHuman spot-checks, judge consistency review
Constitutional AIGenerate chosen/rejectedSelf-critique and self-revision by principlesConstitution quality, human calibration
Self-Play / DebateGenerate candidate answers and hard negativesLet the model compete against past versionsDiversity monitoring, external eval anchors
Self-RewardingMulti-round data flywheelModel self-evaluates, self-critiques, self-revises, then retrainsExternal RM or human eval to prevent degradation

The key insight here is not "replace humans entirely" but use AI to scale and use humans to calibrate direction. If you rely entirely on an AI Judge, when the judge favors verbose responses, fixed templates, or a particular style, those biases get amplified in the next training round.

A minimal RLAIF judge prompt might look like this:

python
rlaif_judge_prompt = """
You are a strict answer quality evaluator. Compare two responses.

Evaluation dimensions:
1. Accuracy: Are the facts correct, any hallucinations?
2. Helpfulness: Does it genuinely address the user's question?
3. Clarity: Is the writing clear and the logic coherent?
4. Safety: Does it contain harmful, biased, or misleading content?

User question:
{prompt}

Response A:
{response_a}

Response B:
{response_b}

Output only JSON:
{{"winner": "A" or "B" or "tie", "reason": "one-sentence reason"}}
"""

To reduce judge bias, do at least four things:

  1. Randomly swap A/B order.
  2. Record the judge's reasoning, not just the winner.
  3. Periodically run human spot-checks.
  4. Keep a fixed evaluation set so the data flywheel does not merely appease the current judge.

Where Does the Data Flywheel Fit

The data flywheel is not a separate algorithm. It connects SFT, RM, PPO, and evaluation into an iterable system:

text
Deploy model
  -> Collect bad cases, user feedback, evaluation failures
  -> Produce new SFT / preference data
  -> Train SFT or RM
  -> PPO-RLHF update policy
  -> Evaluate; deploy again if it passes

Key metrics for this flywheel include iteration cycle time, data utilization rate, evaluation coverage, and rollback rate. In a small-parameter course experiment you can compress it into a single round: prepare fixed data, run SFT/RM/PPO, then use evaluation results to decide what data to supplement next round.

Whether the flywheel keeps spinning better depends primarily on quality gates, not on "how much data was generated."

Quality gateWhat it checksTypical practices
Basic cleaningDuplicates, contamination, format errors, length anomaliesDedup, eval-set leak check, length filtering, format validation
Difficulty stratificationWhether data sits at the model's learning boundaryUse pass@k or judge scores to split too easy / learnable / too hard
Preference consistencyWhether chosen is genuinely better than rejectedMulti-judge voting, human spot-checks
Online regressionWhether the new model breaks old capabilitiesFixed benchmark + badcase replay

Minimal Experiment Directory

For reproducibility, this chapter recommends storing artifacts separately:

text
experiments/rlhf-smollm/
  data/
    sft_train.jsonl
    pref_train.jsonl
    prompts_ppo.jsonl
    eval_prompts.jsonl
  models/
    base.txt
    sft/
    reward_model/
    rlhf/
  reports/
    base_probe.md
    sft_eval.json
    rm_eval.json
    ppo_train_metrics.jsonl
    final_eval.md

This is not bureaucracy. When debugging RLHF you will often ask:

  • Which RM was used for this PPO run?
  • Which preference data was this RM trained on?
  • Did evaluation prompts leak into the training data?
  • From which checkpoint did the model start getting verbose?

If artifacts are unclear, these questions become hard to answer later.

Common Failure Mode Map

LocationFailure symptomRoot causeWhat to check first
BaseOutput does not look like an assistantPretraining objective was not instruction-followingBase probe samples
SFTCorrect format but empty contentDemonstration data is low-quality or homogeneousSFT data manual sampling
RMPrefers long responsesChosen responses are systematically longer in preference dataReward-length correlation
PPOReward rises but quality dropsActor found RM blind spotsHigh-reward sample spot-checks
EvalWin rate fluctuates heavilyJudge bias or too few samplesRandom seed, A/B order, confidence intervals

Section Summary

The standard RLHF pipeline can be compressed into three sentences:

  1. SFT turns a base model into an assistant starting point.
  2. The Reward Model converts preference data into an optimizable reward signal.
  3. PPO increases the probability of high-reward responses under a KL constraint.

But reliable RLHF is not just these three training steps. It also includes artifact management, data quality gates, and an evaluation closed loop. The next section enters the first stage: how SFT data and preference data are constructed and why they have a natural relationship to imitation learning and inverse reinforcement learning -- SFT: Teaching the Model to Answer Instructions.

Exercises

  1. Design an sft_item and a preference_item with the same prompt but different data purposes.
  2. Explain why high RM accuracy does not guarantee success in the PPO stage.
  3. In one sentence, describe the role of the Reference model in PPO-RLHF.

现代强化学习实战课程