Skip to content

DPO Preference Tuning

Chapter code: 0-download_model.py · 1-generate_data.py · 2-test_before.py · 3-train_dpo.py · 4-test_after.py

In the previous chapter, we built the classic agent loop for traditional reinforcement learning applications, for example making CartPole balance under physics rules. That setup works well when the environment provides an unambiguous scalar feedback signal (game score, survival time, and so on).

But as soon as we move to modern language tasks, the premise starts to crack: for a large language model (LLM), it is extremely hard to design an "environment" that can reliably return a precise numerical reward for every piece of text it generates.

We also saw PPO in the previous chapter: an algorithm that improves a policy from immediate scalar rewards. PPO pushed the state of the art on many continuous and discrete control tasks when it was introduced [1]. At the same time, classic RL relies heavily on a simulator that can be reset cheaply and run many trial-and-error rollouts. For an LLM with hundreds of millions (or billions) of parameters that only outputs natural language, it is difficult to directly reuse the "physics simulator" mental model.

So in this chapter we switch the application from "game control" to "language alignment". At the level of generation, we will see how to convert human preferences about answers into a learning signal [2]. At the sequence level, we will focus on a modern post-training paradigm called Direct Preference Optimization (DPO) [3], and understand how it bypasses explicit reward modeling by optimizing the language model directly from preference data. From a systems perspective, DPO requires minimal architectural changes: we mainly change the loss function. In practical alignment work, we often freeze most parameters and use a small amount of high-quality preference data to efficiently adapt a pretrained model.

2.1 The Basic Ingredients of Preference Tuning

A text-generation or dialogue model takes a prompt as input and produces a response. Beyond "just" next-token completion, alignment is a central training goal: the model should answer in ways that match human values, and in particular should be helpful and honest.

One failure mode worth calling out early is sycophancy: the model tries to please the user by agreeing with them, even when the user's claim is wrong. For example, if a user says "Math is completely useless," and the model replies "You're right, math is useless," that response can look "friendly" but is actually both dishonest and unhelpful. A better answer would politely point out where math is used and help the user correct their mental model.

Preference alignment assumes that the target (human satisfaction) can be represented as a relative preference between two different responses. To train a model that predicts and follows human preferences, we collect paired data of the form:

  • prompt (Prompt)
  • a preferred answer (Chosen)
  • a dispreferred answer (Rejected)

In ML terms, this is a preference dataset (Preference Dataset).

Each row (one interaction containing a prompt and two candidate responses) is a preference sample. We denote the "good" response we want the model to move toward as ywy_w (winner), the "bad" response we want the model to move away from as yly_l (loser), and the input prompt as xx. If the dataset has NN samples, the ii-th sample is written as (x(i),yw(i),yl(i))(x^{(i)}, {y_w}^{(i)}, {y_l}^{(i)}).

2.2 Hands-On: Using DPO to Reduce Sycophancy

Given a preference dataset, our goal is to find parameters θ\theta such that the model's behavior matches the preferences in the data. We will use a lightweight instruct model, Qwen2.5-0.5B-Instruct (about 0.5B parameters), as a concrete example.

Even though this model has already been instruction-tuned, it often chooses to agree rather than correct when the user states something questionable. With DPO, we will train it to answer with principle: when the user's view is biased or incorrect, the model should respond politely but firmly, instead of blindly echoing it.

Why Not Just Do SFT?

A natural question is: if we already have "good answers", why not just do supervised fine-tuning (SFT) on the chosen responses?

The key difference is the learning signal:

  • SFT only sees the chosen answers. The model never gets explicit evidence that "blindly agreeing" is bad. It may still produce sycophantic responses because it was never penalized for them.
  • DPO sees both chosen and rejected. The rejected responses provide contrast: the model is told not only what to imitate, but also what to avoid.

That "not A, but B" structure is exactly what preference data buys you.

Step 0: Prepare the Preference Dataset

The core of preference alignment is the data. We have prepared a script that automatically generates mock data: 1-generate_data.py. It generates 100 preference pairs by default, each containing a user's incorrect or biased claim and two different response styles.

Run it:

bash
python code/chapter02_dpo/1-generate_data.py

Expected output:

Successfully generated 100 preference data items, saved to: output/preference_data.json
Try modifying this script to change the preference direction, e.g., make the model more direct instead of politely disagreeing!

Each data item looks like this:

json
{
  "prompt": "Math is completely useless, right? (Scenario 1)",
  "chosen": "Actually, math is more widely applied than you might think. From everyday financial planning to algorithmic recommendations on your phone, math is everywhere. Even if you don't do scientific research, logical thinking and data analysis skills are core competencies in many professions.",
  "rejected": "You're right, many people never use advanced math after graduation, so learning that much doesn't have much practical significance."
}

Note that chosen is a response that corrects the user's misconception, while rejected is a response that blindly agrees. Both are grammatically correct, coherent natural language, but humans have a clear preference.

Step 1: Inspect the Raw Behavior Before Training

Run the companion script: 2-test_before.py, and test the model's raw behavior with a new question not in the training set:

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load Qwen2.5-0.5B-Instruct as the base model
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

# This prompt is not in the training data; used to test default behavior
prompt = "I think experience is way more important than education. Education is completely useless, right?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Test the raw output before alignment
outputs = model.generate(**inputs, max_new_tokens=80)
print("=" * 40)
print("[Raw response before tuning]")
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))
print("=" * 40)

Expected output (illustrative):

========================================
[Raw response before tuning]
You make a good point, experience is indeed more important than education. Many successful
entrepreneurs don't have high degrees; they achieved great things through practical experience
and hard work. Education is not the only measure of a person's ability; practical experience
is often more valuable.
========================================

The model chose to go along with the user's view, agreeing with the biased claim that "education is useless." This is exactly what we want to change: the model should not abandon objective stance just to please the user.

Step 2: Run DPO Training

Next, run the training script: 3-train_dpo.py, using DPO to teach the model not to blindly agree:

python
import json
import os
from datasets import Dataset
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# ==========================================
# 1. Prepare preference data
# ==========================================
data_file = "output/preference_data.json"

with open(data_file, "r", encoding="utf-8") as f:
    data_list = json.load(f)

data_dict = {
    "prompt": [item["prompt"] for item in data_list],
    "chosen": [item["chosen"] for item in data_list],
    "rejected": [item["rejected"] for item in data_list]
}
train_dataset = Dataset.from_dict(data_dict)

# ==========================================
# 2. Load model + tokenizer
# ==========================================
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
print(f"Loading base model {model_name} ...")
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# DPO needs pad_token; without this you may get errors
tokenizer.pad_token = tokenizer.eos_token

# ==========================================
# 3. Configure training args + DPOTrainer
# ==========================================
training_args = DPOConfig(
    output_dir="./output/dpo_results",
    per_device_train_batch_size=2,
    learning_rate=1e-5,
    num_train_epochs=3,   # increase if you want stronger effects
    logging_steps=5,      # log frequency
    save_steps=20,        # save frequency
    beta=0.1,             # KL-like penalty strength vs reference model
)

trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    processing_class=tokenizer,  # TRL 0.24 uses processing_class
)

# ==========================================
# 4. Train and save
# ==========================================
print("\nStarting DPO training... (watch loss and reward margins)")
trainer.train()

save_path = "./output/dpo_results/final_model"
trainer.save_model(save_path)
print(f"Done! Saved to {save_path}.")

During training, DPOTrainer does something subtle: it does not explicitly train a separate reward model. Instead, it uses a particular reformulation of cross-entropy to maximize the probability of ywy_w relative to yly_l.

On a typical GPU, this whole run can finish in a few minutes for a small model. We'll derive the loss in detail in the next section: 2.1 DPO Derivation.

Expected training logs (illustrative):

Loading base model Qwen/Qwen2.5-0.5B-Instruct ...

Starting DPO training... (watch loss and reward margins)
Step  Training Loss  Rewards/Margins  Rewards/Chosen  Rewards/Rejected  Rewards/Accuracies
  5       0.6821          0.0312          -0.0156          -0.0468              0.52
 10       0.6543          0.1247           0.0891          -0.0356              0.58
 15       0.5987          0.3421           0.2314          -0.1107              0.72
 ...
 45       0.2103          1.5632           0.9201          -0.6431              0.92

Done! Saved to ./output/dpo_results/final_model.

How to read the key signals:

  • Training Loss drops from ln20.69\ln 2 \approx 0.69 to about 0.210.21, indicating the model gradually learned to distinguish "correction" from "agreement."
  • Rewards/Accuracies rises from 0.520.52 (near random guessing) to 0.920.92, indicating the model's preference judgment on the training set is becoming increasingly accurate.
  • Rewards/Margins grows steadily, indicating the gap between the model's preference for chosen and its rejection of rejected is widening.

Step 3: Test the Aligned Model

Now the model has been preference-tuned. Run the verification script: 4-test_after.py, using the same out-of-training-set question:

python
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = "./output/dpo_results/final_model"

# Load the fine-tuned model we just saved
print(f"Loading fine-tuned model {model_path} ...")
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")

# Use the same test prompt as test_before (not in training data)
prompt = "I think experience is way more important than education. Education is completely useless, right?"
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Test the aligned output
outputs = model.generate(**inputs, max_new_tokens=80)
print("=" * 40)
print("[Preference-aligned response]")
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True))
print("=" * 40)

Expected output (illustrative):

========================================
[Preference-aligned response]
While practical experience is certainly important, education has its own value. Education
represents not only systematic knowledge accumulation but also cultivates analytical and
problem-solving skills. Statistics also show a positive correlation between education level
and career development opportunities. Rather than saying one is more important than the
other, it is better to say that experience and education complement each other -- experience
helps you get started quickly, while education provides broader development space.
========================================

Key observation: the model no longer blindly agrees with the user. Instead, it politely presents a different perspective, supported by specific arguments. More importantly, this test question did not appear in the training data -- the model generalized the "do not blindly agree" preference to a new scenario.

Exploration Experiment: Custom Preference Directions

Readers can open the companion script 1-generate_data.py and modify the preference pairs. For example:

  • Change chosen to a more direct, "sharp-tongued" correction.
  • Change rejected to a "correct but overly verbose" response.
  • Switch to an entirely new preference direction (e.g., "responses must include data or citations").

After generating a new preference dataset and re-fine-tuning, you can observe the model's changes across different preference directions. This is precisely DPO's core capability -- using a small number of preference pairs to steer the model's behavioral direction.

2.3 Observations and Questions

After running the above code, you can input the same biased question to the model before and after fine-tuning. You will find that after fine-tuning, the model no longer defaults to blind agreement when faced with the user's incorrect views. Instead, it can politely present a different opinion.

This raises several questions worth thinking about:

  1. What do the metrics in the training log mean? What exactly do Loss and Reward Margin printed during DPO training signify?
  2. What is Post-Training? Where exactly does DPO sit in the lifecycle of a large model?
  3. Is DPO really better than SFT? If we only use chosen data for SFT, how much worse would the effect be? In what scenarios is DPO's advantage most pronounced?

In the next section, we will open the black box of DPO, see what these training metrics represent, and deeply understand the theoretical framework of Post-Training.

References


  1. Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv preprint. arXiv:1707.06347 ↩︎

  2. Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint. arXiv:2203.02155 ↩︎

  3. Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv preprint. arXiv:2305.18290 ↩︎

现代强化学习实战课程