Skip to content

3.7 Data Sources: On-Policy, Off-Policy, Online, and Offline

One-sentence summary: this section introduces two pairs of concepts that are often confused: on-policy vs. off-policy (does the data come from the current policy?), and online vs. offline (can we still collect new data during training?). Think of them as two independent axes that jointly define what an RL algorithm's data looks like.

What This Section Solves

Core content

  • Behavior policy μ\mu vs. target policy π\pi: distinguish "who collected the data" from "who is being optimized."
  • On-policy / off-policy: whether the training data is generated by the policy being learned.
  • Online / offline: whether training can keep interacting with the environment and collecting new data.

In the previous sections, we derived Bellman equations and saw how TD errors update a value table. All of these update rules share a common prerequisite:

we need data.

To update V(s)V(s) or Q(s,a)Q(s,a), we need tuples like (s,a,r,s)(s, a, r, s') as raw material.

The key question is: where does this raw material come from?

Consider learning to ride a bicycle:

  1. You ride yourself, fall, reflect, and adjust immediately.
  2. You sit aside and watch others ride (or watch recordings of yourself), and try to infer a better strategy in your head.
  3. A coach gives you a thick book of "human crash records" and forbids touching a real bicycle until you finish studying.

These correspond to very different data paradigms in RL. Often the biggest differences between algorithms are not the loss functions, but how they collect and use data.

Core concepts

  • On-policy: training data is generated by the policy currently being optimized (what you do is what you learn).
  • Off-policy: training data can come from old policies, experts, or humans, and is used to optimize a different target policy.
  • Online: during training, the agent continues interacting with the environment and the dataset keeps growing.
  • Offline: the dataset is fixed before training; interaction with the environment is not allowed during training.

Two Policy Roles: Who Acts, Who Learns?

Before we define the axes, we must separate two policy roles that are often conflated.

In many introductory explanations, we pretend the agent has a single policy π\pi: it both acts in the environment and is updated by learning. In modern RL systems, these roles are often separated.

  1. Behavior policy (often written μ(as)\mu(a\mid s)). This answers: "How was this batch of data collected?" The behavior policy is the one that actually interacts with the environment, presses buttons, or emits tokens. It could be an ϵ\epsilon-greedy exploratory policy, an older snapshot from yesterday, an expert model, or even a human.

  2. Target policy (often written πθ(as)\pi_\theta(a\mid s)). This answers: "What policy do we ultimately want to learn?" This is the optimization object. Regardless of where the data comes from, we want updates to θ\theta to make the target policy better.

Once we separate "who acts" from "who learns", the first axis becomes natural.

Axis 1: On-Policy vs. Off-Policy

The core question is:

When updating the target policy, are we using data generated by (approximately) the same policy?

On-Policy: Learn From Your Own Fresh Data

If training data is generated by the current target policy (or a very close snapshot of it), we call it on-policy.

In symbols, behavior and target policies are essentially aligned:

μ(as)πθ(as).\mu(a\mid s) \approx \pi_\theta(a\mid s).

Intuition: you take a practice exam, grade it immediately, fix your mistakes, then take the next exam using the updated you. You always learn from your latest behavior.

Algorithmically, Sarsa is a classic on-policy method: it updates based on the action aa' that the current policy would actually take at ss'. It evaluates "what happens if I keep behaving the way I currently behave."

In LLM training, PPO (and GRPO variants) are typically on-policy. A common loop is: sample responses from the current model (rollout), score them (reward model or rules), then update the model immediately. PPO additionally constrains how far the new policy can deviate from the sampling policy (clipping) to preserve the on-policy character. [1]

  • Pros: conceptually clean and often stable.
  • Cons: sample-inefficient. Once the policy changes, old data is no longer "from the current policy", and is usually not reused directly.

Off-Policy: Learn From Someone Else's Data (Including Your Past Self)

If μ\mu and πθ\pi_\theta can be different, we call it off-policy:

μ(as)πθ(as).\mu(a\mid s) \neq \pi_\theta(a\mid s).

Intuition: you study from a notebook that contains your old mistakes, yesterday's attempts, and even an excellent student's notes. The data comes from various sources, but what you are learning is the best solution strategy.

The classic off-policy algorithm is Q-learning. Its target uses a greedy next action:

Target=r+γmaxaQ(s,a).\text{Target} = r + \gamma \max_{a'} Q(s', a').

The behavior policy can be exploratory and messy, but the learning target assumes you will act greedily in the next step. Data is collected under μ\mu, but the implicit target policy is greedy. [2]

In deep RL, DQN makes off-policy practical via a replay buffer: it stores old experience and samples it repeatedly for training. Since the stored data may come from much older policies, clearly μπ\mu \neq \pi. [3]

  • Pros: much more sample-efficient. Old data can be reused; expert data can be leveraged.
  • Cons: distribution shift. If the target policy wants to take actions that never appear in the dataset, the algorithm must extrapolate, which can be wildly wrong. [4]

Axis 2: Online vs. Offline

On-policy/off-policy describes "who generated the data." Now zoom out and ask:

During training, can the agent keep interacting with the environment to collect new data?

This is the online vs. offline axis: does the dataset keep growing, or is it fixed?

Online RL: Learn While Interacting

If during training the agent continues to interact and append new samples, we call it online RL. The dataset grows over time:

Dk+1=Dk{τk}.\mathcal{D}_{k+1} = \mathcal{D}_k \cup \{\tau_k\}.

Whether it is DQN playing Atari or PPO training a robot to walk, if training is not finished, the agent is still collecting new rollouts.

Important: online is not the same as on-policy. DQN is off-policy (it reuses old data), but it is still online (it keeps playing the game and collecting new data).

The major advantage of online RL is that incorrect value estimates can be corrected by fresh interaction: if the agent is uncertain about an action, it can try it in the environment and observe the actual reward.

Offline RL: Learn From a Fixed Dataset

If the dataset is fully fixed before training and interaction is forbidden during training, we call it offline RL.

D=Dfixed\mathcal{D} = \mathcal{D}_{\text{fixed}}

Why would we use such a strict setting? Because in many real scenarios, "trial and error" is too expensive. [5]

  • Autonomous driving: You cannot let a still-training AI drive a real car on the road to "explore" the consequences of a crash. You can only give it tens of thousands of hours of human driving footage (fixed data) to learn from in a server.
  • Medical diagnosis: You cannot let an AI experiment on patients' lives.

In LLM alignment, DPO (Direct Preference Optimization) is, in common setups, very close to offline RL. Researchers first prepare a fixed batch of preference data (prompt + human-preferred response + human-dispreferred response), then let the model optimize its parameters directly on this fixed dataset. During gradient updates, the model does not need to converse with humans to request new ratings. [6]

The fatal challenge of offline RL:

Since the data is offline, it is usually off-policy (because the data was already collected by some historical behavior policy). The biggest difficulty with offline is unfalsifiable overestimation.

Suppose the autonomous driving logs contain only normal driving data. During offline training, due to neural network extrapolation effects, the AI might produce a delusion: "If I violently yank the steering wheel at 120km/h on the highway, I would get an extremely high reward!" In online RL, the AI would only need to try this once in a simulator -- the car would be destroyed and the delusion shattered. But in offline RL, it never gets the chance to try, so this fatal misestimate persists in the value function, ultimately causing policy collapse.

To address this problem, offline RL algorithms (such as CQL) typically introduce conservatism: for actions not seen in the dataset, give uniformly low scores, forcing the AI to stay within the safe zone of known data. [7]

Four Quadrants

Putting the two axes together gives a useful map:

Data regimeMeaningTypical examplesMain risk
Online + On-policykeep interacting, and update from fresh data produced by the current policyREINFORCE, Sarsa, PPO, GRPOsample inefficiency
Online + Off-policykeep interacting, but store and reuse older dataQ-learning, DQN, SAC, TD3distribution mismatch between stored data and current targets
Offline + Off-policytrain only from a fixed historical datasetCQL, IQL, DPO-like fixed preference trainingextrapolation to unseen actions
Offline + On-policyfixed data while requiring it to match the current policyfixed-policy evaluation, very small-step imitation updatesthe data becomes stale as soon as the policy changes

The important point is that the axes are independent. DQN is off-policy because it reuses replay data, but it is still online because the agent keeps collecting new transitions. Offline RL is usually off-policy because the dataset was generated before the current policy existed.

Common Misunderstandings

"On-policy means data can only be used once."

Not exactly. PPO often performs several epochs over the same rollout batch. The approximation remains acceptable because PPO clips the update so the new policy does not drift too far from the sampling policy.

"Off-policy means any data is useful."

No. Off-policy learning still needs coverage. If the dataset never contains states or actions that matter for the target policy, the algorithm must extrapolate. Neural networks can then assign high values to actions that have never been tested.

"Off-policy equals offline."

No. DQN is the standard counterexample: it is off-policy because it learns from replay, but online because the agent continues playing the game and adding data.

"DPO is just supervised learning, not RL."

Although DPO writes the loss function in the form of a classification problem and does not need to explicitly fit a reward model, it is still moving the policy distribution on fixed preference data. Its essence is solving a policy optimization problem in an offline setting, and if data coverage is not careful, it faces the same out-of-distribution deviation risks typical of offline RL.

Training-Inference Mismatch in Large-Model RL

In textbook RL, on-policy looks clean: the same policy samples data and then gets updated. In large-language-model RL systems, the engineering reality is less clean.

Rollout generation and training are often handled by different stacks:

  • rollout side: high-throughput inference engines such as vLLM or SGLang, sometimes with lower precision and KV-cache optimizations;
  • training side: FSDP, Megatron-style training, activation recomputation, and different numerical kernels.

Even when the model weights are nominally the same, the log probabilities computed during rollout can differ from the log probabilities recomputed during training. In notation, the true behavior policy πrollout\pi_{\text{rollout}} may not exactly match the recorded old policy πold\pi_{\text{old}}.

This matters for PPO, because the importance ratio is

rt(θ)=πθ(atst)πold(atst).r_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)} {\pi_{\text{old}}(a_t\mid s_t)}.

PPO clipping assumes that the denominator represents the policy that actually sampled the action. If the denominator is already biased by inference-training mismatch, clipping can limit optimization drift but cannot fully correct the original mismatch.

The practical lesson is modest but important: in large-model RL, "on-policy" is often an approximation. Good systems reduce the gap with careful log-prob recomputation, consistent precision, importance-sampling corrections, and monitoring of policy lag.

Training-Inference Mismatch and PPO

Readers might ask: what does this have to do with the PPO we discussed earlier? The answer is: PPO's clipping mechanism is a "defense" against training-inference mismatch, but it can only defend against half the problem.

PPO's core formula is:

LCLIP=E[min(rt(θ)A^t, clip(rt(θ),1ϵ,1+ϵ)A^t)]\mathcal{L}^{\text{CLIP}} = \mathbb{E}\left[\min\left( r_t(\theta) \hat{A}_t,\ \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right)\right]

where rt(θ)=πθ(atst)πold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)} is the importance sampling ratio. PPO uses clipping to limit rtr_t from deviating too far from 1, essentially saying: "if the new policy is too different from the old policy that sampled the data, do not trust this gradient; clip it away."

But PPO's clipping has a default assumption -- the denominator πold\pi_{\text{old}} is indeed "the policy that actually executed during sampling."

In classic RL (Atari, MuJoCo, etc.), the sampling process and training process are the same Python process; πold\pi_{\text{old}} is exactly the network weights saved at the moment of sampling, with no discrepancy. So PPO's clipping purely prevents "drift from optimization," and is completely effective.

But in LLM-RL, the situation changes:

  • πrollout\pi_{\text{rollout}}: the policy that actually took effect during sampling by the vLLM engine in FP8
  • πold\pi_{\text{old}}: the "policy you think was used during sampling," recomputed afterwards by the training framework in BF16/FP32

These two are not the same policy. That is, the importance sampling ratio rtr_t has a biased denominator from the start -- PPO's clipping tries to correct drift from optimization, but has no mechanism to correct the inconsistency between the inference engine and the training engine.

To use an analogy: PPO's clipping ensures you do not go too far from the old policy, but it does not guarantee that the "old policy" map itself is accurate. Training-inference mismatch means the map was biased from the start, and clipping cannot detect this problem.

This explains why LLM-RL training can still be unstable even with PPO clipping. Fixes for training-inference mismatch generally follow several lines:

  • Precision fixes: Use FP16/BF16 instead of FP8 for rollout to reduce the numerical discrepancy between πrollout\pi_{\text{rollout}} and πold\pi_{\text{old}}; some work goes the other direction and compresses training-side precision -- FP8-RL in the veRL framework achieves W8A8 full-stack low-precision training, combined with importance sampling correction, improving rollout throughput by 44% while matching the BF16 baseline.
  • Importance sampling (IS) correction: Since πrolloutπold\pi_{\text{rollout}} \neq \pi_{\text{old}}, explicitly introduce importance weights to correct distribution shift. Truncated IS (TIS) is the most direct approach, clipping extreme IS ratios to prevent gradient explosion; more recent work includes MinPRO, which replaces the cumulative product with the minimum token-level ratio within a prefix, providing more stability under large off-policy drift.
  • Pruning long-tail tokens: Training-inference mismatch concentrates in low-probability regions; directly removing extreme long-tail tokens can eliminate the largest source of deviation at its origin.
  • MoE routing replay: Inference-time expert routing is inherently different from training; R3 (Rollout Routing Replay) replays the inference routing distribution during training, solving the MoE-RL-specific amplification of training-inference mismatch.
  • Optimization perspective: Treat training-inference mismatch as a dynamic optimization problem, triggering learning rate scheduling through signals like response length surges.
  • Engineering-side rollout correction: Before training, use the current training engine to recompute the rollout policy's log-probability, forcibly aligning πrollout\pi_{\text{rollout}} and πold\pi_{\text{old}} -- expensive but most reliable.

Making Peace with Reality

These papers collectively point to one conclusion: in LLM-RL engineering practice, there is no such thing as "purely" on-policy. What we can do is keep the gap between μ\mu and πθ\pi_\theta within an acceptable range -- PPO's clipping is one form of control, FP16 is another, and R3 routing replay is yet another. The on/off-policy theory in the first half of the main text is a clean binary classification, while engineering reality is a continuous spectrum -- theoretically on-policy, but in practice always carrying a hint of off-policy flavor.

Summary

This section answered where RL data comes from:

  1. The behavior policy μ\mu collects the data.
  2. The target policy πθ\pi_\theta is the policy being optimized.
  3. On-policy vs. off-policy asks whether those policies match.
  4. Online vs. offline asks whether new data can still be collected during training.
  5. The two axes form four regimes, and each regime has different stability and data-efficiency tradeoffs.

At this point, we know how to define value, how to update value estimates, and how to classify data sources. The next issue is reward design: if the reward function is wrong, the agent may learn exactly what we asked for and still fail at what we meant.

Next: Reward Design

Appendix: Reading Terminology from Paper Titles

RL and Agentic RL paper titles are often packed with abbreviations and jargon. Rather than memorizing definitions, it helps to look directly at real paper titles -- the titles themselves are the most authentic usage examples. Below we use recent papers from 2024-2026 to unpack the core terminology that appears in titles, one by one.

On-policy and Off-policy: Who Sampled the Data?

These two words are among the most frequently appearing adjectives in RL paper titles. They describe the relationship between training data and the current policy.

Typical paper title breakdown:

"Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends"(arXiv 2509.24203, 2025)

This paper's title directly challenges a popular perception -- DeepSeek's GRPO has always been used as an on-policy algorithm, but the authors prove it can naturally be interpreted as off-policy in mathematical terms. The "Secretly an Off-Policy Algorithm" in the title says: you thought the data was collected by the current policy itself (on-policy), but actually old data can legitimately be used (off-policy).

"Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?"(arXiv 2510.01161, 2025)

The "Off-Policy RL" + "Stale Data" in the title precisely captures the core contradiction of off-policy: the data was generated by old policies (stale), but you want to use it to train a new policy. This paper proposes the M2PO algorithm, which constrains the second moment of importance weights to make off-policy training match on-policy performance on 1.7B-32B large models.

"On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting"(arXiv 2508.11408, 2025)

This title places On-Policy and Off-Policy in opposition and tries to fuse them. "On-Policy RL" refers to the RL phase where the model samples and learns from its own data; "Off-Policy Experts" refers to the SFT phase using human-annotated data (from "experts," clearly not produced by the current policy). The proposed CHORD framework dynamically weights between these two data sources -- a typical scenario of on/off-policy mixing in LLM training.

"Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning"(arXiv 2511.10843, AAAI 2026)

The "Behaviour Policy" and "Off-Policy" appearing together in the title echoes this section's core concepts: in off-policy settings, behavior policy μ\mu and target policy π\pi are separated, and this paper proves that a carefully designed behavior policy can yield lower variance than on-policy sampling.

One-sentence summary: When you see On-policy in a title, it means "the model updates itself with its own data"; when you see Off-policy, it means "the model is learning from someone else's data (or an older version of itself)."

Online and Offline: Can We Still Collect Data?

These two words describe whether the dataset is still growing during training, and are orthogonal to the on/off-policy dimension.

Typical paper title breakdown:

"Offline vs. Online Learning in Model-based RL: Lessons for Data Collection Strategies"(arXiv 2509.05735, RLC 2025)

The title directly pits Offline and Online against each other for comparison. This paper compares two paradigms across 31 environments, concluding that online agents generally outperform offline agents, and that the main reason for offline performance degradation is encountering out-of-distribution (OOD) states at test time -- precisely the fatal weakness of offline RL: unseen means unseen, with no opportunity to try.

"Understanding the Performance Gap Between Online and Offline Alignment Algorithms"(arXiv 2405.08448, NeurIPS 2024)

The "Online and Offline Alignment" in the title places these concepts in the context of LLM alignment. Online alignment refers to methods like PPO that sample and train simultaneously; offline alignment refers to methods like DPO that optimize directly on fixed preference data. The paper systematically analyzes why online methods generally outperform offline methods in practice.

One-sentence summary: When you see Online in a title, it means the agent is still interacting with the environment during training and the dataset is growing; when you see Offline, it means the dataset is sealed and the agent is not allowed to interact with the environment during training.

Two Axes Crossed: Paper Examples for the Four Quadrants

Crossing the on/off-policy and online/offline axes, each of the four quadrants in the main text has corresponding frontier papers:

QuadrantRepresentative paperTitle keyword interpretation
Online + On-policyPPO (Schulman et al., 2017), GRPO (DeepSeek, 2024)Sample and learn simultaneously, use once and discard.
Online + Off-policy"TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning" (ICLR 2025 Spotlight)Off-Policy means using experience replay to reuse old data, but Episodic means it is still continuously starting new episodes and collecting new data (Online).
Offline + Off-policy"Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL" (arXiv 2405.18520, 2024)Offline-Boosted means the base data is a fixed offline dataset; Off-Policy means the behavior policy and target policy are different. OBAC identifies the best-performing historical policies from the replay buffer to constrain the online policy's learning.
Offline + On-policy(Boundary case, rarely appears independently)Fixed data while requiring the data to represent the current policy -- this almost only occurs in policy evaluation or very small-step imitation learning updates.

References


  1. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347. https://arxiv.org/abs/1707.06347 ↩︎

  2. Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279-292. https://www.gatsby.ucl.ac.uk/~dayan/papers/wd92.html ↩︎

  3. Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529-533. https://doi.org/10.1038/nature14236 ↩︎

  4. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. See Chapters 5.5, 5.7, 6.4, and 6.5 on off-policy prediction, off-policy control, Sarsa, and Q-learning. MIT Press page: https://mitpress.mit.edu/9780262039246/reinforcement-learning/ ↩︎

  5. Levine, S., Kumar, A., Tucker, G., & Fu, J. (2020). Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv:2005.01643. https://arxiv.org/abs/2005.01643 ↩︎

  6. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290. https://arxiv.org/abs/2305.18290 ↩︎

  7. Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). Conservative Q-Learning for Offline Reinforcement Learning. NeurIPS 2020. https://papers.nips.cc/paper_files/paper/2020/hash/0d2b2061826a5df3221116a5085a6052-Abstract.html ↩︎

现代强化学习实战课程