C.1 SFT Loss and KL Divergence

SFT Loss (Autoregressive Cross-Entropy)

Core problem: predict the next token at every position, computing loss only on the answer part.

Core variables:

logits: model output, shape [B, seq_len, vocab_size]; position $t$ predicts $t + 1$
labels: ground-truth token ids; prompt tokens marked with ignore_index=-100
ignore_index: cross-entropy skips this index (default -100)

One-Line Memory

Cut the tail of logits, the head of labels: position $t$ predicts $t + 1$ . Mask prompt positions with -100 so they don't enter the loss.

Pseudocode

logits = model(input_ids)                # position t predicts t+1
shift_logits = logits[:, :-1, :]         # cut tail: no "next" after end
shift_labels = labels[:, 1:]             # cut head: nobody predicts the first
loss = cross_entropy(shift_logits, shift_labels, ignore_index=-100)

An autoregressive model predicts the token at position $t + 1$ from the prefix up to $t$ , so logits index $t$ aligns with labels index $t + 1$ .

Python Implementation

python

import numpy as np

def softmax(x, axis=-1):
    x_max = np.max(x, axis=axis, keepdims=True)
    e_x = np.exp(x - x_max)  # subtract max first to avoid overflow
    return e_x / np.sum(e_x, axis=axis, keepdims=True)

def sft_loss(logits, labels, ignore_index=-100):
    """
    logits: [seq_len, vocab_size]
    labels: [seq_len]  (unshifted)
    """
    shift_logits = logits[:-1]
    shift_labels = labels[1:]

    probs = softmax(shift_logits, axis=-1)
    total, count = 0.0, 0
    for t in range(len(shift_labels)):
        if shift_labels[t] == ignore_index:
            continue
        total += -np.log(probs[t, shift_labels[t]] + 1e-12)
        count += 1
    return total / max(count, 1)

PyTorch Implementation

python

import torch
import torch.nn.functional as F

def sft_loss(logits, labels, ignore_index=-100):
    """
    logits: [B, seq_len, vocab_size]
    labels: [B, seq_len]
    """
    shift_logits = logits[:, :-1, :].contiguous()
    shift_labels = labels[:, 1:].contiguous()

    return F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
        ignore_index=ignore_index,
    )

KL Divergence Estimates

Core problem: estimate the gap between the current policy $p$ and a reference policy $q$ , used as the KL penalty in PPO / GRPO.

Core variables:

log_probs: log-probabilities of sampled tokens under the current policy $p$
ref_log_probs: log-probabilities of the same tokens under the reference policy $q$ (usually a frozen SFT model)
log_ratio: $lo g (q / p)$ , the core quantity for k3

One-Line Memory

k1: mean(log_p − log_q), simple and unbiased but can go negative; k3: mean(exp(Δ) − 1 − Δ), $Δ = lo g \frac{q}{p}$ , always nonnegative.

Pseudocode

# k1 (common in PPO): plain average, unbiased but high variance, can go negative
kl = (log_probs - ref_log_probs).mean()

# k3 (default in GRPO / trl): always nonnegative, ratio direction q/p
log_ratio = ref_log_probs - log_probs        # log(q/p)
kl = (exp(log_ratio) - 1 - log_ratio).mean()

Python Implementation

python

import numpy as np

def kl_k1(log_p, log_q):
    """E_p[log p - log q]: unbiased, high variance, can be negative with few samples."""
    return np.mean(log_p - log_q)

def kl_k3(log_p, log_q):
    """E_p[exp(log q - log p) - 1 - (log q - log p)]: unbiased and always nonnegative."""
    log_ratio = log_q - log_p
    return np.mean(np.exp(log_ratio) - 1 - log_ratio)

PyTorch Implementation

python

import torch

def kl_penalty(log_probs, ref_log_probs, mode="k3"):
    """
    log_probs:     [B, seq_len]  current policy p
    ref_log_probs: [B, seq_len]  reference policy q
    """
    if mode == "k1":
        return (log_probs - ref_log_probs).mean()

    log_ratio = ref_log_probs - log_probs   # log(q/p)
    return (torch.exp(log_ratio) - 1 - log_ratio).mean()

Comparing the Two Estimators

Samples come from $p$ ; the target is $KL (p ∥ q)$ :

Estimator	Formula	Notes
k1	$E_{p} [lo g \frac{p}{q}]$	unbiased, simple, can be negative with limited samples
k3	$E_{p} [\frac{q}{p} - 1 - lo g \frac{q}{p}]$	unbiased, always $\geq 0$ , default in GRPO

Pitfall

In k3 the ratio must be $q / p$ (ref/current). Since $e^{x} - 1 - x \geq 0$ for every real $x$ , this guarantees nonnegativity; flipping it to $p / q$ keeps the value nonnegative but the expectation is no longer $KL (p ∥ q)$ .

Common Pitfalls

Pitfall	Explanation
Shift direction reversed	Cut the tail of logits and the head of labels: position $t$ predicts $t + 1$
Forgot `ignore_index`	Prompt tokens are marked `-100` and excluded from the loss
k3 ratio reversed	Must be $q / p$ (ref/current); flipping it biases the expectation
k1 with too few samples	A single batch can yield a negative estimate — that is sampling noise, not a bug
Softmax overflow	Subtract `max(x)` before `exp`
Missing `.contiguous()`	PyTorch `view` on a slice may fail; add `.contiguous()`

D.1 Linear Algebra

D.2 Probability & Estimation

D.3 Calculus & Optimization

D.4 Information Theory

C.1 SFT Loss and KL Divergence

SFT Loss (Autoregressive Cross-Entropy)

One-Line Memory

Pseudocode

Python Implementation

PyTorch Implementation

KL Divergence Estimates

One-Line Memory

Pseudocode

Python Implementation

PyTorch Implementation

Comparing the Two Estimators

Common Pitfalls

C.1 SFT Loss and KL Divergence ​

SFT Loss (Autoregressive Cross-Entropy) ​

One-Line Memory ​

Pseudocode ​

Python Implementation ​

PyTorch Implementation ​

KL Divergence Estimates ​

One-Line Memory ​

Pseudocode ​

Python Implementation ​

PyTorch Implementation ​

Comparing the Two Estimators ​

Common Pitfalls ​

C.1 SFT Loss and KL Divergence

SFT Loss (Autoregressive Cross-Entropy)

One-Line Memory

Pseudocode

Python Implementation

PyTorch Implementation

KL Divergence Estimates

One-Line Memory

Pseudocode

Python Implementation

PyTorch Implementation

Comparing the Two Estimators

Common Pitfalls