Skip to content

C. Code Cheatsheet

Skim this once in the 30 minutes before an interview. For each item, memorize one sentence plus one formula. That is usually enough.

This appendix covers the algorithms that are most frequently asked to be handwritten in LLM post-training / RLHF interviews, ordered roughly by how often they show up. Each topic is presented from four angles:

ViewWhat It Is For
One-line memoryThe short mantra you can recite before walking into the room
PseudocodeThe whiteboard version
PythonExplaining the logic with NumPy / plain Python
PyTorchThe engineering version interviewers often probe

Contents

SectionTopicFrequency
C.1 SFT Loss and KL Divergenceautoregressive SFT loss, shift-right, KL estimates4/5
C.2 PPO Policy Loss and GAEclipped surrogate, value loss, reverse-time GAE recursion5/5
C.3 DPO and VariantsDPO loss, IPO, KTO, SimPO5/5
C.4 GRPO and Reward Modelsgroup-wise normalization in GRPO, Bradley-Terry reward model4/5
C.5 Softmax and Cross-Entropynumerically stable softmax, log-sum-exp, CE loss4/5
C.6 Top-k / Top-p Samplingtemperature, top-k, top-p (nucleus) decoding4/5
C.7 Attention / MHA / GQAscaled dot-product attention, multi-head attention, MQA, GQA5/5
C.8 DAPOdecoupled clipping, dynamic sampling, overlong penalty shaping3/5

How To Use This Appendix

  1. Start by memorizing the one-line mantra. Each file opens with a short sentence that is enough to reconstruct the pseudocode.
  2. Prioritize pseudocode. In a whiteboard interview, pseudocode plus clear variable definitions is often sufficient.
  3. Use the PyTorch snippet for details. If the interviewer asks about implementation specifics (for example ignore_index, log_sum_exp, clamp), jump to the PyTorch section.
  4. Review the “Common Pitfalls.” Each file ends with a short list of high-frequency mistakes. Read those the night before.

现代强化学习实战课程