Preface · Introduction

Introduction to RL

Brief History of RL

Environment Setup

Part I · Fundamentals & Classical RL

1. CartPole

1.0 Chapter Overview

1.1 CartPole Principles

1.2 Training Metrics

1.3 PPO Training Visualization (Translation pending)

2. Basic Definitions of the RL Process

2.1 Exploration and Exploitation

2.2 MDP & Markov Property

2.3 Policy, Value & Return (Translation pending)

2.4 Discount, Trajectory & POMDP

3. Value Functions & Bellman Equations

3.1 V/Q Functions & Bellman Expectation

3.2 Bellman Optimality & Contraction Mapping

3.3 Value Function Experiments (Translation pending)

4. DP, MC & TD

4.1 Dynamic Programming, Monte Carlo, Temporal Difference

4.2 Algorithm Taxonomy: On/Off-Policy & Online/Offline (Translation pending)

4.3 Reward Function Design

Part II · Deep Reinforcement Learning

5. Deep Q-Networks

5.1 From Q-Learning to DQN

5.2 DQN Improvement Family

5.3 Distributional RL

5.4 LunarLander / Atari Experiments

6. Policy Gradient Methods

6.0 Chapter Overview

6.1 Policy Gradient Theorem

6.2 REINFORCE with Baseline

6.3 Policy Gradient Improvements

7. Actor-Critic Architecture

7.1 Advantage Function

7.2 Actor-Critic Synchronous Updates

7.3 Pendulum Experiments

8. PPO & Trust-Region Methods

8.1 TRPO Trust Region

8.2 PPO-Clip Implementation

8.3 GAE & Reward Model

8.4 Long-Horizon Task Experiments

9. Continuous Control & Model-Based RL (Translation pending)

9.1 DDPG (pending)

9.2 TD3 / SAC (pending)

9.3 Model-Based RL: Dyna / PETS / MBPO (pending)

9.4 AlphaZero, MuZero & Dreamer V3 (pending)

Part III · Advanced RL Methods

10. Offline Reinforcement Learning (Translation pending)

10.1 Offline RL Challenges & Classical Methods (pending)

10.2 Decision Transformer, Trajectory Transformer & Diffuser (pending)

10.3 Offline RL Experiments & LLM Perspective (pending)

11. Imitation, Inverse RL & Meta-RL (Translation pending)

11.1 Behavioral Cloning & DAgger (pending)

11.2 Inverse RL & GAIL (pending)

11.3 Meta-RL: MAML / RL² / PEARL / In-Context RL (pending)

12. Exploration, MARL & Hierarchical RL (Translation pending)

12.1 Intrinsic Motivation: ICM / RND / NGU / Agent57 (pending)

12.2 Multi-Agent RL: CTDE / MADDPG / MAPPO (pending)

12.3 Hierarchical RL & Generative World Models (pending)

Part IV · LLM Alignment & Post-Training

13. RLHF Pipeline

13.0 Chapter Overview

13.1 Base Model to Instruction Alignment

13.2 SFT Instruction Tuning

13.3 Bradley-Terry Reward Model

13.4 RL Fine-Tuning Pipeline

13.5 Large-Scale Training Engineering

13.6 Evaluation Methods

13.7 veRL PPO on GSM8K

14. Industrial LLM RL Practice

14.1 Training Frameworks & Dual-Track Rewards (Translation pending)

14.2 Modern Post-Training Pipeline Paradigms

14.3 Optimizers & Training Stability (Translation pending)

14.4 Distributed Sync/Async & MoE Training (Translation pending)

15. Preference Alignment & DPO Family

15.1 DPO Derivation

15.2 DPO Training Metrics

15.3 DPO Theory, Math & Family Selection

16. GRPO, RLVR & Verifier Engineering

16.1 GRPO Core Mechanism

16.2 R1-Zero Paradigm / DAPO

16.3 RLVR: Verifiable Rewards

16.4 GRPO Improvement Family (Translation pending)

16.5 RL Environments & Verifier Engineering (Translation pending)

16.6 Financial API Tool-Calling GRPO Experiment

16.7 On-Policy Distillation

16.8 veRL Code Generation RL Experiment (Translation pending)

17. Reasoning Models & Test-Time Scaling (Translation pending)

17.1 Emergence of Reasoning Models (pending)

17.2 R1-Zero Pure RL Training (pending)

17.3 Test-time Compute Scaling (pending)

17.4 Hybrid Thinking & Thinking Budget (pending)

17.5 Adaptive Thinking (pending)

17.6 CoT Readability & Alignment (pending)

18. Process Reward Models & Inference-Time Search (Translation pending)

18.1 Outcome vs Process Rewards (pending)

18.2 Discriminative PRM (pending)

18.3 Generative PRM (pending)

18.4 Formal PRM Verifier (pending)

18.5 Inference-Time Search (pending)

18.6 Parallel Reasoning Coordination (pending)

19. Constitutional AI & RLAIF (Translation pending)

19.1 HHH Principles & Claude Practice (pending)

19.2 RLAIF Engineering Constitution Extension (pending)

Part V · Agentic Reinforcement Learning

20. Tool Use, Multi-Turn & Multi-Agent RL

20.0 Chapter Overview

20.1 Agentic RL Overview (Translation pending)

20.2 Multi-Turn RL Formulation (Translation pending)

20.3 Trajectory Credit Assignment (Translation pending)

20.4 Tool-Use RL

20.5 Search-Augmented RL

20.6 Code Interpreter RL Industrial Practice

20.7 Multi-Agent Collaboration & Agent Swarm (Translation pending)

21. RL for Code Agents (Translation pending)

21.1 SWE-RL Basics (pending)

21.2 Code World Model & DeepSWE (pending)

21.3 Self-Play SWE-RL Summary (pending)

22. Deep Research & Browser Agents (Translation pending)

22.1 Browser RL Harness Engineering (pending)

22.2 Evaluation Benchmarks & Open-Source Projects (pending)

23. Computer Use & GUI Agents (Translation pending)

23.1 GUI Agent Training Practice (pending)

23.2 Instruction Hierarchy & Prompt Injection Defense (pending)

Part VI · Multimodal Reinforcement Learning

24. Vision-Language Model RL

24.0 Chapter Overview

24.1 Visual Reward Challenges

24.2 Visual Reflection RL

24.3 Multimodal Frontiers (Translation pending)

24.4 GeoQA Geometric Reasoning Experiment

25. Audio & Speech RL (Translation pending)

25.1 RLVR → RLHF Audio Reward Design (pending)

25.2 Multimodal Audio Agent Future Directions (pending)

26. Embodied Intelligence & VLA Models (Translation pending)

26.1 Embodied Intelligence Overview (pending)

27. Visual Generation RL (Translation pending)

27.1 Visual Generation & DanceGRPO (pending)

27.2 Multi-Reward Video RLHF & Physics-Aware Generation (pending)

Part VII · Safety, Evaluation & Research Frontiers

28. Reward Hacking & RL Evaluation (Translation pending)

28.1 Classical Failure Modes (pending)

28.2 RLVR Fake Gains & Industrial Failure Cases (pending)

28.3 Anthropic Misalignment Research (pending)

28.4 Defense Mechanisms Summary (pending)

28.5 Evaluation Principles & Modern Harnesses (pending)

29. Self-Play, Scaling & Future Directions

29.0 Chapter Overview

29.1 Self-Play Basics & LLM Self-Play

29.2 RL Scaling Laws & Foundation Model RL

29.3 In-Context RL & the Next Decade

29.4 Evolutionary LLM Search & Scientific Discovery (Translation pending)

Appendices

A. Training Debugging & Engineering Practice

A.0 Appendix Overview

A.1 Training Debugging Guide (Translation pending)

A.2 Training Infrastructure

A.3 Agent Sandbox

A.4 Evaluation Benchmarks

B. Core Algorithm Implementations

B.0 Appendix Overview

B.1 SFT and KL

B.2 PPO and GAE

B.3 DPO Family

B.4 GRPO and Reward Models

B.5 Softmax & Cross-Entropy

B.6 Sampling Methods

B.7 Attention Mechanism

B.8 DAPO

C. Learning Resources & Reference Materials

C.0 Appendix Overview

C.1 Paper Reading Roadmap (Translation pending)

C.2 GPU Hours Estimation Table (Translation pending)

C.3 Metrics Glossary

C.4 Industrial Exercises

D. Math Foundations

D.0 Appendix Overview

D.1 Linear Algebra

D.1.0 Overview

Basic Objects

Bellman Matrix

Function Approximation

Convergence & Trust Regions

Formulas & Exercises

D.2 Probability & Estimation

D.2.0 Overview

Probability Basics

Returns and Value

Sampling & Estimation

Trajectories and GAE

Bellman Expectations

Formulas & Exercises

D.3 Calculus & Optimization

D.3.0 Overview

Derivatives & Gradients

Policy Gradient

PPO and Adam

Derivation Tools

Complete Formulas

Formulas & Exercises

D.4 Information Theory

D.4.0 Overview

Entropy & Exploration

Cross-Entropy & KL

RLHF and DPO

Mutual Information

Complete Formulas

Formulas & Exercises

C. Code Cheatsheet

Skim this once in the 30 minutes before an interview. For each item, memorize one sentence plus one formula. That is usually enough.

This appendix covers the algorithms that are most frequently asked to be handwritten in LLM post-training / RLHF interviews, ordered roughly by how often they show up. Each topic is presented from four angles:

View	What It Is For
One-line memory	The short mantra you can recite before walking into the room
Pseudocode	The whiteboard version
Python	Explaining the logic with NumPy / plain Python
PyTorch	The engineering version interviewers often probe

Contents

Section	Topic	Frequency
C.1 SFT Loss and KL Divergence	autoregressive SFT loss, shift-right, KL estimates	4/5
C.2 PPO Policy Loss and GAE	clipped surrogate, value loss, reverse-time GAE recursion	5/5
C.3 DPO and Variants	DPO loss, IPO, KTO, SimPO	5/5
C.4 GRPO and Reward Models	group-wise normalization in GRPO, Bradley-Terry reward model	4/5
C.5 Softmax and Cross-Entropy	numerically stable softmax, log-sum-exp, CE loss	4/5
C.6 Top-k / Top-p Sampling	temperature, top-k, top-p (nucleus) decoding	4/5
C.7 Attention / MHA / GQA	scaled dot-product attention, multi-head attention, MQA, GQA	5/5
C.8 DAPO	decoupled clipping, dynamic sampling, overlong penalty shaping	3/5

How To Use This Appendix

Start by memorizing the one-line mantra. Each file opens with a short sentence that is enough to reconstruct the pseudocode.
Prioritize pseudocode. In a whiteboard interview, pseudocode plus clear variable definitions is often sufficient.
Use the PyTorch snippet for details. If the interviewer asks about implementation specifics (for example ignore_index, log_sum_exp, clamp), jump to the PyTorch section.
Review the “Common Pitfalls.” Each file ends with a short list of high-frequency mistakes. Read those the night before.