Preface · Introduction

Introduction to RL

Brief History of RL

Environment Setup

Part I · Fundamentals & Classical RL

1. CartPole

1.0 Chapter Overview

1.1 CartPole Principles

1.2 Training Metrics

1.3 PPO Training Visualization (Translation pending)

2. Basic Definitions of the RL Process

2.1 Exploration and Exploitation

2.2 MDP & Markov Property

2.3 Policy, Value & Return (Translation pending)

2.4 Discount, Trajectory & POMDP

3. Value Functions & Bellman Equations

3.1 V/Q Functions & Bellman Expectation

3.2 Bellman Optimality & Contraction Mapping

3.3 Value Function Experiments (Translation pending)

4. DP, MC & TD

4.1 Dynamic Programming, Monte Carlo, Temporal Difference

4.2 Algorithm Taxonomy: On/Off-Policy & Online/Offline (Translation pending)

4.3 Reward Function Design

Part II · Deep Reinforcement Learning

5. Deep Q-Networks

5.1 From Q-Learning to DQN

5.2 DQN Improvement Family

5.3 Distributional RL

5.4 LunarLander / Atari Experiments

6. Policy Gradient Methods

6.0 Chapter Overview

6.1 Policy Gradient Theorem

6.2 REINFORCE with Baseline

6.3 Policy Gradient Improvements

7. Actor-Critic Architecture

7.1 Advantage Function

7.2 Actor-Critic Synchronous Updates

7.3 Pendulum Experiments

8. PPO & Trust-Region Methods

8.1 TRPO Trust Region

8.2 PPO-Clip Implementation

8.3 GAE & Reward Model

8.4 Long-Horizon Task Experiments

9. Continuous Control & Model-Based RL (Translation pending)

9.1 DDPG (pending)

9.2 TD3 / SAC (pending)

9.3 Model-Based RL: Dyna / PETS / MBPO (pending)

9.4 AlphaZero, MuZero & Dreamer V3 (pending)

Part III · Advanced RL Methods

10. Offline Reinforcement Learning (Translation pending)

10.1 Offline RL Challenges & Classical Methods (pending)

10.2 Decision Transformer, Trajectory Transformer & Diffuser (pending)

10.3 Offline RL Experiments & LLM Perspective (pending)

11. Imitation, Inverse RL & Meta-RL (Translation pending)

11.1 Behavioral Cloning & DAgger (pending)

11.2 Inverse RL & GAIL (pending)

11.3 Meta-RL: MAML / RL² / PEARL / In-Context RL (pending)

12. Exploration, MARL & Hierarchical RL (Translation pending)

12.1 Intrinsic Motivation: ICM / RND / NGU / Agent57 (pending)

12.2 Multi-Agent RL: CTDE / MADDPG / MAPPO (pending)

12.3 Hierarchical RL & Generative World Models (pending)

Part IV · LLM Alignment & Post-Training

13. RLHF Pipeline

13.0 Chapter Overview

13.1 Base Model to Instruction Alignment

13.2 SFT Instruction Tuning

13.3 Bradley-Terry Reward Model

13.4 RL Fine-Tuning Pipeline

13.5 Large-Scale Training Engineering

13.6 Evaluation Methods

13.7 veRL PPO on GSM8K

14. Industrial LLM RL Practice

14.1 Training Frameworks & Dual-Track Rewards (Translation pending)

14.2 Modern Post-Training Pipeline Paradigms

14.3 Optimizers & Training Stability (Translation pending)

14.4 Distributed Sync/Async & MoE Training (Translation pending)

15. Preference Alignment & DPO Family

15.1 DPO Derivation

15.2 DPO Training Metrics

15.3 DPO Theory, Math & Family Selection

16. GRPO, RLVR & Verifier Engineering

16.1 GRPO Core Mechanism

16.2 R1-Zero Paradigm / DAPO

16.3 RLVR: Verifiable Rewards

16.4 GRPO Improvement Family (Translation pending)

16.5 RL Environments & Verifier Engineering (Translation pending)

16.6 Financial API Tool-Calling GRPO Experiment

16.7 On-Policy Distillation

16.8 veRL Code Generation RL Experiment (Translation pending)

17. Reasoning Models & Test-Time Scaling (Translation pending)

17.1 Emergence of Reasoning Models (pending)

17.2 R1-Zero Pure RL Training (pending)

17.3 Test-time Compute Scaling (pending)

17.4 Hybrid Thinking & Thinking Budget (pending)

17.5 Adaptive Thinking (pending)

17.6 CoT Readability & Alignment (pending)

18. Process Reward Models & Inference-Time Search (Translation pending)

18.1 Outcome vs Process Rewards (pending)

18.2 Discriminative PRM (pending)

18.3 Generative PRM (pending)

18.4 Formal PRM Verifier (pending)

18.5 Inference-Time Search (pending)

18.6 Parallel Reasoning Coordination (pending)

19. Constitutional AI & RLAIF (Translation pending)

19.1 HHH Principles & Claude Practice (pending)

19.2 RLAIF Engineering Constitution Extension (pending)

Part V · Agentic Reinforcement Learning

20. Tool Use, Multi-Turn & Multi-Agent RL

20.0 Chapter Overview

20.1 Agentic RL Overview (Translation pending)

20.2 Multi-Turn RL Formulation (Translation pending)

20.3 Trajectory Credit Assignment (Translation pending)

20.4 Tool-Use RL

20.5 Search-Augmented RL

20.6 Code Interpreter RL Industrial Practice

20.7 Multi-Agent Collaboration & Agent Swarm (Translation pending)

21. RL for Code Agents (Translation pending)

21.1 SWE-RL Basics (pending)

21.2 Code World Model & DeepSWE (pending)

21.3 Self-Play SWE-RL Summary (pending)

22. Deep Research & Browser Agents (Translation pending)

22.1 Browser RL Harness Engineering (pending)

22.2 Evaluation Benchmarks & Open-Source Projects (pending)

23. Computer Use & GUI Agents (Translation pending)

23.1 GUI Agent Training Practice (pending)

23.2 Instruction Hierarchy & Prompt Injection Defense (pending)

Part VI · Multimodal Reinforcement Learning

24. Vision-Language Model RL

24.0 Chapter Overview

24.1 Visual Reward Challenges

24.2 Visual Reflection RL

24.3 Multimodal Frontiers (Translation pending)

24.4 GeoQA Geometric Reasoning Experiment

25. Audio & Speech RL (Translation pending)

25.1 RLVR → RLHF Audio Reward Design (pending)

25.2 Multimodal Audio Agent Future Directions (pending)

26. Embodied Intelligence & VLA Models (Translation pending)

26.1 Embodied Intelligence Overview (pending)

27. Visual Generation RL (Translation pending)

27.1 Visual Generation & DanceGRPO (pending)

27.2 Multi-Reward Video RLHF & Physics-Aware Generation (pending)

Part VII · Safety, Evaluation & Research Frontiers

28. Reward Hacking & RL Evaluation (Translation pending)

28.1 Classical Failure Modes (pending)

28.2 RLVR Fake Gains & Industrial Failure Cases (pending)

28.3 Anthropic Misalignment Research (pending)

28.4 Defense Mechanisms Summary (pending)

28.5 Evaluation Principles & Modern Harnesses (pending)

29. Self-Play, Scaling & Future Directions

29.0 Chapter Overview

29.1 Self-Play Basics & LLM Self-Play

29.2 RL Scaling Laws & Foundation Model RL

29.3 In-Context RL & the Next Decade

29.4 Evolutionary LLM Search & Scientific Discovery (Translation pending)

Appendices

A. Training Debugging & Engineering Practice

A.0 Appendix Overview

A.1 Training Debugging Guide (Translation pending)

A.2 Training Infrastructure

A.3 Agent Sandbox

A.4 Evaluation Benchmarks

B. Core Algorithm Implementations

B.0 Appendix Overview

B.1 SFT and KL

B.2 PPO and GAE

B.3 DPO Family

B.4 GRPO and Reward Models

B.5 Softmax & Cross-Entropy

B.6 Sampling Methods

B.7 Attention Mechanism

B.8 DAPO

C. Learning Resources & Reference Materials

C.0 Appendix Overview

C.1 Paper Reading Roadmap (Translation pending)

C.2 GPU Hours Estimation Table (Translation pending)

C.3 Metrics Glossary

C.4 Industrial Exercises

D. Math Foundations

D.0 Appendix Overview

D.1 Linear Algebra

D.1.0 Overview

Basic Objects

Bellman Matrix

Function Approximation

Convergence & Trust Regions

Formulas & Exercises

D.2 Probability & Estimation

D.2.0 Overview

Probability Basics

Returns and Value

Sampling & Estimation

Trajectories and GAE

Bellman Expectations

Formulas & Exercises

D.3 Calculus & Optimization

D.3.0 Overview

Derivatives & Gradients

Policy Gradient

PPO and Adam

Derivation Tools

Complete Formulas

Formulas & Exercises

D.4 Information Theory

D.4.0 Overview

Entropy & Exploration

Cross-Entropy & KL

RLHF and DPO

Mutual Information

Complete Formulas

Formulas & Exercises

B. RL Engineering Practice

Once you have learned the core RL algorithms, a different reality quickly becomes obvious: the real difficulty is rarely the algorithm, but the engineering.

The model does not fit on a single GPU. Training runs for an entire day and the loss still goes up. Offline scores disagree with your intuition. The evaluation looks fine, but the product regresses. These are not questions that standard RL textbooks answer, but in real work you will face them every week.

This appendix is deliberately structured so that each section explains one thing clearly. Jump around as needed.

Structure of This Appendix

Section	Topic	What Problem It Solves
B.1 RL Training Infrastructure: Sampling, Asynchrony, and Distributed Systems	How an RL training system actually runs	Sampling bottlenecks, rollout engines, async training, weight synchronization, DP/TP/PP/EP
B.2 Agentic RL Infrastructure	What infrastructure Agentic RL requires	Sandboxes, trajectory storage, tool execution, multi-turn scheduling, a Relax case study
B.3 RL Post-Training and Agentic RL Benchmarks	How to tell whether the model and agent are improving	Post-training evaluation, agentic benchmarks, training monitoring, badcase attribution, release gates
B.4 A Glossary of RL Training Metrics for LLMs	What the metrics in training logs actually mean	PPO/GRPO/DPO/RM metrics grouped by function, abnormal signals, framework differences
B.5 Industry Exercises	Practical skills for post-training and RL roles	Real job tasks decomposed into stable capabilities, a skills map, and 8 industry-style exercises

Reading Suggestions

If you are doing LLM post-training: read B.1 → B.3 → B.4.
If you are doing Agentic RL: read B.1 → B.2 → B.3.
If you are doing game or robotics RL: focus on the non-LLM part of B.1 and the monitoring part of B.3.
If you are preparing for interviews: start from the exercises in B.5, and then jump back when you find gaps.
If you only need the meaning of a metric: go directly to B.4 Metrics Glossary and look it up.