Preface · Introduction

Introduction to RL

Brief History of RL

Environment Setup

Part I · Fundamentals & Classical RL

1. CartPole

1.0 Chapter Overview

1.1 CartPole Principles

1.2 Training Metrics

1.3 PPO Training Visualization (Translation pending)

2. Basic Definitions of the RL Process

2.1 Exploration and Exploitation

2.2 MDP & Markov Property

2.3 Policy, Value & Return (Translation pending)

2.4 Discount, Trajectory & POMDP

3. Value Functions & Bellman Equations

3.1 V/Q Functions & Bellman Expectation

3.2 Bellman Optimality & Contraction Mapping

3.3 Value Function Experiments (Translation pending)

4. DP, MC & TD

4.1 Dynamic Programming, Monte Carlo, Temporal Difference

4.2 Algorithm Taxonomy: On/Off-Policy & Online/Offline (Translation pending)

4.3 Reward Function Design

Part II · Deep Reinforcement Learning

5. Deep Q-Networks

5.1 From Q-Learning to DQN

5.2 DQN Improvement Family

5.3 Distributional RL

5.4 LunarLander / Atari Experiments

6. Policy Gradient Methods

6.0 Chapter Overview

6.1 Policy Gradient Theorem

6.2 REINFORCE with Baseline

6.3 Policy Gradient Improvements

7. Actor-Critic Architecture

7.1 Advantage Function

7.2 Actor-Critic Synchronous Updates

7.3 Pendulum Experiments

8. PPO & Trust-Region Methods

8.1 TRPO Trust Region

8.2 PPO-Clip Implementation

8.3 GAE & Reward Model

8.4 Long-Horizon Task Experiments

9. Continuous Control & Model-Based RL (Translation pending)

9.1 DDPG (pending)

9.2 TD3 / SAC (pending)

9.3 Model-Based RL: Dyna / PETS / MBPO (pending)

9.4 AlphaZero, MuZero & Dreamer V3 (pending)

Part III · Advanced RL Methods

10. Offline Reinforcement Learning (Translation pending)

10.1 Offline RL Challenges & Classical Methods (pending)

10.2 Decision Transformer, Trajectory Transformer & Diffuser (pending)

10.3 Offline RL Experiments & LLM Perspective (pending)

11. Imitation, Inverse RL & Meta-RL (Translation pending)

11.1 Behavioral Cloning & DAgger (pending)

11.2 Inverse RL & GAIL (pending)

11.3 Meta-RL: MAML / RL² / PEARL / In-Context RL (pending)

12. Exploration, MARL & Hierarchical RL (Translation pending)

12.1 Intrinsic Motivation: ICM / RND / NGU / Agent57 (pending)

12.2 Multi-Agent RL: CTDE / MADDPG / MAPPO (pending)

12.3 Hierarchical RL & Generative World Models (pending)

Part IV · LLM Alignment & Post-Training

13. RLHF Pipeline

13.0 Chapter Overview

13.1 Base Model to Instruction Alignment

13.2 SFT Instruction Tuning

13.3 Bradley-Terry Reward Model

13.4 RL Fine-Tuning Pipeline

13.5 Large-Scale Training Engineering

13.6 Evaluation Methods

13.7 veRL PPO on GSM8K

14. Industrial LLM RL Practice

14.1 Training Frameworks & Dual-Track Rewards (Translation pending)

14.2 Modern Post-Training Pipeline Paradigms

14.3 Optimizers & Training Stability (Translation pending)

14.4 Distributed Sync/Async & MoE Training (Translation pending)

15. Preference Alignment & DPO Family

15.1 DPO Derivation

15.2 DPO Training Metrics

15.3 DPO Theory, Math & Family Selection

16. GRPO, RLVR & Verifier Engineering

16.1 GRPO Core Mechanism

16.2 R1-Zero Paradigm / DAPO

16.3 RLVR: Verifiable Rewards

16.4 GRPO Improvement Family (Translation pending)

16.5 RL Environments & Verifier Engineering (Translation pending)

16.6 Financial API Tool-Calling GRPO Experiment

16.7 On-Policy Distillation

16.8 veRL Code Generation RL Experiment (Translation pending)

17. Reasoning Models & Test-Time Scaling (Translation pending)

17.1 Emergence of Reasoning Models (pending)

17.2 R1-Zero Pure RL Training (pending)

17.3 Test-time Compute Scaling (pending)

17.4 Hybrid Thinking & Thinking Budget (pending)

17.5 Adaptive Thinking (pending)

17.6 CoT Readability & Alignment (pending)

18. Process Reward Models & Inference-Time Search (Translation pending)

18.1 Outcome vs Process Rewards (pending)

18.2 Discriminative PRM (pending)

18.3 Generative PRM (pending)

18.4 Formal PRM Verifier (pending)

18.5 Inference-Time Search (pending)

18.6 Parallel Reasoning Coordination (pending)

19. Constitutional AI & RLAIF (Translation pending)

19.1 HHH Principles & Claude Practice (pending)

19.2 RLAIF Engineering Constitution Extension (pending)

Part V · Agentic Reinforcement Learning

20. Tool Use, Multi-Turn & Multi-Agent RL

20.0 Chapter Overview

20.1 Agentic RL Overview (Translation pending)

20.2 Multi-Turn RL Formulation (Translation pending)

20.3 Trajectory Credit Assignment (Translation pending)

20.4 Tool-Use RL

20.5 Search-Augmented RL

20.6 Code Interpreter RL Industrial Practice

20.7 Multi-Agent Collaboration & Agent Swarm (Translation pending)

21. RL for Code Agents (Translation pending)

21.1 SWE-RL Basics (pending)

21.2 Code World Model & DeepSWE (pending)

21.3 Self-Play SWE-RL Summary (pending)

22. Deep Research & Browser Agents (Translation pending)

22.1 Browser RL Harness Engineering (pending)

22.2 Evaluation Benchmarks & Open-Source Projects (pending)

23. Computer Use & GUI Agents (Translation pending)

23.1 GUI Agent Training Practice (pending)

23.2 Instruction Hierarchy & Prompt Injection Defense (pending)

Part VI · Multimodal Reinforcement Learning

24. Vision-Language Model RL

24.0 Chapter Overview

24.1 Visual Reward Challenges

24.2 Visual Reflection RL

24.3 Multimodal Frontiers (Translation pending)

24.4 GeoQA Geometric Reasoning Experiment

25. Audio & Speech RL (Translation pending)

25.1 RLVR → RLHF Audio Reward Design (pending)

25.2 Multimodal Audio Agent Future Directions (pending)

26. Embodied Intelligence & VLA Models (Translation pending)

26.1 Embodied Intelligence Overview (pending)

27. Visual Generation RL (Translation pending)

27.1 Visual Generation & DanceGRPO (pending)

27.2 Multi-Reward Video RLHF & Physics-Aware Generation (pending)

Part VII · Safety, Evaluation & Research Frontiers

28. Reward Hacking & RL Evaluation (Translation pending)

28.1 Classical Failure Modes (pending)

28.2 RLVR Fake Gains & Industrial Failure Cases (pending)

28.3 Anthropic Misalignment Research (pending)

28.4 Defense Mechanisms Summary (pending)

28.5 Evaluation Principles & Modern Harnesses (pending)

29. Self-Play, Scaling & Future Directions

29.0 Chapter Overview

29.1 Self-Play Basics & LLM Self-Play

29.2 RL Scaling Laws & Foundation Model RL

29.3 In-Context RL & the Next Decade

29.4 Evolutionary LLM Search & Scientific Discovery (Translation pending)

Appendices

A. Training Debugging & Engineering Practice

A.0 Appendix Overview

A.1 Training Debugging Guide (Translation pending)

A.2 Training Infrastructure

A.3 Agent Sandbox

A.4 Evaluation Benchmarks

B. Core Algorithm Implementations

B.0 Appendix Overview

B.1 SFT and KL

B.2 PPO and GAE

B.3 DPO Family

B.4 GRPO and Reward Models

B.5 Softmax & Cross-Entropy

B.6 Sampling Methods

B.7 Attention Mechanism

B.8 DAPO

C. Learning Resources & Reference Materials

C.0 Appendix Overview

C.1 Paper Reading Roadmap (Translation pending)

C.2 GPU Hours Estimation Table (Translation pending)

C.3 Metrics Glossary

C.4 Industrial Exercises

D. Math Foundations

D.0 Appendix Overview

D.1 Linear Algebra

D.1.0 Overview

Basic Objects

Bellman Matrix

Function Approximation

Convergence & Trust Regions

Formulas & Exercises

D.2 Probability & Estimation

D.2.0 Overview

Probability Basics

Returns and Value

Sampling & Estimation

Trajectories and GAE

Bellman Expectations

Formulas & Exercises

D.3 Calculus & Optimization

D.3.0 Overview

Derivatives & Gradients

Policy Gradient

PPO and Adam

Derivation Tools

Complete Formulas

Formulas & Exercises

D.4 Information Theory

D.4.0 Overview

Entropy & Exploration

Cross-Entropy & KL

RLHF and DPO

Mutual Information

Complete Formulas

Formulas & Exercises

E.3 Calculus and Optimization

Training a reinforcement learning agent is, at its core, a matter of adjusting parameters: making the average return higher and higher, or making prediction error smaller and smaller. The underlying language for this process is calculus. Derivatives tell us "which way to move", gradients tell us "how each parameter should move", and the chain rule lets that signal travel backward through the entire computation graph.

This section follows that thread. We start from functions and rates of change, move step by step to derivatives, gradients, and the chain rule, then see how these tools appear in policy gradients, Taylor approximations, PPO clipping, and GRPO normalization.

Gradient update diagram

Roadmap

Article	Mathematical pace	Role in reinforcement learning
E.3.1 Derivatives, Gradients, and the Chain Rule	Function -> derivative -> gradient -> chain rule	Understand how parameters affect the objective function
E.3.2 From Gradients to Policy Gradients	Log-probability gradient -> return weighting -> advantage function	Derive the update direction behind "increase the probability of good actions"
E.3.3 Optimization Stability: PPO and Adam	Probability ratio -> clipping -> adaptive step size	Control policy update size and gradient noise
E.3.4 Derivation Tools: Log Trick and Taylor	Log-derivative trick -> Taylor expansion -> second-order intuition	Understand the derivation skeleton of policy gradients and PPO
E.3.5 Complete Optimization Formulas	Full expressions for PG, DQN, GAE, PPO, GRPO	Connect modern RL training objectives
E.3.6 Summary, Formulas, and Exercises	Formula review -> pitfalls -> exercises	Review and check understanding