Appendix D: Learning Resources and Reproduction Projects

Goal of this appendix: provide clear navigation for your continued advancement. The first half organizes textbooks and courses with solid theory and clear exposition to help you systematically build foundations or explore frontiers. The second half surveys classic milestones and common environments in RL's game and simulation ecosystem, giving you inspiration and coordinates for your next hands-on reproduction project.

Recommended Learning Resources

How to use this list: This book covers the complete pipeline from MDP basics through PPO, DPO, and GRPO, but RL goes far beyond that. If you want to dive deeper into a specific direction, compare different teaching styles, or find hands-on practice resources, this list can serve as a starting point. All resources are free or publicly accessible.

Choose based on your goal:

Just finished Chapter 3, want to see how other textbooks cover basic theory: start with Zhao Shiyu's Mathematical Principles of RL or Sutton & Barto.
Want to follow video lectures: start with David Silver's course or Li Hongyi's course.
Want to write code: start with OpenAI Spinning Up or Dive into Deep Reinforcement Learning.
Interested in LLM alignment / RLHF / GRPO: start with Nathan Lambert's RLHF Book or Ernest Ryu's RL-LLM course.
Want to explore frontier theory: start with Princeton ECE 524 or Alberta CMPUT 365.

I. Classic Textbooks

Reinforcement Learning: An Introduction (Sutton & Barto, 2nd Edition, 2018)

URL: incompleteideas.net/book/the-book-2nd.html | Chinese Translation

The standard textbook for RL, listed as required reading by nearly every university RL course. Three parts: Part I (tabular methods, Ch1-8) covers MDP, DP, MC, TD, n-step bootstrapping, planning; Part II (approximate methods, Ch9-13) covers function approximation, eligibility traces, policy gradients; Part III (Ch14-17) discusses psychology, neuroscience, and applications. Free PDF; Chinese translation quality is high. Best for systematically building foundations.

Mathematical Foundations of Reinforcement Learning (Zhao Shiyu)

URL: github.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning (GitHub 10k+ stars)

Published by Springer + Tsinghua University Press. 10 chapters rigorously deriving core RL algorithms from a mathematical perspective: Bellman equations → VI/PI → MC → TD (including Sarsa, Q-Learning, n-step Sarsa) → function approximation → policy gradients → Actor-Critic. Each chapter includes mathematical proofs and exercises. Best for readers who prefer rigorous derivation and want to understand "why these algorithms work" at the mathematical level.

Deep Reinforcement Learning (Zhang Zhihua, Peking University)

URL: PDF Draft

Textbook for Peking University's math department course. Assumes ML basics but not necessarily RL familiarity. Covers value-based learning (DQN), policy learning (Policy Gradient), Actor-Critic, TRPO, etc. Paired with Wang Shusen's Bilibili video course. Best for Chinese readers seeking a quick DRL introduction.

Dive into Deep Reinforcement Learning (Zhang Weinan, Shen Jian, Yu Yong)

URL: Online Version | Shanghai Jiao Tong University RL course textbook

Practice-oriented with runnable Jupyter code throughout. Three parts: basics (Bandit → MDP → DP → MC → Planning) → advanced (function approximation → DQN → policy gradients → PPO) → frontier (Model-Based RL, Offline RL). Best for learners who want to read and code simultaneously.

II. University Courses

European and American Courses

Stanford CS234: Reinforcement Learning (Emma Brunskill)

URL: web.stanford.edu/class/cs234/

Stanford's foundational RL course. From tabular MDPs through policy evaluation, Q-Learning, function approximation, policy gradients, Offline RL, exploration, MCTS, and finally RLHF. About half the lectures build theory; the other half cover advanced topics. Textbook: Sutton & Barto.

Stanford CS224R: Deep Reinforcement Learning (Chelsea Finn)

URL: cs224r.stanford.edu | YouTube 2025

Stanford's Deep RL course. Assumes RL basics; starts directly with imitation learning, quickly moving into policy gradients, Actor-Critic, Q-Learning, Model-Based RL, Offline RL, Reward Learning, RLHF, and Meta-RL. Best for learners who already know basics and want to dive deep into DRL directions.

MIT 6.7920: Reinforcement Learning Foundations and Methods (Cathy Wu)

URL: web.mit.edu/6.7920/www/

MIT's RL theory course. Two-thirds "exploitation" (known theory: DP 7 lectures + RL core methods 9 lectures), one-third "exploration" (frontier topics). DP section is very solid, covering finite/infinite horizon, LQR, policy/value iteration, convergence proofs. Best for learners seeking theoretical depth.

UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine)

URL: rail.eecs.berkeley.edu/deeprlcourse/

Berkeley's flagship Deep RL course. Only 1 lecture reviews RL basics, then dives into imitation learning, policy gradients, Actor-Critic, Value-Based RL, advanced policy gradients, variational inference & RL, LLM RL, Model-Based RL, Offline RL, and exploration. The 2026 spring edition adds hands-on assignments for LLM RL and Offline RL. Content most aligned with current industrial frontiers.

CMU 10-703: Deep Reinforcement Learning and Control

URL: cmudeeprl.github.io/703website_f25/

CMU's Deep RL course. After covering classical theory (MDP, DP, MC, TD), moves into function approximation, Deep Q-Learning, MCTS, policy gradients, imitation learning, inverse RL, optimal control, Model-Based RL, and exploration. Balanced theory and practice with broad coverage.

University of Alberta CMPUT 365: Introduction to RL (Marlos Machado)

URL: Syllabus PDF

Introductory RL course at Sutton's university, strictly following Sutton & Barto order: Bandits → MDP → DP (including PI, VI, GPI) → MC prediction and control → TD prediction → TD control (Sarsa, Q-Learning) → Planning (Dyna-Q) → function approximation → policy gradients. Most faithful course implementation of Sutton & Barto.

Georgia Tech CS 7642: Reinforcement Learning (OMSCS)

URL: omscs.gatech.edu/cs-7642-reinforcement-learning

Online RL course. Covers DP, TD (including Sarsa), n-step TD, Lambda Return, DQN, policy gradients, multi-agent RL, game theory, and POMDP. One of the best-regarded RL courses in the OMSCS program.

Princeton ECE 524: Foundations of RL (Chi Jin)

URL: sites.google.com/view/cjin/teaching/ece524 | YouTube

Theory-oriented, emphasizing finite-sample analysis and convergence proofs. Part I covers tabular MDPs, planning, exploration (Bandit and MDP), lower bounds; Part II covers large state spaces, linear VI, function approximation, multi-agent, and POMDP. Best for learners aiming to do RL theory research.

David Silver RL Course (UCL / DeepMind)

URL: davidsilver.uk/teaching | YouTube

10 classic lectures: MDP → DP → Model-Free Prediction → Model-Free Control → function approximation → policy gradients → Learning & Planning → exploration → classic game case studies. David Silver is the first author of AlphaGo/AlphaZero. Concise structure, clear explanations; the most widely disseminated RL video course.

DeepMind x UCL RL Lecture Series (2021)

URL: YouTube Playlist

Updated version of David Silver's course, taught by DeepMind researchers (Hado van Hasselt et al.). 13 lectures covering exploration and control, MDPs and DP, model-free methods, function approximation, planning, policy gradients and Actor-Critic, approximate DP, multi-step and off-policy, and Deep RL. More in-depth than the 2015 version with additional frontier content.

Chinese University Courses

Tsinghua University Reinforcement Learning (Fall 2025)

URL: coai.cs.tsinghua.edu.cn/Courses/RL2025/_site/

Undergraduate RL course. Starting from multi-armed bandits, covers MDP, Planning (DP), MC, TD Learning, policy gradients, function approximation, and Deep RL. 4 programming assignments (Bandit → MDP → TD & PG → Deep RL) + course project. Lecture slides are publicly available.

Nanjing University Foundations of Reinforcement Learning (Yu Yang, 2024)

URL: lamda.nju.edu.cn/introrl

Based on Sutton & Barto. 9 lectures covering RL basics, MDP, DP, MC, TD, and DQN. 5 programming assignments (Dagger → Q-Learning → DQN → Model-Based → Offline RL). One of the most theoretically solid Chinese university RL courses.

Nanjing University Advanced Reinforcement Learning (Yuan Lei, 2025)

URL: lamda.nju.edu.cn/advanceRL

Graduate advanced course. Covers DDPG/TD3, PPO techniques, multi-agent, RLHF/DPO theoretical derivations, and paper reading.

Shanghai Jiao Tong University Reinforcement Learning (Zhang Weinan, 2024)

URL: wnzhang.net/teaching/sjtu-rl-2024

Uses Dive into Deep Reinforcement Learning as textbook. 9 chapters covering basics through frontiers.

III. Chinese Online Courses and Tutorials

Li Hongyi Deep Reinforcement Learning (National Taiwan University)

URL: Course Page | Bilibili 2025

Uses Policy Gradient as the main thread, deeply explaining PPO (including Importance Sampling, On-policy → Off-policy derivation), then Q-Learning (DQN, Double DQN, Dueling DQN) and Actor-Critic. Lively explanations with polished slides. Most in-depth PPO coverage among Chinese courses.

Wang Shusen Deep Reinforcement Learning

URL: Bilibili Video

Video companion to Peking University's math department course. Five modules: basic concepts → value learning (DQN) → policy learning (Policy Gradient) → Actor-Critic (A3C, TRPO) → advanced (DDPG, AlphaGo, multi-agent). Paired with Zhang Zhihua's Deep Reinforcement Learning textbook. Concise content suitable for quick introduction.

Mushu Book EasyRL (Datawhale)

URL: Online Version | GitHub

Synthesizes the best of Zhoubolei's RL Outline, Li Hongyi's course, and Baidu's World Champion Takes You from Zero to RL Practice. 13 chapters + special topics, covering basics through DQN, PPO, DDPG, and AlphaStar. Most active open-source RL tutorial in the Chinese community.

Spinning Up Chinese Edition

URL: spinningup.qiwihui.com/zh-cn/latest

Chinese translation of OpenAI Spinning Up. Includes core concepts, algorithm taxonomy, policy gradient derivations, and implementations of VPG, TRPO, PPO, DDPG, TD3, and SAC.

IV. LLM Reinforcement Learning Specialized Resources

Nathan Lambert — RLHF Book + Course

URL: rlhfbook.com | Course | GitHub | YouTube

RLHF monograph by AI2 researcher Nathan Lambert. Covers the full RLHF pipeline: instruction tuning → reward model training → rejection sampling → PPO → DPO. Code repository implements PPO, REINFORCE, GRPO, RLOO and other policy gradient methods. 4 video lectures. Most systematic publicly available textbook on LLM alignment.

Ernest Ryu — Reinforcement Learning of Large Language Models (UCLA)

URL: ernestryu.com/courses/RL-LLM.html

The only university course that systematically combines classical RL theory with LLM RL. Three parts: Ch1 (5 lectures on classic RL: MDP → VI → PG → PPO/GRPO → AlphaGo) → Ch2 (4 lectures on LLM basics: NLP → Transformer → ICL/SFT) → Ch3 (2 lectures on LLM RL: RLHF/PPO/DPO → RLVR). LLM RL course with the deepest RL foundations.

DeepLearning.AI — Reinforcement Fine-Tuning LLMs with GRPO

URL: deeplearning.ai/short-courses/reinforcement-fine-tuning-llms-grpo

1-hour short course, 10 lessons. Uses Wordle as the running example, covering GRPO algorithm, reward function design, LLM-as-Judge, and reward hacking. 7 code experiments. Best for practitioners with LLM basics who want to quickly get started with GRPO.

Hugging Face — Deep RL Course

URL: huggingface.co/learn/deep-rl-course

8 units covering Q-Learning → DQN → Policy Gradient → A2C/A3C → PPO → multi-agent → Offline RL. Each unit includes theory and code practice. Bonus unit covers RLHF. Best for learners wanting to do RL experiments in the Hugging Face ecosystem.

V. Practical Tutorials and Technical Blogs

OpenAI Spinning Up in Deep RL

URL: spinningup.openai.com

The gold standard for RL basics education. Three parts: core concepts (V/Q/Bellman/Advantage) → algorithm taxonomy (Model-Based vs Model-Free) → policy optimization derivation (deriving Policy Gradient from scratch). Implements VPG, TRPO, PPO, DDPG, TD3, and SAC. Best combination of theoretical explanation and code implementation.

Cameron Wolfe — Deep (Learning) Focus

URL: PPO for LLMs: A Guide for Normal People | Online vs Offline RL for LLMs

Blog series explaining PPO in LLMs, online vs offline RL tradeoffs, DPO principles, etc., in accessible language. Best for readers wanting to understand "why LLM RL uses these algorithms."

Sebastian Raschka — Ahead of AI

URL: LLM Training: RLHF and Its Alternatives | State of LLMs 2025

Technical blog by the author of Build a Large Language Model From Scratch. Covers RLHF, DPO, RLVR, GRPO, inference-time scaling, and other frontier topics.

Reproduction Project Recommendations

RL projects can be split into two eras. The non-LLM era focuses on fixed simulation environments, game benchmarks, continuous control, multi-agent, and model learning. The LLM era extends actions to tokens, tool calls, web operations, visual reasoning, and long-horizon agent trajectories, with rewards expanding from environment scores to preference models, rule verifiers, process rewards, and real task success rates.

Reproduction Roadmap Quick Reference

Target Direction	Priority Resources	What to Reproduce
Classic algorithm introduction	CleanRL, Stable-Baselines3, RL Baselines3 Zoo, Dopamine	DQN, PPO, SAC, TD3, Rainbow DQN, Atari benchmarks
Environments & game benchmarks	Gymnasium, ALE, MiniGrid, Procgen, ViZDoom	CartPole, LunarLander, Atari, FPS, procedurally generated environments
Multi-agent & games	PettingZoo, OpenSpiel, SMAC, Google Research Football	Self-play, cooperative/competitive MARL, StarCraft micromanagement, football
Robotics & embodied control	MuJoCo, Isaac Lab, ManiSkill, Meta-World, LeRobot	Continuous control, robot arms, mobile robots, imitation learning + RL
Model-Based / world models	DreamerV3, TD-MPC2, mbrl-lib, MBPO	Learn dynamics models from pixels/states, then plan or optimize policies
LLM post-training	OpenAI InstructGPT, TRL, NVIDIA NeMo-RL, verl	PPO, DPO, GRPO, RLHF, preference alignment, reward model training
LLM reasoning	DeepSeek-R1, Open-R1, TinyZero, DAPO	RLVR, math/code reasoning, R1-style reproduction, verifier design
Deep Research RL	OpenAI Deep Research, Alibaba Tongyi DeepResearch, Search-R1, WebThinker	Search, reading, evidence filtering, citation, research-style answers
Agentic RL	OpenAI Agents SDK, Google ADK, Agent Lightning, AReaL	Code, tool calling, web browsing, long-horizon task success rate optimization
GUI / Computer Use	OpenAI CUA, Anthropic Computer Use, UI-TARS, OSWorld	Web, desktop, mobile GUI operations and visual grounding
VLM	TRL VLM GRPO, VLM-R1, Open Vision Reasoner, Gemini Robotics	Image QA, visual reasoning, GUI/web, robotic visual operations, vision-language rewards
Generative model RL	DDPO, Diffusers DDPO, AlignProp, RLAIF-V, VideoAlign	Optimize image/multimodal generation with preference, aesthetics, safety, and consistency rewards

RL Directions Overview

For systematically choosing reproduction directions, use three axes: "algorithm problem + environment type + reward source." The table below can serve as a long-term maintained directory skeleton.

Category	Representative Problem	Recommended Projects/Frameworks
Value-Based RL	Learn discrete-action policy from Q-values	DQN, Double DQN, Dueling DQN, Rainbow; Dopamine, CleanRL
Policy Gradient / Actor-Critic	Directly optimize policy, handle continuous or stochastic actions	REINFORCE, A2C/A3C, PPO, TRPO; Stable-Baselines3, TRL PPO
Off-Policy / Maximum Entropy	Improve sample efficiency, encourage exploration and robustness	DDPG, TD3, SAC, REDQ; RL Baselines3 Zoo, Tianshou
Distributional RL	Learn return distribution instead of single expectation	C51, QR-DQN, IQN, FQF; Dopamine, DI-engine
Exploration / Curiosity	Sparse rewards, long-horizon exploration, intrinsic motivation	RND, ICM, count-based exploration; MiniGrid, Procgen
Model-Based RL	Learn environment model, then plan or imagine rollouts	PETS, MBPO, Dreamer, TD-MPC; mbrl-lib, DreamerV3, TD-MPC2
Offline / Batch RL	Use only offline data, no online exploration	BCQ, CQL, IQL, TD3+BC; D4RL, Minari, d3rlpy, CORL
Imitation / Reward Learning	Learn from expert trajectories, preferences, or inverse RL	BC, DAgger, GAIL, AIRL; imitation, robomimic, LeRobot
Goal-Conditioned / Hierarchical	Long-horizon tasks, subgoals, options, and skills	HER, Options, HIRO, skill discovery; MiniGrid/BabyAI, Meta-World
Meta-RL / Multitask / Generalization	Cross-task transfer, fast adaptation, generalization	MAML-RL, PEARL, multi-task PPO/SAC; Meta-World, Procgen, LIBERO
Safe / Constrained RL	Constrain costs, risks, safe exploration	CPO, PPO-Lagrangian, shielding; Safety-Gymnasium, OmniSafe
Multi-Agent RL / Game AI	Cooperation, competition, self-play, communication	QMIX, MADDPG, MAPPO, AlphaZero; PettingZoo, OpenSpiel, JaxMARL
Robotics / Embodied RL	Continuous control, manipulation, navigation, Sim2Real	PPO/SAC on robots, domain randomization, VLA; Isaac Lab, ManiSkill, robosuite, OpenVLA
Distributed / Systems RL	High-throughput rollout, multi-node training, productionization	IMPALA, APPO, distributed PPO; Ray RLlib, Sample Factory, DI-engine, Acme
RLHF / Preference Alignment	Optimize language/multimodal models from human or AI preferences	PPO, DPO, IPO, KTO, ORPO; OpenAI InstructGPT, Anthropic Constitutional AI, TRL, NeMo-RL
RLVR / Reasoning RL	Rule-verifiable rewards, math/code reasoning, long CoT	GRPO, DAPO, RLOO, REINFORCE++; DeepSeek-R1, Open-R1, DAPO, reasoning-gym
Agentic RL	Search, tool calling, code execution, web/desktop tasks	Trajectory reward, tool-use reward, process reward; OpenAI Agents SDK, Google ADK, Agent Lightning, SkyRL
VLM / GUI / Computer-Use RL	Image understanding, GUI grounding, web/mobile/desktop control	Multimodal GRPO, GUI action RL; OpenAI CUA, Anthropic Computer Use, VLM-R1, OSWorld
Generative Model RL	Optimize image, video, audio generation models with rewards	DDPO, AlignProp, RLAIF-V; DDPO, Diffusers DDPO, AlignProp, VideoAlign

Non-LLM Era: Fixed Environments, Simulation, and Classic Algorithms

This track is best for building solid RL fundamentals. Start with single-file implementations in small environments, then gradually move to Atari, continuous control, multi-agent, robotics, and Model-Based RL.

Environments and Algorithm Libraries

Environment/Tool	Type	Description	Recommended Use
Gymnasium	General RL environment	Successor to OpenAI Gym; CartPole, LunarLander, and other classic environments	Getting started, algorithm debugging, course experiments
Arcade Learning Environment	Game environment	Atari 2600 standard benchmark, used in DQN-series papers	Pixel input, discrete actions, DQN family
MiniGrid	Grid world	Lightweight GridWorld for studying exploration, sparse rewards, and generalization	Introduction to exploration, hierarchical RL, task generalization
Procgen	Procedurally generated games	16 procedurally generated environments focusing on generalization	Overfitting analysis, generalization experiments
ViZDoom	FPS 3D environment	First-person shooter, partially observable, visual input, long-horizon decisions	Visual policies, POMDP, navigation and combat
Stable-Retro	Classic games	Gymnasium-style wrapper for retro console games	Classic game reproduction, course demonstrations
MuJoCo	Physics simulation	High-precision physics engine; HalfCheetah, Ant, Humanoid benchmarks	PPO, SAC, TD3, continuous control
PyBullet	Physics simulation	Open-source robotics simulation, lightweight ecosystem	Robotics introduction, MuJoCo alternative experiments
Isaac Lab	GPU parallel simulation	NVIDIA successor to Isaac Gym; large-scale parallel robot training	Large-scale embodied RL, Sim2Real
ManiSkill	Robot manipulation	Benchmark for robotic arm manipulation, visual control, and large-scale parallel simulation	Visual manipulation, imitation learning + RL
Meta-World	Multi-task robotics	Multi-task robotic arm benchmark	Multi-task RL, meta-learning, generalization
PettingZoo	Multi-agent environment	Multi-agent version of Gymnasium, supporting cooperative and competitive scenarios	MARL introduction, parallel/turn-based action interfaces
OpenSpiel	Game framework	Board games, card games, matrix games, and multi-agent algorithm collection	Self-play, CFR, AlphaZero variants
Ray RLlib	Distributed RL	Distributed RL library in the Ray ecosystem	Large-scale training, multi-agent production experiments
CleanRL	Algorithm implementation	Single-file, readable, reproduction-friendly	Learning algorithm details, writing course code
Stable-Baselines3	Algorithm library	Well-packaged DQN, PPO, SAC, TD3 implementations	Quick baselines, hyperparameter tuning, comparisons
Dopamine	Atari algorithm library	Google's DQN/Rainbow/IQN research framework	Atari paper reproduction, distributional value learning

Recommended Reproduction Ladder

Stage	Project Suggestion	Recommended Tools	Acceptance Criteria
1	CartPole, MountainCar, LunarLander	Gymnasium, CleanRL, Stable-Baselines3	Can plot reward curves, understand replay and GAE
2	DQN / Rainbow on Atari	ALE, Dopamine, CleanRL	Reproduce at least 1 Atari experiment
3	PPO / SAC / TD3 on MuJoCo	MuJoCo, Stable-Baselines3, RL Baselines3 Zoo	Can explain entropy, target networks, Q bias
4	Self-play and multi-agent	PettingZoo, OpenSpiel, SMAC, Google Research Football	Can distinguish cooperative, competitive, and mixed games
5	Robot manipulation and visual control	Isaac Lab, ManiSkill, Meta-World, LeRobot	Can run parallel simulation or imitation-to-RL pipeline
6	Model-Based RL / World Models	DreamerV3, TD-MPC2, mbrl-lib, MBPO	Can explain latent dynamics and planning

Advanced Directions and Exercise Suggestions

Direction	Recommended Reproduction Projects	Course Assignment Ideas
Single-file algorithm implementation	CleanRL's DQN, PPO, SAC, C51, PPO-LSTM	Write 200-500 lines clearly covering replay, GAE, target networks, entropy
High-performance RL systems	Sample Factory, Ray RLlib, DI-engine	Compare single-machine, multi-process, and distributed rollout throughput and sample efficiency
JAX / GPU parallelism	Brax, PureJaxRL, JaxMARL	Use jit/vmap/pmap for large-batch environments; understand the "environments can also be accelerated" paradigm
Offline RL	D4RL + CQL/IQL/TD3+BC, Minari, d3rlpy, CORL	Compare online RL and offline RL extrapolation error
Imitation learning	BC, DAgger, GAIL, AIRL; imitation, robomimic	Train policy from expert trajectories, then fine-tune with RL
Reward learning & preference learning	GAIL/AIRL, preference comparison, reward model	Construct "human preferences" or scripted preferences, observe reward hacking
Safe & constrained RL	Safety-Gymnasium, OmniSafe, PPO-Lagrangian, CPO	Plot both reward curve and cost curve; learn constrained optimization
Exploration & sparse rewards	MiniGrid, Montezuma's Revenge, Procgen; RND, ICM, episodic curiosity	Study whether intrinsic rewards actually improve exploration vs just inflating training scores
Hierarchical & goal-conditioned RL	HER, Options, HIRO, BabyAI, Meta-World	Decompose long-horizon tasks into subgoals; compare flat vs hierarchical policies
Multi-task & generalization	Procgen, Meta-World, LIBERO, ContinualWorld	High scores on training environments aren't enough; test on unseen tasks and seeds
Multi-agent cooperation/competition	PettingZoo, OpenSpiel, SMAC, Google Research Football, JaxMARL	Compare independent PPO, MAPPO, QMIX, self-play
Robot manipulation	MuJoCo, Isaac Lab, ManiSkill, robosuite, Meta-World	Do reaching, pushing, pick-and-place, then add visual input
World models & planning	DreamerV3, TD-MPC2, mbrl-lib, MBPO, IRIS	Learn dynamics model first, then compare model-free vs model-based sample efficiency
Industrial applications	RecSim, FinRL, Pearl	Bandit/RL experiments in recommendation, advertising, financial trading; emphasize offline evaluation and risk

Unity ML-Agents Introduction

Unity ML-Agents is a unique RL toolkit that enables training directly inside a 3D game engine. Unlike Gymnasium's 2D grids or PyBullet's pure physics simulation, ML-Agents provides complete 3D spaces including occlusion, perspective, gravity, and collision, suitable for studying visual navigation and spatial reasoning.

Typical usage:

python

# Unity ML-Agents is compatible with the Gymnasium interface
from mlagents_envs.environment import UnityEnvironment

# Load a pre-built Unity environment (3D platform jumping)
env = UnityEnvironment(file_name="3DBall")

# ML-Agents uses its own API, but can be wrapped as a Gymnasium interface
from mlagents_envs.gym_utils import UnityToGymWrapper
gym_env = UnityToGymWrapper(env)

# Then train with Stable-Baselines3
from stable_baselines3 import PPO
model = PPO("MlpPolicy", gym_env)
model.learn(total_timesteps=100000)

Classic ML-Agents environment examples:

Environment	Task Type	Difficulty	Best For
3DBall	Balance control	Introductory	Understanding continuous action spaces
Crawler	Quadruped walking	Intermediate	Continuous control + multi-joint coordination
Walker	Bipedal walking	Intermediate	Compare with PyBullet's Walker2d
PushBlock	Push blocks	Introductory	Goal-conditioned RL
FoodCollector	Collect food	Intermediate	Multi-objective + navigation
HideAndSeek	Multi-agent hide-and-seek	Advanced	Multi-agent emergent behavior

See the Environment Setup Guide for installation and environment access.

Classic Milestone Project Reference

Below are 30 common game and simulation reproduction directions from the non-LLM era, organized by theme:

Classic/Board Games

#	Name	Game/Environment	Year	Key Information
1	TD-Gammon	Backgammon	1992	Gerald Tesauro; reached human expert level through self-play RL
2	Deep Blue	Chess	1997	IBM; defeated world champion Kasparov; primarily search-based, not pure RL
3	AlphaGo	Go	2016	DeepMind; RL + MCTS defeated Lee Sedol
4	AlphaGo Zero	Go	2017	No human game records; learned from self-play alone
5	AlphaZero	Go/Chess/Shogi	2017	Universal board-game RL algorithm; mastered three games simultaneously
6	MuZero	Go/Chess/Atari	2020	No explicit game rules needed; simultaneously learns model and policy

Atari Series

#	Name	Game/Environment	Year	Key Information
7	DQN (Playing Atari with Deep RL)	Atari 2600	2013	First to use deep RL to learn multi-game policies directly from pixels
8	Human-level Control through DRL	Atari 2600	2015	Nature 2015; improved DQN reaching human-level on multiple Atari games
9	Prioritized Experience Replay	Atari	2015	Improved experience replay; prioritizes high TD-error experiences
10	Rainbow DQN	Atari	2017	Integrates Double DQN, Dueling, PER, NoisyNet, Distributional RL, n-step return
11	IQN (Implicit Quantile Networks)	Atari	2018	Distributional RL; learns quantile representations of return distributions

RTS / MOBA

#	Name	Game/Environment	Year	Key Information
12	SC2LE (StarCraft II Learning Environment)	StarCraft II	2017	DeepMind provides SC2 RL research environment and benchmarks
13	AlphaStar	StarCraft II	2019	Multi-agent RL reaching Grandmaster level
14	TStarBot	StarCraft II	2019	Tencent's StarCraft II agent system
15	OpenAI Five	Dota 2	2019	5v5 defeated world champion OG; large-scale distributed RL
16	Honor of Kings 1v1	Honor of Kings	2020	Tencent AI Lab; dual-clipped PPO; mastered complex operation control
17	Honor of Kings 5v5	Honor of Kings	2020	Multi-hero, multi-role, global cooperation MOBA AI system
18	Honor of Kings Arena	Honor of Kings	2022	Open MOBA RL environment; focuses on generalization challenges
19	Mini Honor of Kings	Honor of Kings	2024	Lightweight MARL environment; suitable for personal devices and course projects

FPS / 3D Games

#	Name	Game/Environment	Year	Key Information
20	Playing FPS Games with Deep RL	ViZDoom	2016	Deep RL for FPS games with visual input and partially observable states
21	Quake III Arena: Capture the Flag	Quake III CTF	2019	DeepMind; complex team cooperation and multi-agent emergent behavior
22	Obstacle Tower	Unity 3D	2019	Tests 3D navigation, visual generalization, and long-horizon exploration
23	Sample Efficient RL in Minecraft	Minecraft/MineRL	2021	Using human demonstration data to improve sample efficiency in Minecraft

Sports/Racing/Other

#	Name	Game/Environment	Year	Key Information
24	Google Research Football	Football 11v11	2020	Open-source football simulator supporting multi-agent RL research
25	RL in Rocket League	Rocket League	2022	High-dimensional continuous control and team cooperation in a racing-plus-football hybrid
26	Deep RL for Flappy Bird	Flappy Bird	2015	Early deep RL game practice project

Multi-Agent/Comprehensive

#	Name	Game/Environment	Year	Key Information
27	Deep RL for General Game Playing	General board games	2020	Extending AlphaZero-style methods to general game playing
28	OpenSpiel	Board/card games	2019	DeepMind game framework containing multiple games and classic game algorithms
29	Hide-and-Seek	Multi-agent hide-and-seek	2019	OpenAI; emergent tool use and complex strategies from multi-agent self-play
30	Multi-Agent RL in Video Games	Survey	2025	Covers Rocket League, Doom, Minecraft, StarCraft, Dota, MOBA directions

LLM Era: Post-Training, Reasoning, Agentic, VLM, and World Models

LLM-era RL is no longer just "maximize scores in fixed environments." Actions can be text, searches, tool calls, web clicks, code patches, visual grounding, or even entire multi-step agent trajectories. Rewards expand from environment scores to preference models, rule verifiers, process rewards, unit tests, web task success rates, and multimodal grounding signals.

Modern and Classic Resource Quick Reference

The recommended reading order: start with classic papers and official documentation to build concepts, then pick a "small model + verifiable reward" project to run end-to-end, and finally move into distributed training, Deep Research, GUI/Computer Use, and multimodal environments.

Direction	Recommended First Look	Type	Why It's Worth Reading
RLHF / post-training classics	OpenAI InstructGPT, Anthropic Constitutional AI, Meta Llama 3	Classic papers/official docs	Understand the basic paradigms of SFT, RM, PPO, DPO, RLAIF, and safety alignment
Modern post-training engineering	NVIDIA NeMo-RL, verl, OpenRLHF, DAPO	Production/research frameworks	See directly how rollout, vLLM/SGLang, Ray, Megatron, GRPO/DAPO, and async agentic RL are implemented
Reasoning RLVR	DeepSeek-R1, DeepSeek-R1 Nature, Open-R1, TinyZero	Modern reasoning reproduction	Best for learning verifiable reward, GRPO/RLVR, cold-start data, long reasoning, and reward hacking
Open-source base models	Qwen3.6, Qwen3, Meta Llama Models	Open-source models	Suitable for SFT/DPO/GRPO, tool calling, long context, and agentic coding experiments
Deep Research	OpenAI Deep Research, Alibaba Tongyi DeepResearch, WebThinker, Search-R1	Product/open-source research	Turn search, reading, evidence filtering, citation, and long report synthesis into trainable trajectories
Agent frameworks & tool calling	OpenAI Agents SDK, Google ADK, Microsoft Agent Lightning, AutoGen	Agent engineering frameworks	Learn engineering boundaries: tools, handoffs, guardrails, tracing, sessions, agent trajectories, and RL interfaces
GUI / Computer Use	OpenAI CUA, Anthropic Computer Use, ByteDance UI-TARS, OSWorld	Models/tools/benchmarks	Core materials for modern computer use: screenshots, coordinate actions, web/desktop/mobile task success rates
VLM / VLA / Robotics	VLM-R1, Open Vision Reasoner, Gemini Robotics, LeRobot	Multimodal/embodied	Connect visual QA, grounding, GUI clicks, robot actions, and verifiable rewards
World models	DreamerV3 Nature, DreamerV3 Code, Google DeepMind Genie 3, Isaac Lab	Classic/frontier/simulation	From reproducible world models to interactive world generation to parallel robot simulation
Generative model RL	DDPO, Diffusers DDPO, AlignProp, RLAIF-V, VideoAlign	Image/video/multimodal rewards	Learn to turn aesthetics, preferences, safety, text-image consistency, or video quality into optimization objectives

(The remaining sections — LLM Post-Training, LLM Reasoning, Deep Research RL, Agentic RL & Tool Calling, GUI & Computer Use, VLM, World Models & Simulators, Generative Model RL, Evaluation Benchmarks, and Reproduction Order — contain detailed project recommendations, common pitfalls, and resource tables that follow the same pattern as the sections above. Each subsection includes: reproduction goals, resource tables with links, and a "common pitfalls" list.)

Evaluation Benchmarks and Projects

LLM-era RL evaluation is often more error-prone than training. A direction must simultaneously track final success rate, process quality, format constraints, reward hacking, length bias, data leakage, and multi-sample stability.

Acceptance Checklist

Final metrics: accuracy, pass rate, task success rate, preference win rate.
Process metrics: tool call count, invalid action ratio, repeated search ratio, citation accessibility rate, code test failure types.
Stability metrics: effectiveness across different random seeds, sampling temperatures, and model sizes.
Safety metrics: whether the model is more prone to fabricated citations, unauthorized tool calls, environment information leakage, or broken format constraints.
Cost metrics: average tokens, average tool calls, average latency, training and evaluation GPU/CPU overhead.

Badcase Template

For each direction, maintain a badcases.jsonl or spreadsheet recording at minimum: task ID, input, model output, reward, scoring rationale, failure type, reproducibility, and fix suggestion. For LLM RL, badcases are not an afterthought — they are the entry point for next-round reward design, data filtering, and environment fixes.

Reproduction Order Suggestion

First use 0.5B to 3B small models with math, code, and format verification tasks to observe reward hacking, length bias, and sampling temperature effects; then migrate from TRL/TinyZero/Open-R1 to distributed frameworks like verl/OpenRLHF. For Agentic RL, prioritize tasks with clear success rates like search, web, and code; for VLM RL, prioritize scorable tasks like image-text answers, grounding, OCR, and GUI clicks; for world models and embodied directions, first run DreamerV3/TD-MPC2, then add vision and real-robot complexity.

A Solid Roadmap

Week 1: Rule-reward tasks Use TRL or TinyZero to run a small verifiable task like Countdown, formatted JSON, or simple math. Goal: understand rollout, reward, advantage, KL, length bias, and log saving.
Week 2: Preference optimization and post-training comparison Use the same small model for SFT, DPO/KTO, and PPO/GRPO comparison. Don't change too many variables — just observe how different training methods affect the same batch of prompts.
Week 3: Reasoning RLVR Introduce Math-Verify, reasoning-gym, or code unit tests so reward evolves from "format correct" to "answer verifiable." Focus on observing reward sparsity and verifier loopholes.
Week 4: Tool calling or Deep Research Build a small search/reading environment and record complete trajectories. Start with offline trajectory replay, then move to online rollout.
Week 5: VLM or GUI Choose a visual QA, bbox grounding, or web click task and add visualized badcases. Focus on checking coordinate systems, screenshot states, and reward interpretability.
Week 6+: Distributed and industrial frameworks Move into verl, OpenRLHF, AReaL, SkyRL, and similar frameworks. By now you know what reward, logging, and evaluation you need — you won't be led by engineering complexity.

When to Increase Difficulty?

When a task meets three criteria, you can move to the next level: first, the pre/post-training difference on a fixed evaluation set is stable; second, badcases can be clearly classified; third, when reward rises, human spot-check quality also rises. Otherwise, don't rush to switch to a larger model or more complex environment — fix reward, data, and logging first.

D.1 Linear Algebra

D.2 Probability & Estimation

D.3 Calculus & Optimization

D.4 Information Theory

Appendix D: Learning Resources and Reproduction Projects

Recommended Learning Resources

Reproduction Project Recommendations

Reproduction Roadmap Quick Reference

RL Directions Overview

Non-LLM Era: Fixed Environments, Simulation, and Classic Algorithms

Environments and Algorithm Libraries

Recommended Reproduction Ladder

Advanced Directions and Exercise Suggestions

Unity ML-Agents Introduction

Classic Milestone Project Reference

Classic/Board Games

Atari Series

RTS / MOBA

FPS / 3D Games

Sports/Racing/Other

Multi-Agent/Comprehensive

LLM Era: Post-Training, Reasoning, Agentic, VLM, and World Models

Modern and Classic Resource Quick Reference

Evaluation Benchmarks and Projects

Acceptance Checklist

Badcase Template

Reproduction Order Suggestion

A Solid Roadmap

When to Increase Difficulty?

Appendix D: Learning Resources and Reproduction Projects ​

Recommended Learning Resources ​

Reproduction Project Recommendations ​

Reproduction Roadmap Quick Reference ​

RL Directions Overview ​

Non-LLM Era: Fixed Environments, Simulation, and Classic Algorithms ​

Environments and Algorithm Libraries ​

Recommended Reproduction Ladder ​

Advanced Directions and Exercise Suggestions ​

Unity ML-Agents Introduction ​

Classic Milestone Project Reference ​

Classic/Board Games ​

Atari Series ​

RTS / MOBA ​

FPS / 3D Games ​

Sports/Racing/Other ​

Multi-Agent/Comprehensive ​

LLM Era: Post-Training, Reasoning, Agentic, VLM, and World Models ​

Modern and Classic Resource Quick Reference ​

Evaluation Benchmarks and Projects ​

Acceptance Checklist ​

Badcase Template ​

Reproduction Order Suggestion ​

A Solid Roadmap ​

When to Increase Difficulty? ​

Appendix D: Learning Resources and Reproduction Projects

Recommended Learning Resources

Reproduction Project Recommendations

Reproduction Roadmap Quick Reference

RL Directions Overview

Non-LLM Era: Fixed Environments, Simulation, and Classic Algorithms

Environments and Algorithm Libraries

Recommended Reproduction Ladder

Advanced Directions and Exercise Suggestions

Unity ML-Agents Introduction

Classic Milestone Project Reference

Classic/Board Games

Atari Series

RTS / MOBA

FPS / 3D Games

Sports/Racing/Other

Multi-Agent/Comprehensive

LLM Era: Post-Training, Reasoning, Agentic, VLM, and World Models

Modern and Classic Resource Quick Reference

Evaluation Benchmarks and Projects

Acceptance Checklist

Badcase Template

Reproduction Order Suggestion

A Solid Roadmap

When to Increase Difficulty?