Skip to content

Appendix D: Learning Resources and Reproduction Projects

Goal of this appendix: provide clear navigation for your continued advancement. The first half organizes textbooks and courses with solid theory and clear exposition to help you systematically build foundations or explore frontiers. The second half surveys classic milestones and common environments in RL's game and simulation ecosystem, giving you inspiration and coordinates for your next hands-on reproduction project.

How to use this list: This book covers the complete pipeline from MDP basics through PPO, DPO, and GRPO, but RL goes far beyond that. If you want to dive deeper into a specific direction, compare different teaching styles, or find hands-on practice resources, this list can serve as a starting point. All resources are free or publicly accessible.

Choose based on your goal:

  • Just finished Chapter 3, want to see how other textbooks cover basic theory: start with Zhao Shiyu's Mathematical Principles of RL or Sutton & Barto.
  • Want to follow video lectures: start with David Silver's course or Li Hongyi's course.
  • Want to write code: start with OpenAI Spinning Up or Dive into Deep Reinforcement Learning.
  • Interested in LLM alignment / RLHF / GRPO: start with Nathan Lambert's RLHF Book or Ernest Ryu's RL-LLM course.
  • Want to explore frontier theory: start with Princeton ECE 524 or Alberta CMPUT 365.

I. Classic Textbooks

Reinforcement Learning: An Introduction (Sutton & Barto, 2nd Edition, 2018)

URL: incompleteideas.net/book/the-book-2nd.html | Chinese Translation

The standard textbook for RL, listed as required reading by nearly every university RL course. Three parts: Part I (tabular methods, Ch1-8) covers MDP, DP, MC, TD, n-step bootstrapping, planning; Part II (approximate methods, Ch9-13) covers function approximation, eligibility traces, policy gradients; Part III (Ch14-17) discusses psychology, neuroscience, and applications. Free PDF; Chinese translation quality is high. Best for systematically building foundations.

Mathematical Foundations of Reinforcement Learning (Zhao Shiyu)

URL: github.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning (GitHub 10k+ stars)

Published by Springer + Tsinghua University Press. 10 chapters rigorously deriving core RL algorithms from a mathematical perspective: Bellman equations → VI/PI → MC → TD (including Sarsa, Q-Learning, n-step Sarsa) → function approximation → policy gradients → Actor-Critic. Each chapter includes mathematical proofs and exercises. Best for readers who prefer rigorous derivation and want to understand "why these algorithms work" at the mathematical level.

Deep Reinforcement Learning (Zhang Zhihua, Peking University)

URL: PDF Draft

Textbook for Peking University's math department course. Assumes ML basics but not necessarily RL familiarity. Covers value-based learning (DQN), policy learning (Policy Gradient), Actor-Critic, TRPO, etc. Paired with Wang Shusen's Bilibili video course. Best for Chinese readers seeking a quick DRL introduction.

Dive into Deep Reinforcement Learning (Zhang Weinan, Shen Jian, Yu Yong)

URL: Online Version | Shanghai Jiao Tong University RL course textbook

Practice-oriented with runnable Jupyter code throughout. Three parts: basics (Bandit → MDP → DP → MC → Planning) → advanced (function approximation → DQN → policy gradients → PPO) → frontier (Model-Based RL, Offline RL). Best for learners who want to read and code simultaneously.

II. University Courses

European and American Courses

Stanford CS234: Reinforcement Learning (Emma Brunskill)

URL: web.stanford.edu/class/cs234/

Stanford's foundational RL course. From tabular MDPs through policy evaluation, Q-Learning, function approximation, policy gradients, Offline RL, exploration, MCTS, and finally RLHF. About half the lectures build theory; the other half cover advanced topics. Textbook: Sutton & Barto.

Stanford CS224R: Deep Reinforcement Learning (Chelsea Finn)

URL: cs224r.stanford.edu | YouTube 2025

Stanford's Deep RL course. Assumes RL basics; starts directly with imitation learning, quickly moving into policy gradients, Actor-Critic, Q-Learning, Model-Based RL, Offline RL, Reward Learning, RLHF, and Meta-RL. Best for learners who already know basics and want to dive deep into DRL directions.

MIT 6.7920: Reinforcement Learning Foundations and Methods (Cathy Wu)

URL: web.mit.edu/6.7920/www/

MIT's RL theory course. Two-thirds "exploitation" (known theory: DP 7 lectures + RL core methods 9 lectures), one-third "exploration" (frontier topics). DP section is very solid, covering finite/infinite horizon, LQR, policy/value iteration, convergence proofs. Best for learners seeking theoretical depth.

UC Berkeley CS285: Deep Reinforcement Learning (Sergey Levine)

URL: rail.eecs.berkeley.edu/deeprlcourse/

Berkeley's flagship Deep RL course. Only 1 lecture reviews RL basics, then dives into imitation learning, policy gradients, Actor-Critic, Value-Based RL, advanced policy gradients, variational inference & RL, LLM RL, Model-Based RL, Offline RL, and exploration. The 2026 spring edition adds hands-on assignments for LLM RL and Offline RL. Content most aligned with current industrial frontiers.

CMU 10-703: Deep Reinforcement Learning and Control

URL: cmudeeprl.github.io/703website_f25/

CMU's Deep RL course. After covering classical theory (MDP, DP, MC, TD), moves into function approximation, Deep Q-Learning, MCTS, policy gradients, imitation learning, inverse RL, optimal control, Model-Based RL, and exploration. Balanced theory and practice with broad coverage.

University of Alberta CMPUT 365: Introduction to RL (Marlos Machado)

URL: Syllabus PDF

Introductory RL course at Sutton's university, strictly following Sutton & Barto order: Bandits → MDP → DP (including PI, VI, GPI) → MC prediction and control → TD prediction → TD control (Sarsa, Q-Learning) → Planning (Dyna-Q) → function approximation → policy gradients. Most faithful course implementation of Sutton & Barto.

Georgia Tech CS 7642: Reinforcement Learning (OMSCS)

URL: omscs.gatech.edu/cs-7642-reinforcement-learning

Online RL course. Covers DP, TD (including Sarsa), n-step TD, Lambda Return, DQN, policy gradients, multi-agent RL, game theory, and POMDP. One of the best-regarded RL courses in the OMSCS program.

Princeton ECE 524: Foundations of RL (Chi Jin)

URL: sites.google.com/view/cjin/teaching/ece524 | YouTube

Theory-oriented, emphasizing finite-sample analysis and convergence proofs. Part I covers tabular MDPs, planning, exploration (Bandit and MDP), lower bounds; Part II covers large state spaces, linear VI, function approximation, multi-agent, and POMDP. Best for learners aiming to do RL theory research.

David Silver RL Course (UCL / DeepMind)

URL: davidsilver.uk/teaching | YouTube

10 classic lectures: MDP → DP → Model-Free Prediction → Model-Free Control → function approximation → policy gradients → Learning & Planning → exploration → classic game case studies. David Silver is the first author of AlphaGo/AlphaZero. Concise structure, clear explanations; the most widely disseminated RL video course.

DeepMind x UCL RL Lecture Series (2021)

URL: YouTube Playlist

Updated version of David Silver's course, taught by DeepMind researchers (Hado van Hasselt et al.). 13 lectures covering exploration and control, MDPs and DP, model-free methods, function approximation, planning, policy gradients and Actor-Critic, approximate DP, multi-step and off-policy, and Deep RL. More in-depth than the 2015 version with additional frontier content.

Chinese University Courses

Tsinghua University Reinforcement Learning (Fall 2025)

URL: coai.cs.tsinghua.edu.cn/Courses/RL2025/_site/

Undergraduate RL course. Starting from multi-armed bandits, covers MDP, Planning (DP), MC, TD Learning, policy gradients, function approximation, and Deep RL. 4 programming assignments (Bandit → MDP → TD & PG → Deep RL) + course project. Lecture slides are publicly available.

Nanjing University Foundations of Reinforcement Learning (Yu Yang, 2024)

URL: lamda.nju.edu.cn/introrl

Based on Sutton & Barto. 9 lectures covering RL basics, MDP, DP, MC, TD, and DQN. 5 programming assignments (Dagger → Q-Learning → DQN → Model-Based → Offline RL). One of the most theoretically solid Chinese university RL courses.

Nanjing University Advanced Reinforcement Learning (Yuan Lei, 2025)

URL: lamda.nju.edu.cn/advanceRL

Graduate advanced course. Covers DDPG/TD3, PPO techniques, multi-agent, RLHF/DPO theoretical derivations, and paper reading.

Shanghai Jiao Tong University Reinforcement Learning (Zhang Weinan, 2024)

URL: wnzhang.net/teaching/sjtu-rl-2024

Uses Dive into Deep Reinforcement Learning as textbook. 9 chapters covering basics through frontiers.

III. Chinese Online Courses and Tutorials

Li Hongyi Deep Reinforcement Learning (National Taiwan University)

URL: Course Page | Bilibili 2025

Uses Policy Gradient as the main thread, deeply explaining PPO (including Importance Sampling, On-policy → Off-policy derivation), then Q-Learning (DQN, Double DQN, Dueling DQN) and Actor-Critic. Lively explanations with polished slides. Most in-depth PPO coverage among Chinese courses.

Wang Shusen Deep Reinforcement Learning

URL: Bilibili Video

Video companion to Peking University's math department course. Five modules: basic concepts → value learning (DQN) → policy learning (Policy Gradient) → Actor-Critic (A3C, TRPO) → advanced (DDPG, AlphaGo, multi-agent). Paired with Zhang Zhihua's Deep Reinforcement Learning textbook. Concise content suitable for quick introduction.

Mushu Book EasyRL (Datawhale)

URL: Online Version | GitHub

Synthesizes the best of Zhoubolei's RL Outline, Li Hongyi's course, and Baidu's World Champion Takes You from Zero to RL Practice. 13 chapters + special topics, covering basics through DQN, PPO, DDPG, and AlphaStar. Most active open-source RL tutorial in the Chinese community.

Spinning Up Chinese Edition

URL: spinningup.qiwihui.com/zh-cn/latest

Chinese translation of OpenAI Spinning Up. Includes core concepts, algorithm taxonomy, policy gradient derivations, and implementations of VPG, TRPO, PPO, DDPG, TD3, and SAC.

IV. LLM Reinforcement Learning Specialized Resources

Nathan Lambert — RLHF Book + Course

URL: rlhfbook.com | Course | GitHub | YouTube

RLHF monograph by AI2 researcher Nathan Lambert. Covers the full RLHF pipeline: instruction tuning → reward model training → rejection sampling → PPO → DPO. Code repository implements PPO, REINFORCE, GRPO, RLOO and other policy gradient methods. 4 video lectures. Most systematic publicly available textbook on LLM alignment.

Ernest Ryu — Reinforcement Learning of Large Language Models (UCLA)

URL: ernestryu.com/courses/RL-LLM.html

The only university course that systematically combines classical RL theory with LLM RL. Three parts: Ch1 (5 lectures on classic RL: MDP → VI → PG → PPO/GRPO → AlphaGo) → Ch2 (4 lectures on LLM basics: NLP → Transformer → ICL/SFT) → Ch3 (2 lectures on LLM RL: RLHF/PPO/DPO → RLVR). LLM RL course with the deepest RL foundations.

DeepLearning.AI — Reinforcement Fine-Tuning LLMs with GRPO

URL: deeplearning.ai/short-courses/reinforcement-fine-tuning-llms-grpo

1-hour short course, 10 lessons. Uses Wordle as the running example, covering GRPO algorithm, reward function design, LLM-as-Judge, and reward hacking. 7 code experiments. Best for practitioners with LLM basics who want to quickly get started with GRPO.

Hugging Face — Deep RL Course

URL: huggingface.co/learn/deep-rl-course

8 units covering Q-Learning → DQN → Policy Gradient → A2C/A3C → PPO → multi-agent → Offline RL. Each unit includes theory and code practice. Bonus unit covers RLHF. Best for learners wanting to do RL experiments in the Hugging Face ecosystem.

V. Practical Tutorials and Technical Blogs

OpenAI Spinning Up in Deep RL

URL: spinningup.openai.com

The gold standard for RL basics education. Three parts: core concepts (V/Q/Bellman/Advantage) → algorithm taxonomy (Model-Based vs Model-Free) → policy optimization derivation (deriving Policy Gradient from scratch). Implements VPG, TRPO, PPO, DDPG, TD3, and SAC. Best combination of theoretical explanation and code implementation.

Cameron Wolfe — Deep (Learning) Focus

URL: PPO for LLMs: A Guide for Normal People | Online vs Offline RL for LLMs

Blog series explaining PPO in LLMs, online vs offline RL tradeoffs, DPO principles, etc., in accessible language. Best for readers wanting to understand "why LLM RL uses these algorithms."

Sebastian Raschka — Ahead of AI

URL: LLM Training: RLHF and Its Alternatives | State of LLMs 2025

Technical blog by the author of Build a Large Language Model From Scratch. Covers RLHF, DPO, RLVR, GRPO, inference-time scaling, and other frontier topics.

Reproduction Project Recommendations

RL projects can be split into two eras. The non-LLM era focuses on fixed simulation environments, game benchmarks, continuous control, multi-agent, and model learning. The LLM era extends actions to tokens, tool calls, web operations, visual reasoning, and long-horizon agent trajectories, with rewards expanding from environment scores to preference models, rule verifiers, process rewards, and real task success rates.

Reproduction Roadmap Quick Reference

Target DirectionPriority ResourcesWhat to Reproduce
Classic algorithm introductionCleanRL, Stable-Baselines3, RL Baselines3 Zoo, DopamineDQN, PPO, SAC, TD3, Rainbow DQN, Atari benchmarks
Environments & game benchmarksGymnasium, ALE, MiniGrid, Procgen, ViZDoomCartPole, LunarLander, Atari, FPS, procedurally generated environments
Multi-agent & gamesPettingZoo, OpenSpiel, SMAC, Google Research FootballSelf-play, cooperative/competitive MARL, StarCraft micromanagement, football
Robotics & embodied controlMuJoCo, Isaac Lab, ManiSkill, Meta-World, LeRobotContinuous control, robot arms, mobile robots, imitation learning + RL
Model-Based / world modelsDreamerV3, TD-MPC2, mbrl-lib, MBPOLearn dynamics models from pixels/states, then plan or optimize policies
LLM post-trainingOpenAI InstructGPT, TRL, NVIDIA NeMo-RL, verlPPO, DPO, GRPO, RLHF, preference alignment, reward model training
LLM reasoningDeepSeek-R1, Open-R1, TinyZero, DAPORLVR, math/code reasoning, R1-style reproduction, verifier design
Deep Research RLOpenAI Deep Research, Alibaba Tongyi DeepResearch, Search-R1, WebThinkerSearch, reading, evidence filtering, citation, research-style answers
Agentic RLOpenAI Agents SDK, Google ADK, Agent Lightning, AReaLCode, tool calling, web browsing, long-horizon task success rate optimization
GUI / Computer UseOpenAI CUA, Anthropic Computer Use, UI-TARS, OSWorldWeb, desktop, mobile GUI operations and visual grounding
VLMTRL VLM GRPO, VLM-R1, Open Vision Reasoner, Gemini RoboticsImage QA, visual reasoning, GUI/web, robotic visual operations, vision-language rewards
Generative model RLDDPO, Diffusers DDPO, AlignProp, RLAIF-V, VideoAlignOptimize image/multimodal generation with preference, aesthetics, safety, and consistency rewards

RL Directions Overview

For systematically choosing reproduction directions, use three axes: "algorithm problem + environment type + reward source." The table below can serve as a long-term maintained directory skeleton.

CategoryRepresentative ProblemRecommended Projects/Frameworks
Value-Based RLLearn discrete-action policy from Q-valuesDQN, Double DQN, Dueling DQN, Rainbow; Dopamine, CleanRL
Policy Gradient / Actor-CriticDirectly optimize policy, handle continuous or stochastic actionsREINFORCE, A2C/A3C, PPO, TRPO; Stable-Baselines3, TRL PPO
Off-Policy / Maximum EntropyImprove sample efficiency, encourage exploration and robustnessDDPG, TD3, SAC, REDQ; RL Baselines3 Zoo, Tianshou
Distributional RLLearn return distribution instead of single expectationC51, QR-DQN, IQN, FQF; Dopamine, DI-engine
Exploration / CuriositySparse rewards, long-horizon exploration, intrinsic motivationRND, ICM, count-based exploration; MiniGrid, Procgen
Model-Based RLLearn environment model, then plan or imagine rolloutsPETS, MBPO, Dreamer, TD-MPC; mbrl-lib, DreamerV3, TD-MPC2
Offline / Batch RLUse only offline data, no online explorationBCQ, CQL, IQL, TD3+BC; D4RL, Minari, d3rlpy, CORL
Imitation / Reward LearningLearn from expert trajectories, preferences, or inverse RLBC, DAgger, GAIL, AIRL; imitation, robomimic, LeRobot
Goal-Conditioned / HierarchicalLong-horizon tasks, subgoals, options, and skillsHER, Options, HIRO, skill discovery; MiniGrid/BabyAI, Meta-World
Meta-RL / Multitask / GeneralizationCross-task transfer, fast adaptation, generalizationMAML-RL, PEARL, multi-task PPO/SAC; Meta-World, Procgen, LIBERO
Safe / Constrained RLConstrain costs, risks, safe explorationCPO, PPO-Lagrangian, shielding; Safety-Gymnasium, OmniSafe
Multi-Agent RL / Game AICooperation, competition, self-play, communicationQMIX, MADDPG, MAPPO, AlphaZero; PettingZoo, OpenSpiel, JaxMARL
Robotics / Embodied RLContinuous control, manipulation, navigation, Sim2RealPPO/SAC on robots, domain randomization, VLA; Isaac Lab, ManiSkill, robosuite, OpenVLA
Distributed / Systems RLHigh-throughput rollout, multi-node training, productionizationIMPALA, APPO, distributed PPO; Ray RLlib, Sample Factory, DI-engine, Acme
RLHF / Preference AlignmentOptimize language/multimodal models from human or AI preferencesPPO, DPO, IPO, KTO, ORPO; OpenAI InstructGPT, Anthropic Constitutional AI, TRL, NeMo-RL
RLVR / Reasoning RLRule-verifiable rewards, math/code reasoning, long CoTGRPO, DAPO, RLOO, REINFORCE++; DeepSeek-R1, Open-R1, DAPO, reasoning-gym
Agentic RLSearch, tool calling, code execution, web/desktop tasksTrajectory reward, tool-use reward, process reward; OpenAI Agents SDK, Google ADK, Agent Lightning, SkyRL
VLM / GUI / Computer-Use RLImage understanding, GUI grounding, web/mobile/desktop controlMultimodal GRPO, GUI action RL; OpenAI CUA, Anthropic Computer Use, VLM-R1, OSWorld
Generative Model RLOptimize image, video, audio generation models with rewardsDDPO, AlignProp, RLAIF-V; DDPO, Diffusers DDPO, AlignProp, VideoAlign

Non-LLM Era: Fixed Environments, Simulation, and Classic Algorithms

This track is best for building solid RL fundamentals. Start with single-file implementations in small environments, then gradually move to Atari, continuous control, multi-agent, robotics, and Model-Based RL.

Environments and Algorithm Libraries

Environment/ToolTypeDescriptionRecommended Use
GymnasiumGeneral RL environmentSuccessor to OpenAI Gym; CartPole, LunarLander, and other classic environmentsGetting started, algorithm debugging, course experiments
Arcade Learning EnvironmentGame environmentAtari 2600 standard benchmark, used in DQN-series papersPixel input, discrete actions, DQN family
MiniGridGrid worldLightweight GridWorld for studying exploration, sparse rewards, and generalizationIntroduction to exploration, hierarchical RL, task generalization
ProcgenProcedurally generated games16 procedurally generated environments focusing on generalizationOverfitting analysis, generalization experiments
ViZDoomFPS 3D environmentFirst-person shooter, partially observable, visual input, long-horizon decisionsVisual policies, POMDP, navigation and combat
Stable-RetroClassic gamesGymnasium-style wrapper for retro console gamesClassic game reproduction, course demonstrations
MuJoCoPhysics simulationHigh-precision physics engine; HalfCheetah, Ant, Humanoid benchmarksPPO, SAC, TD3, continuous control
PyBulletPhysics simulationOpen-source robotics simulation, lightweight ecosystemRobotics introduction, MuJoCo alternative experiments
Isaac LabGPU parallel simulationNVIDIA successor to Isaac Gym; large-scale parallel robot trainingLarge-scale embodied RL, Sim2Real
ManiSkillRobot manipulationBenchmark for robotic arm manipulation, visual control, and large-scale parallel simulationVisual manipulation, imitation learning + RL
Meta-WorldMulti-task roboticsMulti-task robotic arm benchmarkMulti-task RL, meta-learning, generalization
PettingZooMulti-agent environmentMulti-agent version of Gymnasium, supporting cooperative and competitive scenariosMARL introduction, parallel/turn-based action interfaces
OpenSpielGame frameworkBoard games, card games, matrix games, and multi-agent algorithm collectionSelf-play, CFR, AlphaZero variants
Ray RLlibDistributed RLDistributed RL library in the Ray ecosystemLarge-scale training, multi-agent production experiments
CleanRLAlgorithm implementationSingle-file, readable, reproduction-friendlyLearning algorithm details, writing course code
Stable-Baselines3Algorithm libraryWell-packaged DQN, PPO, SAC, TD3 implementationsQuick baselines, hyperparameter tuning, comparisons
DopamineAtari algorithm libraryGoogle's DQN/Rainbow/IQN research frameworkAtari paper reproduction, distributional value learning
StageProject SuggestionRecommended ToolsAcceptance Criteria
1CartPole, MountainCar, LunarLanderGymnasium, CleanRL, Stable-Baselines3Can plot reward curves, understand replay and GAE
2DQN / Rainbow on AtariALE, Dopamine, CleanRLReproduce at least 1 Atari experiment
3PPO / SAC / TD3 on MuJoCoMuJoCo, Stable-Baselines3, RL Baselines3 ZooCan explain entropy, target networks, Q bias
4Self-play and multi-agentPettingZoo, OpenSpiel, SMAC, Google Research FootballCan distinguish cooperative, competitive, and mixed games
5Robot manipulation and visual controlIsaac Lab, ManiSkill, Meta-World, LeRobotCan run parallel simulation or imitation-to-RL pipeline
6Model-Based RL / World ModelsDreamerV3, TD-MPC2, mbrl-lib, MBPOCan explain latent dynamics and planning

Advanced Directions and Exercise Suggestions

DirectionRecommended Reproduction ProjectsCourse Assignment Ideas
Single-file algorithm implementationCleanRL's DQN, PPO, SAC, C51, PPO-LSTMWrite 200-500 lines clearly covering replay, GAE, target networks, entropy
High-performance RL systemsSample Factory, Ray RLlib, DI-engineCompare single-machine, multi-process, and distributed rollout throughput and sample efficiency
JAX / GPU parallelismBrax, PureJaxRL, JaxMARLUse jit/vmap/pmap for large-batch environments; understand the "environments can also be accelerated" paradigm
Offline RLD4RL + CQL/IQL/TD3+BC, Minari, d3rlpy, CORLCompare online RL and offline RL extrapolation error
Imitation learningBC, DAgger, GAIL, AIRL; imitation, robomimicTrain policy from expert trajectories, then fine-tune with RL
Reward learning & preference learningGAIL/AIRL, preference comparison, reward modelConstruct "human preferences" or scripted preferences, observe reward hacking
Safe & constrained RLSafety-Gymnasium, OmniSafe, PPO-Lagrangian, CPOPlot both reward curve and cost curve; learn constrained optimization
Exploration & sparse rewardsMiniGrid, Montezuma's Revenge, Procgen; RND, ICM, episodic curiosityStudy whether intrinsic rewards actually improve exploration vs just inflating training scores
Hierarchical & goal-conditioned RLHER, Options, HIRO, BabyAI, Meta-WorldDecompose long-horizon tasks into subgoals; compare flat vs hierarchical policies
Multi-task & generalizationProcgen, Meta-World, LIBERO, ContinualWorldHigh scores on training environments aren't enough; test on unseen tasks and seeds
Multi-agent cooperation/competitionPettingZoo, OpenSpiel, SMAC, Google Research Football, JaxMARLCompare independent PPO, MAPPO, QMIX, self-play
Robot manipulationMuJoCo, Isaac Lab, ManiSkill, robosuite, Meta-WorldDo reaching, pushing, pick-and-place, then add visual input
World models & planningDreamerV3, TD-MPC2, mbrl-lib, MBPO, IRISLearn dynamics model first, then compare model-free vs model-based sample efficiency
Industrial applicationsRecSim, FinRL, PearlBandit/RL experiments in recommendation, advertising, financial trading; emphasize offline evaluation and risk

Unity ML-Agents Introduction

Unity ML-Agents is a unique RL toolkit that enables training directly inside a 3D game engine. Unlike Gymnasium's 2D grids or PyBullet's pure physics simulation, ML-Agents provides complete 3D spaces including occlusion, perspective, gravity, and collision, suitable for studying visual navigation and spatial reasoning.

Typical usage:

python
# Unity ML-Agents is compatible with the Gymnasium interface
from mlagents_envs.environment import UnityEnvironment

# Load a pre-built Unity environment (3D platform jumping)
env = UnityEnvironment(file_name="3DBall")

# ML-Agents uses its own API, but can be wrapped as a Gymnasium interface
from mlagents_envs.gym_utils import UnityToGymWrapper
gym_env = UnityToGymWrapper(env)

# Then train with Stable-Baselines3
from stable_baselines3 import PPO
model = PPO("MlpPolicy", gym_env)
model.learn(total_timesteps=100000)

Classic ML-Agents environment examples:

EnvironmentTask TypeDifficultyBest For
3DBallBalance controlIntroductoryUnderstanding continuous action spaces
CrawlerQuadruped walkingIntermediateContinuous control + multi-joint coordination
WalkerBipedal walkingIntermediateCompare with PyBullet's Walker2d
PushBlockPush blocksIntroductoryGoal-conditioned RL
FoodCollectorCollect foodIntermediateMulti-objective + navigation
HideAndSeekMulti-agent hide-and-seekAdvancedMulti-agent emergent behavior

See the Environment Setup Guide for installation and environment access.

Classic Milestone Project Reference

Below are 30 common game and simulation reproduction directions from the non-LLM era, organized by theme:

Classic/Board Games
#NameGame/EnvironmentYearKey Information
1TD-GammonBackgammon1992Gerald Tesauro; reached human expert level through self-play RL
2Deep BlueChess1997IBM; defeated world champion Kasparov; primarily search-based, not pure RL
3AlphaGoGo2016DeepMind; RL + MCTS defeated Lee Sedol
4AlphaGo ZeroGo2017No human game records; learned from self-play alone
5AlphaZeroGo/Chess/Shogi2017Universal board-game RL algorithm; mastered three games simultaneously
6MuZeroGo/Chess/Atari2020No explicit game rules needed; simultaneously learns model and policy
Atari Series
#NameGame/EnvironmentYearKey Information
7DQN (Playing Atari with Deep RL)Atari 26002013First to use deep RL to learn multi-game policies directly from pixels
8Human-level Control through DRLAtari 26002015Nature 2015; improved DQN reaching human-level on multiple Atari games
9Prioritized Experience ReplayAtari2015Improved experience replay; prioritizes high TD-error experiences
10Rainbow DQNAtari2017Integrates Double DQN, Dueling, PER, NoisyNet, Distributional RL, n-step return
11IQN (Implicit Quantile Networks)Atari2018Distributional RL; learns quantile representations of return distributions
RTS / MOBA
#NameGame/EnvironmentYearKey Information
12SC2LE (StarCraft II Learning Environment)StarCraft II2017DeepMind provides SC2 RL research environment and benchmarks
13AlphaStarStarCraft II2019Multi-agent RL reaching Grandmaster level
14TStarBotStarCraft II2019Tencent's StarCraft II agent system
15OpenAI FiveDota 220195v5 defeated world champion OG; large-scale distributed RL
16Honor of Kings 1v1Honor of Kings2020Tencent AI Lab; dual-clipped PPO; mastered complex operation control
17Honor of Kings 5v5Honor of Kings2020Multi-hero, multi-role, global cooperation MOBA AI system
18Honor of Kings ArenaHonor of Kings2022Open MOBA RL environment; focuses on generalization challenges
19Mini Honor of KingsHonor of Kings2024Lightweight MARL environment; suitable for personal devices and course projects
FPS / 3D Games
#NameGame/EnvironmentYearKey Information
20Playing FPS Games with Deep RLViZDoom2016Deep RL for FPS games with visual input and partially observable states
21Quake III Arena: Capture the FlagQuake III CTF2019DeepMind; complex team cooperation and multi-agent emergent behavior
22Obstacle TowerUnity 3D2019Tests 3D navigation, visual generalization, and long-horizon exploration
23Sample Efficient RL in MinecraftMinecraft/MineRL2021Using human demonstration data to improve sample efficiency in Minecraft
Sports/Racing/Other
#NameGame/EnvironmentYearKey Information
24Google Research FootballFootball 11v112020Open-source football simulator supporting multi-agent RL research
25RL in Rocket LeagueRocket League2022High-dimensional continuous control and team cooperation in a racing-plus-football hybrid
26Deep RL for Flappy BirdFlappy Bird2015Early deep RL game practice project
Multi-Agent/Comprehensive
#NameGame/EnvironmentYearKey Information
27Deep RL for General Game PlayingGeneral board games2020Extending AlphaZero-style methods to general game playing
28OpenSpielBoard/card games2019DeepMind game framework containing multiple games and classic game algorithms
29Hide-and-SeekMulti-agent hide-and-seek2019OpenAI; emergent tool use and complex strategies from multi-agent self-play
30Multi-Agent RL in Video GamesSurvey2025Covers Rocket League, Doom, Minecraft, StarCraft, Dota, MOBA directions

LLM Era: Post-Training, Reasoning, Agentic, VLM, and World Models

LLM-era RL is no longer just "maximize scores in fixed environments." Actions can be text, searches, tool calls, web clicks, code patches, visual grounding, or even entire multi-step agent trajectories. Rewards expand from environment scores to preference models, rule verifiers, process rewards, unit tests, web task success rates, and multimodal grounding signals.

Modern and Classic Resource Quick Reference

The recommended reading order: start with classic papers and official documentation to build concepts, then pick a "small model + verifiable reward" project to run end-to-end, and finally move into distributed training, Deep Research, GUI/Computer Use, and multimodal environments.

DirectionRecommended First LookTypeWhy It's Worth Reading
RLHF / post-training classicsOpenAI InstructGPT, Anthropic Constitutional AI, Meta Llama 3Classic papers/official docsUnderstand the basic paradigms of SFT, RM, PPO, DPO, RLAIF, and safety alignment
Modern post-training engineeringNVIDIA NeMo-RL, verl, OpenRLHF, DAPOProduction/research frameworksSee directly how rollout, vLLM/SGLang, Ray, Megatron, GRPO/DAPO, and async agentic RL are implemented
Reasoning RLVRDeepSeek-R1, DeepSeek-R1 Nature, Open-R1, TinyZeroModern reasoning reproductionBest for learning verifiable reward, GRPO/RLVR, cold-start data, long reasoning, and reward hacking
Open-source base modelsQwen3.6, Qwen3, Meta Llama ModelsOpen-source modelsSuitable for SFT/DPO/GRPO, tool calling, long context, and agentic coding experiments
Deep ResearchOpenAI Deep Research, Alibaba Tongyi DeepResearch, WebThinker, Search-R1Product/open-source researchTurn search, reading, evidence filtering, citation, and long report synthesis into trainable trajectories
Agent frameworks & tool callingOpenAI Agents SDK, Google ADK, Microsoft Agent Lightning, AutoGenAgent engineering frameworksLearn engineering boundaries: tools, handoffs, guardrails, tracing, sessions, agent trajectories, and RL interfaces
GUI / Computer UseOpenAI CUA, Anthropic Computer Use, ByteDance UI-TARS, OSWorldModels/tools/benchmarksCore materials for modern computer use: screenshots, coordinate actions, web/desktop/mobile task success rates
VLM / VLA / RoboticsVLM-R1, Open Vision Reasoner, Gemini Robotics, LeRobotMultimodal/embodiedConnect visual QA, grounding, GUI clicks, robot actions, and verifiable rewards
World modelsDreamerV3 Nature, DreamerV3 Code, Google DeepMind Genie 3, Isaac LabClassic/frontier/simulationFrom reproducible world models to interactive world generation to parallel robot simulation
Generative model RLDDPO, Diffusers DDPO, AlignProp, RLAIF-V, VideoAlignImage/video/multimodal rewardsLearn to turn aesthetics, preferences, safety, text-image consistency, or video quality into optimization objectives

(The remaining sections — LLM Post-Training, LLM Reasoning, Deep Research RL, Agentic RL & Tool Calling, GUI & Computer Use, VLM, World Models & Simulators, Generative Model RL, Evaluation Benchmarks, and Reproduction Order — contain detailed project recommendations, common pitfalls, and resource tables that follow the same pattern as the sections above. Each subsection includes: reproduction goals, resource tables with links, and a "common pitfalls" list.)

Evaluation Benchmarks and Projects

LLM-era RL evaluation is often more error-prone than training. A direction must simultaneously track final success rate, process quality, format constraints, reward hacking, length bias, data leakage, and multi-sample stability.

Acceptance Checklist

  • Final metrics: accuracy, pass rate, task success rate, preference win rate.
  • Process metrics: tool call count, invalid action ratio, repeated search ratio, citation accessibility rate, code test failure types.
  • Stability metrics: effectiveness across different random seeds, sampling temperatures, and model sizes.
  • Safety metrics: whether the model is more prone to fabricated citations, unauthorized tool calls, environment information leakage, or broken format constraints.
  • Cost metrics: average tokens, average tool calls, average latency, training and evaluation GPU/CPU overhead.

Badcase Template

For each direction, maintain a badcases.jsonl or spreadsheet recording at minimum: task ID, input, model output, reward, scoring rationale, failure type, reproducibility, and fix suggestion. For LLM RL, badcases are not an afterthought — they are the entry point for next-round reward design, data filtering, and environment fixes.

Reproduction Order Suggestion

First use 0.5B to 3B small models with math, code, and format verification tasks to observe reward hacking, length bias, and sampling temperature effects; then migrate from TRL/TinyZero/Open-R1 to distributed frameworks like verl/OpenRLHF. For Agentic RL, prioritize tasks with clear success rates like search, web, and code; for VLM RL, prioritize scorable tasks like image-text answers, grounding, OCR, and GUI clicks; for world models and embodied directions, first run DreamerV3/TD-MPC2, then add vision and real-robot complexity.

A Solid Roadmap

  1. Week 1: Rule-reward tasks Use TRL or TinyZero to run a small verifiable task like Countdown, formatted JSON, or simple math. Goal: understand rollout, reward, advantage, KL, length bias, and log saving.

  2. Week 2: Preference optimization and post-training comparison Use the same small model for SFT, DPO/KTO, and PPO/GRPO comparison. Don't change too many variables — just observe how different training methods affect the same batch of prompts.

  3. Week 3: Reasoning RLVR Introduce Math-Verify, reasoning-gym, or code unit tests so reward evolves from "format correct" to "answer verifiable." Focus on observing reward sparsity and verifier loopholes.

  4. Week 4: Tool calling or Deep Research Build a small search/reading environment and record complete trajectories. Start with offline trajectory replay, then move to online rollout.

  5. Week 5: VLM or GUI Choose a visual QA, bbox grounding, or web click task and add visualized badcases. Focus on checking coordinate systems, screenshot states, and reward interpretability.

  6. Week 6+: Distributed and industrial frameworks Move into verl, OpenRLHF, AReaL, SkyRL, and similar frameworks. By now you know what reward, logging, and evaluation you need — you won't be led by engineering complexity.

When to Increase Difficulty?

When a task meets three criteria, you can move to the next level: first, the pre/post-training difference on a fixed evaluation set is stable; second, badcases can be clearly classified; third, when reward rises, human spot-check quality also rises. Otherwise, don't rush to switch to a larger model or more complex environment — fix reward, data, and logging first.

现代强化学习实战课程