10.6 Agentic RL Extended Reading Index

The first six sections of this chapter covered the core theory, engineering practice, and industrial case studies of Agentic RL. But the landscape of Agentic RL extends far beyond that. In 2025--2026, RL is being applied to an increasing range of agent scenarios: from role-playing to creative writing, from scientific discovery to empathetic dialogue. This page organizes over 120 representative works by theme for further exploration.

How to use this index

Each theme is ordered as: survey -> methods -> systems. We recommend starting with survey works to build a global view, then going deeper into specific directions as needed. Works marked [open-source] include GitHub links and can be used for hands-on experimentation.

Surveys and Theoretical Foundations

The theoretical foundations of Agentic RL are rapidly taking shape. The surveys collected here map the landscape of this emerging field from different angles: some focus on training recipes and engineering practice, others reconceptualize LLMs as autonomous decision-makers and survey 500+ works around six core capabilities, and still others are written specifically for deep research systems or agentic search tasks. If you want to quickly build a mental model of the Agentic RL landscape, start here.

Work	Key highlight	Link
Adaptation of Agentic AI: A Survey	Survey of post-training, memory, and skill adaptation techniques for AI agents	arXiv
Training Recipes for Agentic RL in LLMs	Systematic compilation of Agentic RL training recipes, including environments and sampling strategies	TechRxiv
The Landscape of Agentic RL for LLMs: A Survey	Treats LLMs as autonomous decision-makers and surveys 500+ works around six core capabilities	arXiv
A Comprehensive Survey on RL-based Agentic Search	Survey of reinforcement learning applied to agentic search tasks	arXiv
Meta-Thinking in LLMs via Multi-Agent RL	Explores how multi-agent RL can enable meta-thinking capabilities in LLMs	arXiv
Reinforcement Learning Foundations for Deep Research Systems	First survey written specifically for RL foundations of deep research systems	arXiv

Deep Research and Information Integration

Deep research agents are one of the hottest application directions in Agentic RL. Unlike simple search-and-summarize, they require models to perform multi-turn, long-horizon information search, cross-validation, and synthesis in real web environments. This section includes everything from end-to-end RL frameworks to citation-aware rewards, covering different scales from 7B small models to 30B large models.

Work	Key highlight	Link
DeepResearcher [open-source]	End-to-end RL framework for search interaction in real web environments	GitHub
Tongyi DeepResearch [open-source]	Alibaba Tongyi Lab's 30.5B MoE model (3.3B active), using a two-stage "Agentic Mid-training + Post-training" pipeline	arXiv
IntentRL	Trains agents to actively clarify ambiguous user intent before starting long-horizon research	arXiv
DR Tulu / RLER	RL training scheme using evolved scoring criteria (RLER) to improve long-form research capabilities	AllenAI Blog
EigentSearch-Q+	Introduces structured reasoning tools (Q+) to enhance deep research agent capabilities	arXiv
Fathom-DeepResearch	Multi-agent system composed of Search and Reason 4B models, generating the DUETQA dataset	arXiv
PokeeResearch-7B [open-source]	7B-parameter open-source deep research agent	HuggingFace
SFR-DeepResearch	Salesforce; focuses on continuous RL training for autonomous single agents	arXiv
CaRR / C-GRPO [open-source]	Introduces citation-aware scoring rewards to curb model hallucination	GitHub

Reinforcement Reasoning and Code Generation

RLVR (Reinforcement Learning from Verifiable Rewards) naturally fits code generation tasks -- whether code passes tests and executes correctly are objectively verifiable signals. The works in this section build on this core advantage: some integrate code execution feedback directly into multi-turn training, some explore RLVR without ground-truth supervision, and others discover that models spontaneously learn to generate and execute code, revealing scaling laws.

Work	Key highlight	Link
rStar2-Agent [open-source]	GRPO-based 14B Agent RL algorithm showing strong competitiveness on math reasoning	arXiv
Murphy	Multi-turn RLVR framework integrating code execution feedback directly into training	arXiv
ZeroCoder	Explores improving code generation through RLVR without ground-truth supervision	arXiv
SARL	Achieves label-free reasoning improvement by rewarding reasoning topology structure	arXiv
Agentic RL Scaling Law / ZeroTIR [open-source]	Discovers models spontaneously learn to generate and execute code, revealing training scaling laws	GitHub
Agnostics	Language-agnostic code RL training framework	Project
ReLook	RL based on visual feedback (rendered screenshots) to optimize web frontend code generation	arXiv
Agentic Code Reasoning	Provides low-cost, risk-free reward signals for RL through semi-formal reasoning	arXiv
Code-Space Response Oracles	Uses LLMs as code generation oracles, replacing traditional RL oracles	arXiv

GUI and Web Agents

GUI agents enable AI to operate graphical interfaces like humans -- clicking buttons, filling forms, navigating web pages. The value of RL here is that SFT can only teach models to "mimic clicks," while RL enables models to "choose the optimal action path based on goals." This section covers approaches from web to mobile, from 3B small models to continual learning frameworks.

Work	Key highlight	Link
WebAgent-R1 [open-source]	End-to-end multi-turn RL framework improving 3B model success rate from 6.1% to 33.9%	GitHub
Web-Shepherd [open-source]	First step-level reward model specifically for web navigation, evaluating each interaction step	GitHub
CRAFT-GUI	Combines curriculum learning with GRPO to improve GUI agent performance	arXiv
MobileRL [open-source]	Mobile online RL framework using ADAGRPO algorithm	GitHub
Co-EPG	Co-evolution framework simultaneously optimizing GUI agent planning and grounding capabilities	AAAI
Continual GUI Agents	Defines and addresses learning problems for GUI agents in continually changing environments	arXiv
WebFactory	Fully automated closed-loop RL flow that "compresses" LLM intelligence into efficient GUI agents	OpenReview
ZeroGUI	Zero human-cost online GUI agent learning framework	arXiv
UI-S1	Semi-online RL training method combining offline and online data advantages	arXiv
Generalization in Online RL for Mobile Agents	Studies generalization in online RL for mobile agents, proving RL can surpass SFT baselines	OpenReview

Embodied Intelligence and Robotics

When RL moves from the digital world to the physical world, agents face not text or images, but continuous control signals and uncertain physical environments. The works in this section explore how LLMs can directly participate in robot reasoning and control: some use RL to optimize spatial reasoning so 7B models surpass GPT-4o, some train self-correction capabilities in pixel-level world models, and others study cross-embodiment transfer and maintaining "cognitive identity" during continual learning.

Work	Key highlight	Link
Robot-R1	Uses RL to directly optimize robot reasoning; 7B model spatial reasoning surpasses GPT-4o	arXiv
WMPO [open-source]	RL training in pixel-level visual world models, emerging self-correction capabilities	GitHub
ViVa	Uses pre-trained video generation models as value function estimators for state value assessment	arXiv
RoboAgent	Achieves embodied task planning through composing foundational capabilities	arXiv
Cross-Embodiment Offline RL	Achieves offline RL across different robot morphologies through morphological grouping strategies	arXiv
Sensory-Motor Control with LLMs	Enables LLMs to directly generate continuous control policies through iterative policy refinement	arXiv
RM-RL	Proposes "role model" RL for precise robot manipulation	arXiv
Learning Without Losing Identity	Studies how embodied agents maintain stable "cognitive identity" while continually learning new capabilities	arXiv

Multi-Agent Systems and Collaboration

Multi-agent collaboration is far more difficult than single-agent -- when you learn new strategies your teammates are also changing, making the environment non-stationary; when the team succeeds, who gets credit, and when it fails, who is responsible? The works in this section address these challenges from multiple angles: extending GRPO to multi-agent settings, achieving decentralized coordination through knowledge distillation, solving context drift with digital twins, and large-scale MARL frameworks that jointly optimize sampling and training end-to-end.

Work	Key highlight	Link
MAPoRL	New paradigm for multi-agent collaborative training	arXiv
M-GRPO	Extends GRPO algorithm to multi-agent scenarios	arXiv
SAGE	Closed-loop self-evolution multi-agent RL framework	arXiv
MARTI [open-source]	Multi-agent debate framework	GitHub
KD-MARL	Transfers centralized expert coordination to lightweight decentralized agents through knowledge distillation	arXiv
Value-Guidance MeanFlow	Value-guided flow model for offline multi-agent RL	arXiv
FlexMARL	First end-to-end training framework jointly optimizing sampling, training, and their orchestration for large-scale LLM-based MARL	arXiv
TwinLoop	Proposes simulation-in-the-loop digital twin framework to address multi-agent performance degradation from context changes	arXiv
Equivariant Multi-agent RL	Equivariant multi-agent RL for multi-modal vehicle-infrastructure cooperative systems	arXiv

World Models and Model-Based RL

The core bottleneck of model-free RL is sample efficiency -- agents must learn through extensive trial and error. World models provide a path around this bottleneck: first learn to "simulate the environment in your head," then generate training data in imagination. This section collects approaches from diffusion world models to object-centric representations, all with the core idea of having policy models interact with world models to complete multi-step planning and training "in imagination."

Work	Key highlight	Link
GIRL	Generative imagination RL through information-theoretic hallucination control	arXiv
World4RL	Diffusion world model for policy refinement in robot manipulation	arXiv
Dreamer-CDP	Dreamer variant that does not require reconstructing raw pixel observations	Project
RLVR-World	Uses RLVR to directly optimize world models	arXiv
OC-STORM	Enhances world models with object-centric representations for sample-efficient RL	arXiv
Imagine-then-Plan (ITP)	Policy models interact with world models to generate multi-step trajectories "in imagination"	arXiv

Role-Playing and Persona Simulation

Role-playing is not just "pretending to be someone" -- it requires models to maintain consistent personality traits, thinking styles, and behavioral patterns across long conversations. The value of RL here is that through verifiable role-awareness rewards, it reinforces the model's continuous perception of "who I am." The works in this section range from dual-layer thinking frameworks (distinguishing character perspective from model perspective) to multi-character self-play, exploring how to make AI truly "get into character" and maintain role consistency.

Work	Key highlight	Link
HER (Human-like Reasoning)	Proposes dual-layer thinking framework distinguishing character first-person thoughts from LLM third-person thoughts (note: not classic Hindsight Experience Replay)	arXiv
OMAR	Cultivates AI social intelligence through multi-turn self-play RL	arXiv
R4	Equips reward models and role-playing agents with reasoning and retrieval capabilities	ICLR Poster
VeriRole	Improves role awareness through verifiable prompt-guided RL	OpenReview
SPELL	Multi-character self-play RL framework for long-context reasoning	arXiv
Consistently Simulating Human Personas	Proposes a unified framework for evaluating and improving LLM role consistency	OpenReview
CPO	Comparative policy optimization for reward ambiguity in role-playing dialogue	arXiv
RAIDEN-R1	Proposes verifiable role-awareness reward (VRAR) to reinforce model perception of its own role	arXiv

Creative and Long-Form Writing

Creative writing poses unique challenges for RL: rewards are not as objectively verifiable as code execution, and "good" writing is subjective and multi-dimensional. The works in this section explore how to design reward signals that capture creative quality -- from generative reward models performing multi-dimensional reasoning about story preferences, to optimizing rubric-based reward models through alternating RL, to comparing different reward strategies via RLAIF to stimulate creative capabilities in small models.

Work	Key highlight	Link
Writer-R1	Memory-augmented Replay Policy Optimization	arXiv
R2-Write	Systematic study of open-domain writing, proposing a reflection and revision framework	arXiv
DPWriter	Addresses output diversity reduction during RL training through diverse planning branches	arXiv
RLMR	First to combine subjective preferences with objective verification in online RL training	arXiv
Rewarding Creativity	Develops generative reward models for multi-dimensional analysis and explicit reasoning about story preferences	arXiv
Alternating RL for Rubric-Based Reward Modeling	Optimizes rubric-based reward models through alternating RL, achieving SOTA on multiple writing benchmarks	arXiv
Igniting Creative Writing in SLMs	Compares two reward strategies under RLAIF framework to stimulate creative writing in 7B small models	ACL Anthology

Emotional Intelligence and Empathetic Dialogue

Empathy is not just "understanding emotions" -- it requires expressing appropriate responses at the right time while maintaining logical coherence in conversation. The value of RL here is enabling models to learn to balance "emotional support" with "cognitive reasoning." The works in this section range from verifiable emotion rewards to psychology-based empathetic reward modeling, exploring how to provide more grounded reward signals for RL.

Work	Key highlight	Link
RLVER	Trains LLM higher-order empathy using verifiable emotion rewards	arXiv
CARE	Cognitive reasoning-enhanced RL improving logical coherence and support quality in emotional support dialogue	arXiv
COMPEER	Unified process-outcome RL for structured empathetic reasoning	arXiv
DialogXpert	Online value RL-based dialogue planning with over 94% success rate on negotiation, emotional support, and other tasks	arXiv
EILS	Bio-emotion-inspired homeostatic learning signal framework for building adaptive autonomous agents	arXiv
SAGE (Steering Dialog Generation)	Uses latent variables to control long-term behavior of dialogue generation for building emotionally intelligent chatbots	arXiv
PERM	Psychology-based empathetic reward modeling providing more grounded reward signals for RL	arXiv

Art and Visual Creation

RL entering the art world is an interesting crossover -- it models "aesthetic judgment" as an optimizable reward signal. The works in this section cover applications from image generation optimization to hierarchical painting, from personalized hand-drawn illustrations to artistic style learning. Core approaches include: coordinating multiple expert models for iterative image generation optimization, learning artist styles from stroke data through inverse RL, and using hierarchical RL to separate high-level planning from low-level rendering.

Work	Key highlight	Link
Image-POSER	Reflective RL framework coordinating multiple expert models for iterative image generation optimization based on complex text prompts	arXiv
HRL-Painter	Hierarchical RL-based painting method with high-level region planning and low-level stroke execution	Neurocomputing
PersonaSketch-RL	RL-based strategy for optimizing personalized hand-drawn illustration generation	ScienceDirect
RMLer	Models cross-category concept fusion as an RL problem for synthesizing novel objects	arXiv
Sequential Art Creation	Deep RL framework for creating sequential artworks that are visually distinct from inputs	UTA Thesis
MVAEx-RL	RL-based multi-modal art element extraction and dynamic adaptation strategy for environment design	Springer
DailyArt	Models joint estimation as synthesis-mediated inference, inferring dynamics from single static images	arXiv

RL Training Infrastructure and Algorithm Innovation

The engineering complexity of Agentic RL far exceeds standard LLM RL -- you need to simultaneously manage model training on GPUs, tool execution on CPUs, and environment interaction over networks. This section focuses on the infrastructure and algorithm innovations supporting these complex training pipelines: from fully asynchronous training systems to scalable synthetic learning environments, from retrieval-augmented policy optimization to new paradigms that convert inference compute into training signals.

Work	Key highlight	Link
AReaL v1.0 [open-source]	Jointly open-sourced by Ant Group and Tsinghua, enabling "one-click agent integration into RL training"	GitHub
RollArt / RollARC	Maximizes multi-task Agentic RL training throughput through decoupled infrastructure (RollARC)	arXiv
SparrowRL	High-performance RL training system achieving lossless sparse incremental synchronization on commodity networks	arXiv
Laminar	Scalable, robust asynchronous RL post-training system based on fully decoupled architecture	arXiv
SCALER	Synthesizes scalable adaptive learning environments providing infinitely verifiable reasoning environments for RL training	arXiv
L-Zero (L0)	Low-cost, scalable end-to-end universal agent training pipeline	arXiv
Compute as Teacher (CaT)	Converts inference-time parallel sampling compute into RL training supervision signals	arXiv
RAPO	Retrieval-augmented policy optimization, explicitly expanding agent exploration space during training	arXiv
LLM-Explorer [open-source]	Tsinghua; a plugin that can enhance exploration capabilities of various RL algorithms	GitHub

Scientific Discovery and Industrial Applications

RL is moving out of the laboratory and into real application scenarios including chemistry, materials science, medicine, and industrial manufacturing. The works in this section model scientific problems as MDPs: lead compound optimization becomes a search problem under synthetic constraints, materials design becomes an optimization problem using formation energy feedback, and industrial anomaly detection becomes a policy learning problem for data synthesis. These applications demonstrate RL's potential as a "universal decision optimizer."

Work	Key highlight	Link
MolReAct	Models lead compound optimization as MDP, using RL for efficient search under synthetic constraints	arXiv
PolyRL	Multi-objective polymer generation and discovery guided by RL	RSC
Helix	Hierarchical evolutionary RL framework for open-ended scientific problem solving	arXiv
RLFEF	RL using formation energy feedback to fine-tune material diffusion models, improving crystal stability	dblp
AnomalyAgent	Industrial anomaly data synthesis agent that optimizes generation of highly realistic anomaly samples through RL	arXiv
Autonomous Adaptive Solver Selection	Uses constrained RL framework for autonomous solver selection during chemical integration	arXiv
PPO-based Surface Reconstruction	Deep RL framework based on PPO for surface reconstruction of AgPd alloy catalysts	AIP PDF
MedVR	For medical VQA, proposes two RL mechanisms: entropy-guided visual relocation (EVR) and consensus-driven credit assignment	arXiv

Note: The above works are papers or projects published or preprinted in 2025--2026. Some arXiv papers may have updated versions; we recommend searching by paper title on arxiv.org or Semantic Scholar for the latest versions.

1. CartPole Balancing

2. DPO Preference Tuning

3. MDP and Value Functions

4. Deep Q-Networks

5. Policy-Based Methods

6. Actor-Critic

7. PPO

8. The RLHF Pipeline

9. Post-Training Alignment

10. Agentic RL

11. VLM Reinforcement Learning

12. Future Trends

B. RL Engineering Practice

C. Code Cheatsheet

E. Math Foundations for RL

E.1 Linear Algebra

E.2 Probability & Estimation

E.3 Calculus & Optimization

E.4 Information Theory

10.6 Agentic RL Extended Reading Index

Surveys and Theoretical Foundations

Deep Research and Information Integration

Reinforcement Reasoning and Code Generation

GUI and Web Agents

Embodied Intelligence and Robotics

Multi-Agent Systems and Collaboration

World Models and Model-Based RL

Role-Playing and Persona Simulation

Creative and Long-Form Writing

Emotional Intelligence and Empathetic Dialogue

Art and Visual Creation

RL Training Infrastructure and Algorithm Innovation

Scientific Discovery and Industrial Applications

E.1 Linear Algebra

E.2 Probability & Estimation

E.3 Calculus & Optimization

E.4 Information Theory

10.6 Agentic RL Extended Reading Index ​

Surveys and Theoretical Foundations ​

Deep Research and Information Integration ​

Reinforcement Reasoning and Code Generation ​

GUI and Web Agents ​

Embodied Intelligence and Robotics ​

Multi-Agent Systems and Collaboration ​

World Models and Model-Based RL ​

Role-Playing and Persona Simulation ​

Creative and Long-Form Writing ​

Emotional Intelligence and Empathetic Dialogue ​

Art and Visual Creation ​

RL Training Infrastructure and Algorithm Innovation ​

Scientific Discovery and Industrial Applications ​

10.6 Agentic RL Extended Reading Index

Surveys and Theoretical Foundations

Deep Research and Information Integration

Reinforcement Reasoning and Code Generation

GUI and Web Agents

Embodied Intelligence and Robotics

Multi-Agent Systems and Collaboration

World Models and Model-Based RL

Role-Playing and Persona Simulation

Creative and Long-Form Writing

Emotional Intelligence and Empathetic Dialogue

Art and Visual Creation

RL Training Infrastructure and Algorithm Innovation

Scientific Discovery and Industrial Applications