Skip to content

10.6 Agentic RL Extended Reading Index

The first six sections of this chapter covered the core theory, engineering practice, and industrial case studies of Agentic RL. But the landscape of Agentic RL extends far beyond that. In 2025--2026, RL is being applied to an increasing range of agent scenarios: from role-playing to creative writing, from scientific discovery to empathetic dialogue. This page organizes over 120 representative works by theme for further exploration.

How to use this index

Each theme is ordered as: survey -> methods -> systems. We recommend starting with survey works to build a global view, then going deeper into specific directions as needed. Works marked [open-source] include GitHub links and can be used for hands-on experimentation.

Surveys and Theoretical Foundations

The theoretical foundations of Agentic RL are rapidly taking shape. The surveys collected here map the landscape of this emerging field from different angles: some focus on training recipes and engineering practice, others reconceptualize LLMs as autonomous decision-makers and survey 500+ works around six core capabilities, and still others are written specifically for deep research systems or agentic search tasks. If you want to quickly build a mental model of the Agentic RL landscape, start here.

WorkKey highlightLink
Adaptation of Agentic AI: A SurveySurvey of post-training, memory, and skill adaptation techniques for AI agentsarXiv
Training Recipes for Agentic RL in LLMsSystematic compilation of Agentic RL training recipes, including environments and sampling strategiesTechRxiv
The Landscape of Agentic RL for LLMs: A SurveyTreats LLMs as autonomous decision-makers and surveys 500+ works around six core capabilitiesarXiv
A Comprehensive Survey on RL-based Agentic SearchSurvey of reinforcement learning applied to agentic search tasksarXiv
Meta-Thinking in LLMs via Multi-Agent RLExplores how multi-agent RL can enable meta-thinking capabilities in LLMsarXiv
Reinforcement Learning Foundations for Deep Research SystemsFirst survey written specifically for RL foundations of deep research systemsarXiv

Deep Research and Information Integration

Deep research agents are one of the hottest application directions in Agentic RL. Unlike simple search-and-summarize, they require models to perform multi-turn, long-horizon information search, cross-validation, and synthesis in real web environments. This section includes everything from end-to-end RL frameworks to citation-aware rewards, covering different scales from 7B small models to 30B large models.

WorkKey highlightLink
DeepResearcher [open-source]End-to-end RL framework for search interaction in real web environmentsGitHub
Tongyi DeepResearch [open-source]Alibaba Tongyi Lab's 30.5B MoE model (3.3B active), using a two-stage "Agentic Mid-training + Post-training" pipelinearXiv
IntentRLTrains agents to actively clarify ambiguous user intent before starting long-horizon researcharXiv
DR Tulu / RLERRL training scheme using evolved scoring criteria (RLER) to improve long-form research capabilitiesAllenAI Blog
EigentSearch-Q+Introduces structured reasoning tools (Q+) to enhance deep research agent capabilitiesarXiv
Fathom-DeepResearchMulti-agent system composed of Search and Reason 4B models, generating the DUETQA datasetarXiv
PokeeResearch-7B [open-source]7B-parameter open-source deep research agentHuggingFace
SFR-DeepResearchSalesforce; focuses on continuous RL training for autonomous single agentsarXiv
CaRR / C-GRPO [open-source]Introduces citation-aware scoring rewards to curb model hallucinationGitHub

Reinforcement Reasoning and Code Generation

RLVR (Reinforcement Learning from Verifiable Rewards) naturally fits code generation tasks -- whether code passes tests and executes correctly are objectively verifiable signals. The works in this section build on this core advantage: some integrate code execution feedback directly into multi-turn training, some explore RLVR without ground-truth supervision, and others discover that models spontaneously learn to generate and execute code, revealing scaling laws.

WorkKey highlightLink
rStar2-Agent [open-source]GRPO-based 14B Agent RL algorithm showing strong competitiveness on math reasoningarXiv
MurphyMulti-turn RLVR framework integrating code execution feedback directly into trainingarXiv
ZeroCoderExplores improving code generation through RLVR without ground-truth supervisionarXiv
SARLAchieves label-free reasoning improvement by rewarding reasoning topology structurearXiv
Agentic RL Scaling Law / ZeroTIR [open-source]Discovers models spontaneously learn to generate and execute code, revealing training scaling lawsGitHub
AgnosticsLanguage-agnostic code RL training frameworkProject
ReLookRL based on visual feedback (rendered screenshots) to optimize web frontend code generationarXiv
Agentic Code ReasoningProvides low-cost, risk-free reward signals for RL through semi-formal reasoningarXiv
Code-Space Response OraclesUses LLMs as code generation oracles, replacing traditional RL oraclesarXiv

GUI and Web Agents

GUI agents enable AI to operate graphical interfaces like humans -- clicking buttons, filling forms, navigating web pages. The value of RL here is that SFT can only teach models to "mimic clicks," while RL enables models to "choose the optimal action path based on goals." This section covers approaches from web to mobile, from 3B small models to continual learning frameworks.

WorkKey highlightLink
WebAgent-R1 [open-source]End-to-end multi-turn RL framework improving 3B model success rate from 6.1% to 33.9%GitHub
Web-Shepherd [open-source]First step-level reward model specifically for web navigation, evaluating each interaction stepGitHub
CRAFT-GUICombines curriculum learning with GRPO to improve GUI agent performancearXiv
MobileRL [open-source]Mobile online RL framework using ADAGRPO algorithmGitHub
Co-EPGCo-evolution framework simultaneously optimizing GUI agent planning and grounding capabilitiesAAAI
Continual GUI AgentsDefines and addresses learning problems for GUI agents in continually changing environmentsarXiv
WebFactoryFully automated closed-loop RL flow that "compresses" LLM intelligence into efficient GUI agentsOpenReview
ZeroGUIZero human-cost online GUI agent learning frameworkarXiv
UI-S1Semi-online RL training method combining offline and online data advantagesarXiv
Generalization in Online RL for Mobile AgentsStudies generalization in online RL for mobile agents, proving RL can surpass SFT baselinesOpenReview

Embodied Intelligence and Robotics

When RL moves from the digital world to the physical world, agents face not text or images, but continuous control signals and uncertain physical environments. The works in this section explore how LLMs can directly participate in robot reasoning and control: some use RL to optimize spatial reasoning so 7B models surpass GPT-4o, some train self-correction capabilities in pixel-level world models, and others study cross-embodiment transfer and maintaining "cognitive identity" during continual learning.

WorkKey highlightLink
Robot-R1Uses RL to directly optimize robot reasoning; 7B model spatial reasoning surpasses GPT-4oarXiv
WMPO [open-source]RL training in pixel-level visual world models, emerging self-correction capabilitiesGitHub
ViVaUses pre-trained video generation models as value function estimators for state value assessmentarXiv
RoboAgentAchieves embodied task planning through composing foundational capabilitiesarXiv
Cross-Embodiment Offline RLAchieves offline RL across different robot morphologies through morphological grouping strategiesarXiv
Sensory-Motor Control with LLMsEnables LLMs to directly generate continuous control policies through iterative policy refinementarXiv
RM-RLProposes "role model" RL for precise robot manipulationarXiv
Learning Without Losing IdentityStudies how embodied agents maintain stable "cognitive identity" while continually learning new capabilitiesarXiv

Multi-Agent Systems and Collaboration

Multi-agent collaboration is far more difficult than single-agent -- when you learn new strategies your teammates are also changing, making the environment non-stationary; when the team succeeds, who gets credit, and when it fails, who is responsible? The works in this section address these challenges from multiple angles: extending GRPO to multi-agent settings, achieving decentralized coordination through knowledge distillation, solving context drift with digital twins, and large-scale MARL frameworks that jointly optimize sampling and training end-to-end.

WorkKey highlightLink
MAPoRLNew paradigm for multi-agent collaborative trainingarXiv
M-GRPOExtends GRPO algorithm to multi-agent scenariosarXiv
SAGEClosed-loop self-evolution multi-agent RL frameworkarXiv
MARTI [open-source]Multi-agent debate frameworkGitHub
KD-MARLTransfers centralized expert coordination to lightweight decentralized agents through knowledge distillationarXiv
Value-Guidance MeanFlowValue-guided flow model for offline multi-agent RLarXiv
FlexMARLFirst end-to-end training framework jointly optimizing sampling, training, and their orchestration for large-scale LLM-based MARLarXiv
TwinLoopProposes simulation-in-the-loop digital twin framework to address multi-agent performance degradation from context changesarXiv
Equivariant Multi-agent RLEquivariant multi-agent RL for multi-modal vehicle-infrastructure cooperative systemsarXiv

World Models and Model-Based RL

The core bottleneck of model-free RL is sample efficiency -- agents must learn through extensive trial and error. World models provide a path around this bottleneck: first learn to "simulate the environment in your head," then generate training data in imagination. This section collects approaches from diffusion world models to object-centric representations, all with the core idea of having policy models interact with world models to complete multi-step planning and training "in imagination."

WorkKey highlightLink
GIRLGenerative imagination RL through information-theoretic hallucination controlarXiv
World4RLDiffusion world model for policy refinement in robot manipulationarXiv
Dreamer-CDPDreamer variant that does not require reconstructing raw pixel observationsProject
RLVR-WorldUses RLVR to directly optimize world modelsarXiv
OC-STORMEnhances world models with object-centric representations for sample-efficient RLarXiv
Imagine-then-Plan (ITP)Policy models interact with world models to generate multi-step trajectories "in imagination"arXiv

Role-Playing and Persona Simulation

Role-playing is not just "pretending to be someone" -- it requires models to maintain consistent personality traits, thinking styles, and behavioral patterns across long conversations. The value of RL here is that through verifiable role-awareness rewards, it reinforces the model's continuous perception of "who I am." The works in this section range from dual-layer thinking frameworks (distinguishing character perspective from model perspective) to multi-character self-play, exploring how to make AI truly "get into character" and maintain role consistency.

WorkKey highlightLink
HER (Human-like Reasoning)Proposes dual-layer thinking framework distinguishing character first-person thoughts from LLM third-person thoughts (note: not classic Hindsight Experience Replay)arXiv
OMARCultivates AI social intelligence through multi-turn self-play RLarXiv
R4Equips reward models and role-playing agents with reasoning and retrieval capabilitiesICLR Poster
VeriRoleImproves role awareness through verifiable prompt-guided RLOpenReview
SPELLMulti-character self-play RL framework for long-context reasoningarXiv
Consistently Simulating Human PersonasProposes a unified framework for evaluating and improving LLM role consistencyOpenReview
CPOComparative policy optimization for reward ambiguity in role-playing dialoguearXiv
RAIDEN-R1Proposes verifiable role-awareness reward (VRAR) to reinforce model perception of its own rolearXiv

Creative and Long-Form Writing

Creative writing poses unique challenges for RL: rewards are not as objectively verifiable as code execution, and "good" writing is subjective and multi-dimensional. The works in this section explore how to design reward signals that capture creative quality -- from generative reward models performing multi-dimensional reasoning about story preferences, to optimizing rubric-based reward models through alternating RL, to comparing different reward strategies via RLAIF to stimulate creative capabilities in small models.

WorkKey highlightLink
Writer-R1Memory-augmented Replay Policy OptimizationarXiv
R2-WriteSystematic study of open-domain writing, proposing a reflection and revision frameworkarXiv
DPWriterAddresses output diversity reduction during RL training through diverse planning branchesarXiv
RLMRFirst to combine subjective preferences with objective verification in online RL trainingarXiv
Rewarding CreativityDevelops generative reward models for multi-dimensional analysis and explicit reasoning about story preferencesarXiv
Alternating RL for Rubric-Based Reward ModelingOptimizes rubric-based reward models through alternating RL, achieving SOTA on multiple writing benchmarksarXiv
Igniting Creative Writing in SLMsCompares two reward strategies under RLAIF framework to stimulate creative writing in 7B small modelsACL Anthology

Emotional Intelligence and Empathetic Dialogue

Empathy is not just "understanding emotions" -- it requires expressing appropriate responses at the right time while maintaining logical coherence in conversation. The value of RL here is enabling models to learn to balance "emotional support" with "cognitive reasoning." The works in this section range from verifiable emotion rewards to psychology-based empathetic reward modeling, exploring how to provide more grounded reward signals for RL.

WorkKey highlightLink
RLVERTrains LLM higher-order empathy using verifiable emotion rewardsarXiv
CARECognitive reasoning-enhanced RL improving logical coherence and support quality in emotional support dialoguearXiv
COMPEERUnified process-outcome RL for structured empathetic reasoningarXiv
DialogXpertOnline value RL-based dialogue planning with over 94% success rate on negotiation, emotional support, and other tasksarXiv
EILSBio-emotion-inspired homeostatic learning signal framework for building adaptive autonomous agentsarXiv
SAGE (Steering Dialog Generation)Uses latent variables to control long-term behavior of dialogue generation for building emotionally intelligent chatbotsarXiv
PERMPsychology-based empathetic reward modeling providing more grounded reward signals for RLarXiv

Art and Visual Creation

RL entering the art world is an interesting crossover -- it models "aesthetic judgment" as an optimizable reward signal. The works in this section cover applications from image generation optimization to hierarchical painting, from personalized hand-drawn illustrations to artistic style learning. Core approaches include: coordinating multiple expert models for iterative image generation optimization, learning artist styles from stroke data through inverse RL, and using hierarchical RL to separate high-level planning from low-level rendering.

WorkKey highlightLink
Image-POSERReflective RL framework coordinating multiple expert models for iterative image generation optimization based on complex text promptsarXiv
HRL-PainterHierarchical RL-based painting method with high-level region planning and low-level stroke executionNeurocomputing
PersonaSketch-RLRL-based strategy for optimizing personalized hand-drawn illustration generationScienceDirect
RMLerModels cross-category concept fusion as an RL problem for synthesizing novel objectsarXiv
Sequential Art CreationDeep RL framework for creating sequential artworks that are visually distinct from inputsUTA Thesis
MVAEx-RLRL-based multi-modal art element extraction and dynamic adaptation strategy for environment designSpringer
DailyArtModels joint estimation as synthesis-mediated inference, inferring dynamics from single static imagesarXiv

RL Training Infrastructure and Algorithm Innovation

The engineering complexity of Agentic RL far exceeds standard LLM RL -- you need to simultaneously manage model training on GPUs, tool execution on CPUs, and environment interaction over networks. This section focuses on the infrastructure and algorithm innovations supporting these complex training pipelines: from fully asynchronous training systems to scalable synthetic learning environments, from retrieval-augmented policy optimization to new paradigms that convert inference compute into training signals.

WorkKey highlightLink
AReaL v1.0 [open-source]Jointly open-sourced by Ant Group and Tsinghua, enabling "one-click agent integration into RL training"GitHub
RollArt / RollARCMaximizes multi-task Agentic RL training throughput through decoupled infrastructure (RollARC)arXiv
SparrowRLHigh-performance RL training system achieving lossless sparse incremental synchronization on commodity networksarXiv
LaminarScalable, robust asynchronous RL post-training system based on fully decoupled architecturearXiv
SCALERSynthesizes scalable adaptive learning environments providing infinitely verifiable reasoning environments for RL trainingarXiv
L-Zero (L0)Low-cost, scalable end-to-end universal agent training pipelinearXiv
Compute as Teacher (CaT)Converts inference-time parallel sampling compute into RL training supervision signalsarXiv
RAPORetrieval-augmented policy optimization, explicitly expanding agent exploration space during trainingarXiv
LLM-Explorer [open-source]Tsinghua; a plugin that can enhance exploration capabilities of various RL algorithmsGitHub

Scientific Discovery and Industrial Applications

RL is moving out of the laboratory and into real application scenarios including chemistry, materials science, medicine, and industrial manufacturing. The works in this section model scientific problems as MDPs: lead compound optimization becomes a search problem under synthetic constraints, materials design becomes an optimization problem using formation energy feedback, and industrial anomaly detection becomes a policy learning problem for data synthesis. These applications demonstrate RL's potential as a "universal decision optimizer."

WorkKey highlightLink
MolReActModels lead compound optimization as MDP, using RL for efficient search under synthetic constraintsarXiv
PolyRLMulti-objective polymer generation and discovery guided by RLRSC
HelixHierarchical evolutionary RL framework for open-ended scientific problem solvingarXiv
RLFEFRL using formation energy feedback to fine-tune material diffusion models, improving crystal stabilitydblp
AnomalyAgentIndustrial anomaly data synthesis agent that optimizes generation of highly realistic anomaly samples through RLarXiv
Autonomous Adaptive Solver SelectionUses constrained RL framework for autonomous solver selection during chemical integrationarXiv
PPO-based Surface ReconstructionDeep RL framework based on PPO for surface reconstruction of AgPd alloy catalystsAIP PDF
MedVRFor medical VQA, proposes two RL mechanisms: entropy-guided visual relocation (EVR) and consensus-driven credit assignmentarXiv

Note: The above works are papers or projects published or preprinted in 2025--2026. Some arXiv papers may have updated versions; we recommend searching by paper title on arxiv.org or Semantic Scholar for the latest versions.

现代强化学习实战课程