Agentic Reasoning for Large Language Models
Show me an executive summary.
1) Background and objective
Large language models excel at tasks like math and code in controlled settings but falter in open, changing environments. This survey reviews agentic reasoning, which turns models into active agents that plan, act, and learn through interaction. The goal is to map methods, applications, and challenges to guide future agent design.
2) Approach
Researchers reviewed papers up to 2025 and organized agentic reasoning into three layers: foundational single-agent skills (planning, tool use, search), self-evolving adaptation (feedback, memory), and multi-agent collaboration. They split systems by optimization type—in-context (prompt-based at runtime) versus post-training (fine-tuned with reinforcement learning)—and assessed real-world uses and benchmarks.
3) Key results
- Foundational agents combine planning (e.g., tree search), tools (e.g., APIs), and search (e.g., retrieval) to handle stable tasks, with post-training methods like ToolLLM outperforming in-context ones.
- Self-evolving agents use reflection (e.g., Reflexion) and memory (e.g., MemGPT) to improve over time, enabling lifelong learning without full retraining.
- Multi-agent systems divide roles (e.g., leader-worker-critic) and evolve via shared memory or RL, boosting complex tasks like software development.
- Applications span math, science, robotics, healthcare, and web tasks; benchmarks like WebArena test end-to-end performance.
- Open challenges include personalization, long interactions, world models, scalable training, and safety governance.
4) Main conclusion
Agentic reasoning unifies LLM thought and action into a roadmap across layers, optimization modes, and settings, advancing from passive models to adaptive, collaborative systems.
5) Implications
These advances cut costs by scaling inference compute over model size, reduce risks through verifiable tools and reflection, and speed timelines in domains like drug discovery (e.g., ChemCrow) and robotics (e.g., Voyager). Multi-agent setups match human teams for safety-critical tasks like healthcare, but unaddressed challenges like long-horizon credit assignment could limit real-world reliability compared to human benchmarks.
6) Recommendations and next steps
Prioritize hybrid in-context/post-training agents for quick deployment in stable settings; invest in RL for self-evolution in dynamic ones. For multi-agent systems, test role-adaptive topologies on benchmarks like MultiAgentBench. Run pilots in high-impact areas (e.g., robotics, healthcare) with safety checks. Next, gather data on long-horizon interactions and personalization to enable stronger decisions.
7) Limitations and confidence
The review covers work to 2025, missing later advances; benchmarks vary in realism, risking overfit evaluations. High confidence in the taxonomy and core findings as a synthesis of 700+ papers, but caution on untested scaling to 100+ agents or real-time governance.
Tianxin Wei1†^{1\dagger}1† Ting-Wei Li1†^{1\dagger}1† Zhining Liu1†^{1\dagger}1† Xuying Ning1^{1}1 Ze Yang2^{2}2 Jiaru Zou1^{1}1
Zhichen Zeng1^{1}1 Ruizhong Qiu1^{1}1 Xiao Lin1^{1}1 Dongqi Fu2^{2}2 Zihao Li1^{1}1 Mengting Ai1^{1}1 Duo Zhou1^{1}1
Wenxuan Bao1^{1}1 Yunzhe Li1^{1}1 Gaotang Li1^{1}1 Cheng Qian1^{1}1 Yu Wang5^{5}5 Xiangru Tang6^{6}6 Yin Xiao1^{1}1
Liri Fang1^{1}1 Hui Liu3^{3}3 Xianfeng Tang3^{3}3 Yuji Zhang1^{1}1 Chi Wang4^{4}4 Jiaxuan You1^{1}1 Heng Ji1^{1}1
Hanghang Tong1✉^{1\text{✉}}1✉ Jingrui He1✉^{1\text{✉}}1✉
1^{1}1University of Illinois Urbana-Champaign
2^{2}2Meta
3^{3}3Amazon
4^{4}4Google Deepmind
5^{5}5UCSD
6^{6}6Yale
†^{\dagger}† Equal contribution, ✉^{\text{✉}}✉ Corresponding Author
Abstract: Reasoning is a fundamental cognitive process underlying inference, problem-solving, and decision-making. While large language models (LLMs) demonstrate strong reasoning capabilities in closed-world settings, exemplified by standard benchmarks in mathematics and code, they struggle in open-ended and dynamic environments. The emergence of agentic reasoning marks a paradigm shift, bridging thought and action by reframing LLMs as autonomous agents that plan, act, and learn through continual interaction. In this survey, we provide a systematic roadmap by organizing agentic reasoning along three complementary dimensions. First, we characterize environmental dynamics through three layers: foundational agentic reasoning establishes core single-agent capabilities, including planning, tool use, and search, that operate in stable environments; self-evolving agentic reasoning examines how agents refine these capabilities through feedback, memory, and adaptation in evolving settings; and collective multi-agent reasoning extends intelligence to collaborative scenarios where multiple agents coordinate roles, share knowledge, and pursue shared goals. Across all layers, we analyze system constraints and optimization settings by distinguishing in-context reasoning, which scales test-time interaction through structured orchestration and adaptive workflow design, from post-training reasoning, which optimizes behaviors through reinforcement learning and supervised fine-tuning. We further review and contextualize agentic reasoning frameworks in real-world applications and benchmarks spanning science, robotics, healthcare, autonomous research, and math, illustrating how different reasoning mechanisms are instantiated and evaluated across domains. This survey synthesizes agentic reasoning methods into a unified roadmap that bridges thoughts and actions, offering actionable guidance for agentic systems across environmental dynamics, optimization settings, and agent interaction settings. Finally, we outline open challenges and future directions, situating how agentic reasoning has developed while identifying what remains ahead: personalization, long-horizon interaction, world modeling, scalable multi-agent training, and governance frameworks for real-world deployment.
Keywords: Agentic AI, LLM Agent, Agentic Reasoning, Self-evolving
1. Introduction
In this section, large language models excel at closed-world reasoning like math and code but falter in open-ended, dynamic environments lacking adaptation or interaction. Agentic reasoning reframes LLMs as autonomous agents that plan, act, and learn through continual engagement, structured across three layers—foundational single-agent capabilities (planning, tools, search), self-evolving adaptation (feedback, memory), and collective multi-agent coordination—while distinguishing in-context orchestration from post-training optimization via reinforcement learning. This taxonomy synthesizes methods into a unified roadmap bridging thought and action, contextualizes applications in science, robotics, and healthcare alongside benchmarks, and highlights contributions like conceptual framing and future challenges in personalization, long-horizon planning, and governance.
Reasoning lies at the core of intelligence, enabling logical inference, problem-solving, and decision-making across interactive and dynamic settings. Large language models (LLMs) have achieved remarkable gains in closed-world domains such as mathematical problem solving and code generation. Empirically, techniques that explicitize intermediate reasoning, such as Chain-of-Thought prompting, decomposition, and program-aided solving, have significantly bolstered inference performance [1, 2, 3, 4]. Yet, these approaches often assume static contexts and short-horizon reasoning. Conventional LLMs lack mechanisms to act, adapt, or improve in open-ended environments where information evolves over time.
In this survey, we systematize this evolution under the framework of Agentic Reasoning: rather than passively generating sequences, LLMs are reframed as autonomous reasoning agents that plan, act, and learn through continual interaction with their environment. This reframing unifies reasoning with acting, positioning reasoning as the organizing principle for perception, planning, decision, and verification. Systems such as ReAct [5] interleave deliberation with environment interaction, tool-use frameworks enable self-directed API calling, and workflow-based agents dynamically orchestrate sub-tasks and verifiable actions [5, 6, 7]. Conceptually, this parallels the shift from static, one-shot inference to sequential decision-making under uncertainty. Unlike simple input-output mapping, this paradigm requires agents to plan over long horizons, navigate partial observability, and actively improve through feedback [8, 9, 10].
To systematically characterize the environmental dynamics, we structure our survey around three complementary scopes of agentic reasoning: foundational capabilities, self-evolution, and collective intelligence, spanning diverse interactive and dynamic settings. Foundational Agentic Reasoning establishes the bedrock of core single-agent capabilities, including planning, tool use, and search, that enable operations within stable, albeit complex, environments. Here, agents act by decomposing goals, invoking external tools, and verifying results through executable actions. For instance, program-aided reasoning [3] grounds logical derivations in code execution; repository-level systems such as OpenHands [11] integrate reasoning, planning, and testing into unified loops; and structured memory modules [12, 13] transform factual recall into procedural competence by persisting intermediate reasoning traces for reuse.
Building upon these foundations, Self-Evolving Agentic Reasoning enables agents to improve continually through cumulative experience. Encompassing task-specific self-improvement (e.g., via iterative critique), this paradigm extends adaptation to include persistent updates of internal states like memory and policy. Rather than following fixed reasoning paths, agents develop mechanisms for feedback integration and memory-driven adaptation to navigate evolving environments. Reflection-based frameworks such as Reflexion [14] allow agents to critique and refine their own reasoning processes, while reinforcement formulations such as RL-for-memory [15] formalize memory writing and retrieval as policy optimization. Through these mechanisms, agents dynamically integrate inference-time reasoning with learning, progressively updating internal representations and decision policies without full retraining. This continual adaptation links reasoning with learning, enabling models to accumulate competence, and generalize across tasks.
Finally, Collective Multi-Agent Reasoning scales intelligence from isolated solvers to collaborative ecosystems. Rather than operating in isolation, multiple agents coordinate to achieve shared goals through explicit role assignment (e.g., manager–worker–critic), communication protocols, and shared memory systems [16, 17]. As agents specialize in subtasks and refine each other’s outputs, collaboration amplifies reasoning diversity, enabling systems to debate, resolve disagreements, and achieve consistency through natural language-based multi-turn interactions [18, 19]. However, this complexity also introduces challenges in stability, communication efficiency, and trustworthiness, necessitating structured coordination frameworks and rigorous evaluation standards [20, 21].
Across all layers, we analyze system constraints and optimization settings by distinguishing two complementary modes, corresponding to inference-time orchestration [5, 14, 22, 23, 24, 25] and training-based capability optimization [26, 27, 28, 15]. In-context Reasoning focuses on scaling inference-time compute: through structured orchestration, search-based planning, and adaptive workflow design, it enables agents to navigate complex problem spaces dynamically without modifying model parameters. Conversely, Post-training Reasoning targets capability internalization: it consolidates successful reasoning patterns or tool-use strategies into the model's weights via reinforcement learning and fine-tuning. Together, they provide an actionable roadmap for designing agents.
Building on the three-layer taxonomy, agentic reasoning has begun to underpin a wide range of practical applications, from mathematical exploration [29, 30] and vibe coding [11, 31, 32] to scientific discovery [33, 34, 35], embodied robotics [36, 37, 38], healthcare [39, 40], and autonomous web exploration [41, 42]. These applications expose distinct reasoning demands shaped by domain-specific data modalities, interaction constraints, and feedback loops, motivating diverse system designs [43, 44] that integrate planning, tool use, search, reflection, memory mechanisms, and multi-agent coordination. On the other hand, the benchmark landscape has emerged to evaluate agentic reasoning, ranging from targeted tests that isolate individual agentic capabilities to application-specific benchmarks that assess end-to-end behavior in domain-specific environments and scenarios [45, 46, 47, 48, 20, 21, 49, 50].
Together, this survey synthesizes agentic reasoning methods into a unified roadmap that bridges reasoning and acting. We systematically characterize these methods across the complementary scopes of foundational, self-evolving, and collective reasoning, while distinguishing between in-context and post-training optimization modes. We further contextualize this roadmap through representative applications and evaluation benchmarks, illustrating how different agentic reasoning mechanisms are instantiated and assessed across realistic domains and task settings. Finally, we outline open challenges and future directions, identifying key frontiers such as personalization, long-horizon interaction, world modeling, scalable multi-agent training, and governance frameworks for real-world deployment.
2. From LLM Reasoning to Agentic Reasoning
In this section, traditional LLM reasoning—limited to passive, single-pass predictions on static inputs—gives way to agentic reasoning, which scales interactive, multi-step deliberation through environment engagement, external memory, and continual adaptation. Core mechanisms formalize agents in partially observable Markov decision processes, decomposing policies into internal thought (planning in latent reasoning space) and external action, optimized via inference-time search like tree-of-thoughts or post-training methods such as group-relative policy optimization; extensions cover multi-agent coordination through communication channels and self-evolving meta-updates across verbal, procedural, or structural states. This paradigm unifies reasoning with acting, transforming static models into adaptive, collaborative intelligences that bridge prediction and long-horizon decision-making.
Traditional reasoning with large language models (LLMs) is typically formulated as a one-shot or few-shot prediction task over static inputs. These models rely on scaling test-time computation, improving accuracy by increasing model size or inference budget, but without the ability to interact, remember, or adapt to changing goals. Methods such as prompt engineering, in-context learning, and chain-of-thought prompting have made reasoning more explicit, yet conventional LLMs remain passive sequence predictors that operate within fixed prompts.
Agentic reasoning, in contrast, emphasizes scaling test-time interaction. Instead of depending solely on internal parameters, agentic systems reason through action: invoking tools, exploring alternatives, updating memory, and integrating feedback. This transforms inference into an iterative process that includes decision steps, reflection, and learning from experience. Reasoning becomes a dynamic loop that connects the model, memory, and environment.
This transition marks a conceptual shift: reasoning no longer scales through static capacity, but through structured interaction that enables planning, adaptation, and collaboration across time and tasks.
2.1 Positioning Our Survey
While several recent surveys have examined LLM reasoning or agent architectures ([51, 52, 53, 54, 55, 56, 57, 58, 59]), our work focuses specifically on agentic reasoning as a unified paradigm for understanding reasoning as interaction. We position this survey at the intersection of model-centric reasoning and system-level intelligence, aiming to bridge prior discussions on reasoning mechanisms and agent architectures.
Relation to LLM Reasoning Surveys. Existing surveys on LLM reasoning mainly investigate how to elicit or enhance reasoning within a model’s internal computation process. For example, [51, 52, 53, 54] summarize prompting and scaling techniques such as chain-of-thought, reinforcement post-training, and long-context reasoning, emphasizing how LLMs can learn to reason better through inference-time supervision or post-training alignment. These works improve the internal expressiveness of reasoning traces but typically remain within static inference settings, where reasoning unfolds in a single forward pass without external interaction. In contrast, our survey examines how reasoning extends beyond text generation, encompassing dynamic planning, adaptive memory, and feedback-driven behavior during deployment.
Relation to AI Agent Surveys. Several contemporary surveys have begun to explore LLM-based agents from architectural or system perspectives ([56, 57, 58, 59]). These works analyze how agents employ reinforcement learning, planning, and tool-use modules to operate in complex environments. For instance, [56, 57] focus on reinforcement learning for agentic search and decision-making, while [58, 59] emphasize self-evolving and lifelong agentic systems that continuously learn from interaction. Our focus complements these perspectives by centering on the reasoning process that these architectures enable, specifically how interaction, feedback, and collaboration transform static inference into adaptive reasoning. Rather than viewing reasoning as an implicit by-product of architectural design, we treat it as the unifying mechanism that links single-agent reinforcement, multi-agent coordination, and self-evolving intelligence.
In summary, our survey provides a reasoning-centric lens on intelligent agency. We examine how foundational reasoning mechanisms, post-training adaptation, and long-term self-evolution jointly constitute the basis of agentic reasoning, illustrating the transition from static prediction to interactive, adaptive, and continually improving intelligence.
2.2 Preliminaries
This subsection formalizes the transition from static language modeling to agentic reasoning. To align with the three-layered dimensions (Foundational, Self-Evolving, Collaboration) outlined in the introduction, we unify these capabilities under a single control-theoretic framework.
Formalizing Agentic Reasoning: A Latent-Space View.
Standard approaches often conflate the agent's context with the environment state. We model the environment as a Partially Observable Markov Decision Process (POMDP) and introduce an internal reasoning variable to expose the "think–act" structure of agentic policies. Concretely, we consider the tuple ⟨X,O,A,Z,M,T,Ω,R,γ⟩\langle \mathcal{X}, \mathcal{O}, \mathcal{A}, \mathcal{Z}, \mathcal{M}, \mathcal{T}, \Omega, \mathcal{R}, \gamma \rangle⟨X,O,A,Z,M,T,Ω,R,γ⟩, where X\mathcal{X}X is the latent environment state space (unobservable to the agent), O\mathcal{O}O is the observation space (e.g., user queries, API returns), A\mathcal{A}A is the external action space (e.g., tool invocation, final answer), Z\mathcal{Z}Z is a reasoning trace space (e.g., latent plans, optionally verbalized as chain-of-thought), and M\mathcal{M}M is the agent's internal memory/context space (e.g., a sufficient statistic of interaction history). T\mathcal{T}T and Ω\OmegaΩ denote the transition and observation kernels, R\mathcal{R}R the reward, and γ∈(0,1)\gamma\in(0, 1)γ∈(0,1) the discount factor.
At timestep ttt, the agent conditions on a history ht=(o≤t,z<t,a<t)h_t = (o_{\le t}, z_{<t}, a_{<t})ht=(o≤t,z<t,a<t) (i.e., oto_tot is observed before generating ztz_tzt and then ata_tat). Equivalently, the history can be summarized by an internal memory state mt∈Mm_t\in\mathcal{M}mt∈M. Crucially, we distinguish external actions from internal reasoning. We factorize the policy as
This decomposition highlights the core shift in agentic systems: performing computation in Z\mathcal{Z}Z (thinking) before committing to A\mathcal{A}A (acting). The objective remains maximizing the expected return J(θ)=Eτ [∑t≥0γtrt]J(\theta)=\mathbb{E}_{\tau}\!\left[\sum_{t\ge 0}\gamma^t r_t\right]J(θ)=Eτ[∑t≥0γtrt].
In-Context Reasoning: Inference-Time Search.
In this regime, model parameters θ\thetaθ are frozen. The agent optimizes the reasoning trajectory by searching over Z\mathcal{Z}Z to maximize a heuristic value function v^(ht,z)\hat{v}(h_t, z)v^(ht,z). We model inference as selecting a trajectory τ=(h0,z0,a0,h1,z1,a1,…)\tau = (h_0, z_0, a_0, h_1, z_1, a_1, \ldots)τ=(h0,z0,a0,h1,z1,a1,…). Methods like ReAct ([5]) perform greedy decoding over alternating thoughts zzz and actions aaa. Tree-of-Thoughts (ToT ([4])) and related MCTS-style approaches treat partial thoughts as nodes u∈Uu \in \mathcal{U}u∈U (e.g., a representation derived from (ht,zt)(h_t, z_t)(ht,zt)) and search for an optimal path:
where v^ϕ\hat{v}_\phiv^ϕ is a heuristic evaluator or verifier. This corresponds to planning in Z\mathcal{Z}Z without updating the policy parameters.
Post-Training: Policy Optimization.
This paradigm optimizes θ\thetaθ to align the policy with long-horizon rewards rtr_trt (e.g., correctness, safety), including reasoning models (e.g., DeepSeek-R1 ([60])) and learning-to-search systems (e.g., Search-R1 ([27]), DeepRetrieval ([61])) that train multi-turn reasoning or tool use with RL. While PPO ([62]) is standard, Group Relative Policy Optimization (GRPO) ([63])-based methods are widely used for reasoning tasks. GRPO eliminates the value network by constructing advantages from group-relative rewards. For a group of GGG sampled outputs {yi}i=1G\{y_i\}_{i=1}^G{yi}i=1G from the same prompt qqq, a common GRPO objective is:
where ρi=πθ(yi∣q)πθold(yi∣q)\rho_i = \frac{\pi_\theta(y_i \mid q)}{\pi_{\theta_{\text{old}}}(y_i \mid q)}ρi=πθold(yi∣q)πθ(yi∣q) and the group-normalized advantage is
with δ>0\delta>0δ>0 a small constant for numerical stability. Advanced methods such as ARPO ([64]) and DAPO ([65]) extend this framework to handle sparse rewards and improve stability in complex tool-use environments (e.g., via replay/rollout strategies and decoupled clipping).
Collective Intelligence: Multi-Agent Reasoning.
We extend the single-agent formulation to a decentralized partially observable multi-agent setting, commonly formalized as a Dec-POMDP. The core distinction lies in expanding each agent's observation to include a communication channel C\mathcal{C}C. For a system of NNN agents, the joint policy π\boldsymbol{\pi}π is composed of individual policies πi\pi^iπi, where agent iii 's observation otio^i_toti explicitly includes communicative messages ct−1−ic^{-i}_{t-1}ct−1−i generated by peers. Crucially, in agentic MARL, communication is not merely signal transmission but an extension of the reasoning process: one agent's external action can act as a prompt that triggers another agent's internal reasoning chain. Existing frameworks like AutoGen ([66]) and CAMEL ([67]) represent static role-playing with fixed policies. Recent agentic RL advances (e.g., GPTSwarm [68], MaAS, agents trained via PPO/GRPO [69]) aim to optimize this joint reasoning distribution. The challenge shifts from single-agent planning to mechanism design: optimizing the communication topology and incentive structures to align decentralized reasoning processes πreasoni\pi^i_{\text{reason}}πreasoni toward a coherent global objective, often utilizing Centralized-Training/Decentralized-Execution (CTDE) paradigms to stabilize the emergence of cooperative behaviors.
Self-Evolving Agents: The Meta-Learning Loop.
While foundational agents optimize reasoning zzz within an episode, self-evolving agents optimize the agent system itself across episodes k=1,…,Kk=1, \dots, Kk=1,…,K. Let Sk\mathcal{S}_kSk denote the evolvable system state (e.g., explicit memories, tool libraries, or code). A generic meta-update rule is
where Fk\mathcal{F}_kFk represents environmental feedback (rewards, execution errors) and Sk\mathcal{S}_kSk represents the evolvable state. We categorize self-evolution by the nature of S\mathcal{S}S:
- Verbal Evolution: S\mathcal{S}S consists of textual reflections or guidelines. Methods like Reflexion ([14]) update S\mathcal{S}S by synthesizing error logs into linguistic cues that condition future reasoning policies.
- Procedural Evolution: S\mathcal{S}S consists of a library of executable tools or skills. Agents like Voyager ([36]) evolve by synthesizing new code-based skills, expanding the action space A\mathcal{A}A permanently.
- Structural Evolution: S\mathcal{S}S consists of the agent's source code or architecture itself. Advanced methods like AlphaEvolve ([70]) treat the agent's code as a hypothesis space, using an LLM as a mutation operator to search for superior reasoning algorithms.
This framework unifies these diverse approaches as gradient-free or gradient-based optimization steps over the agent's explicit memories and artifacts (and optionally parameters), closing the loop between experience and competence.
3. Foundational Agentic Reasoning
In this section, foundational agentic reasoning equips single agents to convert abstract deliberation into verifiable action via an iterative cycle of planning, tool use, and search. Planning decomposes goals through in-context strategies like workflows, tree search, formalization, and decomposition, augmented by post-training rewards; tool use optimizes invocation via interleaved reasoning-actions, supervised fine-tuning, reinforcement learning, and orchestration; search enhances decisions with retrieval and environment exploration. Together, these mechanisms elevate static LLMs into goal-driven autonomous solvers capable of long-horizon execution, backtracking, and error recovery, laying the groundwork for self-evolving adaptation and multi-agent collaboration.
Agentic reasoning originates from the behavior of a single agent. Before discussing adaptation and collaboration, we focus on how an individual agent translates reasoning into structured action through three core components: planning, search, and tool use. In this setting, the agent is not a passive text generator but an autonomous problem solver that formulates plans, explores alternatives through retrieval or environment search, and leverages tools to execute grounded operations. Together, these mechanisms establish the foundation of agentic reasoning, linking abstract deliberation with verifiable action.
A canonical foundational workflow can be viewed as an iterative cycle that interleaves planning (goal decomposition and task formulation), tool use (invoking external systems or APIs to act on the world) and search (retrieval and exploration for decision support), Reasoning serves as the organizing principle across these stages, determining when to plan, what to retrieve, and how to act, transforming static inference into interactive decision-making.
By analyzing these components, we clarify how structured reasoning elevates a static LLM into an autonomous, goal-driven agent. The next section introduces self-evolving reasoning, where feedback and memory enable continual adaptation and extension of these foundational capabilities. Subsequently, we examine collective reasoning, in which multiple agents coordinate through roles, communication, and shared memory to achieve objectives beyond individuals.
3.1 Planning Reasoning
Planning is a central component of intelligent behavior, enabling agents to decompose problems, sequence decisions, and navigate complex environments with foresight. Recent research has increasingly explored planning in the context of large language models (LLMs), either as autonomous agents or as components in broader systems. In this section, we categorize existing work in agent planning for reasoning into six methodological styles, where each category highlights a distinct planning strategy that supports complex agentic reasoning.
3.1.1 In-context Planning
Workflow Design.
Workflow-based approaches often emphasize structuring the overall planning process into distinct stages (e.g., perception, reasoning, execution, verification), which are either explicitly scaffolded or learned implicitly. For example, [72, 73, 71, 92] design planning pipelines that decompose task solving into subtasks, often leveraging a deliberate plan-and-act framework. Similarly, [2, 93, 75, 7] rely on structured prompting to sequentialize tasks and guide reasoning progression. Methods like [94] use structured transitions between diverse "X-of-Thought" strategies. PERIA [95] combines perception, imagination, and action in a unified multimodal workflow. Others such as [96] explicitly target long-horizon planning through structured sequencing, while [97] build workflows for code-related planning.
These workflows are then grounded by a reactive controller that iteratively consumes the current state and interleaves reasoning with actions: in web automation, agents follow inspect-reason-act-observe loops [5, 49], with robustness improved by dynamically adapting in-context examples [98]; in code, agents decide immediate executions/API calls, read outputs or errors, and refine step-by-step [99, 78, 14, 79, 100, 101, 102, 103, 104]; in robotics, monitors trigger on-the-fly safety interventions and VLM-guided subgoal execution with real-time adjustment [87, 105]. This reactive workflow view unifies scripted stage design with online adaptation: the workflow provides interpretable structure and interfaces (what is done when), while the reactive loop supplies closed-loop grounding and error recovery (how it is done in context). The approach is broadly effective yet can accumulate errors over long horizons, motivating incremental verification and memory within the workflow to stabilize execution.
Tree Search / Algorithm Simulation.
Tree-based search strategies, especially BFS, DFS, A*, MCTS, and beam search, have become prominent as interpretable and effective planning scaffolds. Several works simulate tree traversal algorithms to mimic deliberative processes: [4, 106, 107, 108] apply breadth- or depth-first strategies to explore structured thought trees. A*-like guided expansions appear in [109, 110, 111], providing heuristic-driven planning with state evaluation. Besides that, MCTS is heavily explored in agentic research: [112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123] use MCTS or its variations for controlled exploration and improved reasoning fidelity. Beam search is leveraged in [124, 125, 126] to prune and prioritize reasoning trajectories efficiently. Other tree-search-inspired works include [127] which uses learned search policies and [128] which differentiates between fast (reactive) and slow (deliberative) planning. These methods mirror traditional algorithmic planning, grounding LLMs' search processes in classical decision-making frameworks.
This search-over-hierarchy view maps cleanly onto domain systems. In the web setting, planner-executor architectures generate high-level subtask trees in natural language and bind leaves to DOM-grounded actions, often with memory to persist context [84, 129, 85]. For code agents, hierarchical task trees and pseudo-code plans recursively break problems into compilable/editable units, while structured pipelines embed hierarchical RL or MCTS within the tree to choose promising edits and verification paths [76, 22, 130, 131, 132]. In robotics, behavior trees and high-level goal decomposition translate language instructions into subgoal sequences executed by low-level controllers and skills [133, 134, 135, 136, 137].
Taken together, hierarchical tree-search couples plan synthesis (node expansion, heuristic/evidence-based selection) with plan realization (leaf grounding and feedback), yielding interpretable, long-horizon agents that can backtrack, refine, and verify before committing to irreversible actions, while remaining flexible enough to incorporate learned policies and memory for efficiency and robustness.
Process Formalization.
Formalizing planning through symbolic representations, programming languages, or logic frameworks ensures compositionality, interpretability, and generalization. Several works encode plans as code-like artifacts or PDDL programs: [138, 139, 140, 97, 141, 142] incorporate symbolic logic or procedural programming into LLM prompting or output generation. These representations enable downstream tool execution and interface more cleanly with classical planners or robot controllers. PDDL-based formulations explicitly bridge LLM planning with well-established planning ecosystems, as in [139, 140]. CodePlan [97] highlights the use of program synthesis to scaffold long-horizon reasoning. Such formalization provides structural scaffolds for agent behavior and often enhances explainability and robustness of the generated plans.
Decoupling / Decomposition.
Decoupling strategies aim to modularize complex planning into separable components such as goal recognition, memory retrieval, and plan refinement. Notably, ReWOO [71] explicitly separates observation and reasoning modules to optimize for efficiency. Similarly, works like [143, 144, 145, 146, 147, 142, 148] break reasoning into reusable or hierarchical abstractions. [76] promotes hierarchical thinking through hypertrees, while [82] abstracts the world with symbolic predicates to reduce planning burden. Others, such as [149] and [119], decompose via latent variables or state spaces. These decompositions not only enhance tractability, but also align with neural-symbolic hybrid frameworks. They are especially common in long-horizon or multi-agent planning scenarios, such as [150, 151].
External Aid / Tool Use.
Many systems leverage external structures or tools to aid planning, including retrieval-augmented generation (RAG), knowledge graphs, world models, and general-purpose tool use. Knowledge-augmented frameworks like [80, 88, 152, 153, 143] inject structured representations (e.g., graphs, scene layouts) into the LLM context. RAG-style systems [86, 154, 155] retrieve relevant knowledge to support continual instruction planning. World model-based agents such as [112, 138, 156, 89, 90, 91, 157, 158] learn or leverage environment models for model-based planning. Tool-oriented frameworks like HuggingGPT [7], Tool-Planner [81], and RetroInText [148] use external APIs or modular toolchains to support planning execution. These systems often reflect agent-environment interaction and capitalize on external resources to scaffold or augment LLM capabilities.
3.1.2 Post-training Planning
Reward Design / Optimal Control.
Finally, planning as optimization entails designing suitable reward structures and solving for optimal behavior using RL or control-theoretic tools. Reflexion [14], Reflect-then-Plan [77], and Rational Decision Agents [159] incorporate utility-based learning to guide planning behavior. Reward modeling appears in works such as [160], while others like [161] emphasize reward shaping. Optimal control is tackled explicitly in [162, 163, 164, 165], and trajectory optimization via diffusion models is seen in [166, 167, 168]. Offline RL methods like [119, 169, 147] leverage pretrained dynamics or cost models. The control-theoretic orientation in these works complements symbolic or heuristic approaches by optimizing over continuous, structured, or learned reward spaces.
3.2 Tool-Use Optimization
Tool use optimization is the capacity of an agent to augment its intrinsic capabilities by intelligently invoking external modules. This allows agents to overcome limitations such as outdated knowledge, inability to perform precise calculations, or lack of access to private information. The core challenge lies in the agent's ability to reason about when to use a tool, which tool to select from a library, and how to generate a valid call. In this section, we examine existing approaches to tool use optimization, which can be broadly classified into three styles: in-context tool-integration, post-training tool-integration, and orchestration-based tool-integration.
3.2.1 In-Context Tool-integration
The in-context demonstration paradigm is a training-free approach to empowering LLMs with new capabilities at inference time. This method leverages the remarkable in-context learning ability of modern LLMs, guiding a frozen, off-the-shelf model to perform complex tasks by providing carefully crafted instructions, examples, and contextual information directly in the prompt.
Interleaving Reasoning and Tool Use.
The foundation of in-context agentic reasoning lies in augmenting the Chain-of-Thought (CoT) process with the ability to take action.([1]). ChatCoT [171] formalizes this paradigm by structuring reasoning traces as alternating "thought-tool-observation" steps in natural language, allowing LLMs to reflect on intermediate outputs and dynamically plan the next tool query. While CoT enables LLMs to break down problems into intermediate reasoning steps, it operates in a closed world, limited by the model's internal knowledge. The key innovation in agentic tool use is to interleave these reasoning steps with actions (tool calls), creating a dynamic loop that allows the agent to interact with external environments to gather information and execute tasks ([183, 184]). ReAct [5] introduced the "Reasoning+Acting" synergy. This approach enables the model to use reasoning to create, track, and adjust its action plans, while the actions allow it to interface with and gather information from external environments like knowledge bases or the web. Similarly, ART [170] provides a structured approach by maintaining a library of successful task demonstrations. For a new task, ART retrieves a relevant multi-step exemplar and uses it as a few-shot prompt, guiding the LLM to follow a proven reasoning and tool-use path.
Optimizing Context for Tool Interaction.
While the foundational interleaved loop is powerful, its performance degrades when agents must handle large or complex toolsets. A significant branch of research addresses this by optimizing the in-context information provided to the agent. Recent studies demonstrate that well-written tool documentation enables LLMs to utilize new tools in a zero-shot manner ([185, 186]). This finding aligns with the key insight that LLMs, much like humans, benefit from clear and concise instructions. Alternatively, GEAR ([172]) introduces a computationally efficient, training-free algorithm that delegates the tool selection process to a small language model while reserving the more powerful LLM for the final reasoning step to reduce costs. AVATAR [173] enhances the robustness of this choice by prompting the agent to perform in-context "contrastive reasoning" before acting.
While these in-context methods are flexible, their performance is ultimately bounded by the inherent capabilities of the frozen LLM and the length of its context window. Consequently, subsequent research has focused on post-training methods.
3.2.2 Post-training Tool-integration
Tool integration ([5, 187, 188]) with post-training techniques has emerged as a key strategy for addressing the inherent limitations of LLMs or LRMs, such as outdated knowledge, limited computational precision, and shallow multi-step reasoning. By learning how to interact with external tools, reasoning models can dynamically access up-to-date information, execute precise symbolic or numerical computations, and decompose complex tasks into grounded, tool-assisted reasoning steps ([189, 190, 9, 191, 192]). With tools as intermediaries, models are enriched and augmented by external capabilities, enabling the generation of more accurate and generalizable agentic reasoning trajectories ([193, 186, 194]).
Bootstrapping of Tool Use via SFT.
Early works on tool-integration ([5, 6, 174, 175, 195, 196, 197, 198, 199]) primarily apply supervised fine-tuning (SFT) over curated tool-use reasoning steps, where models were trained to imitate demonstrations of search queries, code executions, or API calls. The SFT stage provided an initial competency in invoking tools, interpreting tool outputs, and integrating the results into coherent reasoning chains ([196, 14]). For example, Toolformer [6] introduces a self-supervised framework in which large language models generate, validate, and retain useful API calls within unlabeled text, followed by fine-tuning on the filtered data to enhance factual accuracy and practical utility. ToolLLM [174] further scales SFT training to over 16, 000 real-world APIs, applying supervised fine-tuning on massive curated demonstrations to endow models with robust planning and invocation abilities. ToolAlpaca [175] extends the idea to compact LLMs by automatically constructing a diverse toolset and generating multi-turn tool-use dialogues via multi-agent simulation, followed by fine-tuning to enable generalized tool-use even for previously unseen tools. While effective at bootstrapping tool-awareness, applying SFT along suffers from overfitting to the specific patterns in the training data [200, 201, 202, 203], leading to brittle tool-selection strategies and limited adaptability in unseen downstream application scenarios ([204, 178, 205]).
Mastery of Tool Use via RL.
Recent studies ([206, 178, 207, 176, 208, 27, 209, 177]) leverages reinforcement learning (RL) during model post-training to go beyond imitation and achieve mastery in tool-integrated reasoning. With the integration of RL, models refine their tool-use strategies through outcome-driven rewards, learning when, how, and which tools to invoke via trial and error [176, 210, 177, 211]. For instance, SWE-RL ([207]) optimizes code-editing policies on large-scale software evolution data, improving not only software issue resolution but also general reasoning skills. ReSearch [176] embeds search operations into multi-hop reasoning chains, enabling adaptive retrieval during complex QA. ReTool integrates real-time code execution into reasoning rollouts, leading to optimal performance on advanced math reasoning benchmarks. ToolRL [178] generalizes this paradigm to diverse toolsets by introducing principled reward designs for stable and scalable multi-tool learning. Across these settings, RL has been shown to yield more robust, adaptive, and generalizable tool-use policies than SFT alone, often transferring effectively to out-of-domain tasks [212, 213, 214, 215, 216].
3.2.3 Orchestration-based Tool-integration
In real-world applications, tool use within complex systems often extends beyond the single-model, single-tool setting, requiring orchestration among multiple tools to complete complex tasks. This orchestration typically involves planning, sequencing, and managing dependencies across tools, i.e., ensuring that intermediate outputs are passed and transformed appropriately. Several early works ([7, 179, 217]) explore this direction by devising strategies for the coordinated use of multiple tools, enabling systems to solve multi-stage tasks that no single tool can handle in isolation. Specifically, HuggingGPT ([7]) employs a centralized agent that leverages a language interface to plan which tools to invoke and when, enabling the solution of complex tasks requiring multiple tools in sequence. TaskMatrix.AI ([179]) connects foundation models with millions of APIs, using the models to generate task-solution outlines and automatically matching certain sub-tasks to off-the-shelf models and systems with specialized functionalities. ToolkenGPT ([180]) augments frozen language models with massive tool sets by encoding each tool as a special token during next-token prediction.
Agentic Pipelines for Tool Orchestration.
There are many frameworks designed to enable LLMs to call and orchestrate tools effectively. Most of the current agentic paradigm follows a “plan before action” strategy, where the model first generates a structured plan for tool use and then executes it. ToolPlanner ([81]) introduces a two-stage reinforcement learning framework with path planning and feedback, supported by MGToolBench, to bridge the gap between API-heavy training data and real-world user instructions. Tool-MVR ([218]) enhances reliability and reflection through meta-verification of tool calls and exploration-based reflection learning, achieving strong gains over GPT-4 and other baselines. More recently, OctoTools ([180]) provides a training-free, extensible framework with standardized tool cards, a hierarchical planner, and an executor, showing broad improvements across multi-domain reasoning tasks. Chain-of-Tools ([219]) leverages frozen LLMs’ semantic representations to dynamically compose unseen tools in chain-of-thought reasoning, enabling generalization to massive tool pools without fine-tuning. PyVision [220] introduces an interactive, multi-turn framework that enables MLLMs to dynamically generate, execute, and refine Python-based tools, moving beyond static toolsets in visual reasoning. ConAgents [199] makes an initial extension of tool use frameworks for interactive multi-agent settings. We are also glad to see emerging applications of such agentic tool orchestration frameworks in the chemistry domain ([221]).
Tool Representations for Orchestration.
Beyond designing orchestration pipelines, another line of research focuses on optimizing the tools themselves to facilitate more accurate selection, composition, and coordination during orchestration. ToolExpNet ([181]) models tools and their usage experiences as a network that encodes semantic similarity and dependency relations, allowing LLMs to distinguish between similar tools and account for interdependencies during selection. T2Agent ([222]) addresses multimodal misinformation detection by representing tools with standardized templates and using Bayesian optimization to select a task-relevant subset. Coupled with Monte Carlo Tree Search over this reduced action space, T2Agent enables efficient multi-source verification. ToolChain* ([182]) frames the entire tool action space as a decision tree and applies A* search with task-specific cost functions to guide navigation. This representation allows efficient pruning of high-cost branches and identification of optimal tool-use paths. ToolRerank ([223]) refines tool retrieval by introducing adaptive truncation for seen vs. unseen tools and hierarchy-aware reranking to balance concentration (for single-tool queries) and diversity (for multi-tool queries).
3.3 Agentic Search
Single-agent Agentic Retrieval-Augmented Generation (RAG) systems embed reasoning and control into a centralized agent that governs the entire retrieval-generation loop. Unlike traditional RAG pipelines [224, 10, 225] that perform fixed, one-shot retrieval before generation, agentic RAG agents dynamically control when, what, and how to retrieve based on real-time reasoning needs. This enables the model to adapt retrieval strategies mid-inference, refine its queries, and better integrate evidence from multiple sources. Based on how the agent selects, refines, and integrates retrieved content during reasoning, we categorize single-agent Agentic RAG systems into three distinct architectural styles: in-context, post-training, and structure-enhanced agentic RAG.
3.3.1 In-Context Search
Interleaving Reasoning and Search.
In-context agentic RAG systems embed retrieval behavior directly into the inference process of language models through carefully designed prompting strategies. Rather than training the model to learn retrieval behavior, these methods guide it to alternate between reasoning and search within a single forward pass, typically via few-shot exemplars or special tokens. A representative example is ReAct [5], which interleaves Chain-of-Thought reasoning with tool-use commands such as <Search> to dynamically invoke external APIs or knowledge sources. Extensions such as Self-Ask [226] and IRCoT [184] go beyond sequential reasoning by prompting the model to recursively decompose questions and retrieve sub-evidence accordingly. More recent methods [227, 154, 228, 235] introduce reflective retrieval, where the model explicitly assesses whether it needs additional information at each step, deciding to retrieve only when necessary. These approaches require no additional training, making them highly flexible and deployable, but often rely on prompt engineering and may struggle with stability across diverse domains.
Structure-Enhanced Search.
Structure-enhanced agentic RAG systems enhance retrieval-augmented generation by enabling a single agent to reason over symbolic knowledge sources such as knowledge graphs through dynamic querying, tool invocation, and reflective self-monitoring. Unlike static KG retrievers or query executors, these agents decide when to access structured knowledge, how to formulate graph-based queries, and whether retrieved information suffices for continuing the reasoning trajectory. Agent-G [234] introduces a modular agentic architecture that integrates unstructured document retrieval with structured graph reasoning, using feedback loops and specialized retriever modules to ensure accurate multi-hop responses. MC-Search [235] introduces five canonical reasoning topologies to model multimodal search-enhanced reasoning process, and proposes a end-to-end agentic RAG and step-wise evaluation pipeline to evaluate model's planning and retrieval fidelity across heterogeneous sources. Similarly, GeAR [236] incorporates graph expansion operations into an agentic controller to address challenges in complex multi-hop queries, enhancing coherence across structured and unstructured sources. Beyond retrieval orchestration, ARG [237] proposes a fully end-to-end agentic framework for reasoning over knowledge graphs via active self-reflection. The model autonomously determines when to retrieve, performs iterative critique based on symbolic inputs, and exhibits interpretable, step-wise reasoning behavior over graphs. Together, these systems represent a shift from passive graph access to active, feedback-driven symbolic reasoning, highlighting the potential of structured agentic RAG to achieve both factual reliability and interpretability.
3.3.2 Post-Training Search
Post-training agentic RAG methods endow language models with retrieval-aware capabilities by fine-tuning them to make informed decisions throughout multi-step reasoning. Unlike in-context prompting, these approaches train models, either via supervised fine-tuning (SFT) or reinforcement learning (RL), to determine when retrieval is necessary, how to formulate queries, and how to incorporate retrieved evidence.
SFT-Based Agentic Search.
These methods construct curated or synthetic datasets that interleave retrieval operations with natural language reasoning, and subsequently apply supervised fine-tuning to instill retrieval-aware capabilities into the model. Toolformer [6] introduces a self-supervised approach to annotate tool-use behaviors within model-generated text, enabling LLMs to learn when and how to invoke tools such as web search or calculators. INTERS [229] extends this direction by performing instruction-based fine-tuning over a diverse, multi-task dataset compiled from over 40 sources, capturing a wide spectrum of retrieval-reasoning patterns. This class of methods benefits from scalable data generation pipelines [238, 239, 23], which minimize the need for human annotation. Instructional reformulation techniques [240, 229, 241] further enhance generalization by aligning tasks with human-preferred formats and reasoning.
RL-Based Agentic Search.
These methods optimize retrieval-aware behaviors through reward signals that reflect answer quality, factuality, or user preferences. WebGPT [230] introduces reward modeling to supervise search-augmented chains aligned with human judgment, while RAG-RL [231] formulates retrieval as a sequential decision-making task over evidence access. More recent efforts such as Search-R1 [27] and Deep-Researcher [232] go further by training agents to dynamically issue retrieval actions (e.g., generating <Search> tokens mid-reasoning) and operate in open-ended environments such as the live web. These agents exhibit emergent capabilities such as iterative decomposition, re-verification, and evidence planning. Finally, systems like ReSearch [176] and ReARTeR [233] pursue not only accurate answers but also interpretable and faithful reasoning trajectories, highlighting the potential of reinforcement-learned retrievers to act as controllable and reflective agents.
4. Self-evolving Agentic Reasoning
In this section, self-evolving agentic reasoning addresses the rigidity of static model inference by empowering agents to refine their processes through experience via intertwined feedback and memory mechanisms. Feedback operates across reflective self-critique during inference, parametric adaptation embedding corrections into model parameters, and validator-driven resampling guided by external success signals, while memory evolves from in-context factual/experience buffers to structured graphs/workflows/multimodal forms and post-training reward-optimized control. This synergy creates a dynamic adaptation loop atop planning, search, and tools, yielding continual self-improvement, robust long-horizon reasoning, and groundwork for lifelong and collective intelligence.
In this section, the references compile a vast corpus of over 700 works underpinning agentic reasoning in large language models, addressing the core challenge of evolving LLMs from static predictors to autonomous planners and actors. Spanning foundational techniques like chain-of-thought prompting and ReAct to advanced paradigms in multi-agent collaboration, embodied environments, scientific benchmarks, tool use, and governance risks, it traces mechanisms for perception, long-horizon planning, world models, and safety. Ultimately, this bibliography illuminates explosive progress toward scalable, interpretable agentic systems while exposing gaps in credit assignment, latent reasoning, and real-world deployment, guiding future breakthroughs in AI autonomy.
Self-evolving agentic reasoning refers to an agent’s capacity to improve its own reasoning process through experience. At the core of this evolution lie two fundamental mechanisms: feedback and memory. Feedback provides evaluative signals for self-correction and refinement, allowing the agent to revise its reasoning strategies based on outcomes or environmental responses. Memory, in turn, acts as a persistent substrate for storing, organizing, and synthesizing past interactions, enabling knowledge accumulation and reuse across tasks. Together, these mechanisms transform reasoning from a static process into a dynamic, adaptive loop capable of continual improvement.
Building upon foundational capabilities such as planning, search, and tool use, self-evolving agents integrate feedback and memory to refine their internal reasoning policies, adjust decision-making strategies, and generalize across diverse contexts, often without explicit external supervision. This continual adaptation marks a critical step toward lifelong reasoning and lays the groundwork for the collective intelligence explored in the next section.
4.1 Agentic Feedback Mechanisms
Agentic feedback mechanisms enable models to iteratively refine their reasoning and actions rather than relying on one-shot responses. By incorporating self-critique, verifier guidance, or validator-based resampling, these methods emulate human trial-and-error learning and form the foundation for autonomous self-improvement. Broadly, they operate through three distinct feedback regimes: (1) reflective feedback, where models revise their reasoning through self-critique or verification; (2) parametric adaptation, where feedback is consolidated into updated model parameters; and (3) validator-driven feedback, where binary outcome signals guide resampling without introspection.
These regimes define a continuum between dynamic, inference-time adaptability, durable learning through parameter updates, and efficient correction through external signals. Together, they highlight how modern agents leverage feedback to balance flexibility, reliability, and efficiency.
4.1.1 Reflective Feedback
Reflective feedback methods improve model reliability by modifying the reasoning process during inference, without updating model parameters. These approaches expose intermediate reasoning outputs, such as chains of thought or partial solutions, and introduce additional assessment steps that directly influence how the model continues its generation.
Early self-critique and rationale-refinement methods [14, 242] implement reflection through an explicit generate–critique–revise loop. A model first produces an answer together with its reasoning. The same model, or a separately prompted critic role, then analyzes this output to identify logical errors, unsupported assumptions, or missing steps. The critique is appended as context for a revised generation, and this process may be repeated multiple times or augmented with external evidence such as retrieval. More recent self-improvement frameworks [243] extend reflective feedback beyond a single inference episode by accumulating critiques or failure cases across interactions. Instead of correcting only one response, these methods reuse past feedback to guide future generations through prompt refinement or curated supervision signals, while still operating without direct parameter updates at inference time. Search-based reasoning strategies [244, 4, 74] improve reliability by generating and comparing multiple candidate reasoning paths. These methods explore the solution space through stochastic sampling or structured search, then select or aggregate outputs using voting schemes, heuristic scores, or learned evaluators. Improvement arises from comparison across alternatives rather than explicit revision of a single reasoning trajectory. Decomposition-based prompting methods [2, 245] reformulate complex problems into ordered sequences of simpler subproblems. Intermediate results are reused in later steps, allowing partial inspection of reasoning progress and reducing error propagation, even when no explicit critique step is introduced.
Overall, reflective feedback alters inference-time reasoning trajectories by introducing additional reasoning or comparison steps. Feedback is used to guide generation within an episode, while the model’s parameters remain unchanged.
4.1.2 Parametric Adaptation
Parametric adaptation incorporates feedback into a model’s parameters through additional training, producing persistent behavioral changes that generalize beyond individual inference episodes. Unlike reflective feedback, these methods transform feedback signals into supervised or preference-based training objectives that update the model’s weights.
Trajectory-level supervised fine-tuning approaches [246, 103] attach feedback to intermediate reasoning traces rather than only final answers. Models first generate multi-step trajectories, which are then reviewed by humans, auxiliary models, or automated verifiers. Incorrect steps are corrected or replaced, and the resulting feedback-enriched trajectories are used as supervised training data, encouraging the model to internalize improved reasoning patterns. Distillation-based methods [247] further leverage improved reasoning traces by training student models on high-quality chains of thought or self-corrected solutions generated by stronger teachers. This process transfers structured reasoning behaviors into more stable or efficient models, removing the need for explicit reflection at inference time. Preference-alignment approaches [248, 249, 250] incorporate feedback in the form of comparative judgments that distinguish preferred from dispreferred outputs. Training objectives such as reward modeling or direct preference optimization adjust the model’s parameters so that preferred behaviors become more likely. Although feedback is often defined over final outputs, it implicitly shapes the internal reasoning strategies that produce them. Recent work shows that verification-augmented training data can further improve reasoning robustness across domains [251, 252]. In these settings, trajectories are filtered or revised based on correctness or consistency signals before training, yielding datasets that emphasize reliable reasoning patterns.
In summary, parametric adaptation embeds feedback directly into the model’s parameters, yielding durable improvements across tasks. This durability comes at the cost of additional training and reduced flexibility compared to inference-time methods.
4.1.3 Validator-Driven Feedback
Validator-driven feedback improves model outputs using external success or failure signals, without modifying the model’s reasoning process or parameters. A validator, such as a unit test, constraint checker, simulator, or environment signal, evaluates candidate outputs and determines whether they satisfy predefined correctness criteria.
Retry-based systems [253, 254] implement this paradigm by repeatedly sampling candidate outputs until one passes validation. The model generates a complete solution, submits it to the validator, and discards it if validation fails. Subsequent attempts are generated independently, without conditioning on explicit information about previous failures. This strategy is particularly effective in domains with reliable and inexpensive validation, such as program synthesis and software engineering [255, 256, 257]. Generated code can be executed against unit tests, providing an unambiguous correctness signal. The model iterates until a solution satisfies all tests, even in the absence of explicit reasoning correction. Similar mechanisms appear in embodied and interactive agents [136, 258], where action sequences are repeatedly executed until the environment signals task completion. Failed sequences are abandoned and new ones are attempted, based solely on external success signals. Some hybrid methods introduce lightweight guidance within the retry loop, for example by assigning higher reward to behaviors that eventually lead to successful outcomes [259]. However, the dominant mechanism remains selection through external validation rather than revision of reasoning steps or parameter updates.
Overall, validator-driven feedback offers an efficient and scalable way to improve output correctness when reliable validators are available. Its limitation is that feedback is non-diagnostic, correcting individual outputs without explaining failures or altering the model’s reasoning behavior.
4.2 Agentic Memory
Recent advances in memory-augmented LLM agents have shifted the focus from static memory storage to more dynamic, interactive mechanisms that directly support agentic reasoning. Rather than merely extending the context window or storing historical inputs, memory is increasingly treated as an integral component of the reasoning loop, used for reflecting on past experiences, guiding future actions, and dynamically adapting to complex, long-horizon tasks. Formally, an agent maintains a memory module where each memory entry may represent a raw observation, summarized trajectory, subgoal, tool invocation trace, or other structured element depending on the system design.
The agent’s reasoning process then operates not only on its immediate context but also on this persistent memory, enabling reflection, generalization, and long-term goal tracking. In this section, we organize prior work along four emerging trends in the use of memory to support and enable agentic reasoning. Figure 6 summarizes how agentic memory progresses from contextual recall to adaptive control. In-context memory captures textual and semantic information from prior interactions; structured memory integrates these into graph and multimodal representations; post-training control enables agents to evolve, update, and retrieve memory through learned reward-based mechanisms.
4.2.1 Agentic Use of Flat Memory
Factual Memory.
Traditional memory systems for LLM agents typically treat memory as a passive buffer, mainly used to store dialogue histories or recent observations to address the limited context window of transformer models. Examples include dense retrieval methods [224, 291, 269], pre-defined modules in LangChain and LLamaIndex [268], and cache-inspired designs like MemGPT [265]. These approaches usually retrieve semantically similar past content to augment prompts, without influencing the agent’s internal reasoning. Enhancements such as RET-LLM with differentiable memory [292], SCM with controller-based mechanisms [293], as well as LOCOMO and LongMemEval benchmarks for long-term retention [294, 295] further improve recall but remain largely static. These systems often rely on fixed heuristics and unstructured token lists [269], limiting adaptability for tasks involving goal decomposition [296, 143], long-term planning [150], or iterative self-improvement [297]. In contrast, emerging agentic memory treats memory as part of the reasoning loop, supporting reflection [298], and decision-making [299]. Amem [24] enables LLM agents to autonomously generate contextual memory descriptions, build dynamic links between related experiences, and evolve memory content in response to new information. Similarly, Zep [278], Mirix [300], MemOS [13], LightMem [271], and Nemori [272] leverage LLMs to automatically produce context-aware memory representations. Beyond LLM-driven approaches, recent work has explored reinforcement learning to explicitly train agents to acquire and organize factual memory, such as Mem- α\alphaα [287] and Memory-R1 [15], which we discuss in detail in later sections.
Experience Memory.
Workflow Memory [270] tracks procedural traces to enable plan recovery and consistent reasoning. Sleep‑time Compute enables LLM agents to pre-compute and store anticipated reasoning steps before user interaction, effectively “thinking offline” using memory as a preparatory resource [276]. Dynamic Cheatsheet (DC) [275] equips black-box models with external memory to store reusable strategies, reducing redundant reasoning. Other efforts explore complementary paradigms of agentic memory. In parallel, workflow memory has emerged as another structured approach, particularly suited for procedural and tool-augmented tasks. It explicitly tracks procedural traces during execution, supporting plan recovery, long-term consistency, and interpretable chaining of actions. Atomic reasoning [143] proposes a structured trace over a finite set of reusable atomic skills in a streamlined generation space to reduce spurious reasoning patterns. Context evolution (ACE) [273] treats contexts as evolving playbooks rather than building a static structured store, whereas Reasoning Bank [274] focuses on reusing failed reasoning traces to enhance future task performance. Evo-Memory [25] synthesizes these ideas by benchmarking self-evolving memory under streaming task settings, highlighting experience reuse as a central capability for stateful, long-horizon agentic reasoning. In addition to factual memory, Mirix [300] further introduces a procedural memory component to capture reusable action patterns, while Agentic Memory [289] and MemRL [290] adopt reinforcement learning to optimize the acquisition and management of experiential memory.
This marks a shift from static buffers toward structured, reasoning-centric memory architectures. In these agentic memory systems, memory serves as a dynamically growing context: agents not only record past actions but actively reflect, edit, and refine their strategy over time.
4.2.2 Structured Use of Memory
Beyond flat memory usage and control, the structure of memory plays a critical role in enabling complex reasoning. Recent work increasingly explores structured representations, such as semantic graphs, workflows, and hierarchical trees, often extended to multimodal settings, to better capture dependencies, and contextual relationships.
Graph-based representations provide a flexible substrate for organizing relational knowledge in agents [301]. GraphRAG [277] serves as a foundational technique that augments retrieval with graph-structured reasoning, enabling more contextually coherent and multi-hop information integration. Building on this foundation, agent systems such as MEM0 [12] and Zep [278] organize memory explicitly as dynamic knowledge graphs, allowing agents to store, retrieve, and reason over entities, attributes, and their relations with improved efficiency and semantic grounding. Beyond graphs, structured memory has also been explored through alternative organizational forms. MemTree [302] leverages a dynamic tree-structured representation to hierarchically organize and integrate information, while workflow-oriented systems such as AutoFlow [303], AFLOW [304], and FlowMind [305] represent reasoning workflows explicitly in memory, capturing sequences of subgoals, tool invocations, and decision points.
New benchmarks have pushed reasoning memory into multimodal domains, where agents are required to ground, retrieve, and reuse information across heterogeneous modalities. M3-Agent [281] evaluates visual–audio–text reasoning through “see, listen, and reason, ” while Agent-Scankit [283] proposes multimodal agents with integrated memory modules for adaptive retrieval and grounding. Optimus-1 [279] proposes a hybrid multimodal memory architecture that represents world knowledge as a hierarchical directed knowledge graph and abstracts past interactions into a multimodal experience pool. RAP [280] retrieves relevant experiences based on contextual similarity, enabling adaptive reuse of multimodal memory.
These structured memory formats align task semantics, temporal dependencies, and multimodal signals, enabling agents to reason compositionally and maintain coherent behavior over extended interactions. As task complexity increases, the abstraction and organization of memory become increasingly critical for building robust and generalist agents.
4.2.3 Post-training Memory Control
Conversely, memory systems can also be controlled by the agent's reasoning process itself. Rather than relying on fixed heuristics for reading and writing memory, recent work has explored agent-controllable memory operations, where the agent explicitly decides what to store, when to retrieve, and how to interact with memory. This reframes memory as a policy target, no longer a passive buffer, but a resource that is actively shaped by reasoning.
MemAgent [286] formulates memory overwrite as a reinforcement learning problem: the agent is rewarded for preserving information that proves useful and for discarding irrelevant content. By using a newly proposed DAPO algorithm, the model learns to maintain a constant-sized memory across conversations while maximizing future utility. Mem1 [284] presents an end-to-end reinforcement learning framework where agents maintain a compact, shared internal state across turns, jointly supporting reasoning and memory consolidation. Memory-R1 [15] further advances this line by introducing a dual-agent design: a Memory Manager that dynamically decides when to add, update, or delete entries in the memory store, and an Answer Agent that distills the most relevant retrieved memories to guide response generation. Recent work such as Mem- α\alphaα [287] also explores RL-based control of multi-component memory construction in agents, providing a unified perspective on adaptive memory construction and reasoning control. Memory-as-Action [285] integrates memory editing including insertions, deletions, and modifications directly into the reasoning policy, proposing a Dynamic Context Policy Optimization algorithm to handle non-prefix trajectory changes caused by memory operations. Agent Learning via Early Experience [288] further relaxes reward dependence by enabling agents to learn from their own interaction traces through self-prediction and reflection, bridging imitation and reinforcement learning. Moreover, Agentic Memory [289] and MemRL [290] adopt reinforcement learning to optimize the acquisition and management of experiential memory.
Together, these systems mark a shift toward learning-based memory control, where memory usage is optimized through reinforcement or imitation learning. By integrating memory management into the reasoning policy, agents become more adaptive, scalable, and capable of long-horizon decision-making in dynamic environments.
4.3 Evolving Foundational Agentic Capabilities
4.3.1 Self-evolving Planning
Recent advances view planning not as a fixed reasoning routine but as an evolving capability. Instead of relying on static datasets or human-designed curricula, agents can autonomously generate tasks, learn from their own feedback, and adapt strategies through iterative interaction with the environment. This enables continuous improvement without external supervision.
A representative direction is self-generated task construction. For example, SCA enables agents to alternate between generating problems and solving them, reusing successful trajectories for fine-tuning [306]. Self-rewarding frameworks further allow agents to assess their own outputs, producing high-quality training signals without human labels [307, 308]. Other works directly leverage execution feedback for online adaptation, such as SELF, SCoRe, PAG, TextGrad, and AutoRule, which transform natural-language critiques or traces into training rewards, enabling continual policy refinement [309, 310, 311, 312].
Beyond internal feedback, agents can also evolve through environment shaping. AgentGen constructs adaptive environments to induce curriculum learning [313], while Reflexion and AdaPlanner use self-reflective or adaptive strategies to refine plans at runtime [14, 314]. Self-Refine iteratively critiques and improves outputs [242], and SICA allows self-modification of code and reasoning tools [315]. From an RL perspective, RAGEN and DYSTIL model planning as a Markov Decision Process and optimize strategies with dense feedback [316, 317].
Together, these methods establish a self-improving planning loop, where agents generate their own tasks, shape their environments, and refine strategies, laying the groundwork for autonomous, open-ended planning evolution.
4.3.2 Self-evolving Tool-use
Creating and Synthesizing Tools.
The culmination of in-context reasoning is the emergent capability of agents to autonomously create new tools. This is achieved not through training, but by prompting a frozen LLM to act as a programmer when it encounters a problem that its existing toolset cannot solve. The LATM framework [318] uses a powerful model as a one-time "tool maker" and a cheaper, lightweight model as a frequent "tool user, " thus amortizing the cost of creation. To enable specialization beyond the limits of general-purpose APIs, frameworks like CRAFT [319] and CREATOR [320] generate custom tools tailored for specific domains. Taking this a step further, ToolMaker [321] can convert entire public code repositories into usable tools, allowing agents to leverage complex, human-written codebases on the fly.
4.3.3 Self-evolving Search
Search plays a central role in agentic reasoning, enabling models to retrieve, select, and synthesize relevant knowledge across large and evolving memory spaces. In early systems, search was typically static—built on fixed retrieval heuristics or similarity-based dense retrievers [224, 227, 269, 265]. These methods augmented prompts with retrieved information but lacked adaptive control over how memory evolves or how search strategies are improved over time.
Recent research increasingly links search and memory in a co-evolutionary loop: agents continuously update their memory base during task execution, while dynamically adjusting how search is performed over this evolving knowledge. Agentic memory systems such as MemGPT [265], MemoryBank [269], and Workflow Memory [270] already highlight how retrieved information can be synthesized and re-inserted into memory, gradually improving retrieval quality. Dynamic Cheatsheet (DC) [275] further demonstrates how reusable strategies can be accumulated and leveraged across queries, effectively transforming static search into a living retrieval substrate that evolves with agent experience.
Evolving Memory Bases.
Unlike static index-based retrieval, self-evolving agents actively refine their memory base through reflection and post-execution updates. Reflexion [14] allows agents to critique their own reasoning traces and store distilled insights, improving future search relevance. Reasoning Bank [274] and context evolution methods [273] explicitly restructure memory representations to align retrieval results with evolving problem-solving strategies, effectively making the retrieval target itself adaptive over time.
Dynamic Search and Synthesis.
Beyond memory updates, search strategies themselves can evolve through dynamic prioritization and synthesis. Structured memory representations—such as workflows [303, 304, 305] and knowledge graphs [301, 277, 12, 278]—provide semantic scaffolding that enables multi-hop and compositional search, supporting richer reasoning over longer horizons. Systems like MemOS [13] and Memory-as-Action [285] take this further by integrating search decisions directly into the reasoning policy, allowing retrieval targets, strategies, and sources to co-adapt as agents accumulate experience.
Overall, self-evolving search transforms retrieval from a static utility into a continuously adapting component of the reasoning loop. By evolving memory bases, dynamically adjusting search strategies, and synthesizing retrieval results into structured knowledge, agents can maintain more relevant, structured, and actionable information over extended time horizons.
5. Collective Multi-agent Reasoning
In this section, collective multi-agent reasoning scales single-agent foundations to collaborative systems where specialized agents assume complementary roles like leaders, workers, critics, memory keepers, and domain-specific experts in software engineering, finance, legal, education, healthcare, biomedicine, and music to jointly solve complex tasks. It tackles challenges in role differentiation, communication, and shared memory through in-context collaboration via manual or LLM-driven pipelines, agent routing, and theory-of-mind inference, plus post-training optimization of workflows and policies. Ultimately, embedding memory, reinforcement learning, and co-evolutionary feedback transforms static teams into adaptive, self-improving organizations that converge on superior solutions via distributed, iterative intelligence.
Building upon the single-agent foundation, where reasoning supports planning, search, and tool use within a unified perception–action loop, multi-agent reasoning extends these principles to collaborative settings. In a multi-agent system (MAS), multiple reasoning agents interact to jointly solve complex tasks. Rather than identical problem solvers, agents assume complementary roles, such as Manager for task decomposition, Worker for execution, and Verifier for evaluation, enabling specialization and division of cognitive labor. This role differentiation marks the first step toward collective intelligence, where reasoning is distributed and coordinated across multiple agents.
Beyond role assignment, the essence of multi-agent reasoning lies in how these agents collaborate, communicate, and co-evolve. Collaboration schemas define how reasoning traces are exchanged, conflicts are resolved, and shared memory is maintained to achieve alignment. Through such interaction, reasoning transitions from an individual process into a distributed, iterative loop, in which agents refine each other’s outputs and collectively converge toward better solutions.
Compared with single-agent systems, multi-agent reasoning introduces new challenges that require rethinking reasoning at the system level:
- Role differentiation: how to design static or adaptive roles that align with task structure and expertise distribution;
- Collaboration and communication: how agents exchange intermediate reasoning, negotiate consensus, and divide labor efficiently;
- Collective memory and evolution: how shared or distributed state supports long-term coordination and continual adaptation.
These challenges motivate the following structure of our analysis. Section 5.1 examines the role taxonomy of multi-agent systems, from generic organizational roles to domain-specific specializations. Section 5.2 focuses on collaboration and division of labor, including in-context and post-training coordination strategies. Finally, Section 5.3 explores how memory enables multi-agent systems to evolve over time and maintain collective consistency. Together, these perspectives provide a unified view of how reasoning scales from individual agents to adaptive, collaborative intelligence.
5.1 Role Taxonomy of Multi-Agent Systems (MAS)
In this subsection, we first summarize the generic roles that often appear in a multi-agent system (MAS). Then, we introduce the specific functions of different roles when an MAS is applied in different domains, such as software engineering, finance, legal activities, education, healthcare, biomedicine, and music applications.
5.1.1 Generic Roles
- Leader/Coordinator: The leader, or coordinator, is responsible for maintaining high-level coherence within the system. This role involves setting global objectives, decomposing tasks into manageable subgoals, and assigning them to appropriate agents. In addition, the leader arbitrates conflicts that emerge between agents with overlapping or contradictory outputs. In practice, this role often manifests itself as a meta-controller that monitors the progress of other agents and ensures that execution adheres to an overarching plan.
- Worker/Executor: Executors, often called workers, are the operational backbone of MAS. They engage in concrete actions such as invoking external tools, writing or executing code, retrieving documents, or interfacing with the environment. Although they typically act under the directives of a leader, well-designed systems allow for adaptive autonomy, where executors can refine or optimize their assigned tasks when new local information becomes available.
- Critic/Evaluator: The critic/evaluator role centers on quality assurance. This role includes verifying correctness, testing hypotheses, red-teaming responses, and surfacing potential risks. In LLM-based systems, this often corresponds to LLM-as-a-judge setups, where dedicated evaluators assess the factuality, safety, or stylistic alignment of output. Critic roles help introduce checks and balances into otherwise generative workflows, thereby mitigating error propagation.
- Memory Keeper: Effective MAS requires persistent memory to accumulate context, prevent repetitive failures, and enable learning across episodes. The memory keeper curates and maintains long-term knowledge structures such as episodic logs, semantic embeddings, retrieval indices, or knowledge graphs. By abstracting memory management into a dedicated role, the system can better balance short-term reactivity with long-term continuity and adaptation.
- Communication Facilitator: Communication overhead can easily undermine MAS efficiency. This role governs protocols for inter-agent exchange, including defining message schemas, managing communication bandwidth, enforcing gating mechanisms, and orchestrating consensus-building. By reducing ambiguity and ensuring structured information flow, the communication facilitator prevents bottlenecks and coordination failures in large-scale or heterogeneous agent populations.
5.1.2 Domain-Specific Roles
Beyond generic agent roles, domain-specific tasks often require specialized functions. These roles reflect professional practices in particular industries and map naturally onto MAS architectures.
Software Engineering: In software engineering, MAS generally maps onto roles that mirror the software development lifecycle: architects‾\underline{\textit{architects}}architects, developers‾\underline{\textit{developers}}developers, code reviewers/testers‾\underline{\textit{code reviewers/testers}}code reviewers/testers, CI orchestrators‾\underline{\textit{CI orchestrators}}CI orchestrators, and release managers‾\underline{\textit{release managers}}release managers [17, 322]. The rationale is to distribute the responsibilities in a way that balances creativity, verification, automation, and governance, just as in industrial software practice.
- Architects define system-level design principles and establish structural blueprints.
- Developers translate these abstractions into concrete implementations.
- Code reviewers and testers safeguard reliability, checking correctness, maintainability, and functional coverage.
- CI orchestrators automate builds, testing, and artifact pipelines, reducing integration frictions.
- Finally, release managers oversee deployment, aligning new versions with milestones and safety protocols.
Previous work has demonstrated similar mappings, such as MetaGPT [17], which decomposes development into Product Manager, Architect, and Engineer agents. ChatDev [322] further emphasizes communicative collaboration among specialized agents to support requirement analysis, coding, and testing. More recently, self-evolving collaboration networks have expanded this paradigm by enabling MAS to dynamically reorganize and optimize their roles throughout the software lifecycle [323]. A variant of MAS is also applied to the High-Performance Computing (HPC) domain [324] By structuring MAS around these stages, the architecture gains the same robustness and scalability as professional engineering workflows.
Finance: The financial domain can be roughly decomposed into four archetypal roles: analysts‾\underline{\textit{analysts}}analysts, risk managers‾\underline{\textit{risk managers}}risk managers, traders/execution agents‾\underline{\textit{traders/execution agents}}traders/execution agents, and compliance officers‾\underline{\textit{compliance officers}}compliance officers [325, 326]. This division reflects the established institutional design of financial organizations, where the responsibilities are segmented to balance profit generation with systemic stability.
- Analysts operate at different levels (e.g., fundamental, sentiment, or technical), each extracting distinct signals from raw market or textual data.
- Risk Managers then monitor portfolio exposure, apply stress tests, and enforce safeguards to prevent cascading vulnerabilities.
- Traders take responsibility for market interaction, while Execution agents ensure that orders are placed with speed and efficiency under liquidity constraints.
- Finally, Compliance roles ensure that activities remain aligned with regulatory requirements, enabling traceable decision-making and proper oversight.
Together, this layered ecology mirrors real-world financial institutions, where specialization and checks-and-balances are indispensable. Recent advances in MAS for finance mirror this layered ecology. R&D-Agent-Quant [327] demonstrates how agents can specialize in factor discovery and joint optimization for quantitative strategies. FinRobot [328] provides an open source multi-agent platform tailored to financial applications, reflecting the practical need for modularity and scalability. PEER [329] introduces expertization and tuning methods to adapt MAS to domain-specific responsibilities, while FinCon [330] highlights the role of conceptual verbal reinforcement to enhance decision-making and compliance. Together, these works underscore how MAS can replicate the specialization, checks, and balances of real-world financial institutions.
Legal Activities: Multi-agent systems are also designed to model the collaborative and adversarial processes inherent in legal practice, with roles assigned to manage consultation, reasoning, and argumentation.
- For legal consultation, frameworks often simulate a law firm's structure with a receptionist agent‾\underline{\textit{receptionist agent}}receptionist agent for client intake, specialized lawyer agents‾\underline{\textit{lawyer agents}}lawyer agents for providing advice, a secretary agent‾\underline{\textit{secretary agent}}secretary agent for documentation, and a boss agent‾\underline{\textit{boss agent}}boss agent for quality control. In a consultation model, the receptionist agent first clarifies a user's query before routing it to the appropriate lawyer agent. After the multi-turn consultation, the secretary agent summarizes the interaction, and the boss agent provides an evaluation, ensuring a comprehensive and high-quality service [331].
- For statutory reasoning, tasks are decomposed between knowledge acquisition agents‾\underline{\textit{knowledge acquisition agents}}knowledge acquisition agents that interpret legal texts and knowledge application agents‾\underline{\textit{knowledge application agents}}knowledge application agents that apply formalized rules to case facts. To be specific, in reasoning systems, the knowledge acquisition agent first builds a reusable ontology from legal statutes; then, the knowledge application agent uses this formal structure to analyze the specifics of a new case, ensuring consistent and transparent logic [332].
- To simulate courtroom dynamics, roles such as judge‾\underline{\textit{judge}}judge, plaintiff‾\underline{\textit{plaintiff}}plaintiff, defendant‾\underline{\textit{defendant}}defendant, and adversarial lawyer agents‾\underline{\textit{lawyer agents}}lawyer agents are created [333]. In courtroom simulations, adversarial lawyer agents engage in debate before a judge agent, reflecting on their performance after each trial to iteratively improve their argumentation strategies by updating their internal knowledge bases [333].
Education: In education, MAS is being developed to provide personalized and adaptive learning experiences by distributing pedagogical functions among specialized agents.
- For personalized tutoring, a central tutor agent‾\underline{\textit{tutor agent}}tutor agent might engage a student using Socratic dialogue, while a memory dispatcher‾\underline{\textit{memory dispatcher}}memory dispatcher agent tracks the student's progress and misconceptions to adapt the difficulty and focus of the lesson in real-time [334].
- For curriculum design, a pipeline of agents collaborates: a research agent‾\underline{\textit{research agent}}research agent gathers relevant information, a planning agent‾\underline{\textit{planning agent}}planning agent structures it into a coherent course, and other agents generate specific learning activities or assessments. Also, it can be modeled by an adversarial process, where evaluator agent‾\underline{\textit{evaluator agent}}evaluator agent critiques a lesson plan created by generator agent‾\underline{\textit{generator agent}}generator agent, and optimizer agent‾\underline{\textit{optimizer agent}}optimizer agent refines it based on feedback [335].
These systems demonstrate a shift towards creating intelligent, adaptive platforms that can support educators and provide students with more effective, engaging, and individualized learning journeys.
Healthcare: In the healthcare domain, multi-agent systems are structured to mirror clinical and research workflows, distributing complex tasks among specialized AI agents.
- For clinical diagnostics and consultation, these roles often include a triage agent‾\underline{\textit{triage agent}}triage agent (or moderator) for initial case assessment, various specialist agents‾\underline{\textit{specialist agents}}specialist agents (e.g., pathologists, neurologists), a doctor agent‾\underline{\textit{doctor agent}}doctor agent for patient interaction, and a measurement agent‾\underline{\textit{measurement agent}}measurement agent to provide test results [336, 337]. To be more specific, in the diagnostic setting, a triage agent first assesses the complexity of a case and routes it to the appropriate specialist agents for analysis. These specialists may then engage in multiround discussions, with a lead physician agent synthesizing their opinions to reach a consensus. In addition to that, a doctor agent conducts a multi-turn dialogue with a patient agent, requesting specific data from a measurement agent to gather information dynamically.
- For autonomous research, roles are modeled after the scientific process, featuring a meta agent‾\underline{\textit{meta agent}}meta agent for strategic planning, an executor‾\underline{\textit{executor}}executor for running analyses, an evaluator‾\underline{\textit{evaluator}}evaluator for assessing outcomes, and a reflector‾\underline{\textit{reflector}}reflector for synthesizing knowledge [338]. This division of labor allows for a systematic and comprehensive approach to multifaceted health challenges. Especially, the meta agent plans an experiment, the executor carries it out, the evaluator provides immediate feedback, and the reflector distills successful strategies into a persistent knowledge base, creating a self-improving cycle that enhances future planning.
- For public health events, ShortageSim [339] models FDA regulators, manufacturers, and healthcare buyers interacting under information asymmetry, enables counterfactual policy testing and evaluates how announcements and disruptions shape investment, stockpiling, and resolution timing against historical trajectories.
Other frameworks such as MMedAgent and MedAgent-Pro focus on orchestrating specialized medical tools, using a central agent to plan actions and aggregate results from various tool-based agents to handle multimodal data [39, 340].
Biomedicine: In biomedicine, particularly in drug and material discovery, MAS is designed to automate and accelerate the scientific process by assigning roles that reflect the iterative cycle of design, testing, and refinement.
For de novo molecule design, key roles include the actor‾\underline{\textit{actor}}actor (or reasoner) for generating novel structures, the evaluator‾\underline{\textit{evaluator}}evaluator for assessing chemical properties, and the self-reflector‾\underline{\textit{self-reflector}}self-reflector for refining future hypotheses based on results. To be specific, the actor agent proposes new candidates, which are then passed to the evaluator agent. The evaluator uses computational chemistry tools to calculate properties like binding affinity and synthetic accessibility, providing quantitative feedback [341]. This feedback is then analyzed by the self-reflector agent to update the system's strategy for the next generation cycle, creating a feedback-driven process of optimization [342].
Similarly, LIDDIA acts as a "digital chemist" with a Reasoner, Executor, Evaluator, and Memory component to navigate the drug discovery process and balance the exploration of new chemical spaces with the exploitation of promising candidates [341]. To streamline the creation of machine learning workflows, DrugAgent uses an LLM Planner and an LLM Instructor to automate programming for tasks like ADMET prediction [343]. In genomics, GenoMAS orchestrates six specialized agents through a guided-planning framework to analyze complex gene expression data, integrating the reliability of structured workflows with the adaptability of autonomous agents [344].
Music: In the creative domain of music composition, MAS is being explored to decompose the intricate process of creating music into collaborative, specialized roles. A system like ComposerX might feature a conductor agent‾\underline{\textit{conductor agent}}conductor agent that interprets a high-level user prompt and oversees the project, a melody agent‾\underline{\textit{melody agent}}melody agent that generates primary musical themes, a harmony agent‾\underline{\textit{harmony agent}}harmony agent that creates supporting chord progressions, and a rhythm agent‾\underline{\textit{rhythm agent}}rhythm agent that lays down the percussive and temporal foundation. These agents would interact iteratively, with the conductor agent synthesizing their outputs and providing feedback to ensure the different musical layers are coherent and aligned with the initial creative vision. This mirrors the collaborative process of a human orchestra or band, distributing creative responsibilities to achieve a complex and harmonious final product [345].
5.2 Collaboration and Division of Labor
Collaboration and division of labor constitute a central organizing principle in modern multi-agent systems. Instead of treating agents as homogeneous components, recent work emphasizes how responsibilities are decomposed and coordinated across specialized agents to improve efficiency and robustness. From this perspective, existing approaches can be broadly organized along two dimensions. In-context collaboration focuses on coordination strategies that are specified or induced at inference time without additional training. Post-training collaboration instead optimizes agent roles, interaction structures, or routing policies through learning or search. In addition, agentic routing can be viewed as a special case of this division of labor, where routing decisions explicitly offload cognition and computation to different agents based on task demands.
5.2.1 In-context Collaboration
In the design of multi-agent systems, several studies have observed that leveraging task-specific in-context information is often sufficient to build highly effective systems without the need for explicit training. Among these works, one line of research relies on manually crafted pipelines, where researchers design the agent interactions and workflows tailored to the target task. In contrast, another line explores LLM-driven automatic pipeline generation, allowing the model itself to construct and adapt the system’s structure dynamically based on the task context.
Manually Crafted Pipelines.
These approaches rely on predefined hierarchies or fixed collaboration workflows, where agent roles, execution order, and communication rules are determined before execution. Hierarchical systems such as AgentOrchestra [346], MetaGPT [17], and SurgRAW [347] feature a central planner or conductor directing subordinate agents through structured subgoals. Cascading pipelines like Collab-RAG [348], MA-RAG [349], Chain of Agents [350], and AutoAgents [16] process information sequentially, passing intermediate outputs downstream with limited revision. Modular role-decomposed frameworks such as RAG-KG-IL [351], SMoA [352], and MDocAgent [353] define fixed functional roles (e.g., retriever, reasoner, or vision agent) but allow minimal dynamic coordination. While these manually designed pipelines offer interpretability, modularity, and low execution complexity, their rigidity restricts adaptability to ambiguous or evolving reasoning tasks, motivating more flexible, reasoning-driven coordination mechanisms.
LLM-Driven Pipelines.
This category leverages LLMs as orchestrators that decompose high-level goals into subgoals, route them to role-specialized agents or tools, and iteratively refine workflows based on intermediate feedback until completion. AutoML-Agent [354] proposes a full-pipeline, orchestrator-led agent team that plans, assigns, and coordinates web/API/code tools through role-specialized micro-agents (e.g., coder/tester/runner), enabling end-to-end software development workflows. Magentic-One [355] introduces a generalist multi-agent system where a central Orchestrator plans, tracks progress, and performs ledger-based routing over specialized agents (WebSurfer, FileSurfer, Coder, ComputerTerminal), achieving competitive results on GAIA, AssistantBench, and WebArena. MAS-GPT [356] trains an LLM to emit executable MAS code conditioned on a user query, so a single forward pass generates a query-specific multi-agent workflow. MetaAgent [357] presents a finite-state-machine (FSM) abstraction to declare states, transitions, and tools, from which a LLM designer automatically constructs the MAS pipeline. AOP [146] formalizes orchestrator responsibilities and introduces three design principles, i.e., solvability, completeness, non-redundancy, and then operationalizes them with fast decomposition/assignment plus a reward-model evaluator.
Agent Routing. Closely related to LLM-driven orchestration, a line of work explicitly models agent routing as a decision layer that selects appropriate specialists for each query or subtask. For example, AgentRouter [358] proposes a knowledge-graph-guided router that leverages structured task semantics to dispatch questions to relevant agents, enabling effective collaborative question answering without modifying individual agents. Similarly, Talk to Right Specialists [359] frames routing and planning as a unified inference-time process, where a controller dynamically assigns subtasks to domain-specialized agents based on intermediate reasoning states. These approaches highlight that agentic routing itself can be viewed as an inference-time realization of division of labor, where cognition is selectively offloaded to specialized agents.
Theory-of-Mind-Augmented Collaboration.
Another interesting line of research is Theory of Mind (ToM), which refers to the ability of an agent to infer and reason about the beliefs, intentions, and mental states of other agents. [360] first showed that equipping LLM agents with explicit belief-state representations in a cooperative text game improves both collaboration performance and the accuracy over ToM-free LLM baselines. Building on this, Hypothetical Minds [361] scaffolds ToM as a modular hypothesis-generation and refinement loop for other agents’ strategies, while MindForge [362] extends ToM-aware reasoning to embodied collaborative learning. In parallel, [363] provides a mechanistic account of how LLMs encode ToM, identifying sparse parameter patterns whose perturbation selectively disrupts social reasoning. Pushing toward, ToM-agent [364] augments LLM generative agents with counterfactual reflection over counterparts’ beliefs and BeliefNet [365] offers a ToM-centric joint-action simulator where embodied agents act based on nested belief states.
5.2.2 Post-training Collaboration
In multi-agent systems, the design of agent prompts (or personas) and the interaction topology plays a critical role in determining the system’s ability to solve complex tasks. Recently, optimizing these components during the post-training phase has emerged as an important research direction. Based on the optimization objective, existing studies can be broadly categorized into two lines of work: prompt optimization and topology optimization.
Multi-agent Prompt Optimization. Prompt optimization in multi-agent systems focuses on how agent roles, workflows, and feedback are encoded in prompts to yield reliable coordination and stronger task performance. For example, AutoAgents [16] extends prompt optimization from single-agent contexts to multi-agent teams, refining role specialization and execution plans through structured dialogue among meta-agents. SPP [18] introduces a cognitive synergist that dynamically selects multiple personas during multi-agent collaboration for knowledge-intensive and reasoning-intensive tasks, enabling complementary expertise to emerge. DSPy Assertions [366] introduces LM Assertions that can be either hard (Assert) or soft (Suggest). When violated, these assertions trigger backtracking and prompt revision using erroneous outputs and error traces. During compilation, the mechanism bootstraps examples and counterexamples to reinforce few-shot prompts, which improves both recall and accuracy. MASS [367] demonstrates that prompts are often the dominant factor in MAS performance, and further applies automatic prompt optimization [368] by incorporating local and global topology information to refine each agent’s prompt in a fine-grained manner.
As for topology optimization, two categories of research have emerged, each pursuing relatively independent optimization pathways. The first category of work treats the multi-agent topology as a communication graph, leveraging graph-based methods to identify an optimal structure that achieves strong performance under constrained communication costs (i.e., limited graph size). The second category adopts a policy-based perspective, where variable training paradigms are employed to learn an agent-selection policy with specially designed rewards or supervision signals. Through iterative, policy-based selection of subsequent agents, these approaches aim to progressively construct topologies that yield optimal overall performance. We discuss these two categories of approaches in greater detail in the following paragraphs.
Graph-based Topology Generation.
A large body of work models multi-agent systems (MAS) as graphs where agents are nodes, and inter-agent communication forms edges. Then MAS design becomes a problem of learning the communication/coordination topology. These works could be roughly divided into three groups as follows.
Graph generation. These methods aim to construct communication topologies from scratch by adaptively generating task-conditioned graphs. GommFormer [369] uses an encoder-decoder framework to learn the communication graph via continuous relaxation of the graph representation, optimizing topology end-to-end under bandwidth constraints. G-designer [370] starts from a task-anchored network with a virtual task node, then uses a variational graph auto-encoder to decode a query-adaptive communication graph. MCGD [371] builds a sparse coordination graph with continuous node and discrete edge attributes, and performs categorical diffusion on edges and anisotropic diffusion on actions to capture structure diversity.
Graph pruning. These works start from dense collaboration graphs and aim to prune them into compact, task-appropriate pipelines while preserving utility and lowering token and compute costs. For example, AgentPrune [372] first formulates the MAS problem as a spatial-temporal graph sparsification problem, and then applies one-shot magnitude pruning to learn a sparse and effective pipeline. AGP [373] learns a dual-pruning policy, i.e., soft-pruning on edges and hard-pruning on nodes, to acquire a per-query topology. G-Safeguard [374] introduces pruning as a security mechanism. It treats communication edges as the search space, employs a graph neural network to identify risky nodes, and applies deterministic rules to prune their outward edges based on a model-driven threshold, thereby defending the system against adversarial attacks.
Topology search. This line of research explores the graph space by searching over agentic operators and communication edges to identify effective pipelines. Specifically, AFlow [304] automates multi-agent workflow design with Monte-Carlo Tree Search over a fixed library of operators. MASS pre-defines some influential graph motifs, such as debating and tool-using, and then implements topology search inside this pruned motif subset. Then MASS [367] performs a prompt search on that topology to maximize performance. MaAS [375] replaces single-graph search with a probabilistic “agentic supernet” over layered operator choices and uses a controller to sample a query-conditioned subgraph. DynaSwarm [376] broadens the design space from a single optimized communication graph to a portfolio of candidate structures. It employs Actor–Critic (A2C) optimization to refine this portfolio and introduces a lightweight graph selector that chooses the most suitable topology for each instance. GPTSwarm [68] formulates the search space as inter-agent connections within a computational graph. It relaxes the discrete topology into continuous edge probabilities and leverages reinforcement learning to optimize the resulting connection schemes, thereby enabling flexible and adaptive graph structures.
Policy-based Topology Generation.
A growing line of research strengthens multi-agent pipeline generation by learning the policy of selecting subsequent agents with advanced training paradigms such as supervised fine-tuning (SFT), and reinforcement learning (RL). These approaches embed auxiliary signals into the optimization process, enabling agents to acquire stronger reasoning skills and more reliable coordination. Routing can be viewed as a special case of collaboration, in which a router conditions on task state and system context to learn a policy for selecting agents that maximize efficiency and performance [377, 378, 379, 380]. Broadly, these methods can be grouped into three categories based on the signal type they inject into learning.
Relative-advantage policy learning. Several approaches rely on critic-free objectives to form advantages, thereby avoiding centralized value models and providing effective guidance to optimize policy. For example, MAGRPO [381] proposes a Dec-POMDP formulation for LLM collaboration and replaces centralized critics with a group-relative advantage signal, enabling decentralized training/execution at dialog-turn granularity. MHGPO [382] extends GRPO-style signals to heterogeneous groups: it jointly optimizes different agent roles via a shared group-relative objective, and introduces practical sampling/optimization tweaks. COPY [26] utilizes two-agent co-training framework with shared rewards and KL regularization (to a frozen ref and cross-agent policies), improving stability and transfer between pioneer/observer roles on reasoning tasks.
LLM-generated prior guidance. Other methods leverage LLMs to generate rewards or priors for learning. Specifically, LGC-MARL [383] uses an LLM to propose a Reward Function Generator (RFG) that turns natural-language objectives into structured reward terms. LAMARL [384] lets an LLM synthesize a prior policy and a task-specific reward function, then fine-tunes agents with RL. MAPoRL [385] defines rewards as weighted sums of LLM verifier scores on current and future turns, then updates policies with multi-agent PPO. COPPER [298] learns a shared reflector with a counterfactual-PPO pipeline in which a learned reward model scores each agent’s reflection by its marginal contribution to task improvement. SIRIUS [386] builds an experience library by retaining trajectories that lead to successful outcomes and augmenting failures, while a Judgment–Critic–Actor triad supplies LLM-generated correctness signals that filter and supervise subsequent fine-tuning across reasoning tasks. Multiagent Finetuning [387] bootstraps reasoning by running multi-agent debates among generator LLMs and using LLM critics plus majority voting to produce self-generated supervisory signals, then fine-tunes role-specialized agents on critic-selected trajectories to improve both accuracy and diversity.
Human preference signals. This line of research replaces or augments environment rewards with human-derived feedback to align behavior with human intent, in both online and offline regimes. For instance, M3HF [388] organizes human input into multi-phase feedback (e.g., scalar ratings, pairwise comparisons, and natural-language rationales) processed by LLMs into reward shaping signals. O-MAPL [389] introduces an end-to-end preference-based learning framework and directly learns Q-values from offline preference data, bypassing the two-stage reward-model-then-RL pipeline.
5.3 Multi-Agent Evolution
While self-evolving agents enable individual models to continuously improve through interaction and feedback, many real-world applications require collective intelligence supported by cooperation among multiple agents. Therefore, recent studies extend self-evolution from single-agent settings including planning, tool-use, and search evolution [14, 314, 242, 316, 36] to multi-agent co-evolution, where adaptation emerges across distributed agents [304, 390, 391, 392, 393]. Beyond evolving model parameters, memory, prompts, and tools [12, 394, 395, 396], multi-agent evolution further targets shared memory, communication mechanisms, and collaboration protocols [367, 393, 270].
As a result, multi-agent memory must jointly evolve along architecture, topology, content, and management dimensions, supported by hierarchy-structured, role-aware architectures [397], governed and distributed storage topologies [398, 399], modular and task-structured memory contents [300, 400], and active management mechanisms for compression, verification, and continual updating [401, 402] to ensure coherent and scalable collaboration.
The goal thus shifts from optimizing a single agent’s capability to improving the collective performance of multiple agents on complex, long-horizon tasks [304, 390, 391, 70].
5.3.1 From Single-Agent Evolution to Multi-Agent Evolution
While the shift from single-agent evolution to multi-agent co-evolution broadens the spatial dimension of adaptation from an individual model to a collective, the temporal dimension of evolution remains equally crucial. Beyond determining who evolves (a single agent or a population), recent studies also investigate when and how fast agents should adapt during interaction. This perspective leads to a complementary axis of analysis that distinguishes short-horizon, within-episode updates from long-term, cross-episode improvements, commonly referred to as intra-test-time evolution and inter-test-time evolution. We summarize these temporal modes of self-evolving behavior.
Intra-test-time evolution refers to the ability of agents to adapt and improve during task execution, enabling them to correct failures and refine strategies on the fly when facing unseen states or unexpected feedback. Unlike static inference pipelines, this paradigm embeds self-reflection, dynamic planning, memory rewriting, or even localized fine-tuning into the execution loop. Representative works leverage natural-language self-critique [14, 242] and runtime adaptive planning [314, 403] to generate corrective signals without external supervision. Reflexion [14] allows agents to store distilled reflective feedback for immediate behavior improvement, while AdaPlanner [314] dynamically revises and replans mid-trajectory based on environmental mismatch detection. Beyond contextual adaptation, methods such as test-time supervised updating [404] and test-time reinforcement learning (TTRL) [405, 406] directly modify model behavior when encountering difficult cases, often through problem-variant generation and targeted optimization. These approaches demonstrate that performance at inference time can improve within a single episode, forming short-horizon adaptation loops where the model learns while solving, rather than merely executing a fixed policy.
Inter-test-time evolution extends the self-improving process to across-task learning, where adaptations made in one task can be consolidated and transferred to future tasks. This enables the accumulation of persistent, generalizable capabilities over a lifelong interaction stream. A prominent paradigm involves offline self-distillation, where the agent generates responses and then refines them via self-evaluation before using them for supervised fine-tuning-such as in SELF [309], STaR [407], and Quiet-STaR [408]. These methods turn incorrect initial reasoning into high-quality labeled data for future performance gains. Additionally, online reinforcement learning frameworks such as RAGEN [316] and DYSTIL [317] continuously update policies based on dense interaction feedback, allowing agents to gradually internalize complex decision-making strategies over long horizons. Inter-test-time evolution can also incorporate curriculum mechanisms that automatically adjust task difficulty and environment complexity [409, 410], as well as experience structuring via memory evolution to preserve accumulated reasoning heuristics [411, 412, 270]. This temporal mode focuses on stable long-term improvement, transforming short-lived corrections from individual tasks into continual competence growth across diverse task distributions.
To support these new capabilities, mechanisms evolve from individual reward-based or reflective adaptation [309, 310, 311] to multi-agent reinforcement learning and game-theoretic co-optimization [391, 392], enabling collaborative structures to self-organize under evolving task requirements. Moreover, memory-driven multi-agent evolution (e.g., shared workflow memory or knowledge graphs) helps maintain accumulative group intelligence across episodes [270, 13]. Overall, multi-agent evolution transforms isolated self-improvement loops into adaptive intelligent ecosystems capable of self-correction, self-organization, and social learning. This transition marks a critical step toward artificial collective intelligence, where cooperative dynamics drive continuous progress beyond the capabilities of any individual agent [367, 304, 390, 413].
5.3.2 Multi-agent Memory Management for Evolution
Multi-agent LLM systems pose unique challenges for memory design compared with single-agent settings. Beyond maintaining an individual agent's local context, they must capture inter-agent interactions, track roles and dependencies over time, and preserve both shared and private knowledge coherently. Memory must also remain scalable as collaboration grows and interactions accumulate. To provide a clearer understanding of this landscape, we categorize existing approaches along four key dimensions: (1) architecture, how memory is organized within and across agents; (2) topology, whether it is centralized, distributed, or hybrid; (3) content, the type and structure of stored knowledge; and (4) management, how memory is written, retrieved, and updated over time. Illustrations are shown in Figure 10.
Architecture Dimension: Hierarchical and Heterogeneous Designs.
Recent work highlighted that prevailing multi-agent memory mechanisms were overly simplistic and lacked per-agent customization [397]. To address this, G-Memory constructs a three-tier graph hierarchy (insight, query, interaction graphs) that separates high-level generalizable insights from fine-grained execution traces. This hierarchical approach enables bi-directional memory traversal for retrieving both abstract lessons and concrete precedents across episodes. However, instead of global aggregation, Intrinsic Memory Agents adopts an opposing strategy by maintaining dedicated role-aligned memory templates for each agent [414]. This heterogeneous approach preserves specialized perspectives on collaborative planning benchmarks by reducing irrelevant information per agent. Recent work further explores hybrid strategies, with some systems employing adaptive hierarchical knowledge graphs in decentralized architectures that allow agents to reason over past interactions and share only relevant information rather than raw experiences [415]. These contrasting approaches reveal a fundamental trade-off: hierarchical designs optimize for global coherence and cross-episode learning, while heterogeneous designs optimize for role fidelity and computational efficiency.
Storage Topology and Memory Governance.
Systems employ different topologies to balance scalability, privacy, and coherence, each reflecting different assumptions about trust and coordination. SEDM (Self-Evolving Distributed Memory) [398] tackles memory management by turning memory into an active, self-optimizing component through verifiable write admission (via reproducible replay) and utility-based consolidation. This centralized approach with verification gates ensures that only factual or useful information enters the repository and performs cross-domain knowledge diffusion to enable transfer across heterogeneous tasks. In contrast, when privacy and organizational boundaries matter, Collaborative Memory [399] distinguishes private versus shared memory fragments using bipartite graph policies. Every entry carries immutable provenance (source agent, accessed resources, timestamp), enabling compliance auditing and safe cross-agent knowledge transfer in federated systems. At the other end of the spectrum, some systems like Memory Sharing [416] adopt uncontrolled pooling where all agents freely exchange experiences in a shared memory pool. Research shows that memory sharing among LLM agents leads to a more diverse collective memory pool, which improved performance on open-ended tasks by creating emergent collective intelligence. These three topologies represent increasing levels of formality and control, reflecting different priorities for managing the trade-off between knowledge diversity and verification rigor.
Memory Content: Semantic, Task, and Cognitive-Phase Decomposition.
Different content decomposition strategies suit different task characteristics, and the choice of content structure fundamentally shapes how agents interact with memory. MIRIX [300] pioneered semantic decomposition by defining six specialized memory types (Core, Episodic, Semantic, Procedural, Resource, Knowledge Vault) managed by distinct agents, achieving a 35% accuracy gain on multimodal QA tasks while reducing storage through flexible routing. Building on this modular principle, LEGOMem [400] instead employs task-based decomposition, breaking execution traces into reusable memory units flexibly assigned to either central planners or specialist task agents. This design shows that orchestrator memory improves task decomposition and delegation, while agent memory enhances subtask execution, effectively narrowing performance gaps between small and large LLM teams. Recently, MAPLE introduced Cognitive-phase Decomposition [417], using specialized agents (Solver, Checker, Reflector, Archiver) to enable systematic error detection and plan repair cycles. The Reflector diagnoses errors after each episode, and the Archiver stores refined plans to avoid repeated mistakes, supporting feedback-driven learning. These three content decomposition strategies reveal that memory design should align with task structure: semantic content for heterogeneous information, task-based for workflow automation, and cognitive-phase for error-sensitive reasoning.
Memory Management Strategies.
Effective long-term memory requires active management balancing relevance, efficiency, and coherence through different approaches that trade off simplicity against sophistication. Lyfe Agents [401] pioneered the forgetting-based approach using Summarize-and-Forget mechanisms to regularly compress memory, retaining only critical context. This strategy is suitable when storage is severely constrained, though it risks losing nuanced details for edge cases. To improve upon simple forgetting, AGENT-KB [402] introduced more sophisticated management by organizing procedural traces into structured (entity, action, observation) triples and learning pattern abstractions reusable across tasks. Agents collaborate to retrieve, update, and reason over memory segments, enabling generalization without explicit retraining while central coordination ensures long-term consistency for scalable embodied planning. The choice among these strategies depends on system priorities: forgetting prioritizes storage efficiency, verification prioritizes reliability, and learning-based approaches prioritize adaptability. Production systems typically combine strategies, e.g., verification for critical memories and forgetting for low-utility peripheral information, to balance multiple objectives.
Discussions.
Despite substantial progress, multi-agent memory systems remain largely unexplored with respect to post-training and model adaptation. Current approaches focus primarily on memory organization and retrieval for pre-trained models, with little investigation into how multiple agents can jointly optimize their memories through post-training procedures such as reinforcement learning or supervised fine-tuning. This represents a notable gap: while post-training techniques have been actively explored for single-agent memory systems, extending them to enable multi-agent teams to co-evolve their memory structures and management policies remains an open problem.
5.3.3 Training Multi-agent to Evolve
Recent advancements have shifted multi-agent systems from fixed, hand-designed coordination toward training paradigms that enable agents to evolve over time [26, 418, 386]. Training multi-agent systems to evolve represents a critical step toward realizing adaptive, long-horizon intelligence beyond static coordination. In this emerging paradigm, agents improve collectively through interaction, feedback, and shared memory, rather than isolated or independently optimized behaviors. By embedding reasoning into the learning loop, via reinforcement learning [419], self-play [420], curriculum evolution [385], and verifier-driven feedback [421], multi-agent systems can internalize coordination strategies, address inter-agent credit assignment, and progressively refine divisions of labor. This evolution transforms multi-agent reasoning from a static ensemble of cooperating LLMs into a self-improving organization that adapts its structure, communication patterns, and policies in response to task complexity and environmental change [422].
Co-evolution via Interaction and Intrinsic Feedback.
A growing body of work has operationalized multi-agent evolution through explicit training objectives that couple interaction, feedback, and role specialization. For instance, Multi-Agent Evolve [418] instantiates a closed-loop co-evolution framework containing three interacting roles (Proposer, Solver, and Judge), all of which are derived from a shared LLM backbone and jointly optimized via reinforcement learning. This forms a self-improving curriculum that enables collective skill growth without external supervision. In a related spirit, CoMAS [423] emphasizes intrinsic interaction rewards, extracting learning signals directly from multi-agent discussion dynamics through an LLM-based judge, thereby enabling decentralized co-evolution driven purely by collaborative interaction.
Multi-Agent Reinforcement Fine-Tuning for Collective Adaptation.
Additional works have focused on principled reinforcement fine-tuning frameworks tailored to LLM-absed multi-agent systems. For example, MARFT [419] formalizes multi-agent reinforcement fine-tuning by highlighting key mismatches between classical MARL assumptions and LLM-based agent organizations, such as role heterogeneity, dynamic coordination, and long-horizon dialogue, and provides a systematic framework for stabilizing collective post-training. Stronger-MAS [420] further adapts on-policy reinforcement learning to multi-role, multi-turn settings by introducing agent- and turn-wise grouping strategies that extend GRPO-style optimization, enabling more effective coordination learning across complex agent workflows. Similarly, MAPoRL [385] proposes multi-agent post-co-training, where multiple LLMs are jointly optimized using a collaboration-aware verifier that rewards not only final outcomes but also the quality of intermediate discussions, encouraging the emergence of transferable communication strategies.
Role Specialization and Joint Credit Assignment.
Other approaches have explored structured role specialization and joint credit assignment. MALT [424] trains sequential pipelines of heterogeneous agents using trajectory expansion and outcome-based reinforcement signals, allowing each agent to improve its specialized function while optimizing end-to-end collaborative performance. MARS [425] extends this idea to long-horizon research settings by jointly training complementary System 1 (fast, intuitive) and System 2 (deliberate, tool-using) agents via multi-agent reinforcement learning, enabling adaptive division of labor under complex tool interactions.
Preference- and Alignment-Driven Multi-Agent Evolution.
Finally, another line of work has studied evolution under preference- and alignment-driven objectives. Preference-based multi-agent reinforcement learning [421] studies how collective policies and equilibria can be learned from preference-only feedback, addressing data coverage and stability challenges inherent in multi-agent settings. From a safety perspective, Alignment Waltz [426] frames alignment as a cooperative co-evolution process between a generation agent and a feedback agent, where evolving guidance enables the system to iteratively refine unsafe or unhelpful behaviors. Collectively, these methods demonstrate how embedding reinforcement learning, co-evolution, and verifier-driven feedback into multi-agent training enables LLM-based systems to evolve from static collaborations into adaptive, self-improving organizations.
6. Applications
In this section, agentic reasoning transitions from LLM capabilities to practical applications across math exploration, scientific discovery, robotics, healthcare, and web research, addressing limitations of static benchmarks through dynamic, goal-directed intelligence. It organizes systems via a three-layer taxonomy—foundational skills like planning, tool use, and search; self-evolving mechanisms including feedback, reflection, and memory for iterative refinement; and collective multi-agent collaboration for specialized division of labor. This framework reveals how agents adapt to domain constraints, from mathematical conjecture generation and vibe coding to autonomous experiments and diagnostics, ultimately enabling creative, resilient problem-solving that accumulates competence over long horizons and fosters artificial collective intelligence.
Building on the established three-layer taxonomy (i.e. foundational, self-evolving and collective reasoning) mentioned in previous sections, we now examine how these capabilities manifest across real-world applications. This section surveys representative reasoning-empowered agentic systems across several key domains, as illustrated in Figure 11, including math exploration and vibe coding (Section 6.1), scientific discovery (Section 6.2), robotics (Section 6.3), healthcare (Section 6.4), and autonomous web exploration and research (Section 6.5). Specifically, each domain exhibits distinctive forms of reasoning, influenced by its data modalities and environmental constraints. Accordingly, our discussion in each subsection is organized around three layers: (1) core abilities such as planning, tool use and search that span scientific hypothesis generation, embodied control, medical reasoning, automated experimentation and symbolic problem solving, for example; (2) self-evolving abilities that integrate feedback, reflection and memory modules which refine domain-specific competence through iterative experiment loops, lifelong skill learning and clinical adaptation; and (3) collective multi-agent reasoning that enables collaboration and specialization from cooperative scientific assistants to coordinated robotic teams, diagnostic ensembles or multi-aspect experts. This section highlights how agentic reasoning frameworks adapt to domain-specific knowledge structures and tasks, illustrating the transition from traditional LLM reasoning to goal-directed, domain-aware and active agentic intelligence.
6.1 Math Exploration & Vibe Coding Agents
Mathematics and code have traditionally served as two of the most widely used domains for evaluating reasoning in artificial intelligence, as both require structured symbolic manipulation and precise multi-step deduction. Traditional benchmark-driven evaluation in these domains is showing clear limitations. Widely used math datasets such as GSM8K [427], MATH [428], and AIME [429] are increasingly saturated, which makes it difficult to distinguish among modern high-performing models. The problems in these datasets often rely on a small set of recurring techniques and do not require the sustained and exploratory reasoning needed to assess more advanced mathematical capabilities. Even recent evaluations such as FrontierMath [430] continue to emphasize final-answer accuracy, which offers only a partial view of an agent’s reasoning process and its ability to adjust strategies during problem solving.
Under the agentic reasoning paradigm, however, both areas are undergoing a substantial shift from static problem solving to dynamic processes that emphasize exploration, adaptation, and collaboration. In mathematics, recent systems ([70, 431, 29, 30]) demonstrate that agents can engage in competition-level reasoning, building on the success of LLMs in coding tasks. Work in foundational mathematics ([432, 433]) further shows that agents can search for new problems, propose conjectures, construct auxiliary lemmas, and explore deeper structures in mathematical concepts. These developments position mathematics not merely as an evaluation benchmark but as a domain of active mathematical exploration.
Large Language Models have also reshaped coding through the emerging workflow known as agentic coding and vibe coding ([32, 434]). In this paradigm, the model acts as an interactive collaborator that engages in multi-turn natural-language dialogue. Users iteratively design and refine programs while the agent maintains context, adapts to evolving requirements, and continuously self-corrects. Modern tools such as Copilot and Cursor have further popularized this collaborative workflow, making interactive programming a common practice in real-world software development.
In this section, we organize our discussion according to the three-layer framework introduced earlier. The foundational layer (Section 6.1.1) concerns the core reasoning and execution skills: mathematical agents perform symbolic manipulations and step-by-step derivations across arithmetic, algebra, geometry, and calculus, while code agents carry out syntax-aware generation, implement functions, and verify correctness through interpreter or compiler feedback. The self-evolving layer (Section 6.1.2) introduces mechanisms for reflection and adaptation. Mathematical agents learn from intermediate reasoning traces to correct missteps or explore alternative solution paths, and code agents iteratively debug, refine, and optimize implementations based on runtime feedback or test results. The collective layer (Section 6.1.3) focuses on collaboration, where agents exchange intermediate results, share reusable modules, and jointly develop complex proofs or codebases. Taken together, these layers reveal how mathematics and coding are becoming domains in which agentic reasoning enables increasingly creative and adaptive problem solving.
6.1.1 Foundational agentic reasoning
Planning.
Explicit planning is widely recognized as a core mechanism for enhancing the structured reasoning capabilities of LLMs. In the domain of mathematical discovery, several systems exhibit structures that can be interpreted as forms of planning. In representation theory and knot theory, the system of [435] guides human mathematicians by proposing intermediate objects and promising avenues of exploration, which function as high-level suggestions for organizing problem-solving workflows. In geometric reasoning, [29] solves Olympiad-level geometry problems by decomposing them into sequential stages of construction, lemma generation, and verification, yielding a structured multi-step process that resembles a planned reasoning trajectory. Program-search approaches ([30]) iteratively refine candidate programs and mathematical structures, a procedure that naturally forms a coarse-to-fine exploration path. Large-scale exploration frameworks ([433, 432]) also operate through cycles of proposing, testing, and modifying conjectures or geometric objects, which collectively create a procedural structure aligned with planning. Efforts toward more robust mathematical reasoning ([431]) similarly rely on stepwise reasoning patterns, further reinforcing the presence of implicit planning dynamics. of implicit planning dynamics across mathematical agents.
In code agents, planning has likewise emerged as an essential component for organizing multi-step reasoning and enabling more structured decision-making. Early systems such as CodeChain ([436]) and CodeAct ([99]) introduce explicit planning or action spaces to support modular code construction, while KareCoder ([437]) integrate external knowledge sources or domain-specific information into the planning process. Subsequent works explore more structured planning organizations, including multi-stage control flows ([438, 439]), tree-shaped planning structures ([440, 22]), and adaptive refinement mechanisms ([441]). Planning has also been linked to improved exploration breadth: GIF-MCTS ([442]) incorporates Monte Carlo Tree Search to explore multiple code-generation trajectories. Recent extensions demonstrate applicability in specialized domains such as hardware design, where VerilogCoder ([443]) employs graph-structured planning and waveform-based verification. To address environments where state serialization is difficult, Guided Search ([444]) introduces lookahead and trajectory selection strategies for evaluating candidate actions without full environment access.
Tool-Use.
Integrating external computational tools with LLMs has become a central mechanism for extending the reasoning and generation capabilities of single-agent systems. A defining characteristic of many mathematical reasoning systems is their integration with external computational tools. Formal theorem-proving agents such as [445] operate directly within the Lean proof assistant, selecting tactics and interacting with the underlying prover through in-context guidance. Position papers on formal mathematical reasoning ([446]) emphasize that progress in mathematical AI will depend on systems that can call theorem provers, satisfiability solvers, and computer algebra systems as part of a broader reasoning loop. Program-search frameworks for discovery ([30]) rely on executing generated programs and employing symbolic routines for verification. Generative modelling approaches ([447]) make use of computational number-theoretic tools to check and filter generated candidates. Geometry-focused systems ([29, 448]) integrate automated geometric solvers and checkers to validate constructions and derived relations. Across these systems, external computational resources play a central role in enabling correct and scalable mathematical reasoning.
In code agents, external tools have similarly become crucial for extending the capabilities of LLM-based agents beyond pure text generation. Early work such as Toolformer ([6]) and ToolCoder ([449]) explored how models can learn to invoke APIs or search tools to obtain missing information during generation. Subsequent systems integrate increasingly rich toolchains: ToolGen ([450]) leverages automatic completion tools to resolve undefined dependencies, while CodeAgent ([451]) incorporates multiple programming utilities including search, documentation reading, symbol navigation, and code execution to support more realistic software workflows. Several methods focus on improving tool-feedback loops, such as ROCODE ([452]), which combines real-time error detection with adaptive backtracking, and CodeTool ([453]), which introduces process-level supervision to improve the reliability of tool invocation. Collectively, these systems show that tool integration provides essential external signals, via search results, documentation, static analysis, or execution feedback, that extend the reasoning and generation capabilities of single-agent LLMs.
Search and Retrieval.
Search and retrieval has emerged as a complementary mechanism that enriches model contexts through external information sources. Search is a recurring mechanism in mathematical discovery. Program-search based systems ([30]) treat mathematical discovery as navigating a program space in which candidate programs encode conjectures or structural hypotheses, with iterative filtering based on symbolic or numerical checks. Generative modelling approaches ([447]) explore families of mathematical objects by sampling from flexible distributions that capture structural regularities. Geometric systems such as [29] and [432] search over constructions, configurations, and high-dimensional polytopes, guided by learned heuristics or structural constraints. Large-scale discovery frameworks ([433]) operate through repeated propose–test–refine cycles across conjectures, supporting wide exploration over mathematical landscapes. All these systems rely on systematic search procedures that structure the exploration of mathematical ideas.
In code generation, repository-level retrieval systems such as RepoHyper ([454]) locate reusable code segments from large-scale code bases to provide more informative contexts for generation. CodeNav ([79]) dynamically indexes real repositories during generation, retrieving relevant functions and adjusting based on execution feedback. AUTOPATCH ([455]) applies retrieval to performance optimization, combining historical code examples with control flow graph analysis for context-aware improvements. Structure-aware retrieval has also been explored: knowledge-graph-based repository representations ([456]) improve retrieval quality by capturing symbolic and relational structure, while cAST ([457]) introduces AST-based chunking to enhance syntactic coherence and retrieval granularity. These retrieval methods demonstrate how external knowledge sources can augment single-agent LLMs by providing high-quality, structured contexts that guide both understanding and generation.
6.1.2 Self-evolving agentic reasoning
Agentic Feedback and Reflection.
Across mathematical and code reasoning tasks, feedback operates as an external signal that highlights discrepancies, confirms correct inferences, and directs the agent toward more reliable subsequent computations. Feedback mechanisms appear prominently across mathematical discovery systems. In program-search based discovery ([30]), executing candidate programs and evaluating their outputs against constraints yields counterexamples or confirmations, enabling iterative refinement of conjectures. In geometry, automated checkers validate constructions and derived relationships ([29, 448]), providing correctness signals that guide subsequent revisions. Interactive evaluation frameworks ([458]) show that human clarifications and follow-up prompts expose reasoning errors and improve model responses. Position work on formal reasoning ([446]) highlights verification, proof checking, and model checking as essential sources of structured feedback. In several systems involving multiple candidate hypotheses ([30, 447, 433]), the use of verification signals to retain promising candidates functions analogously to a fitness-based evaluation step, since these signals determine which hypotheses survive and which are discarded, thereby shaping the direction of subsequent exploration without introducing an explicit learning signal.
For code agents, feedback and reflection are central to improving reliability over multi-step reasoning. Fault-aware editing methods such as Self-Edit ([459]) incorporate execution-based signals to refine erroneous code, while Self-Repair ([460]) integrates code and feedback models to diagnose test failures and propose targeted corrections. More structured systems like LeDeX ([461]) combine stepwise annotation, execution-driven verification, and automated repair into a closed-loop pipeline in which feedback continually informs the next revision. Reflection also functions as a form of memory: iterative self-improvement frameworks such as Self-Refine ([242]), Self-Iteration ([462]), and Self-Debug ([463]) reuse earlier drafts, analyses, and explanations to guide subsequent revisions, while artifact-level mechanisms such as CodeChain ([436]) and LeDeX ([461]) retain reusable components, corrected snippets, and execution traces as persistent representations. Together, these approaches demonstrate how feedback—whether symbolic, execution-based, or self-generated—interacts with iterative memory to support structured refinement and long-horizon improvement in code-oriented agentic systems.
Memory.
Memory provides agents with a mechanism for retaining and leveraging information from earlier reasoning steps, allowing them to maintain consistency, improve intermediate states, and improve their performance over extended problem-solving horizons. While few systems introduce an explicit memory module, many mathematical agents rely on forms of persistent state that can be viewed as implicit memory. Interactive evaluation frameworks ([458]) maintain conversational and problem-state context across multiple turns, allowing models to build upon earlier partial derivations. Formal-theorem-proving agents ([445]) operate over evolving proof states in Lean, which accumulate tactics, subgoals, and intermediate lemmas, functioning as structured persistent information. Program-search and discovery systems ([30, 433]) retain conjecture histories, counterexamples, and successful constructions as part of their iterative refinement processes. Their role in preserving and reusing information across reasoning steps aligns with the broader notion of memory in agentic systems.
In code agents, memory increasingly takes the form of explicit structures that maintain coherence over long-horizon generation. Several systems construct shared or structured workspaces: Self-Collaboration ([464]) introduces a blackboard memory for storing task descriptions, intermediate drafts, and revision records, enabling agents to coordinate through a common representation. Architectural approaches such as L2MAC ([465]) and Cogito ([466]) extend this idea by organizing context into dedicated registers, hierarchical memory units, or long-term knowledge stores, overcoming context-window limits and supporting multi-file or large-function reasoning. Across these designs, the underlying insight is consistent: effective code agents require persistent, structured, and often domain-aware memory that preserves intermediate reasoning and enables self-improvement across extended development trajectories.
6.1.3 Collective multi-agent reasoning
To address the growing complexity of tasks in mathematical discovery and code generation, recent systems increasingly rely on multi-agent or modular designs that decompose problems into cooperating specialized components. Mathematical discovery frameworks often organize reasoning into explicitly defined multi-agent or multi-component workflows that collaborate to explore and validate mathematical ideas. The polytope-generation system ([432]) uses multiple specialized components that generate, evaluate, and refine geometric objects, forming a genuine collaborative workflow. Large-scale exploration frameworks ([433]) often divide discovery into modules for proposing conjectures, identifying counterexamples, and refining statements, which, although implemented within a unified system, mirror multi-agent role specialization. Early work on AI-assisted mathematical research ([435]) and Olympiad-level systems ([448]) also involve human–AI collaboration, where human mathematicians interact with AI systems in a complementary manner. These developments indicate that mathematical discovery is an inherently collaborative process, and multi-agent architectures provide a natural vehicle for expressing such collaboration in agentic systems.
Multi-agent systems for code generation have progressed from simple role-based pipelines to adaptive, collaborative frameworks capable of handling long-horizon software development. Early approaches such as Self-Collaboration ([464]) and AgentCoder ([467]) decompose tasks into sequential roles, while hierarchical designs like PairCoder ([468]) and FlowGen ([469]) introduce an architecture in which high-level agents handle planning and lower-level agents carry out concrete implementation. Flexible systems such as SoA ([470]) further adjust the number and specialization of agents in response to task complexity. Other frameworks, including MapCoder ([471]), AutoSafeCoder ([472]), and QualityFlow ([473]), rely on repeated cycles in which multiple agents generate, test, analyze, and repair code. Recent work explores self-evolving system structures, as in SEW ([474]), which reorganizes collaboration pathways based on runtime feedback, and EvoMAC ([323]), which adjusts agent strategies through an iterative text-based update mechanism. Collaborative optimization methods such as Lingma SWE-GPT ([475]), CodeCoR ([476]), SyncMind ([477]), and CANDOR ([478]) explicitly improve cross-agent coordination. Together, these systems show a clear shift toward multi-agent code generators that rely not only on role decomposition, but also on reflection, distributed evaluation, adaptive restructuring, and team-level optimization, transforming code generation into an increasingly coordinated and resilient problem-solving process.
6.2 Scientific Discovery Agents
Scientific-discovery agents aim to accelerate the entire life cycle for scientific research, from hypothesis generation through experimental execution, by coupling LLMs with domain-specific simulators, laboratory automation and up-to-date literature. These systems ground decision in verifiable processes while handling heterogeneous data, safety constraints and long-horizon goals.
In this subsection, we begin with the foundational layer (Section 6.2.1), which encompasses planning under scientific context, tool-augmented interaction with scientific resources, search and retrieval mechanisms including RAG-based systems and execution-time integration with laboratory hardware. Building upon these capabilities, the self-evolving layer (Section 6.2.2) introduces agentic memory, feedback and reflection, which enable scientific agents to refine hypotheses, adapt protocols and learn from experimental outcomes. Finally, the collective layer (Section 6.2.3) explores multi-agent collaboration, where agents coordinate roles, share intermediate knowledge and jointly reason toward complex scientific goals.
6.2.1 Foundational agentic reasoning
Planning.
Scientific agents utilize reasoning-enhanced planning ability to decompose a research goal into steps, decides which tool or simulator to call next, then revises the plan as evidence arrives. In short, the chain of thought emerges from LLM reasoning that compiles instructions into rigorous executable plans [1]. For example, ProtAgents [479] materializes a planner agent that utilizes LLM reasoning capability to formulate a concrete plan for protein analysis and keep modifying it with feedback from another critic agent, and Eunomia [480] uses ReAct-style [5] workflow to make in-context reasoning: after retrieving a top- kkk evidence set, the backbone LLM quote a warranting sentence, and that citation drives the next action choice. Other examples include MatExpert [35], which deploys a chain-of-thought LLM to author a stepwise transition pathway and then emits a structured crystal candidate from a feedback loop.
Planning can also act as reasoning constraint. For instance, Curie [481] utilizes a rigor engine to align, setup and do reproducibility check within planning steps proposed by the Architect LLM. Thus, the Architect's free-form reasoning cannot advance unless these rigor gates are satisfied, which transforms planning into both a guide and a regulator of the reasoning process. In addition, a general purpose biomedical agent, Biomni [40] constrains its reasoning within a dynamically constructed biomedical action space of comprehensive tools, software packages and databases, requiring each hypothesis to be operationalized as executable code.
Tool-Use.
Tool use is an important part of the reasoning loop for scientific agents nowadays. Specifically, rather than following rigid rules, these agents can decide which tool and when to call, how to fill parameters and verify or revise based on evidence. For example, SciAgent [482] formalizes tool-augmented reasoning as a four-step procedure: planning, retrieval, tool-based action and execution. Agents are trained to decide when to call a tool, which one, and how to integrate it into solving scientific tasks. Through domain-specific tools, ChemCrow [33] chains various expert chemistry tools so intermediate calculations become premises in the next reasoning step, which enables end-to-end planning and autonomous syntheses. CACTUS [483] similarly grounds explanations in cheminformatics outputs, reducing reliance on free-form reasoning by language models alone.
Other notable examples include ChemToolAgent [484] and CheMatAgent [485]. In particular, ChemToolAgent [484] employs a ReAct-like [5] architecture with multiple specialized chemistry tools, allowing the LLM to choose and parameterize tool calls while CheMatAgent [485] pushes further by learning tool use: it integrates over 100 chemistry/materials tools, curates a tool-specific benchmark, and uses Monte Carlo Tree Search with step-level fine-tuning to learn both which tool to pick and how to fill arguments.
For biomedical agents, TxAgent [486] scales therapeutic reasoning across 211 vetted tools and it carries out multi-step reasoning that reconciles drug labels, interactions, and patient context—turning clinical justification into an executable trace. On the other hand, AgentMD [487] builds a two-stage tool memory: it first mines thousands of clinical calculators from literature (i.e. making tools), then selects and applies the right ones at inference (i.e. using tools), pinning predictions to concrete computations. Other recent systems [488, 489, 490, 491, 492, 493] reinforce similar design: co-design tool-use with reasoning so each claim is computable and auditable.
Another notable category of tool-use is the ability of agentic execution, which includes but not limited to run codes and simulate environments. Execution layers bridge high-level plans to physical infrastructure, which enables scientific agents to autonomously operate laboratory hardware, orchestrate simulation pipelines, and manage large-scale data workflows. Recent works such as Organa [494] ties LLM reasoning to task-and-motion planning plus scheduling and perception, executing multi-step experiments with autonomous robots; AtomAgents [495] exemplifies the simulation side of execution: a physics-aware system that plans and runs atomistic workflows, coordinating tools for code execution, analysis, and hypothesis checking; and Chemist-x [496] shows wet-lab execution beyond a digital-only scenario, where agents generate control scripts and drive an automated platform to validate conditions without human intervention.
Several other platforms couple execution with optimization or team-based autonomy. For instance, SGA [497] formalizes the workflow of LLM-as-proposer and simulator-as-optimizer while MatExpert [35] operates a retrieval, transition and generation workflow for material discovery tasks, and CellAgent [498] coordinates planner, executor and evaluator roles to run full single-cell analysis pipelines.
Search and retrieval.
Beyond simple context stuffing, recent scientific agentic systems elevate retrieval into a deliberate reasoning step: agents decide when and what to fetch, and how to use the evidence before committing to a hypothesis. With retrieval ability, BioDiscoveryAgent [499] pulls literature and interim assay results inside a closed loop so the model’s next gene-perturbation choices are conditioned on what was read and measured; while DrugAgent [500] coordinates knowledge graph queries, targeted literature search through web API and machine learning predictors. Its planner selects retrieval actions and then reconciles heterogeneous evidence into an explainable rationale. To facilitate scientific research, ARIA [501] operationalizes a search, filter then synthesis workflow as role-bound steps that carry citations forward, turning literature into actionable procedures. Similarly, AI Scientist-v2 [502] employs an agentic tree-search framework in which the agent actively queries scientific literature database during hypothesis formulation and manuscript drafting, ensuring that analyses and writing are grounded in existing evidence. For research idea generation, another recent work [503] constrains the process with curated background packets, using retrieval as an experimental control.
Building on these developments, retrieval-augmented generation (RAG) frameworks position external sources not merely as supporting references but as active components of the reasoning process. Specifically, RAG-enhanced scientific agents make external sources as primary inputs to LLM context and reasoning material, mostly with explicit planning, passage extraction, citation and contradiction checks. For example, PaperQA [488] and PaperQA2 [489] treat retrieval as the main loop. By deciding which documents to read, attributing every claim, and detecting conflicts to steer synthesis, these works can yield expert-level literature reviews that are inherently verifiable. In material science, LLaMP [490] extends RAG beyond text. Specifically, it utilizes hierarchical ReAct [5] agents call material-specific APIs to fetch band gaps or elastic tensors, edit structures and then reason with computed properties.
6.2.2 Self-evolving agentic reasoning
Scientific discovery agents can go beyond static reasoning and acquire the ability to self-evolve, which is to learn from experience, refine their internal representations and improve decision quality over successive interactions. This self-evolving layer equips agents with mechanisms to monitor and revise their own reasoning, retain and reuse intermediate hypotheses and adjust future plans based on external feedback or environmental signals. In the following paragraphs, we discuss how memory modules enable the accumulation of scientific knowledge and how feedback and reflection mechanisms support continual adaptation and reasoning consistency throughout long-horizon scientific workflows.
Memory.
ChemAgent [504] implements a self-updating library. It decomposes chemistry problems into sub-tasks and writes reusable skills (ex: procedures, patterns, solutions) that later prompts can retrieve and adapt, stabilizing long multi-step reasoning without re-deriving everything from scratch. On the other hand, MatAgent [505] emphasizes interpretable generation for inorganic materials, where short-term memory recalls recent compositions and feedback, long-term memory preserves successful designs together with their reasoning traces, and both are reused across iterations to guide proposal refinement and enable transparent audit.
Agentic Feedback and Reflection.
Firstly, Scientific Generative Agent [497] ties discrete LLM proposals to inner-loop simulations that optimize continuous parameters, advancing only when evidence improves. The reflection ability is driven by measurable loss reductions. Next, ChemReasoner [506] performs heuristic search over the LLM’s idea space but scores and steers candidates with quantum-chemical feedback, turning electronic-structure signals into a principled critique of linguistic hypotheses. Complementing these physics-based signals, Curie [481] embeds rigor check directly into control flow via intra-agent checks, inter-agent gates and an experiment-knowledge module. In parallel, LLMatDesign [507] builds explicit self-reflection into materials workflows, prompting the agent to surface and repair inconsistencies before they propagate to tool calls. Moreover, NovelSeek [508] utilizes reflection as a closed loop, updating code and plans with human-interactive feedback after each round. Finally, a recent study [509] regularizes the process up front with explicit goals & constraints and afterwards with standardized scoring to provides an objective standard that makes reflection repeatable.
6.2.3 Collective multi-agent reasoning
Multi-agent frameworks for scientific discovery distribute labor across specialized LLM-driven roles, where advanced LLM reasoning not only orchestrates coordination between scientific agents but also adjudicates conflicting evidence to maintain coherence in the process.
To illustrate, we introduce some important multi-agent frameworks as follows. Firstly, ProtAgents [479] exemplifies this pattern in protein design. The framework involve agents for literature retrieval, structure analysis, physics simulation, and results analysis. Specifically, the backbone LLM directs reasoning over multi-modal outputs, choosing when to iterate or convergence-check based on feedback signals. PiFlow [510], on the other hand, instantiates reasoning as principle-aware uncertainty reduction with a multi-agent loop in which a Planner agent relays strategy to a Hypothesis agent and a validation loop, explicitly tying multi-agent communication to hypothesis–evidence alignment. AtomAgents [495] also brings similar role specialization to alloy discovery. In particular, the agent uses LLM-guided reasoning to control over when to trigger simulations and how to evaluate multi-modal results, letting reasoning allocate computational resources and prune alloy candidates.
With a similar planner, executor and evaluator framework, CellAgent [498] instantiates researches on single-cell analysis, where the planner LLM reasoning selects tools or hyper-parameters and the evaluator LLM triggers self-iterative re-runs when quality checks fail. Some other notable works include ARIA [501] that introduces a four-agent framework (scout, filter, synthesizer and procedure-drafter), Curie [481] that embeds rigor into multi-agent planning, Team of AI-made Scientists (TAIS) [511] for gene-expression discovery and the Virtual Lab [512] for nano-body design with role agents.
6.3 Embodied Agents
Embodied agents extend reasoning beyond text, anchoring language in robotic perception, manipulation and navigation. By embedding LLMs within robotic and simulated bodies, these embodied agents tackle real-world generalization, continual adaptation and multi-modal grounding.
In this subsection, we begin with the foundational layer (Section 6.3.1), which covers long-horizon embodied planning, tool-assisted perception, manipulation and execution. Building upon these capabilities, the self-evolving layer (Section 6.3.2) introduces agentic memory, feedback and self-reflection capabilities enabling robots to refine control policies, adapt to novel environments and improve performance through continual interaction. Finally, the collective reasoning layer explores multi-robot collaboration (Section 6.3.3), where agents coordinate perception, share learned representations and jointly reason about tasks to achieve complex embodied goals.
6.3.1 Foundational agentic reasoning
Planning.
Early work such as SayCan [136] established the template by mapping linguistic descriptions to skill affordance estimates and SayPlan [513] refined this grounding by leveraging 3D scene graphs to align goal references with object-centric representations and spatial models. Beyond symbolic representations, EmbodiedGPT [514] use curated video CoT annotations of sub-goals to train models taht map multi-model input to structured sequences for embodied planning, while context-aware planning system [515] adds semantic spatial map and object location information to the planning pipeline, enabling dynamical planning during execution. In addition, DEPS [516] introduces an interactive planning loop (i.e. describe, explain, plan and select) for open-world multi-task agents.
Embodied agents also rely on multi-modal reasoning traces that explicitly align perception with action. For example, Embodied CoT [517] trains vision-language-action models to generate reasoning steps incorporating visual features before executing an action. Fast, ECoT [518] accelerates this by caching and re-using reasoning segments across time-steps, reducing inference latency while preserving task success. More recently, Cosmos-Reason1 [519] establishes an ontology of space, time and dynamics that lets CoT sequences encode structured physical priors. CoT-VLA [520] builds a visual chain-of-thought by predicting future image frames as intermediate sub-goals prior to action generation. Finally, Emma-X [521] integrates grounded chain-of-thought with look-ahead spatial reasoning, improving long-horizon embodied task performance.
Another line of works strengthen embodied planning through reinforcement learning, considering planning not only as static decomposition but as a self-evolving process that adapts to environment feedback. Robot-R1 [522] trains large VLMs to predict keypoint transitions under visual context, turning RL into a mechanism for learning physically grounded forward models. ManipLVM-R1 [523] exploits verifiable physical reward signals (e.g., trajectory match and affordance correctness) to reduce reliance on dense expert annotation. Embodied-R [38] presents a collaborative framework where VLMs handle perception and smaller LMs handle reasoning, and the whole is trained via RL for embodied spatial reasoning. VIKI-R [524] further extends this direction into heterogeneous multi-agent cooperation with a two-stage design, employing a two‐stage pipeline of chain-of-thought fine-tuning followed by hierarchical RL across agents coordinating activation and planning.
Tool-use.
Embodied agents can also be strengthened to interact with external tools to enhance perception and compensate for incomplete observations. GSCE [525], for example, provides a prompt-framework that binds skill APIs and constraints for safe LLM-driven drone control. MineDojo [526] links agents to internet-scale corpora and thus enabling richer affordance grounding. Physical AI Agents [34] further introduces a modular architecture and a retrieval augmented generation design pattern for embedding real-world physical interaction into LLM-driven agents. Beyond offline tool use, some systems treat the environment itself as an API. For example, Matcha agent [422] uses an LLM to issue queries about objects and scenes and thereby acquire perceptual information needed for task completion.
On the other hand, execution module is one of the most important tool type. It translates high-level language instructions into continuous motor commands, enabling embodied agents to act reliably in physical environments. Early systems such as SayCan [136] uses language to invoke robot pick-and-place skills; while LEO [527] broaden execution to more general manipulation settings and Hi Robot [528] uses a VLM reasoner to process complex prompts and a low-level action policy executes the chosen step. More recent efforts broaden the execution space: Gemini Robotics [529] introduces a large-scale vision-language-action model for real-world robot control and Octopus [530] generates executable code in simulated environments that bridges planning and manipulation.
Beyond single-agent control, hybrid pipelines couple reactive reflexes with language-guided policies to support complex domains. For example, CaPo [531] incorporates an execution phase where agents carry out decomposed sub-tasks and adapt their meta-plan based on progress; COHERENT [532] embeds a robot executor module within its PEFA (i.e. proposal, execution, feedback and adjustment) loop, which ensures each assigned sub-task is acted and refined appropriately; and MP5 [533] integrates multi-modal perception to generate executable plans in open-ended Minecraft. At the perception–action interface, LLM-Planner [83] generates sub-goals and maps them into action sequences via a low-level controller and EmbodiedGPT [514] illustrate how LLM-generated plans can be translated into control policy for embodied control in physical environments.
Search and retrieval.
Embodied agents can also use search and retrieval ability to ground language in spatial structure and past experience. Early navigation systems such as L3MVN [534] use LLMs to query a semantic map and select promising frontiers as long-term goals during visual target navigation, while SayNav [535] and SayPlan [513] build 3D scene graphs and then search task-relevant subgraphs so language instructions can be translated into grounded waypoints and sub-tasks in large environments. Long-horizon navigation works like ReMEmbR [536] maintain a structured spatio-temporal memory that can be queried to answer “where” and “when” questions about past robot experience. Additionally, RAG-style systems make retrieval a first-class part of the planning loop: Embodied-RAG [537] and EmbodiedRAG [37] treat an agent’s experience and 3D scene graphs as non-parametric memories from which task-relevant episodes or subgraphs are retrieved for navigation and task planning; Retrieval-Augmented Embodied Agents [538] retrieve policies from a shared memory bank and condition action generation on them; and MLLM-as-Retriever [539] trains a multi-modal LLM retriever to rank past trajectories so each decision step can condition on the most useful prior experience rather than only the current observation.
6.3.2 Self-evolving agentic reasoning
Embodied agents reliably achieve long-horizon autonomy when they can self-evolve over time: monitor their own internal states, store and update task-relevant knowledge and adjust behaviors when plans deviate. In the following paragraphs, we examine how memory modules, feedback signals and agentic reflection enable embodied agents to turn planning from a one-shot process into a continually improving cycle of behavior.
Memory.
Effective memory mechanisms enable agents to reuse past experiences and maintain coherent task execution over extended interactions. Many systems cache recent observations in episodic buffers while summarizing long-term semantics in structured graphs, as in household planning [540] and long-horizon agents with hybrid multi-modal memory [279]. Skills and routines can be shared across tasks via indexed memory stores. For example, HELPER-X [541] indexes discovered skills and action scripts, which aid future dialogue and can be shared across domains. Spatial navigation methods such as BrainNav [542] maintain biologically inspired dual-map memories linked by a hippocampal hub to reduce hallucinations and drift. Broader contexts also benefit: CAPEAM [515] incorporates environment-aware memory modules that track object states and spatial changes. Finally, lifelong episodic systems such as Ella [543] maintains long-term multi-modal memory system to support social-robot interaction.
Agentic Feedback and Reflection.
Dialogue-based critique, calibrated uncertainty and environment-aware reward shaping refine policies beyond binary success signals. For example, Matcha agent [422] treats objects and scenes as interactive information sources before acting and FAMER [544] uses lightweight preference feedback to adapt embodied agents to user intentions in real time. Uncertainty-aware planners such as KnowNo [545], which proactively solicit guidance when confidence falls below guarantees, and Octopus [530], which exploits environmental feedback to improve generated executable programs over time. At the multi-agent level, MindForge [362] introduces theory-of-mind style perspective feedback so heterogeneous robots adapt to each other’s reasoning strategies; while ReAd [546] introduces a advantage-based feedback loop that enables an LLM planner to self-refine its collaboration strategies across embodied multi-agent tasks.
Robust reflection mechanisms help agents anticipate failures by monitoring their own reasoning and actions and then adjusting plans. Optimus-1 [279] couples a Knowledge-guided Planner with an Experience-Driven Reflector to revise decisions using stored experience, while another recent study [547] defines structured agentic workflows (including self-Reflection, multi-Agent reflection and LLM Ensemble) that enable robots to reflect on and refine LLM-generated object-centered plans, thus reducing reasoning errors. Systems such as EMAC+ [548] interleave perception, planning and verification steps to perform online plan refinement and earlier works such as Voyager [36] also embeds an iterative prompting loop that uses environment feedback and execution errors to refine its skill library over time.
6.3.3 Collective multi-agent reasoning
Multi-agent collaboration enables embodied systems to divide labor and coordinate complex tasks more efficiently, with language often serving as the primary medium for negotiation and role allocation. For instance, SMART-LLM [549] decomposes high-level instructions and allocates sub-tasks across multiple robots, while CaPo [531] optimizes cooperative plans to avoid redundant exploration. For heterogeneity and coordination mechanisms, COHERENT [532] deploys a propose-execution-feedback-adjust loop across diverse robot types to enable seamless joint operation. In addition, Theory of Mind (ToM), which refers to an embodied agent’s ability to infer and reason about others' beliefs and mental states, is also highly related to embodied multi-agent systems [360, 363, 361]. For example, MindForge [362] equips agents with explicit theory-of-mind representations and natural inter-agent communication to coordinate collaboratively.
For multi-modal frameworks, EMAC+ [548] integrate vision and language modules and continuously refine plans via visual feedback, COMBO [550] integrates vision and language modules and continuously refine plans via visual feedback, and VIKI-R [524] demonstrates reinforcement learning as a scalable coordination mechanism among embodied agents. At larger scales, studies such as RoCo [551] show how role negotiation and flexible protocols support adaptable teamwork in dynamic environments.
6.4 Healthcare & Medicine Agents
Healthcare and medical agents seek to support the full clinical decision pipeline, from initial symptom triage to treatment planning and integrating LLMs with structured patient records, medical ontologies and expert guidelines. Unlike general assistants, these systems must operate under strict safety constraints, multi-modal evidence and legal justification.
In this subsection, we begin with the foundational layer (Section 6.4.1), which includes medical and diagnostic reasoning and tool-augmented access to various biomedical knowledge bases and APIs. Building on these primitives, the self-evolving layer (Section 6.4.2) examines memory, feedback and reflective modules that allow these agents to accumulate patient-specific context, adapt to longitudinal trajectories and revise clinical plans over time. Finally, the collective layer (Section 6.4.3) highlights multi-agent collaboration, which includes doctor–agent co-planning, human–AI shared autonomy and specialist model ensembles.
6.4.1 Foundational agentic reasoning
Planning.
Planning is a core capability for healthcare agents, which enables them to structure long-horizon clinical pathways into diagnostic and treatment phases, refine workflows dynamically as patient conditions evolve and coordinate across teams and tools toward cohesive care delivery. We discuss several various recent advancements as follows. For instance, a recent agentic clinical system [552] orchestrates specialized tools and guideline citations to support oncology decision-making, EHRAgent [553] decomposes multi-table EHR inference into code-execution steps with feedback learning and PathFinder [337] presents a multi-agent, multi-modal histopathology workflow for diagnostic reasoning.
Other frameworks model planning as an explicit orchestration layer across levels of abstraction. For example, MedAgent‑Pro [340] proposes a hierarchical workflow which first generates disease-level diagnostic plans from guideline criteria and then dispatches tool-agent modules for execution. MedOrch [554] treats tool invocation itself as a planning primitive across modalities, orchestrating reasoning agents for multi-step diagnostic execution. On the other hand, ClinicalAgent [555] coordinates multi-agent workflows for clinical planning, leveraging LLM reasoning to allocate tools and synthesize evidence. In addition, planning in healthcare agents is increasingly adaptive, responding to new information and evolving contexts. For example, DoctorAgent‑RL [556] models clinical consultation as a dynamic decision-making process under uncertainty, optimizing questioning strategies and diagnostic paths via reinforcement learning; while DynamiCare [557] adjusts specialist-agent teams across multi-round interactions as new patient information emerges.
Tool-use.
Tool integration significantly expands a healthcare agent’s action space, enabling precise calculations, medical image interpretation and access to specialized databases. Recent studies are summarized as follows. Several systems explicitly foreground extensibility. MedOrch [554] introduces a modular architecture that allows new diagnostic APIs to be incorporated without retraining, while TxAgent [486] integrates over two hundreds pharmacological tools to support therapeutic decision-making across drug–disease–treatment relationships. AgentMD [487] similarly curates and leverages over two thousands executable clinical calculators to learn risk-prediction pipelines.
Other approaches focus on structured function calling for safe execution. For example, LLM-based agents can reliably invoke bedside calculators when provided with explicit function signatures, ensuring arithmetic correctness in dosing and risk scoring [558]. MeNTi [559] goes further by enabling nested tool calls across multi-step medical calculators. Complementing these text-based integrations, MMedAgent [39] demonstrates that agents can learn to select among multi-modal tools.
In addition, execution is crucial for translating high-level clinical plans into concrete actions such as code operations, database queries or robotic procedures. VoxelPrompt [560] embed 3-D volumetric priors so that language instructions drive spatial segmentation and analysis of medical image volumes. On the other hand, embodied ultrasound-robot controllers [561] translate LLM-generated plans into closed-loop robotic scanning via a “think-observe-execute” loop. Adaptive reasoning-and-acting systems [562] further refine both the reasoning and actions over time in simulated clinical environments. In medical imaging, systems like MedRAX [563] materializes multi-step reasoning by integrating specialized chest-X-ray tools and LLM reasoning into an end-to-end diagnostic agent. PathFinder [337] similarly executes multi-agent, multi-modal diagnostic workflows in histopathology.
Another class of healthcare agents deploys code-level workflows. For example, Conversational Health Agents [564] compile dialogue actions into function calls and code execution for downstream processing, while EHRAgent [553] materializes EHR operations via executable code. MedAgentGym [565] trains agents to produce code that is directly executed and graded, enforcing reliability of reasoning traces. DoctorAgent-RL [556] validates multi-turn dialogue acts by executing reinforcement-learned strategies in simulated consultations, while AIPatient [566] materializes realistic patient scenarios for execution-based evaluation and another recent study [567] demonstrates how self-evolving multi-agent simulations allow execution behaviors themselves to improve over time.
Search and retrieval.
Search-based agents enhance clinical decision-making by linking LLM reasoning with external biomedical knowledge sources. For instance, MeNTi [559] supplements therapeutic reasoning by bridging LLM calls into multi-step medical calculators while EHRAgent [553] dynamically executes code operations over multi-table EHR data to support complex tabular inference. Conversational Health Agents [564] enrich personalized dialogue by integrating developer-defined external sources and orchestrating action flows. Another line of work explicitly embeds retrieval-augmented generation (RAG) into healthcare agents. For example, CLADD [568] retrieves molecular graphs and prior assay results before proposing compound hypotheses and MedReason [569] issues targeted knowledge-graph sub-queries to anchor each reasoning step for clinical QA.
6.4.2 Self-evolving agentic reasoning
Self-evolving capabilities enable healthcare agents to maintain longitudinal clinical coherence. Representative use cases include accumulating relevant medical context across encounters, updating beliefs as new evidence arrives and revising decisions when inconsistencies surface. In the following paragraphs, we examine how memory, feedback and reflective mechanisms collectively turn clinical reasoning from a one-shot prediction into a continually improving care process.
Memory.
Persistent memory is essential for tracking medical or patient history and maintaining context across interactions. For instance, epidemic-modeling agents [570] maintain temporal contact histories to trace infection chains over time; while MedAgentSim [567] stores experience histories and refine diagnostic strategies over time. In structured data settings, EHRAgent [553] records intermediate computations over tabular EHRs so subsequent steps can reference prior results. EvoPatient [571] interleaves memory with coevolution maintains evolving clinical state across dialogue phases while AIPatient [566] persists longitudinal EHR-derived variables to drive consistent responses. Multi-agent systems such as MedOrch [554] contain clinical knowledge graph agent which can be considered as external memory that can be queried to retrieve known relationships or diagnostic patterns.
Agentic Feedback and Reflection.
Agentic feedback and self-reflection are complementary mechanisms that improve reliability and adaptability of healthcare agents. Feedback converts execution outcomes into learning signals: DynamiCare [557] updates multi-agent treatment strategies when newly observed patient state contradicts prior plans; DoctorAgent-RL [556] optimizes questioning policies from consultation rewards; and MedAgentGym [565] enforces correctness by executing and grading generated code. Tool-use pipelines also propagate execution feedback. For example, the success/failure of table queries in EHRAgent [553] or calculator calls in MeNTi [559] and clinical-calculation agents [558] to refine subsequent actions.
6.4.3 Collective multi-agent reasoning
Multi-agent collaboration is central to healthcare AI, since clinical decision-making often depends on consensus among specialists, negotiation of competing hypotheses and coordination across roles such as physicians, patients and trial designers. In the following, we discuss several strands of research centered around multi-agent capabilities.
For collaborative decision-making, notable frameworks include MDAgents [336], which automatically assigns tailored collaboration structures to teams of LLMs depending on medical task complexity, and DoctorAgent-RL [556], which uses a multi-agent reinforcement-learning framework to optimize multi-turn doctor-patient consultation dialogues. In addition, Agent-derived Multi-Specialist Consultation (AMSC) [572] explores staged multi-specialist dialogues for differential diagnosis that mimics the medical scene of a patient consulting with multiple specialists. Other notable works include ClinicalAgent [555], which organizes clinical trial workflows via role-based agent collaboration / LLM reasoning and PathFinder [337], which integrates a diverse set of agents that can gather evidence and provide comprehensive diagnoses with natural language explanations.
On the other hand, there are studies focusing on simulation-driven collaboration. These works highlight how multi-agent setups enrich training and evaluation. MedAgentSim [567] co-evolves doctor and patient agents to simulate real-world multi-turn clinical interactions, and EvoPatient [571] uses co-evolution of patient and doctor agents to generate diagnostic dialogue data and therefore gathers experience to improve the quality of both questions and answers to enable accurate human doctor training. In addition, DynamiCare [557] initiates a team of specialist agents that iteratively queries the patient system to integrate new information and adapts the composition and strategy. Finally, medical agents can also collaborative to aid medical reasoning process. For example, MedAgents [573] demonstrates zero-shot cooperation among domain-specialist agents in medical reasoning tasks, CLADD [568] uses retrieval-augmented generation to support drug-discovery workflows across agents and GMAI-VL-R1 [574] combines multi-modal reasoning and reinforcement learning in a multi-agent framework to support large-scale medical decision-making.
6.5 Autonomous Web Exploration & Research Agents
Web agents, GUI agents and autonomous research agents constitute three interlinked but distinct trajectories of agentic reasoning systems. Firstly, web agents specialize in navigating online resources, issuing web API calls or browser actions to retrieve dynamic evidence and steer research direction. GUI agents go further by manipulating software interfaces and multi-modal dashboards directly (i.e. clicking, typing, navigating) to execute experiments, data workflows and interface-based tasks. Autonomous research agents sit at the top of this hierarchy, pairing LLM reasoners with scientific workflows, tool-chains and meta-loops to drive hypothesis generation, data synthesis and paper writing. The core connection is a progression of autonomy: first web agents retrieve evidence from online resources, then GUI agents operationalize actions inside software interfaces, and finally autonomous research agents orchestrate full scientific workflows end-to-end.
In this subsection we begin with the foundational layer (Section 6.5.1), which captures the core capabilities that any autonomous agent must support: perceiving its environment, reasoning about goals, planning actions and grounding those into tool-augmented workflows. Building on these primitives, the self-evolving layer (Section 6.5.2) examines how agents incorporate feedback, memory and reflection to iteratively refine their behaviors and improve methods over time. Finally, the collective layer (Section 6.5.3) highlights how agents move beyond individual competence into coordination, specialization and emergent collaboration. While web agents, GUI agents and autonomous agents share common themes of goal-directed autonomy, tool-use and iterative improvement, they differ in where they act on, how they manipulate their environment and what goal they aim to achieve.
6.5.1 Foundational agentic reasoning
Planning.
Planning is essential for web agents because they must decompose long-horizon tasks into manageable steps, adapt to dynamic pages and coordinate tool/invocation strategies. Early work such as WebGPT [230] fine-tuned GPT-3 [575] to answer open-ended questions via a text-based web-browser interface. Then, various web-based methods deepened the planning paradigm: for example, SEEACT [576] explored large multi-modal models as generalists that integrating visual and HTML grounding for web-based tasks, and AutoWebGLM [577] introduced HTML simplification and various learning techniques for open-domain web task decomposition and navigation. These works paved the way for recent systems such as Agent Q [113] that integrate guided MCTS, self-critique and off-policy preference optimization on web-task benchmarks, and set the stage for even more advanced long-horizon web planners such as WebExplorer [578] and WebSailor [41].
In addition, reinforcement learning has become a core tool for improving the decision-making and planning behavior of web-based LLM agents. WebRL [409] introduces a self-evolving online curriculum that generates new tasks from unsuccessful attempts and trains an outcome-supervised reward model to guide policy optimization. WebAgent-R1 [28] performs end-to-end multi-turn RL, learning web interaction policies directly from online rollouts with binary success rewards. DeepResearcher [232] scales RL to real-world web environments, using a multi-agent browsing architecture and exhibiting emergent behaviors such as plan formulation, cross-source corroboration, and self-reflection. Hybrid pipelines like AutoWebGLM [577] combine supervised training with RL fine-tuning to strengthen task decomposition and structured navigation, while Navigating WebAI [579] combines supervised learning and RL techniques to improve web navigation performance. Methods such as Pangu DeepDiver [580], EvolveSearch [581] and WebEvolver [582] use RL-based self-improvement, for example by adaptively scaling search depth or jointly training an agent and a world-model-like simulator to improve long-horizon web decision-making. Hierarchical approaches like ArCHer [583] optimize high-level and low-level policies with a multi-turn hierarchical RL framework, while PAE [584] combines a task proposer, an acting agent and an evaluator to support autonomous skill discovery via RL in internet environments.
Planning is a core capability for GUI agents, enabling them to coordinate long, multi-step interactions across applications and operating environments. OS-Copilot [585] approaches this by treating the desktop as a unified control space in which a generalist agent continually refines its multi-step workflows. Agent S [85] builds an experience-augmented planning stack that decomposes tasks into sub-goals while retrieving past trajectories and external knowledge to guide action sequencing. InfiGUIAgent [586] strengthens planning by integrating hierarchical task structuring into a multi-modal backbone, allowing agents to organize GUI procedures at multiple levels of abstraction. MobA [587] and PC Agent [588] employ hierarchical architectures that separate high-level planning from low-level execution—on mobile and desktop respectively. GUI foundation models such as OS-ATLAS [589], OSCAR [590] and UItron [591] further emphasize robust cross-application planning: OS-ATLAS offers a platform-agnostic action model for consistent control, OSCAR maintains state-aware plans that adapt as execution unfolds and UItron unifies offline and online planning within a single general-purpose GUI agent.
Likewise, reinforcement learning has become a central way to endow GUI agents with planning over long action sequences. End-to-end frameworks such as ARPO [592] and ComputerRL [593] directly optimize multi-step GUI trajectories with replay buffers or large-scale online interaction, replacing hand-crafted scripts with learned policies for general desktop control. R1-style and semi-online methods, including UI-R1 [594], GUI-R1 [595], InfiGUI-R1 [596] and UI-S1 [597], start from strong vision–language backbones and then use RL to sharpen action prediction and long-horizon reasoning. A complementary line focuses on where to act by improving visual grounding: GUI-Bee [598], SE-GUI [599], UIShift/GUI-Shift [600] and UI-AGILE [601] develop RL-based grounding frameworks to help agents reliably localize target elements before executing actions. ZeroGUI [602] pushes toward fully automated online RL loops, where the agent generates its own tasks and trajectories and improves with zero human annotation, while ComputerRL [593] scales end-to-end online RL in large distributed desktop environments. AgentCPM-GUI [603] couples supervised pre-training with reinforcement fine-tuning to strengthen decision quality on mobile apps. Finally, foundation-style GUI agents such as AutoGLM [604] and Mobile-Agent-v3 [605] serve as general backbones that unify perception, grounding and action, and are trained or fine-tuned with scalable RL frameworks to align long-horizon GUI planning with real-world success signals.
For autonomous research agents, planning modules translate abstract goals into actionable research itineraries. For example, Agent Laboratory [606] organizes work into three structured stages, namely literature review, experimentation and report writing, and supports the workflow with tool-hooks that automate code execution, experiment runs and documentation. GPT Researcher [607] uses a plan → research → write cycle, where a dedicated planner drafts the outline, retrieval/analysis agents gather evidence and a writer compiles the final report. Chain of Ideas [608] retrieves literature into a chain structure to reflect domain progression and support ideation via experiment design, whereas IRIS [609] performs hypothesis exploration via Monte Carlo Tree Search to expand promising branches before committing to downstream tasks. Broader variants include ARIA [501] and NovelSeek [508] that automate the research workflow with a complete literature search, hypothesis generation and experiment planning cycle.
Tool-Use.
For web agents, tool-use abilities underpins execute plans in realistic, dynamic environments. For example, WebVoyager [610] systematizes multi-modal execution by building an end-to-end agent that operates on real websites. On the interaction side, BrowserAgent [611] makes the action space more human-like, defining a compact set of browser primitives (e.g., click, scroll, type) and coupling them with an explicit memory mechanism to maintain key conclusions across steps, yielding strong gains on multi-hop QA benchmarks. Finally, methods like WALT [612] and pipeline-oriented systems such as WebDancer [613] and WebShaper [614] push tool use from mere execution toward tool discovery and data-centric interaction. Specifically, WALT teaches agents to reverse-engineer reusable tools from website functionality, while WebDancer and WebShaper embed web actions inside multi-turn information-seeking and dataset-synthesis loops, respectively.
Tool use is another core capability for GUI agents, enabling them to invoke system functions and application features as structured tools. As pioneering systems, AutoDroid [615] automatically analyzes Android apps to construct functionality-aware UI abstractions that LLM agents can reason over as capabilities rather than raw layouts, while its successor AutoDroid-V2 [616] re-frames mobile UI automation as LLM-driven code generation, with an on-device small language model emitting executable scripts for a local interpreter. MobileExperts [617] models each expert as a tool-capable specialist and uses a dual-layer controller to select which expert and its associated tool-set to invoke at different stages of a mobile workflow. AgentStore [618] pushes this idea to the platform level by treating heterogeneous agents themselves as tools: a MetaAgent uses AgentTokens to route operating-system subtasks to the most suitable specialized “tool-agent” through a unified interface. OS-Copilot [585] and OSCAR [590] integrate rich system-level tools into unified computer-control frameworks, so that complex desktop tasks are expressed as sequences of tool calls. OS-ATLAS [589] complements these systems with a foundation action model that offers robust cross-platform GUI grounding, serving as a reliable actuator layer for downstream tool-using agents. Finally, SeeClick [191] strengthens the execution stack by pre-training a visual GUI agent for GUI grounding, improving the ability to locate the correct on-screen elements from instructions.
Specialized tools can expand an autonomous research agent’s capabilities beyond pure text, allowing more fine-grained ability. For instance, Agentic Reasoning [194] automatically routes queries to appropriate tool modules like code execution, web search and structured memory agents when the main LLM detects a gap in reasoning; while Webthinker [619] empowers autonomous web exploration and page navigation during long-horizon investigations, by interleaving reasoning, search and draft-writing with a web explorer module. PaperQA [488] and its follow-up synthesis agent [489] integrate PDF parsing and citation-level grounding to produce verifiable answers and literature syntheses, while Scideator [620] provides an IDE-style tool-chain that combines paper facets with novelty checks for real-time brainstorming. In addition, DeepResearcher [232] shows that reinforcement learning over real-web interactions improves deep-research efficiency and quality, with emergent behaviors such as plan refinement and cross-source corroboration.
Execution components ground high-level reasoning in code, simulations or laboratory protocols to produce verifiable scientific outcomes. Agent Laboratory [606] executes experiments specified in declarative configuration files by orchestrating external toolchains, while Agentic Reasoning [194] integrates a coding agent that executes Python alongside web search and structured memory, feeding the results back into the reasoning process. MLR-Copilot [621] turns research plans into runnable implementations via an ExperimentAgent that leverages retrieved prototype code, runs experiments, and iteratively debugs implementations. Dolphin [622] closes the loop by generating ideas, implementing them through code templates with traceback-guided debugging, executing experiments, and using the analyzed results to steer the next research cycle. The AI Scientist [623] automates end-to-end ML experiments, i.e. generating ideas, writing code, executing experiments and visualizing results, so that observed outcomes can guide subsequent runs, while The AI Scientist-v2 [502] adds a dedicated experiment manager and progressive agentic tree search to prioritize and schedule experiment branches. Most recently, NovelSeek [508] introduces a unified closed-loop multi-agent framework that spans hypothesis generation, idea-to-methodology construction, and multi-round automated experiment execution with feedback across diverse scientific domains.
Search and Retrieval.
Search and retrieval lie at the heart of what differentiates web agents from static language models: they must locate, synthesize and refactor web-scale information in dynamic environments. WebExplorer [578] tackles this by generating challenging information-seeking trajectories and training agents to interleave a search tool and a browse tool over many turns, resulting in improved multi-step retrieval policies on complex benchmarks. WebSailor [41] likewise focuses on information-seeking under extreme uncertainty, constructing high-uncertainty search tasks and using a two-stage post-training pipeline to instill uncertainty-reducing search strategies for long-horizon web tasks. INFOGENT [624] also performs multi-query search across diverse web sources, enabling comprehensive information retrieval beyond task completion. For retrieval-augmented generation applications, RaDA [625] explicitly disentangles web-agent planning into Retrieval-augmented Task Decomposition and Retrieval-augmented Action Generation, so that each high-level subgoal and concrete action is conditioned on fresh search results while respecting context limits. In addition, GeAR [236] advances retrieval itself by augmenting a base retriever with graph expansion and an agent framework, enabling multi-hop passage retrieval along graph-structured evidence chains. Finally, WebRAGent [626] exemplifies retrieval-augmented generation for web agents by retrieving past trajectories and external knowledge into a multi-modal RAG policy.
Several GUI agents use retrieval capability to inject external experience or knowledge at inference time. Synapse [627] maintains an exemplar memory of abstracted trajectories and, for each new task, retrieves similar past trajectories as in-context plans, substantially improving multi-step decision-making. LearnAct [628] builds a three-agent pipeline that mines human demonstrations into a knowledge store and retrieves the most relevant instructions to guide mobile GUI execution on unseen and diverse tasks. MobileGPT’s Explore–Select–Derive–Recall framework [629] equips a phone agent with human-like app memory, storing modular procedures that can be recalled and recomposed when similar tasks reappear. TongUI [630] turns large-scale multi-modal web tutorials into the GUI-Net trajectory corpus, effectively giving agents a large offline memory of how humans operate hundreds of apps across multiple operating systems. RAG-GUI [631] makes retrieval explicit at inference time by querying web tutorials and generating textual guidelines that are fed into any VLM-based GUI agent as step-by-step hints. WebRAGent [626] shows a related pattern in web automation, combining a multi-modal retriever with a web agent so that each action is conditioned on retrieved guidance.
Search modules probe the research landscape to surface relevant papers and passages, enriching context and grounding subsequent reasoning. WebThinker [619] equips large reasoning models with a Deep Web Explorer module for autonomous web search and page navigation, and uses an Autonomous Think–Search–and–Draft strategy with RL-based training to decide when to browse and what to extract during long-horizon tasks. DeepResearcher [232] scales end-to-end training on the real web via reinforcement learning in live search environments, optimizing the iterative think–search loop and exhibiting emergent behaviors such as plan formulation, cross-source corroboration, and self-correction over multi-step research trajectories. Retrieval-centric agents like PaperQA [488] and its successor PaperQA2 [489] demonstrate that tightly coupling full-text retrieval with generation can substantially improve scientific QA accuracy while preserving cited provenance for literature synthesis.
In research settings, retrieval-augmented generation (RAG) grounds idea generation and analysis in freshly retrieved and citable passages. For example, GPT Researcher [607] is an autonomous research-agent that retrieves sources and generates reports with citations, enabling traceability of claims to evidence. Chain of Ideas [608] organizes relevant literature into a chain-structured scaffold that mirrors a field’s progressive development, thereby guiding retrieval and ideation toward subsequent links in the argument. Meanwhile, Scideator [620] extracts key paper facets (e.g., purpose, mechanism, evaluation) and leverages them to drive targeted retrieval and recombination of ideas for identifying methodological or evidentiary gaps.
6.5.2 Self-evolving agentic reasoning
Effective self-evolving abilities enable these autonomous agents to adapt their behavior over time, retain crucial task context across interaction cycles and incrementally refine planning and execution strategies. The following paragraphs review how memory, feedback and self-reflection mechanisms support this continual improvement across these agent families, turning interaction from a one-shot pipeline into an iterative learning loop.
Memory.
Memory modules transform brittle, single-pass web interactions into reusable experience. For example, Agent Workflow Memory (AWM) [270] induces reusable workflows from successful trajectories and retrieves them to guide future tasks, while ICAL [632] distills noisy trajectories into high-level verbal and visual abstractions that are stored as a memory of multimodal experience and later injected into prompts. Control-oriented designs such as BrowserAgent [611] maintain explicit histories of past actions and intermediate conclusions in the agent’s context, instead of only re-encoding the current page view. GLM-based agents like AutoWebGLM [577] and AgentOccam [633] emphasize compressed page representations, using HTML simplification and carefully tuned observation spaces so that the agent’s prompt contains a shorter, more informative view of the state, with past steps preserved through the usual action–observation history. More integrated frameworks like LiteWebAgent [634] expose planning, memory and tree search as modular components, and can plug in workflow memories together with search traces for long-horizon reuse.
Recent GUI agents adopt explicit memory modules that store and retrieve task-relevant information during long-horizon execution. Earlier work such as MobileGPT [635] equips a mobile assistant with human-like app memory: it decomposes procedures into modular sub-tasks that are explored, selected, derived, and then stored so they can be recalled and reused instead of being re-discovered from scratch. Chain-of-Memory (CoM) [636] incorporates short- and long-term memory by recording action descriptions and task-relevant screen information in a dedicated memory module, enabling cross-application navigation to track task state. More recent systems build increasingly structured memories: MobA’s multifaceted memory module [587] maintains environment- and user-level traces that an adaptive planner retrieves when refining mobile task plans, while MGA [637] represents each step as a triad of current screenshot, spatial layout, and a dynamically updated structured memory that summarizes past transitions, mitigating error accumulation in long chains of actions. Mobile-Agent-E [638] adds a persistent long-term store of tips and shortcuts distilled from prior trajectories, so later plans can call reusable guidance and subroutines instead of relearning them. Mirage-1 [639] similarly organizes experience into a hierarchical skill memory that a planner can retrieve as reusable building blocks for new GUI tasks.
Long-term memory is crucial for autonomous research agents because it enables accumulation and reuse of prior knowledge, fostering continuity across research cycles. For example, Agent Laboratory [606] retains prior experiment code, results, and interpretation across its multi-phase workflow, enabling later stages to build on earlier work. GPT Researcher [607] generates reports with embedded citations and provides context for planning and extension of research topics. Chain of Ideas [608] structures relevant literature into a chain scaffold that reflects a field’s progression and can be revisited as new evidence arises. The AI Scientist-v2 [502] incorporates a progressive agentic tree-search approach that enables branching, backtracking and follow-up experimentation across iterations.
Agentic Feedback and Reflection.
Modern web agents treat interaction as a continual learning process, using feedback signals and reflection modules to refine their reasoning and recover from failures over time. Agent Q [113] combines guided Monte Carlo tree search with a self-critique stage, so that rollouts provide not only action sequences but also preference-style supervision. REAP\text{R{\scriptsize E}A{\scriptsize P}}REAP [640] makes reflection explicit by treating it as a retrieval problem: it stores task–reflection key–value pairs summarizing what was learned from past trajectories, then, at inference time, retrieves the most relevant reflections and appends them to the agent’s prompt to guide planning on new web-navigation tasks. Agent-E [641] introduces an automatic validation pipeline that detects execution errors across text and vision, and then triggers self-refinement, enabling agents to iteratively correct their own workflows. Recon-Act [642] uses a dual-team architecture in which a Reconnaissance team extracts generalized tools from successful and failed trajectories, and an Action team applies these tools to re-plan tasks, forming a closed feedback loop. INFOGENT [624], on the other hand, leverages aggregator feedback to iteratively refine navigation and search strategies based on identified information gaps. And WINELL [643], as a updating web agent, relies on feedback from the aggregation process to adapt subsequent searches and update selection during continuous operation. Finally, self-reflective search agents such as WebSeer [644] integrate explicit self-reflection signals into reinforcement learning, constructing reflection-annotated trajectories and a two-stage training framework so that mis-solved or uncertain cases become targeted feedback that deepens future search and reasoning.
GUI agents also integrate explicit reflection so they can critique and repair their own plans. Early computer-control systems with structured reflection, for example, a zero-shot desktop control agent with structured self-reflection loops [645], provides conceptual templates that later GUI agents adapt to visual, multi-application settings. GUI-Reflection [646] instantiates this idea end-to-end: it builds a reflection-oriented task suite, automatically synthesizes error scenarios from existing successful trajectories, and adds an online reflection-tuning stage so multi-modal GUI models learn to detect failures, reason about causes, and generate corrective actions without human annotation. History-Aware Reasoning (HAR) [647] treats long-horizon GUI automation as a reflective learning problem, constructing reflective learning scenarios, synthesizing tailored correction guidelines, and designing a hybrid RL reward so the agent acquires episodic reasoning knowledge from its own errors and shifts from history-agnostic to history-aware reasoning. MobileUse [648] introduces hierarchical reflection on mobile devices, where the agent self-monitors at the action, subtask, and task level and triggers reflection on demand, pausing only when needed to diagnose and recover. InfiGUIAgent [586] integrates hierarchical and expectation–reflection reasoning in a second training stage, enabling the agent to run expectation–reflection cycles that compare expected and actual outcomes and revise multi-step plans when they diverge. Mobile-Agent-E [638] embeds an Action Reflector and Notetaker that evaluate executed steps and write refined Tips and Shortcuts back into persistent long-term memory, forming a self-evolution loop where the agent’s behavior is progressively refined from accumulated experience.
For autonomous research agents, learning from outcomes is essential to improve reasoning and experimental reliability over time. CycleResearcher [649] couples a research agent with a reviewer agent that provides automated peer-review feedback, and uses an iterative preference-training loop so the research agent can refine future drafts and decisions. MLR-Copilot [621] monitors execution results and human comments during experiment implementation and execution, using these signals to iteratively refine code, configurations and even upstream hypotheses. Dolphin [622] implements a closed-loop auto-research framework in which generated code is run on benchmarks and exception-guided debugging plus outcome analysis feed back into idea generation and implementation, pruning unproductive paths. At the search–reasoning interface, DeepResearcher [232] optimizes query, browsing, and answering policies via reinforcement learning on real-web trajectories, with outcome rewards inducing behaviors such as planning, cross-validation, and self-reflection. Agentic Deep Research [650] further emphasizes reward design for reasoning-driven search, arguing that principled incentives over answer quality and reasoning traces provide structured signals that improve downstream synthesis in deep-research agents.
6.5.3 Collective multi-agent reasoning
Collective multi-agent reasoning for web agents reframes browser use as cooperation among specialized roles rather than a single monolithic policy. WebPilot [651] models web task execution as a multi-agent system with a global planning agent that decomposes tasks and local MCTS-based executors that solve subtasks, jointly steering search in complex web environments. INFOGENT [624] organizes web information aggregation into a Navigator, Extractor, and Aggregator, so exploration, evidence extraction, and synthesis are handled by distinct cooperating agents with feedback from the Aggregator to guide future navigation; WINELL [643] leverages agentic web search to plan and execute iterative information gathering for discovering timely factual updates relevant to a target Wikipedia article.. Recon-Act [642] adopts a Reconnaissance–Action paradigm in which a Recon team analyzes successful and failed trajectories to derive generalized tools or hints, and an Action team re-plans and executes with this evolving toolset. PAE [584] uses three roles, namely a task proposer, an acting web agent and a VLM-based evaluator, to autonomously generate vision-based web tasks and feed success signals back into the policy via RL. Hierarchical web agents such as Agent-E [84] and Plan-and-Act [93] similarly separate a high-level planner from a browser-navigation agent, enabling structured plan–execute cooperation. At a more conceptual level, Agentic Web [652] envisions the internet as an agentic web of interacting agents and analyzes how coordination, communication protocols and economic incentives shape such ecosystems, while Agentic Deep Research [650] frames information seeking as iterative feedback loops of reasoning, retrieval, and synthesis that can be instantiated by single- or multi-agent web research systems.
Multi-agent designs for GUI agents typically decompose “using a computer” into cooperating roles that plan, perceive, decide and execute. COLA [653] instantiates a scenario-aware task scheduler, a planner, a decision-agent pool, an executor, and a reviewer, so UI tasks are split into basic capability units and routed to domain-specialized agents rather than a single monolith. On mobile, Mobile-Agent-v2 [654] adopts a tri-role pattern with planner, decision, and reflection agents for progress navigation, local action selection, and error correction, while Mobile-Agent-E [638] further builds a hierarchical stack with a Manager and four subordinate agents (i.e. Perceptor, Operator, Action Reflector, Notetaker) plus a self-evolution module that learns long-term Tips and Shortcuts from experience. Mobile-Agent-V [655] similarly employs a video agent, decision agent, and reflection agent to coordinate multi-modal perception and execution, and MobileExperts [617] dynamically forms teams of expert agents with a dual-layer planner that allocates subtasks to tool-specialized experts. SWIRL [656] makes this structure explicit for RL, training a Navigator that converts language and screen context into structured plans and an Interactor that grounds those plans into atomic GUI actions within a multi-agent RL workflow. PC Agent [588] uses separate planning and grounding agents in a two-stage pipeline for desktop automation, illustrating how multi-agent decomposition can improve long-horizon PC control.
To facilitate autonomous research agents, multi-agent collaboration enables a single model’s linear workflow to become a coordinated research group: specialized agents operate in parallel, exchange intermediate artifacts through explicit interfaces, and provide adversarial or complementary feedback to improve both creativity and rigor. For example, AgentRxiv [657] coordinates author, reviewer, and editor agents that iteratively refine manuscripts and share evolving artifacts across virtual “labs.” ARIA [501] instantiates a role-structured multi-LLM team that searches, filters, and synthesizes scientific literature into actionable experimental procedures. Earlier multi-agent designs such as CAMEL [503] demonstrate how cooperative role-play with tool access can enhance hypothesis generation and task decomposition. In experimental sciences, Coscientist [658] integrates planning, robotic instrument control, and analysis into a multi-agent closed loop that autonomously designs and executes wet-lab experiments. Finally, TAIS [511] defines a hierarchical team, namely project manager, data engineer and domain expert, that jointly discovers disease-predictive genes from expression data through coordinated division of labor.
7. Benchmarks
In this section, benchmarks tackle the challenge of evaluating agentic reasoning's core capabilities amid diverse tools, contexts, and horizons, where suites often conflate errors across perception, planning, and coordination. They organize assessments into mechanism-level primitives—tool use (single-/multi-turn APIs), search (unimodal/multimodal web/video), memory/planning (long-horizon episodic/multi-session recall with feedback), and multi-agent coordination (games, simulations, social reasoning)—plus application-level tests for end-to-end realism. This complementary framework pinpoints component failures, grounds progress in interpretable metrics, and guides scalable agent improvements.
Agentic reasoning has been evaluated through a rapidly growing set of benchmarks, but existing suites often differ in what they treat as the core capability, such as tool invocation accuracy, memory retention under long contexts, or coordination quality in multi-agent settings. To provide a coherent view, we organize benchmarks from two complementary perspectives. We first summarize benchmarks that isolate core mechanisms of agentic reasoning, which helps pinpoint where systems succeed or fail at the capability level. We then review application-level benchmarks that evaluate end-to-end agent behavior in realistic domains, capturing the combined effects of perception, planning, tool use, memory, and coordination.
7.1 Core Mechanisms of Agentic Reasoning
We begin with benchmarks that target mechanism-level capabilities, aiming to evaluate agentic reasoning in a more controlled and interpretable manner. Concretely, these benchmarks decompose agentic behavior into a small set of recurring primitives, including tool use, search, memory and planning, and multi-agent coordination. Such mechanism-centric evaluations make it easier to attribute performance changes to specific components, and they complement end-to-end benchmarks that may conflate multiple sources of errors.
7.1.1 Tool Use
Evaluating tool-using models remains an open challenge due to the diversity of tasks, tools, and usage scenarios involved [659]. The key difficulties arise from the wide range of available tools, varying levels of scenario complexity, and the prevalence requirements specifically for the task domain.
Single-Turn Tool Use.
While agentic reasoning often focuses on multi-turn or long-horizon interactions, single-turn tool use remains a foundational capability for evaluating LLMs' basic tool invocation skills. ToolQA [660] constructs a dataset of 1, 530 dialogues involving 13 specialized tools, designed to assess LLMs' ability to interface with external knowledge sources in a question-answering context. APIBench [78] introduces a large-scale benchmark grounded in real-world APIs from HuggingFace, TorchHub, and TensorHub, comprising 1, 645 unique APIs and 16, 450 instruction–API pairs. It is used to train and evaluate Gorilla, an LLM capable of invoking a broad range of APIs, emphasizing generalization across diverse tool interfaces. ToolLLM-ToolBench [174] curates 16, 464 real-world APIs across 49 categories from the RapidAPI Hub, and uses ChatGPT to generate diverse, instruction-style prompts for these APIs. The benchmark is used to train ToolLLaMA, a model that demonstrates strong tool-use capabilities and exhibits promising generalization to unseen APIs. MetaTool [661] introduces the TOOLE dataset, containing over 20, 000 entries and a benchmark comprising approximately 200 tools across diverse scenarios, including software engineering, finance, and art design. It splits tool selection tasks to tool selection with similar choices, tool selection in specific scenarios, tool selection with possible reliability issues, and multi-tool selection. T-Eval [662] decomposes tool utilization into a series of sub-processes: instruction following, planning, reasoning, retrieval, understanding, and review, and evaluates each step individually to provide a fine-grained assessment of tool-use capabilities. The benchmark includes a total of 23, 305 test cases spanning 15 different tools. GTA (General Tool Agents) [663] targets realistic tool-use scenarios by emphasizing real user queries, real-world deployed tools, and multimodal inputs. It introduces 229 challenging tasks grounded in practical applications, spanning 14 tools across diverse domains. ToolRet [664] focuses specifically on the task of tool retrieval, introducing a heterogeneous benchmark consisting of 7.6K diverse retrieval tasks and a corpus of 43K tools.
Multi-Turn Tool Use.
Multi-turn tool use offers a more realistic simulation of real-world applications, where agents autonomously select and sequence tools to solve complex tasks. ToolAlpaca [175] is one of the earliest efforts in this direction, using multi-agent simulations to generate 3, 938 tool-use instances from over 400 real-world APIs across 50 distinct categories. SambaNova-ToolBench [665] introduces a benchmark centered on software tool manipulation for real-world tasks, with varying levels of API complexity to test agent capabilities. API-Bank [666] provides a dataset of 1, 888 tool-use dialogues from 2, 138 APIs, along with a runnable evaluation system containing 73 APIs and 314 tool-use test cases. UltraTool [667] evaluates tool-use capabilities across six dimensions: planning awareness, planning ability, creation, tool-use awareness, tool selection, and tool usage. The benchmark spans 22 domains, includes 2, 032 tools, and provides 5, 824 evaluation samples. ToolFlow distinguishes itself from prior benchmarks by emphasizing long-term planning. It features 224 expert-curated tasks involving 107 real-world tools, highlighting challenges in goal decomposition and multi-step decision-making. More recently, MTU-Bench [668] presents a multi-granularity benchmark for multi-turn, multi-tool scenarios, and releases MTU-Instruct, a large-scale instruction dataset containing 54, 798 dialogues involving 136 tools. m & m's introduces a benchmark with over 4, 000 multi-step, multimodal tasks involving 33 tools, including multimodal models, public APIs, and image processing modules. It also provides a high-quality subset of 1, 565 task plans that are human-verified and executable end-to-end.
7.1.2 Search
To systematically assess an agent’s ability to acquire information through interaction, recent benchmarks cast search as a sequential reasoning problem and can be broadly categorized into unimodal and multimodal settings, differing in the nature of evidence sources, interaction spaces, and grounding requirements.
Unimodal Search.
Recent benchmarks for single-modal agentic search increasingly frame information seeking as a sequential, decision-driven process, emphasizing planning, interaction, and evidence synthesis. For example, WebWalker [669] emphasizes structured website traversal, explicitly modeling search as coordinated horizontal exploration and vertical drilling across interconnected pages. To reflect realistic open-world information seeking, InfoDeepSeek [670] introduces a dynamic Web setting with verifiable yet non-curated answers, highlighting robustness to noise and distributional shift. Several benchmarks scale search along temporal and informational dimensions: Mind2Web 2 [50] focuses on long-horizon browsing and citation-grounded synthesis, whereas RAVine [671] augments answer quality with process-level efficiency and interaction fidelity. Complementarily, WideSearch [672] and DeepWideSearch [673] distinguish between breadth-oriented large-scale fact aggregation and depth-oriented multi-hop reasoning, revealing the difficulty of jointly optimizing coverage and reasoning coherence. Domain-specific benchmarks further stress reliability under strict correctness constraints: MedBrowseComp [674] targets clinical decision support by requiring agents to integrate heterogeneous and potentially conflicting medical evidence, while FinAgentBench [675] evaluates retrieval-centric reasoning in financial analysis through document-type selection and fine-grained passage localization. Finally, LocalSearchBench [676] grounds agentic search in real-world local services, evaluating multi-constraint, multi-entity reasoning over large structured databases. Collectively, these benchmarks redefine agentic search evaluation around planning depth, interaction quality, evidence integration, and real-world fidelity, providing a more holistic assessment of search-centric reasoning in language-based agents.
Multimodal Search.
Recent benchmarks on multimodal agentic search move beyond static multimodal question answering to systematically evaluate an agent’s ability to actively retrieve, browse, and reason over heterogeneous information sources under realistic constraints. Benchmarks such as MMSearch [677] and its extension MMSearch-Plus [678] frame multimodal search as an end-to-end process, where agents must interpret multimodal queries and synthesize answers by jointly leveraging textual and visual evidence, explicitly modeling different input–output modality configurations. Complementing this setting, MM-BrowseComp [679] adapts the “hard-to-find, easy-to-verify” paradigm to multimodal web environments, enforcing mandatory image dependence to prevent text-only shortcuts and to stress-test multimodal evidence grounding during open-web browsing. BEARCUBS [680] further emphasizes computer-using agents in live web scenarios, requiring explicit interaction trajectories and multimodal manipulation (e.g., videos or 3D navigation), thereby evaluating not only retrieval accuracy but also procedural competence. Moving into domain-specific and tool-augmented regimes, PaperArena [681] evaluates multimodal agentic search in scientific workflows, where agents must coordinate PDF parsing, figure understanding, database queries, and web search to answer research-level questions. Finally, Video-BrowseComp [682] and VideoDR [683] extend agentic search to video-centric settings, requiring agents to extract visual-temporal cues from videos and iteratively validate hypotheses via open-web evidence, with carefully designed constraints to ensure dual dependence on video and external retrieval. Together, these benchmarks delineate a clear evolution toward evaluating multimodal agents as interactive researchers, highlighting planning, tool use, and multimodal evidence integration as first-class capabilities in agentic search.
7.1.3 Memory and Planning
A distinctive advantage of agents lies in their ability to leverage memory to achieve accurate long-term performance and strong reasoning capabilities. This ability can be assessed from two complementary perspectives. The first concerns memory management, which reflects how effectively an agent integrates, organizes, and retrieves long-term memories. The second concerns memory utilization, which captures how well an agent exploits historical information to support planning and informed feedback. In this section, we separately discuss benchmarks from these two aspects.
From the perspective of memory management, existing benchmarks can be broadly categorized into Long-Horizon Episodic Memory and Multi-session Recall, depending on whether the textual context consists of a single continuous long-form input or multiple discontinuous conversational sessions.
Long-Horizon Episodic Memory.
This category targets single-episode tasks with partial observability and delayed rewards, requiring agents to store and retrieve information over extended time spans. Benchmarks in this space evaluate memory retention, retrieval, and reasoning across long contexts. PerLTQA [684] simulates personalized dialogue, where agents answer questions using long-term persona and event memories. It includes 8.5K QA pairs and evaluates memory classification, retrieval ranking, and synthesis fidelity. ELITR-Bench [685] tests QA on noisy meeting transcripts, where relevant evidence may appear far earlier than the query. Models are scored via GPT-4 across various ASR noise levels and dialogue settings. In the meanwhile, Multi-IF [686] and MultiChallenge [687] focus on multi-turn instruction following. Multi-IF [686] spans 4.5K tri-turn conversations in 8 languages, with evaluation based on strict and relaxed instruction accuracy. MultiChallenge [687] tests four memory-intensive phenomena: retention, inference, editing, and coherence, using 273 curated dialogues with binary pass/fail evaluation. TurnBench-MS [688] evaluates multi-step reasoning across 540 symbolic logic games, tracking win rate, round-level accuracy, and verifier usage. StoryBench [689] casts memory as decision-making in interactive narratives, where agents must remember prior choices to progress. It assesses decision accuracy, retry counts, and runtime efficiency. MemBench [690] tests factual and reflective memory across 60K episodes in participatory and observational settings, with metrics for accuracy, recall, capacity, and retrieval speed. MMRC [691] develops a multimodal memory benchmark focused on single-round multimodal conversations. Together, these benchmarks emphasize structured memory demands, with metrics capturing not just task success but also memory precision, synthesis quality, and robustness under long-context stress.
Multi-session Recall.
Multi-session Recall focuses on multi-episode tasks where agents must retain and integrate knowledge across separate sessions, supporting lifelong adaptation and mitigating catastrophic forgetting. A range of recent benchmarks systematically probe this capability under realistic, long-term interaction scenarios. LOCOMO [294] evaluates LLM agents on sustained conversational memory across 19-session dialogues, using tasks such as multi-hop QA, event summarization, and multi-modal response generation. MemSim [692] introduces a simulator-based framework with over 2, 900 synthetic trajectories in daily life domains, assessing fact retention across sessions via accuracy, diversity, and rationality scores. LONGMEMEVAL [295] benchmarks assistants on five sub-tasks: information extraction, multi-session reasoning, temporal inference, knowledge updating, and abstention, over dialogue histories spanning up to 1.5M tokens, with GPT-4 judged accuracy and retrieval recall. REALTALK [693] presents 21-day real human conversations with 17K tokens per dyad, enabling evaluation of memory probing and persona simulation through multi-hop QA and emotional grounding metrics. Furthermore, MemoryAgentBench [694] unifies diverse memory tasks such as test-time learning, conflict resolution, and long-range understanding across multiple datasets, with task-specific metrics including classification accuracy, partial-match F1, and ROUGE. Mem-Gallery [282] introduces a multimodal long-term memory evaluation benchmark that systematically covers a wide range of memory management and utilization scenarios. Lastly, Evo-Memory [25] introduces a benchmark and a unified evaluation protocol for measuring experience reuse in test-time learning. Collectively, these benchmarks underscore the importance of dynamic memory integration across sessions and provide comprehensive evaluations across factual recall, adaptation, and reasoning.
From the perspective of memory utilization, we provide a detailed discussion of benchmarks that evaluate an agent’s ability to support planning and feedback using historical information.
Planning and Feedback.
Benchmarks targeting planning and feedback primarily assess whether agents can effectively utilize memory to support multi-step planning based on environmental feedback, and maintain coherent internal state over extended interactions. First, ALFWorld [48] employs interactive environments to evaluate the consistency of multi-step planning, requiring agents to accumulate observations across actions and maintain latent internal states throughout execution. Moreover, formal planning benchmarks such as PlanBench [695] and ACPBench [696] assess planning capabilities in explicitly defined dynamic environments, testing whether agents can correctly reason about action preconditions, effects, reachability, and overall plan validity. TEXT2WORLD [697] integrate fragmented textual descriptions into a coherent and executable world model, evaluating the capacity to continuously consolidate historical facts into structured planning representations. More recent benchmarks place greater emphasis on feedback integration and planning under non-stationary conditions. For example, REALM-Bench [698] introduces dynamic disturbances in real-world manufacturing scenarios, requiring agents to remember prior commitments and replan when underlying assumptions are violated, while TravelPlanner [699] focuses on accurate itinerary construction under constrained and evolving information. Finally, FlowBench [700] and UrbanPlanBench [701] assess planning performance in procedural and domain-specific settings, respectively, where agents must preserve conversational or policy context and apply it consistently across decision steps. Together, these benchmarks go beyond one-shot plan generation and systematically investigate whether agents can leverage historical information to support sustained planning, adaptive feedback integration, and iterative decision revision over time.
7.1.4 Multi-Agent System
To evaluate coordination, competition, and decision making beyond isolated reasoning, recent benchmarks situate multi-agent systems in interactive environments. These works broadly span game-based evaluations, simulation-centric real-world scenarios, and language-driven social reasoning tasks.
Game-based reinforcement learning evaluation.
Game-based reinforcement learning evaluation benchmarks leverage classical and novel gaming environments to systematically compare the performance of multi-agent RL algorithms under cooperative and adversarial settings. MAgent ([702]) facilitates massive-scale multi-agent scenarios such as pursuit and resource competition within customizable grid-worlds, evaluating individual cumulative rewards and competitive metrics like resource occupancy rates. Pommerman ([703]) adapts the classic Bomberman game for cooperative and adversarial interactions, quantifying performance through win rates, survival duration, and kill-to-suicide ratios. SMAC ([704]) centers on decentralized micromanagement challenges in StarCraft II scenarios, evaluating team success via win rates, average damage output, and formation dispersion. MineLand ([705]) utilizes Minecraft as a realistic ecological simulation for large-scale multi-agent coordination, with up to 64 agents cooperating to meet physical needs under partial observability. TeamCraft ([706]) also employs Minecraft to benchmark embodied multi-modal agents tasked with interpreting visual, textual, and environmental prompts to collaboratively achieve 55, 000 procedurally generated task instances. Melting Pot ([707]) assesses agents’ zero-shot generalization capabilities in diverse social dilemma environments, utilizing metrics such as per-capita return, social welfare, and inequality indices. BenchMARL ([708]) provides standardized algorithm comparisons across multiple scenarios (e.g., SMACv2, VMAS, MPE), measuring convergence rates, final performance, and hyperparameter sensitivity. Finally, Arena ([709]) encompasses a comprehensive suite of cooperative and adversarial games across various complexities, evaluating individual returns, collective social welfare, and emergent communication protocols.
Simulation-centric real-world assessment.
Simulation-centric real-world benchmarks simulate realistic or pseudo-realistic environments, emphasizing scalability, partial observability, and dynamic planning. SMARTS ([710]) offers a scalable multi-agent driving platform for real-world traffic scenarios like merges and intersections, with evaluation based on collision rates, task completion, and agent behavior distributions. Nocturne ([711]) provides high-throughput, partially observable driving simulations using Waymo trajectories, testing coordination and human-like behavior in tasks such as intersections and roundabouts. MABIM ([712]) benchmarks multi-echelon inventory management, simulating cooperative and competitive retail dynamics, evaluated via profit metrics across diverse inventory settings. IMP-MARL ([713]) addresses infrastructure inspection and maintenance scheduling, measuring risk reduction and cost efficiency in large-scale systems. POGEMA ([714]) focuses on decentralized multi-agent pathfinding in grids, tracking success rate, path efficiency, and large-scale coordination. INTERSECTIONZOO ([715]) studies contextual RL for cooperative eco-driving at intersections, using traffic simulations to evaluate emissions and travel-time performance. REALM-Bench ([698]) introduces real-world planning tasks from logistics to disaster relief, with dynamic disruptions, multi-threaded dependencies, and evaluation via planning quality, adaptability, and constraint satisfaction. Together, these benchmarks reflect challenges in scaling, uncertainty, coordination, and dynamic adaptation, offering rigorous testbeds for real-world multi-agent systems.
Language, Communication, and Social Reasoning.
Benchmarks in Language, Communication, and Social Reasoning explore multi-agent communication protocols, Theory-of-Mind reasoning, game-theoretic interactions, and language-driven coordination. LLM-Coordination ([716]) examines collaborative reasoning and joint-planning abilities of LLM agents through cooperative gameplay (e.g., Hanabi, Overcooked-AI), measured by holistic scores and fine-grained coordination question accuracy. AVALONBENCH ([717]) leverages the social deduction game Avalon to assess role-conditioned language-based reasoning, with datasets of thousands of five-player dialogues and metrics on win-rate, role accuracy, and voting dynamics. Welfare Diplomacy ([718]) extends the classic game Diplomacy to general-sum welfare negotiation, using 50-game datasets to quantify coalition stability and welfare-oriented strategic reasoning. MAgIC ([719]) covers social deduction and classic dilemmas (e.g., Chameleon, Prisoner's Dilemma), employing handcrafted scenario datasets to benchmark reasoning, deception, coordination, and rationality. BattleAgentBench ([19]) assesses language-based cooperative and competitive dynamics in strategic gameplay environments, scoring navigation accuracy, agent interactions, and exploitability across diverse map datasets. COMMA ([720]) evaluates multimodal communicative reasoning through collaborative puzzle-solving tasks involving visual-language coordination, measured by grounding accuracy, privacy compliance, and dialogue effectiveness across thousands of scenarios. IntellAgent ([721]) introduces synthetic conversational AI tasks in retail and airline domains, generating extensive policy-constrained dialogue datasets evaluated by conversational success, mistake frequency, and policy adherence. Finally, MultiAgentBench ([21]) provides a comprehensive assessment across tasks such as Minecraft building, coding, and bargaining, employing dynamic key-performance indicators and LLM-scored communication quality across various multi-agent topologies and scenarios.
7.2 Applications of Agentic Reasoning
While mechanism-centric benchmarks help isolate individual capabilities, real-world deployments require these capabilities to work together under realistic constraints, such as partial observability, long-horizon dependencies, and safety-critical decisions. We therefore next review application-level benchmarks that evaluate end-to-end agent performance across representative environments, with tasks that jointly stress perception, reasoning, action execution, and coordination.
In this subsection, we review benchmarks designed to evaluate the application-level performance of agentic reasoning systems across various domains. These benchmarks assess agents' ability to perceive, reason, and act in realistic or high-impact task settings. We organize the discussion into six categories based on the application environment: Embodied Agents, Scientific Discovery Agents, Autonomous Research Agents, Medical and Clinical Agents, Web Agents, and Tool-Use Agents. Each subsubsection introduces representative benchmarks and describes their design motivation, task format, and evaluation metrics.
7.2.1 Embodied Agents
Benchmarks under this category evaluate agents that interact with physical or simulated environments, requiring grounding, perception, and action planning. AgentX [722] provides a diverse suite of vision-language embodied tasks in driving and sports, where agents must make decisions using multimodal information from videos. It emphasizes reasoning across scenes with occlusions, temporal gaps, or distractors. BALROG [723] builds a reinforcement learning-centric framework for benchmarking agentic planning in game environments, focusing on instruction-following, temporal abstraction, and error correction. ALFWorld [48] links language instructions to object interactions in a text-based 3D environment, evaluating perception-grounded execution. AndroidArena [724] targets GUI-based mobile tasks, where agents must perform actions like form-filling and app navigation using vision-language understanding. StarDojo [725] leverages the open-ended Stardew Valley game to study social planning and role-based coordination. MindAgent [726] and NetPlay [727] create multiplayer gaming testbeds to benchmark emergent social reasoning and negotiation under uncertainty. OSWorld [728] offers a simulated desktop environment with diverse cross-app productivity tasks, such as opening files, converting formats, and modifying documents. These environments challenge agents to coordinate between perception, planning, and symbolic action in dynamic and often partially observable scenarios.
7.2.2 Scientific Discovery Agents
Scientific benchmarks aim to test agents' capabilities in knowledge acquisition, hypothesis generation, and experimental automation. DISCOVERYWORLD [729] introduces a virtual lab where agents explore scientific phenomena in biology, chemistry, and physics through simulated tools and instruments. ScienceWorld [730] focuses on elementary science experiments using textual instructions and environment interactions, requiring step-by-step hypothesis testing. ScienceAgentBench [731] builds a benchmark from real-world scientific papers, translating tasks like code implementation, figure generation, and variable extraction into executable subtasks, assessing agents’ ability to automate the research process. The AI Scientist [623] simulates a full end-to-end research pipeline, where agents perform literature review, method writing, experiment execution, and peer-review simulation. LAB-Bench [732] evaluates biology-specific agents on tasks involving genetic sequence reasoning and experiment planning. MLAgentBench [733] benchmarks agents’ ability to autonomously train, evaluate, and tune machine learning models, offering realistic experimentation workflows. These benchmarks collectively probe open-ended reasoning, long-horizon planning, and scientific grounding in semi-structured data settings.
7.2.3 Autonomous Research Agents
This category benchmarks agents designed for long-horizon workflows across general-purpose research, office, or planning tasks. WorkArena [734] and its extension WorkArena++ [735] propose enterprise task benchmarks where agents must complete ticket-based workflows involving retrieval, summarization, and coordination across documents. OfficeBench [736] simulates a productivity software suite environment with tasks such as creating meeting memos, modifying spreadsheets, and replying to emails, emphasizing goal decomposition and tool selection. PlanBench [695] and FlowBench [700] test general workflow planning skills with abstracted task graphs and structured dependencies. ACPBench [696] evaluates agents in assistant–collaborator–planner triads, tracking performance in a hybrid role hierarchy. TRAIL [737] focuses on multi-agent trace debugging and error attribution [738] in LLM-based systems, providing dense annotations for reasoning chains. CLIN [739] introduces lifelong few-shot learning benchmarks where agents adapt to distribution shift and task evolution. Agent-as-a-Judge [740] studies peer-review style evaluation with agents grading reasoning chains and correctness of other agents’ outputs. InfoDeepSeek [670] measures information-seeking abilities in open-domain QA and synthesis tasks. Together, these benchmarks capture the growing demand for agentic reasoning in complex knowledge workflows that involve abstraction, iteration, and evaluation.
7.2.4 Medical and Clinical Agents
These benchmarks test agents’ abilities to reason with clinical knowledge, patient data, and multimodal biomedical sources. AgentClinic [741] introduces a virtual hospital environment where agents make diagnostic decisions based on patient symptoms and medical imaging. MedAgentBench [742] combines medical QA, patient simulation, and retrieval tasks in a multi-format benchmark grounded in standardized exams. MedAgentsBench [743] evaluates multi-hop medical reasoning over structured and unstructured data, scoring agents on correctness and evidence alignment. EHRAgent [553] benchmarks agents working over structured electronic health record (EHR) tables and clinical notes to complete tasks like diagnosis code prediction and medication reasoning. MedBrowseComp [674] focuses on browsing-based medical QA, where agents must retrieve and verify information across web pages. ACC [744] explores trustworthy medical agents with retrieval, hallucination detection, and citation-based support evaluation. MedAgents [573] uses a collaborative multi-agent dialogue setup to simulate patient–doctor–nurse interactions, scoring fluency and factual accuracy. GuardAgent [745] proposes a clinical privacy safeguard agent with structured risk detection benchmarks on EHR and website forms. These datasets emphasize correctness, trustworthiness, and safety in real-world clinical deployment contexts.
7.2.5 Web Agents
Web agents operate in realistic browsing environments and are benchmarked on their ability to parse layouts, execute actions, and handle dynamic content. WebArena [45] introduces a browser-based benchmark suite containing 90+ realistic websites across domains like shopping and booking, where agents complete tasks with structured goals and click-based APIs. VisualWebArena [46] extends this with visual rendering, requiring agents to parse webpage images and align instructions with rendered components. WebVoyager [610] proposes goal-driven navigation with long-horizon tasks involving multi-page traversal and backtracking. Mind2Web [50] targets cross-domain web automation with multi-task datasets and rich grounding annotations. WebCanvas [746] supports fine-grained layout manipulation, such as drag-drop and resize actions. WebLINX [747] simulates information gathering tasks with browsing, summarization, and answer synthesis. BrowseComp-ZH [748] brings language and infrastructure diversity with Chinese websites, challenging agents on multilingual understanding. LASER [749], WebWalker [669], and AutoWebBench [577] focus on structured page representation, real-time action execution, and policy learning in web navigation. These benchmarks highlight perception, grounding, and policy generalization challenges in web settings.
7.2.6 General Tool-Use Agents
This group of benchmarks emphasizes LLM agents' ability to invoke, coordinate, and reason over tools and APIs. GTA [663] presents a realistic tool-use benchmark grounded in user queries and deployed software tools, spanning APIs from image generation to analytics dashboards. NESTFUL [750] evaluates nested API invocation tasks requiring compositional planning across toolchains. CodeAct [99] simulates executable function calling and evaluates agents on parsing, composition, and runtime accuracy. RestGPT [196] connects LLMs with RESTful APIs via coarse-to-fine planning pipelines, tested on 60+ tool types. Search-o1 [23] frames tool use as sequential retrieval, with benchmarks spanning code search, PDF querying, and scientific tool usage. Agentic RL [751] proposes a reinforcement learning agent with access to tool interfaces and evaluation tasks such as calendar scheduling and translation. ActionReasoningBench [752] benchmarks agents’ ability to reason about action side effects and downstream consequences using a structured action grammar. R-Judge [753] introduces safety judgment benchmarks where agents assess risky plans involving tools. These datasets jointly reflect the increasing complexity and compositionality of tool-augmented agent environments.
8. Open Problems
In this section, agentic reasoning faces critical open challenges in adapting to dynamic user needs, sustaining long-horizon planning amid error accumulation and credit assignment gaps, building reliable world models for lookahead simulation, enabling scalable multi-agent collaboration through trainable policies, unlocking efficient yet auditable latent reasoning, and establishing governance for safe autonomous operations. These issues stem from non-stationary environments, partial observability, and emergent risks in extended interactions, tools, and ecosystems, where current methods falter on generalization, interpretability, and safety. Ultimately, resolving them demands integrated frameworks for adaptive learning, evaluation benchmarks, and holistic safeguards to realize robust, deployable agentic systems.
In this section, we highlight open problems arising from user-centric personalization, long-horizon interaction and credit assignment, world-model-based reasoning, multi-agent collaboration and training, latent internal reasoning, and the governance of agentic systems operating autonomously in real-world environments.
8.1 User-centric Agentic Reasoning and Personalization
User-centric agentic reasoning [754, 755] refers to an agent’s ability to tailor its reasoning and actions to a specific individual user by modeling user characteristics, preferences, and interaction history over time. Rather than optimizing a fixed, task-defined objective, a user-centric agent treats the user as part of the environment and continuously adapts its strategy through extended, multi-turn interaction. This requires the agent to dynamically infer evolving user intent, accommodate changes in goals and behavior styles, and adjust decisions based on explicit or implicit user feedback as the dialogue progresses. Crucially, user-centric agentic reasoning involves balancing short-term task rewards with long-term user experience, satisfaction, and trust, which introduces non-stationary objectives and long-horizon credit assignment challenges beyond conventional agentic reasoning settings.
8.2 Long-horizon Agentic Reasoning from Extended Interaction
A central open challenge in agentic reasoning is robust long-horizon planning and credit assignment across extended interactions. While methods such as ReAct and Tree of Thought improve short-horizon reasoning ([5, 4]), errors still compound rapidly in long tasks, as illustrated by embodied agents like Voyager ([36]). RL-trained agents such as WebRL and Agent-R1 improve performance in realistic environments but rely on heavily engineered, domain-specific rewards and largely treat episodes independently ([409, 28]). More recent process-aware approaches attempt to construct finer-grained credit signals ([756, 15, 757]), yet remain environment-specific. A core open problem is how to assign credit across tokens, tool calls, skills, and memory updates, and to generalize such learning across a long sequence of episodes and tasks.
8.3 Agentic Reasoning with World Models
World-model-based agents [758, 288] aim to mitigate myopic reasoning by enabling internal simulation and lookahead. Model-based RL systems such as DreamerV3 demonstrate the effectiveness of imagined rollouts for long-horizon control ([759]), while recent LLM-based agents adapt world models to web, code, and GUI environments ([760, 758, 761, 762]). However, current designs rely on ad hoc representations and are typically trained on short-horizon or environment-specific data, raising concerns about calibration and generalization. Only a few works explore co-evolving world models and agents over long time scales ([582, 763]). An open problem is how to jointly train, update, and evaluate world models in non-stationary environments, and how to assess their causal impact on downstream planning reliability.
8.4 Multi-agent Collaborative Reasoning and Training
Multi-agent collaboration has emerged as a powerful paradigm for scaling agentic reasoning through role specialization and division of labor ([67, 764, 66]). While debate- and role-based systems often outperform single agents, most collaboration structures are still manually designed. Recent multi-agent RL approaches begin to treat collaboration itself as a trainable skill ([381, 385, 26]), but credit assignment at the group level remains poorly understood. Scaling to larger agent populations further introduces challenges in topology adaptation, coordination overhead, and safety ([765, 766, 738]). A key open problem is how to learn adaptive, interpretable collaboration policies that remain robust under partial observability and adversarial conditions.
8.5 Latent Agentic Reasoning
Latent agentic reasoning [767, 768, 413] explores performing planning, decision-making and collaboration in internal latent spaces rather than explicit natural language or symbolic traces. Recent work suggests that latent reasoning can improve efficiency and scalability, but at the cost of reduced interpretability and controllability. In agentic settings, this raises additional challenges, including how to align latent reasoning with external objectives, tools, agents and memory systems. Diagnosing failures becomes particularly difficult when intermediate reasoning steps are not externally observable. An open problem is how to design learning objectives, probing methods, and evaluation benchmarks that make latent agentic reasoning both effective and auditable.
8.6 Governance of Agentic Reasoning
Governance is a cross-cutting challenge for agentic reasoning systems that act autonomously over tools, environments, and other agents. Beyond standard LLM safety issues, agentic systems introduce new risks due to long-horizon planning, persistent memory, and real-world action execution ([769]). Failures may arise from interactions across time and components, making attribution and auditing difficult. Existing benchmarks and guardrails mainly focus on short-horizon behaviors ([745, 753]), leaving planning-time failures and multi-agent dynamics underexplored. A central open problem is to develop governance frameworks that jointly address model-level alignment, agent-level policies, and ecosystem-level interactions under realistic deployment conditions.
References
[1] Wei et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems. 35. pp. 24824–24837.
[2] Zhou et al. (2022). Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
[3] Gao et al. (2023). Pal: Program-aided language models. In International Conference on Machine Learning. pp. 10764–10799.
[4] Yao et al. (2023). Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems. 36. pp. 11809–11822.
[5] Yao et al. (2023). React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR).
[6] Schick et al. (2023). Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems. 36. pp. 68539–68551.
[7] Shen et al. (2023). Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems. 36. pp. 38154–38180.
[8] Wang et al. (2024). A survey on large language model based autonomous agents. Frontiers of Computer Science. 18(6). pp. 186345.
[9] Singh et al. (2025). Agentic retrieval-augmented generation: A survey on agentic rag. arXiv preprint arXiv:2501.09136.
[10] Huang, Yizheng and Huang, Jimmy (2024). A survey on retrieval-augmented text generation for large language models. arXiv preprint arXiv:2404.10981.
[11] Wang et al. (2024). Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741.
[12] Chhikara et al. (2025). Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413.
[13] Li et al. (2025). MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models. arXiv preprint arXiv:2505.22101.
[14] Shinn et al. (2023). Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems. 36. pp. 8634–8652.
[15] Yan et al. (2025). Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning. arXiv preprint arXiv:2508.19828.
[16] Chen et al. (2023). Autoagents: A framework for automatic agent generation. arXiv preprint arXiv:2309.17288.
[17] Sirui Hong et al. (2024).
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. In
The Twelfth International Conference on Learning Representations.
https://openreview.net/forum?id=VtmBAGCN7o.
[18] Wang et al. (2024). Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration. In Proc. 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL2024).
[19] Wang et al. (2024). Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems. arXiv preprint arXiv:2408.15971.
[21] Zhu et al. (2025). Multiagentbench: Evaluating the collaboration and competition of llm agents. arXiv preprint arXiv:2503.01935.
[22] Ni et al. (2025). Tree-of-Code: A Self-Growing Tree Framework for End-to-End Code Generation and Execution in Complex Tasks. In Findings of the Association for Computational Linguistics: ACL 2025. pp. 9804–9819.
[23] Li et al. (2025). Search-o1: Agentic search-enhanced large reasoning models. arXiv preprint arXiv:2501.05366.
[24] Xu et al. (2025). A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110.
[25] Wei et al. (2025). Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory. arXiv preprint arXiv:2511.20857.
[26] Ma et al. (2024). Coevolving with the other you: Fine-tuning llm with sequential cooperative multi-agent reinforcement learning. Advances in Neural Information Processing Systems. 37. pp. 15497–15525.
[27] Jin et al. (2025). Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516.
[28] Wei et al. (2025). Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning. arXiv preprint arXiv:2505.16421.
[29] Trinh et al. (2024). Solving olympiad geometry without human demonstrations. Nature. 625(7995). pp. 476–482.
[30] Romera-Paredes et al. (2024). Mathematical discoveries from program search with large language models. Nature. 625(7995). pp. 468–475.
[31] Sapkota et al. (2025). Vibe coding vs. agentic coding: Fundamentals and practical implications of agentic AI.
[33] Bran et al. (2023). Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376.
[34] Bousetouane, Fouad (2025). Physical AI Agents: Integrating Cognitive Intelligence with Real-World Action. arXiv preprint arXiv:2501.08944.
[35] Ding et al. (2024). Matexpert: Decomposing materials discovery by mimicking human experts. arXiv preprint arXiv:2410.21317.
[36] Wang et al. (2023). Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291.
[37] Meghan et al. (2024).
EmbodiedRAG: Dynamic 3D Scene Graph Retrieval for Efficient and Scalable Robot Task Planning. arXiv preprint arXiv:2410.23968.
https://www.arxiv.org/abs/2410.23968.
[38] Zhao et al. (2025). Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning. arXiv preprint arXiv:2504.12680.
[39] Li et al. (2024). Mmedagent: Learning to use medical tools with multi-modal agent. arXiv preprint arXiv:2407.02483.
[40] Huang et al. (2025). Biomni: A general-purpose biomedical ai agent. biorxiv.
[41] Li et al. (2025). WebSailor: Navigating Super-human Reasoning for Web Agent. arXiv preprint arXiv:2507.02592.
[42] Zheng et al. (2025). Skillweaver: Web agents can self-improve by discovering and honing skills. arXiv preprint arXiv:2504.07079.
[43] Sapkota et al. (2025). Ai agents vs. agentic ai: A conceptual taxonomy, applications and challenges. arXiv preprint arXiv:2505.10468.
[44] Liu et al. (2024). A dynamic llm-powered agent network for task-oriented agent collaboration. In First Conference on Language Modeling.
[47] Jang et al. (2024). Videowebarena: Evaluating long context multimodal agents with video understanding web tasks. arXiv preprint arXiv:2410.19100.
[48] Shridhar et al. (2020). Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768.
[49] Deng et al. (2023). Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems. 36. pp. 28091–28114.
[50] Gou et al. (2025). Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge. arXiv preprint arXiv:2506.21506.
[51] Huang, Jie and Chang, Kevin Chen-Chuan (2022). Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403.
[52] Chen et al. (2025). Towards reasoning era: A survey of long chain-of-thought for reasoning large language models. arXiv preprint arXiv:2503.09567.
[53] Xu et al. (2025). Towards large reasoning models: A survey of reinforced reasoning with large language models. arXiv preprint arXiv:2501.09686.
[54] Ke et al. (2025). A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems. arXiv preprint arXiv:2504.09037.
[55] Zhang et al. (2025). A survey of reinforcement learning for large reasoning models. arXiv preprint arXiv:2509.08827.
[56] Zhang et al. (2025). The landscape of agentic reinforcement learning for llms: A survey. arXiv preprint arXiv:2509.02547.
[57] Lin et al. (2025). A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications. arXiv preprint arXiv:2510.16724.
[58] Fang et al. (2025). A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems. arXiv preprint arXiv:2508.07407.
[59] Gao et al. (2025). A survey of self-evolving agents: On path to artificial super intelligence. arXiv preprint arXiv:2507.21046.
[60] Guo et al. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
[61] Jiang et al. (2025). Deepretrieval: Hacking real search engines and retrievers with large language models via reinforcement learning. arXiv preprint arXiv:2503.00223.
[62] Schulman et al. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
[63] Shao et al. (2024). Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300.
[64] Fanbin Lu et al. (2025). ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay. arXiv preprint arXiv:2505.16282.
[65] Yu et al. (2025). Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476.
[67] Li et al. (2023). Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems. 36. pp. 51991–52008.
[68] Zhuge et al. (2024). Gptswarm: Language agents as optimizable graphs. In Forty-first International Conference on Machine Learning.
[69] Hong et al. (2025). Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO. arXiv preprint arXiv:2511.13288.
[70] Novikov et al. (2025). AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131.
[71] Xu et al. (2023). REWOO: Decoupling reasoning from observations for efficient augmented language models. arXiv preprint arXiv:2305.18323.
[72] Liu et al. (2023). LLM+P: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477.
[73] Valmeekam et al. (2023). On the planning abilities of large language models: A critical investigation. Advances in Neural Information Processing Systems. 36. pp. 75993–76005.
[74] Besta et al. (2024). Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI conference on artificial intelligence. pp. 17682–17690.
[75] Sel et al. (2023). Algorithm of thoughts: Enhancing exploration of ideas in large language models. arXiv preprint arXiv:2308.10379.
[76] Gui et al. (2025). HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking. arXiv preprint arXiv:2505.02322.
[77] Jeong et al. (2025). Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens. arXiv preprint arXiv:2506.06261.
[78] Patil et al. (2024). Gorilla: Large language model connected with massive apis. Advances in Neural Information Processing Systems. 37. pp. 126544–126565.
[79] Gupta et al. (2024). CodeNav: Beyond tool-use to using real-world codebases with LLM agents. arXiv preprint arXiv:2406.12276.
[80] Chen et al. (2024). Plan-on-graph: Self-correcting adaptive planning of large language model on knowledge graphs. Advances in Neural Information Processing Systems. 37. pp. 37665–37691.
[82] Liang et al. (2024). Visualpredicator: Learning abstract world models with neuro-symbolic predicates for robot planning. arXiv preprint arXiv:2410.23156.
[83] Song et al. (2023). Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF international conference on computer vision. pp. 2998–3009.
[84] Abuelsaad et al. (2024). Agent-e: From autonomous web navigation to foundational design principles in agentic systems. arXiv preprint arXiv:2407.13032.
[85] Agashe et al. (2024). Agent s: An open agentic framework that uses computers like a human. arXiv preprint arXiv:2410.08164.
[86] Yoo et al. (2024). Exploratory retrieval-augmented planning for continual embodied instruction following. Advances in Neural Information Processing Systems. 37. pp. 67034–67060.
[87] Sinha et al. (2024). Real-time anomaly detection and reactive planning with large language models. arXiv preprint arXiv:2407.08735.
[88] Cornelio et al. (2025). Hierarchical Planning for Complex Tasks with Knowledge Graph-RAG and Symbolic Verification. arXiv preprint arXiv:2504.04578.
[89] Zhou et al. (2024). Behaviorgpt: Smart agent simulation for autonomous driving with next-patch prediction. Advances in Neural Information Processing Systems. 37. pp. 79597–79617.
[90] Zhou et al. (2024). Dino-wm: World models on pre-trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983.
[91] Gao et al. (2024). Flip: Flow-centric generative planning as general-purpose manipulation world model. arXiv preprint arXiv:2412.08261.
[92] Hao et al. (2024). LLM reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models. arXiv preprint arXiv:2404.05221.
[93] Wang et al. (2023). Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2609–2634.
[94] Liu et al. (2023). Plan, verify and switch: Integrated reasoning with diverse x-of-thoughts. arXiv preprint arXiv:2310.14628.
[95] Ni et al. (2024). Peria: Perceive, reason, imagine, act via holistic language and vision planning for manipulation. Advances in Neural Information Processing Systems. 37. pp. 17541–17571.
[96] Erdogan et al. (2025). Plan-and-act: Improving planning of agents for long-horizon tasks. arXiv preprint arXiv:2503.09572.
[97] Wen et al. (2024). Codeplan: Unlocking reasoning potential in large language models by scaling code-form planning. In The Thirteenth International Conference on Learning Representations.
[98] Lutz et al. (2024). Wilbur: Adaptive in-context learning for robust and accurate web agents. arXiv preprint arXiv:2404.05902.
[99] Wang et al. (2024). Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning.
[100] Rahman et al. (2025). MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing. arXiv preprint arXiv:2505.03906.
[101] He et al. (2024). Enhancing llm reasoning with multi-path collaborative reactive and reflection agents. arXiv preprint arXiv:2501.00430.
[102] Rawat et al. (2025). Pre-Act: Multi-Step Planning and Reasoning Improves Acting in LLM Agents. arXiv preprint arXiv:2505.09970.
[103] Aksitov et al. (2023). Rest meets react: Self-improvement for multi-step reasoning llm agent. arXiv preprint arXiv:2312.10003.
[104] Jiang et al. (2024). Self-planning code generation with large language models. ACM Transactions on Software Engineering and Methodology. 33(7). pp. 1–30.
[105] Shah et al. (2023). Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on robot learning. pp. 492–504.
[106] Markowitz et al. (2024). Tree-of-traversals: A zero-shot reasoning algorithm for augmenting black-box language models with knowledge graphs. arXiv preprint arXiv:2407.21358.
[107] Long, Jieyi (2023). Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291.
[108] Koh et al. (2024). Tree search for language model agents. arXiv preprint arXiv:2407.01476.
[109] Wang et al. (2024). Q: Improving multi-step reasoning for llms with deliberative planning*. arXiv preprint arXiv:2406.14283.
[110] Meng et al. (2024). Llm-a: Large language model enhanced incremental heuristic search on path planning*. arXiv preprint arXiv:2407.02511.
[111] Liu et al. (2024). Multimodal large language models for inverse molecular design with retrosynthetic planning. arXiv preprint arXiv:2410.04223.
[112] Hao et al. (2023). Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992.
[113] Putta et al. (2024). Agent q: Advanced reasoning and learning for autonomous ai agents. arXiv preprint arXiv:2408.07199.
[114] Sprueill et al. (2023). Monte carlo thought search: Large language model querying for complex scientific reasoning in catalyst design. arXiv preprint arXiv:2310.14420.
[115] Yu et al. (2023). Prompt-based Monte-Carlo tree search for goal-oriented dialogue policy planning. arXiv preprint arXiv:2305.13660.
[116] Zhao et al. (2023). Large language models as commonsense knowledge for large-scale task planning. Advances in neural information processing systems. 36. pp. 31967–31987.
[117] Ding et al. (2023). Everything of thoughts: Defying the law of penrose triangle for thought generation. arXiv preprint arXiv:2311.04254.
[118] Chen et al. (2024). When is tree search useful for llm planning? it depends on the discriminator. arXiv preprint arXiv:2402.10890.
[119] Kong et al. (2024). Latent plan transformer for trajectory abstraction: Planning as latent space inference. Advances in Neural Information Processing Systems. 37. pp. 123379–123401.
[120] Feng et al. (2023). Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179.
[121] Yoon et al. (2025). Monte carlo tree diffusion for system 2 planning. arXiv preprint arXiv:2502.07202.
[122] Schultz et al. (2024). Mastering board games by external and internal planning with language models. arXiv preprint arXiv:2412.12119.
[123] Chen et al. (2025). Broaden your SCOPE! efficient multi-turn conversation planning for LLMs with semantic space. arXiv preprint arXiv:2503.11586.
[124] Xie et al. (2023). Self-evaluation guided beam search for reasoning. Advances in Neural Information Processing Systems. 36. pp. 41618–41650.
[125] Golovneva et al. (2023). Pathfinder: Guided search over multi-step reasoning paths. arXiv preprint arXiv:2312.05180.
[126] Qian et al. (2025). Discriminator-Guided Embodied Planning for LLM Agent. In The Thirteenth International Conference on Learning Representations.
[127] Gandhi et al. (2024). Stream of search (sos): Learning to search in language. arXiv preprint arXiv:2404.03683.
[128] Saha et al. (2024). System-1. x: Learning to balance fast and slow planning with language models. arXiv preprint arXiv:2407.14414.
[129] Guan et al. (2023). Intelligent virtual assistants with llm-based process automation. arXiv preprint arXiv:2312.06677.
[130] Chen et al. (2025). Enhancing LLM-Based Agents via Global Planning and Hierarchical Execution. arXiv preprint arXiv:2504.16563.
[131] Hu et al. (2025). Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning. arXiv preprint arXiv:2505.19761.
[132] Antoniades et al. (2024). Swe-search: Enhancing software agents with monte carlo tree search and iterative refinement. arXiv preprint arXiv:2410.20285.
[133] Lykov, Artem and Tsetserukou, Dzmitry (2024). Llm-brain: Ai-driven fast generation of robot behaviour tree based on large language model. In 2024 2nd International Conference on Foundation and Large Language Models (FLLM). pp. 392–397.
[134] Cao, Yue and Lee, CS (2023). Robot behavior-tree-based task generation with large language models. arXiv preprint arXiv:2302.12927.
[135] Izzo et al. (2024). Btgenbot: Behavior tree generation for robotic tasks with lightweight llms. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 9684–9690.
[136] Ahn et al. (2022). Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
[137] Huang et al. (2022). Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608.
[138] Guan et al. (2023). Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. Advances in Neural Information Processing Systems. 36. pp. 79081–79094.
[139] Mahdavi et al. (2024). Leveraging environment interaction for automated pddl translation and planning with large language models. Advances in Neural Information Processing Systems. 37. pp. 38960–39008.
[140] Katz et al. (2024). Thought of search: Planning with language models through the lens of efficiency. Advances in Neural Information Processing Systems. 37. pp. 138491–138568.
[141] Hao et al. (2024). Planning anything with rigor: General-purpose zero-shot planning with llm-based formalized programming. arXiv preprint arXiv:2410.12112.
[142] Vyas et al. (2024). From an LLM Swarm to a PDDL-empowered Hive: Planning Self-executed Instructions in a Multi-modal Jungle. arXiv preprint arXiv:2412.12839.
[143] Zhang et al. (2025). Atomic Reasoning for Scientific Table Claim Verification. arXiv preprint arXiv:2506.06972.
[144] Dong et al. (2024). Diffuserlite: Towards real-time diffusion planning. Advances in Neural Information Processing Systems. 37. pp. 122556–122583.
[145] Lo et al. (2024). Goal-space planning with subgoal models. Journal of Machine Learning Research. 25(330). pp. 1–57.
[146] Li et al. (2024). Agent-oriented planning in multi-agent systems. arXiv preprint arXiv:2410.02189.
[147] Wang et al. (2023). Goplan: Goal-conditioned offline reinforcement learning by planning with learned models. arXiv preprint arXiv:2310.20025.
[148] Kang et al. (2025). Retrointext: A multimodal large language model enhanced framework for retrosynthetic planning via in-context representation learning. In The Thirteenth International Conference on Learning Representations.
[149] Ye et al. (2024). Beyond autoregression: Discrete diffusion for complex reasoning and planning. arXiv preprint arXiv:2410.14157.
[150] Zheng et al. (2024). Planagent: A multi-modal large language agent for closed-loop vehicle motion planning. arXiv preprint arXiv:2406.01587.
[151] Nayak et al. (2024). Long-horizon planning for multi-agent robots in partially observable environments. Advances in Neural Information Processing Systems. 37. pp. 67929–67967.
[152] Meng, Yue and Fan, Chuchu (2025). TeLoGraF: Temporal Logic Planning via Graph-encoded Flow Matching. arXiv preprint arXiv:2505.00562.
[153] Zhong et al. (2024). FlexPlanner: Flexible 3D Floorplanning via Deep Reinforcement Learning in Hybrid Action Space with Multi-Modality Representation. Advances in Neural Information Processing Systems. 37. pp. 49252–49278.
[154] Li et al. (2024). Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset and self-adaptive planning agent. arXiv preprint arXiv:2411.02937.
[156] Qiao et al. (2024). Agent planning with world knowledge model. Advances in Neural Information Processing Systems. 37. pp. 114843–114871.
[157] Liu et al. (2025). Continual Reinforcement Learning by Planning with Online World Models. arXiv preprint arXiv:2507.09177.
[158] Wang et al. (2025). Adawm: Adaptive world model based planning for autonomous driving. arXiv preprint arXiv:2501.13072.
[159] Ye et al. (2023). Rational decision-making agent with internalized utility judgment. arXiv preprint arXiv:2308.12519.
[160] Chen et al. (2025). Scaling autonomous agents via automatic reward modeling and planning. arXiv preprint arXiv:2502.12130.
[161] Luyten et al. (2025). Strategic Planning: A Top-Down Approach to Option Generation. In Forty-second International Conference on Machine Learning.
[162] Ma et al. (2024). Non-myopic generation of language models for reasoning and planning. arXiv preprint arXiv:2410.17195.
[163] Ni et al. (2025). Physics-informed Temporal Difference Metric Learning for Robot Motion Planning. arXiv preprint arXiv:2505.05691.
[164] Matada et al. (2024). Generalizable Motion Planning via Operator Learning. arXiv preprint arXiv:2410.17547.
[166] Xie et al. (2025). Latent diffusion planning for imitation learning. arXiv preprint arXiv:2504.16925.
[167] Xiao et al. (2023). Safediffuser: Safe planning with diffusion probabilistic models. In The Thirteenth International Conference on Learning Representations.
[168] Shan et al. (2025). ContraDiff: Planning Towards High Return States via Contrastive Learning. In The Thirteenth International Conference on Learning Representations.
[169] Ruoss et al. (2024). Amortized planning with large-scale transformers: A case study on chess. Advances in Neural Information Processing Systems. 37. pp. 65765–65790.
[171] Chen et al. (2023).
ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models. In
Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 14777–14790. doi:10.18653/v1/2023.findings-emnlp.985.
https://aclanthology.org/2023.findings-emnlp.985/.
[172] Lu et al. (2024).
GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution. In
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 112–138. doi:10.18653/v1/2024.eacl-long.7.
https://aclanthology.org/2024.eacl-long.7/.
[173] Wu et al. (2024). AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning. In Advances in Neural Information Processing Systems. pp. 25981–26010.
[174] Yujia Qin et al. (2024).
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. In
The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024.
https://openreview.net/forum?id=dHng2O0Jjr.
[175] Qiaoyu Tang et al. (2023).
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases. CoRR. abs/2306.05301. doi:10.48550/ARXIV.2306.05301.
https://doi.org/10.48550/arXiv.2306.05301.
[176] Chen et al. (2025). Learning to reason with search for llms via reinforcement learning. arXiv preprint arXiv:2503.19470.
[177] Dong et al. (2025). Reinforcement Pre-Training. arXiv preprint arXiv:2506.08007.
[178] Qian et al. (2025). Toolrl: Reward is all tool learning needs. arXiv preprint arXiv:2504.13958.
[179] Yaobo Liang et al. (2023).
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs. CoRR. abs/2303.16434. doi:10.48550/ARXIV.2303.16434.
https://doi.org/10.48550/arXiv.2303.16434.
[181] Zijing Zhang et al. (2025).
ToolExpNet: Optimizing Multi-Tool Selection in LLMs with Similarity and Dependency-Aware Experience Networks. In
Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025. pp. 15706–15722.
https://aclanthology.org/2025.findings-acl.811/.
[182] Yuchen Zhuang et al. (2024).
ToolChain: Efficient Action Space Navigation in Large Language Models with A* Search*. In
The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024.
https://openreview.net/forum?id=B6pQxqUcT8.
[183] Inaba et al. (2023).
MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting. In
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 1522–1532. doi:10.18653/v1/2023.acl-short.130.
https://aclanthology.org/2023.acl-short.130/.
[184] Trivedi et al. (2022). Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509.
[186] Yuan et al. (2024). Easytool: Enhancing llm-based agents with concise tool instruction. arXiv preprint arXiv:2401.06201.
[187] Qu et al. (2025). Tool learning with large language models: A survey. Frontiers of Computer Science. 19(8). pp. 198343.
[188] Shi et al. (2025). Tool learning in the wild: Empowering language models as automatic tool agents. In Proceedings of the ACM on Web Conference 2025. pp. 2222–2237.
[189] Wang et al. (2024). Empowering large language models: Tool learning for real-world interaction. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 2983–2986.
[190] Yang et al. (2024). Buffer of thoughts: Thought-augmented reasoning with large language models. Advances in Neural Information Processing Systems. 37. pp. 113519–113544.
[191] Cheng et al. (2024). Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935.
[193] Nam et al. (2024). Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. pp. 1–13.
[195] Lu et al. (2023). Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems. 36. pp. 43447–43478.
[196] Song et al. (2023). Restgpt: Connecting large language models with real-world restful apis. arXiv preprint arXiv:2306.06624.
[197] Prasad et al. (2023). Adapt: As-needed decomposition and planning with language models. arXiv preprint arXiv:2311.05772.
[198] Yin et al. (2023). Agent lumos: Unified and modular training for open-source language agents. arXiv preprint arXiv:2311.05657.
[199] Shi et al. (2024). Learning to use tools via cooperative and interactive agents. arXiv preprint arXiv:2403.03031.
[200] Kirk et al. (2023). Understanding the effects of rlhf on llm generalisation and diversity. arXiv preprint arXiv:2310.06452.
[201] Li et al. (2024). Preserving diversity in supervised fine-tuning of large language models. arXiv preprint arXiv:2408.16673.
[202] O’Mahony et al. (2024). Attributing mode collapse in the fine-tuning of large language models. In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models.
[204] Zeng et al. (2025). itool: Reinforced fine-tuning with dynamic deficiency calibration for advanced tool use. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 13901–13916.
[205] Yu et al. (2025). Demystifying Reinforcement Learning in Agentic Reasoning. arXiv preprint arXiv:2510.11701.
[206] Zhou et al. (2025). Sweet-rl: Training multi-turn llm agents on collaborative reasoning tasks. arXiv preprint arXiv:2503.15478.
[207] Wei et al. (2025). Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution. arXiv preprint arXiv:2502.18449.
[208] Zhang et al. (2025). RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents. arXiv preprint arXiv:2507.22844.
[210] Feng et al. (2025). Retool: Reinforcement learning for strategic tool use in llms. arXiv preprint arXiv:2504.11536.
[211] Sun et al. (2025). Zerosearch: Incentivize the search capability of llms without searching. arXiv preprint arXiv:2505.04588.
[212] Team et al. (2025). Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599.
[213] Comanici et al. (2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261.
[214] Team et al. (2025). Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534.
[215] Zeng et al. (2025). GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models. arXiv preprint arXiv:2508.06471.
[216] Zou et al. (2025). TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning. arXiv preprint arXiv:2510.06217.
[218] Zhiyuan Ma et al. (2025).
Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning. CoRR. abs/2506.04625. doi:10.48550/ARXIV.2506.04625.
https://doi.org/10.48550/arXiv.2506.04625.
[219] Mengsong Wu et al. (2025).
Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models. CoRR. abs/2503.16779. doi:10.48550/ARXIV.2503.16779.
https://doi.org/10.48550/arXiv.2503.16779.
[223] Yuanhang Zheng et al. (2024).
ToolRerank: Adaptive and Hierarchy-Aware Reranking for Tool Retrieval. In
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy. pp. 16263–16273.
https://aclanthology.org/2024.lrec-main.1413.
[224] Lewis et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems. 33. pp. 9459–9474.
[225] Yang et al. (2024). Crag-comprehensive rag benchmark. Advances in Neural Information Processing Systems. 37. pp. 10470–10490.
[226] Press et al. (2022). Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350.
[227] Asai et al. (2023). Self-rag: Self-reflective retrieval augmented generation. In NeurIPS 2023 workshop on instruction tuning and instruction following.
[228] Guan et al. (2025). DeepRAG: Thinking to Retrieve Step by Step for Large Language Models. arXiv preprint arXiv:2502.01142.
[229] Zhu et al. (2024). Inters: Unlocking the power of large language models in search with instruction tuning. arXiv preprint arXiv:2401.06532.
[230] Nakano et al. (2021). Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
[231] Huang et al. (2025). Rag-rl: Advancing retrieval-augmented generation via rl and curriculum learning. arXiv preprint arXiv:2503.12759.
[232] Zheng et al. (2025). Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. arXiv preprint arXiv:2504.03160.
[233] Sun et al. (2025). Rearter: Retrieval-augmented reasoning with trustworthy process rewarding. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 1251–1261.
[234] Lee et al.. Agent-G: An Agentic Framework for Graph Retrieval Augmented Generation.
[235] Ning et al. (2025). MC-Search: Benchmarking Multimodal Agentic RAG with Structured Reasoning Chains. In NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling.
[236] Shen et al. (2025). Gear: Graph-enhanced agent for retrieval-augmented generation. In Findings of the Association for Computational Linguistics: ACL 2025. pp. 12049–12072.
[237] Zhang et al. (2025). Learning to retrieve and reason on knowledge graph through active self-reflection. arXiv preprint arXiv:2502.14932.
[238] Mao et al. (2024). RAG-studio: Towards in-domain adaptation of retrieval augmented generation through self-alignment. In Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 725–735.
[239] Zhang et al. (2024). Raft: Adapting language model to domain specific rag. arXiv preprint arXiv:2403.10131.
[240] Lin et al. (2023). Ra-dit: Retrieval-augmented dual instruction tuning. In The Twelfth International Conference on Learning Representations.
[241] Nguyen et al. (2024). Sfr-rag: Towards contextually faithful llms. arXiv preprint arXiv:2409.09916.
[242] Madaan et al. (2023). Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems. 36. pp. 46534–46594.
[243] Wang et al. (2024). Enable Language Models to Implicitly Learn Self-Improvement From Data. In Proc. The Twelfth International Conference on Learning Representations (ICLR2024).
[244] Wang et al. (2023).
Self-Consistency Improves Chain of Thought Reasoning in Language Models. In
The Eleventh International Conference on Learning Representations.
https://openreview.net/forum?id=1PL1NIMMrw.
[245] Chen et al. (2023).
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Transactions on Machine Learning Research.
https://openreview.net/forum?id=YfZ4ZPt8zd.
[246] Zeng et al. (2024). Agenttuning: Enabling generalized agent abilities for llms. In Findings of the Association for Computational Linguistics: ACL 2024. pp. 3053–3077.
[247] Hsieh et al. (2023). Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023. pp. 8003–8017.
[248] Christiano et al. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems. 30.
[249] Rafailov et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Advances in Neural Information Processing Systems (NeurIPS).
[250] Bai et al. (2022). Constitutional AI: Harmlessness from AI Feedback. In Advances in Neural Information Processing Systems (NeurIPS).
[252] Zheng, Zhi and Lee, Wee Sun (2025). Reasoning-CV: Fine-tuning Powerful Reasoning LLMs for Knowledge-Assisted Claim Verification. arXiv preprint arXiv:2505.12348.
[253] Dao, Alan and Le, Thinh (2025). ReZero: Enhancing LLM search ability by trying one-more-time. arXiv preprint arXiv:2504.11001.
[254] Potamitis, Nearchos and Arora, Akhil (2025). Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback. arXiv preprint arXiv:2504.12951.
[255] Le et al. (2022). CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning. In Advances in Neural Information Processing Systems (NeurIPS).
[256] Ni et al. (2023). Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning. pp. 26106–26128.
[257] Carlos E. Jimenez et al. (2024). SWE-bench: Can Language Models Resolve Real-world Github Issues?. In International Conference on Learning Representations (ICLR).
[258] Driess et al. (2023). PaLM-E: An Embodied Multimodal Language Model. In International Conference on Machine Learning. pp. 8469–8488.
[259] Bensal et al. (2025). Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning. arXiv preprint arXiv:2505.24726.
[260] Harrison Lee et al. (2024). RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback.
[261] Manakul et al. (2023). Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 conference on empirical methods in natural language processing. pp. 9004–9017.
[262] Chowdhury, Jishnu Ray and Caragea, Cornelia (2025). Zero-Shot Verification-guided Chain of Thoughts. arXiv preprint arXiv:2501.13122.
[263] Zhang et al. (2025). ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Fragility in LLMs. arXiv preprint arXiv:2508.05282.
[266] Dou et al. (2024). Re-ReST: Reflection-Reinforced Self-Training for Language Agents. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 15394–15411.
[271] Fang et al. (2025). Lightmem: Lightweight and efficient memory-augmented generation. arXiv preprint arXiv:2510.18866.
[272] Nan et al. (2025). Nemori: Self-organizing agent memory inspired by cognitive science. arXiv preprint arXiv:2508.03341.
[273] Zhang et al. (2025). Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models. arXiv preprint arXiv:2510.04618.
[274] Ouyang et al. (2025). ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory. arXiv preprint arXiv:2509.25140.
[275] Suzgun et al. (2025). Dynamic cheatsheet: Test-time learning with adaptive memory. arXiv preprint arXiv:2504.07952.
[278] Rasmussen et al. (2025). Zep: a temporal knowledge graph architecture for agent memory. arXiv preprint arXiv:2501.13956.
[279] Li et al. (2024). Optimus-1: Hybrid multimodal memory empowered agents excel in long-horizon tasks. Advances in neural information processing systems. 37. pp. 49881–49913.
[280] Kagaya et al. (2024). Rap: Retrieval-augmented planning with contextual memory for multimodal llm agents. arXiv preprint arXiv:2402.03610.
[281] Long et al. (2025). Seeing, listening, remembering, and reasoning: A multimodal agent with long-term memory. arXiv preprint arXiv:2508.09736.
[282] Bei et al. (2026). Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents. arXiv preprint arXiv:2601.03515.
[283] Cheng et al. (2025). Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations. arXiv preprint arXiv:2510.00496.
[284] Zhou et al. (2025). MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents. arXiv preprint arXiv:2506.15841.
[285] Zhang et al. (2025).
Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks. arXiv preprint arXiv:2510.12635.
https://arxiv.org/abs/2510.12635.
[286] Yu et al. (2025). MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent. arXiv preprint arXiv:2507.02259.
[287] Wang et al. (2025). Mem-