A Subgoal-driven Framework for Improving Long-Horizon LLM Agents
arXiv:2603.19685v1 Announce Type: new Abstract: Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing LLM-based agents struggle with long-horizon planning in two main ways. During online execution, they often lose track as new information arrives, lacking a clear and adaptive path toward the final goal. This issue is further exacerbated during reinforcement learning (RL) fine-tuning, where sparse and delayed rewards make it difficult for agents to identify which actions lead to success, preventing them from maintaining coherent reasoning over extended tasks. To address these challenges, we propose two contributions. First, we introduce an agent framework that leverages proprietary models for online planning through subgoal d
arXiv:2603.19685v1 Announce Type: new Abstract: Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing LLM-based agents struggle with long-horizon planning in two main ways. During online execution, they often lose track as new information arrives, lacking a clear and adaptive path toward the final goal. This issue is further exacerbated during reinforcement learning (RL) fine-tuning, where sparse and delayed rewards make it difficult for agents to identify which actions lead to success, preventing them from maintaining coherent reasoning over extended tasks. To address these challenges, we propose two contributions. First, we introduce an agent framework that leverages proprietary models for online planning through subgoal decomposition. Second, we present MiRA (Milestoning your Reinforcement Learning Enhanced Agent), an RL training framework that uses dense, milestone-based reward signals. The real-time planning mechanism improves proprietary models such as Gemini by approximately a 10% absolute increase in success rate (SR) on the WebArena-Lite benchmark. Meanwhile, applying MiRA to the open Gemma3-12B model increases its success rate from 6.4% to 43.0%. This performance surpasses proprietary systems such as GPT-4-Turbo (17.6%) and GPT-4o (13.9%), as well as the previous open-model state of the art, WebRL (38.4%). Overall, our findings demonstrate that combining explicit inference-time planning with milestone-based rewards significantly improves an agent's long-horizon capabilities, paving the way for more robust and general-purpose autonomous systems.
Executive Summary
This article proposes an innovative framework for improving long-horizon capabilities of large language model (LLM)-based agents. The authors introduce a subgoal-driven framework that leverages proprietary models for online planning through subgoal decomposition. They also present MiRA, an RL training framework that uses dense, milestone-based reward signals. The results demonstrate a significant improvement in success rates for various models, surpassing existing state-of-the-art systems. The proposed framework has the potential to pave the way for more robust and general-purpose autonomous systems. However, further research is needed to explore the generalizability of the proposed approach and its applicability to diverse domains. The findings have significant implications for the development of autonomous systems, particularly in areas such as web navigation and web browsing.
Key Points
- ▸ Introduction of a subgoal-driven framework for improving long-horizon capabilities of LLM-based agents.
- ▸ Presentation of MiRA, an RL training framework that uses dense, milestone-based reward signals.
- ▸ Significant improvement in success rates for various models, surpassing existing state-of-the-art systems.
Merits
Strength
The proposed framework addresses the challenges of long-horizon planning in LLM-based agents, providing a clear and adaptive path toward the final goal.
Strength
The use of milestone-based reward signals in MiRA allows for more efficient and effective RL training.
Demerits
Limitation
The proposed framework may not be generalizable to diverse domains, requiring further research and adaptation.
Limitation
The reliance on proprietary models may limit the applicability of the proposed framework to other systems.
Expert Commentary
The proposed framework is a significant contribution to the field of autonomous systems, addressing the challenges of long-horizon planning in LLM-based agents. The use of milestone-based reward signals in MiRA is particularly innovative, allowing for more efficient and effective RL training. However, further research is needed to explore the generalizability of the proposed approach and its applicability to diverse domains. Additionally, the reliance on proprietary models may limit the applicability of the proposed framework to other systems. Overall, the findings have significant implications for the development of autonomous systems, particularly in areas such as web navigation and web browsing.
Recommendations
- ✓ Further research is needed to explore the generalizability of the proposed approach and its applicability to diverse domains.
- ✓ The developers of autonomous systems should consider incorporating the proposed framework into their systems to improve performance and robustness.
Sources
Original: arXiv - cs.AI