AsgardBench - Evaluating Visually Grounded Interactive Planning Under Minimal Feedback
arXiv:2603.15888v1 Announce Type: new Abstract: With AsgardBench we aim to evaluate visually grounded, high-level action sequence generation and interactive planning, focusing specifically on plan adaptation during execution based on visual observations rather than navigation or low-level manipulation. In the landscape of embodied AI benchmarks, AsgardBench targets the capability category of interactive planning, which is more sophisticated than offline high-level planning as it requires agents to revise plans in response to environmental feedback, yet remains distinct from low-level execution. Unlike prior embodied AI benchmarks that conflate reasoning with navigation or provide rich corrective feedback that substitutes for perception, AsgardBench restricts agent input to images, action history, and lightweight success/failure signals, isolating interactive planning in a controlled simulator without low-level control noise. The benchmark contains 108 task instances spanning 12 ta
arXiv:2603.15888v1 Announce Type: new Abstract: With AsgardBench we aim to evaluate visually grounded, high-level action sequence generation and interactive planning, focusing specifically on plan adaptation during execution based on visual observations rather than navigation or low-level manipulation. In the landscape of embodied AI benchmarks, AsgardBench targets the capability category of interactive planning, which is more sophisticated than offline high-level planning as it requires agents to revise plans in response to environmental feedback, yet remains distinct from low-level execution. Unlike prior embodied AI benchmarks that conflate reasoning with navigation or provide rich corrective feedback that substitutes for perception, AsgardBench restricts agent input to images, action history, and lightweight success/failure signals, isolating interactive planning in a controlled simulator without low-level control noise. The benchmark contains 108 task instances spanning 12 task types, each systematically varied through object state, placement, and scene configuration. These controlled variations create conditional branches in which a single instruction can require different action sequences depending on what the agent observes, emphasizing conditional branching and plan repair during execution. Our evaluations of leading vision language models show that performance drops sharply without visual input, revealing weaknesses in visual grounding and state tracking that ultimately undermine interactive planning. Our benchmark zeroes in on a narrower question: can a model actually use what it sees to adapt a plan when things do not go as expected?
Executive Summary
AsgardBench emerges as a targeted benchmark designed to evaluate the capacity of visually grounded agents to adapt high-level plans in response to visual observations during execution, without relying on navigation or low-level manipulation. Unlike existing embodied AI benchmarks that conflate reasoning with navigation or inundate agents with corrective feedback, AsgardBench isolates interactive planning by restricting input to images, action history, and minimal success/failure signals. With 108 task instances across 12 types, systematically varied via object state and scene configuration, the benchmark introduces conditional branching scenarios that demand plan adaptation during execution. Preliminary evaluations demonstrate a significant performance decline when visual input is removed, underscoring the critical role of visual grounding in adaptive planning. This benchmark advances the field by narrowing the focus to the core question: can models truly leverage visual information to revise plans when outcomes deviate from expectations? The controlled environment and systematic variation enhance reproducibility and comparative analysis.
Key Points
- ▸ Isolation of interactive planning via restricted input
- ▸ Controlled variation of task instances to induce conditional branching
- ▸ Evidence of performance degradation without visual input
Merits
Targeted Scope
AsgardBench fills a gap by focusing narrowly on interactive planning under minimal feedback, enabling more precise evaluation of visual adaptation capabilities.
Controlled Variability
The systematic variation of task parameters across 108 instances provides a robust framework for testing conditional adaptation without extraneous variables.
Demerits
Narrow Constraints
The exclusion of navigation or low-level manipulation may limit applicability to real-world embodied agents that integrate multiple modalities.
Limited Feedback Complexity
Lightweight success/failure signals may not capture the richness of real-world feedback, potentially constraining generalization.
Expert Commentary
AsgardBench represents a significant methodological advancement in the evaluation of embodied AI systems. Its design cleverly circumvents the conflation of perception, planning, and execution that plagues many current benchmarks, thereby enabling more accurate attribution of adaptive capabilities to visual perception. The use of controlled variations to induce conditional branching without introducing external noise is particularly commendable—it allows researchers to isolate the effect of visual input on plan revision. Moreover, the observed performance drop in the absence of visual input is not merely a statistical artifact; it reflects a deeper epistemological issue: many models fail to integrate perception into higher-level reasoning, treating visual data as auxiliary rather than foundational. This benchmark thus serves as a litmus test for the maturity of visual grounding in autonomous systems. While it may not capture the full spectrum of real-world agent interactions, its precision in targeting the specific failure mode—failure to adapt via observation—makes it indispensable for benchmarking progress in embodied cognition.
Recommendations
- ✓ Extend AsgardBench with additional modalities (e.g., audio, tactile) to assess multimodal grounding in future iterations.
- ✓ Develop companion evaluation protocols for human-agent interaction to validate findings in hybrid human-machine systems.