Academic

Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models

arXiv:2603.23149v1 Announce Type: new Abstract: Deploying safety-critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for this proactive foresight, current approaches relying on visual simulation incur prohibitive latencies, often exceeding several seconds per step. In this work, we challenge the assumption that visual processing is necessary for failure prevention. We show that a trained policy's latent state, combined with its planned actions, already encodes sufficient information to anticipate action outcomes, making visual simulation redundant for failure prevention. To this end, we introduce DILLO (DIstiLLed Language-ActiOn World Model), a fast steering layer that shifts the paradigm from "simulate-then-act" to "describe-then-act." DILLO is trained via cross-modal distillation, where a privileged Vision Language Model teacher annotates offline trajectories and a latent-conditioned Large Language Model stu

arXiv:2603.23149v1 Announce Type: new Abstract: Deploying safety-critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for this proactive foresight, current approaches relying on visual simulation incur prohibitive latencies, often exceeding several seconds per step. In this work, we challenge the assumption that visual processing is necessary for failure prevention. We show that a trained policy's latent state, combined with its planned actions, already encodes sufficient information to anticipate action outcomes, making visual simulation redundant for failure prevention. To this end, we introduce DILLO (DIstiLLed Language-ActiOn World Model), a fast steering layer that shifts the paradigm from "simulate-then-act" to "describe-then-act." DILLO is trained via cross-modal distillation, where a privileged Vision Language Model teacher annotates offline trajectories and a latent-conditioned Large Language Model student learns to predict semantic outcomes. This creates a text-only inference path, bypassing heavy visual generation entirely, achieving a 14x speedup over baselines. Experiments on MetaWorld and LIBERO demonstrate that DILLO produces high-fidelity descriptions of the next state and is able to steer the policy, improving episode success rate by up to 15 pp and 9.3 pp on average across tasks.

Executive Summary

The article introduces DILLO, a novel framework that shifts from traditional 'simulate-then-act' paradigms to 'describe-then-act' by leveraging latent states and planned actions to anticipate action outcomes without visual simulation. DILLO employs cross-modal distillation, enabling a text-only inference pipeline that bypasses computationally intensive visual processing, achieving a 14x speedup. Experimental validation on MetaWorld and LIBERO shows significant improvements in episode success rates—up to 15 percentage points and 9.3 on average—demonstrating viable alternatives to latency-heavy simulation-based safety-critical agent systems.

Key Points

  • DILLO replaces visual simulation with latent state-based prediction using cross-modal distillation
  • Achieves significant speed gains (14x) without compromising predictive fidelity
  • Experimental results validate effectiveness across multiple environments

Merits

Technical Innovation

DILLO introduces a novel paradigm shift via distillation and latent state utilization, offering a scalable and efficient alternative to visual simulation in safety-critical domains

Demerits

Generalizability Concern

While promising, the approach may face limitations in highly complex or novel environments where visual cues are intrinsically critical to contextual understanding

Expert Commentary

DILLO represents a pivotal evolution in proactive agent steering by redefining the necessity of visual simulation. The integration of distilled language-action models through cross-modal distillation is both elegant and pragmatic—it directly confronts a critical bottleneck in real-time autonomy without sacrificing predictive accuracy. The authors rightly challenge the assumed dependency between visual fidelity and safety, and their empirical results corroborate this thesis. Notably, the 14x speedup is not merely a performance gain; it is a structural enabler for deploying agents in constrained environments—such as edge devices or low-bandwidth systems—where latency was previously prohibitive. However, the long-term viability of this approach hinges on the robustness of latent state encoding in edge-case scenarios. Future work should explore hybrid architectures that combine DILLO’s efficiency with selective visual augmentation in critical decision nodes. This work sets a new benchmark for efficiency-safety tradeoffs in autonomous systems.

Recommendations

  • Adopt DILLO as a baseline for latency-constrained agent development in safety-critical domains
  • Investigate hybrid architectures that selectively integrate visual inputs where latent encoding may be ambiguous

Sources

Original: arXiv - cs.AI