Academic

SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

arXiv:2603.12382v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have advanced from image-level reasoning to pixel-level grounding, but extending these capabilities to videos remains challenging as models must achieve spatial precision and temporally consistent reference tracking. Existing video MLLMs often rely on a static segmentation token ([SEG]) for frame-wise grounding, which provides semantics but lacks temporal context, causing spatial drift, identity switches, and unstable initialization when objects move or reappear. We introduce SPARROW, a pixel-grounded video MLLM that unifies spatial accuracy and temporal stability through two key components: (i) Target-Specific Tracked Features (TSF), which inject temporally aligned referent cues during training, and (ii) a dual-prompt design that decodes box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding. SPARROW is supported by a curated referential video dataset of 30

arXiv:2603.12382v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have advanced from image-level reasoning to pixel-level grounding, but extending these capabilities to videos remains challenging as models must achieve spatial precision and temporally consistent reference tracking. Existing video MLLMs often rely on a static segmentation token ([SEG]) for frame-wise grounding, which provides semantics but lacks temporal context, causing spatial drift, identity switches, and unstable initialization when objects move or reappear. We introduce SPARROW, a pixel-grounded video MLLM that unifies spatial accuracy and temporal stability through two key components: (i) Target-Specific Tracked Features (TSF), which inject temporally aligned referent cues during training, and (ii) a dual-prompt design that decodes box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding. SPARROW is supported by a curated referential video dataset of 30,646 videos and 45,231 Q&A pairs and operates end-to-end without external detectors via a class-agnostic SAM2-based proposer. Integrated into three recent open-source video MLLMs (UniPixel, GLUS, and VideoGLaMM), SPARROW delivers consistent gains across six benchmarks, improving up to +8.9 J&F on RVOS, +5 mIoU on visual grounding, and +5.4 CLAIR on GCG. These results demonstrate that SPARROW substantially improves referential stability, spatial precision, and temporal coherence in pixel-grounded video understanding. Project page: https://risys-lab.github.io/SPARROW

Executive Summary

The article introduces SPARROW, a pixel-grounded video multimodal large language model (MLLM) that enhances spatial precision and temporal referential consistency. SPARROW achieves this through two key components: Target-Specific Tracked Features (TSF) and a dual-prompt design. The model is evaluated on six benchmarks, demonstrating significant improvements in referential stability, spatial precision, and temporal coherence. With a curated referential video dataset and end-to-end operation, SPARROW shows promise in advancing video understanding capabilities.

Key Points

  • SPARROW is a pixel-grounded video MLLM that improves spatial precision and temporal referential consistency
  • The model uses Target-Specific Tracked Features (TSF) and a dual-prompt design to achieve this
  • SPARROW is evaluated on six benchmarks, showing consistent gains in performance

Merits

Improved Performance

SPARROW demonstrates significant improvements in referential stability, spatial precision, and temporal coherence, outperforming existing models

End-to-End Operation

The model operates without external detectors, making it more efficient and streamlined

Demerits

Limited Dataset

The curated referential video dataset, although extensive, may not be comprehensive enough to cover all scenarios and edge cases

Complexity

The dual-prompt design and TSF components may add complexity to the model, potentially affecting interpretability and maintainability

Expert Commentary

The introduction of SPARROW marks a significant advancement in pixel-grounded video MLLMs, addressing long-standing challenges in spatial precision and temporal referential consistency. The model's ability to operate end-to-end without external detectors is a notable strength, but the complexity of its architecture and potential limitations of the dataset warrant further investigation. As the field continues to evolve, it is essential to prioritize explainability, transparency, and robustness in video MLLMs to ensure their safe and effective deployment.

Recommendations

  • Future research should focus on evaluating SPARROW's performance on edge cases and adversarial attacks to ensure its robustness
  • The development of more comprehensive and diverse datasets is necessary to further improve the model's generalizability and accuracy

Sources