Academic

Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies

Siddharth Srikanth, Freddie Liang, Sophie Hsu, Varun Bhatt, Shihan Zhao, Henry Chen, Bryon Tjanaka, Minjune Hwang, Akanksha Saran, Daniel Seita, Aaquib Tabrez, Stefanos Nikolaidis · March 16, 2026 · 1 min read · 9 views

#cs.RO #cs.AI #cs.CL

arXiv:2603.12510v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models have significant potential to enable general-purpose robotic systems for a range of vision-language tasks. However, the performance of VLA-based robots is highly sensitive to the precise wording of language instructions, and it remains difficult to predict when such robots will fail. To improve the robustness of VLAs to different wordings, we present Q-DIG (Quality Diversity for Diverse Instruction Generation), which performs red-teaming by scalably identifying diverse natural language task descriptions that induce failures while remaining task-relevant. Q-DIG integrates Quality Diversity (QD) techniques with Vision-Language Models (VLMs) to generate a broad spectrum of adversarial instructions that expose meaningful vulnerabilities in VLA behavior. Our results across multiple simulation benchmarks show that Q-DIG finds more diverse and meaningful failure modes compared to baseline methods, and that fine-tuning VLAs on the generated instructions improves task success rates. Furthermore, results from a user study highlight that Q-DIG generates prompts judged to be more natural and human-like than those from baselines. Finally, real-world evaluations of Q-DIG prompts show results consistent with simulation, and fine-tuning VLAs on the generated prompts further success rates on unseen instructions. Together, these findings suggest that Q-DIG is a promising approach for identifying vulnerabilities and improving the robustness of VLA-based robots. Our anonymous project website is at qdigvla.github.io.

Executive Summary

This article presents Q-DIG, a novel approach for improving the robustness of Vision-Language-Action (VLA) models in robotic systems. By integrating Quality Diversity (QD) techniques with Vision-Language Models (VLMs), Q-DIG generates diverse natural language task descriptions that expose vulnerabilities in VLA behavior. The results demonstrate that Q-DIG finds more diverse and meaningful failure modes compared to baseline methods, and fine-tuning VLAs on the generated instructions improves task success rates. The approach also generates more natural and human-like prompts than baseline methods. The findings suggest that Q-DIG is a promising approach for identifying vulnerabilities and improving the robustness of VLA-based robots.

Key Points

▸ Q-DIG integrates Quality Diversity (QD) techniques with Vision-Language Models (VLMs) for robust VLA-based robots
▸ Q-DIG generates diverse natural language task descriptions that expose vulnerabilities in VLA behavior
▸ Fine-tuning VLAs on Q-DIG generated instructions improves task success rates

Merits

Strength in Identifying Vulnerabilities

Q-DIG effectively identifies diverse and meaningful failure modes in VLA-based robots, enabling the development of more robust robotic systems.

Improved Task Success Rates

Fine-tuning VLAs on Q-DIG generated instructions leads to improved task success rates, indicating the potential for real-world applications.

Demerits

Limited Real-World Evaluations

Although the results from real-world evaluations are consistent with simulation, further real-world testing is necessary to confirm the effectiveness of Q-DIG.

Scalability Challenges

The scalability of Q-DIG may be a concern, as it relies on computationally intensive QD techniques and VLMs.

Expert Commentary

The article presents a novel approach to improving the robustness of VLA-based robots, leveraging the strengths of Quality Diversity and Vision-Language Models. The results demonstrate the effectiveness of Q-DIG in identifying vulnerabilities and improving task success rates. However, the scalability of Q-DIG and the need for further real-world testing are concerns. The approach has significant implications for the development of more robust robotic systems and may have broader implications for human-robot interaction and job displacement.

Recommendations

✓ Future research should focus on addressing the scalability challenges of Q-DIG and exploring its application in various real-world scenarios.
✓ The development of more robust VLA-based robots, enabled by Q-DIG, should be accompanied by careful consideration of the potential implications for human-robot interaction and job displacement.

Sources

arXiv - cs.AI

Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies

AI Commentary

Executive Summary

Key Points

Merits

Strength in Identifying Vulnerabilities

Improved Task Success Rates

Demerits

Limited Real-World Evaluations

Scalability Challenges

Expert Commentary

Recommendations

Sources

Related Articles

AI-Driven Approaches to Enhancing Fairness and Identifying Algorithmic Bias in …

High resolution schemes for hyperbolic conservation laws

Robust Graph Representation Learning via Adaptive Spectral Contrast

Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via …

JCG, PC

HSOLLC Co., Ltd.