Academic

The Intelligent Disobedience Game: Formulating Disobedience in Stackelberg Games and Markov Decision Processes

arXiv:2603.20994v1 Announce Type: new Abstract: In shared autonomy, a critical tension arises when an automated assistant must choose between obeying a human's instruction and deliberately overriding it to prevent harm. This safety-critical behavior is known as intelligent disobedience. To formalize this dynamic, this paper introduces the Intelligent Disobedience Game (IDG), a sequential game-theoretic framework based on Stackelberg games that models the interaction between a human leader and an assistive follower operating under asymmetric information. It characterizes optimal strategies for both agents across multi-step scenarios, identifying strategic phenomena such as ``safety traps,'' where the system indefinitely avoids harm but fails to achieve the human's goal. The IDG provides a needed mathematical foundation that enables both the algorithmic development of agents that can learn safe non-compliance and the empirical study of how humans perceive and trust disobedient AI. The p

B
Benedikt Hornig, Reuth Mirsky
· · 1 min read · 5 views

arXiv:2603.20994v1 Announce Type: new Abstract: In shared autonomy, a critical tension arises when an automated assistant must choose between obeying a human's instruction and deliberately overriding it to prevent harm. This safety-critical behavior is known as intelligent disobedience. To formalize this dynamic, this paper introduces the Intelligent Disobedience Game (IDG), a sequential game-theoretic framework based on Stackelberg games that models the interaction between a human leader and an assistive follower operating under asymmetric information. It characterizes optimal strategies for both agents across multi-step scenarios, identifying strategic phenomena such as ``safety traps,'' where the system indefinitely avoids harm but fails to achieve the human's goal. The IDG provides a needed mathematical foundation that enables both the algorithmic development of agents that can learn safe non-compliance and the empirical study of how humans perceive and trust disobedient AI. The paper further translates the IDG into a shared control Multi-Agent Markov Decision Process representation, forming a compact computational testbed for training reinforcement learning agents.

Executive Summary

This article proposes the Intelligent Disobedience Game (IDG), a game-theoretic framework that models the interaction between a human leader and an assistive follower in shared autonomy. The IDG formalizes the concept of intelligent disobedience, where an automated assistant must choose between obeying a human's instruction and deliberately overriding it to prevent harm. The framework identifies strategic phenomena such as 'safety traps,' where the system indefinitely avoids harm but fails to achieve the human's goal. The IDG is translated into a shared control Multi-Agent Markov Decision Process representation, providing a compact computational testbed for training reinforcement learning agents.

Key Points

  • The IDG framework provides a mathematical foundation for understanding intelligent disobedience in shared autonomy.
  • The framework models the interaction between a human leader and an assistive follower under asymmetric information.
  • The IDG identifies strategic phenomena such as 'safety traps' and optimal strategies for both agents across multi-step scenarios.

Merits

Strength

The IDG framework provides a comprehensive and formalized approach to understanding intelligent disobedience, which is essential for the development of safe and trustworthy AI systems.

Strength

The framework's translation into a shared control Multi-Agent Markov Decision Process representation enables the training of reinforcement learning agents, which can be applied to real-world scenarios.

Demerits

Limitation

The IDG framework assumes a simplified scenario, where the human leader and assistive follower have perfect information, which may not be realistic in real-world applications.

Limitation

The framework does not consider the emotional and social aspects of human-AI interaction, which are crucial for understanding intelligent disobedience in shared autonomy.

Expert Commentary

The IDG framework is a significant contribution to the field of AI safety and shared autonomy. It provides a comprehensive and formalized approach to understanding intelligent disobedience, which is essential for the development of safe and trustworthy AI systems. However, the framework's assumptions and limitations should be taken into account when applying it to real-world scenarios. The framework's translation into a shared control Multi-Agent Markov Decision Process representation is a valuable contribution to the field of reinforcement learning and AI safety. It enables the training of reinforcement learning agents that can learn safe non-compliance, which is crucial for the development of safe and trustworthy AI systems.

Recommendations

  • Future research should focus on relaxing the framework's assumptions and incorporating the emotional and social aspects of human-AI interaction.
  • The IDG framework should be applied to real-world scenarios, such as autonomous vehicles and robots, to develop safe and trustworthy AI systems that can learn safe non-compliance.

Sources

Original: arXiv - cs.AI