CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?
arXiv:2603.11915v1 Announce Type: new Abstract: Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different famil
arXiv:2603.11915v1 Announce Type: new Abstract: Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different families and sizes, we analyze the strengths and limitations of current models and identify directions for future improvement. Our work offers a deeper understanding of the social cognitive capabilities of modern LLMs.
Executive Summary
The article introduces CoMMET, a novel multimodal benchmark designed to evaluate Large Language Models (LLMs) on Theory of Mind (ToM) tasks in a multi-turn conversational context. While existing ToM assessments for LLMs are constrained to text-based belief-related tasks, CoMMET expands the scope by incorporating multimodal inputs and multi-turn interactions, aligning with the Theory of Mind Booklet Task. The paper presents a significant advancement in evaluating the social cognitive capabilities of LLMs, offering a broader, more dynamic framework for ToM evaluation. This benchmark fills a critical gap in the current landscape, enabling more comprehensive assessments of LLM capacities in social reasoning.
Key Points
- ▸ CoMMET introduces a multimodal benchmark for ToM evaluation
- ▸ Expands beyond text-based belief tasks to include multi-turn conversational settings
- ▸ First multimodal dataset for ToM in conversational contexts
Merits
Strength
CoMMET addresses a significant gap by offering a multimodal, multi-turn evaluation framework that more accurately reflects real-world ToM demands.
Demerits
Limitation
The current scope of CoMMET may still be limited by the absence of specific, standardized metrics for quantifying nuanced ToM performance across diverse LLM architectures.
Expert Commentary
The introduction of CoMMET represents a pivotal shift in the evaluation of LLM social intelligence. Traditional ToM assessments have been constrained by their reliance on static, text-only inputs, which inadequately capture the complexity of human social cognition. CoMMET’s integration of multimodal inputs and multi-turn dialogues offers a more ecologically valid simulation of human interaction. Moreover, by aligning with the Theory of Mind Booklet Task, the dataset leverages a well-established psychological benchmark, enhancing credibility and applicability. However, the absence of a quantifiable scoring mechanism for ToM performance metrics may hinder comparative analysis across models. Future iterations should incorporate standardized evaluation indices to allow for more rigorous benchmarking. Overall, CoMMET sets a new precedent for evaluating AI social reasoning and encourages a more nuanced, multi-dimensional approach to LLM assessment.
Recommendations
- ✓ Develop standardized ToM scoring metrics for future benchmarking
- ✓ Expand CoMMET to include additional modalities such as audio or video for richer contextual evaluation