Academic

CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?

arXiv:2603.11915v1 Announce Type: new Abstract: Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different famil

Ruirui Chen, Weifeng Jiang, Chengwei Qin, Cheston Tan · March 13, 2026 · 1 min read · 6 views

#cs.CL

Executive Summary

The article introduces CoMMET, a novel multimodal benchmark designed to evaluate Large Language Models (LLMs) on Theory of Mind (ToM) tasks in a multi-turn conversational context. While existing ToM assessments for LLMs are constrained to text-based belief-related tasks, CoMMET expands the scope by incorporating multimodal inputs and multi-turn interactions, aligning with the Theory of Mind Booklet Task. The paper presents a significant advancement in evaluating the social cognitive capabilities of LLMs, offering a broader, more dynamic framework for ToM evaluation. This benchmark fills a critical gap in the current landscape, enabling more comprehensive assessments of LLM capacities in social reasoning.

Key Points

▸ CoMMET introduces a multimodal benchmark for ToM evaluation
▸ Expands beyond text-based belief tasks to include multi-turn conversational settings
▸ First multimodal dataset for ToM in conversational contexts

Merits

Strength

CoMMET addresses a significant gap by offering a multimodal, multi-turn evaluation framework that more accurately reflects real-world ToM demands.

Demerits

Limitation

The current scope of CoMMET may still be limited by the absence of specific, standardized metrics for quantifying nuanced ToM performance across diverse LLM architectures.

Expert Commentary

The introduction of CoMMET represents a pivotal shift in the evaluation of LLM social intelligence. Traditional ToM assessments have been constrained by their reliance on static, text-only inputs, which inadequately capture the complexity of human social cognition. CoMMET’s integration of multimodal inputs and multi-turn dialogues offers a more ecologically valid simulation of human interaction. Moreover, by aligning with the Theory of Mind Booklet Task, the dataset leverages a well-established psychological benchmark, enhancing credibility and applicability. However, the absence of a quantifiable scoring mechanism for ToM performance metrics may hinder comparative analysis across models. Future iterations should incorporate standardized evaluation indices to allow for more rigorous benchmarking. Overall, CoMMET sets a new precedent for evaluating AI social reasoning and encourages a more nuanced, multi-dimensional approach to LLM assessment.

Recommendations

✓ Develop standardized ToM scoring metrics for future benchmarking
✓ Expand CoMMET to include additional modalities such as audio or video for richer contextual evaluation

Sources

arXiv - cs.CL

CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?

AI Commentary

Executive Summary

Key Points

Merits

Strength

Demerits

Limitation

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs