Academic

I Came, I Saw, I Explained: Benchmarking Multimodal LLMs on Figurative Meaning in Memes

arXiv:2603.23229v1 Announce Type: new Abstract: Internet memes represent a popular form of multimodal online communication and often use figurative elements to convey layered meaning through the combination of text and images. However, it remains largely unclear how multimodal large language models (MLLMs) combine and interpret visual and textual information to identify figurative meaning in memes. To address this gap, we evaluate eight state-of-the-art generative MLLMs across three datasets on their ability to detect and explain six types of figurative meaning. In addition, we conduct a human evaluation of the explanations generated by these MLLMs, assessing whether the provided reasoning supports the predicted label and whether it remains faithful to the original meme content. Our findings indicate that all models exhibit a strong bias to associate a meme with figurative meaning, even when no such meaning is present. Qualitative analysis further shows that correct predictions are no

arXiv:2603.23229v1 Announce Type: new Abstract: Internet memes represent a popular form of multimodal online communication and often use figurative elements to convey layered meaning through the combination of text and images. However, it remains largely unclear how multimodal large language models (MLLMs) combine and interpret visual and textual information to identify figurative meaning in memes. To address this gap, we evaluate eight state-of-the-art generative MLLMs across three datasets on their ability to detect and explain six types of figurative meaning. In addition, we conduct a human evaluation of the explanations generated by these MLLMs, assessing whether the provided reasoning supports the predicted label and whether it remains faithful to the original meme content. Our findings indicate that all models exhibit a strong bias to associate a meme with figurative meaning, even when no such meaning is present. Qualitative analysis further shows that correct predictions are not always accompanied by faithful explanations.

Executive Summary

This article presents a comprehensive evaluation of eight state-of-the-art multimodal large language models (MLLMs) in their ability to detect and explain figurative meaning in memes. The study utilizes three datasets and assesses the models' performance across six types of figurative meaning. The findings reveal a strong bias among MLLMs to associate memes with figurative meaning, even when none is present, and that correct predictions are not always accompanied by faithful explanations. This research contributes to a better understanding of how MLLMs process and interpret multimodal information in online communication. However, the study's limitations and findings also highlight the need for further research in this area to develop more accurate and reliable models for detecting figurative meaning in memes.

Key Points

  • Evaluation of eight state-of-the-art MLLMs on figurative meaning in memes
  • Three datasets used to assess MLLMs' performance across six types of figurative meaning
  • Strong bias among MLLMs to associate memes with figurative meaning, even when none is present
  • Correct predictions not always accompanied by faithful explanations

Merits

Strength in Methodology

The study employs a rigorous evaluation framework, utilizing multiple datasets and assessing MLLMs' performance across various types of figurative meaning. This comprehensive approach provides a robust foundation for the research findings and contributes to the advancement of the field.

Demerits

Limitation in Generalizability

The study's reliance on a single task and three specific datasets may limit the generalizability of the findings to other tasks and datasets, potentially affecting the models' performance and behavior in different contexts.

Bias in MLLMs' Performance

The strong bias among MLLMs to associate memes with figurative meaning, even when none is present, highlights the need for further research in this area to develop more accurate and reliable models for detecting figurative meaning in memes.

Expert Commentary

The study's findings and limitations have significant implications for the development and deployment of MLLMs in online communication. The strong bias among MLLMs to associate memes with figurative meaning, even when none is present, highlights the need for further research in this area to develop more accurate and reliable models. Additionally, the study's focus on multimodal language models and online communication raises important questions about the role of technology in shaping online discourse and the potential consequences of MLLMs' biases and limitations. As the field continues to evolve, it is essential to consider the broader societal implications of MLLMs' performance and behavior in detecting figurative meaning in memes.

Recommendations

  • Develop more accurate and reliable MLLMs that can detect figurative meaning in memes with higher fidelity
  • Conduct further research on the societal implications of MLLMs' biases and limitations in online communication

Sources

Original: arXiv - cs.CL