Academic

COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives

Azwad Anjum Islam, Tisa Islam Erana · March 18, 2026 · 1 min read · 33 views

#cs.CL #cs.AI

arXiv:2603.15897v1 Announce Type: new Abstract: We describe our system for SemEval-2026 Task 5, which requires rating the plausibility of given word senses of homonyms in short stories on a 5-point Likert scale. Systems are evaluated by the unweighted average of accuracy (within one standard deviation of mean human judgments) and Spearman Rank Correlation. We explore three prompting strategies using multiple closed-source commercial LLMs: (i) a baseline zero-shot setup, (ii) Chain-of-Thought (CoT) style prompting with structured reasoning, and (iii) a comparative prompting strategy for evaluating candidate word senses simultaneously. Furthermore, to account for the substantial inter-annotator variation present in the gold labels, we propose an ensemble setup by averaging model predictions. Our best official system, comprising an ensemble of LLMs across all three prompting strategies, placed 4th on the competition leaderboard with 0.88 accuracy and 0.83 Spearman's rho (0.86 average). Post-competition experiments with additional models further improved this performance to 0.92 accuracy and 0.85 Spearman's rho (0.89 average). We find that comparative prompting consistently improved performance across model families, and model ensembling significantly enhanced alignment with mean human judgments, suggesting that LLM ensembles are especially well suited for subjective semantic evaluation tasks involving multiple annotators.

Executive Summary

The article presents a system for SemEval-2026 Task 5, which aims to rate the plausibility of word senses in short stories. The system utilizes ensembles of large language models (LLMs) with various prompting strategies, achieving high accuracy and Spearman's rho. The best system, combining three prompting strategies, ranked 4th in the competition and was further improved in post-competition experiments. The results demonstrate the effectiveness of LLM ensembles in subjective semantic evaluation tasks, particularly in addressing inter-annotator variation.

Key Points

▸ The system uses LLM ensembles with multiple prompting strategies
▸ Comparative prompting consistently improved performance across model families
▸ Model ensembling enhanced alignment with mean human judgments

Merits

Effective Use of LLM Ensembles

The system's use of LLM ensembles demonstrates a robust approach to addressing the challenges of subjective semantic evaluation tasks.

Improved Performance with Comparative Prompting

The consistent improvement in performance with comparative prompting highlights its potential as a valuable strategy in similar tasks.

Demerits

Dependence on Closed-Source Commercial LLMs

The system's reliance on closed-source commercial LLMs may limit its accessibility and reproducibility for other researchers.

Inter-annotator Variation

The substantial inter-annotator variation in the gold labels may impact the system's performance and accuracy.

Expert Commentary

The article presents a significant contribution to the field of natural language processing, demonstrating the potential of LLM ensembles in addressing the challenges of subjective semantic evaluation tasks. The use of comparative prompting and model ensembling highlights the importance of carefully designed prompting strategies and the benefits of combining multiple models. However, the reliance on closed-source commercial LLMs and the impact of inter-annotator variation are important considerations for future research. Overall, the system's performance and the insights gained from this study have important implications for the development of more effective language understanding systems.

Recommendations

✓ Future research should explore the use of open-source LLMs and alternative prompting strategies to improve the accessibility and reproducibility of the system.
✓ The development of more robust evaluation metrics and methods for addressing inter-annotator variation is crucial for advancing the field of subjective semantic evaluation tasks.

Sources

arXiv - cs.CL

COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives

AI Commentary

Executive Summary

Key Points

Merits

Effective Use of LLM Ensembles

Improved Performance with Comparative Prompting

Demerits

Dependence on Closed-Source Commercial LLMs

Inter-annotator Variation

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs