COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives
arXiv:2603.15897v1 Announce Type: new Abstract: We describe our system for SemEval-2026 Task 5, which requires rating the plausibility of given word senses of homonyms in short stories on a 5-point Likert scale. Systems are evaluated by the unweighted average of accuracy (within one standard deviation of mean human judgments) and Spearman Rank Correlation. We explore three prompting strategies using multiple closed-source commercial LLMs: (i) a baseline zero-shot setup, (ii) Chain-of-Thought (CoT) style prompting with structured reasoning, and (iii) a comparative prompting strategy for evaluating candidate word senses simultaneously. Furthermore, to account for the substantial inter-annotator variation present in the gold labels, we propose an ensemble setup by averaging model predictions. Our best official system, comprising an ensemble of LLMs across all three prompting strategies, placed 4th on the competition leaderboard with 0.88 accuracy and 0.83 Spearman's rho (0.86 average). P
arXiv:2603.15897v1 Announce Type: new Abstract: We describe our system for SemEval-2026 Task 5, which requires rating the plausibility of given word senses of homonyms in short stories on a 5-point Likert scale. Systems are evaluated by the unweighted average of accuracy (within one standard deviation of mean human judgments) and Spearman Rank Correlation. We explore three prompting strategies using multiple closed-source commercial LLMs: (i) a baseline zero-shot setup, (ii) Chain-of-Thought (CoT) style prompting with structured reasoning, and (iii) a comparative prompting strategy for evaluating candidate word senses simultaneously. Furthermore, to account for the substantial inter-annotator variation present in the gold labels, we propose an ensemble setup by averaging model predictions. Our best official system, comprising an ensemble of LLMs across all three prompting strategies, placed 4th on the competition leaderboard with 0.88 accuracy and 0.83 Spearman's rho (0.86 average). Post-competition experiments with additional models further improved this performance to 0.92 accuracy and 0.85 Spearman's rho (0.89 average). We find that comparative prompting consistently improved performance across model families, and model ensembling significantly enhanced alignment with mean human judgments, suggesting that LLM ensembles are especially well suited for subjective semantic evaluation tasks involving multiple annotators.
Executive Summary
The article presents a system for SemEval-2026 Task 5, which aims to rate the plausibility of word senses in short stories. The system utilizes ensembles of large language models (LLMs) with various prompting strategies, achieving high accuracy and Spearman's rho. The best system, combining three prompting strategies, ranked 4th in the competition and was further improved in post-competition experiments. The results demonstrate the effectiveness of LLM ensembles in subjective semantic evaluation tasks, particularly in addressing inter-annotator variation.
Key Points
- ▸ The system uses LLM ensembles with multiple prompting strategies
- ▸ Comparative prompting consistently improved performance across model families
- ▸ Model ensembling enhanced alignment with mean human judgments
Merits
Effective Use of LLM Ensembles
The system's use of LLM ensembles demonstrates a robust approach to addressing the challenges of subjective semantic evaluation tasks.
Improved Performance with Comparative Prompting
The consistent improvement in performance with comparative prompting highlights its potential as a valuable strategy in similar tasks.
Demerits
Dependence on Closed-Source Commercial LLMs
The system's reliance on closed-source commercial LLMs may limit its accessibility and reproducibility for other researchers.
Inter-annotator Variation
The substantial inter-annotator variation in the gold labels may impact the system's performance and accuracy.
Expert Commentary
The article presents a significant contribution to the field of natural language processing, demonstrating the potential of LLM ensembles in addressing the challenges of subjective semantic evaluation tasks. The use of comparative prompting and model ensembling highlights the importance of carefully designed prompting strategies and the benefits of combining multiple models. However, the reliance on closed-source commercial LLMs and the impact of inter-annotator variation are important considerations for future research. Overall, the system's performance and the insights gained from this study have important implications for the development of more effective language understanding systems.
Recommendations
- ✓ Future research should explore the use of open-source LLMs and alternative prompting strategies to improve the accessibility and reproducibility of the system.
- ✓ The development of more robust evaluation metrics and methods for addressing inter-annotator variation is crucial for advancing the field of subjective semantic evaluation tasks.