GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals
arXiv:2603.09979v1 Announce Type: new Abstract: Persian poetry plays an active role in Iranian cultural practice, where verses by canonical poets such as Hafez are frequently quoted, paraphrased, or completed from partial cues. Supporting such interactions requires language models to engage not only with poetic meaning but also with culturally entrenched surface form. We introduce GhazalBench, a benchmark for evaluating how large language models (LLMs) interact with Persian ghazals under usage-grounded conditions. GhazalBench assesses two complementary abilities: producing faithful prose paraphrases of couplets and accessing canonical verses under varying semantic and formal cues. Across several proprietary and open-weight multilingual LLMs, we observe a consistent dissociation: models generally capture poetic meaning but struggle with exact verse recall in completion-based settings, while recognition-based tasks substantially reduce this gap. A parallel evaluation on English sonnets
arXiv:2603.09979v1 Announce Type: new Abstract: Persian poetry plays an active role in Iranian cultural practice, where verses by canonical poets such as Hafez are frequently quoted, paraphrased, or completed from partial cues. Supporting such interactions requires language models to engage not only with poetic meaning but also with culturally entrenched surface form. We introduce GhazalBench, a benchmark for evaluating how large language models (LLMs) interact with Persian ghazals under usage-grounded conditions. GhazalBench assesses two complementary abilities: producing faithful prose paraphrases of couplets and accessing canonical verses under varying semantic and formal cues. Across several proprietary and open-weight multilingual LLMs, we observe a consistent dissociation: models generally capture poetic meaning but struggle with exact verse recall in completion-based settings, while recognition-based tasks substantially reduce this gap. A parallel evaluation on English sonnets shows markedly higher recall performance, suggesting that these limitations are tied to differences in training exposure rather than inherent architectural constraints. Our findings highlight the need for evaluation frameworks that jointly assess meaning, form, and cue-dependent access to culturally significant texts. GhazalBench is available at https://github.com/kalhorghazal/GhazalBench.
Executive Summary
The article introduces GhazalBench, a benchmark for evaluating large language models (LLMs) on Persian ghazals. It assesses two abilities: producing faithful prose paraphrases and accessing canonical verses under varying cues. The results show a consistent dissociation between capturing poetic meaning and exact verse recall, with recognition-based tasks reducing this gap. A parallel evaluation on English sonnets suggests that these limitations are tied to differences in training exposure.
Key Points
- ▸ GhazalBench is a benchmark for evaluating LLMs on Persian ghazals
- ▸ The benchmark assesses two complementary abilities: paraphrasing and verse recall
- ▸ The results show a dissociation between capturing poetic meaning and exact verse recall
Merits
Comprehensive Evaluation Framework
GhazalBench provides a comprehensive framework for evaluating LLMs on culturally significant texts
Usage-Grounded Approach
The benchmark assesses LLMs under usage-grounded conditions, reflecting real-world interactions with poetic texts
Demerits
Limited Scope
The benchmark is limited to Persian ghazals and may not be generalizable to other forms of poetry or languages
Dependence on Training Data
The results suggest that the performance of LLMs is heavily dependent on the quality and quantity of training data
Expert Commentary
The article provides a valuable contribution to the field of natural language processing, highlighting the need for more nuanced and culturally sensitive evaluation frameworks for LLMs. The results suggest that the performance of LLMs on specific tasks is heavily dependent on the quality and quantity of training data, and that more diverse and representative training data is needed to improve their performance. The development of GhazalBench provides a useful tool for evaluating LLMs on culturally significant texts, and has implications for the development of more effective and inclusive AI systems.
Recommendations
- ✓ Develop more comprehensive and culturally sensitive evaluation frameworks for LLMs
- ✓ Increase the diversity and representativeness of training data for LLMs
- ✓ Consider the cultural significance of texts when developing and evaluating LLMs