Academic

GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals

arXiv:2603.09979v1 Announce Type: new Abstract: Persian poetry plays an active role in Iranian cultural practice, where verses by canonical poets such as Hafez are frequently quoted, paraphrased, or completed from partial cues. Supporting such interactions requires language models to engage not only with poetic meaning but also with culturally entrenched surface form. We introduce GhazalBench, a benchmark for evaluating how large language models (LLMs) interact with Persian ghazals under usage-grounded conditions. GhazalBench assesses two complementary abilities: producing faithful prose paraphrases of couplets and accessing canonical verses under varying semantic and formal cues. Across several proprietary and open-weight multilingual LLMs, we observe a consistent dissociation: models generally capture poetic meaning but struggle with exact verse recall in completion-based settings, while recognition-based tasks substantially reduce this gap. A parallel evaluation on English sonnets

Ghazal Kalhor, Yadollah Yaghoobzadeh · March 12, 2026 · 1 min read · 8 views

#cs.CL

Executive Summary

The article introduces GhazalBench, a benchmark for evaluating large language models (LLMs) on Persian ghazals. It assesses two abilities: producing faithful prose paraphrases and accessing canonical verses under varying cues. The results show a consistent dissociation between capturing poetic meaning and exact verse recall, with recognition-based tasks reducing this gap. A parallel evaluation on English sonnets suggests that these limitations are tied to differences in training exposure.

Key Points

▸ GhazalBench is a benchmark for evaluating LLMs on Persian ghazals
▸ The benchmark assesses two complementary abilities: paraphrasing and verse recall
▸ The results show a dissociation between capturing poetic meaning and exact verse recall

Merits

Comprehensive Evaluation Framework

GhazalBench provides a comprehensive framework for evaluating LLMs on culturally significant texts

Usage-Grounded Approach

The benchmark assesses LLMs under usage-grounded conditions, reflecting real-world interactions with poetic texts

Demerits

Limited Scope

The benchmark is limited to Persian ghazals and may not be generalizable to other forms of poetry or languages

Dependence on Training Data

The results suggest that the performance of LLMs is heavily dependent on the quality and quantity of training data

Expert Commentary

The article provides a valuable contribution to the field of natural language processing, highlighting the need for more nuanced and culturally sensitive evaluation frameworks for LLMs. The results suggest that the performance of LLMs on specific tasks is heavily dependent on the quality and quantity of training data, and that more diverse and representative training data is needed to improve their performance. The development of GhazalBench provides a useful tool for evaluating LLMs on culturally significant texts, and has implications for the development of more effective and inclusive AI systems.

Recommendations

✓ Develop more comprehensive and culturally sensitive evaluation frameworks for LLMs
✓ Increase the diversity and representativeness of training data for LLMs
✓ Consider the cultural significance of texts when developing and evaluating LLMs

Sources

arXiv - cs.CL

GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Evaluation Framework

Usage-Grounded Approach

Demerits

Limited Scope

Dependence on Training Data

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs