CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context
arXiv:2603.22576v1 Announce Type: new Abstract: We introduce CAPITU, a benchmark for evaluating instruction-following capabilities of Large Language Models (LLMs) in Brazilian Portuguese. Unlike existing benchmarks that focus on English or use generic prompts, CAPITU contextualizes all tasks within eight canonical works of Brazilian literature, combining verifiable instruction constraints with culturally-grounded content. The benchmark comprises 59 instruction types organized into seven categories, all designed to be automatically verifiable without requiring LLM judges or human evaluation. Instruction types include Portuguese-specific linguistic constraints (word termination patterns like -ando/-endo/-indo, -inho/-inha, -mente) and structural requirements. We evaluate 18 state-of-the-art models across single-turn and multi-turn settings. Our results show that frontier reasoning models achieve strong performance (GPT-5.2 with reasoning: 98.5% strict accuracy), while Portuguese-special
arXiv:2603.22576v1 Announce Type: new Abstract: We introduce CAPITU, a benchmark for evaluating instruction-following capabilities of Large Language Models (LLMs) in Brazilian Portuguese. Unlike existing benchmarks that focus on English or use generic prompts, CAPITU contextualizes all tasks within eight canonical works of Brazilian literature, combining verifiable instruction constraints with culturally-grounded content. The benchmark comprises 59 instruction types organized into seven categories, all designed to be automatically verifiable without requiring LLM judges or human evaluation. Instruction types include Portuguese-specific linguistic constraints (word termination patterns like -ando/-endo/-indo, -inho/-inha, -mente) and structural requirements. We evaluate 18 state-of-the-art models across single-turn and multi-turn settings. Our results show that frontier reasoning models achieve strong performance (GPT-5.2 with reasoning: 98.5% strict accuracy), while Portuguese-specialized models offer competitive cost-efficiency (Sabiazinho-4: 87.0% at \$0.13 vs Claude-Haiku-4.5: 73.5% at \$1.12). Multi-turn evaluation reveals significant variation in constraint persistence, with conversation-level accuracy ranging from 60% to 96% across models. We identify specific challenges in morphological constraints, exact counting, and constraint persistence degradation across turns. We release the complete benchmark, evaluation code, and baseline results to facilitate research on instruction-following in Portuguese.
Executive Summary
The CAPITU benchmark introduces a novel, culturally grounded evaluation framework for instruction-following in Brazilian Portuguese, distinguishing itself by anchoring tasks within canonical literary works. With 59 instruction types across seven categories—specifically incorporating Portuguese linguistic nuances such as verb endings and morphological markers—CAPITU enables fully automatable verification, eliminating reliance on human judges. The evaluation of 18 LLMs across single- and multi-turn scenarios reveals that generalist models with reasoning capabilities (e.g., GPT-5.2) achieve near-perfect accuracy, while Portuguese-specialized models offer superior cost-efficiency without compromising efficacy. Multi-turn metrics expose persistent challenges in constraint retention, particularly with morphological and counting tasks. The open release of the benchmark and evaluation infrastructure represents a significant contribution to linguistic AI research.
Key Points
- ▸ CAPITU is culturally contextualized within Brazilian literature
- ▸ Instruction types include Portuguese-specific linguistic constraints
- ▸ Automatable verification reduces human intervention
Merits
Cultural Relevance
By anchoring tasks in canonical Brazilian literature, CAPITU provides a more authentic and linguistically nuanced evaluation environment than generic English-centric benchmarks.
Automatable Verification
The design of instruction types with verifiable constraints enables scalable, repeatable, and objective evaluation without human intervention.
Diverse Model Comparison
The benchmark facilitates comparative analysis across both generalist and specialized models, revealing nuanced performance trade-offs between cost and accuracy.
Demerits
Limited Generalizability
The focus on Brazilian Portuguese and canonical literature limits applicability to other linguistic or cultural domains, potentially restricting broader adoption.
Constraint Complexity
Some morphological and counting constraints may be too specific or ambiguous for certain models, potentially introducing variability in interpretability.
Expert Commentary
CAPITU represents a pivotal advancement in the evaluation of LLMs for non-English languages. The integration of literary context with verifiable instruction constraints is a masterstroke: it transforms evaluation from a generic, abstract task into a culturally embedded, linguistically authentic measurement. Unlike prior benchmarks that treat language as a neutral variable, CAPITU acknowledges that language is inherently tied to cultural expression, syntax, and morphology—factors that profoundly affect model behavior. The fact that 18 state-of-the-art models were evaluated under unified, objective criteria demonstrates a rare level of methodological rigor. Moreover, the fact that Portuguese-specialized models perform competitively at lower cost suggests that linguistic specialization need not be a trade-off for efficacy. This benchmark does more than measure performance—it redefines the criteria by which we assess AI’s linguistic competence in diverse linguistic ecosystems. It is a model for future benchmarks in multilingual AI research.
Recommendations
- ✓ Adopt CAPITU as a reference benchmark for Portuguese instruction-following in academic and industry research.
- ✓ Extend CAPITU’s framework to other regional languages (e.g., Spanish, French, Arabic) to create a global suite of culturally anchored evaluation tools.
- ✓ Develop supplementary datasets or prompts that mitigate constraints that prove too ambiguous for certain models, enhancing interpretability without diluting rigor.
Sources
Original: arXiv - cs.CL