Academic

CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context

Giovana Kerche Bon\'as, Roseval Malaquias Junior, Marcos Piau, Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Celio Larcher, Ramon Pires, Rodrigo Nogueira · March 25, 2026 · 1 min read · 8 views

#cs.CL

arXiv:2603.22576v1 Announce Type: new Abstract: We introduce CAPITU, a benchmark for evaluating instruction-following capabilities of Large Language Models (LLMs) in Brazilian Portuguese. Unlike existing benchmarks that focus on English or use generic prompts, CAPITU contextualizes all tasks within eight canonical works of Brazilian literature, combining verifiable instruction constraints with culturally-grounded content. The benchmark comprises 59 instruction types organized into seven categories, all designed to be automatically verifiable without requiring LLM judges or human evaluation. Instruction types include Portuguese-specific linguistic constraints (word termination patterns like -ando/-endo/-indo, -inho/-inha, -mente) and structural requirements. We evaluate 18 state-of-the-art models across single-turn and multi-turn settings. Our results show that frontier reasoning models achieve strong performance (GPT-5.2 with reasoning: 98.5% strict accuracy), while Portuguese-specialized models offer competitive cost-efficiency (Sabiazinho-4: 87.0% at \$0.13 vs Claude-Haiku-4.5: 73.5% at \$1.12). Multi-turn evaluation reveals significant variation in constraint persistence, with conversation-level accuracy ranging from 60% to 96% across models. We identify specific challenges in morphological constraints, exact counting, and constraint persistence degradation across turns. We release the complete benchmark, evaluation code, and baseline results to facilitate research on instruction-following in Portuguese.

Executive Summary

The CAPITU benchmark introduces a novel, culturally grounded evaluation framework for instruction-following in Brazilian Portuguese, distinguishing itself by anchoring tasks within canonical literary works. With 59 instruction types across seven categories—specifically incorporating Portuguese linguistic nuances such as verb endings and morphological markers—CAPITU enables fully automatable verification, eliminating reliance on human judges. The evaluation of 18 LLMs across single- and multi-turn scenarios reveals that generalist models with reasoning capabilities (e.g., GPT-5.2) achieve near-perfect accuracy, while Portuguese-specialized models offer superior cost-efficiency without compromising efficacy. Multi-turn metrics expose persistent challenges in constraint retention, particularly with morphological and counting tasks. The open release of the benchmark and evaluation infrastructure represents a significant contribution to linguistic AI research.

Key Points

▸ CAPITU is culturally contextualized within Brazilian literature
▸ Instruction types include Portuguese-specific linguistic constraints
▸ Automatable verification reduces human intervention

Merits

Cultural Relevance

By anchoring tasks in canonical Brazilian literature, CAPITU provides a more authentic and linguistically nuanced evaluation environment than generic English-centric benchmarks.

Automatable Verification

The design of instruction types with verifiable constraints enables scalable, repeatable, and objective evaluation without human intervention.

Diverse Model Comparison

The benchmark facilitates comparative analysis across both generalist and specialized models, revealing nuanced performance trade-offs between cost and accuracy.

Demerits

Limited Generalizability

The focus on Brazilian Portuguese and canonical literature limits applicability to other linguistic or cultural domains, potentially restricting broader adoption.

Constraint Complexity

Some morphological and counting constraints may be too specific or ambiguous for certain models, potentially introducing variability in interpretability.

Expert Commentary

CAPITU represents a pivotal advancement in the evaluation of LLMs for non-English languages. The integration of literary context with verifiable instruction constraints is a masterstroke: it transforms evaluation from a generic, abstract task into a culturally embedded, linguistically authentic measurement. Unlike prior benchmarks that treat language as a neutral variable, CAPITU acknowledges that language is inherently tied to cultural expression, syntax, and morphology—factors that profoundly affect model behavior. The fact that 18 state-of-the-art models were evaluated under unified, objective criteria demonstrates a rare level of methodological rigor. Moreover, the fact that Portuguese-specialized models perform competitively at lower cost suggests that linguistic specialization need not be a trade-off for efficacy. This benchmark does more than measure performance—it redefines the criteria by which we assess AI’s linguistic competence in diverse linguistic ecosystems. It is a model for future benchmarks in multilingual AI research.

Recommendations

✓ Adopt CAPITU as a reference benchmark for Portuguese instruction-following in academic and industry research.
✓ Extend CAPITU’s framework to other regional languages (e.g., Spanish, French, Arabic) to create a global suite of culturally anchored evaluation tools.
✓ Develop supplementary datasets or prompts that mitigate constraints that prove too ambiguous for certain models, enhancing interpretability without diluting rigor.

Sources

Original: arXiv - cs.CL

arXiv - cs.CL

CAPITU: A Benchmark for Evaluating Instruction-Following in Brazilian Portuguese with Literary Context

AI Commentary

Executive Summary

Key Points

Merits

Cultural Relevance

Automatable Verification

Diverse Model Comparison

Demerits

Limited Generalizability

Constraint Complexity

Expert Commentary

Recommendations

Sources

Related Articles

Cross-subject Muscle Fatigue Detection via Adversarial and Supervised Contrastive Learning …

A Numerical Method for Coupling Parameterized Physics-Informed Neural Networks and …

Low-Rank Compression of Pretrained Models via Randomized Subspace Iteration

Product-Stability: Provable Convergence for Gradient Descent on the Edge of …

JCG, PC

HSOLLC Co., Ltd.