Academic

The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

Simon Henniger, Gabriel Poesia · March 7, 2026 · 1 min read · 13 views

#cs.AI

arXiv:2602.17831v1 Announce Type: new Abstract: Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. Even then, there is always a concern about whether these questions test genuine reasoning or if similar problems have been seen during training. Here, we take inspiration from 16th-century mathematical duels to design The Token Games (TTG): an evaluation framework where models challenge each other by creating their own puzzles. We leverage the format of Programming Puzzles - given a Python function that returns a boolean, find inputs that make it return True - to flexibly represent problems and enable verifying solutions. Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each other. We evaluate 10 frontier models on TTG, and closely match the ranking from existing benchmarks such as Humanity's Last Exam, without involving any human effort in creating puzzles. We also find that creating good puzzles is still a highly challenging task for current models, not measured by previous benchmarks. Overall, our work suggests new paradigms for evaluating reasoning that cannot be saturated by design, and that allow testing models for other skills like creativity and task creation alongside problem solving.

Executive Summary

The article proposes a novel evaluation framework, The Token Games (TTG), to assess the reasoning capabilities of Large Language Models. Inspired by 16th-century mathematical duels, TTG enables models to challenge each other by creating their own puzzles, allowing for a more dynamic and adaptive evaluation process. The framework leverages programming puzzles to represent problems and verify solutions, with Elo ratings used to compare model performance. The results show that TTG closely matches the rankings from existing benchmarks, without requiring human effort in creating puzzles, and highlights the challenges of creating good puzzles for current models.

Key Points

▸ The Token Games (TTG) is a novel evaluation framework for assessing Large Language Model reasoning capabilities
▸ TTG enables models to challenge each other by creating their own puzzles
▸ The framework uses programming puzzles to represent problems and verify solutions, with Elo ratings for model comparison

Merits

Innovative Evaluation Approach

TTG offers a unique and dynamic evaluation framework that can adapt to improving model capabilities

Efficient Use of Resources

The framework eliminates the need for human curation of hard questions, reducing the expense and effort required

Comprehensive Assessment

TTG evaluates not only reasoning capabilities but also creativity and task creation skills

Demerits

Limited Generalizability

The framework's reliance on programming puzzles may limit its applicability to other problem domains

Model Bias and Fairness

The use of Elo ratings and pairwise duels may introduce biases and unfair comparisons between models

Expert Commentary

The Token Games framework represents a significant advancement in the evaluation of Large Language Models, offering a more dynamic and adaptive approach to assessing reasoning capabilities. By leveraging model-generated puzzles and Elo ratings, TTG provides a comprehensive assessment of model performance, including creativity and task creation skills. However, the framework's limitations, such as limited generalizability and potential biases, must be carefully considered. As the field continues to evolve, it is essential to address these challenges and explore the potential applications and implications of TTG in various domains.

Recommendations

✓ Future research should investigate the generalizability of TTG to other problem domains and model architectures
✓ The development of more transparent and explainable evaluation frameworks, such as TTG, should be prioritized to improve the trustworthiness and reliability of AI models

Sources

arXiv - cs.AI

The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

AI Commentary

Executive Summary

Key Points

Merits

Innovative Evaluation Approach

Efficient Use of Resources

Comprehensive Assessment

Demerits

Limited Generalizability

Model Bias and Fairness

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs