Academic

BenchBench: Benchmarking Automated Benchmark Generation

arXiv:2603.20807v1 Announce Type: new Abstract: Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items often relies on LLM judges, introducing additional sources of bias and prompt sensitivity. We argue that evaluation must extend beyond how well models answer benchmarks to how well models design them. We introduce BenchBench, a three-stage pipeline and dataset for benchmarking automated benchmark generation: (i) extract structured domain cards from seed benchmarks, (ii) prompt multiple designer LLMs to generate quota-controlled suites, and (iii) validate items with a multi-model answerer panel using exact/numeric/symbolic verifiers when possible and rubric-guided judging otherwise, yielding designer--answerer matrices with item-level quality flags and psychometric diagnostics. Across nine variants spanni

arXiv:2603.20807v1 Announce Type: new Abstract: Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items often relies on LLM judges, introducing additional sources of bias and prompt sensitivity. We argue that evaluation must extend beyond how well models answer benchmarks to how well models design them. We introduce BenchBench, a three-stage pipeline and dataset for benchmarking automated benchmark generation: (i) extract structured domain cards from seed benchmarks, (ii) prompt multiple designer LLMs to generate quota-controlled suites, and (iii) validate items with a multi-model answerer panel using exact/numeric/symbolic verifiers when possible and rubric-guided judging otherwise, yielding designer--answerer matrices with item-level quality flags and psychometric diagnostics. Across nine variants spanning computer science, mathematics, medicine, and theory-of-mind reasoning (including multilingual and multimodal settings), we generate 16.7K items, retain ~15K core items post-filtering, and produce ~152K graded model--item responses. BenchBench shows that benchmark-design ability is only moderately correlated with answer-time strength (Spearman rho ~0.37), invalidity is negatively associated with discrimination (Pearson r~0.62), and the resulting designer--answerer matrices enable scalable audits of format/modality/language fidelity and suite-dependent self/family interactions. The project is available at: https://github.com/koanatakiyo/BenchBench.

Executive Summary

BenchBench is a novel three-stage pipeline and dataset designed to evaluate the ability of large language models (LLMs) to generate high-quality benchmarks for a range of tasks, including computer science, mathematics, medicine, and theory-of-mind reasoning. The BenchBench framework extracts structured domain cards from seed benchmarks, prompts multiple designer LLMs to generate quota-controlled suites, and validates items with a multi-model answerer panel. The study generates 16.7K items, retains ~15K core items post-filtering, and produces ~152K graded model-item responses. BenchBench demonstrates that benchmark-design ability is only moderately correlated with answer-time strength and highlights the importance of evaluating LLMs' ability to design benchmarks.

Key Points

  • BenchBench is a three-stage pipeline for benchmarking automated benchmark generation
  • The framework extracts structured domain cards from seed benchmarks and validates items with a multi-model answerer panel
  • The study demonstrates that benchmark-design ability is only moderately correlated with answer-time strength

Merits

Strength in Addressing Benchmark Saturation

BenchBench addresses the limitations of static test sets and provides a scalable solution for evaluating open-ended items, reducing the risk of contamination and the need for costly refreshes.

Demerits

Limited Transferability to Real-World Scenarios

The study's reliance on a controlled environment and a narrow range of tasks may limit the transferability of BenchBench to real-world scenarios, which often involve complex, dynamic, and uncertain contexts.

Expert Commentary

BenchBench is a significant contribution to the field of AI research, highlighting the need for more nuanced and multifaceted evaluations of LLMs. By introducing a three-stage pipeline for benchmarking automated benchmark generation, the study provides a valuable framework for assessing the strengths and limitations of LLMs in generating high-quality benchmarks. However, the study's limitations, such as the narrow range of tasks and controlled environment, highlight the need for further research to ensure the transferability of BenchBench to real-world scenarios. As AI systems become increasingly ubiquitous, the development of fair and unbiased benchmarking processes, such as those proposed by BenchBench, is essential for ensuring the accountability and transparency of AI decision-making.

Recommendations

  • Recommendation 1: Further research is needed to explore the applicability of BenchBench to a broader range of tasks and domains.
  • Recommendation 2: The development of more diverse and representative datasets is essential for ensuring the fairness and bias mitigation of LLMs in generating high-quality benchmarks.

Sources

Original: arXiv - cs.CL