Academic

Are Large Language Models Truly Smarter Than Humans?

arXiv:2603.16197v1 Announce Type: new Abstract: Public leaderboards increasingly suggest that large language models (LLMs) surpass human experts on benchmarks spanning academic knowledge, law, and programming. Yet most benchmarks are fully public, their questions widely mirrored across the internet, creating systematic risk that models were trained on the very data used to evaluate them. This paper presents three complementary experiments forming a rigorous multi-method contamination audit of six frontier LLMs: GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, and Qwen3-235B. Experiment 1 applies a lexical contamination detection pipeline to 513 MMLU questions across all 57 subjects, finding an overall contamination rate of 13.8% (18.1% in STEM, up to 66.7% in Philosophy) and estimated performance gains of +0.030 to +0.054 accuracy points by category. Experiment 2 applies a paraphrase and indirect-reference diagnostic to 100 MMLU questions, finding accuracy drops by an ave

E
Eshwar Reddy M, Sourav Karmakar
· · 1 min read · 15 views

arXiv:2603.16197v1 Announce Type: new Abstract: Public leaderboards increasingly suggest that large language models (LLMs) surpass human experts on benchmarks spanning academic knowledge, law, and programming. Yet most benchmarks are fully public, their questions widely mirrored across the internet, creating systematic risk that models were trained on the very data used to evaluate them. This paper presents three complementary experiments forming a rigorous multi-method contamination audit of six frontier LLMs: GPT-4o, GPT-4o-mini, DeepSeek-R1, DeepSeek-V3, Llama-3.3-70B, and Qwen3-235B. Experiment 1 applies a lexical contamination detection pipeline to 513 MMLU questions across all 57 subjects, finding an overall contamination rate of 13.8% (18.1% in STEM, up to 66.7% in Philosophy) and estimated performance gains of +0.030 to +0.054 accuracy points by category. Experiment 2 applies a paraphrase and indirect-reference diagnostic to 100 MMLU questions, finding accuracy drops by an average of 7.0 percentage points under indirect reference, rising to 19.8 pp in both Law and Ethics. Experiment 3 applies TS-Guessing behavioral probes to all 513 questions and all six models, finding that 72.5% trigger memorization signals far above chance, with DeepSeek-R1 displaying a distributed memorization signature (76.6% partial reconstruction, 0% verbatim recall) that explains its anomalous Experiment 2 profile. All three experiments converge on the same contamination ranking: STEM > Professional > Social Sciences > Humanities.

Executive Summary

This article examines the performance of large language models (LLMs) in comparison to human experts, revealing that LLMs may have an unfair advantage due to training data contamination. The study conducts three experiments, finding significant contamination rates, performance gains, and memorization signals, which converge on a contamination ranking across various subjects. The results suggest that LLMs may not be truly smarter than humans, but rather, their performance is inflated due to exposure to evaluation data during training.

Key Points

  • LLMs may have an unfair advantage due to training data contamination
  • Contamination rates vary across subjects, with STEM having the highest rate
  • Memorization signals are present in LLMs, indicating that they may not truly understand the content

Merits

Rigorous Methodology

The study employs a multi-method approach, using three complementary experiments to audit the contamination of LLMs, providing a comprehensive understanding of the issue.

Demerits

Limited Generalizability

The study focuses on a specific set of LLMs and subjects, which may not be representative of all LLMs and domains, potentially limiting the generalizability of the findings.

Expert Commentary

The study's findings have significant implications for the field of artificial intelligence, highlighting the need for rigorous evaluation methods and transparent development practices. The presence of memorization signals in LLMs raises concerns about their ability to truly understand and generate content, rather than simply recalling memorized information. As AI systems become increasingly pervasive, it is essential to address these issues to ensure that they are developed and deployed in a fair and transparent manner.

Recommendations

  • Develop and implement robust evaluation methods to prevent training data contamination
  • Establish guidelines for transparent and fair development practices in the AI industry

Sources