Conference

Introducing the Evaluations & Datasets Track at NeurIPS 2026

· · 6 min read · 106 views

March 23 2026 Introducing the Evaluations & Datasets Track at NeurIPS 2026 Communication Chairs 2026 2026 Conference We are excited to announce that the Datasets & Benchmarks Track at NeurIPS 2026 has been officially renamed the Evaluations & Datasets (ED) Track. While the track continues to align with the main conference (see the call for paper ) in terms of requirements and timeline , this year we introduce an important refinement; an expanded scope that explicitly positions evaluation as a scientific object of study. The goal is to improve clarity and rigor in how we assess progress in the field. This post accompanies the ED Track’s Call for Papers and outlines the rationale behind this update. From “Datasets & Benchmarks” to “Evaluations & Datasets” The Datasets & Benchmarks Track has become a leading venue for work on dataset creation and benchmarking. As the field matures, it has become increasingly clear that evaluation plays a central role in shaping scientific conclusions. What we measure, under which assumptions, and how we interpret results often determine conclusions as much as model architecture does. Datasets and benchmarks are tightly connected to this process: they define what can be measured, compared, and ultimately claimed. Ongoing concerns around reproducibility and comparability have further highlighted that disagreement frequently stems from differences in evaluation design, assumptions, and reporting practices. As models become more standardized and widely available, evaluation methodology more frequently determines the conclusions we draw. For NeurIPS 2026, we refine the track to reflect this evolution; evaluation itself becomes an object of scientific study .  The new name places evaluation first, while datasets remain central and receive renewed focus within broader evaluative practices rather than as endpoints in themselves.  This shift broadens the scope of the track, allowing works that were previously submitted to the main track to be more clearly aligned with the ED Track’s reframed focus. Defining evaluation as a scientific object We explicitly define evaluation as: Processes, practices, tools, and resources for making evaluative claims about AI/ML systems, including – but not limited to – datasets, benchmarks, user studies, simulators, auditing, red-teaming methods, interaction protocols, metrics, and experimental or qualitative study designs. This definition also includes datasets and related resources when they are intended for use in any part of the AI/ML lifecycle (e.g. training, fine-tuning, testing etc.) and are accompanied by a clear description of their scope, underlying assumptions, and explanation of how models or systems studied using them are expected to be meaningfully assessed. By making this definition explicit, we clarify the motivation behind the track’s new name. Evaluation is not merely a procedural step, but a scientific object in its own right – one that can be studied, stress-tested, reproduced, audited, and improved. It shapes how we interpret performance, robustness, fairness, and reliability, and ultimately determines which conclusions and comparisons are justified. Placing “evaluations” first reflects this central role. What This Means for Submissions Under this reframing: Contributions extend beyond introducing new datasets or benchmarks . Works whose primary contribution is the analysis, critique, redesign, or stress-testing of evaluation practices or methodologies are fully in scope. Advancing evaluation science is itself a central goal of the track. Datasets remain central . They may be used at any stage of the AI/ML lifecycle, including training, fine-tuning, auditing, or testing. Submissions introducing datasets are welcome, including those whose primary contribution is enabling new capabilities, applications, or problem settings. To support meaningful use and comparison, authors should clearly explain what claims the dataset is intended to support (e.g., improved model performance, fairness, robustness, safety, or other model characteristics), under what assumptions those claims are valid, and what limitations constrain them. The key requirement is not the stage at which a dataset is used, but that its relationship to evaluative claims is clearly articulated ; authors should clarify how models developed or studied using the dataset are intended to be assessed and what kinds of claims such assessments can support. For instance, if a dataset enables the creation of better models, the submission should explain how those models should be meaningfully evaluated (e.g. how improvements are to be assessed and what constitutes meaningful evidence of progress). Papers that present data without clarifying the intended problem formulation, evaluation setup, or interpretive boundaries are unlikely to provide sufficient context for review and may therefore not fall within the intended scope of the track. For example, simply releasing a large collection of data without specifying the problem setting it addresses, how it shapes evaluation, or what kinds of claims it is meant to enable  would not provide a sufficient basis for review under the track’s criteria. Submissions need not introduce a new model or outperform prior work. Rather than results showing that one model outperforms another, we seek contributions that advance our understanding of model performance and how evaluative claims are constructed, supported, and interpreted. Negative results, critical analyses, and use-case-inspired evaluations are welcome. A submission need not “beat a baseline”; its primary contribution should be to deepen and refine our understanding of evaluation practices. Benchmarks remain in scope . They are a subset of evaluations, standardized, reusable setups that serve as shared comparison points. Removing “benchmarks” from the title does not signal exclusion; rather, it reflects that benchmarks are one form of evaluation. This framing allows the track to welcome both the development of new benchmarks (when scientifically justified) and rigorous analyses of when and why existing benchmarks fail. In-Scope Submissions Under the revised scope of the track, we welcome work that: Analyze strengths, limitations, or failure modes of existing benchmarks or evaluation practices Study benchmark saturation or overfitting and their impact on scientific conclusions Compare evaluation designs and demonstrate how different assumptions lead to different results or conclusions Provide rigorous reproduction, auditing, and stress-testing of prior evaluations Develop documentation methodologies (e.g., Data Cards, Model Cards, evaluation cards) that improve how evaluative claims are made, interpreted, and compared Refine existing evaluation setups Propose new evaluation protocols, practices, or methodologies Conduct human- or interaction-centered evaluations (e.g., user studies, red-teaming) Introduce datasets and clearly explain their scope, assumptions, limitations, and how they are intended to support or shape evaluative claims across the AI/ML lifecycle Contribute tools, analyses, or frameworks that improve how evaluative claims are constructed or interpreted Present negative results, critical analyses, and use-case-inspired evaluations Moreover, data-centric and benchmarking submissions historically welcomed by the track remain fully in scope . These include, but are not limited to: New datasets and dataset collections Data generators and reinforcement learning environments Data-centric AI methods and tools Advanced data collection and curation practices Responsible dataset development frameworks Audits of existing datasets Benchmarks on new or existing datasets, benchmarking tools, and methodologies Systematic analyses of systems on novel datasets In-depth analyses of machine learning challenges and competitions (by organisers and/or participants) that yield important new insights Competition papers from prior NeurIPS competitions For 2026, we encourage all submissions, including those in the categories above, to clearly explain their contribution in light of the revised scope. We welcome your feedback As the Evaluations & Datasets Track continues to evolve, we welcome input from the community. Our goal is to improve clarity and rigor, and we value perspectives on how to best achieve that. Please feel free to reach out to the ED Track Chairs at evaluationsdatasets@neurips.cc with feedback or suggestions. ED Track chairs: Konstantina Palla, Jessica Schrouff,  Alexandre Drouin, Lijun Wu, Joaquin Vanschoren

Executive Summary

The NeurIPS 2026 Evaluations & Datasets (ED) Track has undergone a significant refinement, expanding its scope to position evaluation as a scientific object of study. This shift aims to improve clarity and rigor in assessing progress in the field. The ED Track now places evaluation first, with datasets remaining central and receiving renewed focus within broader evaluative practices. This change broadens the track's scope, allowing for submissions that were previously aligned with the main track to be more clearly aligned with the ED Track's reframed focus. This update acknowledges the increasing importance of evaluation methodology in shaping scientific conclusions and highlights the need for a more refined approach to assessing AI/ML systems.

Key Points

  • The ED Track has been renamed to place evaluation first, with datasets remaining central.
  • The track's scope has been expanded to position evaluation as a scientific object of study.
  • Evaluation methodology is increasingly important in shaping scientific conclusions.

Merits

Strength

The update acknowledges the increasing importance of evaluation methodology, promoting a more refined approach to assessing AI/ML systems.

Strength

The expanded scope of the ED Track allows for submissions that were previously aligned with the main track to be more clearly aligned with the ED Track's reframed focus.

Demerits

Limitation

The updated definition of evaluation may be too broad, potentially leading to confusion and challenges in distinguishing between evaluation and other aspects of the AI/ML lifecycle.

Expert Commentary

The ED Track's update is a significant refinement that acknowledges the increasing importance of evaluation methodology in shaping scientific conclusions in AI/ML research. By positioning evaluation as a scientific object of study, the track promotes a more refined approach to assessing AI/ML systems. However, the updated definition of evaluation may be too broad, potentially leading to confusion and challenges in distinguishing between evaluation and other aspects of the AI/ML lifecycle. Nevertheless, the update is a step in the right direction, and it is likely to promote better evaluation practices and more robust conclusions in AI/ML research.

Recommendations

  • Researchers and practitioners should familiarize themselves with the updated ED Track's definition of evaluation and ensure that their submission aligns with the reframed focus.
  • Funding agencies and research institutions should revisit their evaluation criteria and methodologies to ensure alignment with the ED Track's reframed focus.

Sources

Original: NeurIPS

Related Articles