Academic

Auto Researching, not hyperparameter tuning: Convergence Analysis of 10,000 Experiments

arXiv:2603.15916v1 Announce Type: new Abstract: When LLM agents autonomously design ML experiments, do they perform genuine architecture search -- or do they default to hyperparameter tuning within a narrow region of the design space? We answer this question by analyzing 10,469 experiments executed by two LLM agents (Claude Opus and Gemini 2.5 Pro) across a combinatorial configuration space of 108,000 discrete cells for dashcam collision detection over 27 days. Through ANOVA decomposition, we find that \textbf{architectural choices explain 94\% of performance variance} ($F = 1324$, $\eta^2 = 0.94$), while hyperparameter variation within a fixed architecture explains only 6\%. Cross-task validation on a second collision dataset confirms this finding (75\% architecture-explained variance) with a \emph{different} winning backbone, confirming genuine architecture discovery. The agents' key contribution is discovering that V-JEPA\,2 video features with Zipformer temporal encoders achieve 0

X
Xiaoyi Li
· · 1 min read · 7 views

arXiv:2603.15916v1 Announce Type: new Abstract: When LLM agents autonomously design ML experiments, do they perform genuine architecture search -- or do they default to hyperparameter tuning within a narrow region of the design space? We answer this question by analyzing 10,469 experiments executed by two LLM agents (Claude Opus and Gemini 2.5 Pro) across a combinatorial configuration space of 108,000 discrete cells for dashcam collision detection over 27 days. Through ANOVA decomposition, we find that \textbf{architectural choices explain 94\% of performance variance} ($F = 1324$, $\eta^2 = 0.94$), while hyperparameter variation within a fixed architecture explains only 6\%. Cross-task validation on a second collision dataset confirms this finding (75\% architecture-explained variance) with a \emph{different} winning backbone, confirming genuine architecture discovery. The agents' key contribution is discovering that V-JEPA\,2 video features with Zipformer temporal encoders achieve 0.9245 AP -- a configuration no human proposed -- and concentrating search on productive architectural regions: at $N = 50$, LLM-guided search reaches AP $= 0.985$ versus $0.965$ for from-scratch random search. Post-bugfix convergence follows a power law ($c = 0.11$, $R^2 = 0.93$); the low exponent reflects the cost of broad exploration, not inefficiency, since the LLM discovers qualitatively better regions than random or Bayesian baselines. We characterize multi-agent search dynamics via entropy cycles and Jensen--Shannon specialization, providing the first large-scale empirical framework for LLM-guided combinatorial ML experiment design.

Executive Summary

This study offers a comprehensive analysis of the convergence behavior of large language model (LLM) agents in designing machine learning (ML) experiments. Through an extensive experiment involving 10,469 trials across a combinatorial design space, the authors demonstrate that LLM agents primarily engage in architecture search rather than hyperparameter tuning. The findings suggest that architectural choices account for 94% of performance variance, while hyperparameter variations within a fixed architecture contribute only 6%. This study contributes significantly to our understanding of LLM-guided ML experiment design and its potential applications in various fields. The analysis of multi-agent search dynamics and the characterization of entropy cycles provide valuable insights into the behavior of LLM agents. The study's results have important implications for the development of more efficient and effective ML models.

Key Points

  • LLM agents primarily engage in architecture search rather than hyperparameter tuning
  • Architectural choices account for 94% of performance variance in ML experiments
  • Hyperparameter variations within a fixed architecture contribute only 6% to performance variance
  • LLM agents can discover qualitatively better regions than random or Bayesian baselines
  • The study provides a large-scale empirical framework for LLM-guided combinatorial ML experiment design

Merits

Strength in methodology

The study employs a robust experimental design involving 10,469 trials across a combinatorial configuration space, providing a comprehensive understanding of LLM-guided ML experiment design.

Advancements in theory

The study contributes to the development of a new framework for understanding LLM-guided combinatorial ML experiment design, including the analysis of multi-agent search dynamics and the characterization of entropy cycles.

Practical implications

The study's findings have important implications for the development of more efficient and effective ML models, with potential applications in various fields.

Demerits

Limited generalizability

The study's findings may not generalize to other domains or problems, and further research is needed to validate the results in different contexts.

Computational resource-intensive

The study's extensive experimental design requires significant computational resources, which may limit its practical applicability.

Expert Commentary

While the study's findings are compelling, it is essential to consider the limitations of the experimental design and the potential for generalizability issues. Furthermore, the study's emphasis on the computational resource-intensiveness of the experimental design may limit its practical applicability. Nevertheless, the study contributes significantly to our understanding of LLM-guided ML experiment design and its potential applications in various fields. Future research should aim to validate the study's findings in different contexts and explore the potential of LLM agents in designing more efficient and effective ML models.

Recommendations

  • Future research should aim to validate the study's findings in different contexts and explore the potential of LLM agents in designing more efficient and effective ML models.
  • The development of more efficient and effective LLM agents should be a priority, with a focus on reducing computational resource requirements and improving generalizability.

Sources