Academic

Matching Accuracy, Different Geometry: Evolution Strategies vs GRPO in LLM Post-Training

arXiv:2604.01499v1 Announce Type: new Abstract: Evolution Strategies (ES) have emerged as a scalable gradient-free alternative to reinforcement learning based LLM fine-tuning, but it remains unclear whether comparable task performance implies comparable solutions in parameter space. We compare ES and Group Relative Policy Optimization (GRPO) across four tasks in both single-task and sequential continual-learning settings. ES matches or exceeds GRPO in single-task accuracy and remains competitive sequentially when its iteration budget is controlled. Despite this similarity in task performance, the two methods produce markedly different model updates: ES makes much larger changes and induces broader off-task KL drift, whereas GRPO makes smaller, more localized updates. Strikingly, the ES and GRPO solutions are linearly connected with no loss barrier, even though their update directions are nearly orthogonal. We develop an analytical theory of ES that explains all these phenomena within

W
William Hoy, Binxu Wang, Xu Pan
· · 1 min read · 2 views

arXiv:2604.01499v1 Announce Type: new Abstract: Evolution Strategies (ES) have emerged as a scalable gradient-free alternative to reinforcement learning based LLM fine-tuning, but it remains unclear whether comparable task performance implies comparable solutions in parameter space. We compare ES and Group Relative Policy Optimization (GRPO) across four tasks in both single-task and sequential continual-learning settings. ES matches or exceeds GRPO in single-task accuracy and remains competitive sequentially when its iteration budget is controlled. Despite this similarity in task performance, the two methods produce markedly different model updates: ES makes much larger changes and induces broader off-task KL drift, whereas GRPO makes smaller, more localized updates. Strikingly, the ES and GRPO solutions are linearly connected with no loss barrier, even though their update directions are nearly orthogonal. We develop an analytical theory of ES that explains all these phenomena within a unified framework, showing how ES can accumulate large off-task movement on weakly informative directions while still making enough progress on the task to match gradient-based RL in downstream accuracy. These results show that gradient-free and gradient-based fine-tuning can reach similarly accurate yet geometrically distinct solutions, with important consequences for forgetting and knowledge preservation. The source code is publicly available: https://github.com/Bhoy1/ESvsGRPO.

Executive Summary

This paper conducts a rigorous comparison between Evolution Strategies (ES) and Group Relative Policy Optimization (GRPO) for fine-tuning large language models (LLMs), revealing that despite achieving comparable task performance, the methods produce geometrically distinct parameter updates. Through empirical and theoretical analysis, the authors demonstrate that ES induces broader off-task KL drift and larger parameter changes, while GRPO makes localized updates. Notably, the solutions from both methods are linearly connected without loss barriers, indicating geometric alignment despite orthogonal update directions. The study also explores sequential continual-learning settings, showing ES remains competitive when iteration budgets are controlled. These findings challenge conventional assumptions about the equivalence of gradient-free and gradient-based optimization in LLM fine-tuning, with significant implications for model stability, knowledge preservation, and the interpretability of optimization landscapes.

Key Points

  • ES and GRPO achieve comparable task performance despite fundamentally different update geometries, with ES making larger, broader changes and GRPO making localized, smaller updates.
  • The solutions produced by ES and GRPO are linearly connected without loss barriers, despite their update directions being nearly orthogonal, suggesting a hidden geometric alignment.
  • In sequential continual-learning settings, ES remains competitive with GRPO when iteration budgets are controlled, but induces greater off-task KL drift.
  • An analytical theory of ES is developed to explain the accumulation of off-task movement in weakly informative directions while maintaining task progress.
  • The study highlights the importance of considering parameter-space behavior, not just task performance, in evaluating fine-tuning methods for LLMs.

Merits

Rigorous Empirical and Theoretical Analysis

The paper combines extensive empirical comparisons across four tasks with a unified analytical theory that explains the observed phenomena, providing a robust framework for understanding ES in LLM fine-tuning.

Novel Geometric Insights

The discovery of linear connectivity between ES and GRPO solutions, despite orthogonal update directions, offers groundbreaking insights into the optimization landscapes of LLM fine-tuning.

Practical Relevance

The findings have direct implications for model stability, forgetting, and knowledge preservation, which are critical concerns in the deployment of LLMs.

Open-Source Contribution

The provision of publicly available source code (https://github.com/Bhoy1/ESvsGRPO) enhances reproducibility and facilitates further research in the field.

Demerits

Limited Task Scope

The study focuses on four specific tasks, which may not fully capture the diversity of real-world LLM applications or edge cases where performance differences could emerge.

Iteration Budget Sensitivity

The competitive performance of ES in sequential continual-learning settings is contingent on controlled iteration budgets, which may not always be feasible in practical scenarios.

Theoretical Generalizability

While the analytical theory explains the observed phenomena, its generalizability to other gradient-free methods or more complex LLM architectures remains to be tested.

Expert Commentary

This paper represents a significant advancement in the comparative analysis of gradient-free and gradient-based fine-tuning methods for LLMs. The authors' empirical and theoretical contributions challenge the conventional wisdom that comparable task performance implies equivalent solutions in parameter space. The revelation that ES and GRPO solutions are linearly connected despite orthogonal update directions is particularly striking and suggests that the optimization landscapes of LLMs may harbor hidden geometric structures that are not yet fully understood. This work also raises important questions about the trade-offs between task performance and parameter-space behavior, particularly in the context of continual learning. For practitioners, the study underscores the need to look beyond accuracy metrics and consider the long-term implications of fine-tuning choices on model stability and knowledge preservation. The analytical theory of ES provides a valuable framework for understanding these phenomena, though further research is needed to test its generalizability to other optimization methods and LLM architectures. Overall, this paper is a must-read for researchers and practitioners in the field of LLM fine-tuning and optimization.

Recommendations

  • Conduct further studies to validate the linear connectivity hypothesis across a broader range of fine-tuning methods and LLM architectures, including exploration of the underlying causes of this phenomenon.
  • Develop standardized metrics for evaluating parameter-space behavior in LLM fine-tuning to complement traditional task performance metrics, ensuring a more holistic assessment of optimization methods.

Sources

Original: arXiv - cs.LG