Academic

Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization

arXiv:2603.12596v1 Announce Type: new Abstract: Proximal policy optimization (PPO) approximates the trust region update using multiple epochs of clipped SGD. Each epoch may drift further from the natural gradient direction, creating path-dependent noise. To understand this drift, we can use Fisher information geometry to decompose policy updates into signal (the natural gradient projection) and waste (the Fisher-orthogonal residual that consumes trust region budget without first-order surrogate improvement). Empirically, signal saturates but waste grows with additional epochs, creating an optimization-depth dilemma. We propose Consensus Aggregation for Policy Optimization (CAPO), which redirects compute from depth to width: $K$ PPO replicates are optimized on the same batch, differing only in minibatch shuffling order, and then aggregated into a consensus. We study aggregation in two spaces: Euclidean parameter space, and the natural parameter space of the policy distribution via the

arXiv:2603.12596v1 Announce Type: new Abstract: Proximal policy optimization (PPO) approximates the trust region update using multiple epochs of clipped SGD. Each epoch may drift further from the natural gradient direction, creating path-dependent noise. To understand this drift, we can use Fisher information geometry to decompose policy updates into signal (the natural gradient projection) and waste (the Fisher-orthogonal residual that consumes trust region budget without first-order surrogate improvement). Empirically, signal saturates but waste grows with additional epochs, creating an optimization-depth dilemma. We propose Consensus Aggregation for Policy Optimization (CAPO), which redirects compute from depth to width: $K$ PPO replicates are optimized on the same batch, differing only in minibatch shuffling order, and then aggregated into a consensus. We study aggregation in two spaces: Euclidean parameter space, and the natural parameter space of the policy distribution via the logarithmic opinion pool. In natural parameter space, the consensus provably achieves higher KL-penalized surrogate and tighter trust region compliance than the mean expert; parameter averaging inherits these guarantees approximately. On continuous control tasks, CAPO outperforms PPO and compute-matched deeper baselines under fixed sample budgets by up to 8.6x. CAPO demonstrates that policy optimization can be improved by optimizing wider, rather than deeper, without additional environment interactions.

Executive Summary

This study proposes Consensus Aggregation for Policy Optimization (CAPO), a novel approach to policy optimization in reinforcement learning. By redirecting computation from depth to width, CAPO aggregates multiple proximal policy optimization (PPO) replicates to achieve higher KL-penalized surrogate and tighter trust region compliance. The study demonstrates CAPO's efficacy on continuous control tasks, outperforming PPO and compute-matched deeper baselines. The findings suggest that optimizing wider, rather than deeper, can improve policy optimization without additional environment interactions. The study's contributions include a deeper understanding of the optimization-depth dilemma and the development of a computationally efficient method for policy optimization.

Key Points

  • CAPO redirects computation from depth to width, aggregating multiple PPO replicates
  • CAPO achieves higher KL-penalized surrogate and tighter trust region compliance
  • CAPO outperforms PPO and compute-matched deeper baselines on continuous control tasks

Merits

Strength in Addressing the Optimization-Depth Dilemma

The study provides a nuanced understanding of the optimization-depth dilemma, highlighting the trade-off between signal and waste in policy updates.

Demerits

Limited Exploration of Aggregation Methods

The study focuses primarily on Euclidean parameter space and natural parameter space aggregation, with limited exploration of other aggregation methods.

Expert Commentary

The study makes a significant contribution to the field of reinforcement learning by addressing the optimization-depth dilemma and providing a computationally efficient method for policy optimization. The use of consensus aggregation in both Euclidean and natural parameter spaces is a novel and effective approach. However, the study's limitations, such as the limited exploration of aggregation methods, highlight the need for further research in this area. The implications of the study's findings are far-reaching, with potential applications in areas such as robotics and autonomous systems. As the field of reinforcement learning continues to evolve, CAPO is likely to play an increasingly important role in the development of more efficient and effective algorithms.

Recommendations

  • Future studies should explore the application of CAPO in more complex and diverse reinforcement learning tasks
  • Researchers should investigate the use of alternative aggregation methods and explore the potential benefits of combining CAPO with other policy optimization techniques

Sources