Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs
arXiv:2603.23926v1 Announce Type: new Abstract: Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $\gamma$-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\tilde{O}( \sqrt{SA\,\text{Var}} + \text{lower-order terms})$, where $S,A$ are the state and action space sizes, and $\text{Var}$ captures cumulative transition variance. This implies minimax-optimal average-reward and $\gamma$-regret bounds in the worst case but also adapts to easier problem instances, for example y
arXiv:2603.23926v1 Announce Type: new Abstract: Online reinforcement learning in infinite-horizon Markov decision processes (MDPs) remains less theoretically and algorithmically developed than its episodic counterpart, with many algorithms suffering from high ``burn-in'' costs and failing to adapt to benign instance-specific complexity. In this work, we address these shortcomings for two infinite-horizon objectives: the classical average-reward regret and the $\gamma$-regret. We develop a single tractable UCB-style algorithm applicable to both settings, which achieves the first optimal variance-dependent regret guarantees. Our regret bounds in both settings take the form $\tilde{O}( \sqrt{SA\,\text{Var}} + \text{lower-order terms})$, where $S,A$ are the state and action space sizes, and $\text{Var}$ captures cumulative transition variance. This implies minimax-optimal average-reward and $\gamma$-regret bounds in the worst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. Furthermore, our algorithm enjoys significantly improved lower-order terms for the average-reward setting. With prior knowledge of the optimal bias span $\Vert h^\star\Vert_\text{sp}$, our algorithm obtains lower-order terms scaling as $\Vert h^\star\Vert_\text{sp} S^2 A$, which we prove is optimal in both $\Vert h^\star\Vert_\text{sp}$ and $A$. Without prior knowledge, we prove that no algorithm can have lower-order terms smaller than $\Vert h^\star \Vert_\text{sp}^2 S A$, and we provide a prior-free algorithm whose lower-order terms scale as $\Vert h^\star\Vert_\text{sp}^2 S^3 A$, nearly matching this lower bound. Taken together, these results completely characterize the optimal dependence on $\Vert h^\star\Vert_\text{sp}$ in both leading and lower-order terms, and reveal a fundamental gap in what is achievable with and without prior knowledge.
Executive Summary
This article presents a novel algorithm for online reinforcement learning in infinite-horizon Markov decision processes (MDPs), specifically addressing the shortcomings of existing algorithms in the average-reward and γ-regret settings. The proposed algorithm achieves optimal variance-dependent regret guarantees, significantly improving upon current state-of-the-art approaches. The results demonstrate a fundamental gap in what is achievable with and without prior knowledge of the optimal bias span. The implications of this work are far-reaching, with potential applications in various fields, including robotics, finance, and healthcare.
Key Points
- ▸ The authors develop a single tractable UCB-style algorithm applicable to both average-reward and γ-regret settings.
- ▸ The algorithm achieves the first optimal variance-dependent regret guarantees with regret bounds of Õ( √SA Var + lower-order terms).
- ▸ Prior knowledge of the optimal bias span leads to improved lower-order terms, while the absence of prior knowledge implies a fundamental gap in achievable performance.
Merits
Strength in Theoretical Contributions
The authors provide a comprehensive theoretical analysis, establishing optimal variance-dependent regret bounds and characterizing the dependence on the optimal bias span. This work significantly advances the field of online reinforcement learning in infinite-horizon MDPs.
Strength in Algorithmic Development
The proposed algorithm is a significant improvement over existing approaches, offering improved regret guarantees and adaptability to easier problem instances.
Demerits
Limitation in Experimental Evaluation
The article primarily focuses on theoretical analysis, and the experimental evaluation is limited to a few example scenarios. Further investigation into the algorithm's performance in various settings is necessary to fully assess its practical implications.
Limitation in Scope
The work primarily addresses average-reward and γ-regret settings, leaving other infinite-horizon objectives, such as discounted-reward MDPs, unexplored.
Expert Commentary
The article's contributions are significant and timely, as the field of online reinforcement learning in infinite-horizon MDPs continues to evolve. The proposed algorithm's ability to adapt to easier problem instances is particularly noteworthy, as it has the potential to improve the efficiency and effectiveness of online learning in various applications. However, further investigation into the algorithm's performance and limitations is necessary to fully understand its practical implications. Additionally, the exclusion of other infinite-horizon objectives, such as discounted-reward MDPs, limits the article's scope and raises questions about the generality of the results.
Recommendations
- ✓ Future research should focus on extending the algorithm's applicability to other infinite-horizon objectives, such as discounted-reward MDPs, and exploring its performance in various real-world settings.
- ✓ Investigating the algorithm's performance under different assumptions, such as varying levels of prior knowledge, can provide further insights into its limitations and potential applications.
Sources
Original: arXiv - cs.LG