Academic

Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

arXiv:2603.23889v1 Announce Type: new Abstract: When safety is formulated as a limit of cumulative cost, safe reinforcement learning (RL) aims to learn policies that maximize return subject to the cost constraint in data collection and deployment. Off-policy safe RL methods, although offering high sample efficiency, suffer from constraint violations due to cost-agnostic exploration and estimation bias in cumulative cost. To address this issue, we propose Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy safe RL algorithm that integrates cost-bounded online exploration and conservative offline distributional value learning. First, we introduce a novel cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost in the action space and adaptively adjusts the trust region to control the training cost. Second, we adopt truncated quantile critics to stabilize the cost value learning. Quantile critics also quantify epistemic un

G
Guopeng Li, Matthijs T. J. Spaan, Julian F. P. Kooij
· · 1 min read · 7 views

arXiv:2603.23889v1 Announce Type: new Abstract: When safety is formulated as a limit of cumulative cost, safe reinforcement learning (RL) aims to learn policies that maximize return subject to the cost constraint in data collection and deployment. Off-policy safe RL methods, although offering high sample efficiency, suffer from constraint violations due to cost-agnostic exploration and estimation bias in cumulative cost. To address this issue, we propose Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy safe RL algorithm that integrates cost-bounded online exploration and conservative offline distributional value learning. First, we introduce a novel cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost in the action space and adaptively adjusts the trust region to control the training cost. Second, we adopt truncated quantile critics to stabilize the cost value learning. Quantile critics also quantify epistemic uncertainty to guide exploration. Experiments on safe velocity, safe navigation, and autonomous driving tasks demonstrate that COX-Q achieves high sample efficiency, competitive test safety performance, and controlled data collection cost. The results highlight COX-Q as a promising RL method for safety-critical applications.

Executive Summary

The article proposes Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy safe reinforcement learning algorithm that addresses the issue of constraint violations in cost-bounded exploration and estimation bias in cumulative cost. COX-Q integrates cost-bounded online exploration and conservative offline distributional value learning, utilizing a novel cost-constrained optimistic exploration strategy and truncated quantile critics to stabilize cost value learning. Experiments demonstrate COX-Q's high sample efficiency, competitive test safety performance, and controlled data collection cost in safety-critical applications, such as safe velocity, safe navigation, and autonomous driving.

Key Points

  • COX-Q addresses constraint violations in off-policy safe RL methods
  • Integrates cost-bounded online exploration and conservative offline distributional value learning
  • Utilizes a novel cost-constrained optimistic exploration strategy and truncated quantile critics

Merits

Enhanced Safety Performance

COX-Q achieves competitive test safety performance in safety-critical applications, ensuring the reliability of reinforcement learning policies.

Improved Sample Efficiency

COX-Q's high sample efficiency enables efficient data collection and reduces the computational costs associated with off-policy safe RL methods.

Demerits

Limited Exploration Strategy

The cost-constrained optimistic exploration strategy may not be adaptable to diverse environments, limiting the algorithm's generalizability.

Expert Commentary

COX-Q is a significant contribution to the field of reinforcement learning, as it addresses the critical issue of constraint violations in off-policy safe RL methods. The algorithm's integration of cost-bounded online exploration and conservative offline distributional value learning is a promising approach to balancing exploration and exploitation in safety-critical applications. While the article's experiments demonstrate the algorithm's effectiveness, further research is needed to fully understand its generalizability and adaptability to diverse environments.

Recommendations

  • Future research should investigate the extension of COX-Q to more complex safety-critical domains, such as robotics and healthcare.
  • The development of more adaptable exploration strategies for COX-Q is essential for its generalizability in diverse environments.

Sources

Original: arXiv - cs.LG