Jorge Cortés

Professor

Cymer Corporation Endowed Chair





Bias helps in anytime safe reinforcement learning
A. Marzabal, J. Cortés
Proceedings of the IEEE Conference on Decision and Control, Honolulu, Hawaii, 2026, submitted


Abstract

Anytime-safe reinforcement learning seeks to optimize performance while preserving constraint satisfaction throughout training. Existing methods with formal anytime guarantees rely on unbiased policy-gradient estimators, which can suffer from high variance and poor sample efficiency. In this paper, we develop a general framework that extends anytime-safe updates to biased value and gradient estimators. We characterize how estimator bias and variance affect safety, derive a probabilistic condition on the number of episodes required to preserve one-step safety and provide convergence guarantees to KKT points under vanishing-bias assumptions. The analysis yields explicit trade-offs showing when bias can be beneficial through variance reduction. The framework accommodates practical estimators and objectives, including GAE and PPO-style losses, while retaining the core anytime-safety mechanism. Preliminary experiments on constrained continuous-control tasks support the theoretical predictions and show improved learning efficiency compared to anytime-safe policy gradient baselines.

pdf

Mechanical and Aerospace Engineering, University of California, San Diego
9500 Gilman Dr, La Jolla, California, 92093-0411

Ph: 1-858-822-7930
Fax: 1-858-822-3107

cortes at ucsd.edu
Skype id: jorgilliyo