Jorge Cortés
Professor
Cymer Corporation Endowed Chair
Bias helps in anytime safe reinforcement learning
A. Marzabal, J. Cortés
Proceedings of the IEEE Conference on Decision and Control, Honolulu, Hawaii, 2026, submitted
Abstract
Anytime-safe reinforcement learning seeks to
optimize performance while preserving constraint
satisfaction throughout training. Existing methods
with formal anytime guarantees rely on unbiased
policy-gradient estimators, which can suffer from
high variance and poor sample efficiency. In this
paper, we develop a general framework that extends
anytime-safe updates to biased value and gradient
estimators. We characterize how estimator bias and
variance affect safety, derive a probabilistic
condition on the number of episodes required to
preserve one-step safety and provide convergence
guarantees to KKT points under vanishing-bias
assumptions. The analysis yields explicit trade-offs
showing when bias can be beneficial through variance
reduction. The framework accommodates practical
estimators and objectives, including GAE and
PPO-style losses, while retaining the core
anytime-safety mechanism. Preliminary experiments on
constrained continuous-control tasks support the
theoretical predictions and show improved learning
efficiency compared to anytime-safe policy gradient
baselines.
pdf
Mechanical and Aerospace Engineering,
University of California, San Diego
9500 Gilman Dr,
La Jolla, California, 92093-0411
Ph: 1-858-822-7930
Fax: 1-858-822-3107
cortes at ucsd.edu
Skype id:
jorgilliyo