Batch-Constrained deep Q-learning (BCQ) [1] is a batch reinforcement learning method for continuous control. BCQ aims to perform Q-learning while constraining the action space to eliminate actions which are unlikely to be selected by the behavioral policy , and are therefore unlikely to be contained in the batch. At its core, BCQ uses a state-conditioned generative model
to model the distribution of data in the batch,
akin to a behavioral cloning model. As it is easier to sample from
than model
exactly in a continuous action space, the policy is defined by sampling
actions
from
and selecting the highest valued action according to a Q-network. Since BCQ was designed for continuous actions, the method also includes a perturbation model
, which is a residual added to the sampled actions in the range
, and trained with the deterministic policy gradient. Finally the authors include a weighted version of Clipped Double Q-learning to penalize high variance estimates and reduce overestimation bias, using
with
:
where During evaluation, the policy is defined similarly, by sampling
python bcq-train.py --dataset=walker2d-random-v2 --seed=0 --gpu=0
- Fujimoto S, Meger D, Precup D. Off-policy deep reinforcement learning without exploration[C]//International Conference on Machine Learning. PMLR, 2019: 2052-2062.