Off-dynamics reinforcement learning (RL), where training and deployment transition dynamics are different, can be formulated as learning in a robust Markov decision process (RMDP) where uncertainties in transition dynamics are imposed. Existing literature mostly assumes access to generative models allowing arbitrary state-action queries or pre-collected datasets with a good state coverage of the deployment environment, bypassing the challenge of exploration. In this work, we study a more realistic and challenging setting where the agent is limited to online interaction with the training environment. To capture the intrinsic difficulty of exploration in online RMDPs, we introduce the supremal visitation ratio, a novel quantity that measures the mismatch between the training dynamics and the deployment dynamics. We show that if this ratio is unbounded, online learning becomes exponentially hard. We propose the first computationally efficient algorithm that achieves sublinear regret in online RMDPs with $f$-divergence based transition uncertainties. We also establish matching regret lower bounds, demonstrating that our algorithm achieves optimal dependence on both the supremal visitation ratio and the number of interaction episodes. Finally, we validate our theoretical results through comprehensive numerical experiments.
翻译:离动态强化学习(RL)中,训练与部署阶段的转移动态存在差异,可建模为在转移动态具有不确定性的鲁棒马尔可夫决策过程(RMDP)中进行学习。现有研究大多假设可通过生成模型任意查询状态-动作对,或依赖预先收集、覆盖部署环境良好状态的数据集,从而规避探索难题。本文研究一种更现实且更具挑战性的设定:智能体仅能通过与训练环境在线交互进行学习。为刻画在线RMDP中探索的内在困难性,我们引入了一种新度量——上确界访问比,用于量化训练动态与部署动态之间的失配程度。我们证明,若该比值无界,在线学习将面临指数级困难。我们提出了首个计算高效的算法,可在基于$f$-散度的转移动态不确定性在线RMDP中实现次线性遗憾。同时,我们建立了匹配的遗憾下界,证明该算法在上确界访问比和交互回合数上均达到最优依赖关系。最后,我们通过系统的数值实验验证了理论结果。