基于高斯过程的有限时域马尔可夫决策过程的无悔汤普森采样 (No-Regret Thompson Sampling for Finite-Horizon Markov Decision Processes with Gaussian Processes)

Thompson sampling (TS) is a powerful and widely used strategy for sequential decision-making, with applications ranging from Bayesian optimization to reinforcement learning (RL). Despite its success, the theoretical foundations of TS remain limited, particularly in settings with complex temporal structure such as RL. We address this gap by establishing no-regret guarantees for TS using models with Gaussian marginal distributions. Specifically, we consider TS in episodic RL with joint Gaussian process (GP) priors over rewards and transitions. We prove a regret bound of $\mathcal{\tilde{O}}(\sqrt{KH\Gamma(KH)})$ over $K$ episodes of horizon $H$, where $\Gamma(\cdot)$ captures the complexity of the GP model. Our analysis addresses several challenges, including the non-Gaussian nature of value functions and the recursive structure of Bellman updates, and extends classical tools such as the elliptical potential lemma to multi-output settings. This work advances the understanding of TS in RL and highlights how structural assumptions and model uncertainty shape its performance in finite-horizon Markov Decision Processes.

翻译：汤普森采样（TS）是一种强大且广泛使用的序列决策策略，其应用范围涵盖从贝叶斯优化到强化学习（RL）。尽管取得了成功，但TS的理论基础仍然有限，尤其是在具有复杂时间结构的场景（如RL）中。我们通过为使用具有高斯边缘分布的模型的TS建立无悔保证来弥补这一不足。具体而言，我们研究了在幕式RL中，对奖励和状态转移使用联合高斯过程（GP）先验的TS。我们证明了在$K$个时域为$H$的幕中，遗憾界为$\mathcal{\tilde{O}}(\sqrt{KH\Gamma(KH)})$，其中$\Gamma(\cdot)$刻画了GP模型的复杂度。我们的分析解决了若干挑战，包括价值函数的非高斯性质以及贝尔曼更新的递归结构，并将椭圆势引理等经典工具扩展到多输出设置。这项工作增进了对RL中TS的理解，并强调了结构假设和模型不确定性如何影响其在有限时域马尔可夫决策过程中的性能。