The problem of combinatorial multi-armed bandits with probabilistically triggered arms (CMAB-T) has been extensively studied. Prior work primarily focuses on either the online setting where an agent learns about the unknown environment through iterative interactions, or the offline setting where a policy is learned solely from logged data. However, each of these paradigms has inherent limitations: online algorithms suffer from high interaction costs and slow adaptation, while offline methods are constrained by dataset quality and lack of exploration capabilities. To address these complementary weaknesses, we propose hybrid CMAB-T, a new framework that integrates offline data with online interaction in a principled manner. Our proposed hybrid CUCB algorithm leverages offline data to guide exploration and accelerate convergence, while strategically incorporating online interactions to mitigate the insufficient coverage or distributional bias of the offline dataset. We provide theoretical guarantees on the algorithm's regret, demonstrating that hybrid CUCB significantly outperforms purely online approaches when high-quality offline data is available, and effectively corrects the bias inherent in offline-only methods when the data is limited or misaligned. Empirical results further demonstrate the consistent advantage of our algorithm.
翻译:概率触发臂的组合多臂老虎机(CMAB-T)问题已得到广泛研究。现有工作主要集中于两种范式:在线设定,其中智能体通过迭代交互学习未知环境;或离线设定,其中策略仅从记录数据中学习。然而,每种范式都存在固有局限性:在线算法面临高交互成本和缓慢适应性的问题,而离线方法则受限于数据集质量和缺乏探索能力。为应对这些互补性缺陷,我们提出混合CMAB-T,这是一个以原则性方式整合离线数据与在线交互的新框架。我们提出的混合CUCB算法利用离线数据引导探索并加速收敛,同时策略性地融入在线交互以缓解离线数据集覆盖不足或分布偏差的问题。我们提供了算法遗憾的理论保证,证明当高质量离线数据可用时,混合CUCB显著优于纯在线方法,并在数据有限或存在偏差时有效纠正纯离线方法的固有偏差。实证结果进一步证明了我们算法的持续优势。