Cascading bandit (CB) is a variant of both the multi-armed bandit (MAB) and the cascade model (CM), where a learning agent aims to maximize the total reward by recommending $K$ out of $L$ items to a user. We focus on a common real-world scenario where the user's preference can change in a piecewise-stationary manner. Two efficient algorithms, \texttt{GLRT-CascadeUCB} and \texttt{GLRT-CascadeKL-UCB}, are developed. The key idea behind the proposed algorithms is incorporating an almost parameter-free change-point detector, the Generalized Likelihood Ratio Test (GLRT), within classical upper confidence bound (UCB) based algorithms. Gap-dependent regret upper bounds of the proposed algorithms are derived and both match the lower bound $\Omega(\sqrt{T})$ up to a poly-logarithmic factor $\sqrt{\log{T}}$ in the number of time steps $T$. We also present numerical experiments on both synthetic and real-world datasets to show that \texttt{GLRT-CascadeUCB} and \texttt{GLRT-CascadeKL-UCB} outperform state-of-the-art algorithms in the literature.
翻译:连锁匪盗( CB) 是多武装土匪( MAB) 和级联模式( CM) 的一种变体, 学习代理商的目的是通过向用户推荐美元中美元的项目来最大限度地实现总回报, 向用户推荐美元中的美元。 我们集中关注一种共同的现实世界情景, 用户的偏好可以以零星静止的方式改变。 两种有效的算法, \ texttt{ GLRT- CascadeB} 和\ texttt{ GLRT- CascadeKL- CUB} 正在开发一种多边算法 $\ sqrt{T} 。 拟议算法背后的关键理念正在将一个几乎没有参数的变更点探测器( GLRT) (GLRT) 的通用相似比率测试( GLRT) 纳入基于经典的上层信任值算法。 生成了基于差异的遗憾上限, 并且两者都匹配了较低约束的 $- Omega(\ sqr) $\ Btroductional- dal- grus 和G- groupal- grus