In this paper we consider an online recommendation setting, where a platform recommends a sequence of items to its users at every time period. The users respond by selecting one of the items recommended or abandon the platform due to fatigue from seeing less useful items. Assuming a parametric stochastic model of user behavior, which captures positional effects of these items as well as the abandoning behavior of users, the platform's goal is to recommend sequences of items that are competitive to the single best sequence of items in hindsight, without knowing the true user model a priori. Naively applying a stochastic bandit algorithm in this setting leads to an exponential dependence on the number of items. We propose a new Thompson sampling based algorithm with expected regret that is polynomial in the number of items in this combinatorial setting, and performs extremely well in practice.
翻译:在本文中,我们考虑一个在线建议设置, 平台在每一个时间段都向用户推荐一系列项目。 用户通过选择推荐的某个项目或放弃平台来回应。 假设用户行为模拟模型, 捕捉这些项目的位置效应以及用户的放弃行为, 该平台的目标是推荐在事后观察中具有竞争力的项目序列, 不先验地了解真正的用户模式 。 在这个环境中, 巧妙地应用一个随机的土匪算法导致对项目数量的指数依赖 。 我们提出了一个新的基于汤普森抽样算法, 其预期的遗憾是在这个组合环境中的项目数量是多式的, 并在实际中表现得非常好 。