改进与T~Bandit~Feedback的多阶段多级包装问题 (Improved Algorithms for Multi-period Multi-class Packing Problems with~Bandit~Feedback)

We consider the linear contextual multi-class multi-period packing problem~(LMMP) where the goal is to pack items such that the total vector of consumption is below a given budget vector and the total value is as large as possible. We consider the setting where the reward and the consumption vector associated with each action is a class-dependent linear function of the context, and the decision-maker receives bandit feedback. LMMP includes linear contextual bandits with knapsacks and online revenue management as special cases. We establish a new more efficient estimator which guarantees a faster convergence rate, and consequently, a lower regret in such problems. We propose a bandit policy that is a closed-form function of said estimated parameters. When the contexts are non-degenerate, the regret of the proposed policy is sublinear in the context dimension, the number of classes, and the time horizon~$T$ when the budget grows at least as $\sqrt{T}$. We also resolve an open problem posed in Agrawal & Devanur (2016), and extend the result to a multi-class setting. Our numerical experiments clearly demonstrate that the performance of our policy is superior to other benchmarks in the literature.

翻译：我们考虑的是线性背景多级多时期包装问题~(LMMP),其目标是将物品包装到总消费矢量低于某一预算矢量,总值尽可能大。我们考虑的是每种行动的奖赏和消费矢量与每一行动相关的消费矢量是依阶级而定的线性功能,决策者接受的是土匪反馈。LMMP包括具有knapsacks和在线收入管理的线性背景强盗作为特例。我们还建立了一个新的效率更高的估测器,保证更快的趋同率,从而降低这些问题的遗憾程度。我们建议采用土匪政策,该政策是上述估计参数的封闭形式功能。当环境不是退化时,拟议政策的遗憾是背景层面的亚线性、类别数量和当预算至少增长到$\sqrt{T}美元时的时间范围~T$。我们还解决了在Agrawal & Devanur(2016年)中出现的公开问题,并将结果扩大到多级设定。我们的数字实验清楚地表明,我们的政策绩效比其他文献中的标准要高。