具有有限适应和学习分配最佳设计的线形强盗 (Linear Bandits with Limited Adaptivity and Learning Distributional Optimal Design)

Motivated by practical needs such as large-scale learning, we study the impact of adaptivity constraints to linear contextual bandits, a central problem in online active learning. We consider two popular limited adaptivity models in literature: batch learning and rare policy switches. We show that, when the context vectors are adversarially chosen in $d$-dimensional linear contextual bandits, the learner needs $O(d \log d \log T)$ policy switches to achieve the minimax-optimal regret, and this is optimal up to $\mathrm{poly}(\log d, \log \log T)$ factors; for stochastic context vectors, even in the more restricted batch learning model, only $O(\log \log T)$ batches are needed to achieve the optimal regret. Together with the known results in literature, our results present a complete picture about the adaptivity constraints in linear contextual bandits. Along the way, we propose the distributional optimal design, a natural extension of the optimal experiment design, and provide a both statistically and computationally efficient learning algorithm for the problem, which may be of independent interest.

翻译：基于大规模学习等实际需要,我们研究适应性限制对线性背景强盗的影响,这是在线积极学习的一个中心问题。我们考虑文献中两种流行的有限适应性模式:批量学习和罕见的政策开关。我们表明,当背景矢量以美元维度线性背景强盗为对抗性选择时,学习者需要美元(d)\log d\log T)的政策开关以实现最小最大程度的负鼠悔,这是最优到$\mathrm{poly}(log d,\log\log T)的因子;对于随机环境矢量,即使是在较受限制的批量学习模式中,只需要美元(log\log T)来达到最佳程度的遗憾。与已知的文献结果一起,我们的结果完整地展示了线性背景强盗的适应性限制。此外,我们提出了分配性最佳设计、最佳实验设计的自然延伸,并为问题提供统计和计算效率高的算法,这或许是独立的兴趣。