悬浮漂流:学会在非稳定条件下优化 (Hedging the Drift: Learning to Optimize under Non-Stationarity)

from arxiv, Journal version of the AISTATS 2019 version (available at arXiv:1810.03024). This version contains improved design of algorithms and dynamic regret bounds, and applications to K-armed bandits, generalized linear bandits, and combinatorial semi-bandits

We introduce data-driven decision-making algorithms that achieve state-of-the-art \emph{dynamic regret} bounds for non-stationary bandit settings. These settings capture applications such as advertisement allocation, dynamic pricing, and traffic network routing in changing environments. We show how the difficulty posed by the (unknown \emph{a priori} and possibly adversarial) non-stationarity can be overcome by an unconventional marriage between stochastic and adversarial bandit learning algorithms. Our main contribution is a general algorithmic recipe for a wide variety of non-stationary bandit problems. Specifically, we design and analyze the sliding window-upper confidence bound algorithm that achieves the optimal dynamic regret bound for each of the settings when we know the respective underlying \emph{variation budget}, which quantifies the total amount of temporal variation of the latent environments. Boosted by the novel bandit-over-bandit framework that adapts to the latent changes, we can further enjoy the (nearly) optimal dynamic regret bounds in a (surprisingly) parameter-free manner. In addition to the classical exploration-exploitation trade-off, our algorithms leverage the power of the "forgetting principle" in the learning processes, which is vital in changing environments. Our extensive numerical experiments on both synthetic and real world online auto-loan datasets show that our proposed algorithms achieve superior empirical performance compared to existing algorithms.

翻译：我们引入了数据驱动决策算法, 以达到非静态的土匪设置为条件。这些设置可以捕捉广告分配、动态定价、交通网络在变化环境中的路线等应用程序。我们展示了如何通过非常规结合来克服( 未知的 emph{ a priori} 和可能的对抗性) 非常态) 非常态的( 未知的) 非常态的( 未知的) 非常态的( emph{ 动态的) 土匪学习算法之间的非常规结合。我们的主要贡献是, 一种一般的算法配方, 解决各种非静态的土匪问题。具体地说, 我们设计和分析滑动的窗口增强信任约束算法的算法, 当我们知道相应的 emph{ variation 预算来缩小潜在环境的时间变化总量时, 就能实现每个环境的最佳动态后悔。由新颖的土匪横跨行框架推动, 我们可以进一步享受( 早期) 最佳的动态后悔, 在一个( 令人惊讶的) 相对的) 比较无常态的参数化的逻辑化的逻辑化的逻辑环境中,, 将我们的现有的演算演算演算法演进的演进的演进的演进法演进法演进法演进的演进的演进法演进法在“, 将我们的世界的演进法的演进的演进法将演进法演进法演进法演进法演进法演进法演进的演进进进进进进进进的演进的演进的演进中, 演进的演进的演进法将演进法演进法演进法演进的演进的演进的演进法演进法演进法演进法演进法演进法演进进法的演进进进进进进进进进进进法的演进法的演进进的演进的演进进进法演进法的演进进进进进法的演进进进进进进进进进进进进进进的演进进进进进进进进进进进进