动态在线学习:小型视角 (Online learning with dynamics: A minimax perspective)

We study the problem of online learning with dynamics, where a learner interacts with a stateful environment over multiple rounds. In each round of the interaction, the learner selects a policy to deploy and incurs a cost that depends on both the chosen policy and current state of the world. The state-evolution dynamics and the costs are allowed to be time-varying, in a possibly adversarial way. In this setting, we study the problem of minimizing policy regret and provide non-constructive upper bounds on the minimax rate for the problem. Our main results provide sufficient conditions for online learnability for this setup with corresponding rates. The rates are characterized by 1) a complexity term capturing the expressiveness of the underlying policy class under the dynamics of state change, and 2) a dynamics stability term measuring the deviation of the instantaneous loss from a certain counterfactual loss. Further, we provide matching lower bounds which show that both the complexity terms are indeed necessary. Our approach provides a unifying analysis that recovers regret bounds for several well studied problems including online learning with memory, online control of linear quadratic regulators, online Markov decision processes, and tracking adversarial targets. In addition, we show how our tools help obtain tight regret bounds for a new problems (with non-linear dynamics and non-convex losses) for which such bounds were not known prior to our work.

翻译：我们研究的是在线学习与动态的问题,在动态中,学习者在多轮互动中与状态环境发生互动。在每轮互动中,学习者选择一项政策来部署并承担取决于所选择的政策和世界当前状况的成本。允许国家革命动态和成本以可能敌对的方式在时间上变化。在这种背景下,我们研究的是尽量减少政策悔恨的问题,并为问题的最低比率提供非建设性上限。我们的主要结果为这一设置以相应比率进行在线学习提供了充分的条件。利率的特点是:1)一个复杂术语,抓住国家变化动态下基本政策等级的清晰度,并产生费用;2)一个动态稳定术语,衡量瞬间损失与某种反事实损失的偏差,并可能采用对抗性损失的相对比值。此外,我们提供较低的界限,表明这两个复杂的术语确实是必要的。我们的方法提供了一种统一分析,为一些研究周全的问题,包括在线学习记忆、在线控制线形四控管者、在线Markcon决策程序,以及跟踪新的对抗性工具,从而帮助我们获得非约束性损失。此外,我们展示了一种非约束性的工具。