强化学习分处 (Branching Reinforcement Learning)

In this paper, we propose a novel Branching Reinforcement Learning (Branching RL) model, and investigate both Regret Minimization (RM) and Reward-Free Exploration (RFE) metrics for this model. Unlike standard RL where the trajectory of each episode is a single $H$-step path, branching RL allows an agent to take multiple base actions in a state such that transitions branch out to multiple successor states correspondingly, and thus it generates a tree-structured trajectory. This model finds important applications in hierarchical recommendation systems and online advertising. For branching RL, we establish new Bellman equations and key lemmas, i.e., branching value difference lemma and branching law of total variance, and also bound the total variance by only $O(H^2)$ under an exponentially-large trajectory. For RM and RFE metrics, we propose computationally efficient algorithms BranchVI and BranchRFE, respectively, and derive nearly matching upper and lower bounds. Our results are only polynomial in problem parameters despite exponentially-large trajectories.

翻译：在本文中,我们提出一个新的分处强化学习(Branging RL)模式, 并调查该模式的最小化( RM) 和奖励自由探索(RFE) 衡量标准。不同于标准的 RL, 每集的轨迹为单一H$的一步路径, 分支RL允许代理在某个状态下采取多重基础行动, 这样分处可以相应向多个继承国过渡, 从而产生树形轨迹。这个模式在等级推荐系统和在线广告中找到重要的应用。对于分行RL, 我们建立了新的贝尔曼方程式和关键利玛( lemma) 等式和关键利玛( lemma), 即分支值差 Lemma 和分支法的全差分法, 并且将总差幅仅限制在指数大的轨迹下 $O( H2) 。对于 RFE 和 RFE 基准, 我们分别提出计算高效的算法支部VI 和分管和分管线几乎匹配的上下界。对于RFE 。在问题参数中我们的结果只是多数值参数, 尽管有指数的参数。