The current work on reinforcement learning (RL) from demonstrations often assumes the demonstrations are samples from an optimal policy, an unrealistic assumption in practice. When demonstrations are generated by sub-optimal policies or have sparse state-action pairs, policy learned from sub-optimal demonstrations may mislead an agent with incorrect or non-local action decisions. We propose a new method called Local Ensemble and Reparameterization with Split and Merge of expert policies (LEARN-SAM) to improve efficiency and make better use of the sub-optimal demonstrations. First, LEARN-SAM employs a new concept, the lambda-function, based on a discrepancy measure between the current state to demonstrated states to "localize" the weights of the expert policies during learning. Second, LEARN-SAM employs a split-and-merge (SAM) mechanism by separating the helpful parts in each expert demonstration and regrouping them into new expert policies to use the demonstrations selectively. Both the lambda-function and SAM mechanism help boost the learning speed. Theoretically, we prove the invariant property of reparameterized policy before and after the SAM mechanism, providing theoretical guarantees for the convergence of the employed policy gradient method. We demonstrate the superiority of the LEARN-SAM method and its robustness with varying demonstration quality and sparsity in six experiments on complex continuous control problems of low to high dimensions, compared to existing methods on RL from demonstration.
翻译:从示威中强化学习(RL)的当前工作往往假定示威是最佳政策的样本,是一种不切实际的假设。当示威是由次优政策产生的,或州一级行动对等少而少时,从次优示威中学习的政策可能会误导作出不正确或非当地行动决定的代理人。我们提出一种新的方法,称为地方集合和补偿,与专家政策的分化和合并(LEARN-SAAM)相结合,以提高效率,更好地利用次优示范。首先,LEARN-SAM采用了一个新的概念,即羊膜功能,其依据是当前国家为证明的“本地化”而将专家政策重量在学习期间“本地化”而采用的差异计量。第二,LEAR-SAM采用一个分裂和合并机制,将每次专家演示的有益部分分为新的专家政策,以便有选择地使用演示。低功能和SAM机制都帮助提高学习速度。理论上,我们证明,在演示政策之前的六种高度集中化政策中,比SAM-SAAR的高度稳定化政策具有可逆性,在SAM机制之后又以高的理论方法展示。