Partially Observable Markov Decision Process (POMDP) provides a principled and generic framework to model real world sequential decision making processes but yet remains unsolved, especially for high dimensional continuous space and unknown models. The main challenge lies in how to accurately obtain the belief state, which is the probability distribution over the unobservable environment states given historical information. Accurately calculating this belief state is a precondition for obtaining an optimal policy of POMDPs. Recent advances in deep learning techniques show great potential to learn good belief states. However, existing methods can only learn approximated distribution with limited flexibility. In this paper, we introduce the \textbf{F}l\textbf{O}w-based \textbf{R}ecurrent \textbf{BE}lief \textbf{S}tate model (FORBES), which incorporates normalizing flows into the variational inference to learn general continuous belief states for POMDPs. Furthermore, we show that the learned belief states can be plugged into downstream RL algorithms to improve performance. In experiments, we show that our methods successfully capture the complex belief states that enable multi-modal predictions as well as high quality reconstructions, and results on challenging visual-motor control tasks show that our method achieves superior performance and sample efficiency.
翻译:部分可观测的 Markov 决策程序(POMDP) 提供了一个原则性和通用框架,用以模拟真实的世界顺序决策程序,但至今仍未解决,特别是高维连续空间和未知模型。主要挑战在于如何准确获得信仰状态,即无法观测的环境国家的历史信息中的概率分布。精确计算这一信仰状态是获得POMDP最佳政策的先决条件。最近深层次学习技术的进展显示学习良好信仰状态的巨大潜力。然而,现有方法只能以有限的灵活性学习近似分布。在本文件中,我们引入了基于\ textbf{F}F}lookbf{O}w-基础的\ textbf{R}w-textb{B}efurd\ textb{B}B}lief\ textb{Sff}tate 模型(FOFBES),该模型将正常流纳入变异推论,以学习POMDP的总体持续信仰状态。此外,我们表明,所学到的信仰状态可以插入下游区域测算法,改进业绩。在高额的预测中,我们通过高额的预测显示我们高额的改进的成绩分析方法,我们能够实现高额的业绩分析。