A critical challenge for reinforcement learning (RL) is making decisions based on incomplete and noisy observations, especially in perturbed and partially observable Markov decision processes (P$^2$OMDPs). Existing methods fail to mitigate perturbations while addressing partial observability. We propose \textit{Causal State Representation under Asynchronous Diffusion Model (CaDiff)}, a framework that enhances any RL algorithm by uncovering the underlying causal structure of P$^2$OMDPs. This is achieved by incorporating a novel asynchronous diffusion model (ADM) and a new bisimulation metric. ADM enables forward and reverse processes with different numbers of steps, thus interpreting the perturbation of P$^2$OMDP as part of the noise suppressed through diffusion. The bisimulation metric quantifies the similarity between partially observable environments and their causal counterparts. Moreover, we establish the theoretical guarantee of CaDiff by deriving an upper bound for the value function approximation errors between perturbed observations and denoised causal states, reflecting a principled trade-off between approximation errors of reward and transition-model. Experiments on Roboschool tasks show that CaDiff enhances returns by at least 14.18\% compared to baselines. CaDiff is the first framework that approximates causal states using diffusion models with both theoretical rigor and practicality.
翻译:强化学习(RL)面临的一个关键挑战是基于不完整且带有噪声的观测做出决策,尤其是在受扰动且部分可观测的马尔可夫决策过程(P$^2$OMDPs)中。现有方法在解决部分可观测性的同时未能有效缓解扰动。我们提出了《基于异步扩散模型的因果状态表示(CaDiff)》,该框架通过揭示P$^2$OMDPs的底层因果结构,增强任何RL算法。这是通过引入一种新颖的异步扩散模型(ADM)和一种新的互模拟度量来实现的。ADM支持前向和反向过程具有不同步数,从而将P$^2$OMDP的扰动解释为可通过扩散抑制的噪声部分。互模拟度量量化了部分可观测环境与其因果对应环境之间的相似性。此外,我们通过推导扰动观测与去噪因果状态之间价值函数近似误差的上界,建立了CaDiff的理论保证,反映了奖励与转移模型近似误差之间的原则性权衡。在Roboschool任务上的实验表明,与基线方法相比,CaDiff将回报提升了至少14.18%。CaDiff是首个利用扩散模型近似因果状态的框架,兼具理论严谨性与实用性。