Recently, offline reinforcement learning (RL) has become a popular RL paradigm. In offline RL, data providers share pre-collected datasets -- either as individual transitions or sequences of transitions forming trajectories -- to enable the training of RL models (also called agents) without direct interaction with the environments. Offline RL saves interactions with environments compared to traditional RL, and has been effective in critical areas, such as navigation tasks. Meanwhile, concerns about privacy leakage from offline RL datasets have emerged. To safeguard private information in offline RL datasets, we propose the first differential privacy (DP) offline dataset synthesis method, PrivORL, which leverages a diffusion model and diffusion transformer to synthesize transitions and trajectories, respectively, under DP. The synthetic dataset can then be securely released for downstream analysis and research. PrivORL adopts the popular approach of pre-training a synthesizer on public datasets, and then fine-tuning on sensitive datasets using DP Stochastic Gradient Descent (DP-SGD). Additionally, PrivORL introduces curiosity-driven pre-training, which uses feedback from the curiosity module to diversify the synthetic dataset and thus can generate diverse synthetic transitions and trajectories that closely resemble the sensitive dataset. Extensive experiments on five sensitive offline RL datasets show that our method achieves better utility and fidelity in both DP transition and trajectory synthesis compared to baselines. The replication package is available at the GitHub repository.
翻译:近年来,离线强化学习已成为一种主流的强化学习范式。在离线强化学习中,数据提供方共享预先收集的数据集——既可以是单个状态转移,也可以是构成轨迹的状态转移序列——以支持在不与环境直接交互的情况下训练强化学习模型(也称为智能体)。与传统强化学习相比,离线强化学习减少了与环境的交互次数,并在导航任务等关键领域展现出显著效果。与此同时,离线强化学习数据集可能引发的隐私泄露问题也日益受到关注。为保护离线强化学习数据集中的隐私信息,我们提出了首个差分隐私离线数据集合成方法 PrivORL,该方法利用扩散模型和扩散变换器分别在差分隐私约束下合成状态转移和轨迹。合成数据集可安全发布用于下游分析与研究。PrivORL 采用在公开数据集上预训练合成器,再通过差分隐私随机梯度下降在敏感数据集上微调的通用流程。此外,PrivORL 引入了好奇心驱动预训练机制,通过好奇心模块的反馈增强合成数据集的多样性,从而生成与敏感数据集高度相似且多样化的合成状态转移与轨迹。在五个敏感离线强化学习数据集上的大量实验表明,与基线方法相比,我们的方法在差分隐私状态转移和轨迹合成方面均实现了更优的效用与保真度。复现资源已发布于 GitHub 代码库。