While real-world applications of reinforcement learning are becoming popular, the security and robustness of RL systems are worthy of more attention and exploration. In particular, recent works have revealed that, in a multi-agent RL environment, backdoor trigger actions can be injected into a victim agent (a.k.a. Trojan agent), which can result in a catastrophic failure as soon as it sees the backdoor trigger action. To ensure the security of RL agents against malicious backdoors, in this work, we propose the problem of Backdoor Detection in a multi-agent competitive reinforcement learning system, with the objective of detecting Trojan agents as well as the corresponding potential trigger actions, and further trying to mitigate their Trojan behavior. In order to solve this problem, we propose PolicyCleanse that is based on the property that the activated Trojan agents accumulated rewards degrade noticeably after several timesteps. Along with PolicyCleanse, we also design a machine unlearning-based approach that can effectively mitigate the detected backdoor. Extensive experiments demonstrate that the proposed methods can accurately detect Trojan agents, and outperform existing backdoor mitigation baseline approaches by at least 3% in winning rate across various types of agents and environments.
翻译:虽然在现实世界应用强化学习正在普及,但RL系统的安全性和稳健性值得更多关注和探索,特别是,最近的工作表明,在多试剂RL环境中,后门触发行动可以注入受害者代理人(a.k.a.trojan代理),一旦发现后门触发行动,就可能导致灾难性失败。在这项工作中,为确保RL代理对恶意后门的安全,我们提议在多试剂竞争性强化学习系统中解决后门探测问题,目的是探测Trojan剂以及相应的潜在触发行动,并进一步努力减轻其Trojan行为。为了解决这一问题,我们提议PolicyCleanse以激活的Trojan代理在几步后积累的奖励明显退化的财产为基础。与PolicyCleane一道,我们还设计了一种不学习的机器方法,可以有效减少检测到的后门。广泛的实验表明,拟议的方法可以准确地探测Trojan剂以及相应的潜在触发行动,并进一步努力减轻Trojan的触发行为。为了解决这一问题,我们提议采用Policlean Cleane, leane,即至少3 %的代理人在各种类型中获胜环境中超越现有后门基线。</s>