Reinforcement learning agents often behave unexpectedly in sparse-reward or safety-critical environments, creating a strong need for reliable debugging and verification tools. In this paper, we propose STACHE, a comprehensive framework for generating local, black-box explanations for an agent's specific action within discrete Markov games. Our method produces a Composite Explanation consisting of two complementary components: (1) a Robustness Region, the connected neighborhood of states where the agent's action remains invariant, and (2) Minimal Counterfactuals, the smallest state perturbations required to alter that decision. By exploiting the structure of factored state spaces, we introduce an exact, search-based algorithm that circumvents the fidelity gaps of surrogate models. Empirical validation on Gymnasium environments demonstrates that our framework not only explains policy actions, but also effectively captures the evolution of policy logic during training - from erratic, unstable behavior to optimized, robust strategies - providing actionable insights into agent sensitivity and decision boundaries.
翻译:在稀疏奖励或安全关键环境中,强化学习智能体的行为常出现意外情况,因此亟需可靠的调试与验证工具。本文提出STACHE,一个为离散马尔可夫博弈中智能体特定动作生成局部黑盒解释的综合性框架。该方法生成一种复合解释,包含两个互补部分:(1)鲁棒性区域,即智能体动作保持不变的连通状态邻域;(2)最小反事实,即改变该决策所需的最小状态扰动。通过利用因子化状态空间的结构,我们提出一种精确的基于搜索的算法,规避了代理模型的保真度差距。在Gymnasium环境上的实证验证表明,该框架不仅能解释策略动作,还能有效捕捉训练过程中策略逻辑的演变——从随机、不稳定的行为到优化、鲁棒的策略——从而为智能体敏感性和决策边界提供可操作的见解。