Reliability in multi-agent systems (MAS) built on large language models is increasingly limited by cognitive failures rather than infrastructure faults. Existing observability tools describe failures but do not quantify how quickly distributed reasoning recovers once coherence is lost. We introduce MTTR-A (Mean Time-to-Recovery for Agentic Systems), a runtime reliability metric that measures cognitive recovery latency in MAS. MTTR-A adapts classical dependability theory to agentic orchestration, capturing the time required to detect reasoning drift and restore coherent operation. We further define complementary metrics, including MTBF and a normalized recovery ratio (NRR), and establish theoretical bounds linking recovery latency to long-run cognitive uptime. Using a LangGraph-based benchmark with simulated drift and reflex recovery, we empirically demonstrate measurable recovery behavior across multiple reflex strategies. This work establishes a quantitative foundation for runtime cognitive dependability in distributed agentic systems.
翻译:基于大语言模型构建的多智能体系统(MAS)的可靠性日益受到认知故障而非基础设施故障的限制。现有的可观测性工具虽能描述故障,但无法量化分布式推理在失去一致性后恢复的速度。本文提出MTTR-A(智能体系统平均恢复时间),这是一种运行时可靠性度量指标,用于测量MAS中的认知恢复延迟。MTTR-A将经典的可靠性理论适配于智能体编排,捕捉检测推理漂移并恢复一致操作所需的时间。我们进一步定义了补充指标,包括平均故障间隔时间(MTBF)和归一化恢复比率(NRR),并建立了将恢复延迟与长期认知运行时间联系起来理论界限。通过使用基于LangGraph的基准测试,结合模拟漂移和反射恢复机制,我们实证展示了多种反射策略下可测量的恢复行为。这项工作为分布式智能体系统的运行时认知可靠性奠定了量化基础。