Large-scale cyber-physical systems (CPS), such as railway control systems and smart grids, consist of geographically distributed subsystems that are connected via unreliable, asynchronous inter-region networks. Their scale and distribution make them especially vulnerable to faults and attacks. Unfortunately, existing fault-tolerant methods either consume excessive resources or provide only eventual guarantees, making them unsuitable for real-time resource-constrained CPS. We present GeoShield, a resource-efficient solution for defending geo-distributed CPS against Byzantine faults. GeoShield leverages the property that CPS are designed to tolerate brief disruptions and maintain safety, as long as they recover (i.e., resume normal operations or transition to a safe mode) within a bounded amount of time following a fault. Instead of masking faults, it detects them and recovers the system within bounded time, thus guaranteeing safety with much fewer resources. GeoShield introduces protocols for Byzantine fault-resilient network measurement and inter-region omission fault detection that proactively detect malicious message delays, along with recovery mechanisms that guarantee timely recovery while maximizing operational robustness. It is the first bounded-time recovery solution that operates effectively under unreliable networks without relying on trusted hardware. Evaluations using real-world case studies show that it significantly outperforms existing methods in both effectiveness and resource efficiency.
翻译:大规模信息物理系统(CPS),如铁路控制系统和智能电网,由地理分布的子系统组成,这些子系统通过不可靠、异步的区域间网络连接。其规模与分布特性使其特别容易受到故障和攻击的影响。遗憾的是,现有的容错方法要么消耗过多资源,要么仅提供最终一致性保证,因此不适用于资源受限的实时CPS。本文提出GeoShield,一种资源高效的解决方案,用于防御地理分布式CPS免受拜占庭故障影响。GeoShield利用了CPS的设计特性:只要在故障发生后有界时间内恢复(即恢复正常运行或转入安全模式),系统就能容忍短暂中断并保持安全性。该方法不采用故障掩蔽策略,而是通过检测故障并在有界时间内恢复系统,从而以更少资源保证安全性。GeoShield引入了拜占庭故障弹性网络测量协议与区域间遗漏故障检测协议,可主动检测恶意消息延迟,同时配备恢复机制,在最大化运行鲁棒性的前提下确保及时恢复。这是首个在不依赖可信硬件的情况下,能在不可靠网络中有效运行的有界时间恢复方案。基于实际案例的评估表明,该方法在效能与资源效率方面均显著优于现有方法。