Software bugs in cloud management systems often cause erratic behavior, hindering detection, and recovery of failures. As a consequence, the failures are not timely detected and notified, and can silently propagate through the system. To face these issues, we propose a lightweight approach to runtime verification, for monitoring and failure detection of cloud computing systems. We performed a preliminary evaluation of the proposed approach in the OpenStack cloud management platform, an "off-the-shelf" distributed system, showing that the approach can be applied with high failure detection coverage.
翻译:云层管理系统中的软件错误往往造成反复无常的行为,阻碍检测和故障恢复。 因此,这些故障没有及时检测和通知,并且可以在系统中悄悄传播。 面对这些问题,我们建议对运行时间核查、云计算系统的监测和故障检测采取轻量级方法。 我们对 OpenStack 云管理平台(即“现成”分布式系统)中的拟议方法进行了初步评估,显示该方法可以高故障检测覆盖率应用。