透明且连贯的程序性错误检测 (Transparent and Coherent Procedural Mistake Detection)

Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that VLMs struggle off-the-shelf, but with some trade-offs, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement.

翻译：程序性错误检测（PMD）是一个具有挑战性的问题，旨在分类人类用户（通过第一人称视角视频观察）是否成功执行了某项任务（由程序性文本指定）。尽管近期已有大量研究，但机器在真实场景中的性能仍然不可行，且支撑该性能的推理过程是不透明的。为此，我们将PMD扩展为要求生成视觉自对话推理来支持决策。鉴于近期视觉-语言模型（VLMs）展现出的令人印象深刻且成熟的图像理解能力，我们基于单帧图像构建了一个适用于PMD的基准数据集。由于我们的重构实现了前所未有的透明度，我们利用自然语言推理（NLI）模型制定了两个自动化指标，用于评估生成推理的连贯性。我们为这一重构任务建立了基线，结果表明，现成的VLMs表现不佳，但通过一些权衡，通过将这些指标融入常见的推理和微调方法，可以提升其准确性、连贯性和效率。最后，我们的多维度指标可视化了常见结果，突出了需要进一步改进的领域。