Systematic failures of computer vision models on subsets with coherent visual patterns, known as error slices, pose a critical challenge for robust model evaluation. Existing slice discovery methods are primarily developed for image classification, limiting their applicability to multi-instance tasks such as detection, segmentation, and pose estimation. In real-world scenarios, error slices often arise from corner cases involving complex visual relationships, where existing instance-level approaches lacking fine-grained reasoning struggle to yield meaningful insights. Moreover, current benchmarks are typically tailored to specific algorithms or biased toward image classification, with artificial ground truth that fails to reflect real model failures. To address these limitations, we propose SliceLens, a hypothesis-driven framework that leverages LLMs and VLMs to generate and verify diverse failure hypotheses through grounded visual reasoning, enabling reliable identification of fine-grained and interpretable error slices. We further introduce FeSD (Fine-grained Slice Discovery), the first benchmark specifically designed for evaluating fine-grained error slice discovery across instance-level vision tasks, featuring expert-annotated and carefully refined ground-truth slices with precise grounding to local error regions. Extensive experiments on both existing benchmarks and FeSD demonstrate that SliceLens achieves state-of-the-art performance, improving Precision@10 by 0.42 (0.73 vs. 0.31) on FeSD, and identifies interpretable slices that facilitate actionable model improvements, as validated through model repair experiments.
翻译:计算机视觉模型在具有一致视觉模式的子集上出现的系统性失效,即错误切片,对鲁棒的模型评估构成了关键挑战。现有的切片发现方法主要针对图像分类任务开发,限制了其在检测、分割和姿态估计等多实例任务中的适用性。在现实场景中,错误切片通常源于涉及复杂视觉关系的边缘案例,而现有缺乏细粒度推理能力的实例级方法难以从中获得有意义的洞见。此外,当前基准测试通常针对特定算法定制或偏向于图像分类,其人工构建的真实切片往往无法反映真实的模型失效模式。为应对这些局限,我们提出了SliceLens,一个假设驱动的框架。该框架利用大语言模型和视觉语言模型,通过基于视觉的溯源推理来生成和验证多样化的失效假设,从而实现对细粒度、可解释错误切片的可靠识别。我们进一步引入了FeSD(细粒度切片发现基准),这是首个专门为评估跨实例级视觉任务的细粒度错误切片发现而设计的基准测试,其特点是包含专家标注并经过精心提炼的真实切片,且能精确定位到局部错误区域。在现有基准和FeSD上进行的大量实验表明,SliceLens实现了最先进的性能,在FeSD上将Precision@10指标提升了0.42(0.73对0.31),并识别出可解释的切片,这些切片有助于推动可操作的模型改进,这一点已通过模型修复实验得到验证。