“它由非残障人士训练”：评估图像质量如何影响基于视觉语言模型的产品描述生成 ("It's trained by non-disabled people": Evaluating How Image Quality Affects Product Captioning with VLMs)

Vision-Language Models (VLMs) are increasingly used by blind and low-vision (BLV) people to identify and understand products in their everyday lives, such as food, personal products, and household goods. Despite their prevalence, we lack an empirical understanding of how common image quality issues, like blur and misframing of items, affect the accuracy of VLM-generated captions and whether resulting captions meet BLV people's information needs. Grounded in a survey with 86 BLV people, we systematically evaluate how image quality issues affect captions generated by VLMs. We show that the best model recognizes products in images with no quality issues with 98% accuracy, but drops to 75% accuracy overall when quality issues are present, worsening considerably as issues compound. We discuss the need for model evaluations that center on disabled people's experiences throughout the process and offer concrete recommendations for HCI and ML researchers to make VLMs more reliable for BLV people.

翻译：视觉语言模型（VLMs）正日益被盲人和低视力（BLV）人群用于识别和理解日常生活中的产品，如食品、个人用品和家居物品。尽管其应用广泛，我们仍缺乏实证研究来了解常见的图像质量问题（如物品模糊或构图不当）如何影响VLM生成描述的准确性，以及这些描述是否满足BLV人群的信息需求。基于对86名BLV人士的调查，我们系统评估了图像质量问题对VLM生成描述的影响。研究表明，最佳模型在图像无质量问题时识别产品的准确率达到98%，但当存在质量问题时整体准确率降至75%，且随着问题叠加而显著恶化。我们讨论了在模型评估过程中需以残障人士体验为核心的必要性，并为HCI和ML研究者提出了具体建议，以提升VLMs对BLV人群的可靠性。