Vision-Language Models (VLMs) are increasingly used by blind and low-vision (BLV) people to identify and understand products in their everyday lives, such as food, personal products, and household goods. Despite their prevalence, we lack an empirical understanding of how common image quality issues, like blur and misframing of items, affect the accuracy of VLM-generated captions and whether resulting captions meet BLV people's information needs. Grounded in a survey with 86 BLV people, we systematically evaluate how image quality issues affect captions generated by VLMs. We show that the best model recognizes products in images with no quality issues with 98% accuracy, but drops to 75% accuracy overall when quality issues are present, worsening considerably as issues compound. We discuss the need for model evaluations that center on disabled people's experiences throughout the process and offer concrete recommendations for HCI and ML researchers to make VLMs more reliable for BLV people.
翻译:视觉语言模型(VLMs)正日益被盲人和低视力(BLV)人群用于识别和理解日常生活中的产品,例如食品、个人用品和家居物品。尽管这些模型应用广泛,但我们对常见的图像质量问题(如物品模糊或构图不当)如何影响VLM生成描述的准确性,以及由此产生的描述是否满足BLV人群的信息需求,仍缺乏实证理解。基于对86名BLV人士的调查,我们系统评估了图像质量问题如何影响VLMs生成的描述。研究表明,最佳模型在图像无质量问题时识别产品的准确率达98%,但当存在质量问题时总体准确率降至75%,且随着问题叠加而显著恶化。我们讨论了在模型评估过程中以残障人士体验为中心的必要性,并为HCI和ML研究者提出了具体建议,以提高VLMs对BLV人群的可靠性。