Large Language Models (LLMs) have been positioned as having the potential to expand access to health information in the Global South, yet their evaluation remains heavily dependent on benchmarks designed around Western norms. We present insights from a preliminary benchmarking exercise with a chatbot for sexual and reproductive health (SRH) for an underserved community in India. We evaluated using HealthBench, a benchmark for conversational health models by OpenAI. We extracted 637 SRH queries from the dataset and evaluated on the 330 single-turn conversations. Responses were evaluated using HealthBench's rubric-based automated grader, which rated responses consistently low. However, qualitative analysis by trained annotators and public health experts revealed that many responses were actually culturally appropriate and medically accurate. We highlight recurring issues, particularly a Western bias, such as for legal framing and norms (e.g., breastfeeding in public), diet assumptions (e.g., fish safe to eat during pregnancy), and costs (e.g., insurance models). Our findings demonstrate the limitations of current benchmarks in capturing the effectiveness of systems built for different cultural and healthcare contexts. We argue for the development of culturally adaptive evaluation frameworks that meet quality standards while recognizing needs of diverse populations.
翻译:大语言模型(LLMs)被认为具有扩大全球南方地区健康信息获取的潜力,但其评估仍严重依赖围绕西方规范设计的基准。本文通过一项针对印度服务不足社区的性健康与生殖健康(SRH)聊天机器人初步基准测试,提出相关见解。我们采用OpenAI开发的对话健康模型基准HealthBench进行评估,从数据集中提取637条SRH查询,并对其中330轮单轮对话进行评估。使用HealthBench基于评分标准的自动评分器对回复进行评价,结果显示评分持续偏低。然而,经培训标注员和公共卫生专家定性分析发现,许多回复实际上具有文化适切性和医学准确性。我们重点揭示了反复出现的问题,特别是西方偏见,例如法律框架与规范(如公共场所母乳喂养)、饮食假设(如孕期食用鱼类的安全性)以及成本考量(如保险模式)。我们的研究结果表明,当前基准在评估针对不同文化和医疗背景构建的系统效能方面存在局限。我们主张开发符合质量标准、同时能识别多元群体需求的文化适应性评估框架。