Large Language Models (LLMs) are increasingly deployed in high-stakes clinical applications in India. In many such settings, speakers of Indian languages frequently communicate using romanized text rather than native scripts, yet existing research rarely evaluates this orthographic variation using real-world data. We investigate how romanization impacts the reliability of LLMs in a critical domain: maternal and newborn healthcare triage. We benchmark leading LLMs on a real-world dataset of user-generated queries spanning five Indian languages and Nepali. Our results reveal consistent degradation in performance for romanized messages, with F1 scores trailing those of native scripts by 5-12 points. At our partner maternal health organization in India, this gap could cause nearly 2 million excess errors in triage. Crucially, this performance gap by scripts is not due to a failure in clinical reasoning. We demonstrate that LLMs often correctly infer the semantic intent of romanized queries. Nevertheless, their final classification outputs remain brittle in the presence of orthographic noise in romanized inputs. Our findings highlight a critical safety blind spot in LLM-based health systems: models that appear to understand romanized input may still fail to act on it reliably.
翻译:大型语言模型(LLMs)在印度日益被部署于高风险的临床应用中。在许多此类场景中,印度语言使用者经常使用罗马化文本而非原生脚本进行交流,然而现有研究很少基于真实世界数据评估这种正字法变体。我们研究了罗马化如何影响LLMs在一个关键领域——孕产妇与新生儿医疗分诊——中的可靠性。我们在一个包含五种印度语言及尼泊尔语的真实世界用户生成查询数据集上对主流LLMs进行了基准测试。我们的结果显示,对于罗马化消息,模型性能出现一致性下降,F1分数较原生脚本低5-12分。在我们合作的印度孕产妇健康组织中,这一差距可能导致近200万次额外的分诊错误。关键在于,这种由脚本导致的性能差距并非源于临床推理失败。我们证明LLMs通常能正确推断罗马化查询的语义意图。然而,面对罗马化输入中的正字法噪声,其最终分类输出仍表现出脆弱性。我们的研究结果凸显了基于LLM的健康系统中一个关键的安全盲点:看似理解罗马化输入的模型,仍可能无法可靠地据此采取行动。