Large language models work well for technical problem solving in English but perform poorly when the same questions are asked in Bangla. A simple solution would be to translate Bangla questions into English first and then use these models. However, existing Bangla-English translation systems struggle with technical terms. They often mistranslate specialized vocabulary, which changes the meaning of the problem and leads to wrong answers. We present BanglaSTEM, a dataset of 5,000 carefully selected Bangla-English sentence pairs from STEM fields including computer science, mathematics, physics, chemistry, and biology. We generated over 12,000 translations using language models and then used human evaluators to select the highest quality pairs that preserve technical terminology correctly. We train a T5-based translation model on BanglaSTEM and test it on two tasks: generating code and solving math problems. Our results show significant improvements in translation accuracy for technical content, making it easier for Bangla speakers to use English-focused language models effectively. Both the BanglaSTEM dataset and the trained translation model are publicly released at https://huggingface.co/reyazul/BanglaSTEM-T5.
翻译:大型语言模型在处理英语技术问题时表现优异,但当相同问题以孟加拉语提出时,其性能显著下降。一种简单的解决方案是先将孟加拉语问题翻译成英语,再利用这些模型处理。然而,现有的孟加拉语-英语翻译系统在技术术语处理上存在困难,常误译专业词汇,导致问题含义改变并产生错误答案。本文提出BanglaSTEM数据集,包含从计算机科学、数学、物理、化学及生物学等STEM领域精选的5000个孟加拉语-英语句对。我们通过语言模型生成超过12000条翻译结果,并经由人工评估筛选出技术术语准确保留的最高质量句对。基于BanglaSTEM训练T5架构的翻译模型后,我们在代码生成与数学问题求解两项任务中进行了测试。结果表明,该模型显著提升了技术内容的翻译准确度,使孟加拉语使用者能更有效地利用以英语为中心的语言模型。BanglaSTEM数据集及训练完成的翻译模型已公开发布于https://huggingface.co/reyazul/BanglaSTEM-T5。