Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layouts and variability in handwriting styles. Prior methods have faced performance bottlenecks by proposing isolated architectural modifications, making them difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves super state-of-the-art performance, outperforming the best lightweight specialized model SSAN by 16.31\% and the top-performing VLM Gemini2.5-flash by 24.42\% under zero-shot setting. Our datasets, models, and code are open-sourced at: {https://github.com/BFlameSwift/Uni-MuMER
翻译:手写数学公式识别(HMER)由于符号布局的固有自由性和书写风格的多样性,在光学字符识别(OCR)领域仍是一项持续的挑战。先前方法通过提出孤立的架构修改而面临性能瓶颈,难以将其协调地整合到统一框架中。与此同时,预训练视觉语言模型(VLMs)的最新进展展现了强大的跨任务泛化能力,为开发统一解决方案提供了有前景的基础。本文提出Uni-MuMER,该方法在不修改架构的情况下,将VLM完全微调用于HMER任务,从而有效地将领域特定知识注入通用框架。我们的方法整合了三种数据驱动任务:用于结构化空间推理的树感知思维链(Tree-CoT)、用于减少视觉相似字符间混淆的错误驱动学习(EDL),以及用于提升长表达式识别一致性的符号计数(SC)。在CROHME和HME100K数据集上的实验表明,Uni-MuMER实现了超越现有最优水平的性能,在零样本设置下分别优于最佳轻量级专用模型SSAN 16.31%和最高性能VLM Gemini2.5-flash 24.42%。我们的数据集、模型和代码已开源:{https://github.com/BFlameSwift/Uni-MuMER