Uni-MuMER：面向手写数学公式识别的视觉语言模型统一多任务微调 (Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition)

Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layouts and variability in handwriting styles. Prior methods have faced performance bottlenecks by proposing isolated architectural modifications, making them difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves super state-of-the-art performance, outperforming the best lightweight specialized model SSAN by 16.31\% and the top-performing VLM Gemini2.5-flash by 24.42\% under zero-shot setting. Our datasets, models, and code are open-sourced at: {https://github.com/BFlameSwift/Uni-MuMER

翻译：手写数学公式识别（HMER）由于符号布局的固有自由性和书写风格的多样性，在光学字符识别（OCR）领域仍是一项持续的挑战。先前方法通过提出孤立的架构修改而面临性能瓶颈，难以将其协调地整合到统一框架中。与此同时，预训练视觉语言模型（VLMs）的最新进展展现了强大的跨任务泛化能力，为开发统一解决方案提供了有前景的基础。本文提出Uni-MuMER，该方法在不修改架构的情况下，将VLM完全微调用于HMER任务，从而有效地将领域特定知识注入通用框架。我们的方法整合了三种数据驱动任务：用于结构化空间推理的树感知思维链（Tree-CoT）、用于减少视觉相似字符间混淆的错误驱动学习（EDL），以及用于提升长表达式识别一致性的符号计数（SC）。在CROHME和HME100K数据集上的实验表明，Uni-MuMER实现了超越现有最优水平的性能，在零样本设置下分别优于最佳轻量级专用模型SSAN 16.31%和最高性能VLM Gemini2.5-flash 24.42%。我们的数据集、模型和代码已开源：{https://github.com/BFlameSwift/Uni-MuMER

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日