基于U-Net与Transformer的多语言图像翻译流水线 (A U-Net and Transformer Pipeline for Multilingual Image Translation)

This paper presents an end-to-end multilingual translation pipeline that integrates a custom U-Net for text detection, the Tesseract engine for text recognition, and a from-scratch sequence-to-sequence (Seq2Seq) Transformer for Neural Machine Translation (NMT). Our approach first utilizes a U-Net model, trained on a synthetic dataset , to accurately segment and detect text regions from an image. These detected regions are then processed by Tesseract to extract the source text. This extracted text is fed into a custom Transformer model trained from scratch on a multilingual parallel corpus spanning 5 languages. Unlike systems reliant on monolithic pre-trained models, our architecture emphasizes full customization and adaptability. The system is evaluated on its text detection accuracy, text recognition quality, and translation performance via BLEU scores. The complete pipeline demonstrates promising results, validating the viability of a custom-built system for translating text directly from images.

翻译：本文提出了一种端到端的多语言翻译流水线，该流水线集成了用于文本检测的自定义U-Net模型、用于文本识别的Tesseract引擎，以及一个从头开始训练的序列到序列（Seq2Seq）Transformer模型，用于神经机器翻译（NMT）。我们的方法首先利用在合成数据集上训练的U-Net模型，从图像中精确分割并检测文本区域。随后，这些检测到的区域由Tesseract处理以提取源文本。提取出的文本被输入到一个从头开始训练的自定义Transformer模型中，该模型基于涵盖5种语言的多语言平行语料库进行训练。与依赖单一预训练模型的系统不同，我们的架构强调完全的定制化和适应性。该系统通过文本检测准确率、文本识别质量以及基于BLEU分数的翻译性能进行评估。完整的流水线展示了令人期待的结果，验证了定制化系统直接从图像翻译文本的可行性。