Automatic diacritic restoration is crucial for text processing in languages with rich diacritical marks, such as Romanian. This study evaluates the performance of several large language models (LLMs) in restoring diacritics in Romanian texts. Using a comprehensive corpus, we tested models including OpenAI's GPT-3.5, GPT-4, GPT-4o, Google's Gemini 1.0 Pro, Meta's Llama 2 and Llama 3, MistralAI's Mixtral 8x7B Instruct, airoboros 70B, and OpenLLM-Ro's RoLlama 2 7B, under multiple prompt templates ranging from zero-shot to complex multi-shot instructions. Results show that models such as GPT-4o achieve high diacritic restoration accuracy, consistently surpassing a neutral echo baseline, while others, including Meta's Llama family, exhibit wider variability. These findings highlight the impact of model architecture, training data, and prompt design on diacritic restoration performance and outline promising directions for improving NLP tools for diacritic-rich languages.
翻译:自动变音符号恢复对于罗马尼亚语等具有丰富变音符号的语言的文本处理至关重要。本研究评估了多种大型语言模型(LLMs)在罗马尼亚语文本中恢复变音符号的性能。通过使用一个全面的语料库,我们测试了包括OpenAI的GPT-3.5、GPT-4、GPT-4o,Google的Gemini 1.0 Pro,Meta的Llama 2和Llama 3,MistralAI的Mixtral 8x7B Instruct,airoboros 70B,以及OpenLLM-Ro的RoLlama 2 7B在内的模型,涵盖了从零样本到复杂多样本指令的多种提示模板。结果表明,诸如GPT-4o等模型实现了较高的变音符号恢复准确率,持续超越中性回声基线,而其他模型,包括Meta的Llama系列,则表现出更大的性能波动。这些发现凸显了模型架构、训练数据和提示设计对变音符号恢复性能的影响,并为改进面向富含变音符号语言的自然语言处理工具指明了有前景的方向。