EMCee：通过提取合成多语言上下文桥接知识与推理以提升大语言模型的多语言能力 (EMCee: Improving Multilingual Capability of LLMs via Bridging Knowledge and Reasoning with Extracted Synthetic Multilingual Context)

Large Language Models (LLMs) have achieved impressive progress across a wide range of tasks, yet their heavy reliance on English-centric training data leads to significant performance degradation in non-English languages. While existing multilingual prompting methods emphasize reformulating queries into English or enhancing reasoning capabilities, they often fail to incorporate the language- and culture-specific grounding that is essential for some queries. To address this limitation, we propose EMCee (Extracting synthetic Multilingual Context and merging), a simple yet effective framework that enhances the multilingual capabilities of LLMs by explicitly extracting and utilizing query-relevant knowledge from the LLM itself. In particular, EMCee first extracts synthetic context to uncover latent, language-specific knowledge encoded within the LLM, and then dynamically merges this contextual insight with reasoning-oriented outputs through a judgment-based selection mechanism. Extensive experiments on four multilingual benchmarks covering diverse languages and tasks demonstrate that EMCee consistently outperforms prior approaches, achieving an average relative improvement of 16.4% overall and 31.7% in low-resource languages.

翻译：大语言模型（LLMs）在广泛的任务上取得了令人瞩目的进展，但其对英语中心训练数据的严重依赖导致在非英语语言上的性能显著下降。现有的多语言提示方法虽然强调将查询重构为英语或增强推理能力，但往往未能融入某些查询所必需的语言和文化特定基础。为应对这一局限，我们提出了EMCee（提取合成多语言上下文并融合），这是一个简单而有效的框架，通过显式地从LLM自身提取并利用与查询相关的知识来增强其多语言能力。具体而言，EMCee首先提取合成上下文，以揭示LLM内部编码的潜在、语言特定的知识，然后通过一个基于判断的选择机制，动态地将这种上下文洞察与面向推理的输出进行融合。在涵盖多种语言和任务的四个多语言基准测试上进行的大量实验表明，EMCee始终优于先前的方法，实现了平均16.4%的整体相对提升，在低资源语言上更是达到了31.7%的提升。