System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.
翻译:系统提示为大语言模型在推理时提供了一种轻量级且强大的条件调节机制。尽管先前研究主要集中于英语场景,但实际部署中,单一提示若能可靠地跨语言运作将带来显著优势。本文系统研究了不同系统提示如何引导模型实现准确且鲁棒的跨语言行为。我们提出了一个统一的四维评估框架,用于评估多语言环境下的系统提示。通过在五种语言、三个大语言模型及三个基准测试上进行大规模实验,我们发现特定提示组件(如思维链、情感与场景)与鲁棒的多语言行为存在相关性。我们开发了一个面向多语言场景的提示优化框架,并证明其能自动发现可将所有指标提升5-10%的提示。最后,通过对超过1000万个推理单元的分析,我们发现性能更优的系统提示能诱导出更结构化、更一致的推理模式,同时减少不必要的语言切换。综合而言,我们强调系统提示优化是实现准确、鲁棒多语言大语言模型行为的可扩展路径。