Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R's evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R's reasoning captures translation-relevant information and is practically useful.
翻译:多年来,自动机器翻译评价指标在基准测试中不断优化,与人工评分展现出强劲且有时达到人类水平的一致性。然而这些指标仍属于黑箱系统,其决策过程缺乏可解释性,且在真实世界分布外输入条件下常出现失效。我们提出Remedy-R,一种基于推理的生成式机器翻译评价指标,通过从成对翻译偏好中进行强化学习训练,无需错误片段标注或从封闭大语言模型进行蒸馏。Remedy-R逐步生成对准确性、流畅性和完整性的分析,最终给出评分,从而实现更具可解释性的评估。仅使用两个语言对共计6万训练样本,Remedy-R在WMT22-24元评估中仍与顶尖标量指标及基于GPT-4的评判系统保持竞争力,能泛化至其他语言,并在分布外压力测试中表现出强鲁棒性。此外,Remedy-R模型生成的自我反思反馈可直接用于翻译改进。基于此发现,我们进一步提出Remedy-R Agent——一个利用Remedy-R评估分析优化翻译的“评估-修订”简易流程。该智能体在包括Qwen2.5、ALMA-R、GPT-4o-mini和Gemini-2.0-Flash在内的多种模型上持续提升翻译质量,表明Remedy-R的推理过程能有效捕捉翻译相关信息并具备实际应用价值。