As Retrieval-Augmented Generation (RAG) systems evolve toward more sophisticated architectures, ensuring their trustworthiness through explainable and robust evaluation becomes critical. Existing scalar metrics suffer from limited interpretability, inadequate uncertainty quantification, and computational inefficiency in multi-system comparisons, hindering responsible deployment of RAG technologies. We introduce DICE (Discrete Interpretable Comparative Evaluation), a two-stage, evidence-coupled framework that advances explainability and robustness in RAG evaluation. DICE combines deep analytical reasoning with probabilistic $\{A, B, Tie\}$ scoring to produce transparent, confidence-aware judgments that support accountable system improvement through interpretable reasoning traces, enabling systematic error diagnosis and actionable insights. To address efficiency challenges at scale, DICE employs a Swiss-system tournament that reduces computational complexity from $O(N^2)$ to $O(N \log N)$, achieving a 42.9% reduction in our eight-system evaluation while preserving ranking fidelity. Validation on a curated Chinese financial QA dataset demonstrates that DICE achieves 85.7% agreement with human experts, substantially outperforming existing LLM-based metrics such as RAGAS. Our results establish DICE as a responsible, explainable, and efficient paradigm for trustworthy RAG system assessment.
翻译:随着检索增强生成(RAG)系统向更复杂的架构演进,通过可解释且稳健的评估来确保其可信度变得至关重要。现有的标量指标存在可解释性有限、不确定性量化不足以及在多系统比较中计算效率低下的问题,这阻碍了RAG技术的负责任部署。我们提出了DICE(离散可解释比较评估),这是一个两阶段、证据耦合的框架,旨在提升RAG评估的可解释性与鲁棒性。DICE将深度分析推理与概率化的 $\{A, B, Tie\}$ 评分相结合,以产生透明、置信度感知的判断,并通过可解释的推理轨迹支持可问责的系统改进,从而实现系统性错误诊断和可操作的见解。为了解决大规模评估的效率挑战,DICE采用瑞士制锦标赛机制,将计算复杂度从 $O(N^2)$ 降低至 $O(N \log N)$,在我们的八系统评估中实现了42.9%的计算量减少,同时保持了排序保真度。在一个精心构建的中文金融问答数据集上的验证表明,DICE与人类专家的评估一致性达到85.7%,显著优于现有的基于LLM的评估指标(如RAGAS)。我们的研究结果确立了DICE作为一种负责任、可解释且高效的可信RAG系统评估范式。