This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.
翻译:本文研究针对大语言模型(LLM)评估系统在提示注入攻击下的防御策略。我们形式化了一类称为盲攻击的威胁,其中候选答案独立于真实答案精心设计,旨在欺骗评估器。为应对此类攻击,我们提出一个框架,将标准评估(SE)与反事实评估(CFE)相结合,后者通过故意使用错误的标准答案对提交内容进行重新评估。若系统在标准条件和反事实条件下均验证同一答案,则判定为攻击。实验表明,标准评估方法极易受攻击,而我们的SE+CFE框架通过显著提升攻击检测能力,在性能损失最小的情况下大幅增强了系统安全性。