基于大语言模型评估系统中盲攻击检测的反事实评估方法 (Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems) - 专知论文

会员服务 ·

0

攻击 · 系统 · 反事实 · 评估系统 · 攻击检测 ·

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

翻译：基于大语言模型评估系统中盲攻击检测的反事实评估方法

Lijia Liu,Takumi Kondo,Kyohei Atarashi,Koh Takeuchi,Jiyi Li,Shigeru Saito,Hisashi Kashima

This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.

翻译：本文研究针对大语言模型（LLM）评估系统在提示注入攻击下的防御策略。我们形式化了一类称为盲攻击的威胁，其中候选答案独立于真实答案精心设计，旨在欺骗评估器。为应对此类攻击，我们提出一个框架，将标准评估（SE）与反事实评估（CFE）相结合，后者通过故意使用错误的标准答案对提交内容进行重新评估。若系统在标准条件和反事实条件下均验证同一答案，则判定为攻击。实验表明，标准评估方法极易受攻击，而我们的SE+CFE框架通过显著提升攻击检测能力，在性能损失最小的情况下大幅增强了系统安全性。

0

相关内容

【ICCV2021】参数化对比学习

专知会员服务

33+阅读 · 2021年7月27日

【ICML2021】加速异构数据的分散式深度学习

专知会员服务

16+阅读 · 2021年7月7日

【NeurIPS2020】无限可能的联合对比学习

专知会员服务

29+阅读 · 2020年10月2日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

GeoffreyHinton-ICML2020投稿论文-偏转对抗攻击 Deflecting Adversarial Attacks

GeoffreyHinton-ICML2020投稿论文-偏转对抗攻击 Deflecting Adversarial Attacks

专知会员服务

24+阅读 · 2020年2月22日

《面向军事应用的数据驱动的行为建模》荷兰应用科学研究组织（NTO）

《面向军事应用的数据驱动的行为建模》荷兰应用科学研究组织（NTO）

专知

50+阅读 · 2022年6月2日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

【推荐系统论文笔记】DKN: 基于深度知识感知的新闻推荐网络（WWW2018 ）

【推荐系统论文笔记】DKN: 基于深度知识感知的新闻推荐网络（WWW2018 ）

专知

18+阅读 · 2018年4月2日

网络节点表示学习论文笔记03—基于异构网络节点表示的推荐系统

网络节点表示学习论文笔记03—基于异构网络节点表示的推荐系统

专知

27+阅读 · 2018年2月24日

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

炼数成金订阅号

26+阅读 · 2017年7月10日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

基于非对称群体兴趣相关性并融合情境与群体信任的Web服务推荐研究

国家自然科学基金

1+阅读 · 2015年12月31日

面向异分布数据的主动学习方法

国家自然科学基金

12+阅读 · 2015年12月31日

高维数据下的模型平均方法

国家自然科学基金

6+阅读 · 2014年12月31日

基于动态分层与自学习的多智能体自适应协作模型

国家自然科学基金

17+阅读 · 2008年12月31日

An Efficient Gradient-Based Inference Attack for Federated Learning

Arxiv

0+阅读 · 12月17日

Edge-Only Universal Adversarial Attacks in Distributed Learning

Arxiv

0+阅读 · 12月5日

Evolving Prompts for Toxicity Search in Large Language Models

Arxiv

0+阅读 · 11月16日

Partial Information Decomposition for Data Interpretability and Feature Selection

Arxiv

0+阅读 · 11月14日

A Novel Data-Dependent Learning Paradigm for Large Hypothesis Classes

Arxiv

0+阅读 · 11月13日

VIP会员

文章信息

相关主题

相关VIP内容

【ICCV2021】参数化对比学习

专知会员服务

33+阅读 · 2021年7月27日

【ICML2021】加速异构数据的分散式深度学习

专知会员服务

16+阅读 · 2021年7月7日

【NeurIPS2020】无限可能的联合对比学习

专知会员服务

29+阅读 · 2020年10月2日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

GeoffreyHinton-ICML2020投稿论文-偏转对抗攻击 Deflecting Adversarial Attacks

GeoffreyHinton-ICML2020投稿论文-偏转对抗攻击 Deflecting Adversarial Attacks

专知会员服务

24+阅读 · 2020年2月22日

热门VIP内容

开通专知VIP会员享更多权益服务

《北约认知战概念报告》

《预测促成大规模货运无人机的技术趋势与影响》报告

美海军放弃星座级转而采用国家安全巡逻舰设计

《北约作战弹性概念》报告

相关资讯

《面向军事应用的数据驱动的行为建模》荷兰应用科学研究组织（NTO）

《面向军事应用的数据驱动的行为建模》荷兰应用科学研究组织（NTO）

专知

50+阅读 · 2022年6月2日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

【推荐系统论文笔记】DKN: 基于深度知识感知的新闻推荐网络（WWW2018 ）

【推荐系统论文笔记】DKN: 基于深度知识感知的新闻推荐网络（WWW2018 ）

专知

18+阅读 · 2018年4月2日

网络节点表示学习论文笔记03—基于异构网络节点表示的推荐系统

网络节点表示学习论文笔记03—基于异构网络节点表示的推荐系统

专知

27+阅读 · 2018年2月24日

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

炼数成金订阅号

26+阅读 · 2017年7月10日

相关论文

An Efficient Gradient-Based Inference Attack for Federated Learning

Arxiv

0+阅读 · 12月17日

Edge-Only Universal Adversarial Attacks in Distributed Learning

Arxiv

0+阅读 · 12月5日

Evolving Prompts for Toxicity Search in Large Language Models

Arxiv

0+阅读 · 11月16日

Partial Information Decomposition for Data Interpretability and Feature Selection

Arxiv

0+阅读 · 11月14日

A Novel Data-Dependent Learning Paradigm for Large Hypothesis Classes

Arxiv

0+阅读 · 11月13日

相关基金

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

基于非对称群体兴趣相关性并融合情境与群体信任的Web服务推荐研究

国家自然科学基金

1+阅读 · 2015年12月31日

面向异分布数据的主动学习方法

国家自然科学基金

12+阅读 · 2015年12月31日

高维数据下的模型平均方法

国家自然科学基金

6+阅读 · 2014年12月31日

基于动态分层与自学习的多智能体自适应协作模型

国家自然科学基金

17+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员