一场接一场的较量：基于演化基准框架探究大语言模型在多轮指令遵循中的极限 (One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework) - 专知论文

会员服务 ·

0

基准 · 指令遵循 · 语言模型 · 大语言模型 · 基准测试 ·

One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework

翻译：一场接一场的较量：基于演化基准框架探究大语言模型在多轮指令遵循中的极限

Qi Jia,Ye Shen,Xiujie Song,Kaiwei Zhang,Shibo Wang,Dun Pei,Xiangyang Zhu,Guangtao Zhai

Evaluating LLMs' instruction-following ability in multi-topic dialogues is essential yet challenging. Existing benchmarks are limited to a fixed number of turns, susceptible to saturation and failing to account for users' interactive experience. In this work, we propose a novel framework backed by a three-layer tracking mechanism and a query synthesis agent to mimic sequential user behaviors. Incorporating Flow Theory, we introduce process-centric metrics and terminate a conversational evaluation only upon exhausting user patience. Upon this framework, we present EvolIF, an evolving benchmark covering 12 constraint groups. Results indicate that GPT-5 excels, sustaining 14 turns with 66.40% robustness. It outperforms Gemini-3.0-Pro by a margin of 5.59%, while other models trail behind. Resources are available at https://github.com/JiaQiSJTU/EvolvingInstructionFollowing.

翻译：评估大语言模型在多主题对话中的指令遵循能力至关重要，但也极具挑战性。现有基准测试通常局限于固定轮次，容易达到性能饱和，且未能充分考虑用户的交互体验。本研究提出了一种新颖的框架，该框架依托三层追踪机制和一个查询合成智能体来模拟序列化的用户行为。结合心流理论，我们引入了以过程为中心的评估指标，并仅在耗尽用户耐心时才终止对话评估。基于此框架，我们提出了EvolIF——一个覆盖12个约束组的演化基准。实验结果表明，GPT-5表现卓越，能够持续14轮对话并保持66.40%的鲁棒性，其性能优于Gemini-3.0-Pro达5.59%，而其他模型则相对落后。相关资源已发布于https://github.com/JiaQiSJTU/EvolvingInstructionFollowing。

0

相关内容

DeepSeek模型综述：V1 V2 V3 R1-Zero

DeepSeek模型综述：V1 V2 V3 R1-Zero

专知会员服务

116+阅读 · 2月11日

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

专知会员服务

17+阅读 · 2022年5月10日

【MIT-ICLR2022】在机器学习模型中注入公平性, Injecting fairness into machine-learning models

【MIT-ICLR2022】在机器学习模型中注入公平性, Injecting fairness into machine-learning models

专知会员服务

22+阅读 · 2022年3月7日

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

【深度图相似学习综述】Deep Graph Similarity Learning: A Survey，29页pdf，117条参考文献

【深度图相似学习综述】Deep Graph Similarity Learning: A Survey，29页pdf，117条参考文献

专知会员服务

98+阅读 · 2019年12月31日

【CVPR 2020 Oral】小样本类增量学习

【CVPR 2020 Oral】小样本类增量学习

专知

20+阅读 · 2020年6月26日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

论文笔记之Feature Selective Networks for Object Detection

论文笔记之Feature Selective Networks for Object Detection

统计学习与视觉计算组

21+阅读 · 2018年7月26日

误差反向传播——CNN

误差反向传播——CNN

统计学习与视觉计算组

30+阅读 · 2018年7月12日

读论文Discriminative Deep Metric Learning for Face and KV

读论文Discriminative Deep Metric Learning for Face and KV

统计学习与视觉计算组

12+阅读 · 2018年4月6日

粗糙回归模型与算法研究

国家自然科学基金

8+阅读 · 2015年12月31日

“自然语言-草图”耦合的地理场景查询方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

复杂多元数据的半参数统计推断

国家自然科学基金

5+阅读 · 2014年12月31日

基于动态分层与自学习的多智能体自适应协作模型

国家自然科学基金

17+阅读 · 2008年12月31日

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Arxiv

0+阅读 · 12月15日

Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages

Arxiv

0+阅读 · 12月2日

NeKo: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts Language Model

Arxiv

0+阅读 · 12月1日

MS-PPO: Morphological-Symmetry-Equivariant Policy for Legged Robot Locomotion

Arxiv

0+阅读 · 11月30日

LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling

Arxiv

0+阅读 · 11月26日

VIP会员

文章信息

相关主题

大语言模型

相关VIP内容

DeepSeek模型综述：V1 V2 V3 R1-Zero

DeepSeek模型综述：V1 V2 V3 R1-Zero

专知会员服务

116+阅读 · 2月11日

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

【超越消息传递:图神经网络的物理启发范式】Beyond Message Passing: a Physics-Inspired Paradigm for Graph Neural Networks

专知会员服务

17+阅读 · 2022年5月10日

【MIT-ICLR2022】在机器学习模型中注入公平性, Injecting fairness into machine-learning models

【MIT-ICLR2022】在机器学习模型中注入公平性, Injecting fairness into machine-learning models

专知会员服务

22+阅读 · 2022年3月7日

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

【深度图相似学习综述】Deep Graph Similarity Learning: A Survey，29页pdf，117条参考文献

【深度图相似学习综述】Deep Graph Similarity Learning: A Survey，29页pdf，117条参考文献

专知会员服务

98+阅读 · 2019年12月31日

热门VIP内容

开通专知VIP会员享更多权益服务

美海军作战管理系统：变革战场空间的二十年

《任务与武器驱动美海军舰队设计》报告

俄罗斯“沙希德”/“天竺葵”攻击无人机

《利用动态图对网络攻击进行建模与仿真：在云安全评估中的应用》90页

相关资讯

【CVPR 2020 Oral】小样本类增量学习

【CVPR 2020 Oral】小样本类增量学习

专知

20+阅读 · 2020年6月26日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

论文笔记之Feature Selective Networks for Object Detection

论文笔记之Feature Selective Networks for Object Detection

统计学习与视觉计算组

21+阅读 · 2018年7月26日

误差反向传播——CNN

误差反向传播——CNN

统计学习与视觉计算组

30+阅读 · 2018年7月12日

读论文Discriminative Deep Metric Learning for Face and KV

读论文Discriminative Deep Metric Learning for Face and KV

统计学习与视觉计算组

12+阅读 · 2018年4月6日

相关论文

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Arxiv

0+阅读 · 12月15日

Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages

Arxiv

0+阅读 · 12月2日

NeKo: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts Language Model

Arxiv

0+阅读 · 12月1日

MS-PPO: Morphological-Symmetry-Equivariant Policy for Legged Robot Locomotion

Arxiv

0+阅读 · 11月30日

LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling

Arxiv

0+阅读 · 11月26日

相关基金

粗糙回归模型与算法研究

国家自然科学基金

8+阅读 · 2015年12月31日

“自然语言-草图”耦合的地理场景查询方法研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

复杂多元数据的半参数统计推断

国家自然科学基金

5+阅读 · 2014年12月31日

基于动态分层与自学习的多智能体自适应协作模型

国家自然科学基金

17+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员