Evaluating LLMs' instruction-following ability in multi-topic dialogues is essential yet challenging. Existing benchmarks are limited to a fixed number of turns, susceptible to saturation and failing to account for users' interactive experience. In this work, we propose a novel framework backed by a three-layer tracking mechanism and a query synthesis agent to mimic sequential user behaviors. Incorporating Flow Theory, we introduce process-centric metrics and terminate a conversational evaluation only upon exhausting user patience. Upon this framework, we present EvolIF, an evolving benchmark covering 12 constraint groups. Results indicate that GPT-5 excels, sustaining 14 turns with 66.40% robustness. It outperforms Gemini-3.0-Pro by a margin of 5.59%, while other models trail behind. Resources are available at https://github.com/JiaQiSJTU/EvolvingInstructionFollowing.
翻译:评估大语言模型在多主题对话中的指令遵循能力至关重要,但也极具挑战性。现有基准测试通常局限于固定轮次,容易达到性能饱和,且未能充分考虑用户的交互体验。本研究提出了一种新颖的框架,该框架依托三层追踪机制和一个查询合成智能体来模拟序列化的用户行为。结合心流理论,我们引入了以过程为中心的评估指标,并仅在耗尽用户耐心时才终止对话评估。基于此框架,我们提出了EvolIF——一个覆盖12个约束组的演化基准。实验结果表明,GPT-5表现卓越,能够持续14轮对话并保持66.40%的鲁棒性,其性能优于Gemini-3.0-Pro达5.59%,而其他模型则相对落后。相关资源已发布于https://github.com/JiaQiSJTU/EvolvingInstructionFollowing。