We introduce QwenLong-L1.5, a model that achieves superior long-context reasoning capabilities through systematic post-training innovations. The key technical breakthroughs of QwenLong-L1.5 are as follows: (1) Long-Context Data Synthesis Pipeline: We develop a systematic synthesis framework that generates challenging reasoning tasks requiring multi-hop grounding over globally distributed evidence. By deconstructing documents into atomic facts and their underlying relationships, and then programmatically composing verifiable reasoning questions, our approach creates high-quality training data at scale, moving substantially beyond simple retrieval tasks to enable genuine long-range reasoning capabilities. (2) Stabilized Reinforcement Learning for Long-Context Training: To overcome the critical instability in long-context RL, we introduce task-balanced sampling with task-specific advantage estimation to mitigate reward bias, and propose Adaptive Entropy-Controlled Policy Optimization (AEPO) that dynamically regulates exploration-exploitation trade-offs. (3) Memory-Augmented Architecture for Ultra-Long Contexts: Recognizing that even extended context windows cannot accommodate arbitrarily long sequences, we develop a memory management framework with multi-stage fusion RL training that seamlessly integrates single-pass reasoning with iterative memory-based processing for tasks exceeding 4M tokens. Based on Qwen3-30B-A3B-Thinking, QwenLong-L1.5 achieves performance comparable to GPT-5 and Gemini-2.5-Pro on long-context reasoning benchmarks, surpassing its baseline by 9.90 points on average. On ultra-long tasks (1M~4M tokens), QwenLong-L1.5's memory-agent framework yields a 9.48-point gain over the agent baseline. Additionally, the acquired long-context reasoning ability translates to enhanced performance in general domains like scientific reasoning, memory tool using, and extended dialogue.


翻译:我们介绍了QwenLong-L1.5模型,该模型通过系统性的后训练创新实现了卓越的长上下文推理能力。QwenLong-L1.5的关键技术突破如下:(1)长上下文数据合成流程:我们开发了一个系统化的合成框架,能够生成需要基于全局分布证据进行多跳推理的挑战性任务。通过将文档解构为原子事实及其底层关系,并以编程方式组合可验证的推理问题,我们的方法能够大规模生成高质量训练数据,显著超越了简单的检索任务,实现了真正的长程推理能力。(2)面向长上下文训练的稳定强化学习:为克服长上下文强化学习中的关键不稳定性,我们引入了基于任务平衡采样和任务特定优势估计的方法以缓解奖励偏差,并提出了自适应熵控制策略优化(AEPO),动态调节探索与利用的权衡。(3)面向超长上下文的记忆增强架构:认识到即使扩展的上下文窗口也无法容纳任意长序列,我们开发了具有多阶段融合强化训练的记忆管理框架,将单次推理与基于记忆的迭代处理无缝集成,以处理超过400万标记的任务。基于Qwen3-30B-A3B-Thinking,QwenLong-L1.5在长上下文推理基准测试中取得了与GPT-5和Gemini-2.5-Pro相当的性能,平均超越基线模型9.90分。在超长任务(100万至400万标记)上,QwenLong-L1.5的记忆-智能体框架相比智能体基线实现了9.48分的性能提升。此外,所获得的长上下文推理能力可迁移至科学推理、记忆工具使用和扩展对话等通用领域,带来整体性能的增强。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员