Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model is available here: http://ouro-llm.github.io.
翻译:现代大型语言模型主要通过显式文本生成(如思维链)进行“思考”,这种方式将推理推迟到训练后阶段,未能充分利用预训练数据。我们提出并开源了Ouro模型系列,其命名源于递归的衔尾蛇概念,这是一类预训练的循环语言模型,通过以下方式将推理能力构建于预训练阶段:(i)在潜在空间中进行迭代计算,(ii)采用熵正则化目标实现学习深度分配,以及(iii)将训练规模扩展至7.7万亿词元。Ouro 1.4B和2.6B模型展现出卓越性能,在广泛基准测试中匹配了当前最高水平12B参数规模LLM的结果。通过对照实验,我们证明这一优势并非源于知识容量的增加,而是来自更优越的知识操纵能力。我们还表明,与显式思维链相比,循环语言模型生成的推理轨迹与最终输出具有更高一致性。我们希望本研究结果能展示循环语言模型作为推理时代新型扩展方向的潜力。模型访问地址:http://ouro-llm.github.io。