Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model is available here: http://ouro-llm.github.io.
翻译:现代大型语言模型主要通过显式文本生成(如思维链)进行“思考”,这种方式将推理推迟到训练后阶段,未能充分利用预训练数据。我们提出并开源了Ouro(以递归的衔尾蛇命名),这是一个预训练的循环语言模型系列,通过以下方式将推理构建到预训练阶段:(i)在潜在空间中进行迭代计算,(ii)采用熵正则化目标以学习深度分配,以及(iii)扩展至7.7万亿个标记。Ouro 1.4B和2.6B模型表现出卓越性能,在广泛的基准测试中匹配了高达120亿参数的最先进大型语言模型的结果。通过控制实验,我们证明这一优势并非源于知识容量的增加,而是来自更优的知识操作能力。我们还表明,循环语言模型生成的推理轨迹比显式思维链更与最终输出对齐。我们希望我们的结果展示循环语言模型作为推理时代一种新颖扩展方向的潜力。我们的模型可在此获取:http://ouro-llm.github.io。