Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but it is unclear where the benefit comes from. To this end, we perform a controlled study that scales the vocabulary of the language model from 24K to 196K while holding data, computation, and optimization unchanged. We begin by quantifying the complexity of tokenized text -- formalized via Kolmogorov complexity -- and show that larger vocabularies reduce this complexity. Above 24K, every common word is already tokenized as a single token, so enlarging vocabulary only deepens the relative token-frequency imbalance. Word-level loss decomposition shows that larger vocabularies reduce cross-entropy loss almost exclusively by lowering uncertainty on the 2,500 most frequent words, even though loss on the rare tail rises. The same frequent words cover roughly 75% of tokens in downstream benchmarks, so this training advantage transfers intact. We further show that enlarging model parameters with a fixed vocabulary yields the same frequent-word benefit. Our results recast "bigger vocabularies help" as "lowering complexity of tokenized text helps," offering a simple, principled knob for tokenizer-model co-design and clarifying the loss dynamics that govern language model scaling in pre-training.
翻译:大型语言模型通过分词器进行训练,由此产生的词元分布高度不平衡:少数词汇主导了数据流,而大多数词汇出现频率极低。当前实践倾向于使用越来越大的词汇表,但其益处来源尚不明确。为此,我们进行了一项对照研究,将语言模型的词汇表规模从24K扩展到196K,同时保持数据、计算和优化过程不变。我们首先量化了分词后文本的复杂度——通过柯尔莫哥洛夫复杂度形式化——并证明更大的词汇表能降低该复杂度。超过24K后,每个常见词已单独分词为单个词元,因此扩大词汇表只会加剧相对词元频率的不平衡性。基于词级别的损失分解表明,更大的词汇表几乎完全通过降低前2,500个高频词的不确定性来减少交叉熵损失,尽管低频尾部的损失反而上升。这些高频词在下游基准测试中覆盖了约75%的词元,因此该训练优势得以完整迁移。我们进一步证明,在固定词汇表条件下扩大模型参数能产生相同的高频词收益。我们的研究结果将“更大词汇表有益”重新解读为“降低分词文本复杂度有益”,为分词器-模型协同设计提供了一个简单且原理清晰的调节机制,并阐明了预训练中语言模型规模扩展的损失动态。