Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.
翻译:分词器为语言模型(LMs)处理与表示文本提供了基础。尽管分词至关重要,但由于难以单独衡量分词的影响,其在语言模型性能与行为中的作用尚未得到充分理解。为应对这一需求,我们提出了TokSuite——一个模型集合与基准测试,旨在支持分词对语言模型影响的研究。具体而言,我们训练了十四个模型,它们使用不同的分词器,但在其他方面(包括架构、数据集、训练预算与初始化)保持完全一致。此外,我们构建并发布了一个新的基准测试,专门用于衡量模型在可能影响分词的真实世界扰动下的性能。TokSuite共同实现了对模型分词器影响的稳健解耦,支持了一系列新颖的发现,阐明了多种流行分词器各自的优势与不足。