Long context inference scenarios have become increasingly important for large language models, yet they introduce significant computational latency. While prior research has optimized long-sequence inference through operators, model architectures, and system frameworks, tokenization remains an overlooked bottleneck. Existing parallel tokenization methods accelerate processing through text segmentation and multi-process tokenization, but they suffer from inconsistent results due to boundary artifacts that occur after merging. To address this, we propose LoPT, a novel Lossless Parallel Tokenization framework that ensures output identical to standard sequential tokenization. Our approach employs character-position-based matching and dynamic chunk length adjustment to align and merge tokenized segments accurately. Extensive experiments across diverse long-text datasets demonstrate that LoPT achieves significant speedup while guaranteeing lossless tokenization. We also provide theoretical proof of consistency and comprehensive analytical studies to validate the robustness of our method.
翻译:长上下文推理场景对大语言模型日益重要,但其引入了显著的计算延迟。现有研究已通过算子优化、模型架构和系统框架改进长序列推理效率,然而分词环节仍是一个被忽视的性能瓶颈。现有并行分词方法通过文本分割与多进程分词加速处理,但合并后因边界伪影导致输出结果不一致。为解决此问题,我们提出LoPT,一种新颖的无损并行分词框架,确保输出与标准顺序分词完全一致。该方法采用基于字符位置的匹配机制和动态分块长度调整策略,以实现分词片段的精准对齐与合并。在多样化长文本数据集上的大量实验表明,LoPT在保证无损分词的同时实现了显著的加速效果。我们还提供了理论一致性证明与全面的分析研究,验证了方法的鲁棒性。