Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT$^3$) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT$^3$ across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT$^3$ consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT$^3$ baseline can facilitate future work on visual TTT models. Code is available at https://github.com/LeapLabTHU/ViTTT.
翻译:测试时训练(TTT)最近已成为高效序列建模的一个有前景的方向。TTT将注意力操作重新表述为一个在线学习问题,在测试时从键值对构建一个紧凑的内部模型。这种重新表述开启了一个丰富而灵活的设计空间,同时实现了线性计算复杂度。然而,构建一个强大的视觉TTT设计仍然具有挑战性:内部模块和内部训练的基本选择缺乏全面的理解和实用指南。为了弥合这一关键差距,本文对视觉序列建模的TTT设计进行了系统的实证研究。通过一系列实验和分析,我们提炼出六条实用见解,为有效的视觉TTT建立了设计原则,并为未来的改进指明了路径。这些发现最终形成了视觉测试时训练(ViT$^3$)模型,这是一个纯TTT架构,实现了线性复杂度和可并行计算。我们在多种视觉任务上评估ViT$^3$,包括图像分类、图像生成、目标检测和语义分割。结果表明,ViT$^3$始终匹配或优于先进的线性复杂度模型(例如Mamba和线性注意力变体),并有效缩小了与高度优化的视觉Transformer之间的差距。我们希望这项研究和ViT$^3$基线能够促进未来关于视觉TTT模型的工作。代码可在https://github.com/LeapLabTHU/ViTTT获取。