The Transformer architecture is ubiquitously used as the building block of large-scale autoregressive language models. However, finding architectures with the optimal trade-off between task performance (perplexity) and hardware constraints like peak memory utilization and latency is non-trivial. This is exacerbated by the proliferation of various hardware. We leverage the somewhat surprising empirical observation that the number of decoder parameters in autoregressive Transformers has a high rank correlation with task performance, irrespective of the architecture topology. This observation organically induces a simple Neural Architecture Search (NAS) algorithm that uses decoder parameters as a proxy for perplexity without need for any model training. The search phase of our training-free algorithm, dubbed Lightweight Transformer Search (LTS), can be run directly on target devices since it does not require GPUs. Using on-target-device measurements, LTS extracts the Pareto-frontier of perplexity versus any hardware performance cost. We evaluate LTS on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL. Results show that the perplexity of 16-layer GPT-2 and Transformer-XL can be achieved with up to 1.5x, 2.5x faster runtime and 1.2x, 2.0x lower peak memory utilization. When evaluated in zero and one-shot settings, LTS Pareto-frontier models achieve higher average accuracy compared to the 350M parameter OPT across 14 tasks, with up to 1.6x lower latency. LTS extracts the Pareto-frontier in under 3 hours while running on a commodity laptop. We effectively remove the carbon footprint of hundreds of GPU hours of training during search, offering a strong simple baseline for future NAS methods in autoregressive language modeling.
翻译:变换器架构被普遍用作大规模自动递增语言模型的建筑块。 然而, 寻找任务性( 翻滚) 和峰值内存利用率等硬件限制之间的最佳权衡( 峰值内存利用率和纬度等硬件限制) 的架构是非三角的。 各种硬件的激增加剧了这一点。 我们利用一些令人惊讶的经验观察, 即自动递减变换器中的解码参数数量与任务性能有着很高的等级相关性, 不论结构型态如何。 这种观察有机地引出一个简单的神经结构搜索( NAS), 将调色化器参数用作任何模型性能之间的最佳交换。 使用调色化器参数来替代任何模型性能培训。 我们的解析器值性能参数可以用来替代不需任何模式培训的易变异性( 双倍的调) 调值内值内值内值内值内值内值内值内值内值内值内值内值内值内值内值内值内值内值与任何硬性性性性性性性能性能内值内值内值内值内值内值内值内值内值内值内值内值内值内值内值内值内值内值内值内值内值内值内值 。 我们对各种设备上调的LTTS-, 的更值未来方法上值-, 的更值-, 递递递增到惯性- 递递递递递递递递增- 递递递递增- 内值- 递增到惯性- 递递增到惯性性- 性- 性- 性- 内值- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性- 性-