Language models have seen enormous progress on advanced benchmarks in recent years, but much of this progress has only been possible by using more costly models. Benchmarks may therefore present a warped picture of progress in practical capabilities per dollar. To remedy this, we use data from Artificial Analysis and Epoch AI to form the largest dataset of current and historical prices to run benchmarks to date. We find that the price for a given level of benchmark performance has decreased remarkably fast, around $5\times$ to $10\times$ per year, for frontier models on knowledge, reasoning, math, and software engineering benchmarks. These reductions in the cost of AI inference are due to economic forces, hardware efficiency improvements, and algorithmic efficiency improvements. Isolating out open models to control for competition effects and dividing by hardware price declines, we estimate that algorithmic efficiency progress is around $3\times$ per year. Finally, we recommend that evaluators both publicize and take into account the price of benchmarking as an essential part of measuring the real-world impact of AI.
翻译:近年来,语言模型在高级基准测试中取得了巨大进展,但许多进展仅通过使用成本更高的模型才得以实现。因此,基准测试可能扭曲了每美元实际能力进展的真实图景。为纠正这一问题,我们利用Artificial Analysis和Epoch AI的数据,构建了迄今为止最大规模的当前与历史基准测试运行价格数据集。研究发现,对于前沿模型在知识、推理、数学和软件工程基准测试中,达到特定性能水平的价格下降速度惊人,每年约降低5倍至10倍。AI推理成本的下降源于经济压力、硬件效率提升以及算法效率改进。通过分离开源模型以控制竞争效应,并除以硬件价格下降幅度,我们估算算法效率的年提升率约为3倍。最后,我们建议评估者将基准测试价格作为衡量AI实际影响的重要组成部分,予以公开并纳入考量。