The advancement of machine learning for compiler optimization, particularly within the polyhedral model, is constrained by the scarcity of large-scale, public performance datasets. This data bottleneck forces researchers to undertake costly data generation campaigns, slowing down innovation and hindering reproducible research learned code optimization. To address this gap, we introduce LOOPerSet, a new public dataset containing 28 million labeled data points derived from 220,000 unique, synthetically generated polyhedral programs. Each data point maps a program and a complex sequence of semantics-preserving transformations (such as fusion, skewing, tiling, and parallelism)to a ground truth performance measurement (execution time). The scale and diversity of LOOPerSet make it a valuable resource for training and evaluating learned cost models, benchmarking new model architectures, and exploring the frontiers of automated polyhedral scheduling. The dataset is released under a permissive license to foster reproducible research and lower the barrier to entry for data-driven compiler optimization.
翻译:机器学习在编译器优化,特别是在多面体模型中的应用,正受到大规模公开性能数据集稀缺的制约。这一数据瓶颈迫使研究人员承担昂贵的数据生成工作,减缓了创新速度,并阻碍了基于学习的代码优化的可复现研究。为弥补这一空白,我们推出了LOOPerSet,这是一个新的公共数据集,包含从22万个独特的、合成生成的多面体程序中提取的2800万个带标签数据点。每个数据点将一个程序及一系列复杂的、保持语义的变换序列(如融合、倾斜、分块和并行化)映射到一个真实的性能测量值(执行时间)。LOOPerSet的规模和多样性使其成为训练和评估学习型代价模型、为新的模型架构提供基准测试以及探索自动化多面体调度前沿的宝贵资源。该数据集以宽松许可证发布,旨在促进可复现研究,并降低数据驱动编译器优化的入门门槛。