This research designs a unified architecture of CTR prediction benchmark (Bench-CTR) platform that offers flexible interfaces with datasets and components of a wide range of CTR prediction models. Moreover, we construct a comprehensive system of evaluation protocols encompassing real-world and synthetic datasets, a taxonomy of metrics, standardized procedures and experimental guidelines for calibrating the performance of CTR prediction models. Furthermore, we implement the proposed benchmark platform and conduct a comparative study to evaluate a wide range of state-of-the-art models from traditional multivariate statistical to modern large language model (LLM)-based approaches on three public datasets and two synthetic datasets. Experimental results reveal that, (1) high-order models largely outperform low-order models, though such advantage varies in terms of metrics and on different datasets; (2) LLM-based models demonstrate a remarkable data efficiency, i.e., achieving the comparable performance to other models while using only 2% of the training data; (3) the performance of CTR prediction models has achieved significant improvements from 2015 to 2016, then reached a stage with slow progress, which is consistent across various datasets. This benchmark is expected to facilitate model development and evaluation and enhance practitioners' understanding of the underlying mechanisms of models in the area of CTR prediction. Code is available at https://github.com/NuriaNinja/Bench-CTR.
翻译:本研究设计了一个统一的点击率预测基准平台架构(Bench-CTR),该平台提供了与多种点击率预测模型的数据集和组件的灵活接口。此外,我们构建了一个全面的评估协议体系,涵盖真实世界和合成数据集、指标分类法、标准化流程以及用于校准点击率预测模型性能的实验指南。进一步地,我们实现了所提出的基准平台,并开展了一项比较研究,在三个公共数据集和两个合成数据集上评估了从传统多元统计方法到现代基于大语言模型(LLM)方法的一系列前沿模型。实验结果表明:(1)高阶模型在很大程度上优于低阶模型,尽管这种优势在指标和不同数据集上存在差异;(2)基于LLM的模型展现出显著的数据效率,即仅使用2%的训练数据即可达到与其他模型相当的性能;(3)点击率预测模型的性能在2015年至2016年间取得了显著提升,随后进入进展缓慢的阶段,这一趋势在不同数据集中保持一致。该基准有望促进模型开发与评估,并增强从业者对点击率预测领域模型内在机制的理解。代码发布于 https://github.com/NuriaNinja/Bench-CTR。