The recent rise in the popularity of large language models has spurred the development of extensive code datasets needed to train them. This has left limited code available for collection and use in the downstream investigation of specific behaviors, or evaluation of large language models without suffering from data contamination. To address this problem, we release The Heap, a large multilingual dataset covering 57 programming languages that has been deduplicated with respect to other open datasets of code, enabling researchers to conduct fair evaluations of large language models without significant data cleaning overhead.
翻译:近期大语言模型的兴起推动了训练所需大规模代码数据集的发展。这导致可用于下游特定行为研究或大语言模型评估的代码数据收集受限,且难以避免数据污染问题。为解决此问题,我们发布了The Heap——一个涵盖57种编程语言的大规模多语言数据集。该数据集已针对其他开源代码集进行了去重处理,使研究者能够在无需大量数据清洗工作的情况下,对大语言模型进行公平评估。