We present AraMix, a deduplicated Arabic pretraining corpus containing approximately 178 billion tokens across 179 million documents. Rather than scraping the web again, AraMix demonstrates that substantial value lies in systematically reusing and curating existing pretraining datasets: we combine seven publicly available Arabic web datasets, apply quality filtering designed specifically for Arabic text to re-filter some datasets, and perform cross-dataset deduplication, both MinHash and sentence-level. This approach reveals that nearly 60% of tokens across these independently collected corpora are duplicates, redundancy that any new scraping efforts will reproduce. Our work suggests that for lower resource languages, investment in curation pipelines for existing data yields greater returns than additional web crawls, an approach that allowed us to curate the largest heavily filtered publicly available Arabic pretraining corpus.


翻译:我们提出了AraMix——一个经过去重处理的阿拉伯语预训练语料库,包含约1780亿个标记和1.79亿份文档。AraMix并未重新进行网络爬取,而是证明了系统化复用与整理现有预训练数据集具有重要价值:我们整合了七个公开可用的阿拉伯语网络数据集,针对阿拉伯文本特性设计了质量过滤方案以重新过滤部分数据集,并执行了跨数据集去重(包括MinHash和句子级别)。该方法揭示出这些独立收集的语料库中近60%的标记是重复的,任何新的网络爬取工作都将复现这种冗余。我们的研究表明,对于资源相对稀缺的语言,投资于现有数据的整理流程比进行额外的网络爬取能产生更大回报,这一方法使我们构建了当前规模最大且经过严格过滤的公开阿拉伯语预训练语料库。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员