Data quality remains an important challenge in data-driven systems, as errors in tabular data can severely compromise downstream analytics and machine learning performance. Although numerous error detection algorithms have been proposed, the lack of diverse, real-world error datasets limits comprehensive evaluation. Manual error annotation is both time-consuming and inconsistent, motivating the exploration of synthetic error generation as an alternative. In this work, we introduce TableEG, a framework that leverages large language models (LLMs) to generate authentic errors. By employing a table fine-tuning strategy and a triplet representation $(I, T, O)$ to model error generation, detection, and correction tasks, TableEG captures the complex dependencies inherent in two-dimensional tables. Trained on 12 real-world datasets spanning 10 diverse domains, TableEG ensures that the synthesized errors faithfully reflect authentic error distributions. Experimental results indicate that errors generated by TableEG exhibit superior pattern and distribution similarity compared to both rule-based methods and LLM-generated errors without fine-tuning. Furthermore, performance metrics on TableEG-generated errors closely align with those on real-world errors across nearly all datasets and detection algorithms, particularly for machine learning based detection techniques. Overall, TableEG not only bridges the gap between synthetic and real-world errors but also establishes a robust benchmark for subsequent error detection and correction tasks.


翻译:数据质量仍然是数据驱动系统中的重要挑战,因为表格数据中的错误会严重影响下游分析和机器学习性能。尽管已有大量错误检测算法被提出,但缺乏多样化、真实世界的错误数据集限制了全面评估。人工错误标注既耗时又不一致,这促使我们探索合成错误生成作为替代方案。在本工作中,我们提出了TableEG框架,该框架利用大型语言模型(LLMs)生成真实错误。通过采用表格微调策略和三元组表示$(I, T, O)$来建模错误生成、检测和校正任务,TableEG能够捕捉二维表格中固有的复杂依赖关系。在涵盖10个不同领域的12个真实世界数据集上进行训练后,TableEG确保合成的错误能够忠实反映真实的错误分布。实验结果表明,与基于规则的方法以及未经微调的LLM生成错误相比,TableEG生成的错误在模式和分布相似性方面表现出优越性。此外,在几乎所有数据集和检测算法上,TableEG生成错误上的性能指标与真实世界错误上的指标高度一致,特别是对于基于机器学习的检测技术。总体而言,TableEG不仅弥合了合成错误与真实世界错误之间的差距,还为后续的错误检测和校正任务建立了一个稳健的基准。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员