探索表格数据的异质性：基于大语言模型的多样性感知数据生成器 (Exploring the Heterogeneity of Tabular Data: A Diversity-aware Data Generator via LLMs)

Tabular data generation has become increasingly essential for enabling robust machine learning applications, which require large-scale, high-quality data. Existing solutions leverage generative models to learn original data distributions. However, real-world data are naturally heterogeneous with diverse distributions, making it challenging to obtain a universally good model for diverse data generation. To address this limitation, we introduce Diversity-Aware Tabular data gEnerator (DATE), a framework that (i) prepares high-quality and distributionally distinct examples for in-context learning by effectively partitioning the original heterogeneous data into multiple diverse subsets; (ii) harnesses Large Language Models (LLMs) to explore the diversity of the partitioned distribution with decision tree reasoning as feedback, generating high-quality labeled data for each subset. However, the massive generated data inherently involves a trade-off between diversity and quality. To integrate this issue, existing solutions greedily select the validation-best data. However, we prove that the selection in heterogeneous settings does not possess the greedy-choice property, and design a Multi-Arm Bandit-based sampling algorithm that balances the diversity and quality of generated data. Extensive experiments on tabular classification and regression benchmarks demonstrate that DATE consistently outperforms state-of-the-art GAN-based and LLM-based methods. On average, DATE achieves a 23.75% reduction in error rate with just 100 generated data. Empirically, we demonstrate that data generated by DATE can improve the accuracy of Direct Preference Optimization (DPO) and enhance the reasoning capability of LLMs on the target data. Code is available at https://github.com/windblow32/DATE.

翻译：表格数据生成对于实现稳健的机器学习应用已变得日益重要，这些应用需要大规模、高质量的数据。现有解决方案利用生成模型来学习原始数据分布。然而，现实世界的数据天然具有异质性，其分布多种多样，这使得获得一个适用于多样化数据生成的通用良好模型具有挑战性。为应对这一局限，我们引入了多样性感知表格数据生成器（DATE），该框架（i）通过将原始异质数据有效划分为多个不同的子集，为上下文学习准备高质量且分布差异显著的示例；（ii）利用大语言模型（LLMs），以决策树推理作为反馈，探索划分后分布的多样性，为每个子集生成高质量的标注数据。然而，大规模生成的数据本质上涉及多样性与质量之间的权衡。为解决这一问题，现有解决方案贪婪地选择验证效果最佳的数据。但我们证明，在异质设置下的选择不具备贪心选择性质，并设计了一种基于多臂老虎机的采样算法，以平衡生成数据的多样性与质量。在表格分类和回归基准测试上进行的大量实验表明，DATE 始终优于最先进的基于 GAN 和基于 LLM 的方法。平均而言，DATE 仅使用 100 条生成数据即可实现 23.75% 的错误率降低。经验上，我们证明 DATE 生成的数据可以提高直接偏好优化（DPO）的准确性，并增强 LLMs 在目标数据上的推理能力。代码可在 https://github.com/windblow32/DATE 获取。