Diffusion models have been the predominant generative model for tabular data generation. However, they face the conundrum of modeling under a separate versus a unified data representation. The former encounters the challenge of jointly modeling all multi-modal distributions of tabular data in one model. While the latter alleviates this by learning a single representation for all features, it currently leverages sparse suboptimal encoding heuristics and necessitates additional computation costs. In this work, we address the latter by presenting TabRep, a tabular diffusion architecture trained with a unified continuous representation. To motivate the design of our representation, we provide geometric insights into how the data manifold affects diffusion models. The key attributes of our representation are composed of its density, flexibility to provide ample separability for nominal features, and ability to preserve intrinsic relationships. Ultimately, TabRep provides a simple yet effective approach for training tabular diffusion models under a continuous data manifold. Our results showcase that TabRep achieves superior performance across a broad suite of evaluations. It is the first to synthesize tabular data that exceeds the downstream quality of the original datasets while preserving privacy and remaining computationally efficient. Code is available at https://github.com/jacobyhsi/TabRep.
翻译:扩散模型已成为表格数据生成的主要生成模型。然而,它们面临着在分离表示与统一表示下建模的两难困境。前者面临在一个模型中联合建模表格数据所有多模态分布的挑战。而后者通过学习所有特征的单一表示来缓解这一问题,但目前依赖于稀疏的次优编码启发式方法,且需要额外的计算成本。在本工作中,我们通过提出TabRep来解决后者,这是一种使用统一连续表示训练的表格扩散架构。为了阐明我们表示方法的设计动机,我们提供了关于数据流形如何影响扩散模型的几何洞察。我们表示方法的关键属性包括其密度、为标称特征提供充分分离的灵活性,以及保持内在关系的能力。最终,TabRep为在连续数据流形下训练表格扩散模型提供了一种简单而有效的方法。我们的结果表明,TabRep在广泛的评估套件中实现了卓越的性能。它是首个能够合成表格数据,在保持隐私和计算效率的同时,其下游质量超过原始数据集的模型。代码可在https://github.com/jacobyhsi/TabRep获取。