Deep learning has achieved impressive performance in many domains, such as computer vision and natural language processing, but its advantage over classical shallow methods on tabular datasets remains questionable. It is especially challenging to surpass the performance of tree-like ensembles, such as XGBoost or Random Forests, on small-sized datasets (less than 1k samples). To tackle this challenge, we introduce HyperTab, a hypernetwork-based approach to solving small sample problems on tabular datasets. By combining the advantages of Random Forests and neural networks, HyperTab generates an ensemble of neural networks, where each target model is specialized to process a specific lower-dimensional view of the data. Since each view plays the role of data augmentation, we virtually increase the number of training samples while keeping the number of trainable parameters unchanged, which prevents model overfitting. We evaluated HyperTab on more than 40 tabular datasets of a varying number of samples and domains of origin, and compared its performance with shallow and deep learning models representing the current state-of-the-art. We show that HyperTab consistently outranks other methods on small data (with a statistically significant difference) and scores comparable to them on larger datasets. We make a python package with the code available to download at https://pypi.org/project/hypertab/
翻译:深度学习在许多领域,如计算机视觉和自然语言处理中取得了显著的性能,但是相比于传统的浅层模型,它在表格数据集上的优势仍然值得商榷。特别是在小样本数据集(少于1k个样本)上超越类似XGBoost或Random Forests的树状集合方法的性能尤其具有挑战性。为了解决这个问题,我们引入了HyperTab,一种基于超网络的方法,用于解决表格数据集中的小样本问题。通过结合Random Forests和神经网络的优点,HyperTab生成神经网络的集合,其中每个目标模型都专门处理数据的特定低维视图。由于每个视图扮演数据增强的角色,我们在保持可训练参数数量不变的情况下,虚拟增加了训练样本的数量,从而防止了模型过度拟合的问题。我们对40多个来自不同领域和包含不同样本数量的表格数据集进行了HyperTab的评估,并将其性能与表示当前最先进技术的浅层学习模型和深度学习模型进行了比较。我们表明HyperTab在小数据集上持续优于其他方法(存在显着的统计差异),并在大型数据集上达到可比较的分数。我们提供了一份Python代码包,可以在以下链接中下载:https://pypi.org/project/hypertab/。