Recent improvements in synthetic data generation make it possible to produce images that are highly photorealistic and indistinguishable from real ones. Furthermore, synthetic generation pipelines have the potential to generate an unlimited number of images. The combination of high photorealism and scale turn the synthetic data into a promising candidate for potentially improving various machine learning (ML) pipelines. Thus far, a large body of research in this field has focused on using synthetic images for training, by augmenting and enlarging training data. In contrast to using synthetic data for training, in this work we explore whether synthetic data can be beneficial for model selection. Considering the task of image classification, we demonstrate that when data is scarce, synthetic data can be used to replace the held out validation set, thus allowing to train on a larger dataset.
翻译:合成数据制作的近期改进使得能够制作高度光现实化和与真实数据无法区分的图像。此外,合成生成管道具有生成无限数量图像的潜力。高光现实主义和规模的结合使合成数据成为有可能改进各种机器学习(ML)管道的有希望的候选产品。迄今为止,这一领域的大量研究侧重于通过扩大和扩大培训数据,将合成图像用于培训。与使用合成数据用于培训相比,我们在这项工作中探讨合成数据是否有益于模型选择。考虑到图像分类的任务,我们证明在数据稀缺的情况下,合成数据可以用来取代已进行的验证,从而能够就更大的数据集进行培训。