Visual instruction tuning (VIT) datasets are constructed from randomly sampled image-question pairs, without regard to the informativeness of each pair. Recent dataset selection methods have shown that a small fraction of such datasets enriched with informative samples can lead to efficient finetuning of Multimodal Large Language Models. In this work, we explore the impact of sample complexity on informative data curation and introduce COMPACT (COMPositional Atomic-to-complex Visual Capability Tuning), a VIT data recipe that scales training sample complexity by combining multiple atomic visual capabilities in a single training example. Concretely, we synthesize rich and informative text questions for each image, allowing us to significantly reduce the number of training examples required for effective visual instruction tuning. COMPACT demonstrates superior data efficiency compared to existing data reduction methods. When applied to the LLAVA-665K VIT dataset, COMPACT reduces the data budget by 90% while still achieving 100.2% of the full VIT performance (compared to only 97.5% by the state-of-the-art method) across eight multimodal benchmarks. Further, training on the COMPACT data outperforms training on the full-scale data on particularly complex benchmarks such as MM-Vet (+8.6%) and MMStar (+2.9%). COMPACT offers a scalable and efficient synthetic data generation recipe to improve on visual language tasks.
翻译:视觉指令调优数据集通常由随机采样的图像-问题对构成,未考虑每对样本的信息含量。近期研究表明,通过筛选富含信息量的样本构成的小规模数据集,能够高效微调多模态大语言模型。本文探讨样本复杂度对信息数据筛选的影响,提出COMPACT(组合式原子到复杂视觉能力调优)——一种通过将多个原子视觉能力组合至单个训练样本中,实现训练样本复杂度扩展的视觉指令调优数据构建方案。具体而言,我们为每幅图像合成丰富且信息密集的文本问题,从而显著减少有效视觉指令调优所需的训练样本数量。与现有数据缩减方法相比,COMPACT展现出卓越的数据效率。在LLaVA-665K视觉指令调优数据集上的实验表明:COMPACT在减少90%数据预算的同时,仍能在八大多模态基准测试中达到完整数据集性能的100.2%(当前最优方法仅达97.5%)。此外,在MM-Vet(+8.6%)和MMStar(+2.9%)等复杂基准测试中,使用COMPACT数据的训练效果甚至超越全量数据训练。COMPACT为提升视觉语言任务性能提供了一种可扩展的高效合成数据生成方案。