Dataset availability and quality remain critical challenges in machine learning, especially in domains where data are scarce, expensive to acquire, or constrained by privacy regulations. Fields such as healthcare, biomedical research, and cybersecurity frequently encounter high data acquisition costs, limited access to annotated data, and the rarity or sensitivity of key events. These issues-collectively referred to as the dataset challenge-hinder the development of accurate and generalizable machine learning models in such high-stakes domains. To address this, we introduce FlexiDataGen, an adaptive large language model (LLM) framework designed for dynamic semantic dataset generation in sensitive domains. FlexiDataGen autonomously synthesizes rich, semantically coherent, and linguistically diverse datasets tailored to specialized fields. The framework integrates four core components: (1) syntactic-semantic analysis, (2) retrieval-augmented generation, (3) dynamic element injection, and (4) iterative paraphrasing with semantic validation. Together, these components ensure the generation of high-quality, domain-relevant data. Experimental results show that FlexiDataGen effectively alleviates data shortages and annotation bottlenecks, enabling scalable and accurate machine learning model development.
翻译:数据集的可获得性与质量仍然是机器学习领域的关键挑战,尤其是在数据稀缺、获取成本高昂或受隐私法规约束的领域中。医疗保健、生物医学研究和网络安全等领域经常面临高昂的数据获取成本、带标注数据的访问受限,以及关键事件的罕见性或敏感性。这些问题——统称为数据集挑战——阻碍了在这些高风险领域中开发准确且可泛化的机器学习模型。为解决此问题,我们提出了FlexiDataGen,一个为敏感领域动态语义数据集生成而设计的自适应大语言模型框架。FlexiDataGen能够自主合成丰富、语义连贯且语言多样的数据集,并针对特定领域进行定制。该框架集成了四个核心组件:(1) 句法-语义分析,(2) 检索增强生成,(3) 动态元素注入,以及(4) 带语义验证的迭代复述。这些组件共同确保了高质量、领域相关数据的生成。实验结果表明,FlexiDataGen能有效缓解数据短缺和标注瓶颈,从而实现可扩展且准确的机器学习模型开发。