FlexiDataGen：面向敏感领域动态语义数据集生成的自适应大语言模型框架 (FlexiDataGen: An Adaptive LLM Framework for Dynamic Semantic Dataset Generation in Sensitive Domains)

Dataset availability and quality remain critical challenges in machine learning, especially in domains where data are scarce, expensive to acquire, or constrained by privacy regulations. Fields such as healthcare, biomedical research, and cybersecurity frequently encounter high data acquisition costs, limited access to annotated data, and the rarity or sensitivity of key events. These issues-collectively referred to as the dataset challenge-hinder the development of accurate and generalizable machine learning models in such high-stakes domains. To address this, we introduce FlexiDataGen, an adaptive large language model (LLM) framework designed for dynamic semantic dataset generation in sensitive domains. FlexiDataGen autonomously synthesizes rich, semantically coherent, and linguistically diverse datasets tailored to specialized fields. The framework integrates four core components: (1) syntactic-semantic analysis, (2) retrieval-augmented generation, (3) dynamic element injection, and (4) iterative paraphrasing with semantic validation. Together, these components ensure the generation of high-quality, domain-relevant data. Experimental results show that FlexiDataGen effectively alleviates data shortages and annotation bottlenecks, enabling scalable and accurate machine learning model development.

翻译：数据集的可获得性与质量仍然是机器学习领域的关键挑战，尤其是在数据稀缺、获取成本高昂或受隐私法规约束的领域中。医疗保健、生物医学研究和网络安全等领域经常面临高昂的数据获取成本、带标注数据的访问受限，以及关键事件的罕见性或敏感性。这些问题——统称为数据集挑战——阻碍了在这些高风险领域中开发准确且可泛化的机器学习模型。为解决此问题，我们提出了FlexiDataGen，一个为敏感领域动态语义数据集生成而设计的自适应大语言模型框架。FlexiDataGen能够自主合成丰富、语义连贯且语言多样的数据集，并针对特定领域进行定制。该框架集成了四个核心组件：(1) 句法-语义分析，(2) 检索增强生成，(3) 动态元素注入，以及(4) 带语义验证的迭代复述。这些组件共同确保了高质量、领域相关数据的生成。实验结果表明，FlexiDataGen能有效缓解数据短缺和标注瓶颈，从而实现可扩展且准确的机器学习模型开发。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日