Synthetic data sets are used across linguistic domains and NLP tasks, particularly in scenarios where authentic data is limited (or even non-existent). One such domain is that of clinical (healthcare) contexts, where there exist significant and long-standing challenges (e.g., privacy, anonymization, and data governance) which have led to the development of an increasing number of synthetic datasets. One increasingly important category of clinical dataset is that of clinical dialogues which are especially sensitive and difficult to collect, and as such are commonly synthesized. While such synthetic datasets have been shown to be sufficient in some situations, little theory exists to inform how they may be best used and generalized to new applications. In this paper, we provide an overview of how synthetic datasets are created, evaluated and being used for dialogue related tasks in the medical domain. Additionally, we propose a novel typology for use in classifying types and degrees of data synthesis, to facilitate comparison and evaluation.
翻译:合成数据集在语言学领域和自然语言处理任务中得到广泛应用,尤其在真实数据有限(甚至不存在)的场景中。临床(医疗保健)语境便是这样一个领域,其中存在长期且显著的挑战(例如隐私、匿名化和数据治理),这促使了越来越多的合成数据集的开发。临床对话作为一类日益重要的临床数据集,因其高度敏感性和收集难度,通常通过合成方式生成。尽管这类合成数据集在某些情境下已被证明是有效的,但目前缺乏理论指导如何最优地利用它们并将其推广至新应用。本文综述了医疗领域中合成数据集在对话相关任务中的创建、评估及应用方式。此外,我们提出了一种新颖的类型学,用于对数据合成的类型和程度进行分类,以促进比较与评估。