We present and apply two methods for addressing the problem of selecting relevant training data out of a general pool for use in tasks such as machine translation. Building on existing work on class-based language difference models, we first introduce a cluster-based method that uses Brown clusters to condense the vocabulary of the corpora. Secondly, we implement the cynical data selection method, which incrementally constructs a training corpus to efficiently model the task corpus. Both the cluster-based and the cynical data selection approaches are used for the first time within a machine translation system, and we perform a head-to-head comparison. Our intrinsic evaluations show that both new methods outperform the standard Moore-Lewis approach (cross-entropy difference), in terms of better perplexity and OOV rates on in-domain data. The cynical approach converges much quicker, covering nearly all of the in-domain vocabulary with 84% less data than the other methods. Furthermore, the new approaches can be used to select machine translation training data for training better systems. Our results confirm that class-based selection using Brown clusters is a viable alternative to POS-based class-based methods, and removes the reliance on a part-of-speech tagger. Additionally, we are able to validate the recently proposed cynical data selection method, showing that its performance in SMT models surpasses that of traditional cross-entropy difference methods and more closely matches the sentence length of the task corpus.
翻译:我们提出并运用两种方法来解决从一般人才库中选择相关培训数据的问题,以便用于机器翻译等任务。基于基于阶级语言差异模型的现有工作,我们首先采用基于集群的方法,利用布朗群集来压缩公司词汇。第二,我们采用玩世不恭的数据选择方法,逐步构建一个培训实体,以高效地模拟任务内容。基于集群和玩世不恭的数据选择方法首次在一个机器翻译系统内使用,我们进行头对头的比较。我们的内在评价表明,新方法既优于标准摩尔-莱维斯方法(跨种族差异)、更易易理解和内部数据OOOOVV比率的通用方法。这种玩世不恭的数据选择方法更快地结合,几乎涵盖所有内部词汇,比其他方法少84%的数据。此外,可以使用新方法选择机器翻译培训更好的系统。我们的结果证实,使用布朗群集的班级选择是POS类差异模型(跨种族差异差异方法)的一个可行的替代方法(跨种族差异差异方法),从更好的易混淆性和OOVV比率和OV对内部数据的速率率率率。我们最近提出的传统任务选择方法更可靠地展示了对业绩方法的依赖性比。