In recent years, semi-supervised algorithms have received a lot of interest in both academia and industry. Among the existing techniques, self-training methods have arguably received more attention in the last few years. These models are designed to search the decision boundary on low density regions without making extra assumptions on the data distribution, and use the unsigned output score of a learned classifier, or its margin, as an indicator of confidence. The working principle of self-training algorithms is to learn a classifier iteratively by assigning pseudo-labels to the set of unlabeled training samples with a margin greater than a certain threshold. The pseudo-labeled examples are then used to enrich the labeled training data and train a new classifier in conjunction with the labeled training set. We present self-training methods for binary and multiclass classification and their variants which were recently developed using Neural Networks. Finally, we discuss our ideas for future research in self-training. To the best of our knowledge, this is the first thorough and complete survey on this subject.
翻译:近年来,半监督的算法在学术界和工业界都引起了很大兴趣,在现有的技术中,自培训方法在过去几年中可能受到更多的注意。这些模型的设计是为了在不对数据分布作出额外假设的情况下,在低密度区域搜索决定界限,并利用一个有学识的分类师的未签名产出分数或其差值作为信任指标。自培训算法的工作原则是通过给一组无标签的培训样本分配假标签来迭代性地学习一个分类器,其差值大于一定的阈值。后来,假标签的例子被用来丰富标签的培训数据,并结合标签的培训成套培训材料培训一个新的分类师。我们介绍了二进制和多级分类的自培训方法及其最近利用Neural网络开发的变式。最后,我们讨论了今后自我培训研究的构想。据我们所知,这是第一次彻底和彻底的调查。