Imbalanced data commonly exists in real world, espacially in sentiment-related corpus, making it difficult to train a classifier to distinguish latent sentiment in text data. We observe that humans often express transitional emotion between two adjacent discourses with discourse markers like "but", "though", "while", etc, and the head discourse and the tail discourse 3 usually indicate opposite emotional tendencies. Based on this observation, we propose a novel plug-and-play method, which first samples discourses according to transitional discourse markers and then validates sentimental polarities with the help of a pretrained attention-based model. Our method increases sample diversity in the first place, can serve as a upstream preprocessing part in data augmentation. We conduct experiments on three public sentiment datasets, with several frequently used algorithms. Results show that our method is found to be consistently effective, even in highly imbalanced scenario, and easily be integrated with oversampling method to boost the performance on imbalanced sentiment classification.
翻译:在现实世界中通常存在不平衡的数据,在情感相关内容中,通常存在与情感相关的内容,因此很难训练一个分类师来区分文字数据中潜伏的情绪。我们观察到,人类常常在两个相邻的谈话中表达过渡性情感,两个相邻的谈话标记有“但”、“虽然”“同时”等,而头部谈话和尾部谈话通常表明相反的情感倾向。根据这一观察,我们提出了一个新的插座和剧本方法,首先根据过渡性谈话标记进行抽样讨论,然后在预先训练的基于关注的模式下验证感情上的两极分化。我们的方法首先增加了样本的多样性,可以作为数据增强的上游前处理部分。我们在三个公共情绪数据集上进行实验,并使用几种常用的算法。结果显示,我们的方法始终有效,即使在高度不平衡的情况下,也很容易与过度抽样的方法结合起来,以提高情绪不平衡分类的性能。