Weakly-supervised text classification aims to induce text classifiers from only a few user-provided seed words. The vast majority of previous work assumes high-quality seed words are given. However, the expert-annotated seed words are sometimes non-trivial to come up with. Furthermore, in the weakly-supervised learning setting, we do not have any labeled document to measure the seed words' efficacy, making the seed word selection process "a walk in the dark". In this work, we remove the need for expert-curated seed words by first mining (noisy) candidate seed words associated with the category names. We then train interim models with individual candidate seed words. Lastly, we estimate the interim models' error rate in an unsupervised manner. The seed words that yield the lowest estimated error rates are added to the final seed word set. A comprehensive evaluation of six binary classification tasks on four popular datasets demonstrates that the proposed method outperforms a baseline using only category name seed words and obtained comparable performance as a counterpart using expert-annotated seed words.
翻译:微弱监督的文本分类旨在从几个用户提供的种子字中诱导文本分类员。 绝大多数先前的工作都假定高质量的种子字。 但是, 专家加注的种子字有时是非三元性的。 此外, 在微弱监督的学习环境中, 我们没有任何标签文件来衡量种子字的功效, 使种子字选择过程“ 在黑暗中行走 ” 。 在这项工作中, 我们不需要专家加译的种子字, 与类别名称相关的先行( noisy) 候选种子字。 然后我们用个别候选种子字来训练临时模型。 最后, 我们以不受监督的方式估计临时模型的错误率。 产生最低估计误差率的种子字被添加到最后种子字组中。 对四个流行数据集的六个二元分类任务进行全面评估, 结果表明, 拟议的方法仅使用分类种子字就超越了基线, 并且作为使用专家加注的种子字的对应方, 取得了可比的成绩。