Topic models are typically represented by top-$m$ word lists for human interpretation. The corpus is often pre-processed with lemmatization (or stemming) so that those representations are not undermined by a proliferation of words with similar meanings, but there is little public work on the effects of that pre-processing. Recent work studied the effect of stemming on topic models of English texts and found no supporting evidence for the practice. We study the effect of lemmatization on topic models of Russian Wikipedia articles, finding in one configuration that it significantly improves interpretability according to a word intrusion metric. We conclude that lemmatization may benefit topic models on morphologically rich languages, but that further investigation is needed.
翻译:专题模型通常由用于人文解释的顶值-百万美元单词列表代表,本套通常先用浸渍(或冲压)处理,这样,这些表述不会因具有类似含义的词语的激增而受到损害,但几乎没有关于预处理影响的公共工作。最近的工作研究了对英文文本专题模型的影响,没有发现支持这一做法的证据。我们研究了浸渍对俄罗斯维基百科文章专题模型的影响,在一个组合中发现,它大大改进了根据“侵入”指标的可解释性。我们的结论是,浸渍可能有利于关于形态丰富语言的专题模型,但还需要进一步调查。