Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration. We find that (1) popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in our experiments, (2) corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and (3) we can predict good configurations for unseen corpora reliably. These findings support researchers and practitioners in efficiently determining suitable configurations for topic modelling when analysing textual data contained in software repositories.
翻译:软件库包含大量文本数据,从源码评论和发布说明到关于Stack overflow的提问、回答和评论,从源码评论和发布说明到Stack overprofer 。为了理解这一文本数据,经常使用专题建模作为查找文本机构隐藏的语义结构的文字挖掘工具。Litetent Dirichlet 分配(LDA)是一个常用的专题模型,目的是通过分组文本解释一个文件库的结构。LDA要求多种参数才能很好地发挥作用,对于如何设定这些参数,只有粗略和有时相互矛盾的准则。在本文中,我们贡献了:(一) 对参数的广泛研究,以达到对GitHub和Stack overflow文本库的当地选择,以达到良好的本地选择。 (二) 与八种编程语言有关的文本组合的不规则,以及(三) 通过每组组合的LDADA配置对要素的重要性进行分析。我们发现:(1) 用于专题建模参数配置的流行的拇指规则不适用于我们的实验中所使用的公司,(2) GitHub和Stack Over overfroproft 的公司样本取样取样取样取样的公司样本样本样本样本样本样本样本具有合适的适当特性,需要,在确定我们进行可靠的模型分析时,这些精确的模型分析。