用于处理社会科学计算中访谈数据的 " 采矿 " 文本 (Text Mining for Processing Interview Data in Computational Social Science)

We use commercially available text analysis technology to process interview text data from a computational social science study. We find that topical clustering and terminological enrichment provide for convenient exploration and quantification of the responses. This makes it possible to generate and test hypotheses and to compare textual and non-textual variables, and saves analyst effort. We encourage studies in social science to use text analysis, especially for exploratory open-ended studies. We discuss how replicability requirements are met by text analysis technology. We note that the most recent learning models are not designed with transparency in mind, and that research requires a model to be editable and its decisions to be explainable. The tools available today, such as the one used in the present study, are not built for processing interview texts. While many of the variables under consideration are quantifiable using lexical statistics, we find that some interesting and potentially valuable features are difficult or impossible to automatise reliably at present. We note that there are some potentially interesting applications for traditional natural language processing mechanisms such as named entity recognition and anaphora resolution in this application area. We conclude with a suggestion for language technologists to investigate the challenge of processing interview data comprehensively, especially the interplay between question and response, and we encourage social science researchers not to hesitate to use text analysis tools, especially for the exploratory phase of processing interview data.?

翻译：我们利用商业上可获得的文本分析技术处理来自计算社会科学研究的访谈文本数据。我们发现,专题分组和术语丰富有助于对答复进行方便的探讨和量化。这样就有可能产生和测试假设,比较文本和非文本变量,并节省分析者的努力。我们鼓励社会科学研究使用文本分析,特别是用于探索性开放研究。我们讨论了文本分析技术如何满足可复制性要求。我们注意到,最新学习模式的设计没有透明度,研究需要一种可编辑的模型,其决定是可以解释的。今天可用的工具,例如本研究中所使用的工具,不是用于处理访谈文本的。虽然审议中的许多变量是用词汇统计数据量化的,但我们认为,目前有些有趣和潜在有价值的特征很难或无法可靠地实现自动化。我们注意到,在这一应用领域,有些对传统自然语言处理机制的潜在应用,如名称实体识别和Aaphora决议。我们最后建议语言学家研究如何处理处理访谈数据的挑战,特别是我们不全面分析分析工具之间的社会文本,我们鼓励对分析工具进行全面分析,特别是我们之间的互动。