The extensive use of social media for sharing and obtaining information has resulted in the development of topic detection models to facilitate the comprehension of the overwhelming amount of short and distributed posts. Probabilistic topic models, such as Latent Dirichlet Allocation, and matrix factorization based approaches such as Latent Semantic Analysis and Non-negative Matrix Factorization represent topics as sets of terms that are useful for many automated processes. However, the determination of what a topic is about is left as a further task. Alternatively, techniques that produce summaries are human comprehensible, but less suitable for automated processing. This work proposes an approach that utilizes Linked Open Data (LOD) resources to extract semantically represented topics from collections of microposts. The proposed approach utilizes entity linking to identify the elements of topics from microposts. The elements are related through co-occurrence graphs, which are processed to yield topics. The topics are represented using an ontology that is introduced for this purpose. A prototype of the approach is used to identify topics from 11 datasets consisting of more than one million posts collected from Twitter during various events, such as the 2016 US election debates and the death of Carrie Fisher. The characteristics of the approach with more than 5 thousand generated topics are described in detail. The potentials of semantic topics in revealing information, that is not otherwise easily observable, is demonstrated with semantic queries of various complexities. A human evaluation of topics from 36 randomly selected intervals resulted in a precision of 81.0% and F1 score of 93.3%. Furthermore, they are compared with topics generated from the same datasets from an approach that produces human readable topics from microblog post collections.
翻译:广泛使用社交媒体来分享和获取信息,这导致开发了专题探测模型,以便于理解数量庞大的短数和分布式职位。 概率性主题模型,如Lentant Dirichlet分配,以及基于矩阵要素化的方法,如Lient 语义分析和非负矩阵矩阵化,代表了对许多自动化进程有用的一系列术语。 然而,确定一个主题的术语留作进一步任务。 或者,制作摘要的技术是人类可理解的,但更不适于自动处理。 这项工作建议采用一种方法,利用链接的开放数据(LOD)资源从微调的收藏中提取精度代表主题。 拟议的方法利用实体链接来确定微调主题的元素。 这些元素通过共同访问图进行关联,这些元素被处理为产生主题。 这些主题使用了为此而引入的文理学。 使用一种随机方法的原型,从11个数据集中收集了超过100万个的标本,从各种事件中采集的标本,例如从Twitter上采集的标本,在2016年选举中以其他方式绘制了39个专题的标本。 阅读了各种论文的标本。