比较文本:可视化、比较和理解文本公司 (CompText: Visualizing, Comparing & Understanding Text Corpus)

A common practice in Natural Language Processing (NLP) is to visualize the text corpus without reading through the entire literature, still grasping the central idea and key points described. For a long time, researchers focused on extracting topics from the text and visualizing them based on their relative significance in the corpus. However, recently, researchers started coming up with more complex systems that not only expose the topics of the corpus but also word closely related to the topic to give users a holistic view. These detailed visualizations spawned research on comparing text corpora based on their visualization. Topics are often compared to idealize the difference between corpora. However, to capture greater semantics from different corpora, researchers have started to compare texts based on the sentiment of the topics related to the text. Comparing the words carrying the most weightage, we can get an idea about the important topics for corpus. There are multiple existing texts comparing methods present that compare topics rather than sentiments but we feel that focusing on sentiment-carrying words would better compare the two corpora. Since only sentiments can explain the real feeling of the text and not just the topic, topics without sentiments are just nouns. We aim to differentiate the corpus with a focus on sentiment, as opposed to comparing all the words appearing in the two corpora. The rationale behind this is, that the two corpora do not many have identical words for side-by-side comparison, so comparing the sentiment words gives us an idea of how the corpora are appealing to the emotions of the reader. We can argue that the entropy or the unexpectedness and divergence of topics should also be of importance and help us to identify key pivot points and the importance of certain topics in the corpus alongside relative sentiment.

翻译：自然语言处理( NLP) 的常见做法是将文字文体视觉化, 而不在整个文献中阅读, 仍然掌握着中央思想和描述的要点。长期以来, 研究人员侧重于从文字中提取专题, 并根据在文体中的相对重要性来将它们视觉化。然而, 最近, 研究人员开始使用更复杂的系统, 不仅暴露了文体的主题, 也用与专题密切相关的文字来比较。这些详细的视觉化催生了对基于其视觉化的文本体体进行对比的研究。主题往往被比喻成Cora之间的差别。然而, 要从不同的体体体中捕捉到更多的意想不到的情绪。研究人员开始根据与文体相关主题的情绪来比较文字。比较最重的文字, 我们可以了解一些比较主题的方法,而不是感情,但我们认为, 注重情感的言词会比两个体的言词更好。我们只能解释文字的真实感觉, 而不是相对的情感的情绪, 比较的言调, 而不是相对的言调的言调, 也只是区分两个主题, 。