利用python做LDA文本分析，该从哪里入手呢？

Question

利用python做LDA文本分析，该从哪里入手呢？

硕二，最近打算利用topic modeling结合推荐系统做一些研究。之前一直都是在看理论方面的知识（推荐系统基础，LDA的数学基础还有吉布斯采样），…

关注者

213

被浏览

90,002

登录后你可以

不限量看优质回答私信答主深度交流精彩内容一键收藏

查看全部 13 个回答

仅分享一个小小的应用。

在人民日报玩微博：一个文本分析小作业中，经过用Python BeautifulSoup爬取、用Python Jieba中文分词、提取关键词后，现在准备进行主题建模，来看看官微到底比较关注哪些话题。

这里要用到的Gensim（ =“Generate Similar”），是一个开源自然语言处理库，可以无监督地进行主题建模。

先厘清一些基础概念，我都参考这个初学者教程：

Gensim Tutorial

Token − 一个词。
Document （文本）− 句子/段落/文档，总之就是Python里的‘str’ objects。
Corpus （语料库）− 文本的集合，不需要人工标注的附加信息。
Vector （向量）− 文本的数学表达。
BoW (bag-of-word)（词袋）-文本的一种向量表达，包含词典中每个词出现的次数。
Model （模型）− 一种算法，用来把从文本的一种向量表达变换为另一种向量表达。

1、把文本转化为Gensim能处理的向量

import numpy as np
from gensim import corpora, models


# 读入文本数据
f = open('title_keywords.txt', encoding='utf-8')  # 输入已经分好词、去除停用词后的文本
texts = [[word for word in line.split()] for line in f]
f.close()
M = len(texts)
print('文本数目：%d 个' % M)

# Convert corpus into list of vectors，即建立词典
dictionary = corpora.Dictionary(texts)
print(dictionary)
V = len(dictionary)
print('词的个数：%d 个' % V)

# Create the bag-of-word representation，即建立词袋
BoW_corpus = [dictionary.doc2bow(text) for text in texts] 
print(BoW_corpus)

2、向量转换，拟合模型，输出结果

# 计算文档TF-IDF
corpus_tfidf = models.TfidfModel(corpus)[corpus]

# LDA模型拟合
num_topics = 15  # 定义主题数
lda = models.LdaModel(corpus_tfidf, num_topics=num_topics, id2word=dictionary,
                      alpha=0.01, eta=0.01, minimum_probability=0.001,
                      update_every=1, chunksize=100, passes=1)

# 所有文档的主题
doc_topic = [a for a in lda[corpus_tfidf]]
print('Document-Topic:')
print(doc_topic)

# 将主题-词写入一个文档
with open('ldatopic.txt', 'w', encoding='utf-8') as tm:
    for topic_id in range(num_topics):
        term_distribute_all = lda.get_topic_terms(topicid=topic_id, topn=15)
        term_distribute = np.array(term_distribute_all)
        term_id = term_distribute[:, 0].astype(np.int)
        for t in term_id:
            tm.write(dictionary.id2token[t] + " ")
        tm.write("\n")

主题的数量、每个主题显示几个词，都需要一次次去试，直到聚合出来的结果可以用肉眼看出意义，而不是一些词语的奇怪组合。有点碰运气的意思。

比如人民日报官微内容中，有几个聚合效果比较明显的结果——

编辑于 2022-01-16 23:48

查看全部 13 个回答