Topic models can be useful tools to discover latent topics in collections of documents. Recent studies have shown the feasibility of approach topic modeling as a clustering task. We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of TF-IDF. More specifically, BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure. BERTopic generates coherent topics and remains competitive across a variety of benchmarks involving classical models and those that follow the more recent clustering approach of topic modeling.
翻译:最近的研究显示,将专题建模作为集群任务的做法是可行的。我们介绍了BERTopic,这是一个通过开发基于阶级的TF-IDF变异法来获取连贯的专题代表性,从而扩展这一进程的专题模型。更具体地说,BERTopic生成了文件,与培训前以变压器为基础的语言模型嵌入,这些嵌入组,最后,生成了基于阶级的TF-IDF程序的专题表述。BERTopic生成了连贯一致的专题,并在涉及经典模型和采用最近专题建模组合法的各种基准中保持竞争力。