大规模汉语历时语料库建设及词汇语义变迁研究

项目名称： 大规模汉语历时语料库建设及词汇语义变迁研究

项目编号： No.61472017

项目类型： 面上项目

立项/批准年度： 2015

项目学科： 计算机科学学科

项目作者： 胡俊峰

作者单位： 北京大学

项目金额： 80万元

中文摘要： 语言变迁研究作为社会语言学的一个重要课题，在过去半个世纪中取得了显著成就，然而借助自然语言处理和语义挖掘技术对此进行的研究却比较少。一方面是因为，计算方法给出的是统计结果，无法达到社会语言学家的精确性要求；另一方面是因为，目前尚缺乏切分和词性标注并经人工校对的大规模历时语料库。本课题拟以申请者现有工作基础出发，从三方面开展研究工作：（1）汉语历时语料库建设：拟建以现代汉语为主体的，包含部分古代汉语语料的大规模历时语料库；同步建设网上语料库检索与数据可视化应用平台。(2)历时本体挖掘算法研究以及汉语历时词汇本体知识库建设。(3)基于历时词汇本体的现代汉语词汇语义变迁研究与词汇义项标注知识库建设。综合上述三方面的工作，该研究旨在构建大规模历时语料库（包括历时词汇本体知识）的同时，呈现一个完整的采用计算方法实现语言变迁研究的应用示范。

中文关键词： 历时语料库；本体挖掘；语言变迁；中文信息处理；词义消岐

英文摘要： The study of language change and variation has achieved a significant growth in the past half-century, but there is seldom research conducted from the aspect of Natural Language Processing (NLP) or semantic mining. One reason is that these computational based approaches just give statistical trends and fail to meet the accuracy needs of sociolinguists. The other reason is that there is not enough segmented, POS labeled and proofread diachronic corpus to support such research. Based on the solid existing work, this project is to carry out from the following three aspects. (1) Corpus construction: the construction of a large-scale diachronic corpus mainly on modern Chinese, togather with a Web based corpus searching tools and data visualization platform; (2) Research on ontology mining and diachronic corpus based ontologies construction; (3) Research on diachronic ontologies based semantic change mining, and lexical sense disambiguation knowledge-base construction. Through the study mentioned above, this project aims to build a large-scale Chinese diachronic corpus (including diachronic ontologies) and to construct a public research platform based on the corpus, which also supplies a demonstration study on diachronic language variation by the computational based algorithm.

英文关键词： Diachronic Corpus;Ontology Minging;Language Change and Variation;Chinese Language Processing;Word Sense Disambiguation

成为VIP会员查看完整内容