【推荐】自然语言处理(NLP)指南

2017 年 11 月 17 日 机器学习研究会
【推荐】自然语言处理(NLP)指南


点击上方 “机器学习研究会”可以订阅
摘要
 

转自:网路冷眼

Natural Language Processing (NLP) comprises a set of techniques that can be used to achieve many different objectives. Take a look at the following table to figure out which technique can solve your particular problem.

WHAT YOU NEED WHERE TO LOOK
Grouping similar words for search Stemming, Splitting Words, Parsing Documents
Finding words with the same meaning for search Latent Semantic Analysis
Generating realistic names Splitting Words
Understanding how much time does it take to read a text Reading Time
Understanding how difficult to read is a text Readability of a Text
Identifying the language of a text Identifying a Language
Generating a summary of a text SumBasic (word-based), Graph-based Methods: TextRank (relationship-based), Latent Semantic Analysis (semantic-based)
Finding similar documents Latent Semantic Analysis
Identifying entities (e.g., cities, people) in a text Parsing Documents
Understanding the attitude expressed in a text Parsing Documents
Translating a text Parsing Documents

We are going to talk about parsing in the general sense of analyzing a document and extracting its meaning.So, we are going to talk about actual parsing of natural languages, but we will spend most of the time on other techniques. When it comes to understanding programming languages parsing is the way to go, however you can pick specific alternatives for natural languages. In other words, we are mostly going to talk about what you would use instead of parsing, to accomplish your goals.

For instance, if you wanted to find all for statements a programming language file, you would parse it and then count the number of for. Instead, you are probably going to use something like stemming to find all mentions of cats in a natural language document.

This is necessary because the theory behind the parsing of natural languages might be the same one that is behind the parsing of programming languages, however the practice is very dissimilar. In fact, you are not going to build a parser for a natural language. That is unless you work in artificial intelligence or as researcher. You are even rarely going to use one. Rather you are going to find an algorithm that work a simplified model of the document that can only solve your specific problem.

In short, you are going to find tricks to avoid to actually having to parse a natural language. That is why this area of computer science is usually called natural language processing rather than natural language parsing.


Algorithms That Require Data

We are going to see specific solutions to each problem. Mind you that these specific solutions can be quite complex themselves. The more advanced they are, the less they rely on simple algorithms. Usually they need a vast database of data about the language. A logical consequence of this is that it is rarely easy to adopt a tool for one language to be used for another one. Or rather, the tool might work with few adaptations, but to build the database would require a lot of investment. So, for example, you would probably find a ready to use tool to create a summary of an English text, but maybe not one for an Italian one.

For this reason, in this article we concentrate mostly on English language tools. Although we mention if these tools work for other languages. You do not need to know the theoretical differences between languages, such as the number of genders or cases they have. However, you should be aware that the more different a language is from English, the harder would be to apply these techniques or tools to it.

For example, you should not expect to find tools that can work with Chinese (or rather the Chinese writing system). It is not necessarily that these languages are harder to understand programmatically, but there might be less research on them or the methods might be completely different from the ones adopted for English.


The Structure of This Guide

This article is organized according to the tasks we want to accomplish. Which means that the tools and explanation are grouped according to the task they are used for. For instance, there is a section about measuring the properties of a text, such as its difficulty. They are also generally in ascending order of difficulty: it is easier to classify words than entire documents. We start with simple information retrieval techniques and we end in the proper field of natural language processing.

We think it is the most useful way to provide the information you need: you need to do X, we directly show the methods and tools you can use.


Table Of Contents

The following table of contents shows the whole content of this guide.

  1. Classifying Words

    • Stemming

    • Splitting Words

    • Grouping Similar Words

  2. Classifying Documents

    • Reading Time

    • Calculating the Readability of a Text

    • Text Metrics

    • Identifying a Language

  3. Understanding Documents

    • You Need Data

    • The Things You Can Do

    • The Libraries You Can Use

    • SumBasic

    • Graph-based Methods: TextRank

    • Latent Semantic Analysis

    • Other Methods and Libraries

    • Other Uses

    • Generation of Summaries

    • Parsing Documents

  4. Summary

Classifying Words

With the expression classifying words, we intend to include techniques and libraries that group words together.


Grouping Similar Words

We are going to talk about two methods that can group together similar words, for the purpose of information retrieval. Basically, these are methods used to find the documents, with the words we care about, from a pool of all documents. That is useful because if a user search for documents containing the word friend he is probably equally interested in documents containing friends and possibly friended and friendship.

So, to be clear, in this section we are not going to talk about methods to group semantically connected words, such as identifying all pets or all English towns.

The two methods are stemming and division of words into group of characters. The algorithms for the first ones are language dependent, while the ones for the second ones are not. We are going to examine each of them in separate paragraphs.


链接:

https://tomassetti.me/guide-natural-language-processing/


原文链接:

https://m.weibo.cn/1715118170/4174604782463416

“完整内容”请点击【阅读原文】
↓↓↓
登录查看更多
30

相关内容

语义分析的最终目的是理解句子表达的真实语义。但是,语义应该采用什么表示形式一直困扰着研究者们,至今这个问题也没有一个统一的答案。语义角色标注(semantic role labeling)是目前比较成熟的浅层语义分析技术。基于逻辑表达的语义分析也得到学术界的长期关注。

When I started out, I had a strong quantitative background (chemical engineering undergrad, was taking PhD courses in chemical engineering) and some functional skills in programming. From there, I first dove deep into one type of machine learning (Gaussian processes) along with general ML practice (how to set up ML experiments in order to evaluate your models) because that was what I needed for my project. I learned mostly online and by reading papers, but I also took one class on data analysis for biologists that wasn’t ML-focused but did cover programming and statistical thinking. Later, I took a linear algebra class, an ML survey class, and an advanced topics class on structured learning at Caltech. Those helped me obtain a broad knowledge of ML, and then I’ve gained deeper understandings of some subfields that interest me or are especially relevant by reading papers closely (chasing down references and anything I don’t understand and/or implementing the core algorithms myself).

成为VIP会员查看完整内容
0
33
小贴士
相关资讯
中文自然语言处理相关资料集合指南
专知
17+阅读 · 2019年3月10日
五个精彩实用的自然语言处理资源
机器学习研究会
5+阅读 · 2018年2月23日
【推荐】深度学习情感分析综述
机器学习研究会
50+阅读 · 2018年1月26日
【推荐】用Python/OpenCV实现增强现实
机器学习研究会
5+阅读 · 2017年11月16日
【推荐】MXNet深度情感分析实战
机器学习研究会
16+阅读 · 2017年10月4日
【推荐】用Tensorflow理解LSTM
机器学习研究会
26+阅读 · 2017年9月11日
【推荐】RNN/LSTM时序预测
机器学习研究会
21+阅读 · 2017年9月8日
【推荐】GAN架构入门综述(资源汇总)
机器学习研究会
8+阅读 · 2017年9月3日
【推荐】TensorFlow手把手CNN实践指南
机器学习研究会
5+阅读 · 2017年8月17日
相关VIP内容
专知会员服务
27+阅读 · 2019年12月8日
开源书:PyTorch深度学习起步
专知会员服务
23+阅读 · 2019年10月11日
强化学习最新教程,17页pdf
专知会员服务
55+阅读 · 2019年10月11日
[综述]深度学习下的场景文本检测与识别
专知会员服务
31+阅读 · 2019年10月10日
机器学习入门的经验与建议
专知会员服务
33+阅读 · 2019年10月10日
计算机视觉最佳实践、代码示例和相关文档
专知会员服务
8+阅读 · 2019年10月9日
相关论文
Yi Tay,Dara Bahri,Che Zheng,Clifford Brunk,Donald Metzler,Andrew Tomkins
4+阅读 · 2020年4月13日
A Comprehensive Survey on Transfer Learning
Fuzhen Zhuang,Zhiyuan Qi,Keyu Duan,Dongbo Xi,Yongchun Zhu,Hengshu Zhu,Hui Xiong,Qing He
85+阅读 · 2019年11月7日
Eric Wallace,Yizhong Wang,Sujian Li,Sameer Singh,Matt Gardner
5+阅读 · 2019年9月17日
Fabio Petroni,Tim Rocktäschel,Patrick Lewis,Anton Bakhtin,Yuxiang Wu,Alexander H. Miller,Sebastian Riedel
5+阅读 · 2019年9月4日
Ziwei Zhang,Peng Cui,Wenwu Zhu
38+阅读 · 2018年12月11日
Antoine J. -P. Tixier
10+阅读 · 2018年8月30日
Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis
Hai Pham,Thomas Manzini,Paul Pu Liang,Barnabas Poczos
4+阅读 · 2018年8月6日
Felix Stahlberg,Danielle Saunders,Gonzalo Iglesias,Bill Byrne
3+阅读 · 2018年3月20日
Lior Friedman,Shaul Markovitch
4+阅读 · 2018年1月31日
Mahsa Sadat Shahshahani,Mahdi Mohseni,Azadeh Shakery,Heshaam Faili
5+阅读 · 2018年1月30日
Top