超全word embedding论文总结（附论文链接）

一个热爱生活的好奇心保持者

最近在做一些word embedding的相关研究，整理了一下word embedding的相关论文，囊括了早期的一些统计方法，著名的word2vec，还有最近的预训练的contextualized word embedding，以及近期研究者们提出的一些新型的embedding结构。因为这些方法网上都已经有比较多的文章进行解析，因此这里只列出了在各个阶段比较重要的一些文章及其重要思想。欢迎大家补充。

p.s. 未加粗的文章大家可以先不用重点看

首先是一些survey，这里把每一篇survey review过的一些方法大致列了出来，大家可以通过这些survey对word embedding有一个大致的了解。

A survey of word embeddings based on deep learning, Wang et al., 2019

Word2Vec, GloVe, FastText;
ELMO, GPT, BERT.

Word Embeddings: A Survey, Almeida and Xexeo, 2019

Early distributed word embeddings;
Word2Vec, Glove, FastText.

A Survey on Contextual Embeddings, Liu et al., 2020

ELMO, GPT, BERT, Variants of BERT (ERNIE, RoBERTa, ALBERT, XLNET, …)

A Survey on Language Models, Qudar and Mago, 2020

Static Word Embedding (Word2Vec, GloVe, FastText);
Contextualized Word Embedding (ELMO, BERT, ALBERT, BioBERT, SciBERT,...)

A Survey of Word Embeddings Evaluation Methods, Bakarov, 2018

Extrinsic evaluation: downstream NLP models, e.g., POS tagging, NER, ...
Intrinsic evaluation: word semantic similarity, word analogy, synonym detection,...

From Word To Sense Embeddings: A Survey on Vector Representations of Meaning, Collados and Pilehvar, 2018

后面会分以下几个方向进行整理

Statistical models （早期统计方法）

Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), …

Early distributed word representations （早期用dense vector表示单词的方法）

Neural probabilistic language model, …

Static distributed word embedding （近期较为著名的一些方法）

Word2Vec, GloVe, FastText, …

Contextual word embedding （通过预训练获得的基于上下文的单词表征）

ELMO, GPT, BERT, …

Recent new embedding architecture （最近的一些新的embedding结构）

Complex representation, Gaussian representation, Cone representation, Box representation

大家可以根据需要跳转到感兴趣的部分（建议直接从static distributed word embedding部分开始看，前面的部分年代都太久远了哈哈哈）

Statistical models and Early distributed word representations

Word Representations: A Simple and General Method for Semi-Supervised Learning, Turian et al., 2010 (ACL-IJCNLP 2021, The Test of Time Award)

这篇总结了word2vec之前的3种单词表征类型:

Distributional word representations:

LSA, LDA, Hyperspace Analogue to Language (HAL), …

Clustering-based word representations:

Brown Clusters, ...

Distributed word representation (word embedding):

Collobert and Weston (2008) embeddings
Hierarchical Log-BiLinear (HLBL) embeddings

a) Distributional word representations

主要思想就是利用单词与上下文直接的共现信息（co-occurrence information）。主要的模型结构如下：

首先构建一个大小为WxC的一个共现矩阵F，其中W为vocabulary的大小，C为上下文的数量
如果考虑word-word co-occurence, 则C为vocabulary大小；
如果考虑word-document co-occurence, 则C为总的documents的个数。
F的第 i, j 个元素则表示第i个单词和第j个上下文的共现信息
如果考虑word-word co-occurence，则为第i个单词与第j个单词共同出现（在某一个window内）的次数
如果考虑word-document co-occurrence, 则为第i个单词出现在第j个document里面的次数
共现信息可以考虑原始的次数，也可以考虑其PMI （pointwise mutual information），或者考虑tf-idf
F的每一行可以看成是单词的初始表征（维度为C），每一列可以看成是某一context的初始表征（维度为W），这些表征一般都是稀疏的
利用一些函数g，将F映射到Wxd维的矩阵G，其中g一般远小于C，得到的新矩阵G里面的每一行即为单词新的d维的表征

An Introduction to Latent Semantic Analysis (LSA), Landauer et al., 1998

LSA考虑的是word-document co-occurrence, g计算F的SVD。

Producing high-dimensional semantic spaces from lexical co-occurrence (HAL), Lund and Burgess, 1996

HAL考虑的是word-word co-occurrence（只取左边或者右边），g取具有最高variance的200列

Latent Dirichlet Allocation (LDA), Blei et al., 2003

b) Clustering-based word representations (基于聚类的单词表征）

Class-Based n-gram Models of Natural Language, Brown et al., 1992

一个ppt供参考： [2019] Class-based N-gram Models of Natural Language
主要思想：将单词聚成不同的类，基于单词的类别重新定义language model p(w_j | w_1,...w_(j-1)) = p(w_j | c_j) * p(c_j | c_1,...c_(j-1))

Name Tagging with Word Clusters and Discriminative Training, Miller et al., 2004

主要思想：将单词聚类得到一个binary clustering tree，每一个单词即是树上面的一个叶子节点，记录从根节点到该叶子节点的路径，利用路径将单词表示成0-1编码，其中0表示左节点，1表示右节点
实际使用时取0-1编码的前16或者20个bit

c) Early Distributed Word Representation （早期的利用神经网络训练单词embedding的方法）

A neural probabilistic language model, Bengio et al., 2003

是第一篇提出用word embedding lookup table和神经网络对language model进行建模的论文

A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning, Collobert and Weston, 2008

将真实的n-gram作为positive samples, 将corrupted n-gram作为negative samples，将n-grams的embedding concatenate起来输入到一个CNN中，训练目标为输出high scores s(x)给ground-truth n-gram和输出low score s(x') 给corrupted n-gram
loss function 为 L(x) = max(0, 1-s(x)+s(x'))

A Scalable Hierarchical Distributed Language Model, Minh and Hinton, 2009

word2vec文章的思想跟这篇有点类似，这篇文章吧前n-1个单词的embedding concatenate起来，去预测最后一个单词的embedding.
单词之间的相似性也是利用probability来模拟的，也是exponentiating and then normalizing.

……

Static Distributed Word Embeddings

下面就是近期比较著名的word embedding方法，大家经常提到的word embedding方法基本上也就是指的这一类方法。

a) Word2Vec:

Efficient Estimation of Word Representations in Vector Space, Mikolov et al., 2013

近期word2vec的起源，提出skip-gram和CBOW （continuous bag-of-words），但在原始的这一篇论文里，用的是hierarchical softmax来处理概率近似公式里的分母需要大量计算量的问题

Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al., 2013

提出用negative sampling来提升计算速度，现在用的word2vec基本是这一个版本

Neural Word Embedding as Implicit Matrix Factorization, Levy and Goldberg, 2014

这篇文章提供了一些理论分析，展示了word2vec背后是在分解一个word-context matrix, 这个矩阵的每一个元素是用一个全局常数偏置了的pointwise mutual information (PMI)

b) GloVe:

GloVe: Global Vectors for Word Representation, Pennington et al., 2014

word embedding比较出名的两大流派就是上面提到的word2vec和这篇文章的GloVe, word2vec最大化一个利用向量内积近似出来的概率，而GloVe直接利用单词共现的统计信息

c) FastText:

Enriching Word Vectors with Subword Information, Bojanowski et al., 2017

skip-gram的subword版本，这篇文章以character n-gram为单位，应用skip-gram的目标函数对每一个character n-gram都学习一个向量表达，每一个单词表示为character n-gram的词袋，也就是其所有character n-gram的平均向量
从subword出发，让fastText可以处理一些之前训练语料库里没有出现过的单词

Contextual Word Embeddings

下面就是一些最近超级火的pre-trained model. 这些预训练模型对每一个单词输出的向量，可以看成是包含了上下文信息的representation.

a) ELMO: (Embedding from Language MOdel)

Deep Contextualized Word Representations, Peters et al., 2018