上篇给大家介绍了几种 NLP 中的语言模型，其中 RNN 比 neural network language model (NNLM) 和 n-gram language model 有明显的优势：RNN 考虑到前面所有的 word，而 NNLM 和 n-gram 都是基于马可夫假设的，所以第 n +1 个 word 的概率只取决于它的前 n 个 word。不过常规 RNN 的问题是：在参数学习阶段，通过 back propagation 来更新参数时，会有 gradient explosion / vanishing 问题。后期学者们提出的 LSTM (long-short term memory) 和 GRU (gated recurrent unit) 都是用来解决这个问题的，而 LSTM 和 GRU 本质上仍然属于 RNN。
我们首先介绍 RNN 与 HMM 和 LDS 的区别和联系，然后讲述怎样用 RNN 来建立机器翻译模型。
(1) RNN vs. HMM / LDS
如果把 RNN 在时间维度展开，它也是一条 linear chain，它与 Hidden Markov Model (HMM) 和 Linear Dynamic System (LDS) 有着相似的结构。那么 RNN 跟 HMM 和 LDS 的区别和联系是什么呢？
HMM 与 LDS 是两种简单的 Dynamic Bayesian Network (DBN)。HMM 的 state space 是离散随机变量，而 LDS 的 state space 是连续随机变量，并且服从高斯分布。线性离散和高斯分布都会给 inference 带来便利，所以它们都有 exact inference 算法，如 Viterbi algorithm, Kalman filter 等。如果 hidden state 是连续但不服从高斯分布的话，那么 exact inference 就是一个大问题，这时我们需要使用 approximate inference 算法，如 particle-based inference (e.g. MCMC) 和 variational inference (e.g. EM)。在 HMM 或 LDS 的参数学习阶段，目标是 maximize marginal likelihood，其中观测值 (observed variable) 的 marginal probability 需要通过对 hidden state 积分得到。
再说 RNN，它虽然也是一条 linear chain，但它是 deterministic 的：hidden state 到 hidden state 的 transition 是固定的而不是随机的，hidden state 到 observation state 也是固定的。既然 RNN 是一个 deterministic parametric model，一旦通过 back propagation 学习参数后，inference 就相当容易了，不需要对 hidden state 积分或求和。所以，RNN 中 hidden state 可以是任意连续的向量，不需要对它的分布做任何假设。我们今后会特别讲解 RNN 的技术细节。
(2) 机器翻译 (machine translation)
机器翻译就是把A语言翻译成B语言。目前最好的机器翻译模型 (translation model) 是基于 Neural Nets 的，它使用 neural network 建立了一个条件概率 (conditional distribution) 模型：给定一个 source sentence - Sa (A语言)，那么 target sentence - Sb (B语言) 的条件概率是多少，也就是 p( Sb | Sa )。在2014年的NIPS会议上，Ilya Sutskever et al. 发表了一篇关于机器翻译的经典文章：Sequence to Sequence Learning with Neural Networks。文章中将机器翻译分为两个阶段，encode source sentence 和 decode target sentence。首先用一个 encoding LSTM 来对 source sentence 进行编码，得到一个固定维数的向量 C；然后，以 C 为输入，用另一个 decoding LSTM 来解码得到 target sentence。
在参数学习阶段，通过 back propagation 来联合训练两个 LSTM 里面的参数。在翻译阶段，给定A语言的一句话，如何把它翻译成B语言呢？先通过 encoding LSTM 得到 A 的编码后，把该编码输入到 decoding LSTM 中，就可以按顺序地得到每个词了，这一系列词就组成了 target sentence。想得到条件概率最大的 target sentence，也就是使 p( Sb | Sa ) 最大的 Sb，brute force search 是不可取的，实际中一般使用 beam search algorithm 来得到若干个条件概率较大的 candidate sentences。这种 encoding-decoding 模式是解决 机器翻译 问题非常有效的框架，它不仅启发了后续一系列的改进模型，还启发了计算机视觉领域中的 image / video description 研究。
基于以上的 encoding-decoding 框架的 翻译模型 有一个突出的问题：使用 encoding LSTM 模型，把整个 source sentence 压缩成了一个固定长度的向量 C，而这个向量 C 是下一个 decoding LSTM 的唯一输入。这样导致的问题是，对于短句子的翻译效果不错，可是翻译长句子时，BLUE score 明显下降 (BLUE score 是一种度量机器翻译质量的测度)。直观上其实很好理解：如果以整个 source sentence 的编码向量作为 decoding LSTM 的输入，那么 source sentence 中的 temporal 信息以及每个词自身的信息都丢失掉了。为了解决这个问题，Dzmitry Bahdanau et al. 在 Neural Machine Translation by Jointly Learning to Align and Translate 文章中提出了 jointly align and translate 机器翻译模型。它仍然是基于 encoding-decoding 框架的，核心的不同是，decoding LSTM 的输入不是整个 source sentence 的编码向量 C，而是 encoding LSTM 得到的 source sentence 里每个词的编码向量，这样每个词的信息都保留了下来。他们的模型不仅可以翻译整个句子，而且还可以把翻译得到的每个词对应到 source sentence 的每个词。这篇文章值得好好读一读。
好了，这里我们主要给大家讲解了目前最优的机器翻译模型。大家对模型的细节可能还存在不少疑惑。在后面的文章里 (不一定在“自然语言处理”系列)，我们会深入到 RNN 和 LSTM 的技术细节里。
Machine translation systems achieve near human-level performance on some languages, yet their effectiveness strongly relies on the availability of large amounts of parallel sentences, which hinders their applicability to the majority of language pairs. This work investigates how to learn to translate when having access to only large monolingual corpora in each language. We propose two model variants, a neural and a phrase-based model. Both versions leverage a careful initialization of the parameters, the denoising effect of language models and automatic generation of parallel data by iterative back-translation. These models are significantly better than methods from the literature, while being simpler and having fewer hyper-parameters. On the widely used WMT'14 English-French and WMT'16 German-English benchmarks, our models respectively obtain 28.1 and 25.2 BLEU points without using a single parallel sentence, outperforming the state of the art by more than 11 BLEU points. On low-resource languages like English-Urdu and English-Romanian, our methods achieve even better results than semi-supervised and supervised approaches leveraging the paucity of available bitexts. Our code for NMT and PBSMT is publicly available.
Sequence to sequence learning models still require several days to reach state of the art performance on large benchmark datasets using a single machine. This paper shows that reduced precision and large batch training can speedup training by nearly 5x on a single 8-GPU machine with careful tuning and implementation. On WMT'14 English-German translation, we match the accuracy of (Vaswani et al 2017) in under 5 hours when training on 8 GPUs and we obtain a new state of the art of 29.3 BLEU after training for 91 minutes on 128 GPUs. We further improve these results to 29.8 BLEU by training on the much larger Paracrawl dataset.
Machine translation systems require semantic knowledge and grammatical understanding. Neural machine translation (NMT) systems often assume this information is captured by an attention mechanism and a decoder that ensures fluency. Recent work has shown that incorporating explicit syntax alleviates the burden of modeling both types of knowledge. However, requiring parses is expensive and does not explore the question of what syntax a model needs during translation. To address both of these issues we introduce a model that simultaneously translates while inducing dependency trees. In this way, we leverage the benefits of structure while investigating what syntax NMT must induce to maximize performance. We show that our dependency trees are 1. language pair dependent and 2. improve translation quality.
In neural machine translation, a source sequence of words is encoded into a vector from which a target sequence is generated in the decoding phase. Differently from statistical machine translation, the associations between source words and their possible target counterparts are not explicitly stored. Source and target words are at the two ends of a long information processing procedure, mediated by hidden states at both the source encoding and the target decoding phases. This makes it possible that a source word is incorrectly translated into a target word that is not any of its admissible equivalent counterparts in the target language. In this paper, we seek to somewhat shorten the distance between source and target words in that procedure, and thus strengthen their association, by means of a method we term bridging source and target word embeddings. We experiment with three strategies: (1) a source-side bridging model, where source word embeddings are moved one step closer to the output target sequence; (2) a target-side bridging model, which explores the more relevant source word embeddings for the prediction of the target sequence; and (3) a direct bridging model, which directly connects source and target word embeddings seeking to minimize errors in the translation of ones by the others. Experiments and analysis presented in this paper demonstrate that the proposed bridging models are able to significantly improve quality of both sentence translation, in general, and alignment and translation of individual source words with target words, in particular.
Neural machine translation (NMT) has been a new paradigm in machine translation, and the attention mechanism has become the dominant approach with the state-of-the-art records in many language pairs. While there are variants of the attention mechanism, all of them use only temporal attention where one scalar value is assigned to one context vector corresponding to a source word. In this paper, we propose a fine-grained (or 2D) attention mechanism where each dimension of a context vector will receive a separate attention score. In experiments with the task of En-De and En-Fi translation, the fine-grained attention method improves the translation quality in terms of BLEU score. In addition, our alignment analysis reveals how the fine-grained attention mechanism exploits the internal structure of context vectors.
Homographs, words with different meanings but the same surface form, have long caused difficulty for machine translation systems, as it is difficult to select the correct translation based on the context. However, with the advent of neural machine translation (NMT) systems, which can theoretically take into account global sentential context, one may hypothesize that this problem has been alleviated. In this paper, we first provide empirical evidence that existing NMT systems in fact still have significant problems in properly translating ambiguous words. We then proceed to describe methods, inspired by the word sense disambiguation literature, that model the context of the input word with context-aware word embeddings that help to differentiate the word sense be- fore feeding it into the encoder. Experiments on three language pairs demonstrate that such models improve the performance of NMT systems both in terms of BLEU score and in the accuracy of translating homographs.
Neural sequence-to-sequence networks with attention have achieved remarkable performance for machine translation. One of the reasons for their effectiveness is their ability to capture relevant source-side contextual information at each time-step prediction through an attention mechanism. However, the target-side context is solely based on the sequence model which, in practice, is prone to a recency bias and lacks the ability to capture effectively non-sequential dependencies among words. To address this limitation, we propose a target-side-attentive residual recurrent network for decoding, where attention over previous words contributes directly to the prediction of the next word. The residual learning facilitates the flow of information from the distant past and is able to emphasize any of the previously translated words, hence it gains access to a wider context. The proposed model outperforms a neural MT baseline as well as a memory and self-attention network on three language pairs. The analysis of the attention learned by the decoder confirms that it emphasizes a wider context, and that it captures syntactic-like structures.
Monolingual data have been demonstrated to be helpful in improving translation quality of both statistical machine translation (SMT) systems and neural machine translation (NMT) systems, especially in resource-poor or domain adaptation tasks where parallel data are not rich enough. In this paper, we propose a novel approach to better leveraging monolingual data for neural machine translation by jointly learning source-to-target and target-to-source NMT models for a language pair with a joint EM optimization method. The training process starts with two initial NMT models pre-trained on parallel data for each direction, and these two models are iteratively updated by incrementally decreasing translation losses on training data. In each iteration step, both NMT models are first used to translate monolingual data from one language to the other, forming pseudo-training data of the other NMT model. Then two new NMT models are learnt from parallel data together with the pseudo training data. Both NMT models are expected to be improved and better pseudo-training data can be generated in next step. Experiment results on Chinese-English and English-German translation tasks show that our approach can simultaneously improve translation quality of source-to-target and target-to-source models, significantly outperforming strong baseline systems which are enhanced with monolingual data for model training including back-translation.
In spite of the recent success of neural machine translation (NMT) in standard benchmarks, the lack of large parallel corpora poses a major practical problem for many language pairs. There have been several proposals to alleviate this issue with, for instance, triangulation and semi-supervised learning techniques, but they still require a strong cross-lingual signal. In this work, we completely remove the need of parallel data and propose a novel method to train an NMT system in a completely unsupervised manner, relying on nothing but monolingual corpora. Our model builds upon the recent work on unsupervised embedding mappings, and consists of a slightly modified attentional encoder-decoder model that can be trained on monolingual corpora alone using a combination of denoising and backtranslation. Despite the simplicity of the approach, our system obtains 15.56 and 10.21 BLEU points in WMT 2014 French-to-English and German-to-English translation. The model can also profit from small parallel corpora, and attains 21.81 and 15.24 points when combined with 100,000 parallel sentences, respectively. Our implementation is released as an open source project.
In this paper, we present Neural Phrase-based Machine Translation (NPMT). Our method explicitly models the phrase structures in output sequences using Sleep-WAke Networks (SWAN), a recently proposed segmentation-based sequence modeling method. To mitigate the monotonic alignment requirement of SWAN, we introduce a new layer to perform (soft) local reordering of input sequences. Different from existing neural machine translation (NMT) approaches, NPMT does not use attention-based decoding mechanisms. Instead, it directly outputs phrases in a sequential order and can decode in linear time. Our experiments show that NPMT achieves superior performances on IWSLT 2014 German-English/English-German and IWSLT 2015 English-Vietnamese machine translation tasks compared with strong NMT baselines. We also observe that our method produces meaningful phrases in output languages.