2019 年 6 月 9 日 AINLP


A curated list of NLP resources focused on BERT, attention mechanism, Transformer networks, and transfer learning.

Awesome BERT & Transfer Learning in NLP 

This repository contains a hand-curated of great machine (deep) learning resources for Natural Language Processing (NLP) with a focus on Bidirectional Encoder Representations from Transformers (BERT), attention mechanism, Transformer architectures/networks, and transfer learning in NLP.

Table of Contents

Expand Table of Contents
  • Papers

  • Articles

    • BERT and Transformer

    • Attention Concept

    • Transformer Architecture

  • Official Implementations

  • Other Implementations

    • PyTorch

    • Keras

    • TensorFlow

    • Chainer

  • Transfer Learning in NLP

  • Other Resources

  • Tools

  • Tasks

    • Named-Entity Recognition (NER)

    • Classification

    • Text Generation

    • Question Answering (QA)

    • Knowledge Graph


  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.

  2. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context by Zihang Dai, Zhilin Yang, Yiming Yang, William W. Cohen, Jaime Carbonell, Quoc V. Le and Ruslan Salakhutdinov.

  • Uses smart caching to improve the learning of long-term dependency in Transformer. Key results: state-of-art on 5 language modeling benchmarks, including ppl of 21.8 on One Billion Word (LM1B) and 0.99 on enwiki8. The authors claim that the method is more flexible, faster during evaluation (1874 times speedup), generalizes well on small datasets, and is effective at modeling short and long sequences.

  1. Conditional BERT Contextual Augmentation by Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han and Songlin Hu.

  2. SDNet: Contextualized Attention-based Deep Network for Conversational Question Answering by Chenguang Zhu, Michael Zeng and Xuedong Huang.

  3. Language Models are Unsupervised Multitask Learners by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever.

  4. The Evolved Transformer by David R. So, Chen Liang and Quoc V. Le.

  • They used architecture search to improve Transformer architecture. Key is to use evolution and seed initial population with Transformer itself. The architecture is better and more efficient, especially for small size models.


BERT and Transformer

  1. Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing from Google AI.

  2. The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning).

  3. Dissecting BERT by Miguel Romero and Francisco Ingham - Understand BERT in depth with an intuitive, straightforward explanation of the relevant concepts.

  4. A Light Introduction to Transformer-XL.

  5. Generalized Language Models by Lilian Weng, Research Scientist at OpenAI.

Attention Concept

  1. The Annotated Transformer by Harvard NLP Group - Further reading to understand the "Attention is all you need" paper.

  2. Attention? Attention! - Attention guide by Lilian Weng from OpenAI.

  3. Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) by Jay Alammar, an Instructor from Udacity ML Engineer Nanodegree.

Transformer Architecture

  1. The Transformer blog post.

  2. The Illustrated Transformer by Jay Alammar, an Instructor from Udacity ML Engineer Nanodegree.

  3. Watch Łukasz Kaiser’s talk walking through the model and its details.

  4. Transformer-XL: Unleashing the Potential of Attention Models by Google Brain.

  5. Generative Modeling with Sparse Transformers by OpenAI - an algorithmic improvement of the attention mechanism to extract patterns from sequences 30x longer than possible previously.

OpenAI Generative Pre-Training Transformer (GPT) and GPT-2

  1. Better Language Models and Their Implications.

  2. Improving Language Understanding with Unsupervised Learning - this is an overview of the original GPT model.

  3. 🦄  How to build a State-of-the-Art Conversational AI with Transfer Learning by Hugging Face.

Additional Reading

  1. How to Build OpenAI's GPT-2: "The AI That's Too Dangerous to Release".

  2. OpenAI’s GPT2 - Food to Media hype or Wake Up Call?

Official Implementations

  1. google-research/bert - TensorFlow code and pre-trained models for BERT.

Other Implementations


  1. huggingface/pytorch-pretrained-BERT - A PyTorch implementation of Google AI's BERT model with script to load Google's pre-trained models by Hugging Face.

  2. codertimo/BERT-pytorch - Google AI 2018 BERT pytorch implementation.

  3. innodatalabs/tbert - PyTorch port of BERT ML model.

  4. kimiyoung/transformer-xl - Code repository associated with the Transformer-XL paper.

  5. dreamgonfly/BERT-pytorch - PyTorch implementation of BERT in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".

  6. dhlee347/pytorchic-bert - Pytorch implementation of Google BERT


  1. Separius/BERT-keras - Keras implementation of BERT with pre-trained weights.

  2. CyberZHG/keras-bert - Implementation of BERT that could load official pre-trained models for feature extraction and prediction.


  1. guotong1988/BERT-tensorflow - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

  2. kimiyoung/transformer-xl - Code repository associated with the Transformer-XL paper.


  1. soskek/bert-chainer - Chainer implementation of "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".

Transfer Learning in NLP

As Jay Alammar put it:

The year 2018 has been an inflection point for machine learning models handling text (or more accurately, Natural Language Processing or NLP for short). Our conceptual understanding of how best to represent words and sentences in a way that best captures underlying meanings and relationships is rapidly evolving. Moreover, the NLP community has been putting forward incredibly powerful components that you can freely download and use in your own models and pipelines (It's been referred to as NLP's ImageNet moment, referencing how years ago similar developments accelerated the development of machine learning in Computer Vision tasks).

One of the latest milestones in this development is the release of BERT, an event described as marking the beginning of a new era in NLP. BERT is a model that broke several records for how well models can handle language-based tasks. Soon after the release of the paper describing the model, the team also open-sourced the code of the model, and made available for download versions of the model that were already pre-trained on massive datasets. This is a momentous development since it enables anyone building a machine learning model involving language processing to use this powerhouse as a readily-available component – saving the time, energy, knowledge, and resources that would have gone to training a language-processing model from scratch.

BERT builds on top of a number of clever ideas that have been bubbling up in the NLP community recently – including but not limited to Semi-supervised Sequence Learning (by Andrew Dai and Quoc Le), ELMo (by Matthew Peters and researchers from AI2 and UW CSE), ULMFiT (by founder Jeremy Howard and Sebastian Ruder), the OpenAI transformer (by OpenAI researchers Radford, Narasimhan, Salimans, and Sutskever), and the Transformer (Vaswani et al).

ULMFiT: Nailing down Transfer Learning in NLP

ULMFiT introduced methods to effectively utilize a lot of what the model learns during pre-training – more than just embeddings, and more than contextualized embeddings. ULMFiT introduced a language model and a process to effectively fine-tune that language model for various tasks.

NLP finally had a way to do transfer learning probably as well as Computer Vision could.

Other Resources

Expand Other Resources
  1. hanxiao/bert-as-service - Mapping a variable-length sentence to a fixed-length vector using pretrained BERT model.

  2. brightmart/bert_language_understanding - Pre-training of Deep Bidirectional Transformers for Language Understanding: pre-train TextCNN.

  3. algteam/bert-examples - BERT examples.

  4. JayYip/bert-multiple-gpu - A multiple GPU support version of BERT.

  5. HighCWu/keras-bert-tpu - Implementation of BERT that could load official pre-trained models for feature extraction and prediction on TPU.

  6. whqwill/seq2seq-keyphrase-bert - Add BERT to encoder part for

  7. xu-song/bert_as_language_model - BERT as language model, a fork from Google official BERT implementation.

  8. Y1ran/NLP-BERT--Chinese version

  9. yuanxiaosc/Deep_dynamic_word_representation - TensorFlow code and pre-trained models for deep dynamic word representation (DDWR). It combines the BERT model and ELMo's deep context word representation.

  10. yangbisheng2009/cn-bert

  11. Willyoung2017/Bert_Attempt

  12. Pydataman/bert_examples - Some examples of BERT. based on Google BERT for Kaggle Quora Insincere Questions Classification challenge. is based on the first season of the Ruijin Hospital AI contest and a NER written by BERT.

  13. guotong1988/BERT-chinese - Pre-training of deep bidirectional transformers for Chinese language understanding.

  14. zhongyunuestc/bert_multitask - Multi-task.

  15. Microsoft/AzureML-BERT - End-to-end walk through for fine-tuning BERT using Azure Machine Learning.

  16. bigboNed3/bert_serving - Export BERT model for serving.

  17. yoheikikuta/bert-japanese - BERT with SentencePiece for Japanese text.


  1. jessevig/bertviz - Tool for visualizing BERT's attention.

  2. FastBert - A simple deep learning library that allows developers and data scientists to train and deploy BERT based models for NLP tasks beginning with text classification. The work on FastBert is inspired by


Named-Entity Recognition (NER)

Expand NER
  1. kyzhouhzau/BERT-NER - Use google BERT to do CoNLL-2003 NER.

  2. zhpmatrix/bert-sequence-tagging - Chinese sequence labeling.

  3. JamesGu14/BERT-NER-CLI - Bert NER command line tester with step by step setup guide.

  4. sberbank-ai/ner-bert

  5. mhcao916/NER_Based_on_BERT - This project is based on Google BERT model, which is a Chinese NER.

  6. macanv/BERT-BiLSMT-CRF-NER - TensorFlow solution of NER task using Bi-LSTM-CRF model with Google BERT fine-tuning.

  7. ProHiryu/bert-chinese-ner - Use the pre-trained language model BERT to do Chinese NER.

  8. FuYanzhe2/Name-Entity-Recognition - Lstm-CRF, Lattice-CRF, recent NER related papers.

  9. king-menin/ner-bert - NER task solution (BERT-Bi-LSTM-CRF) with Google BERT


Expand Classification
  1. brightmart/sentiment_analysis_fine_grain - Multi-label classification with BERT; Fine Grained Sentiment Analysis from AI challenger.

  2. zhpmatrix/Kaggle-Quora-Insincere-Questions-Classification - Kaggle baseline—fine-tuning BERT and tensor2tensor based Transformer encoder solution.

  3. maksna/bert-fine-tuning-for-chinese-multiclass-classification - Use Google pre-training model BERT to fine-tune for the Chinese multiclass classification.

  4. NLPScott/bert-Chinese-classification-task - BERT Chinese classification practice.

  5. fooSynaptic/BERT_classifer_trial - BERT trial for Chinese corpus classfication.

  6. xiaopingzhong/bert-finetune-for-classfier - Fine-tuning the BERT model while building your own dataset for classification.

  7. Socialbird-AILab/BERT-Classification-Tutorial - Tutorial.

Text Generation

Expand Text Generation
  1. asyml/texar - Toolkit for Text Generation and Beyond. Texar is a general-purpose text generation toolkit, has also implemented BERT here for classification, and text generation applications by combining with Texar's other modules.

Question Answering (QA)

Expand QA
  1. matthew-z/R-net - R-net in PyTorch, with BERT and ELMo.

  2. vliu15/BERT - TensorFlow implementation of BERT for QA.

  3. benywon/ChineseBert - This is a Chinese BERT model specific for question answering.

  4. xzp27/BERT-for-Chinese-Question-Answering

Knowledge Graph

Expand Knowledge Graph
  1. sakuranew/BERT-AttributeExtraction - Using BERT for attribute extraction in knowledge graph. Fine-tuning and feature extraction. The BERT-based fine-tuning and feature extraction methods are used to extract knowledge attributes of Baidu Encyclopedia characters.

  2. lvjianxin/Knowledge-extraction - Chinese knowledge-based extraction. Baseline: bi-LSTM+CRF upgrade: BERT pre-training.


Expand License

This repository contains a variety of content; some developed by Cedric Chee, and some from third-parties. The third-party content is distributed under the license provided by those parties.

I am providing code and resources in this repository to you under an open source license. Because this is my personal repository, the license you receive to my code and resources is from me and not my employer.

The content developed by Cedric Chee is distributed under the following license:


The code in this repository, including all code samples in the notebooks listed above, is released under the MIT license. Read more at the Open Source Initiative.


The text content of the book is released under the CC-BY-NC-ND license. Read more at Creative Commons.



BERT全称Bidirectional Encoder Representations from Transformers,是预训练语言表示的方法,可以在大型文本语料库(如维基百科)上训练通用的“语言理解”模型,然后将该模型用于下游NLP任务,比如机器翻译、问答。

Transformer networks have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. As a solution, we propose a novel neural architecture, \textit{Transformer-XL}, that enables Transformer to learn dependency beyond a fixed length without disrupting temporal coherence. Concretely, it consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-term dependency, but also resolves the problem of context fragmentation. As a result, Transformer-XL learns dependency that is about 80\% longer than RNNs and 450\% longer than vanilla Transformers, achieves better performance on both short and long sequences, and is up to 1,800+ times faster than vanilla Transformer during evaluation. Additionally, we improve the state-of-the-art (SoTA) results of bpc/perplexity from 1.06 to 0.99 on enwiki8, from 1.13 to 1.08 on text8, from 20.5 to 18.3 on WikiText-103, from 23.7 to 21.8 on One Billion Word, and from 55.3 to 54.5 on Penn Treebank (without finetuning). Our code, pretrained models, and hyperparameters are available in both Tensorflow and PyTorch.

32+阅读 · 2020年3月19日
46+阅读 · 2020年2月3日
41+阅读 · 2020年1月2日
【文章|BERT三步使用NLP迁移学习】NLP Transfer Learning In 3 Steps
53+阅读 · 2019年10月16日
TensorFlow 2.0 学习资源汇总
28+阅读 · 2019年10月9日
14+阅读 · 2019年10月9日
最新BERT相关论文清单,BERT-related Papers
28+阅读 · 2019年9月29日
Rodrigo Nogueira,Wei Yang,Kyunghyun Cho,Jimmy Lin
5+阅读 · 2019年10月31日
Question Generation by Transformers
Kettip Kriangchaivech,Artit Wangperawong
3+阅读 · 2019年9月14日
Betty van Aken,Benjamin Winter,Alexander Löser,Felix A. Gers
3+阅读 · 2019年9月11日
Liang Yao,Chengsheng Mao,Yuan Luo
5+阅读 · 2019年9月11日
Zhilin Yang,Zihang Dai,Yiming Yang,Jaime Carbonell,Ruslan Salakhutdinov,Quoc V. Le
13+阅读 · 2019年6月19日
How to Fine-Tune BERT for Text Classification?
Chi Sun,Xipeng Qiu,Yige Xu,Xuanjing Huang
11+阅读 · 2019年5月14日
Universal Transformers
Mostafa Dehghani,Stephan Gouws,Oriol Vinyals,Jakob Uszkoreit,Łukasz Kaiser
3+阅读 · 2019年3月5日
Multi-Task Deep Neural Networks for Natural Language Understanding
Xiaodong Liu,Pengcheng He,Weizhu Chen,Jianfeng Gao
3+阅读 · 2019年1月31日
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Zihang Dai,Zhilin Yang,Yiming Yang,William W. Cohen,Jaime Carbonell,Quoc V. Le,Ruslan Salakhutdinov
3+阅读 · 2019年1月9日
Jacob Devlin,Ming-Wei Chang,Kenton Lee,Kristina Toutanova
9+阅读 · 2018年10月11日