从200多篇顶会论文看预训练语言模型研究进展
作者简介:王晓磊,中国人民大学高瓴人工智能学院博士一年级,导师为赵鑫教授,研究方向为对话系统和预训练模型。
引言:近年来,以 BERT 和 GPT 系列为代表的大规模预训练语言模型(Pre-trained Language Model, PLM)在 NLP 的各个领域取得了巨大成功。本文整理了自 BERT 和 GPT 诞生以来与 PLM 相关的论文,根据引用数筛选出其中一些具有代表性的工作和 2021 年在各大顶会(ACL、EMNLP、ICLR、ICML、NeurIPS等)发表的工作,共计 285 篇,按照综述、基准数据集、PLM 的设计、PLM 的分析、高效的PLM和PLM的使用这 6 个大类 22 个小类进行了划分。
论文列表已经同步更新到 GitHub,也会持续进行更新,欢迎大家关注和 Star。
本文尽可能地在每篇论文的后面附上了 PDF 链接、代码实现和项目主页,以方便读者进一步了解相关工作。
目录
- 综述
- 基准数据集
- PLM 的设计
- 通用设计
- 知识增强
- 多语言
- 多模态
- 信息检索
- 代码
- 其他
- PLM 的分析
- 知识
- 鲁棒性
- 稀疏性
- 其他
- 高效的 PLM
- 模型训练
- 模型推理
- 模型压缩
- PLM 的使用
- 两阶段微调
- 多任务微调
- Adapter
- Prompt
- 其他
综述
- "Pre-trained models for natural language processing: A survey".
Science China Technological Sciences(2020)
[PDF] - "Which *BERT? A Survey Organizing Contextualized Encoders".
EMNLP(2020)
[PDF] - "A Primer in BERTology: What We Know About How BERT Works".
TACL(2020)
[PDF] - "From static to dynamic word representations: a survey".
International Journal of Machine Learning and Cybernetics(2020)
[PDF] - "Overview of the Transformer-based Models for NLP Tasks".
2020 15th Conference on Computer Science and Information Systems (FedCSIS)
[PDF] - "A Survey on Contextual Embeddings".
arXiv(2020)
[PDF] - "The NLP Cookbook: Modern Recipes for Transformer Based Deep Learning Architectures".
IEEE Access(2021)
[PDF] - "Pre-Trained Models: Past, Present and Future".
arXiv(2021)
[PDF] - "Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing".
arXiv(2021)
[PDF] - "AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing".
arXiv(2021)
[PDF] - "On the Opportunities and Risks of Foundation Models".
arXiv(2021)
[PDF] - "Paradigm Shift in Natural Language Processing".
arXiv(2021)
[PDF] - "Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey".
arXiv(2021)
[PDF]
基准数据集
- XNLI: "XNLI: Evaluating Cross-lingual Sentence Representations".
EMNLP(2018)
[PDF] [Dataset] - GLUE: "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding".
ICLR(2019)
[Homepage] - SuperGLUE: "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems".
NeurIPS(2019)
[Homepage] - CLUE: "CLUE: A Chinese Language Understanding Evaluation Benchmark".
COLING(2020)
[Homepage] - XTREME: "XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization".
ICML(2020)
[Homepage] - XGLUE: "XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation".
EMNLP(2020)
[Homepage] - DialoGLUE: "DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue".
arXiv(2020)
[Homepage]
PLM 的设计
通用设计
- GPT: "Improving Language Understanding by Generative Pre-Training".
OpenAI(2018)
[Project] - GPT-2: "Language Models are Unsupervised Multitask Learners".
OpenAI(2019)
[Project] - BERT: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding".
NAACL(2019)
[PDF] [Code] - XLNet: "XLNet: Generalized Autoregressive Pretraining for Language Understanding".
NeurIPS(2019)
[PDF] [Code] - SBERT: "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks".
ACL(2019)
[PDF] [Code] - UniLM: "Unified Language Model Pre-training for Natural Language Understanding and Generation".
NeurIPS(2019)
[PDF] [Code] - MASS: "MASS: Masked Sequence to Sequence Pre-training for Language Generation".
ICML(2019)
[PDF] [Code] - Chinese-BERT-wwm: "Pre-Training with Whole Word Masking for Chinese BERT".
arXiv(2019)
[PDF] [Code] - "Cloze-driven Pretraining of Self-attention Networks".
EMNLP(2019)
[PDF] - "BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model".
Workshop on Methods for Optimizing and Evaluating Neural Language Generation(2019)
[PDF] [Code] - GPT-3: "Language Models are Few-Shot Learners".
NeurIPS(2020)
[PDF] [Code] - T5: "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer".
JMLR(2020)
[PDF] [Code] - BART: "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension".
ACL(2020)
[PDF] [Code] - Poly-encoders: "Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring".
ICLR(2020)
[PDF] - SpanBERT: "SpanBERT: Improving Pre-training by Representing and Predicting Spans".
TACL(2020)
[PDF] [Code] - ERNIE 2.0: "ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding".
AAAI(2020)
[PDF] [Code] - SemBERT: "Semantics-Aware BERT for Language Understanding".
AAAI(2020)
[PDF] [Code] - "Leveraging Pre-trained Checkpoints for Sequence Generation Tasks".
TACL(2020)
[PDF] [Code] - ProphetNet: "ProphetNet: Predicting Future N-gram for Sequence-to-SequencePre-training".
EMNLP(2020)
[PDF] - UniLMv2: "UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training".
ICML(2020)
[PDF] [Code] - MacBERT: "Revisiting Pre-Trained Models for Chinese Natural Language Processing".
EMNLP(2020)
[PDF] [Code] - MPNet: "MPNet: Masked and Permuted Pre-training for Language Understanding".
arXiv(2020)
[PDF] [Code] - DEBERTA: "DeBERTa: Decoding-enhanced BERT with Disentangled Attention".
ICLR(2021)
[PDF] [Code] - PALM: "PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation".
EMNLP(2020)
[PDF] - Optimus: "Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space".
EMNLP(2020)
[PDF] [Code] - "Self-training Improves Pre-training for Natural Language Understanding".
NAACL(2021)
[PDF] [Code] - CAPT: "Rethinking Denoised Auto-Encoding in Language Pre-Training".
EMNLP(2021)
[PDF] - "Frustratingly Simple Pretraining Alternatives to Masked Language Modeling".
EMNLP(2021)
[PDF] [Code] - "Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models".
ACL(2021)
[PDF] [Code] - ERNIE-Doc: "ERNIE-Doc: A Retrospective Long-Document Modeling Transformer".
ACL(2021)
[PDF] [Code] - "Pre-training Universal Language Representation".
ACL(2021)
[PDF] [Code]
知识增强
- ERNIE(Baidu): "ERNIE: Enhanced Representation through Knowledge Integration".
arXiv(2019)
[PDF] [Code] - KnowBert: "Knowledge Enhanced Contextual Word Representations".
EMNLP(2019)
[PDF] - ERNIE(Tsinghua): "ERNIE: Enhanced Language Representation with Informative Entities".
ACL(2019)
[PDF] [Code] - COMET: "COMET: Commonsense Transformers for Automatic Knowledge Graph Construction".
ACL(2019)
[PDF] [Code] - K-BERT: "K-BERT: Enabling Language Representation with Knowledge Graph".
AAAI(2020)
[PDF] [Code] - WKLM: "Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model".
ICLR(2020)
[PDF] - LUKE: "LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention".
EMNLP(2020)
[PDF] [Code] - K-Adapter: "K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters".
ICLR(2021)
[PDF] - KEPLER: "KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation".
TACL(2021)
[PDF] [Code] - RuleBERT: "RuleBERT: Teaching Soft Rules to Pre-Trained Language Models".
EMNLP(2021)
[PDF] [Code] - BeliefBank: "Exploring the Role of BERT Token Representations to Explain Sentence Probing Results".
EMNLP(2021)
[PDF] [Code] - Phrase-BERT: "Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration".
EMNLP(2021)
[PDF] [Code] - "Syntax-Enhanced Pre-trained Model".
ACL(2021)
[PDF] [Code] - StructFormer: "StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling".
ACL(2021)
[PDF] - ERICA: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".
ACL(2021)
[PDF] [Code] - "Structural Guidance for Transformer Language Models".
ACL(2021)
[PDF] [Code] - HORNET: "HORNET: Enriching Pre-trained Language Representations with Heterogeneous Knowledge Sources".
CIKM(2021)
[PDF] - "Drop Redundant, Shrink Irrelevant: Selective Knowledge Injection for Language Pretraining".
IJCAI(2021)
[PDF]
多语言
- XLM: "Cross-lingual Language Model Pretraining".
arXiv(2019)
[PDF] [Code] - "Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond".
TACL(2019)
[PDF] [Code] - UDify: "75 Languages, 1 Model: Parsing Universal Dependencies Universally".
EMNLP(2019)
[PDF] [Code] - Unicoder: "Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks".
EMNLP(2019)
[PDF] - XLM-R: "Unsupervised Cross-lingual Representation Learning at Scale".
ACL(2020)
[PDF] - "Multilingual Alignment of Contextual Word Representations".
ICLR(2020)
[PDF] - mBART: "Multilingual Denoising Pre-training for Neural Machine Translation".
TACL(2020)
[PDF] [Code] - mT5: "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer".
NAACL(2021)
[PDF] [Code] - InfoXLM: "InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training".
NAACL(2021)
[PDF] [Code] - "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training".
EMNLP(2021)
[PDF] [Code] - ERNIE-M: "ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora".
EMNLP(2021)
[PDF] [Code] - "A Simple Geometric Method for Cross-Lingual Linguistic Transformations with Pre-trained Autoencoders".
EMNLP(2021)
[PDF] - "Boosting Cross-Lingual Transfer via Self-Learning with Uncertainty Estimation".
EMNLP(2021)
[PDF] - "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models".
ACL(2021)
[PDF] [Code] - "Multilingual Pre-training with Universal Dependency Learning".
NeurIPS(2021)
[PDF]
多模态
- ViLBERT: "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks".
NeuralIPS(2019)
[PDF] - LXMERT: "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".
EMNLP(2019)
[PDF] [Code] - VideoBERT: "VideoBERT: A Joint Model for Video and Language Representation Learning"
ICCV(2019)
[PDF] - VisualBERT: "VisualBERT: A Simple and Performant Baseline for Vision and Language".
arXiv(2019)
[PDF] - B2T2: "Fusion of Detected Objects in Text for Visual Question Answering".
EMNLP(2019)
[PDF] [Code] - VL-BERT: "VL-BERT: Pre-training of Generic Visual-Linguistic Representations".
ICLR(2020)
[PDF] [Code] - Unicoder-VL: "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training".
AAAI(2020)
[PDF] - VLP: "Unified Vision-Language Pre-Training for Image Captioning and VQA".
AAAI(2020)
[PDF] [Code] - UNITER: "UNITER: UNiversal Image-TExt Representation Learning".
ECCV(2020)
[PDF] [Code] - Oscar: "Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks".
ECCV(2020)
[PDF] [Code] - "12-in-1: Multi-Task Vision and Language Representation Learning".
CVPR(2020)
[PDF] [Code] - ActBERT: "ActBERT: Learning Global-Local Video-Text Representations".
CVPR(2020)
[PDF] - VLN: "Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks".
CVPR(2020)
[PDF] - VILLA: "Large-Scale Adversarial Training for Vision-and-Language Representation Learning".
arXiv(2020)
[PDF] [Code] - ImageBERT: "ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data".
arXiv(2020)
[PDF] - ALIGN: "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision".
ICML(2021)
[PDF] - ClipBERT: "Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling".
CVPR(2021)
[PDF] [Code] - DALL·E: "Zero-Shot Text-to-Image Generation".
arXiv(2021)
[PDF] [Code] - CLIP: "Learning Transferable Visual Models From Natural Language Supervision".
arXiv(2021)
[PDF] [Code] - IPT: "Pre-Trained Image Processing Transformer".
CVPR(2021)
[PDF] [Code] - CvT: "CvT: Introducing Convolutions to Vision Transformers".
ICCV(2021)
[PDF] [Code] - "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision".
ICML(2021)
[PDF] - TERA: "TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech".
TASLP(2021)
[PDF] [Code] - CaiT: "Going deeper with Image Transformers".
ICCV(2021)
[PDF] [Code] - ViViT: "ViViT: A Video Vision Transformer".
ICCV(2021)
[PDF] [Code] - VirTex: "VirTex: Learning Visual Representations From Textual Annotations".
CVPR(2021)
[PDF] [Code] - M6: "M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining".
KDD(2021)
[PDF] - "Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training".
NeurIPS(2021)
[PDF] - GilBERT: "GilBERT: Generative Vision-Language Pre-Training for Modality-Incomplete Visual-Linguistic Tasks".
SIGIR(2021)
[PDF]
信息检索
- ORQA: "Latent Retrieval for Weakly Supervised Open Domain Question Answering".
ACL(2019)
[PDF] - REALM: "REALM: Retrieval-Augmented Language Model Pre-Training".
arXiv(2020)
[PDF] - RAG: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks".
NeurIPS(2020)
[PDF] [Code] - DPR: "Dense Passage Retrieval for Open-Domain Question Answering".
EMNLP(2020)
[PDF] [Code] - "Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering".
EACL(2021)
[PDF] [Code]
代码
- CodeT5: "CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation".
EMNLP(2021)
[PDF] [Code] - Codex: "Evaluating Large Language Models Trained on Code".
arXiv(2021)
[PDF] [Code]
其他
- ReasonBERT: "ReasonBERT: Pre-trained to Reason with Distant Supervision".
EMNLP(2021)
[PDF] [Code] - "Sentence Bottleneck Autoencoders from Transformer Language Models".
EMNLP(2021)
[PDF] [Code] - "Numeracy enhances the Literacy of Language Models".
EMNLP(2021)
[PDF] [Code] - EnsLM: "EnsLM: Ensemble Language Model for Data Diversity by Semantic Clustering".
ACL(2021)
[PDF] [Code] - "Reflective Decoding: Beyond Unidirectional Generation with Off-the-Shelf Language Models".
ACL(2021)
[PDF] [Code] - BERTAC: "BERTAC: Enhancing Transformer-based Language Models with Adversarially Pretrained Convolutional Neural Networks".
ACL(2021)
[PDF] [Code] - "Natural Language Understanding with Privacy-Preserving BERT".
CIKM(2021)
[PDF] - BANG: "BANG: Bridging Autoregressive and Non-autoregressive Generation with Large Scale Pretraining".
ICML(2021)
[PDF] [Code]
PLM 的分析
知识
- "What Does BERT Look at? An Analysis of BERT’s Attention".
BlackBoxNLP(2019)
[PDF] [Code] - "BERT Rediscovers the Classical NLP Pipeline".
ACL(2019)
[PDF] - "How Multilingual is Multilingual BERT?".
ACL(2019)
[PDF] - "A Structural Probe for Finding Syntax in Word Representations".
NAACL(2019)
[PDF] [Code] - "Language Models as Knowledge Bases?".
EMNLP(2019)
[PDF] [Code] - "What Does BERT Learn about the Structure of Language?".
ACL(2019)
[PDF] [Code] - "Linguistic Knowledge and Transferability of Contextual Representations".
NAACL(2019)
[PDF] - "Assessing BERT's Syntactic Abilities".
arXiv(2019)
[PDF] [Code] - "Probing Neural Network Comprehension of Natural Language Arguments"
ACL(2019)
[PDF] - "How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings".
EMNLP(2019)
[PDF] - "Visualizing and Measuring the Geometry of BERT".
NeurIPS(2019)
[PDF] - "Designing and Interpreting Probes with Control Tasks".
EMNLP(2019)
[PDF] - "Open Sesame: Getting inside BERT’s Linguistic Knowledge".
BlackboxNLP(2019)
[PDF] [Code] - "What do you learn from context? Probing for sentence structure in contextualized word representations".
ICLR(2019)
[PDF] [Code] - "Commonsense Knowledge Mining from Pretrained Models".
EMNLP(2019)
[PDF] - "Do NLP Models Know Numbers? Probing Numeracy in Embeddings".
EMNLP(2019)
[PDF] - "On the Cross-lingual Transferability of Monolingual Representations".
ACL(2020)
[PDF] - "Cross-Lingual Ability of Multilingual BERT: An Empirical Study".
ICLR(2020)
[PDF] [Code] - "What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models".
TACL(2020)
[PDF] [Code] - "How Much Knowledge Can You Pack Into the Parameters of a Language Model?".
EMNLP(2020)
[PDF] [Code] - "How Can We Know What Language Models Know?".
TACL(2020)
[PDF] [Code] - "oLMpics-On What Language Model Pre-training Captures".
TACL(2020)
[PDF] [Code] - "Information-Theoretic Probing with Minimum Description Length".
EMNLP(2020)
[PDF] [Code] - "Inducing Relational Knowledge from BERT".
AAAI(2020)
[PDF] - AutoPrompt: "AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts".
EMNLP(2020)
[PDF] [Code] - "Emergent linguistic structure in artificial neural networks trained by self-supervision".
PNAS(2020)
[PDF] - "Evaluating Commonsense in Pre-Trained Language Models".
AAAI(2020)
[PDF] [Code] - "Inducing Relational Knowledge from BERT".
AAAI(2020)
[PDF] - "Editing Factual Knowledge in Language Models".
EMNLP(2021)
[PDF] [Code] - "How much pretraining data do language models need to learn syntax?".
EMNLP(2021)
[PDF] - "Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you?".
EMNLP(2021)
[PDF] [Code] - "Putting Words in BERT's Mouth: Navigating Contextualized Vector Spaces with Pseudowords".
EMNLP(2021)
[PDF] [Code] - "Frequency Effects on Syntactic Rule Learning in Transformers".
EMNLP(2021)
[PDF] [Code] - "Exploring the Role of BERT Token Representations to Explain Sentence Probing Results".
EMNLP(2021)
[PDF] [Code] - "How is BERT surprised? Layerwise detection of linguistic anomalies".
ACL(2021)
[PDF] [Code] - "Implicit Representations of Meaning in Neural Language Model".
ACL(2021)
[PDF] [Code] - "Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases".
ACL(2021)
[PDF] [Code]
鲁棒性
- "Universal Adversarial Triggers for Attacking and Analyzing NLP".
EMNLP(2019)
[PDF] [Code] - "Pretrained Transformers Improve Out-of-Distribution Robustness".
ACL(2020)
[PDF] [Code] - BERT-ATTACK: "BERT-ATTACK: Adversarial Attack Against BERT Using BERT".
EMNLP(2020)
[PDF] [Code] - "Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment".
AAAI(2020)
[PDF] [Code] - "The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers".
EMNLP(2021)
[PDF] [Code] - "Sorting through the noise: Testing robustness of information processing in pre-trained language models".
EMNLP(2021)
[PDF] [Code]
稀疏性
- "Are Sixteen Heads Really Better than One?".
NeurIPS(2019)
[PDF] [Code] - "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned".
ACL(2019)
[PDF] [Code] - "Revealing the Dark Secrets of BERT".
EMNLP(2019)
[PDF] - "The Lottery Ticket Hypothesis for Pre-trained BERT Networks".
NeurIPS(2020)
[PDF] [Code] - "When BERT Plays the Lottery, All Tickets Are Winning".
EMNLP(2020)
[PDF] [Code]
其他
- "Scaling Laws for Neural Language Models".
arXiv(2020)
[PDF] - "Extracting Training Data from Large Language Models".
arXiv(2020)
[PDF] [Code] - "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ".
FACCT(2021)
[PDF] - "Extracting Training Data from Large Language Models".
USENIX(2021)
[PDF] [Code] - "Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little".
EMNLP(2021)
[PDF] [Code] - "Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent".
EMNLP(2021)
[PDF] [Code] - "Discretized Integrated Gradients for Explaining Language Models".
EMNLP(2021)
[PDF] [Code] - "Do Long-Range Language Models Actually Use Long-Range Context?".
EMNLP(2021)
[PDF] - "Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right".
EMNLP(2021)
[PDF] [Code] - "Incorporating Residual and Normalization Layers into Analysis of Masked Language Models".
EMNLP(2021)
[PDF] [Code] - "Sequence Length is a Domain: Length-based Overfitting in Transformer Models".
EMNLP(2021)
[PDF] - "Are Pretrained Convolutions Better than Pretrained Transformers?".
ACL(2021)
[PDF] - "Positional Artefacts Propagate Through Masked Language Model Embeddings".
ACL(2021)
[PDF] - "When Do You Need Billions of Words of Pretraining Data?".
ACL(2021)
[PDF] [Code] - "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?".
ACL(2021)
[PDF] [Code] - "Examining the Inductive Bias of Neural Language Models with Artificial Languages".
ACL(2021)
[PDF] [Code] - "Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning".
NeurIPS(2021)
[PDF]
高效的 PLM
模型训练
- RoBERTa: "RoBERTa: A Robustly Optimized BERT Pretraining Approach".
arXiv(2019)
[PDF] [Code] - "Efficient Training of BERT by Progressively Stacking".
ICML(2019)
[PDF] [Code] - Megatron-LM: "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism".
arXiv(2019)
[PDF] [Code] - ELECTRA: "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators".
ICLR(2020)
[PDF] [Code] - "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes".
ICLR(2020)
[PDF] [Code] - GShard: "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding".
arXiv(2020)
[PDF] - Admin: "Understanding the Difficulty of Training Transformers".
EMNLP(2020)
[PDF] [Code] - ZeRO: "ZeRO: Memory optimizations Toward Training Trillion Parameter Models".
SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
[PDF] [Code] - Switch Transformers: "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity".
arXiv(2021)
[PDF] [Code] - "How to Train BERT with an Academic Budget".
EMNLP(2021)
[PDF] - "Optimizing Deeper Transformers on Small Datasets".
ACL(2021)
[PDF] [Code] - "EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets".
ACL(2021)
[PDF] [Code]
模型推理
- "BERT Loses Patience: Fast and Robust Inference with Early Exit".
NeurIPS(2020)
[PDF] [Code] - GAML-BERT: "GAML-BERT: Improving BERT Early Exiting by Gradient Aligned Mutual Learning".
EMNLP(2021)
[PDF] - "Efficient Nearest Neighbor Language Models".
EMNLP(2021)
[PDF] [Code] - GhostBERT: "GhostBERT: Generate More Features with Cheap Operations for BERT".
ACL(2021)
[PDF] [Code] - LeeBERT: "LeeBERT: Learned Early Exit for BERT with cross-level optimization".
ACL(2021)
[PDF] - "Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search".
ACL(2021)
[PDF] [Code] - "Distilling Knowledge from BERT into Simple Fully Connected Neural Networks for Efficient Vertical Retrieval".
CIKM(2021)
[PDF]
模型压缩
- DistilBERT: "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter".
arXiv(2019)
[PDF] [Code] - PKD: "Patient Knowledge Distillation for BERT Model Compression".
EMNLP(2019)
[PDF] [Code] - "Distilling Task-Specific Knowledge from BERT into Simple Neural Networks".
arXiv(2019)
[PDF] - Q8BERT: "Q8BERT: Quantized 8Bit BERT".
5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019
[PDF] - ALBERT: "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations".
ICLR(2020)
[PDF] [Code] - TinyBERT: "TinyBERT: Distilling BERT for Natural Language Understanding".
EMNLP(2020)
[PDF] [Code] - Layerdrop: "Reducing Transformer Depth on Demand with Structured Dropout".
ICLR(2020)
[PDF] [Code] - Q-BERT: "Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT".
AAAI(2020)
[PDF] - MobileBERT: "MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices".
ACL(2020)
[PDF] [Code] - "Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning".
5th Workshop on Representation Learning for NLP(2020)
[PDF] [Code] - MiniLM: "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers".
arXiv(2020)
[PDF] [Code] - FastBERT: "FastBERT: a Self-distilling BERT with Adaptive Inference Time".
ACL(2020)
[PDF] [Code] - DeeBERT: "DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference".
ACL(2020)
[PDF] [Code] - "Compressing Large-Scale Transformer-Based Models: A Case Study on BERT".
TACL(2021)
[PDF] - "Winning the Lottery with Continuous Sparsification".
NeurIPS(2020)
[PDF] [Code] - SqueezeBERT: "SqueezeBERT: What can computer vision teach NLP about efficient neural networks?".
SustaiNLP(2020)
[PDF] - Audio ALBERT: "Audio Albert: A Lite Bert for Self-Supervised Learning of Audio Representation".
SLT(2021)
[PDF] [Code] - T2R: "Finetuning Pretrained Transformers into RNNs".
EMNLP(2021)
[PDF] [Code] - "Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression".
EMNLP(2021)
[PDF] [Code] - Meta-KD: "Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains".
ACL(2021)
[PDF] [Code] - "Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization".
ACL(2021)
[PDF] [Code] - BinaryBERT: "BinaryBERT: Pushing the Limit of BERT Quantization".
ACL(2021)
[PDF] [Code] - AutoTinyBERT: "AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models".
ACL(2021)
[PDF] [Code] - "Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation".
ACL(2021)
[PDF] [Code] - "Enabling Lightweight Fine-tuning for Pre-trained Language Model Compression based on Matrix Product Operators".
ACL(2021)
[PDF] [Code] - NAS-BERT: "NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural Architecture Search".
KDD(2021)
[PDF]
PLM 的使用
两阶段微调
- "Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks".
arXiv(2018)
[PDF] [Code] - "How to Fine-Tune BERT for Text Classification?".
CCL(2019)
[PDF] - "Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks".
ACL(2020)
[PDF] [Code] - "Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?".
ACL(2020)
[PDF] - "What to Pre-Train on? Efficient Intermediate Task Selection".
EMNLP(2021)
[PDF] [Code] - "On the Influence of Masking Policies in Intermediate Pre-training".
EMNLP(2021)
[PDF] - TADPOLE: "TADPOLE: Task ADapted Pre-Training via AnOmaLy DEtection".
EMNLP(2021)
[PDF]
多任务微调
- MT-DNN: "Multi-Task Deep Neural Networks for Natural Language Understanding".
ACL(2019)
[PDF] [Code] - "BAM! Born-Again Multi-Task Networks for Natural Language Understanding".
ACL(2019)
[PDF] [Code] - "Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding".
arXiv(2019)
[PDF] [Code] - GradTS: "GradTS: A Gradient-Based Automatic Auxiliary Task Selection Method Based on Transformer Networks".
EMNLP(2021)
[PDF] - "What's in Your Head? Emergent Behaviour in Multi-Task Transformer Models".
EMNLP(2021)
[PDF] - MTAdam: "MTAdam: Automatic Balancing of Multiple Training Loss Terms".
EMNLP(2021)
[PDF] - Muppet: "Muppet: Massive Multi-task Representations with Pre-Finetuning".
EMNLP(2021)
[PDF] - "The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders".
EMNLP(2021)
[PDF] [Code] - BERTGen: "BERTGen: Multi-task Generation through BERT".
ACL(2021)
[PDF] [Code] - "Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks".
ACL(2021)
[PDF] [Code]
Adapter
- "BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning".
ICML(2019)
[PDF] [Code] - Adapter: "Parameter-Efficient Transfer Learning for NLP".
ICML(2019)
[PDF] [Code] - AdapterDrop: "AdapterDrop: On the Efficiency of Adapters in Transformers".
EMNLP(2021)
[PDF] - "On the Effectiveness of Adapter-based Tuning for Pretrained Language Model Adaptation".
ACL(2021)
[PDF] - "Learning to Generate Task-Specific Adapters from Task Description".
ACL(2021)
[PDF] [Code]
Prompt
- PET: "Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference".
EACL(2021)
[PDF] [Code] - "It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners".
NAACL(2021)
[PDF] [Code] - "Prefix-Tuning: Optimizing Continuous Prompts for Generation".
arXiv(2021)
[PDF] - LM-BFF: "Making Pre-trained Language Models Better Few-shot Learners".
ACL(2021)
[PDF] [Code] - "What Makes Good In-Context Examples for GPT-3?".
arXiv(2021)
[PDF] [Code] - "The Power of Scale for Parameter-Efficient Prompt Tuning".
EMNLP(2021)
[PDF] [Code] - "Finetuned Language Models Are Zero-Shot Learners".
arXiv(2021)
[PDF] - "Calibrate Before Use: Improving Few-shot Performance of Language Models".
ICML(2021)
[PDF] [Code] - TransPrompt: "TransPrompt: Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification".
EMNLP(2021)
[PDF] [Code] - SFLM: "Revisiting Self-training for Few-shot Learning of Language Model".
EMNLP(2021)
[PDF] [Code] - ADAPET: "Improving and Simplifying Pattern Exploiting Training".
EMNLP(2021)
[PDF] [Code]
其他
- "To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks".
RepL4NLP(2019)
[PDF] - "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models".
NAACL(2019)
[PDF] [Code] - "Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping".
arXiv(2020)
[PDF] - SMART: "SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization".
EMNLP(2020)
[PDF] [Code] - "Revisiting Few-sample BERT Fine-tuning".
ICLR(2021)
[PDF] - Mirror-BERT: "Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders".
EMNLP(2021)
[PDF] [Code] - "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".
EMNLP(2021)
[PDF] [Code] - AVocaDo: "AVocaDo: Strategy for Adapting Vocabulary to Downstream Domain".
EMNLP(2021)
[PDF] - CHILD-TUNING: "Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning".
EMNLP(2021)
[PDF] [Code] - "Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation".
ACL(2021)
[PDF] [Code] - LexFit: "LexFit: Lexical Fine-Tuning of Pretrained Language Models".
ACL(2021)
[PDF] [Code] - "Selecting Informative Contexts Improves Language Model Fine-tuning".
ACL(2021)
[PDF] [Code] - "An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models".
ACL(2021)
[PDF] [Code] - "How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?".
NeurIPS(2021)
[PDF] [Code]
编辑于 2021-11-26 10:55