预培训模式的语文、语文和加权文件代表 (LAWDR: Language-Agnostic Weighted Document Representations from Pre-trained Models) - 专知论文

会员服务 ·

0

Weight · MoDELS · 语言模型化 · XLM · Performer ·

2021 年 6 月 7 日

LAWDR: Language-Agnostic Weighted Document Representations from Pre-trained Models

翻译：预培训模式的语文、语文和加权文件代表

Hongyu Gong,Vishrav Chaudhary,Yuqing Tang,Francisco Guzmán

Cross-lingual document representations enable language understanding in multilingual contexts and allow transfer learning from high-resource to low-resource languages at the document level. Recently large pre-trained language models such as BERT, XLM and XLM-RoBERTa have achieved great success when fine-tuned on sentence-level downstream tasks. It is tempting to apply these cross-lingual models to document representation learning. However, there are two challenges: (1) these models impose high costs on long document processing and thus many of them have strict length limit; (2) model fine-tuning requires extra data and computational resources, which is not practical in resource-limited settings. In this work, we address these challenges by proposing unsupervised Language-Agnostic Weighted Document Representations (LAWDR). We study the geometry of pre-trained sentence embeddings and leverage it to derive document representations without fine-tuning. Evaluated on cross-lingual document alignment, LAWDR demonstrates comparable performance to state-of-the-art models on benchmark datasets.

翻译：跨语文文件代表制有助于在多语种背景下理解语言,并允许在文件一级从高资源语言向低资源语言转移学习。最近,如BERT、XLM和XLM-ROBERTA等大型预先培训语言模式在对下游任务进行微调时取得了巨大成功。我们研究这些跨语文模式的几何方法,以便用于文件代表制学习。然而,有两个挑战:(1)这些模式对长文件处理造成高昂费用,因此许多模式有严格的长度限制;(2) 示范微调需要额外的数据和计算资源,而这在资源有限的情况下是行不通的。在这项工作中,我们通过提出未经监督的语文敏感文件代表制(LAWDR)来应对这些挑战。我们研究了预先培训的句的嵌入式,并利用这些模式在不作微调的情况下获取文件代表制。对跨语文文件的校准进行了评价,LADR显示与基准数据集方面最先进的模式具有可比性。

0

相关内容

Weight

【论文翻译】2020最新预训练语言模型综述：Pre-trained Models for Natural Language Processing: A Survey

【论文翻译】2020最新预训练语言模型综述：Pre-trained Models for Natural Language Processing: A Survey

专知会员服务

94+阅读 · 2020年4月13日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【牛津大学-DeepMind 】上下文嵌入综述，A Survey on Contextual Embeddings

【牛津大学-DeepMind 】上下文嵌入综述，A Survey on Contextual Embeddings

专知会员服务

42+阅读 · 2020年3月17日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【清华大学】Bert 简介，Bidirectional Encoder Representations from Transformers，21页ppt

【清华大学】Bert 简介，Bidirectional Encoder Representations from Transformers，21页ppt

专知会员服务

79+阅读 · 2019年12月29日

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

专知会员服务

25+阅读 · 2019年12月26日

【AAAI2020论文】概念结构化嵌入医疗文本表示（Learning Conceptual-Contextual Embeddings for Medical Text）

【AAAI2020论文】概念结构化嵌入医疗文本表示（Learning Conceptual-Contextual Embeddings for Medical Text）

专知会员服务

49+阅读 · 2019年11月15日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

2019年机器学习框架回顾

2019年机器学习框架回顾

专知会员服务

36+阅读 · 2019年10月11日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

已删除

将门创投

8+阅读 · 2017年7月21日

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Arxiv

11+阅读 · 2020年10月20日

Pre-training Text Representations as Meta Learning

Arxiv

13+阅读 · 2020年4月12日

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Arxiv

5+阅读 · 2019年9月26日

Enriching BERT with Knowledge Graph Embeddings for Document Classification

Arxiv

6+阅读 · 2019年9月18日

How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations

Arxiv

4+阅读 · 2019年9月11日

CEDR: Contextualized Embeddings for Document Ranking

Arxiv

4+阅读 · 2019年8月19日

Enriching Pre-trained Language Model with Entity Information for Relation Classification

Arxiv

5+阅读 · 2019年5月20日

Pre-trained Language Model Representations for Language Generation

Arxiv

5+阅读 · 2019年4月1日

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Arxiv

7+阅读 · 2019年2月3日

Deep contextualized word representations

Arxiv

10+阅读 · 2018年3月22日

VIP会员

文章信息

相关主题

语言模型化

相关VIP内容

【论文翻译】2020最新预训练语言模型综述：Pre-trained Models for Natural Language Processing: A Survey

【论文翻译】2020最新预训练语言模型综述：Pre-trained Models for Natural Language Processing: A Survey

专知会员服务

94+阅读 · 2020年4月13日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【牛津大学-DeepMind 】上下文嵌入综述，A Survey on Contextual Embeddings

【牛津大学-DeepMind 】上下文嵌入综述，A Survey on Contextual Embeddings

专知会员服务

42+阅读 · 2020年3月17日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【清华大学】Bert 简介，Bidirectional Encoder Representations from Transformers，21页ppt

【清华大学】Bert 简介，Bidirectional Encoder Representations from Transformers，21页ppt

专知会员服务

79+阅读 · 2019年12月29日

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

【ICLR2020 预训练的百科全书】弱监督的知识-预训练的语言模型（PRETRAINED ENCYCLOPEDIA: WEAKLY SUPERVISED KNOWLEDGE-PRETRAINED LANGUAGE MODEL）

专知会员服务

25+阅读 · 2019年12月26日

【AAAI2020论文】概念结构化嵌入医疗文本表示（Learning Conceptual-Contextual Embeddings for Medical Text）

【AAAI2020论文】概念结构化嵌入医疗文本表示（Learning Conceptual-Contextual Embeddings for Medical Text）

专知会员服务

49+阅读 · 2019年11月15日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

2019年机器学习框架回顾

2019年机器学习框架回顾

专知会员服务

36+阅读 · 2019年10月11日

热门VIP内容

开通专知VIP会员享更多权益服务

《印太区域的海域态势感知》2025最新112页报告

《军事网络工具中运用生成式人工智能的伦理与对抗风险》最新报告

中文版 | AI增强型指挥控制（C2）系统：军事决策与战场情报变革

《面相高速武器冲击评估的靶区参考算法》

相关资讯

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

已删除

将门创投

8+阅读 · 2017年7月21日

相关论文

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

Arxiv

11+阅读 · 2020年10月20日

Pre-training Text Representations as Meta Learning

Arxiv

13+阅读 · 2020年4月12日

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Arxiv

5+阅读 · 2019年9月26日

Enriching BERT with Knowledge Graph Embeddings for Document Classification

Arxiv

6+阅读 · 2019年9月18日

How Does BERT Answer Questions? A Layer-Wise Analysis of Transformer Representations

Arxiv

4+阅读 · 2019年9月11日

CEDR: Contextualized Embeddings for Document Ranking

Arxiv

4+阅读 · 2019年8月19日

Enriching Pre-trained Language Model with Entity Information for Relation Classification

Arxiv

5+阅读 · 2019年5月20日

Pre-trained Language Model Representations for Language Generation

Arxiv

5+阅读 · 2019年4月1日

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Arxiv

7+阅读 · 2019年2月3日

Deep contextualized word representations

Arxiv

10+阅读 · 2018年3月22日

微信扫码咨询专知VIP会员