学习来自英语搜索的跨语言的 IR (Learning Cross-Lingual IR from an English Retriever)

We present DR.DECR (Dense Retrieval with Distillation-Enhanced Cross-Lingual Representation), a new cross-lingual information retrieval (CLIR) system trained using multi-stage knowledge distillation (KD). The teacher of DR.DECR relies on a highly effective but computationally expensive two-stage inference process consisting of query translation and monolingual IR, while the student, DR.DECR, executes a single CLIR step. We teach DR.DECR powerful multilingual representations as well as CLIR by optimizing two corresponding KD objectives. Learning useful representations of non-English text from an English-only retriever is accomplished through a cross-lingual token alignment algorithm that relies on the representation capabilities of the underlying multilingual encoders. In both in-domain and zero-shot out-of-domain evaluation, DR.DECR demonstrates far superior accuracy over direct fine-tuning with labeled CLIR data. It is also the best single-model retriever on the XOR-TyDi benchmark at the time of this writing.

翻译：我们介绍了DR.DECR(通过蒸馏增强跨语言代表系统进行大量检索),这是一个使用多阶段知识蒸馏(KD)培训的新的跨语言信息检索系统(CLIR),DR.DECR的教师依赖一个非常有效但计算成本高昂的两阶段推论过程,包括查询翻译和单语语言的IR,而学生DR.DECR则执行一个单一的CLIR步骤。我们通过优化两个相应的 KD目标,教授DR.DECR强大的多语种代表以及CLIR。从一个只使用英语的检索器学习非英语文本的有用表达方式,是通过一种跨语言的象征性比对算法,该算法依赖于基本的多语种编码器的表达能力。在内部和零投射的外部评价中,DR.DECR显示出比带有标签的CLIR数据直接微调的准确性要高得多。在编写本报告时,它也是XOR-Tydi基准的最佳单一模型检索器。

相关内容

关注 14

信息检索杂志（IR）为信息检索的广泛领域中的理论、算法分析和实验的发布提供了一个国际论坛。感兴趣的主题包括对应用程序（例如Web，社交和流媒体，推荐系统和文本档案）的搜索、索引、分析和评估。这包括对搜索中人为因素的研究、桥接人工智能和信息检索以及特定领域的搜索应用程序。官网地址：https://dblp.uni-trier.de/db/journals/ir/

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日