We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of representation learning methods to transfer knowledge and translate progress across multiple domains (e.g., natural language processing, computer vision, audio processing).
翻译:我们展示了最先进的单词表达学习方法,最大限度地实现一个对单词序列不同部分(即句子)之间相互信息的较低约束的客观功能。我们的提法提供了另一种观点,将典型的单词嵌入模型(例如,跳过格)和现代背景嵌入模型(例如,BERT、XLNet)统一起来。除了加强我们对这些方法的理论理解外,我们的衍生还导致一个原则性框架,可用于构建新的自我监督任务。我们提供了一个范例,我们从基于在计算机愿景中成功的相互信息最大化的相关方法中汲取灵感,并引入一个简单的自我监督目标,使全球句式代表与句子中的正格之间的相互信息最大化。我们的分析为代表学习方法提供了一种整体观点,以转让知识和翻译跨多个领域(例如,自然语言处理、计算机视觉、音频处理)的进展。