Deep language models learning a hierarchical representation proved to be a powerful tool for natural language processing, text mining and information retrieval. However, representations that perform well for retrieval must capture semantic meaning at different levels of abstraction or context-scopes. In this paper, we propose a new method to generate multi-resolution word embedding representing documents at multiple resolutions in term of context-scopes. In order to investigate its performance, we use the Stanford Question Answering Dataset (SQuAD) and the Question Answering by Search And Reading (QUASAR) in an open-domain question-answering setting, where the first task is to find documents useful for answering a given question. To this end, we first compare the quality of various text-embedding methods for retrieval performance and give an extensive empirical comparison with the performance of various non-augmented base embeddings with and without multi-resolution representation. We argue that multi-resolution word embeddings are consistently superior to the original counterparts and deep residual neural models specifically trained for retrieval purposes can yield further significant gains when they are used for augmenting those embeddings.
翻译:深语言模型学习分级代表制被证明是自然语言处理、文本挖掘和信息检索的有力工具。然而,在检索方面表现良好的代表制必须捕捉不同程度的抽象或上下文范围的语义含义。在本文中,我们提出一种新的方法,在上下文范围内生成多分辨率的嵌入文件,代表多个分辨率的文档。为了调查其性能,我们使用斯坦福问答数据集(SQuAD)和搜索和阅读问答问题解答(QUASAR),在开放的问答环境中,第一个任务是找到对回答一个特定问题有用的文件。为此,我们首先比较各种文本组合的检索性能质量,并对各种非缩放基础嵌入和不包含多分辨率代表的嵌入性进行广泛的实验性比较。我们争辩说,多分辨率嵌入式词始终优于原始对应方和经过专门培训用于检索目的的深残余神经模型,在用于扩大这些嵌入时,可以产生更重大的成果。