Text-based image retrieval has seen considerable progress in recent years. However, the performance of existing methods suffers in real life since the user is likely to provide an incomplete description of a complex scene, which often leads to results filled with false positives that fit the incomplete description. In this work, we introduce the partial-query problem and extensively analyze its influence on text-based image retrieval. We then propose an interactive retrieval framework called Part2Whole to tackle this problem by iteratively enriching the missing details. Specifically, an Interactive Retrieval Agent is trained to build an optimal policy to refine the initial query based on a user-friendly interaction and statistical characteristics of the gallery. Compared to other dialog-based methods that rely heavily on the user to feed back differentiating information, we let AI take over the optimal feedback searching process and hint the user with confirmation-based questions about details. Furthermore, since fully-supervised training is often infeasible due to the difficulty of obtaining human-machine dialog data, we present a weakly-supervised reinforcement learning method that needs no human-annotated data other than the text-image dataset. Experiments show that our framework significantly improves the performance of text-based image retrieval under complex scenes.
翻译:近年来,基于文本的图像检索工作取得了相当大的进展。但是,现有方法的绩效在现实生活中受到了相当大的进步。但是,由于用户有可能提供对复杂场景的不完整描述,因此,现有方法的性能可能无法完整地描述一个复杂的场景,这往往导致产生与不完整描述相适应的虚假正面效果。在这项工作中,我们引入了部分查询问题,并广泛分析了其对基于文本的图像检索的影响。然后我们提出了一个互动检索框架,称为Part2Halle,以通过迭接地丰富缺失的细节来解决这一问题。具体地说,一个互动检索代理进行了培训,以根据方便用户的互动和图片库的统计特征来完善初始查询。与严重依赖用户来反馈差异性信息的其他基于对话的方法相比,我们让AI接管了最佳反馈搜索过程,并向用户暗示了基于确认的细节问题。此外,由于难以获取人机对话框数据,因此,我们提出了一种薄弱的强化学习方法,该方法不需要除文本模拟数据外的附加说明性数据。实验显示,我们的框架大大改进了基于文本图像的复杂图像的性。