Despite the abundance of multi-modal data, such as image-text pairs, there has been little effort in understanding the individual entities and their different roles in the construction of these data instances. In this work, we endeavour to discover the entities and their corresponding importance in cooking recipes automaticall} as a visual-linguistic association problem. More specifically, we introduce a novel cross-modal learning framework to jointly model the latent representations of images and text in the food image-recipe association and retrieval tasks. This model allows one to discover complex functional and hierarchical relationships between images and text, and among textual parts of a recipe including title, ingredients and cooking instructions. Our experiments show that by making use of efficient tree-structured Long Short-Term Memory as the text encoder in our computational cross-modal retrieval framework, we are not only able to identify the main ingredients and cooking actions in the recipe descriptions without explicit supervision, but we can also learn more meaningful feature representations of food recipes, appropriate for challenging cross-modal retrieval and recipe adaption tasks.
翻译:尽管存在大量多模式数据,例如图像-文本配对,但在理解单个实体及其在构建这些数据实例中的不同作用方面没有做出多少努力。在这项工作中,我们努力发现这些实体及其在烹饪配方中的相应重要性,作为视觉语言联系问题。更具体地说,我们引入了一个新型的跨模式学习框架,以共同模拟食品图像-反相联系和检索任务中图像和文本的潜在表达方式和文本。这一模型使人们得以发现图像和文本之间以及包括标题、成分和烹饪指示在内的配方的文字部分之间的复杂功能和等级关系。我们的实验显示,通过利用高效的树结构长短期记忆作为计算跨模式检索框架中的文字编码,我们不仅能够在没有明确监督的情况下确定配方描述中的主要成分和烹饪行动,我们还能够学习更有意义的食品配方特征表达方式,适合具有挑战性的跨模式检索和配方适应任务。