This manuscript explores multimodal alignment, translation, fusion, and transference to enhance machine understanding of complex inputs. We organize the work into five chapters, each addressing unique challenges in multimodal machine learning. Chapter 3 introduces Spatial-Reasoning Bert for translating text-based spatial relations into 2D arrangements between clip-arts. This enables effective decoding of spatial language into visual representations, paving the way for automated scene generation aligned with human spatial understanding. Chapter 4 presents a method for translating medical texts into specific 3D locations within an anatomical atlas. We introduce a loss function leveraging spatial co-occurrences of medical terms to create interpretable mappings, significantly enhancing medical text navigability. Chapter 5 tackles translating structured text into canonical facts within knowledge graphs. We develop a benchmark for linking natural language to entities and predicates, addressing ambiguities in text extraction to provide clearer, actionable insights. Chapter 6 explores multimodal fusion methods for compositional action recognition. We propose a method fusing video frames and object detection representations, improving recognition robustness and accuracy. Chapter 7 investigates multimodal knowledge transference for egocentric action recognition. We demonstrate how multimodal knowledge distillation enables RGB-only models to mimic multimodal fusion-based capabilities, reducing computational requirements while maintaining performance. These contributions advance methodologies for spatial language understanding, medical text interpretation, knowledge graph enrichment, and action recognition, enhancing computational systems' ability to process complex, multimodal inputs across diverse applications.
翻译:本文探讨了多模态对齐、翻译、融合与迁移,以增强机器对复杂输入的理解。我们将工作组织为五个章节,每章分别应对多模态机器学习中的独特挑战。第三章介绍了Spatial-Reasoning Bert,用于将基于文本的空间关系翻译为剪贴画之间的二维排列。这实现了将空间语言有效解码为视觉表征,为生成符合人类空间理解的自动化场景铺平了道路。第四章提出了一种将医学文本翻译为解剖图谱内特定三维位置的方法。我们引入了一种利用医学术语空间共现性的损失函数来创建可解释的映射,显著增强了医学文本的可导航性。第五章致力于将结构化文本翻译为知识图谱中的规范事实。我们开发了一个将自然语言链接到实体和谓词的基准,解决了文本提取中的歧义问题,以提供更清晰、可操作的见解。第六章探索了用于组合动作识别的多模态融合方法。我们提出了一种融合视频帧和物体检测表征的方法,提高了识别的鲁棒性和准确性。第七章研究了用于第一人称动作识别的多模态知识迁移。我们展示了多模态知识蒸馏如何使仅RGB模型能够模仿基于多模态融合的能力,在保持性能的同时降低了计算需求。这些贡献推进了空间语言理解、医学文本解释、知识图谱丰富和动作识别的方法论,增强了计算系统在不同应用中处理复杂多模态输入的能力。