Causality knowledge is vital to building robust AI systems. Deep learning models often perform poorly on tasks that require causal reasoning, which is often derived using some form of commonsense knowledge not immediately available in the input but implicitly inferred by humans. Prior work has unraveled spurious observational biases that models fall prey to in the absence of causality. While language representation models preserve contextual knowledge within learned embeddings, they do not factor in causal relationships during training. By blending causal relationships with the input features to an existing model that performs visual cognition tasks (such as scene understanding, video captioning, video question-answering, etc.), better performance can be achieved owing to the insight causal relationships bring about. Recently, several models have been proposed that have tackled the task of mining causal data from either the visual or textual modality. However, there does not exist widespread research that mines causal relationships by juxtaposing the visual and language modalities. While images offer a rich and easy-to-process resource for us to mine causality knowledge from, videos are denser and consist of naturally time-ordered events. Also, textual information offers details that could be implicit in videos. We propose iReason, a framework that infers visual-semantic commonsense knowledge using both videos and natural language captions. Furthermore, iReason's architecture integrates a causal rationalization module to aid the process of interpretability, error analysis and bias detection. We demonstrate the effectiveness of iReason using a two-pronged comparative analysis with language representation learning models (BERT, GPT-2) as well as current state-of-the-art multimodal causality models.


翻译:养分知识对于建设稳健的人工智能系统至关重要。深层学习模式往往在需要因果推理的任务上表现不佳,而这种推理往往使用某种形式的常识知识来得出,而这种知识在输入时并非立即可用,而是由人类暗含地推断。先前的工作揭示了模型在没有因果关系的情况下会受害于的虚假观察偏差。虽然语言代表模式保留了在所学嵌入中的背景知识,但它们并没有在培训过程中将因果关系考虑在内。通过将因果关系与执行视觉认知任务的现有模型(如现场理解、视频描述、视频解答等)相结合,可以实现更好的业绩,因为通过洞察到因果关系关系产生的结果往往可以实现。最近,提出了若干模型,从视觉或文字模式中解决了因果数据的挖掘任务。然而,虽然没有广泛的研究表明,通过对视觉和语言模块的并存,它们为我们提供了丰富的、容易处理的因果知识资源,但视频是更为密集的,并且由自然时间排序的模型组成。此外,文本信息提供细节分析,在视觉视频中可以隐含地解释。

17
下载
关闭预览

相关内容

专知会员服务
50+阅读 · 2021年8月8日
【干货书】机器学习速查手册,135页pdf
专知会员服务
120+阅读 · 2020年11月20日
神经常微分方程教程,50页ppt,A brief tutorial on Neural ODEs
专知会员服务
70+阅读 · 2020年8月2日
因果图,Causal Graphs,52页ppt
专知会员服务
238+阅读 · 2020年4月19日
【哈佛大学商学院课程Fall 2019】机器学习可解释性
专知会员服务
96+阅读 · 2019年10月9日
Unsupervised Learning via Meta-Learning
CreateAMind
41+阅读 · 2019年1月3日
disentangled-representation-papers
CreateAMind
26+阅读 · 2018年9月12日
Hierarchical Disentangled Representations
CreateAMind
4+阅读 · 2018年4月15日
计算机视觉近一年进展综述
机器学习研究会
8+阅读 · 2017年11月25日
可解释的CNN
CreateAMind
17+阅读 · 2017年10月5日
【学习】Hierarchical Softmax
机器学习研究会
4+阅读 · 2017年8月6日
Arxiv
6+阅读 · 2018年3月31日
VIP会员
Top
微信扫码咨询专知VIP会员