Humans possess a unique social cognition capability; nonverbal communication can convey rich social information among agents. In contrast, such crucial social characteristics are mostly missing in the existing scene understanding literature. In this paper, we incorporate different nonverbal communication cues (e.g., gaze, human poses, and gestures) to represent, model, learn, and infer agents' mental states from pure visual inputs. Crucially, such a mental representation takes the agent's belief into account so that it represents what the true world state is and infers the beliefs in each agent's mental state, which may differ from the true world states. By aggregating different beliefs and true world states, our model essentially forms "five minds" during the interactions between two agents. This "five minds" model differs from prior works that infer beliefs in an infinite recursion; instead, agents' beliefs are converged into a "common mind". Based on this representation, we further devise a hierarchical energy-based model that jointly tracks and predicts all five minds. From this new perspective, a social event is interpreted by a series of nonverbal communication and belief dynamics, which transcends the classic keyframe video summary. In the experiments, we demonstrate that using such a social account provides a better video summary on videos with rich social interactions compared with state-of-the-art keyframe video summary methods.
翻译:人类拥有独特的社会认知能力; 非语言交流可以在代理商之间传递丰富的社会信息。 相反, 在现有的现场理解文献中,这类关键社会特征大多缺乏。 在本文中,我们采用了不同的非语言交流提示(例如凝视、人姿势和手势)来代表、模拟、学习和从纯视觉输入中推断代理商的精神状态。 关键是,这种精神表现考虑到代理商的信仰,以便它代表真实的世界状态,并推断出每个代理商精神状态的信仰,这种信仰可能与真实的世界状态不同。 通过汇集不同的信仰和真实世界状态,我们的模式基本上以两种代理商之间的互动形式为“五种思想 ” 。 这种“ 五种思想” 模式与先前的作品不同, 将信仰推导出无限的循环; 相反, 代理商的信仰会归结为“ 共同思想 ” 。 基于这种表现, 我们进一步设计一个基于能源的等级模式, 共同跟踪和预测所有五种思想。 从这个新角度, 一个社会事件通过一系列非语言交流和信仰的总结来解释, 通过一系列非语言交流和信仰的动态模式来解释, 展示一个更好的社会视频模式, 超越了我们所比较的关键视频的视频模式。