Action recognition is also key for applications ranging from robotics to healthcare monitoring. Action information can be extracted from the body pose and movements, as well as from the background scene. However, the extent to which deep neural networks (DNNs) make use of information about the body and information about the background remains unclear. Since these two sources of information may be correlated within a training dataset, DNNs might learn to rely predominantly on one of them, without taking full advantage of the other. Unlike DNNs, humans have domain-specific brain regions selective for perceiving bodies, and regions selective for perceiving scenes. The present work tests whether humans are thus more effective at extracting information from both body and background, and whether building brain-inspired deep network architectures with separate domain-specific streams for body and scene perception endows them with more human-like performance. We first demonstrate that DNNs trained using the HAA500 dataset perform almost as accurately on versions of the stimuli that show both body and background and on versions of the stimuli from which the body was removed, but are at chance-level for versions of the stimuli from which the background was removed. Conversely, human participants (N=28) can recognize the same set of actions accurately with all three versions of the stimuli, and perform significantly better on stimuli that show only the body than on stimuli that show only the background. Finally, we implement and test a novel architecture patterned after domain specificity in the brain with separate streams to process body and background information. We show that 1) this architecture improves action recognition performance, and 2) its accuracy across different versions of the stimuli follows a pattern that matches more closely the pattern of accuracy observed in human participants.
翻译:动作识别对于从机器人技术到健康监测等应用至关重要。动作信息可以从身体姿态与运动中提取,也可以从背景场景中提取。然而,深度神经网络(DNNs)在多大程度上利用了身体信息和背景信息仍不明确。由于这两种信息在训练数据集中可能存在相关性,DNNs可能倾向于主要依赖其中一种信息,而未能充分利用另一种。与DNNs不同,人类大脑具有专门用于感知身体的特定脑区,以及专门用于感知场景的特定脑区。本研究旨在验证人类是否因此能更有效地从身体和背景中提取信息,以及构建具有独立处理身体与场景信息的脑启发深度网络架构是否能使其获得更接近人类的表现。我们首先证明,使用HAA500数据集训练的DNNs在同时显示身体和背景的刺激版本与仅移除身体的刺激版本上表现几乎同样准确,但在仅移除背景的刺激版本上表现接近随机水平。相反,人类参与者(N=28)能够准确识别所有三种刺激版本中的同一组动作,并且在仅显示身体的刺激版本上表现显著优于仅显示背景的版本。最后,我们实现并测试了一种模仿大脑领域特异性结构的新型架构,该架构采用独立流分别处理身体和背景信息。结果表明:1)该架构提升了动作识别性能;2)其在不同刺激版本上的准确率模式更接近人类参与者观察到的准确率模式。