Recognizing an activity with a single reference sample using metric learning approaches is a promising research field. The majority of few-shot methods focus on object recognition or face-identification. We propose a metric learning approach to reduce the action recognition problem to a nearest neighbor search in embedding space. We encode signals into images and extract features using a deep residual CNN. Using triplet loss, we learn a feature embedding. The resulting encoder transforms features into an embedding space in which closer distances encode similar actions while higher distances encode different actions. Our approach is based on a signal level formulation and remains flexible across a variety of modalities. It further outperforms the baseline on the large scale NTU RGB+D 120 dataset for the One-Shot action recognition protocol by 5.6%. With just 60% of the training data, our approach still outperforms the baseline approach by 3.7%. With 40% of the training data, our approach performs comparably well to the second follow up. Further, we show that our approach generalizes well in experiments on the UTD-MHAD dataset for inertial, skeleton and fused data and the Simitate dataset for motion capturing data. Furthermore, our inter-joint and inter-sensor experiments suggest good capabilities on previously unseen setups.
翻译:使用标准化学习方法,以单一参考样本确认一项活动,并采用单一参考样本,这是一个很有希望的研究领域。大多数微小方法都侧重于对象识别或面貌识别。我们建议了一种衡量学习方法,将行动识别问题降低到近邻的嵌入空间的搜索中。我们将信号编码成图像并提取特征,使用深度残余CNN。使用三重损失,我们学习一个特征嵌入。由此产生的编码器将特征转换成一个嵌入空间,使更近距离的类似动作编码,而更远的距离则对不同动作进行编码。我们的方法基于信号级别的配制,并且在不同模式中保持灵活性。我们的方法进一步超越了NTU RGB+D 120 大规模用于单位行动识别协议的基线5.6%。只有60%的培训数据,我们的方法仍然比基线方法高出3.7%。40%的培训数据,我们的方法与第二次后续跟踪相匹配。此外,我们的方法显示,我们的方法在对UTD-MHAD数据集用于惯性、骨质和密质导数据之间的实验中,我们为先前的同步数据设定了良好的模型。