We propose a novel neural network module that transforms an existing single-frame semantic segmentation model into a video semantic segmentation pipeline. In contrast to prior works, we strive towards a simple and general module that can be integrated into virtually any single-frame architecture. Our approach aggregates a rich representation of the semantic information in past frames into a memory module. Information stored in the memory is then accessed through an attention mechanism. This provides temporal appearance cues from prior frames, which are then fused with an encoding of the current frame through a second attention-based module. The segmentation decoder processes the fused representation to predict the final semantic segmentation. We integrate our approach into two popular semantic segmentation networks: ERFNet and PSPNet. We observe an improvement in segmentation performance on Cityscapes by 1.7% and 2.1% in mIoU respectively, while increasing inference time of ERFNet by only 1.5ms.
翻译:我们提出了一个新颖的神经网络模块,将现有的单框架语义分解模型转换成视频语义分解管道。 与先前的工程不同, 我们努力开发一个简单和一般模块, 能够融入几乎所有单一框架结构。 我们的方法将过去框架中的语义信息集中到一个记忆模块中。 存储在记忆中的信息随后通过一个关注机制被存取。 这提供了前框架的时间外观提示, 然后通过第二个基于关注的模块与当前框架的编码融合在一起。 分解分解将导出导出导出导出导出导出导出最终语义分解的导出法。 我们将我们的方法整合到两个流行的语义分解网络: ERFNet 和 PSPNet 。 我们观察到城市景区段化的绩效分别提高了1. 7% 和 2.1%, 而在 mIOU 中, ERFNet 的推导出时间仅增加了1.5米。