Scene parsing from images is a fundamental yet challenging problem in visual content understanding. In this dense prediction task, the parsing model assigns every pixel to a categorical label, which requires the contextual information of adjacent image patches. So the challenge for this learning task is to simultaneously describe the geometric and semantic properties of objects or a scene. In this paper, we explore the effective use of multi-layer feature outputs of the deep parsing networks for spatial-semantic consistency by designing a novel feature aggregation module to generate the appropriate global representation prior, to improve the discriminative power of features. The proposed module can auto-select the intermediate visual features to correlate the spatial and semantic information. At the same time, the multiple skip connections form a strong supervision, making the deep parsing network easy to train. Extensive experiments on four public scene parsing datasets prove that the deep parsing network equipped with the proposed feature aggregation module can achieve very promising results.
翻译:从图像中剖析是一个根本性的、但具有挑战性的问题。 在这一密集的预测任务中, 剖析模型将每个像素指派给一个绝对的标签, 这需要相邻图像补丁的上下文信息。 因此, 学习任务的挑战在于同时描述对象或场景的几何和语义特性。 在本文中, 我们探索如何有效利用深切剖析网络的多层特性输出, 实现空间- 语义一致性, 设计一个新颖的特征聚合模块, 以生成适当的全球代表性, 从而改善特征的区别性能。 拟议的模块可以自动选择中间视觉特征, 将空间和语义信息联系起来。 同时, 多处跳过连接形成强有力的监管, 使深处剖析网络易于培训。 在四个公开场的解析数据集上进行广泛的实验, 证明配备了拟议特征汇总模块的深层剖析网络能够取得非常有希望的结果 。