Context modeling is crucial for visual recognition, enabling highly discriminative image representations by integrating both intrinsic and extrinsic relationships between objects and labels in images. A limitation in current approaches is their focus on basic geometric relationships or localized features, often neglecting cross-scale contextual interactions between objects. This paper introduces the Deep Panoptic Context Aggregation Network (PanCAN), a novel approach that hierarchically integrates multi-order geometric contexts through cross-scale feature aggregation in a high-dimensional Hilbert space. Specifically, PanCAN learns multi-order neighborhood relationships at each scale by combining random walks with an attention mechanism. Modules from different scales are cascaded, where salient anchors at a finer scale are selected and their neighborhood features are dynamically fused via attention. This enables effective cross-scale modeling that significantly enhances complex scene understanding by combining multi-order and cross-scale context-aware features. Extensive multi-label classification experiments on NUS-WIDE, PASCAL VOC2007, and MS-COCO benchmarks demonstrate that PanCAN consistently achieves competitive results, outperforming state-of-the-art techniques in both quantitative and qualitative evaluations, thereby substantially improving multi-label classification performance.
翻译:上下文建模对于视觉识别至关重要,它通过整合图像中对象与标签之间的内在和外在关系,能够生成高度判别性的图像表示。现有方法的一个局限在于其侧重于基本几何关系或局部特征,往往忽略了对象间的跨尺度上下文交互。本文提出深度全景上下文聚合网络(PanCAN),这是一种新颖的方法,通过在高维希尔伯特空间中进行跨尺度特征聚合,分层整合多阶几何上下文。具体而言,PanCAN通过将随机游走与注意力机制相结合,学习每个尺度下的多阶邻域关系。来自不同尺度的模块被级联,其中在更精细的尺度上选择显著锚点,并通过注意力机制动态融合其邻域特征。这使得有效的跨尺度建模成为可能,通过结合多阶和跨尺度的上下文感知特征,显著增强了复杂场景的理解。在NUS-WIDE、PASCAL VOC2007和MS-COCO基准上进行的大量多标签分类实验表明,PanCAN始终取得具有竞争力的结果,在定量和定性评估中均优于现有技术,从而显著提升了多标签分类性能。