Most polyp segmentation methods use CNNs as their backbone, leading to two key issues when exchanging information between the encoder and decoder: 1) taking into account the differences in contribution between different-level features and 2) designing an effective mechanism for fusing these features. Unlike existing CNN-based methods, we adopt a transformer encoder, which learns more powerful and robust representations. In addition, considering the image acquisition influence and elusive properties of polyps, we introduce three standard modules, including a cascaded fusion module (CFM), a camouflage identification module (CIM), and a similarity aggregation module (SAM). Among these, the CFM is used to collect the semantic and location information of polyps from high-level features; the CIM is applied to capture polyp information disguised in low-level features, and the SAM extends the pixel features of the polyp area with high-level semantic position information to the entire polyp area, thereby effectively fusing cross-level features. The proposed model, named Polyp-PVT, effectively suppresses noises in the features and significantly improves their expressive capabilities. Extensive experiments on five widely adopted datasets show that the proposed model is more robust to various challenging situations (\emph{e.g.}, appearance changes, small objects, rotation) than existing representative methods. The proposed model is available at https://github.com/DengPingFan/Polyp-PVT.
翻译:大多数腺肌瘤分割方法使用CNN作为其骨干网络,在编码器和解码器之间交换信息时存在两个关键问题:1.考虑不同级别特征之间的贡献差异;2.设计一种有效的机制来融合这些特征。与现有的基于CNN的方法不同,我们采用了Transformer编码器,以学习更强大、更稳健的表示。此外,考虑到图像采集影响以及腺肌瘤的难以捉摸的特性,我们引入了三个标准模块,包括级联融合模块(CFM)、伪装识别模块(CIM)和相似性聚合模块(SAM)。其中,CFM用于从高级特征中收集腺肌瘤的语义和位置信息;CIM用于捕获低级特征中伪装的腺肌瘤信息,SAM将腺肌瘤区域的像素特征与高级语义位置信息扩展到整个腺肌瘤区域,从而有效地融合了跨级特征。所提出的模型名为Polyp-PVT,有效地抑制了特征中的噪声,并极大地提高了它们的表达能力。在五个广泛采用的数据集上进行的大量实验表明,与现有的典型方法相比,所提出的模型对各种具有挑战性的情况(如外观变化、小物体、旋转)更具鲁棒性。该模型可在https://github.com/DengPingFan/Polyp-PVT上获取。