Contextual information is vital in visual understanding problems, such as semantic segmentation and object detection. We propose a Criss-Cross Network (CCNet) for obtaining full-image contextual information in a very effective and efficient way. Concretely, for each pixel, a novel criss-cross attention module harvests the contextual information of all the pixels on its criss-cross path. By taking a further recurrent operation, each pixel can finally capture the full-image dependencies. Besides, a category consistent loss is proposed to enforce the criss-cross attention module to produce more discriminative features. Overall, CCNet is with the following merits: 1) GPU memory friendly. Compared with the non-local block, the proposed recurrent criss-cross attention module requires 11x less GPU memory usage. 2) High computational efficiency. The recurrent criss-cross attention significantly reduces FLOPs by about 85% of the non-local block. 3) The state-of-the-art performance. We conduct extensive experiments on semantic segmentation benchmarks including Cityscapes, ADE20K, human parsing benchmark LIP, instance segmentation benchmark COCO, video segmentation benchmark CamVid. In particular, our CCNet achieves the mIoU scores of 81.9%, 45.76% and 55.47% on the Cityscapes test set, the ADE20K validation set and the LIP validation set respectively, which are the new state-of-the-art results. The source codes are available at \url{https://github.com/speedinghzl/CCNet}.
翻译:对于视觉理解问题,例如语义分割和对象探测,背景信息至关重要。 我们建议使用 Criss- Cross 网络(CCNet), 以非常有效和高效的方式获取完整图像背景信息。 具体地说, 对于每个像素来说, 新的 Criss- cross 网络(CCNet), 一个新的 Crips- cross 交叉关注模块, 捕捉着其十字路路上所有像素的背景资料。 通过进一步的经常性操作, 每个像素最终可以捕捉完整图像依赖性 。 此外, 提议使用一个分类一致损失, 以强制使用 Criss- cross 网络模块, 以产生更具歧视性的特性。 总体来说, CCNet 的优点是:1 GPUP 的记忆友好。 与非本地区块相比, 拟议的经常性 CIS- 交叉关注模块需要 11x GPUPS 的记忆使用量。 2) 计算效率。 经常的交叉关注将FLOP 大幅降低约85% 非本地区块 。 。 3,, 州级 Silforal- cal- colf col- sal 的运行运行运行运行运行运行运行运行运行绩效绩效绩效绩效绩效绩效运行。 我们进行广泛的测试基准 。