Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation, attaining a comparable or even higher accuracy than CNNs. More recently, MLP-Mixer abandons both the convolution and the self-attention operation, proposing an architecture containing only MLP layers. To achieve cross-patch communications, it devises an additional token-mixing MLP besides the channel-mixing MLP. It achieves promising results when training on an extremely large-scale dataset. But it cannot achieve as outstanding performance as its CNN and ViT counterparts when training on medium-scale datasets such as ImageNet1K and ImageNet21K. The performance drop of MLP-Mixer motivates us to rethink the token-mixing MLP. We discover that the token-mixing MLP is a variant of the depthwise convolution with a global reception field and spatial-specific configuration. But the global reception field and the spatial-specific property make token-mixing MLP prone to over-fitting. In this paper, we propose a novel pure MLP architecture, spatial-shift MLP (S$^2$-MLP). Different from MLP-Mixer, our S$^2$-MLP only contains channel-mixing MLP. We utilize a spatial-shift operation for communications between patches. It has a local reception field and is spatial-agnostic. It is parameter-free and efficient for computation. The proposed S$^2$-MLP attains higher recognition accuracy than MLP-Mixer when training on ImageNet-1K dataset. Meanwhile, S$^2$-MLP accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters.
翻译:最近,视觉变异器(ViT)及其后续作品抛弃了混凝土,并开发了自我关注操作,实现了比CNN更相似甚至更高的精度。最近,MLP-Mixer放弃了混凝土和自我关注操作,建议了一个仅包含 MLP 层的架构。为了实现交叉批量通信,除了频道混合MLP之外,它设计了额外的代号混合MLP。它在一个超大型的存储数据集培训中取得了令人乐观的成果。但是,当在MifNet1K和图像Net21K等中等规模数据集培训时,它无法像CNN和ViT对口一样出色地表现。 MLP$的 SNet1K1K和图像Net21K。 MLMixer的性能下降激励我们重新思考代代号MLP。我们发现,代号混合MLP是全球接收字段和空间配置更小的深度混和组合的变异体。但是,全球接收场和空间特定属性将MLP$的质比我们SIMP$的 mLP值比超值。 在本文中,我们的新M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-M-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S