We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition, which is composed of a series of fully-connected (FC) layers. Compared to convolutional layers, FC layers are more efficient, better at modeling the long-range dependencies and positional patterns, but worse at capturing the local structures, hence usually less favored for image recognition. We propose a structural re-parameterization technique that adds local prior into an FC to make it powerful for image recognition. Specifically, we construct convolutional layers inside a RepMLP during training and merge them into the FC for inference. On CIFAR, a simple pure-MLP model shows performance very close to CNN. By inserting RepMLP in traditional CNN, we improve ResNets by 1.8% accuracy on ImageNet, 2.9% for face recognition, and 2.3% mIoU on Cityscapes with lower FLOPs. Our intriguing findings highlight that combining the global representational capacity and positional perception of FC with the local prior of convolution can improve the performance of neural network with faster speed on both the tasks with translation invariance (e.g., semantic segmentation) and those with aligned images and positional patterns (e.g., face recognition). The code and models are available at https://github.com/DingXiaoH/RepMLP.

6
下载
关闭预览

相关内容

FC:Financial Cryptography and Data Security。 Explanation:金融密码与数据安全。 Publisher:Springer。 SIT: http://dblp.uni-trier.de/db/conf/fc/

We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using heavy data-augmentation and optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on ImageNet. We also train ResMLP models in a self-supervised setup, to further remove priors from employing a labelled dataset. Finally, by adapting our model to machine translation we achieve surprisingly good results. We share pre-trained models and our code based on the Timm library.

0
0
下载
预览

Vision Transformers (ViT) have recently emerged as a powerful alternative to convolutional networks (CNNs). Although hybrid models attempt to bridge the gap between these two architectures, the self-attention layers they rely on induce a strong computational bottleneck, especially at large spatial resolutions. In this work, we explore the idea of reducing the time spent training these layers by initializing them as convolutional layers. This enables us to transition smoothly from any pre-trained CNN to its functionally identical hybrid model, called Transformed CNN (T-CNN). With only 50 epochs of fine-tuning, the resulting T-CNNs demonstrate significant performance gains over the CNN (+2.2% top-1 on ImageNet-1k for a ResNet50-RS) as well as substantially improved robustness (+11% top-1 on ImageNet-C). We analyze the representations learnt by the T-CNN, providing deeper insights into the fruitful interplay between convolutions and self-attention. Finally, we experiment initializing the T-CNN from a partially trained CNN, and find that it reaches better performance than the corresponding hybrid model trained from scratch, while reducing training time.

0
0
下载
预览

Classical sampling is based on acquiring signal amplitudes at specific points in time, with the minimal sampling rate dictated by the degrees of freedom in the signal. The samplers in this framework are controlled by a global clock that operates at a rate greater than or equal to the minimal sampling rate. At high sampling rates, clocks are power-consuming and prone to electromagnetic interference. An integrate-and-fire time encoding machine (IF-TEM) is an alternative power-efficient sampling mechanism which does not require a global clock. Here, the samples are irregularly spaced threshold-based samples. In this paper, we investigate the problem of sampling finite-rate-of-innovation (FRI) signals using an IF-TEM. We provide theoretical recovery guarantees for an FRI signal with arbitrary pulse shape and without any constraint on the minimum separation between the pulses. In particular, we show how to design a sampling kernel, IF-TEM, and recovery method such that the FRI signals are perfectly reconstructed. We then propose a modification to the sampling kernel to improve noise robustness. Our results enable designing low-cost and energy-efficient analog-to-digital converters for FRI signals.

0
0
下载
预览

Multi-label image recognition is a practical and challenging task compared to single-label image classification. However, previous works may be suboptimal because of a great number of object proposals or complex attentional region generation modules. In this paper, we propose a simple but efficient two-stream framework to recognize multi-category objects from global image to local regions, similar to how human beings perceive objects. To bridge the gap between global and local streams, we propose a multi-class attentional region module which aims to make the number of attentional regions as small as possible and keep the diversity of these regions as high as possible. Our method can efficiently and effectively recognize multi-class objects with an affordable computation cost and a parameter-free region localization module. Over three benchmarks on multi-label image classification, we create new state-of-the-art results with a single model only using image semantics without label dependency. In addition, the effectiveness of the proposed method is extensively demonstrated under different factors such as global pooling strategy, input size and network architecture. Code has been made available at~\url{https://github.com/gaobb/MCAR}.

0
0
下载
预览

This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. Unlike existing Transformer methods, which employ standard Transformer blocks to tackle raw images with a fixed resolution, our ResT have several advantages: (1) A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads; (2) Position encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune; (3) Instead of the straightforward tokenization at the beginning of each stage, we design the patch embedding as a stack of overlapping convolution operation with stride on the 2D-reshaped token map. We comprehensively validate ResT on image classification and downstream tasks. Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones. The code and models will be made publicly available at https://github.com/wofmanaf/ResT.

0
0
下载
预览

We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to SegFormer-B5, reaching significantly better performance and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5x smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C. Code will be released at: github.com/NVlabs/SegFormer.

0
0
下载
预览

Temporal modeling still remains challenging for action recognition in videos. To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition. The core of our TDN is to devise an efficient temporal module (TDM) by explicitly leveraging a temporal difference operator, and systematically assess its effect on short-term and long-term motion modeling. To fully capture temporal information over the entire video, our TDN is established with a two-level difference modeling paradigm. Specifically, for local motion modeling, temporal difference over consecutive frames is used to supply 2D CNNs with finer motion pattern, while for global motion modeling, temporal difference across segments is incorporated to capture long-range structure for motion feature excitation. TDN provides a simple and principled temporal modeling framework and could be instantiated with the existing CNNs at a small extra computational cost. Our TDN presents a new state of the art on the Something-Something V1 and V2 datasets and is on par with the best performance on the Kinetics-400 dataset. In addition, we conduct in-depth ablation studies and plot the visualization results of our TDN, hopefully providing insightful analysis on temporal difference operation. We release the code at https://github.com/MCG-NJU/TDN.

0
3
下载
预览

Adversarial examples are commonly viewed as a threat to ConvNets. Here we present an opposite perspective: adversarial examples can be used to improve image recognition models if harnessed in the right manner. We propose AdvProp, an enhanced adversarial training scheme which treats adversarial examples as additional examples, to prevent overfitting. Key to our method is the usage of a separate auxiliary batch norm for adversarial examples, as they have different underlying distributions to normal examples. We show that AdvProp improves a wide range of models on various image recognition tasks and performs better when the models are bigger. For instance, by applying AdvProp to the latest EfficientNet-B7 [28] on ImageNet, we achieve significant improvements on ImageNet (+0.7%), ImageNet-C (+6.5%), ImageNet-A (+7.0%), Stylized-ImageNet (+4.8%). With an enhanced EfficientNet-B8, our method achieves the state-of-the-art 85.5% ImageNet top-1 accuracy without extra data. This result even surpasses the best model in [20] which is trained with 3.5B Instagram images (~3000X more than ImageNet) and ~9.4X more parameters. Models are available at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet.

0
4
下载
预览

The convolution layer has been the dominant feature extractor in computer vision for years. However, the spatial aggregation in convolution is basically a pattern matching process that applies fixed filters which are inefficient at modeling visual elements with varying spatial distributions. This paper presents a new image feature extractor, called the local relation layer, that adaptively determines aggregation weights based on the compositional relationship of local pixel pairs. With this relational approach, it can composite visual elements into higher-level entities in a more efficient manner that benefits semantic inference. A network built with local relation layers, called the Local Relation Network (LR-Net), is found to provide greater modeling capacity than its counterpart built with regular convolution on large-scale recognition tasks such as ImageNet classification.

0
3
下载
预览

Spatial pyramid pooling module or encode-decoder structure are used in deep neural networks for semantic segmentation task. The former networks are able to encode multi-scale contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple effective fields-of-view, while the latter networks can capture sharper object boundaries by gradually recovering the spatial information. In this work, we propose to combine the advantages from both methods. Specifically, our proposed model, DeepLabv3+, extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further explore the Xception model and apply the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network. We demonstrate the effectiveness of the proposed model on the PASCAL VOC 2012 semantic image segmentation dataset and achieve a performance of 89% on the test set without any post-processing. Our paper is accompanied with a publicly available reference implementation of the proposed models in Tensorflow.

0
7
下载
预览
小贴士
相关论文
ResMLP: Feedforward networks for image classification with data-efficient training
Hugo Touvron,Piotr Bojanowski,Mathilde Caron,Matthieu Cord,Alaaeldin El-Nouby,Edouard Grave,Gautier Izacard,Armand Joulin,Gabriel Synnaeve,Jakob Verbeek,Hervé Jégou
0+阅读 · 6月10日
Stéphane d'Ascoli,Levent Sagun,Giulio Biroli,Ari Morcos
0+阅读 · 6月10日
Hila Namman,Satish Mulleti,Yonina C. Eldar
0+阅读 · 6月10日
Qinglong Zhang,Yubin Yang
0+阅读 · 6月6日
Enze Xie,Wenhai Wang,Zhiding Yu,Anima Anandkumar,Jose M. Alvarez,Ping Luo
0+阅读 · 6月5日
TDN: Temporal Difference Networks for Efficient Action Recognition
Limin Wang,Zhan Tong,Bin Ji,Gangshan Wu
3+阅读 · 2020年12月18日
Cihang Xie,Mingxing Tan,Boqing Gong,Jiang Wang,Alan Yuille,Quoc V. Le
4+阅读 · 2019年11月21日
Local Relation Networks for Image Recognition
Han Hu,Zheng Zhang,Zhenda Xie,Stephen Lin
3+阅读 · 2019年4月25日
Liang-Chieh Chen,Yukun Zhu,George Papandreou,Florian Schroff,Hartwig Adam
7+阅读 · 2018年2月7日
相关VIP内容
专知会员服务
35+阅读 · 2020年3月19日
[综述]深度学习下的场景文本检测与识别
专知会员服务
32+阅读 · 2019年10月10日
相关资讯
鲁棒机器学习相关文献集
专知
5+阅读 · 2019年8月18日
Hierarchically Structured Meta-learning
CreateAMind
10+阅读 · 2019年5月22日
Deep Compression/Acceleration:模型压缩加速论文汇总
极市平台
11+阅读 · 2019年5月15日
LibRec 精选:基于参数共享的CNN-RNN混合模型
LibRec智能推荐
5+阅读 · 2019年3月7日
Unsupervised Learning via Meta-Learning
CreateAMind
27+阅读 · 2019年1月3日
Hierarchical Disentangled Representations
CreateAMind
3+阅读 · 2018年4月15日
ResNet, AlexNet, VGG, Inception:各种卷积网络架构的理解
全球人工智能
14+阅读 · 2017年12月17日
【推荐】ResNet, AlexNet, VGG, Inception:各种卷积网络架构的理解
机器学习研究会
16+阅读 · 2017年12月17日
Capsule Networks解析
机器学习研究会
10+阅读 · 2017年11月12日
可解释的CNN
CreateAMind
11+阅读 · 2017年10月5日
Top