自监督学习(Self-Supervised Learning)是一种介于无监督和监督学习之间的一种新范式,旨在减少对大量带注释数据的挑战性需求。它通过定义无注释(annotation-free)的前置任务(pretext task),为特征学习提供代理监督信号。jason718整理了关于自监督学习最新的论文合集,非常值得查看!
地址: https://github.com/jason718/awesome-self-supervised-learning
A curated list of awesome Self-Supervised Learning resources. Inspired by awesome-deep-vision, awesome-adversarial-machine-learning, awesome-deep-learning-papers, and awesome-architecture-search
Self-Supervised Learning has become an exciting direction in AI community.
Please help contribute this list by contacting me or add pull request
Markdown format:
- Paper Name.
[[pdf]](link)
[[code]](link)
- Author 1, Author 2, and Author 3. *Conference Year*
FAIR Self-Supervision Benchmark [repo]: various benchmark (and legacy) tasks for evaluating quality of visual representations learned by various self-supervision approaches.
Unsupervised Visual Representation Learning by Context Prediction. [pdf] [code]
Unsupervised Learning of Visual Representations using Videos. [pdf] [code]
Learning to See by Moving. [pdf] [code]
Learning image representations tied to ego-motion. [pdf] [code]
Joint Unsupervised Learning of Deep Representations and Image Clusters. [pdf] [code-torch] [code-caffe]
Unsupervised Deep Embedding for Clustering Analysis. [pdf] [code]
Slow and steady feature analysis: higher order temporal coherence in video. [pdf]
Context Encoders: Feature Learning by Inpainting. [pdf] [code]
Colorful Image Colorization. [pdf] [code]
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. [pdf] [code]
Ambient Sound Provides Supervision for Visual Learning. [pdf] [code]
Learning Representations for Automatic Colorization. [pdf] [code]
Unsupervised Visual Representation Learning by Graph-based Consistent Constraints. [pdf] [code]
Adversarial Feature Learning. [pdf] [code]
Self-supervised learning of visual features through embedding images into text topic spaces. [pdf] [code]
Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction. [pdf] [code]
Learning Features by Watching Objects Move. [pdf] [code]
Colorization as a Proxy Task for Visual Understanding. [pdf] [code]
DeepPermNet: Visual Permutation Learning. [pdf] [code]
Unsupervised Learning by Predicting Noise. [pdf] [code]
Multi-task Self-Supervised Visual Learning. [pdf]
Representation Learning by Learning to Count. [pdf]
Transitive Invariance for Self-supervised Visual Representation Learning. [pdf]
Look, Listen and Learn. [pdf]
Unsupervised Representation Learning by Sorting Sequences. [pdf] [code]
Unsupervised Feature Learning via Non-parameteric Instance Discrimination [pdf] [code]
Learning Image Representations by Completing Damaged Jigsaw Puzzles. [pdf]
Unsupervised Representation Learning by Predicting Image Rotations. [pdf] [code]
Learning Latent Representations in Neural Networks for Clustering through Pseudo Supervision and Graph-based Activity Regularization. [pdf] [code]
Improvements to context based self-supervised learning. [pdf]
Self-Supervised Feature Learning by Learning to Spot Artifacts. [pdf] [code]
Boosting Self-Supervised Learning via Knowledge Transfer. [pdf]
Cross-domain Self-supervised Multi-task Feature Learning Using Synthetic Imagery. [pdf] [code]
ShapeCodes: Self-Supervised Feature Learning by Lifting Views to Viewgrids. [pdf]
Deep Clustering for Unsupervised Learning of Visual Features [pdf]
Cross Pixel Optical-Flow Similarity for Self-Supervised Learning. [pdf]
Representation Learning with Contrastive Predictive Coding. [pdf]
Self-Supervised Learning via Conditional Motion Propagation. [pdf] [code]
Self-Supervised Representation Learning by Rotation Feature Decoupling. [pdf] [code]
Revisiting Self-Supervised Visual Representation Learning. [pdf] [code]
AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data. [pdf] [code]
Unsupervised Deep Learning by Neighbourhood Discovery. [pdf]. [code].
Contrastive Multiview Coding. [pdf] [code]
Large Scale Adversarial Representation Learning. [pdf]
Learning Representations by Maximizing Mutual Information Across Views. [pdf] [code]
Selfie: Self-supervised Pretraining for Image Embedding. [pdf]
Data-Efficient Image Recognition with Contrastive Predictive Coding [pdf]
Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty [pdf] [code]
Boosting Few-Shot Visual Learning with Self-Supervision [pdf]
Self-Supervised Generalisation with Meta Auxiliary Learning [pdf] [code]
Wasserstein Dependency Measure for Representation Learning [pdf] [code]
Scaling and Benchmarking Self-Supervised Visual Representation Learning [pdf] [code]
A critical analysis of self-supervision, or what we can learn from a single image [pdf] [code]
On Mutual Information Maximization for Representation Learning [pdf] [code]
Understanding the Limitations of Variational Mutual Information Estimators [pdf] [code]
Automatic Shortcut Removal for Self-Supervised Representation Learning [pdf]
Momentum Contrast for Unsupervised Visual Representation Learning [pdf]
A Simple Framework for Contrastive Learning of Visual Representations [pdf]
ClusterFit: Improving Generalization of Visual Representations [pdf]
Self-Supervised Learning of Pretext-Invariant Representations [pdf]
Unsupervised Learning of Video Representations using LSTMs. [pdf] [code]
Shuffle and Learn: Unsupervised Learning using Temporal Order Verification. [pdf] [code]
LSTM Self-Supervision for Detailed Behavior Analysis [pdf]
Self-Supervised Video Representation Learning With Odd-One-Out Networks. [pdf]
Unsupervised Learning of Long-Term Motion Dynamics for Videos. [pdf]
Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning. [pdf]
Improving Spatiotemporal Self-Supervision by Deep Reinforcement Learning. [pdf]
Self-supervised learning of a facial attribute embedding from video. [pdf]
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles. [pdf]
Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics. [pdf]
DynamoNet: Dynamic Action and Motion Network. [pdf]
Learning Correspondence from the Cycle-consistency of Time. [pdf] [code]
Joint-task Self-supervised Learning for Temporal Correspondence. [pdf] [code]
Self-supervised Learning of Motion Capture. [pdf] [code] [web]
Unsupervised Learning of Depth and Ego-Motion from Video. [pdf] [code] [web]
Active Stereo Net: End-to-End Self-Supervised Learning for Active Stereo Systems. [project]
Self-Supervised Relative Depth Learning for Urban Scene Understanding. [pdf] [project]
Geometry-Aware Learning of Maps for Camera Localization. [pdf] [code]
Self-supervised Learning of Geometrically Stable Features Through Probabilistic Introspection. [pdf] [web]
Self-Supervised Learning of 3D Human Pose Using Multi-View Geometry. [pdf]
SelFlow: Self-Supervised Learning of Optical Flow. [pdf]
Unsupervised Learning of Landmarks by Descriptor Vector Exchange. [pdf] [code] [web]
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. [pdf] [code]
Objects that Sound. [pdf]
Learning to Separate Object Sounds by Watching Unlabeled Video. [pdf] [project]
The Sound of Pixels. [pdf] [project]
Learnable PINs: Cross-Modal Embeddings for Person Identity. [pdf] [web]
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization. [pdf]
Self-Supervised Generation of Spatial Audio for 360° Video. [pdf]
TriCycle: Audio Representation Learning from Sensor Network Data Using Self-Supervision [pdf]
Self-taught Learning: Transfer Learning from Unlabeled Data. [pdf]
Representation Learning: A Review and New Perspectives. [pdf]
Curiosity-driven Exploration by Self-supervised Prediction. [pdf] [code]
Large-Scale Study of Curiosity-Driven Learning. [pdf]
Playing hard exploration games by watching YouTube. [pdf]
Unsupervised State Representation Learning in Atari. [pdf] [code]
Improving Robot Navigation Through Self-Supervised Online Learning [pdf]
Reverse Optical Flow for Self-Supervised Adaptive Autonomous Robot Navigation [pdf]
Online self-supervised learning for dynamic object segmentation [pdf]
Self-Supervised Online Learning of Basic Object Push Affordances [pdf]
Self-supervised learning of grasp dependent tool affordances on the iCub Humanoid robot [pdf]
Persistent self-supervised learning principle: from stereo to monocular vision for obstacle avoidance [pdf]
The Curious Robot: Learning Visual Representations via Physical Interactions. [pdf]
Learning to Poke by Poking: Experiential Learning of Intuitive Physics. [pdf]
Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours. [pdf]
Supervision via Competition: Robot Adversaries for Learning Tasks. [pdf]
Multi-view Self-supervised Deep Learning for 6D Pose Estimation in the Amazon Picking Challenge. [pdf] [Project]
Combining Self-Supervised Learning and Imitation for Vision-Based Rope Manipulation. [pdf] [Project]
Learning to Fly by Crashing [pdf]
Self-supervised learning as an enabling technology for future space exploration robots: ISS experiments on monocular distance learning [pdf]
Unsupervised Perceptual Rewards for Imitation Learning. [pdf] [project]
Self-Supervised Visual Planning with Temporal Skip Connections. [pdf]
CASSL: Curriculum Accelerated Self-Supervised Learning. [pdf]
Time-Contrastive Networks: Self-Supervised Learning from Video. [pdf] [Project]
Self-Supervised Deep Reinforcement Learning with Generalized Computation Graphs for Robot Navigation. [pdf]
Learning Actionable Representations from Visual Observations. [pdf] [Project]
Learning Synergies between Pushing and Grasping with Self-supervised Deep Reinforcement Learning. [pdf] [Project]
Visual Reinforcement Learning with Imagined Goals. [pdf] [Project]
Grasp2Vec: Learning Object Representations from Self-Supervised Grasping. [pdf] [Project]
Robustness via Retrying: Closed-Loop Robotic Manipulation with Self-Supervised Learning. [pdf] [Project]
Learning Long-Range Perception Using Self-Supervision from Short-Range Sensors and Odometry. [pdf]
Learning Latent Plans from Play. [pdf] [Project]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. [pdf] [link]
Self-Supervised Dialogue Learning [pdf]
Self-Supervised Learning for Contextualized Extractive Summarization [pdf]
A Mutual Information Maximization Perspective of Language Representation Learning [pdf]
VL-BERT: Pre-training of Generic Visual-Linguistic Representations [pdf] [code]
Learning Robust and Multilingual Speech Representations [pdf]
Unsupervised pretraining transfers well across languages [pdf] [code]
wav2vec: Unsupervised Pre-Training for Speech Recognition [pdf] [code]
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations [pdf]
Effectiveness of self-supervised pre-training for speech recognition [pdf]
Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning [pdf]
Self-Training for End-to-End Speech Recognition [pdf]
Generative Pre-Training for Speech with Autoregressive Predictive Coding [pdf] [code]
To the extent possible under law, Zhongzheng Ren has waived all copyright and related or neighboring rights to this work.
We present a new method to learn video representations from large-scale unlabeled video data. Ideally, this representation will be generic and transferable, directly usable for new tasks such as action recognition and zero or few-shot learning. We formulate unsupervised representation learning as a multi-modal, multi-task learning problem, where the representations are shared across different modalities via distillation. Further, we introduce the concept of loss function evolution by using an evolutionary search algorithm to automatically find optimal combination of loss functions capturing many (self-supervised) tasks and modalities. Thirdly, we propose an unsupervised representation evaluation metric using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law. This unsupervised constraint, which is not guided by any labeling, produces similar results to weakly-supervised, task-specific ones. The proposed unsupervised representation learning results in a single RGB network and outperforms previous methods. Notably, it is also more effective than several label-based methods (e.g., ImageNet), with the exception of large, fully labeled video datasets.
Few-shot image classification aims to classify unseen classes with limited labeled samples. Recent works benefit from the meta-learning process with episodic tasks and can fast adapt to class from training to testing. Due to the limited number of samples for each task, the initial embedding network for meta learning becomes an essential component and can largely affects the performance in practice. To this end, many pre-trained methods have been proposed, and most of them are trained in supervised way with limited transfer ability for unseen classes. In this paper, we proposed to train a more generalized embedding network with self-supervised learning (SSL) which can provide slow and robust representation for downstream tasks by learning from the data itself. We evaluate our work by extensive comparisons with previous baseline methods on two few-shot classification datasets ({\em i.e.,} MiniImageNet and CUB). Based on the evaluation results, the proposed method achieves significantly better performance, i.e., improve 1-shot and 5-shot tasks by nearly \textbf{3\%} and \textbf{4\%} on MiniImageNet, by nearly \textbf{9\%} and \textbf{3\%} on CUB. Moreover, the proposed method can gain the improvement of (\textbf{15\%}, \textbf{13\%}) on MiniImageNet and (\textbf{15\%}, \textbf{8\%}) on CUB by pretraining using more unlabeled data. Our code will be available at \hyperref[https://github.com/phecy/SSL-FEW-SHOT.]{https://github.com/phecy/ssl-few-shot.}
Continual learning aims to improve the ability of modern learning systems to deal with non-stationary distributions, typically by attempting to learn a series of tasks sequentially. Prior art in the field has largely considered supervised or reinforcement learning tasks, and often assumes full knowledge of task labels and boundaries. In this work, we propose an approach (CURL) to tackle a more general problem that we will refer to as unsupervised continual learning. The focus is on learning representations without any knowledge about task identity, and we explore scenarios when there are abrupt changes between tasks, smooth transitions from one task to another, or even when the data is shuffled. The proposed approach performs task inference directly within the model, is able to dynamically expand to capture new concepts over its lifetime, and incorporates additional rehearsal-based techniques to deal with catastrophic forgetting. We demonstrate the efficacy of CURL in an unsupervised learning setting with MNIST and Omniglot, where the lack of labels ensures no information is leaked about the task. Further, we demonstrate strong performance compared to prior art in an i.i.d setting, or when adapting the technique to supervised tasks such as incremental class learning.
This work tackles the problem of semi-supervised learning of image classifiers. Our main insight is that the field of semi-supervised learning can benefit from the quickly advancing field of self-supervised visual representation learning. Unifying these two approaches, we propose the framework of self-supervised semi-supervised learning ($S^4L$) and use it to derive two novel semi-supervised image classification methods. We demonstrate the effectiveness of these methods in comparison to both carefully tuned baselines, and existing semi-supervised learning methods. We then show that $S^4L$ and existing semi-supervised methods can be jointly trained, yielding a new state-of-the-art result on semi-supervised ILSVRC-2012 with 10% of labels.
Deep learning has been shown successful in a number of domains, ranging from acoustics, images to natural language processing. However, applying deep learning to the ubiquitous graph data is non-trivial because of the unique characteristics of graphs. Recently, a significant amount of research efforts have been devoted to this area, greatly advancing graph analyzing techniques. In this survey, we comprehensively review different kinds of deep learning methods applied to graphs. We divide existing methods into three main categories: semi-supervised methods including Graph Neural Networks and Graph Convolutional Networks, unsupervised methods including Graph Autoencoders, and recent advancements including Graph Recurrent Neural Networks and Graph Reinforcement Learning. We then provide a comprehensive overview of these methods in a systematic manner following their history of developments. We also analyze the differences of these methods and how to composite different architectures. Finally, we briefly outline their applications and discuss potential future directions.
Although transfer learning has been shown to be successful for tasks like object and speech recognition, its applicability to question answering (QA) has yet to be well-studied. In this paper, we conduct extensive experiments to investigate the transferability of knowledge learned from a source QA dataset to a target dataset using two QA models. The performance of both models on a TOEFL listening comprehension test (Tseng et al., 2016) and MCTest (Richardson et al., 2013) is significantly improved via a simple transfer learning technique from MovieQA (Tapaswi et al., 2016). In particular, one of the models achieves the state-of-the-art on all target datasets; for the TOEFL listening comprehension test, it outperforms the previous best model by 7%. Finally, we show that transfer learning is helpful even in unsupervised scenarios when correct answers for target QA dataset examples are not available.
Surgical data science is a new research field that aims to observe all aspects and factors of the patient treatment process in order to provide the right assistance to the right person at the right time. Due to the breakthrough successes of deep learning-based solutions for automatic image annotation, the availability of reference annotations for algorithm training is becoming a major bottleneck in the field. The purpose of this paper was to investigate the concept of self-supervised learning to address this issue. Our approach is guided by the hypothesis that unlabeled video data can be used to learn a representation of the target domain that boosts the performance of state-of-the-art machine learning algorithms when used for pre-training. Essentially, this method involves an auxiliary task that requires training with unlabeled endoscopic video data from the target domain to initialize a convolutional neural network (CNN) for the target task. In this paper, we propose to undertake a re-colorization of medical images with generative adversarial network (GAN)-based architecture as an auxiliary task. A variant of the method involves a second pre-training step based on labeled data for the target task from a related domain. We have validated both variants using medical instrument segmentation as the target task. The proposed approach can be used to radically reduce the manual annotation effort involved in training CNNs. Compared to the baseline approach of generating annotated data from scratch, our method decreases exploratively the number of labeled images by up to 60% without sacrificing performance. Our method also outperforms alternative methods for CNN pre-training, such as pre-training on publicly available non-medical (COCO) or medical data (MICCAI endoscopic vision challenge 2017) using the target task (in this instance: segmentation).
Standard deep learning systems require thousands or millions of examples to learn a concept, and cannot integrate new concepts easily. By contrast, humans have an incredible ability to do one-shot or few-shot learning. For instance, from just hearing a word used in a sentence, humans can infer a great deal about it, by leveraging what the syntax and semantics of the surrounding words tells us. Here, we draw inspiration from this to highlight a simple technique by which deep recurrent networks can similarly exploit their prior knowledge to learn a useful representation for a new word from little data. This could make natural language processing systems much more flexible, by allowing them to learn continually from the new words they encounter.