2018 年 9 月 12 日 CreateAMind


This is a curated list of papers on disentangled (and an occasional "conventional") representation learning. Within each year, the papers are ordered from newest to oldest. I've scored the importance/quality of each paper (in my own personal opinion) on a scale of 1 to 3, as indicated by the number of stars in front of each entry in the list. If stars are replaced by a question mark, then it represents a paper I haven't fully read yet, in which case I'm unable to judge its quality.


  • ? Learning Deep Representations by Mutual Information Estimation and Maximization (Aug, Hjelm et. al.) [paper]

  • ? Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies (Aug, Achille et. al.) [paper]

  • ? Insights on Representational Similarity in Neural Networks with Canonical Correlation (Jun, Morcos et. al.) [paper]

  • ** Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects (Jun, Kosiorek et. al.) [paper]

  • *** Neural Scene Representation and Rendering (Jun, Eslami et. al.) [paper]

  • ? Image-to-image translation for cross-domain disentanglement (May, Gonzalez-Garcia et. al.) [paper]

  • * Learning Disentangled Joint Continuous and Discrete Representations (May, Dupont) [paper] [code]

  • ? DGPose: Disentangled Semi-supervised Deep Generative Models for Human Body Analysis (Apr, Bem et. al.) [paper]

  • ? Structured Disentangled Representations (Apr, Esmaeili et. al.) [paper]

  • ** Understanding disentangling in β-VAE (Apr, Burgess et. al.) [paper]

  • ? On the importance of single directions for generalization (Mar, Morcos et. al.) [paper]

  • ** Unsupervised Representation Learning by Predicting Image Rotations (Mar, Gidaris et. al.) [paper]

  • ? Disentangled Sequential Autoencoder (Mar, Li & Mandt) [paper]

  • *** Isolating Sources of Disentanglement in Variational Autoencoders (Mar, Chen et. al.) [paper] [code]

  • ** Disentangling by Factorising (Feb, Kim & Mnih) [paper]

  • ** Disentangling the Independently Controllable Factors of Variation by Interacting with the World (Feb, Bengio's group) [paper]

  • ? On the Latent Space of Wasserstein Auto-Encoders (Feb, Rubenstein et. al.) [paper]

  • ? Auto-Encoding Total Correlation Explanation (Feb, Gao et. al.) [paper]

  • ? Fixing a Broken ELBO (Feb, Alemi et. al.) [paper]

  • * Learning Disentangled Representations with Wasserstein Auto-Encoders (Feb, Rubenstein et. al.) [paper]

  • ? Rethinking Style and Content Disentanglement in Variational Autoencoders (Feb, Shu et. al.) [paper]

  • ? A Framework for the Quantitative Evaluation of Disentangled Representations (Feb, Eastwood & Williams) [paper]


  • ? The β-VAE's Implicit Prior (Dec, Hoffman et. al.) [paper]

  • ** The Multi-Entity Variational Autoencoder (Dec, Nash et. al.) [paper]

  • ? Learning Independent Causal Mechanisms (Dec, Parascandolo et. al.) [paper]

  • ? Variational Inference of Disentangled Latent Concepts from Unlabeled Observations (Nov, Kumar et. al.) [paper]

  • * Neural Discrete Representation Learning (Nov, Oord et. al.) [paper]

  • ? Disentangled Representations via Synergy Minimization (Oct, Steeg et. al.) [paper]

  • ? Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data (Sep, Hsu et. al.) [paper] [code]

  • * Experiments on the Consciousness Prior (Sep, Bengio & Fedus) [paper]

  • ** The Consciousness Prior (Sep, Bengio) [paper]

  • ? Disentangling Motion, Foreground and Background Features in Videos (Jul, Lin. et. al.) [paper]

  • * SCAN: Learning Hierarchical Compositional Visual Concepts (Jul, Higgins. et. al.) [paper]

  • *** DARLA: Improving Zero-Shot Transfer in Reinforcement Learning (Jul, Higgins et. al.) [paper]

  • ** Unsupervised Learning via Total Correlation Explanation (Jun, Ver Steeg) [paper] [code]

  • ? PixelGAN Autoencoders (Jun, Makhzani & Frey) [paper]

  • ? Emergence of Invariance and Disentanglement in Deep Representations (Jun, Achille & Soatto) [paper]

  • ** A Simple Neural Network Module for Relational Reasoning (Jun, Santoro et. al.) [paper]

  • ? Learning Disentangled Representations with Semi-Supervised Deep Generative Models (Jun, Siddharth, et al.) [paper]

  • ? Unsupervised Learning of Disentangled Representations from Video (May, Denton & Birodkar) [paper]


  • ** Deep Variational Information Bottleneck (Dec, Alemi et. al.) [paper]

  • *** β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework (Nov, Higgins et. al.) [paper] [code]

  • ? Disentangling factors of variation in deep representations using adversarial training (Nov, Mathieu et. al.) [paper]

  • ** Information Dropout: Learning Optimal Representations Through Noisy Computation (Nov, Achille & Soatto) [paper]

  • ** InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets (Jun, Chen et. al.) [paper]

  • *** Attend, Infer, Repeat: Fast Scene Understanding with Generative Models (Mar, Eslami et. al.) [paper]

  • *** Building Machines That Learn and Think Like People (Apr, Lake et. al.) [paper]

  • * Understanding Visual Concepts with Continuation Learning (Feb, Whitney et. al.) [paper]

  • ? Disentangled Representations in Neural Models (Feb, Whitney) [paper]

Older work

  • ** Deep Convolutional Inverse Graphics Network (2015, Kulkarni et. al.) [paper]

  • ? Learning to Disentangle Factors of Variation with Manifold Interaction (2014, Reed et. al.) [paper]

  • *** Representation Learning: A Review and New Perspectives (2013, Bengio et. al.) [paper]

  • ? Disentangling Factors of Variation via Generative Entangling (2012, Desjardinis et. al.) [paper]

  • *** Transforming Auto-encoders (2011, Hinton et. al.) [paper]

  • ** Learning Factorial Codes By Predictability Minimization (1992, Schmidhuber) [paper]

  • *** Self-Organization in a Perceptual Network (1988, Linsker) [paper]


  • Building Machines that Learn & Think Like People (2018, Tenenbaum) [youtube]

  • From Deep Learning of Disentangled Representations to Higher-level Cognition (2018, Bengio) [youtube]

  • What is wrong with convolutional neural nets? (2017, Hinton) [youtube]



表示学习是通过利用训练数据来学习得到向量表示,这可以克服人工方法的局限性。 表示学习通常可分为两大类,无监督和有监督表示学习。大多数无监督表示学习方法利用自动编码器(如去噪自动编码器和稀疏自动编码器等)中的隐变量作为表示。 目前出现的变分自动编码器能够更好的容忍噪声和异常值。 然而,推断给定数据的潜在结构几乎是不可能的。 目前有一些近似推断的策略。 此外,一些无监督表示学习方法旨在近似某种特定的相似性度量。提出了一种无监督的相似性保持表示学习框架,该框架使用矩阵分解来保持成对的DTW相似性。 通过学习保持DTW的shaplets,即在转换后的空间中的欧式距离近似原始数据的真实DTW距离。有监督表示学习方法可以利用数据的标签信息,更好地捕获数据的语义结构。 孪生网络和三元组网络是目前两种比较流行的模型,它们的目标是最大化类别之间的距离并最小化了类别内部的距离。

In English, prosody adds a broad range of information to segment sequences, from information structure (e.g. contrast) to stylistic variation (e.g. expression of emotion). However, when learning to control prosody in text-to-speech voices, it is not clear what exactly the control is modifying. Existing research on discrete representation learning for prosody has demonstrated high naturalness, but no analysis has been performed on what these representations capture, or if they can generate meaningfully-distinct variants of an utterance. We present a phrase-level variational autoencoder with a multi-modal prior, using the mode centres as "intonation codes". Our evaluation establishes which intonation codes are perceptually distinct, finding that the intonation codes from our multi-modal latent model were significantly more distinct than a baseline using k-means clustering. We carry out a follow-up qualitative study to determine what information the codes are carrying. Most commonly, listeners commented on the intonation codes having a statement or question style. However, many other affect-related styles were also reported, including: emotional, uncertain, surprised, sarcastic, passive aggressive, and upset.


Mining graph data has become a popular research topic in computer science and has been widely studied in both academia and industry given the increasing amount of network data in the recent years. However, the huge amount of network data has posed great challenges for efficient analysis. This motivates the advent of graph representation which maps the graph into a low-dimension vector space, keeping original graph structure and supporting graph inference. The investigation on efficient representation of a graph has profound theoretical significance and important realistic meaning, we therefore introduce some basic ideas in graph representation/network embedding as well as some representative models in this chapter.


User behavior data in recommender systems are driven by the complex interactions of many latent factors behind the users' decision making processes. The factors are highly entangled, and may range from high-level ones that govern user intentions, to low-level ones that characterize a user's preference when executing an intention. Learning representations that uncover and disentangle these latent factors can bring enhanced robustness, interpretability, and controllability. However, learning such disentangled representations from user behavior is challenging, and remains largely neglected by the existing literature. In this paper, we present the MACRo-mIcro Disentangled Variational Auto-Encoder (MacridVAE) for learning disentangled representations from user behavior. Our approach achieves macro disentanglement by inferring the high-level concepts associated with user intentions (e.g., to buy a shirt or a cellphone), while capturing the preference of a user regarding the different concepts separately. A micro-disentanglement regularizer, stemming from an information-theoretic interpretation of VAEs, then forces each dimension of the representations to independently reflect an isolated low-level factor (e.g., the size or the color of a shirt). Empirical results show that our approach can achieve substantial improvement over the state-of-the-art baselines. We further demonstrate that the learned representations are interpretable and controllable, which can potentially lead to a new paradigm for recommendation where users are given fine-grained control over targeted aspects of the recommendation lists.


Knowledge distillation is typically conducted by training a small model (the student) to mimic a large and cumbersome model (the teacher). The idea is to compress the knowledge from the teacher by using its output probabilities as soft-labels to optimize the student. However, when the teacher is considerably large, there is no guarantee that the internal knowledge of the teacher will be transferred into the student; even if the student closely matches the soft-labels, its internal representations may be considerably different. This internal mismatch can undermine the generalization capabilities originally intended to be transferred from the teacher to the student. In this paper, we propose to distill the internal representations of a large model such as BERT into a simplified version of it. We formulate two ways to distill such representations and various algorithms to conduct the distillation. We experiment with datasets from the GLUE benchmark and consistently show that adding knowledge distillation from internal representations is a more powerful method than only using soft-label distillation.


This is an official pytorch implementation of Deep High-Resolution Representation Learning for Human Pose Estimation. In this work, we are interested in the human pose estimation problem with a focus on learning reliable high-resolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process. We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high-resolution representations. As a result, the predicted keypoint heatmap is potentially more accurate and spatially more precise. We empirically demonstrate the effectiveness of our network through the superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset and the MPII Human Pose dataset. The code and models have been publicly available at \url{https://github.com/leoxiaobin/deep-high-resolution-net.pytorch}.


Knowledge representation learning (KRL) aims to represent entities and relations in knowledge graph in low-dimensional semantic space, which have been widely used in massive knowledge-driven tasks. In this article, we introduce the reader to the motivations for KRL, and overview existing approaches for KRL. Afterwards, we extensively conduct and quantitative comparison and analysis of several typical KRL methods on three evaluation tasks of knowledge acquisition including knowledge graph completion, triple classification, and relation extraction. We also review the real-world applications of KRL, such as language modeling, question answering, information retrieval, and recommender systems. Finally, we discuss the remaining challenges and outlook the future directions for KRL. The codes and datasets used in the experiments can be found in https://github.com/thunlp/OpenKE.


Generative Adversarial Networks (GANs) have recently achieved impressive results for many real-world applications, and many GAN variants have emerged with improvements in sample quality and training stability. However, they have not been well visualized or understood. How does a GAN represent our visual world internally? What causes the artifacts in GAN results? How do architectural choices affect GAN learning? Answering such questions could enable us to develop new insights and better models. In this work, we present an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level. We first identify a group of interpretable units that are closely related to object concepts using a segmentation-based network dissection method. Then, we quantify the causal effect of interpretable units by measuring the ability of interventions to control objects in the output. We examine the contextual relationship between these units and their surroundings by inserting the discovered object concepts into new images. We show several practical applications enabled by our framework, from comparing internal representations across different layers, models, and datasets, to improving GANs by locating and removing artifact-causing units, to interactively manipulating objects in a scene. We provide open source interpretation tools to help researchers and practitioners better understand their GAN models.


Learning compact representation is vital and challenging for large scale multimedia data. Cross-view/cross-modal hashing for effective binary representation learning has received significant attention with exponentially growing availability of multimedia content. Most existing cross-view hashing algorithms emphasize the similarities in individual views, which are then connected via cross-view similarities. In this work, we focus on the exploitation of the discriminative information from different views, and propose an end-to-end method to learn semantic-preserving and discriminative binary representation, dubbed Discriminative Cross-View Hashing (DCVH), in light of learning multitasking binary representation for various tasks including cross-view retrieval, image-to-image retrieval, and image annotation/tagging. The proposed DCVH has the following key components. First, it uses convolutional neural network (CNN) based nonlinear hashing functions and multilabel classification for both images and texts simultaneously. Such hashing functions achieve effective continuous relaxation during training without explicit quantization loss by using Direct Binary Embedding (DBE) layers. Second, we propose an effective view alignment via Hamming distance minimization, which is efficiently accomplished by bit-wise XOR operation. Extensive experiments on two image-text benchmark datasets demonstrate that DCVH outperforms state-of-the-art cross-view hashing algorithms as well as single-view image hashing algorithms. In addition, DCVH can provide competitive performance for image annotation/tagging.


Although Faster R-CNN and its variants have shown promising performance in object detection, they only exploit simple first-order representation of object proposals for final classification and regression. Recent classification methods demonstrate that the integration of high-order statistics into deep convolutional neural networks can achieve impressive improvement, but their goal is to model whole images by discarding location information so that they cannot be directly adopted to object detection. In this paper, we make an attempt to exploit high-order statistics in object detection, aiming at generating more discriminative representations for proposals to enhance the performance of detectors. To this end, we propose a novel Multi-scale Location-aware Kernel Representation (MLKP) to capture high-order statistics of deep features in proposals. Our MLKP can be efficiently computed on a modified multi-scale feature map using a low-dimensional polynomial kernel approximation.Moreover, different from existing orderless global representations based on high-order statistics, our proposed MLKP is location retentive and sensitive so that it can be flexibly adopted to object detection. Through integrating into Faster R-CNN schema, the proposed MLKP achieves very competitive performance with state-of-the-art methods, and improves Faster R-CNN by 4.9% (mAP), 4.7% (mAP) and 5.0% (AP at IOU=[0.5:0.05:0.95]) on PASCAL VOC 2007, VOC 2012 and MS COCO benchmarks, respectively. Code is available at: https://github.com/Hwang64/MLKP.


Generating novel, yet realistic, images of persons is a challenging task due to the complex interplay between the different image factors, such as the foreground, background and pose information. In this work, we aim at generating such images based on a novel, two-stage reconstruction pipeline that learns a disentangled representation of the aforementioned image factors and generates novel person images at the same time. First, a multi-branched reconstruction network is proposed to disentangle and encode the three factors into embedding features, which are then combined to re-compose the input image itself. Second, three corresponding mapping functions are learned in an adversarial manner in order to map Gaussian noise to the learned embedding feature space, for each factor respectively. Using the proposed framework, we can manipulate the foreground, background and pose of the input image, and also sample new embedding features to generate such targeted manipulations, that provide more control over the generation process. Experiments on Market-1501 and Deepfashion datasets show that our model does not only generate realistic person images with new foregrounds, backgrounds and poses, but also manipulates the generated factors and interpolates the in-between states. Another set of experiments on Market-1501 shows that our model can also be beneficial for the person re-identification task.

RL 真经
4+阅读 · 2018年12月28日
7+阅读 · 2018年12月10日
4+阅读 · 2018年9月10日
vae 相关论文 表示学习 2
5+阅读 · 2018年9月9日
vae 相关论文 表示学习 1
9+阅读 · 2018年9月6日
计算机视觉领域顶会CVPR 2018 接受论文列表
Auto-Encoding GAN
5+阅读 · 2017年8月4日
60+阅读 · 2020年3月18日
81+阅读 · 2020年3月12日
113+阅读 · 2020年2月13日
【哈佛大学商学院课程Fall 2019】机器学习可解释性
37+阅读 · 2019年10月9日
42+阅读 · 2019年10月9日
Wenwu Zhu,Xin Wang,Peng Cui
18+阅读 · 2020年1月2日
Learning Disentangled Representations for Recommendation
Jianxin Ma,Chang Zhou,Peng Cui,Hongxia Yang,Wenwu Zhu
4+阅读 · 2019年10月31日
Knowledge Distillation from Internal Representations
Gustavo Aguilar,Yuan Ling,Yu Zhang,Benjamin Yao,Xing Fan,Edward Guo
4+阅读 · 2019年10月8日
Ke Sun,Bin Xiao,Dong Liu,Jingdong Wang
4+阅读 · 2019年2月25日
Knowledge Representation Learning: A Quantitative Review
Yankai Lin,Xu Han,Ruobing Xie,Zhiyuan Liu,Maosong Sun
25+阅读 · 2018年12月28日
GAN Dissection: Visualizing and Understanding Generative Adversarial Networks
David Bau,Jun-Yan Zhu,Hendrik Strobelt,Bolei Zhou,Joshua B. Tenenbaum,William T. Freeman,Antonio Torralba
11+阅读 · 2018年12月8日
Liu Liu,Hairong Qi
8+阅读 · 2018年4月4日
Hao Wang,Qilong Wang,Mingqi Gao,Peihua Li,Wangmeng Zuo
5+阅读 · 2018年4月2日
Liqian Ma,Qianru Sun,Stamatios Georgoulis,Luc Van Gool,Bernt Schiele,Mario Fritz
6+阅读 · 2018年1月21日