利用翻译和轮换小组平等 VAE进行无人监督的客体代表学习 (Unsupervised Object Representation Learning using Translation and Rotation Group Equivariant VAE)

In many imaging modalities, objects of interest can occur in a variety of locations and poses (i.e. are subject to translations and rotations in 2d or 3d), but the location and pose of an object does not change its semantics (i.e. the object's essence). That is, the specific location and rotation of an airplane in satellite imagery, or the 3d rotation of a chair in a natural image, or the rotation of a particle in a cryo-electron micrograph, do not change the intrinsic nature of those objects. Here, we consider the problem of learning semantic representations of objects that are invariant to pose and location in a fully unsupervised manner. We address shortcomings in previous approaches to this problem by introducing TARGET-VAE, a translation and rotation group-equivariant variational autoencoder framework. TARGET-VAE combines three core innovations: 1) a rotation and translation group-equivariant encoder architecture, 2) a structurally disentangled distribution over latent rotation, translation, and a rotation-translation-invariant semantic object representation, which are jointly inferred by the approximate inference network, and 3) a spatially equivariant generator network. In comprehensive experiments, we show that TARGET-VAE learns disentangled representations without supervision that significantly improve upon, and avoid the pathologies of, previous methods. When trained on images highly corrupted by rotation and translation, the semantic representations learned by TARGET-VAE are similar to those learned on consistently posed objects, dramatically improving clustering in the semantic latent space. Furthermore, TARGET-VAE is able to perform remarkably accurate unsupervised pose and location inference. We expect methods like TARGET-VAE will underpin future approaches for unsupervised object generation, pose prediction, and object detection.

翻译：在许多成像模式中,令人感兴趣的物体可能出现在各种地点和姿势中(即受2d或3d的翻译和轮换的物体),但物体的位置和姿势不会改变其语义学(即物体的精髓),即卫星图像中的飞机的具体位置和旋转,或自然图像中的椅子的3d旋转,或者在冷冻-电子显微图中的粒子的旋转,不会改变这些物体的内在性质。在这里,我们考虑的是,以完全不可逾越的方式学习不易出现和位置的物体的语义表达的问题。我们通过引入 TARGET-VAE 、翻译和旋转组- QQOVA 自动变异框架来解决这个问题的先前方法。 TRAET- VAE 将三种核心创新方法结合起来:1) 旋转和翻译组- 变量的变异性变异性对等物体的内位结构结构结构学, 翻译和旋转变异性变异的变异性物体的变异性分析,通过不断的变异性变变变变变的网络中,通过我们所学的变的变变变的变的变变的变的变的变变变变的变的变的变变变的变式, 。