Humans are able to rapidly understand scenes by utilizing concepts extracted from prior experience. Such concepts are diverse, and include global scene descriptors, such as the weather or lighting, as well as local scene descriptors, such as the color or size of a particular object. So far, unsupervised discovery of concepts has focused on either modeling the global scene-level or the local object-level factors of variation, but not both. In this work, we propose COMET, which discovers and represents concepts as separate energy functions, enabling us to represent both global concepts as well as objects under a unified framework. COMET discovers energy functions through recomposing the input image, which we find captures independent factors without additional supervision. Sample generation in COMET is formulated as an optimization process on underlying energy functions, enabling us to generate images with permuted and composed concepts. Finally, discovered visual concepts in COMET generalize well, enabling us to compose concepts between separate modalities of images as well as with other concepts discovered by a separate instance of COMET trained on a different dataset. Code and data available at https://energy-based-model.github.io/comet/.
翻译:人类能够利用从以往经验中提取的概念迅速理解场景。 这些概念多种多样, 包括全球场景描述器, 如天气或照明, 以及本地场景描述器, 如特定物体的颜色或大小。 到目前为止, 未受监督的发现概念侧重于建模全球场景水平或本地目标水平的变化因素, 但不是两者兼而有之。 在这项工作中, 我们提议知识与技术伦理学委员会, 发现并代表不同能源功能的概念, 使我们能够在统一的框架内代表全球概念和对象。 知识与技术伦理学委员会通过重新组合输入图像来发现能源功能, 我们发现这些功能时没有额外的监督, 获取独立因素。 知识与技术伦理学委员会的样本生成是作为基本能源功能的一个优化过程, 使我们能够用固定和组合的概念生成图像。 最后, 在知识与技术伦理学委员会中发现的视觉概念, 使我们能够在不同的图像模式之间以及由经过不同数据集培训的另外实例所发现的其他概念 。 代码和数据可在 https://ennergy- must.github. /comet/ 上获得的数据 。