Vision transformer networks have shown superiority in many computer vision tasks. In this paper, we take a step further by proposing a novel generative vision transformer with latent variables following an informative energy-based prior for salient object detection. Both the vision transformer network and the energy-based prior model are jointly trained via Markov chain Monte Carlo-based maximum likelihood estimation, in which the sampling from the intractable posterior and prior distributions of the latent variables are performed by Langevin dynamics. Further, with the generative vision transformer, we can easily obtain a pixel-wise uncertainty map from an image, which indicates the model confidence in predicting saliency from the image. Different from the existing generative models which define the prior distribution of the latent variables as a simple isotropic Gaussian distribution, our model uses an energy-based informative prior which can be more expressive to capture the latent space of the data. We apply the proposed framework to both RGB and RGB-D salient object detection tasks. Extensive experimental results show that our framework can achieve not only accurate saliency predictions but also meaningful uncertainty maps that are consistent with the human perception.
翻译:视觉变压器网络在许多计算机视觉任务中表现出优越性。 在本文中,我们进一步提出一个新的基因变压器, 其潜在变压器, 其潜在变压器, 其潜在变压器在先基于信息能量进行显著的物体探测。 视觉变压器网络和以前基于能源的模型, 都通过Markov 链子 Monte Carlo 的最大可能性估计进行联合培训, 其中由Langevin 动力进行来自棘手的子宫和先前潜在变量分布的取样。 此外, 通过基因变压器, 我们很容易从一个图像中获取一个像素一样的不确定图, 这表明模型对预测图像显著性的信心。 与现有的基因变压模型不同, 前者将潜在变量的先前分布定义为简单的象形高斯分布, 我们的模型使用一种基于能源的资讯, 前者可以更清晰地捕捉到数据的潜在空间。 我们对 RGB 和 RGB 显著的物体探测任务都应用了拟议框架。 广泛的实验结果显示, 我们的框架不仅可以实现准确的显著的显著的预测,, 还可以得到与人类认知一致的有意义的不确定性地图。