We present in this work the first end-to-end deep learning based method that predicts both 3D hand shape and pose from RGB images in the wild. Our network consists of the concatenation of a deep convolutional encoder, and a fixed model-based decoder. Given an input image, and optionally 2D joint detections obtained from an independent CNN, the encoder predicts a set of hand and view parameters. The decoder has two components: A pre-computed articulated mesh deformation hand model that generates a 3D mesh from the hand parameters, and a re-projection module controlled by the view parameters that projects the generated hand into the image domain. We show that using the shape and pose prior knowledge encoded in the hand model within a deep learning framework yields state-of-the-art performance in 3D pose prediction from images on standard benchmarks, and produces geometrically valid and plausible 3D reconstructions. Additionally, we show that training with weak supervision in the form of 2D joint annotations on datasets of images in the wild, in conjunction with full supervision in the form of 3D joint annotations on limited available datasets allows for good generalization to 3D shape and pose predictions on images in the wild.
翻译:在这项工作中,我们展示了第一个端到端深的基于学习的方法,该方法预测了3D手形状和从野生的 RGB 图像成形。我们的网络包括一个深卷动编码器和一个固定的模型解码器。根据一个输入图像和从独立的CNN获得的可选的2D联合检测,编码器预测了一组手和视图参数。解码器有两个组成部分:一个预先编译的、从手参数产生3D网格和从野生的 RGB 图像成形的显影模型,以及一个再投影模块,由预测产生的手的图像进入图像域的视图参数控制。我们显示,使用在深层学习框架内手工模型中编码的形状和先成知识,3D 以标准基准为基础对图像进行最新状态的预测,并产生几何正确和可信的3D 重建。此外,我们展示了以2D 联合说明的形式对野生图像集成的微弱监督培训,同时进行全面监督,3D 将3D 组合图解用于有限的普通的图像。