Computer Vision has been improved significantly in the past few decades. It has enabled machine to do many human tasks. However, the real challenge is in enabling machine to carry out tasks that an average human does not have the skills for. One such challenge that we have tackled in this paper is providing accessibility for deaf individual by providing means of communication with others with the aid of computer vision. Unlike other frequent works focusing on multiple camera, depth camera, electrical glove or visual gloves, we focused on the sole use of RGB which allows everybody to communicate with a deaf individual through their personal devices. This is not a new approach but the lack of realistic large-scale data set prevented recent computer vision trends on video classification in this filed. In this paper, we propose the first large scale ASL data set that covers over 200 signers, signer independent sets, challenging and unconstrained recording conditions and a large class count of 1000 signs. We evaluate baselines from action recognition techniques on the data set. We propose I3D, known from video classifications, as a powerful and suitable architecture for sign language recognition. We also propose new pre-trained model more appropriate for sign language recognition. Finally, We estimate the effect of number of classes and number of training samples on the recognition accuracy.
翻译:在过去几十年里,计算机的愿景有了显著改善。它使机器能够完成许多人类任务。然而,真正的挑战在于使机器能够完成普通人不具备技能的任务。我们在本文件中处理的一个挑战是,通过提供与他人的通信手段,借助计算机视觉,为聋哑人提供无障碍。不同于其他侧重于多摄像头、深度相机、电子手套或视觉手套的频繁工作,我们只注重使用RGB,使每个人都能通过个人设备与聋人进行交流。这不是一种新办法,但由于缺乏现实的大型数据集,使机器无法执行普通人不具备所需技能的任务。在本文件中,我们提出了首个大型ASL数据集,涵盖200多名签名人、独立签名人、具有挑战性和不受限制的记录条件以及1,000个大类标志。我们从数据集的行动识别技术中评估基线。我们从视频分类中发现I3D,作为签名语言识别的强大和适当结构。我们还提出了新的预先培训模型,更适合用于签名语言识别。最后,我们评估了班级的准确度和训练次数的识别效果。