Learning to understand grounded language, which connects natural language to percepts, is a critical research area. Prior work in grounded language acquisition has focused primarily on textual inputs. In this work we demonstrate the feasibility of performing grounded language acquisition on paired visual percepts and raw speech inputs. This will allow interactions in which language about novel tasks and environments is learned from end users, reducing dependence on textual inputs and potentially mitigating the effects of demographic bias found in widely available speech recognition systems. We leverage recent work in self-supervised speech representation models and show that learned representations of speech can make language grounding systems more inclusive towards specific groups while maintaining or even increasing general performance.
翻译:学习理解将自然语言与感知连接在一起的有根语言是一个关键的研究领域。以前在有根语言获取方面的工作主要侧重于文本投入。在这个工作中,我们展示了在配对视觉阅读和原始语言投入上进行有根语言获取的可行性。这将使最终用户能够相互学习关于新任务和环境的语言,减少对文字投入的依赖,并有可能减轻在广泛可用的语音识别系统中发现的人口偏见的影响。我们利用了最近在自我监督的演讲代表模式方面开展的工作,并表明对演讲的学习表现可以使语言定位系统对特定群体更具包容性,同时保持甚至提高总体表现。