更仔细地审视视听多人讲话的认可和积极选用议长 (A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection) - 专知论文

会员服务 ·

0

语音识别 · 注意力机制 · Performer · 自动语音识别 · 讲稿 ·

2022 年 5 月 11 日

A Closer Look at Audio-Visual Multi-Person Speech Recognition and Active Speaker Selection

翻译：更仔细地审视视听多人讲话的认可和积极选用议长

Otavio Braga,Olivier Siohan

from arxiv, arXiv admin note: text overlap with arXiv:2205.05586

Audio-visual automatic speech recognition is a promising approach to robust ASR under noisy conditions. However, up until recently it had been traditionally studied in isolation assuming the video of a single speaking face matches the audio, and selecting the active speaker at inference time when multiple people are on screen was put aside as a separate problem. As an alternative, recent work has proposed to address the two problems simultaneously with an attention mechanism, baking the speaker selection problem directly into a fully differentiable model. One interesting finding was that the attention indirectly learns the association between the audio and the speaking face even though this correspondence is never explicitly provided at training time. In the present work we further investigate this connection and examine the interplay between the two problems. With experiments involving over 50 thousand hours of public YouTube videos as training data, we first evaluate the accuracy of the attention layer on an active speaker selection task. Secondly, we show under closer scrutiny that an end-to-end model performs at least as well as a considerably larger two-step system that utilizes a hard decision boundary under various noise conditions and number of parallel face tracks.

翻译：视听自动语音识别是一个很有希望的方法,在吵闹的条件下对强烈的ASR进行动态自动语音识别。然而,直到最近,一直以来,一直以孤立的方式研究它,假设单张讲话脸的视频与音频相匹配,在多个人在屏幕上时作为单独的问题在推论时间选择活跃的演讲者,作为一个单独的问题被搁置一边。作为替代办法,最近的工作提议在关注机制下同时解决这两个问题,将演讲者选择问题直接转化为完全不同的模式。一个有趣的发现是,尽管在培训时从未明确提供这种信函,但人们间接地了解到音频面和声音面部之间的联系。在目前的工作中,我们进一步调查这一联系,并审查这两个问题之间的相互作用。在涉及5万多小时的公开YouTube视频作为培训数据的实验中,我们首先评估了积极演讲者选择任务的注意层的准确性。第二,我们仔细检查后显示,一个终端到终端模式至少表现了以及一个相当大得多的两步系统,在各种噪音条件下和平行的轨道上使用硬决定边界。

0

相关内容

语音识别

语音识别是计算机科学和计算语言学的一个跨学科子领域，它发展了一些方法和技术，使计算机可以将口语识别和翻译成文本。它也被称为自动语音识别（ASR），计算机语音识别或语音转文本（STT）。它整合了计算机科学，语言学和计算机工程领域的知识和研究。

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

专知会员服务

44+阅读 · 2020年11月2日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

经典书《机器学习：概率视角》（Machine Learning: a Probabilistic Perspective）第二版Python代码，附1098页pdf下载

经典书《机器学习：概率视角》（Machine Learning: a Probabilistic Perspective）第二版Python代码，附1098页pdf下载

专知会员服务

274+阅读 · 2019年10月25日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

中国图象图形学学会CSIG

0+阅读 · 2021年11月9日

局部学习的特征选择：Local-Learning-Based Feature Selection

局部学习的特征选择：Local-Learning-Based Feature Selection

我爱读PAMI

14+阅读 · 2019年9月20日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

拓扑绝缘体与超导体耦合体系中交叉Andreev反射研究

国家自然科学基金

1+阅读 · 2014年12月31日

S3AGA样本（Spitzer-SDSS Spectral Atlas of Galaxies and AGNs)及其AGN研究

国家自然科学基金

0+阅读 · 2014年12月31日

Anderson型多酸的不对称修饰及可控组装研究

国家自然科学基金

1+阅读 · 2014年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

非凸Hamilton系统的Aubry-Mather理论

国家自然科学基金

0+阅读 · 2012年12月31日

Degasperis-Procesi方程若干控制问题的研究

国家自然科学基金

0+阅读 · 2012年12月31日

PTBP1介导的survivinΔEx3过表达调控胶质母细胞瘤微血管增生的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

Bi/Er共掺多芯石英光纤及其应用研究

国家自然科学基金

0+阅读 · 2012年12月31日

微粒捕集器过滤体复合再生与多场协同机理及其优化研究

国家自然科学基金

0+阅读 · 2011年12月31日

气－固反应制备IyCo4Sb12/SnO2 纳米复合材料及其热电性能研究

国家自然科学基金

0+阅读 · 2008年12月31日

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings

Arxiv

0+阅读 · 2022年7月1日

Improving Speech Enhancement through Fine-Grained Speech Characteristics

Arxiv

0+阅读 · 2022年7月1日

Motion Compensated Frequency Selective Extrapolation for Error Concealment in Video Coding

Arxiv

0+阅读 · 2022年7月1日

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition

Arxiv

0+阅读 · 2022年6月30日

Adaptive Cut Selection in Mixed-Integer Linear Programming

Adaptive Cut Selection in Mixed-Integer Linear Programming

Arxiv

0+阅读 · 2022年6月30日

Sonification as a Reliable Alternative to Conventional Visual Surgical Navigation

Sonification as a Reliable Alternative to Conventional Visual Surgical Navigation

Arxiv

0+阅读 · 2022年6月30日

Automatic Speech recognition for Speech Assessment of Persian Preschool Children

Automatic Speech recognition for Speech Assessment of Persian Preschool Children

Arxiv

0+阅读 · 2022年6月30日

Meta-Wrapper: Differentiable Wrapping Operator for User Interest Selection in CTR Prediction

Arxiv

0+阅读 · 2022年6月28日

A Survey on Neural Speech Synthesis

Arxiv

14+阅读 · 2021年6月30日

A Survey on Multi-Task Learning

Arxiv

31+阅读 · 2021年3月29日

VIP会员

文章信息

相关主题

注意力机制

自动语音识别

相关VIP内容

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

专知会员服务

44+阅读 · 2020年11月2日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

经典书《机器学习：概率视角》（Machine Learning: a Probabilistic Perspective）第二版Python代码，附1098页pdf下载

经典书《机器学习：概率视角》（Machine Learning: a Probabilistic Perspective）第二版Python代码，附1098页pdf下载

专知会员服务

274+阅读 · 2019年10月25日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《美陆军特种作战条令》最新102页

《洛克希德SR-71“黑鸟”侦察机动力系统》21页slides

美空军作战实验室通过人工智能和指挥控制技术创新推进杀伤链

《指挥控制能力分析方法论》最新报告

相关资讯

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

中国图象图形学学会CSIG

0+阅读 · 2021年11月9日

局部学习的特征选择：Local-Learning-Based Feature Selection

局部学习的特征选择：Local-Learning-Based Feature Selection

我爱读PAMI

14+阅读 · 2019年9月20日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings

A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings

Arxiv

0+阅读 · 2022年7月1日

Improving Speech Enhancement through Fine-Grained Speech Characteristics

Arxiv

0+阅读 · 2022年7月1日

Motion Compensated Frequency Selective Extrapolation for Error Concealment in Video Coding

Arxiv

0+阅读 · 2022年7月1日

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition

Arxiv

0+阅读 · 2022年6月30日

Adaptive Cut Selection in Mixed-Integer Linear Programming

Adaptive Cut Selection in Mixed-Integer Linear Programming

Arxiv

0+阅读 · 2022年6月30日

Sonification as a Reliable Alternative to Conventional Visual Surgical Navigation

Sonification as a Reliable Alternative to Conventional Visual Surgical Navigation

Arxiv

0+阅读 · 2022年6月30日

Automatic Speech recognition for Speech Assessment of Persian Preschool Children

Automatic Speech recognition for Speech Assessment of Persian Preschool Children

Arxiv

0+阅读 · 2022年6月30日

Meta-Wrapper: Differentiable Wrapping Operator for User Interest Selection in CTR Prediction

Arxiv

0+阅读 · 2022年6月28日

A Survey on Neural Speech Synthesis

Arxiv

14+阅读 · 2021年6月30日

A Survey on Multi-Task Learning

Arxiv

31+阅读 · 2021年3月29日

相关基金

拓扑绝缘体与超导体耦合体系中交叉Andreev反射研究

国家自然科学基金

1+阅读 · 2014年12月31日

S3AGA样本（Spitzer-SDSS Spectral Atlas of Galaxies and AGNs)及其AGN研究

国家自然科学基金

0+阅读 · 2014年12月31日

Anderson型多酸的不对称修饰及可控组装研究

国家自然科学基金

1+阅读 · 2014年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

非凸Hamilton系统的Aubry-Mather理论

国家自然科学基金

0+阅读 · 2012年12月31日

Degasperis-Procesi方程若干控制问题的研究

国家自然科学基金

0+阅读 · 2012年12月31日

PTBP1介导的survivinΔEx3过表达调控胶质母细胞瘤微血管增生的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

Bi/Er共掺多芯石英光纤及其应用研究

国家自然科学基金

0+阅读 · 2012年12月31日

微粒捕集器过滤体复合再生与多场协同机理及其优化研究

国家自然科学基金

0+阅读 · 2011年12月31日

气－固反应制备IyCo4Sb12/SnO2 纳米复合材料及其热电性能研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员