LibriVAD：一个用于语音活动检测深度学习基准测试的可扩展开放数据集 (LibriVAD: A Scalable Open Dataset with Deep Learning Benchmarks for Voice Activity Detection)

Robust Voice Activity Detection (VAD) remains a challenging task, especially under noisy, diverse, and unseen acoustic conditions. Beyond algorithmic development, a key limitation in advancing VAD research is the lack of large-scale, systematically controlled, and publicly available datasets. To address this, we introduce LibriVAD - a scalable open-source dataset derived from LibriSpeech and augmented with diverse real-world and synthetic noise sources. LibriVAD enables systematic control over speech-to-noise ratio, silence-to-speech ratio (SSR), and noise diversity, and is released in three sizes (15 GB, 150 GB, and 1.5 TB) with two variants (LibriVAD-NonConcat and LibriVAD-Concat) to support different experimental setups. We benchmark multiple feature-model combinations, including waveform, Mel-Frequency Cepstral Coefficients (MFCC), and Gammatone filter bank cepstral coefficients, and introduce the Vision Transformer (ViT) architecture for VAD. Our experiments show that ViT with MFCC features consistently outperforms established VAD models such as boosted deep neural network and convolutional long short-term memory deep neural network across seen, unseen, and out-of-distribution (OOD) conditions, including evaluation on the real-world VOiCES dataset. We further analyze the impact of dataset size and SSR on model generalization, experimentally showing that scaling up dataset size and balancing SSR noticeably and consistently enhance VAD performance under OOD conditions. All datasets, trained models, and code are publicly released to foster reproducibility and accelerate progress in VAD research.

翻译：鲁棒的语音活动检测（VAD）仍然是一项具有挑战性的任务，尤其是在噪声、多样且未经见的声学条件下。除了算法开发之外，推进VAD研究的一个关键限制在于缺乏大规模、系统可控且公开可用的数据集。为解决此问题，我们引入了LibriVAD——一个源自LibriSpeech并利用多样化的真实世界和合成噪声源进行增强的可扩展开源数据集。LibriVAD能够系统控制信噪比、静默语音比（SSR）和噪声多样性，并以三种规模（15 GB、150 GB和1.5 TB）和两种变体（LibriVAD-NonConcat和LibriVAD-Concat）发布，以支持不同的实验设置。我们对多种特征-模型组合进行了基准测试，包括波形、梅尔频率倒谱系数（MFCC）和伽马通滤波器组倒谱系数，并引入了Vision Transformer（ViT）架构用于VAD。我们的实验表明，在已见、未见和分布外（OOD）条件下（包括在真实世界VOiCES数据集上的评估），采用MFCC特征的ViT模型始终优于已建立的VAD模型，如增强深度神经网络和卷积长短期记忆深度神经网络。我们进一步分析了数据集规模和SSR对模型泛化能力的影响，实验表明，扩大数据集规模并平衡SSR能显著且持续地提升VAD在OOD条件下的性能。所有数据集、训练模型和代码均已公开发布，以促进可重复性并加速VAD研究的进展。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

[ICCV2025]EAMamba：面向图像恢复的高效全能视觉状态空间模型

专知会员服务

5+阅读 · 7月1日

RAG与RAU：自然语言处理中的检索增强语言模型综述

专知会员服务

87+阅读 · 2024年5月3日

【NeurIPS2023】基于反事实保守Q学习的离线多智能体强化学习

专知会员服务

17+阅读 · 2023年9月25日

《用于代码弱点识别的 LLVM 中间表示》CMU

专知会员服务

14+阅读 · 2022年12月12日