CASTELLA：带有字幕和时间边界的长音频数据集 (CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries)

We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The early study of AMR trained the model with solely synthetic datasets. Moreover, the evaluation is based on annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1,009, 213, and 640 audio recordings for train, valid, and test split, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data outperformed a model trained solely on the synthetic data by 10.4 points in Recall1@0.7. CASTELLA is publicly available in https://h-munakata.github.io/CASTELLA-demo/.

翻译：我们介绍了CASTELLA，一个用于音频片段检索（AMR）任务的人工标注音频基准。尽管AMR具有多种潜在的实际应用价值，但目前仍缺乏基于真实世界数据的成熟基准。AMR的早期研究仅使用合成数据集训练模型。此外，评估基于少于100个样本的标注数据集，导致报告的性能可靠性较低。为确保在真实环境应用中的性能，我们提出了CASTELLA，一个大规模人工标注的AMR数据集。CASTELLA包含分别用于训练、验证和测试的1,009、213和640个音频记录，规模是先前数据集的24倍。我们还利用CASTELLA建立了AMR的基线模型。实验表明，在合成数据预训练后使用CASTELLA微调的模型，在Recall1@0.7指标上比仅使用合成数据训练的模型高出10.4个百分点。CASTELLA已在https://h-munakata.github.io/CASTELLA-demo/公开提供。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR2024】VidLA: 大规模视频-语言对齐

专知会员服务

20+阅读 · 2024年3月31日

【CVPR2024】ViewDiff: 3D一致的图像生成与文本到图像模型

专知会员服务

30+阅读 · 2024年3月10日

【NeurIPS2023】PAXION：在视频-语言基础模型中修补动作知识

专知会员服务

18+阅读 · 2023年9月24日

【NeurIPS2022】SparCL:边缘稀疏持续学习

专知会员服务

24+阅读 · 2022年9月22日