Speech Relation Extraction (SpeechRE) aims to extract relation triplets directly from speech. However, existing benchmark datasets rely heavily on synthetic data, lacking sufficient quantity and diversity of real human speech. Moreover, existing models also suffer from rigid single-order generation templates and weak semantic alignment, substantially limiting their performance. To address these challenges, we introduce CommonVoice-SpeechRE, a large-scale dataset comprising nearly 20,000 real-human speech samples from diverse speakers, establishing a new benchmark for SpeechRE research. Furthermore, we propose the Relation Prompt-Guided Multi-Order Generative Ensemble (RPG-MoGe), a novel framework that features: (1) a multi-order triplet generation ensemble strategy, leveraging data diversity through diverse element orders during both training and inference, and (2) CNN-based latent relation prediction heads that generate explicit relation prompts to guide cross-modal alignment and accurate triplet generation. Experiments show our approach outperforms state-of-the-art methods, providing both a benchmark dataset and an effective solution for real-world SpeechRE. The source code and dataset are publicly available at https://github.com/NingJinzhong/SpeechRE_RPG_MoGe.
翻译:语音关系抽取(SpeechRE)旨在直接从语音中提取关系三元组。然而,现有基准数据集严重依赖合成数据,缺乏足够数量与多样性的真实人类语音。此外,现有模型还受限于僵化的单序生成模板与薄弱的语义对齐能力,显著限制了其性能。为应对这些挑战,我们提出了CommonVoice-SpeechRE——一个包含近20,000个来自多样化说话者的真实人类语音样本的大规模数据集,为SpeechRE研究建立了新的基准。进一步地,我们提出了关系提示引导的多序生成集成框架(RPG-MoGe),该创新框架具备以下特征:(1)多序三元组生成集成策略,通过在训练与推理阶段利用多样化元素顺序来挖掘数据多样性;(2)基于CNN的潜在关系预测头,可生成显式关系提示以指导跨模态对齐与精确的三元组生成。实验表明,我们的方法超越了现有最先进技术,为实际场景中的SpeechRE提供了基准数据集与有效解决方案。源代码与数据集已公开于https://github.com/NingJinzhong/SpeechRE_RPG_MoGe。