Recent audio language models can follow long conversations. However, research on emotion-aware or spoken dialogue summarization is constrained by the lack of data that links speech, summaries, and paralinguistic cues. We introduce Spoken DialogSum, the first corpus aligning raw conversational audio with factual summaries, emotion-rich summaries, and utterance-level labels for speaker age, gender, and emotion. The dataset is built in two stages: first, an LLM rewrites DialogSum scripts with Switchboard-style fillers and back-channels, then tags each utterance with emotion, pitch, and speaking rate. Second, an expressive TTS engine synthesizes speech from the tagged scripts, aligned with paralinguistic labels. Spoken DialogSum comprises 13,460 emotion-diverse dialogues, each paired with both a factual and an emotion-focused summary. The dataset is available online at https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/. Baselines show that an Audio-LLM raises emotional-summary ROUGE-L by 28% relative to a cascaded ASR-LLM system, confirming the value of end-to-end speech modeling.
翻译:当前音频语言模型已能处理长对话,但情感感知或口语对话摘要的研究因缺乏关联语音、摘要与副语言线索的数据而受限。本文提出Spoken DialogSum,这是首个将原始对话音频与事实摘要、情感丰富摘要以及说话人年龄、性别和情感的话语级标签对齐的语料库。该数据集通过两个阶段构建:首先,利用大语言模型重写DialogSum脚本,加入Switchboard风格的填充词与反馈词,并为每句话标注情感、音高和语速;其次,通过富有表现力的文本转语音引擎从标注脚本合成语音,并与副语言标签对齐。Spoken DialogSum包含13,460个情感多样化的对话,每个对话均配有事实摘要和情感聚焦摘要。数据集已在https://fatfat-emosum.github.io/EmoDialog-Sum-Audio-Samples/公开。基线实验表明,相较于级联的ASR-LLM系统,音频大语言模型将情感摘要的ROUGE-L指标相对提升了28%,验证了端到端语音建模的价值。