With recent advances in automatic speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS) technologies, spoken dialogue systems (SDS) have become widely accessible. However, most existing SDS are limited to conventional spoken responses. We present SingingSDS, a cascaded SDS that responds through singing rather than speaking, fostering more affective, memorable, and pleasurable interactions in character-based roleplay and interactive entertainment scenarios. SingingSDS employs a modular ASR-LLM-SVS pipeline and supports a wide range of configurations across character personas, ASR and LLM backends, SVS models, melody sources, and voice profiles, tailored to different needs in terms of latency, quality, and musical style. SingingSDS is available as a plug-and-play web demo, featuring modular, open-source code that supports customization and extension. Demo: https://huggingface.co/spaces/espnet/SingingSDS. Code: https://github.com/SingingSDS/SingingSDS.
翻译:随着自动语音识别(ASR)、大语言模型(LLM)和文本转语音(TTS)技术的近期进展,口语对话系统(SDS)已变得广泛可用。然而,现有的大多数SDS仅限于传统的语音回应。我们提出了SingingSDS,一种级联式SDS,它通过歌唱而非说话来回应,从而在基于角色的角色扮演和交互式娱乐场景中促进更具情感性、更令人难忘且更愉悦的交互。SingingSDS采用模块化的ASR-LLM-SVS(歌唱语音合成)流程,并支持广泛的配置选项,涵盖角色设定、ASR与LLM后端、SVS模型、旋律来源和音色配置,可根据延迟、质量和音乐风格等方面的不同需求进行定制。SingingSDS可作为即插即用的网络演示使用,其模块化、开源的代码支持定制与扩展。演示:https://huggingface.co/spaces/espnet/SingingSDS。代码:https://github.com/SingingSDS/SingingSDS。