Wake-up words (WUW) is a short sentence used to activate a speech recognition system to receive the user's speech input. WUW utterances include not only the lexical information for waking up the system but also non-lexical information such as speaker identity or emotion. In particular, recognizing the user's emotional state may elaborate the voice communication. However, there is few dataset where the emotional state of the WUW utterances is labeled. In this paper, we introduce Hi, KIA, a new WUW dataset which consists of 488 Korean accent emotional utterances collected from four male and four female speakers and each of utterances is labeled with four emotional states including anger, happy, sad, or neutral. We present the step-by-step procedure to build the dataset, covering scenario selection, post-processing, and human validation for label agreement. Also, we provide two classification models for WUW speech emotion recognition using the dataset. One is based on traditional hand-craft features and the other is a transfer-learning approach using a pre-trained neural network. These classification models could be used as benchmarks in further research.
翻译:觉醒单词(WUW)是用于启动语音识别系统的短句,以接收用户的语音输入。 WUW的语句不仅包括唤醒系统所需的词汇信息,而且还包括声音身份或情绪等非历史信息。特别是,承认用户的情绪状态可能会详细描述语音通信。然而,在WUW的语句的情感状态贴上标签的数据集中,很少有。我们在此文件中,我们引入了Hi、KIA,一个新的W数据集,由4位男性和4位女性发言者收集的488韩国口音情感发音组成,每个语句都有四个情感状态的标签,包括愤怒、快乐、悲伤或中性。我们介绍了建立数据集的逐步程序,涵盖情景选择、后处理和标签协议的人类验证。我们还提供了两个WUW语音情绪使用数据集识别分类模型。一个是基于传统的手工艺特征,另一个是基于使用预先培训的神经网络的转移学习方法。这些分类模型可以用作进一步研究的基准。