Audio classification plays an essential role in sentiment analysis and emotion recognition, especially for analyzing customer attitudes in marketing phone calls. Efficiently categorizing customer purchasing propensity from large volumes of audio data remains challenging. In this work, we propose a novel Multi-Segment Multi-Task Fusion Network (MSMT-FN) that is uniquely designed for addressing this business demand. Evaluations conducted on our proprietary MarketCalls dataset, as well as established benchmarks (CMU-MOSI, CMU-MOSEI, and MELD), show MSMT-FN consistently outperforms or matches state-of-the-art methods. Additionally, our newly curated MarketCalls dataset will be available upon request, and the code base is made accessible at GitHub Repository MSMT-FN, to facilitate further research and advancements in audio classification domain.
翻译:音频分类在情感分析与情绪识别中扮演着关键角色,尤其适用于分析营销电话中的客户态度。从海量音频数据中高效分类客户购买倾向仍具挑战性。本研究提出一种新颖的多片段多任务融合网络(MSMT-FN),其独特设计旨在应对这一商业需求。在我们专有的MarketCalls数据集及现有基准(CMU-MOSI、CMU-MOSEI与MELD)上进行的评估表明,MSMT-FN持续优于或匹配最先进方法。此外,我们新构建的MarketCalls数据集将根据请求提供,代码库已在GitHub仓库MSMT-FN开源,以推动音频分类领域的进一步研究与进展。