联合:通过联合培训,同时改善多种仪器传输和音乐源分离 (Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training)

In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilizes instrument information and transcription results. The joint training of the transcription and source separation modules serves to improve the performance of both tasks. The instrument module is optional and can be directly controlled by human users. This makes Jointist a flexible user-controllable framework. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. Its novelty, however, necessitates a new perspective on how to evaluate such a model. In our experiments, we assess the proposed model from various aspects, providing a new evaluation perspective for multi-instrument transcription. Our subjective listening study shows that Jointist achieves state-of-the-art performance on popular music, outperforming existing multi-instrument transcription models such as MT3. %We also argue that transcription models can be used as a preprocessing module for other music analysis tasks. We conducted experiments on several downstream tasks and found that the proposed method improved transcription by more than 1 percentage points (ppt.), source separation by 5 SDR, downbeat detection by 1.8 ppt., chord recognition by 1.4 ppt., and key estimation by 1.4 ppt., when utilizing transcription results obtained from Jointist.

翻译：在本文中,我们引入了联合制片人,这是一个能转换、识别和将多种乐器与音频剪辑分开的具有仪器觉悟的多工具工具的多工具工具框架。联合制片人包括一个仪器识别模块,该模块是其他两个模块的条件:一个输出仪器专用钢琴卷的转录模块,以及一个使用仪器信息和转录结果的源分离模块。对转录和源分离模块的联合培训有助于改进这两项任务的业绩。仪器模块是可选的,并且可以直接由人类用户控制。这使得联合制片人是一个灵活的用户控制的可调控框架。由于现代流行音乐通常由多种工具组成,我们的问题配置使模型在现实世界中非常有用。然而,它的新颖性要求有一个如何评价这种模型的新视角。在我们的实验中,我们从各方面评估了拟议的模型,为多工具转换提供了新的评价视角。我们的主观听力研究显示,联合制片人在流行音乐上取得了最新版本的手势,超越了现有的多工具读数模型,例如MT3.,我们还认为,在联合利用现代流行流行流行的音乐节录解模型时,我们所使用的第1级分析模型时,我们用了新的分析方法分析所使用的第1级分析中,我们用了更多的分解模型,我们用了一些的分解模型,我们用的分解模型,我们用的分解模型,我们用了一些。我们所使用的的分解模型,我们用的方法,我们用的方法,我们用的方法,我们用的方法,我们用到的分解的分解的分解的分解模型,我们用的方法,我们用的方法,我们用的方法,我们用的分解的分解模型,我们用的方法,我们用的方法,我们用的方法,我们用的分解模型,我们用的分解模型用的方法,我们用的方法,我们用到的分解模型,我们用的分解模型,我们用了。