The softmax-contaminated mixture of experts (MoE) model is deployed when a large-scale pre-trained model, which plays the role of a fixed expert, is fine-tuned for learning downstream tasks by including a new contamination part, or prompt, functioning as a new, trainable expert. Despite its popularity and relevance, the theoretical properties of the softmax-contaminated MoE have remained unexplored in the literature. In the paper, we study the convergence rates of the maximum likelihood estimator of gating and prompt parameters in order to gain insights into the statistical properties and potential challenges of fine-tuning with a new prompt. We find that the estimability of these parameters is compromised when the prompt acquires overlapping knowledge with the pre-trained model, in the sense that we make precise by formulating a novel analytic notion of distinguishability. Under distinguishability of the pre-trained and prompt models, we derive minimax optimal estimation rates for all the gating and prompt parameters. By contrast, when the distinguishability condition is violated, these estimation rates become significantly slower due to their dependence on the prompt convergence rate to the pre-trained model. Finally, we empirically corroborate our theoretical findings through several numerical experiments.
翻译:Softmax污染专家混合模型(MoE)在部署时,将大规模预训练模型作为固定专家,通过引入一个新的污染部分(即提示)作为可训练专家,以微调学习下游任务。尽管该模型应用广泛且具有实际意义,但文献中尚未探讨其理论性质。本文研究了门控参数和提示参数的最大似然估计的收敛速率,以深入理解引入新提示进行微调的统计特性及潜在挑战。我们发现,当提示获取与预训练模型重叠的知识时,这些参数的可估计性会受到损害,这一现象通过我们提出的新颖可区分性分析概念得以精确表述。在预训练模型与提示模型可区分的条件下,我们推导了所有门控参数和提示参数的极小极大最优估计速率。相反,当可区分性条件不满足时,由于估计速率依赖于提示向预训练模型的收敛速率,这些估计速率显著变慢。最后,我们通过多项数值实验对理论结果进行了实证验证。