Development of language proficiency models for non-native learners has been an active area of interest in NLP research for the past few years. Although language proficiency is multidimensional in nature, existing research typically considers a single "overall proficiency" while building models. Further, existing approaches also considers only one language at a time. This paper describes our experiments and observations about the role of pre-trained and fine-tuned multilingual embeddings in performing multi-dimensional, multilingual language proficiency classification. We report experiments with three languages -- German, Italian, and Czech -- and model seven dimensions of proficiency ranging from vocabulary control to sociolinguistic appropriateness. Our results indicate that while fine-tuned embeddings are useful for multilingual proficiency modeling, none of the features achieve consistently best performance for all dimensions of language proficiency. All code, data and related supplementary material can be found at: https://github.com/nishkalavallabhi/MultidimCEFRScoring.
翻译:过去几年来,为非母语学生开发语言熟练程度模式一直是国家语言熟练程度研究的一个积极关注领域,尽管语言熟练程度具有多面性,但现有研究通常在建立模式时考虑单一的“总体熟练程度”,此外,现有方法也同时只考虑一种语言,本文描述了我们关于预先培训和经过微调的多种语言熟练程度的多语言嵌入在多维、多语言熟练程度分类中的作用的实验和观察,我们报告了三种语言 -- -- 德语、意大利语和捷克语 -- -- 的实验,以及从词汇控制到社会语言适应性的7个水平模型。我们的结果显示,尽管微调的嵌入对于多语言熟练程度模型有用,但没有一个特征在语言熟练程度的所有层面都能始终取得最佳业绩。所有代码、数据和相关补充材料见https://github.com/nishkalavalabhi/MultidimCEFRScoring。