Recent NLP tasks have benefited a lot from pre-trained language models (LM) since they are able to encode knowledge of various aspects. However, current LM evaluations focus on downstream performance, hence lack to comprehensively inspect in which aspect and to what extent have they encoded knowledge. This paper addresses both queries by proposing four tasks on syntactic, semantic, commonsense, and factual knowledge, aggregating to a total of $39,308$ questions covering both linguistic and world knowledge in Chinese. Throughout experiments, our probes and knowledge data prove to be a reliable benchmark for evaluating pre-trained Chinese LMs. Our work is publicly available at https://github.com/ZhiruoWang/ChnEval.
翻译:最近的国家语言方案任务从经过培训的语言模式中获益良多,因为这些语言模式能够对各方面的知识进行编码,然而,目前的语言模型评价侧重于下游业绩,因此无法全面检查知识编码的方面和程度,本文件通过提出四项关于综合、语义、常识和事实知识的任务来回答这两个问题,共涉及39 308美元的问题,涉及中文的语言和世界知识。在整个实验过程中,我们的探测和知识数据证明是评价经过培训的中国语言模型的可靠基准。我们的工作在https://github.com/ZhiruoWang/ChnEval上公开提供。