We propose an alternate approach to quantifying how well language models learn natural language: we ask how well they match the statistical tendencies of natural language. To answer this question, we analyze whether text generated from language models exhibits the statistical tendencies present in the human-generated text on which they were trained. We provide a framework--paired with significance tests--for evaluating the fit of language models to certain statistical tendencies of natural language. We find that neural language models appear to learn only a subset of the statistical tendencies considered, but align much more closely with empirical trends than theoretical laws (when present). Further, the fit to different distributions is dependent on both model architecture and generation strategy. As concrete examples, text generated under the nucleus sampling scheme adheres more closely to the type--token relationship of natural language than text produced using standard ancestral sampling; text from LSTMs reflects the natural language distributions over length, stopwords, and symbols suprisingly well.
翻译:我们提出了一种量化语言模式如何很好地学习自然语言的替代方法:我们问它们与自然语言的统计趋势相匹配的程度如何;为了回答这个问题,我们分析语言模式产生的文本是否显示了他们接受培训的人类生成的文本中存在的统计趋势;我们提供了一种框架,具有重大测试作用,用以评价语言模式是否适合自然语言的某些统计趋势;我们发现神经语言模式似乎只学习了所考虑的统计趋势的一小部分,但与经验趋势比理论法(当存在时)更为接近。此外,适应不同分布的适宜性取决于模型结构和生成战略。此外,核心取样方案产生的文本与使用标准祖先抽样制成的文本相比,更接近自然语言类型和类型关系;LSTMS的文本反映了自然语言在长度、断字和符号上的分布,令人惊讶地很好。