Natural language processing models learn word representations based on the distributional hypothesis, which asserts that word context (e.g., co-occurrence) correlates with meaning. We propose that $n$-grams composed of random character sequences, or $garble$, provide a novel context for studying word meaning both within and beyond extant language. In particular, randomly generated character $n$-grams lack meaning but contain primitive information based on the distribution of characters they contain. By studying the embeddings of a large corpus of garble, extant language, and pseudowords using CharacterBERT, we identify an axis in the model's high-dimensional embedding space that separates these classes of $n$-grams. Furthermore, we show that this axis relates to structure within extant language, including word part-of-speech, morphology, and concept concreteness. Thus, in contrast to studies that are mainly limited to extant language, our work reveals that meaning and primitive information are intrinsically linked.
翻译:自然语言处理模型学习基于分布假设的字义表达方式,该假设主张字上下文(例如,共同出现)与含义相关。我们建议,由随机字符序列组成的一克或一克美元,为研究现有语言内外的字义含义提供一个新的背景。特别是,随机生成的字元一克缺乏意义,但含有基于其所含字符分布的原始信息。通过研究大量布置的装饰、现存语言和假冒词,我们确定了模型高维嵌入空间的轴线,分离了这些种类的一克美元。此外,我们表明,这一轴线与原语言的结构有关,包括部分语音、形态学和概念具体性。因此,与主要局限于外在语言的研究相比,我们的工作表明,意思和原始信息是内在相连的。