Language tasks involving character-level manipulations (e.g., spelling correction, many word games) are challenging for models based in subword tokenization. To address this, we adapt the interchange intervention training method of Geiger et al. (2021) to operate on type-level variables over characters. This allows us to encode robust, position-independent character-level information in the internal representations of subword-based models. We additionally introduce a suite of character-level tasks that systematically vary in their dependence on meaning and sequence-level context. While simple character-level tokenization approaches still perform best on purely form-based tasks like string reversal, our method is superior for more complex tasks that blend form, meaning, and context, such as spelling correction in context and word search games. Our approach also leads to subword-based models with human-intepretable internal representations of characters.
翻译:包含字符操作(例如拼写校正、许多字游戏)的语言任务对于子名符号化模型来说具有挑战性。为了解决这个问题,我们调整了Geiger等人(2021年)的交换干预培训方法,使其以类型变量取代字符(2021年)来操作。这使我们能够在子词模型的内部表述中将稳健的、独立位置的字符级信息编码为基于子词的模型的内部表述。我们还引入了一系列字符级任务,这些任务在依赖意义和顺序级别背景方面有系统性的差别。虽然简单的字符级符号化方法仍然在纯粹基于形式的任务(如字符串逆转)上表现最佳,但我们的方法优于更复杂的任务,这些任务包括形式、意义和上下文,例如上下文的拼写校正和字词搜索游戏。我们的方法还导致基于子字型模型的字符内插图解。