在基于子字的语文模式和类型级交流干预培训中引入个性层次结构 (Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training)

Language tasks involving character-level manipulations (e.g., spelling correction, many word games) are challenging for models based in subword tokenization. To address this, we adapt the interchange intervention training method of Geiger et al. (2021) to operate on type-level variables over characters. This allows us to encode robust, position-independent character-level information in the internal representations of subword-based models. We additionally introduce a suite of character-level tasks that systematically vary in their dependence on meaning and sequence-level context. While simple character-level tokenization approaches still perform best on purely form-based tasks like string reversal, our method is superior for more complex tasks that blend form, meaning, and context, such as spelling correction in context and word search games. Our approach also leads to subword-based models with human-intepretable internal representations of characters.

翻译：包含字符操作(例如拼写校正、许多字游戏)的语言任务对于子名符号化模型来说具有挑战性。为了解决这个问题,我们调整了Geiger等人(2021年)的交换干预培训方法,使其以类型变量取代字符(2021年)来操作。这使我们能够在子词模型的内部表述中将稳健的、独立位置的字符级信息编码为基于子词的模型的内部表述。我们还引入了一系列字符级任务,这些任务在依赖意义和顺序级别背景方面有系统性的差别。虽然简单的字符级符号化方法仍然在纯粹基于形式的任务(如字符串逆转)上表现最佳,但我们的方法优于更复杂的任务,这些任务包括形式、意义和上下文,例如上下文的拼写校正和字词搜索游戏。我们的方法还导致基于子字型模型的字符内插图解。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日