在下一符号预测设定下学习正则语言的困难性 (Hardness of Learning Regular Languages in the Next Symbol Prediction Setting)

We study the learnability of languages in the Next Symbol Prediction (NSP) setting, where a learner receives only positive examples from a language together with, for every prefix, (i) whether the prefix itself is in the language and (ii) which next symbols can lead to an accepting string. This setting has been used in prior works to empirically analyze neural sequence models, and additionally, we observe that efficient algorithms for the NSP setting can be used to learn the (truncated) support of language models. We formalize the setting so as to make it amenable to PAC-learning analysis. While the setting provides a much richer set of labels than the conventional classification setting, we show that learning concept classes such as DFAs and Boolean formulas remains computationally hard. The proof is via a construction that makes almost all additional labels uninformative, yielding a reduction from the conventional learning problem to learning with NSP labels. Under cryptographic assumptions, the reduction implies that the problem of learning DFAs is computationally hard in the NSP setting.

翻译：我们研究了在下一符号预测（NSP）设定下语言的可学习性，其中学习者仅接收来自语言的正例，以及对于每个前缀：（i）该前缀本身是否属于该语言，以及（ii）哪些后续符号可以导向一个可接受的字符串。先前的工作已使用此设定对神经序列模型进行实证分析，此外，我们观察到，针对NSP设定的高效算法可用于学习语言模型的（截断）支持集。我们形式化了该设定，使其适用于PAC学习分析。尽管该设定提供了比传统分类设定丰富得多的标签集，但我们证明了学习诸如DFA和布尔公式等概念类在计算上仍然是困难的。证明通过一种构造实现，该构造使得几乎所有附加标签都变得无信息量，从而将传统学习问题归约到使用NSP标签的学习问题。在密码学假设下，该归约意味着在NSP设定下学习DFA的问题是计算困难的。