We study the problem of language identification in the limit, where given a sequence of examples from a target language, the goal of the learner is to output a sequence of guesses for the target language such that all the guesses beyond some finite time are correct. Classical results of Gold showed that language identification in the limit is impossible for essentially any interesting collection of languages. Later, Angluin gave a precise characterization of language collections for which this task is possible. Motivated by recent positive results for the related problem of language generation, we revisit the classic language identification problem in the setting where the learner is given the additional power of producing a list of $k$ guesses at each time step. The goal is to ensure that beyond some finite time, one of the guesses is correct at each time step. We give an exact characterization of collections of languages that can be $k$-list identified in the limit, based on a recursive version of Angluin's characterization (for language identification with a list of size $1$). This further leads to a conceptually appealing characterization: A language collection can be $k$-list identified in the limit if and only if the collection can be decomposed into $k$ collections of languages, each of which can be identified in the limit (with a list of size $1$). We also use our characterization to establish rates for list identification in the statistical setting where the input is drawn as an i.i.d. stream from a distribution supported on some language in the collection. Our results show that if a collection is $k$-list identifiable in the limit, then the collection can be $k$-list identified at an exponential rate, and this is best possible. On the other hand, if a collection is not $k$-list identifiable in the limit, then it cannot be $k$-list identified at any rate that goes to zero.
翻译:我们研究极限语言识别问题,其中给定目标语言的示例序列,学习者的目标是输出对目标语言的一系列猜测,使得在某个有限时间之后的所有猜测均正确。Gold的经典结果表明,对于任何有意义的语言集合,极限语言识别本质上是不可能的。随后,Angluin对这一任务可行的语言集合给出了精确刻画。受近期语言生成相关问题的积极成果启发,我们在学习者被赋予额外能力——能够在每个时间步产生包含$k$个猜测的列表——这一设定下,重新审视经典的语言识别问题。其目标是确保在某个有限时间之后,每个时间步的猜测列表中至少有一个是正确的。基于Angluin刻画(针对列表大小为$1$的语言识别)的递归版本,我们给出了能够在极限下进行$k$列表识别的语言集合的精确刻画。这进一步导出了一个概念上引人入胜的刻画:一个语言集合能够在极限下进行$k$列表识别,当且仅当该集合可以分解为$k$个语言子集,每个子集均可在极限下(以列表大小$1$)被识别。我们还利用该刻画建立了统计设定下列表识别的速率,其中输入是从集合中某个语言支持的分布中独立同分布抽取的流。我们的结果表明,如果一个集合在极限下是$k$列表可识别的,则该集合能以指数速率进行$k$列表识别,且这是最优的。另一方面,如果一个集合在极限下不是$k$列表可识别的,则它无法以任何趋于零的速率进行$k$列表识别。