How can we trust the correctness of a learned model on a particular input of interest? Model accuracy is typically measured on average over a distribution of inputs, giving no guarantee for any fixed input. This paper proposes a theoretically-founded solution to this problem: to train Self-Proving models that prove the correctness of their output to a verification algorithm $V$ via an Interactive Proof. Self-Proving models satisfy that, with high probability over an input sampled from a given distribution, the model generates a correct output and successfully proves its correctness to $V$. The soundness property of $V$ guarantees that, for every input, no model can convince $V$ of the correctness of an incorrect output. Thus, a Self-Proving model proves correctness of most of its outputs, while all incorrect outputs (of any model) are detected by $V$. We devise and analyze two generic methods for learning Self-Proving models: Transcript Learning (TL) which relies on access to transcripts of accepting interactions, and Reinforcement Learning from Verifier Feedback (RLVF) which trains a model by emulating interactions with the verifier.
翻译:我们如何信任一个学习模型在特定输入上的正确性?模型准确性通常是在输入分布上的平均度量,无法为任何固定输入提供保证。本文提出了一种基于理论基础的解决方案:训练自我证明模型,使其通过交互式证明向验证算法 $V$ 证明其输出的正确性。自我证明模型满足以下性质:以高概率从给定分布中采样输入时,模型生成正确输出并成功向 $V$ 证明其正确性。验证算法 $V$ 的可靠性保证了对任意输入,任何模型都无法使 $V$ 接受错误输出的正确性证明。因此,自我证明模型能证明其大部分输出的正确性,而所有错误输出(来自任何模型)均可被 $V$ 检测到。我们设计并分析了两种学习自我证明模型的通用方法:基于接受交互转录本的转录学习,以及通过模拟与验证器交互进行训练的验证器反馈强化学习。