E-valuator：基于序贯假设检验的可靠智能体验证器 (E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing)

Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.

翻译：智能体AI系统通过执行一系列动作（如推理步骤或工具调用）来响应用户提示。为评估其执行轨迹的成功性，研究者开发了验证器（如LLM评判器和过程奖励模型）来对智能体轨迹中每个动作的质量进行评分。尽管这些启发式评分具有参考价值，但在用于判断智能体是否会产生成功输出时，其正确性缺乏理论保证。本文提出e-valuator方法，可将任意黑盒验证器评分转化为具有可证明误报率控制的决策规则。我们将区分成功轨迹（即能正确回应用户提示的动作序列）与失败轨迹的问题构建为序贯假设检验问题。E-valuator基于e过程工具开发出在智能体轨迹任意步骤均保持统计有效性的序贯假设检验，实现对智能体任意长动作序列的在线监控。实证研究表明，在六个数据集和三种智能体上，e-valuator相比其他策略具有更高的统计功效和更优的误报率控制。我们还证明e-valuator可用于快速终止问题轨迹以节省计算资源。综上，e-valuator提供了一个轻量级、模型无关的框架，可将验证器启发式方法转化为具有统计保证的决策规则，从而推动更可靠智能体系统的部署。

相关内容

假设检验

关注 0

假设检验是推论统计中用于检验统计假设的一种方法。而“统计假设”是可通过观察一组随机变量的模型进行检验的科学假说。一旦能估计未知参数，就会希望根据结果对未知的真正参数值做出适当的推论。统计上对参数的假设，就是对一个或多个参数的论述。而其中欲检验其正确性的为零假设（null hypothesis），零假设通常由研究者决定，反映研究者对未知参数的看法。相对于零假设的其他有关参数之论述是备择假设（alternative hypothesis），它通常反应了执行检定的研究者对参数可能数值的另一种（对立的）看法（换句话说，备择假设通常才是研究者最想知道的）。假设检验的种类包括：t检验，Z检验，卡方检验，F检验等等。

美海军《表征军事领域的新奇性》开发和评估对新事物具有鲁棒性的智能体；DARPA人工智能科学和开放世界新事物学习（SAIL-ON）项目

专知会员服务

31+阅读 · 2023年3月1日

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日

【IJCAI2021】User-as-Graph: 基于异构图池化的新闻推荐用户建模

专知会员服务

23+阅读 · 2021年8月25日

【ICML2021】图对比学习自动化

专知会员服务

41+阅读 · 2021年6月19日