Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.
翻译:智能体AI系统通过执行一系列动作(如推理步骤或工具调用)来响应用户提示。为评估其执行轨迹的成功性,研究者开发了验证器(如LLM评判器和过程奖励模型)来对智能体轨迹中每个动作的质量进行评分。尽管这些启发式评分具有参考价值,但在用于判断智能体是否会产生成功输出时,其正确性缺乏理论保证。本文提出e-valuator方法,可将任意黑盒验证器评分转化为具有可证明误报率控制的决策规则。我们将区分成功轨迹(即能正确回应用户提示的动作序列)与失败轨迹的问题构建为序贯假设检验问题。E-valuator基于e过程工具开发出在智能体轨迹任意步骤均保持统计有效性的序贯假设检验,实现对智能体任意长动作序列的在线监控。实证研究表明,在六个数据集和三种智能体上,e-valuator相比其他策略具有更高的统计功效和更优的误报率控制。我们还证明e-valuator可用于快速终止问题轨迹以节省计算资源。综上,e-valuator提供了一个轻量级、模型无关的框架,可将验证器启发式方法转化为具有统计保证的决策规则,从而推动更可靠智能体系统的部署。