Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.
翻译:评估面向用户的AI应用仍然是一个核心挑战,尤其是在旅行规划、临床记录生成或对话等开放领域。黄金标准是用户反馈(例如点赞/点踩)或行为信号(例如用户留存),但这些数据在原型和研究项目中往往稀缺,或者用于系统优化的速度过慢。我们提出了AutoMetrics,一个在低数据约束下合成评估指标的框架。AutoMetrics结合了从MetricBank(我们整理的包含48个指标的集合)的检索,以及基于轻量级人类反馈自动生成的大语言模型即评判标准。这些指标通过回归进行组合,以最大化与人类信号的相关性。AutoMetrics能够将昂贵的测量方法转化为可解释的自动指标。在5个不同的任务中,AutoMetrics与人类评分的Kendall相关性相比大语言模型即评判方法提升了高达33.4%,同时所需反馈点少于100个。我们证明AutoMetrics可以作为代理奖励函数,其效果与可验证的奖励函数相当。我们发布了完整的AutoMetrics工具包和MetricBank,以加速大语言模型应用的适应性评估。