Metrics like FactScore and VeriScore that evaluate long-form factuality operate by decomposing an input response into atomic claims and then individually verifying each claim. While effective and interpretable, these methods incur numerous LLM calls and can take upwards of 100 seconds to evaluate a single response, limiting their practicality in large-scale evaluation and training scenarios. To address this, we propose VeriFastScore, which leverages synthetic data to fine-tune Llama3.1 8B for simultaneously extracting and verifying all verifiable claims within a given text based on evidence from Google Search. We show that this task cannot be solved via few-shot prompting with closed LLMs due to its complexity: the model receives ~4K tokens of evidence on average and needs to concurrently decompose claims, judge their verifiability, and verify them against noisy evidence. However, our fine-tuned VeriFastScore model demonstrates strong correlation with the original VeriScore pipeline at both the example level (r=0.80) and system level (r=0.94) while achieving an overall speedup of 6.6x (9.9x excluding evidence retrieval) over VeriScore. To facilitate future factuality research, we publicly release our VeriFastScore model and synthetic datasets.
翻译:诸如FactScore和VeriScore等评估长文本事实性的指标,其工作原理是将输入的回答分解为原子性主张,然后逐一验证每个主张。尽管这些方法有效且可解释,但它们需要进行大量的大语言模型调用,评估单个回答可能需要超过100秒的时间,这限制了其在大规模评估和训练场景中的实用性。为解决这一问题,我们提出了VeriFastScore,该方法利用合成数据对Llama3.1 8B模型进行微调,以基于Google搜索的证据,同时提取并验证给定文本中的所有可验证主张。我们证明,由于其复杂性,这项任务无法通过闭源大语言模型的少样本提示解决:模型平均接收约4K个令牌的证据,需要同时分解主张、判断其可验证性,并在噪声证据中进行验证。然而,我们微调后的VeriFastScore模型在示例层面(r=0.80)和系统层面(r=0.94)均与原始VeriScore流程展现出强相关性,同时整体上实现了相对于VeriScore 6.6倍(若排除证据检索则为9.9倍)的加速。为促进未来事实性研究,我们公开发布了VeriFastScore模型及合成数据集。