VeriFastScore：加速长文本事实性评估 (VeriFastScore: Speeding up long-form factuality evaluation)

Metrics like FactScore and VeriScore that evaluate long-form factuality operate by decomposing an input response into atomic claims and then individually verifying each claim. While effective and interpretable, these methods incur numerous LLM calls and can take upwards of 100 seconds to evaluate a single response, limiting their practicality in large-scale evaluation and training scenarios. To address this, we propose VeriFastScore, which leverages synthetic data to fine-tune Llama3.1 8B for simultaneously extracting and verifying all verifiable claims within a given text based on evidence from Google Search. We show that this task cannot be solved via few-shot prompting with closed LLMs due to its complexity: the model receives ~4K tokens of evidence on average and needs to concurrently decompose claims, judge their verifiability, and verify them against noisy evidence. However, our fine-tuned VeriFastScore model demonstrates strong correlation with the original VeriScore pipeline at both the example level (r=0.80) and system level (r=0.94) while achieving an overall speedup of 6.6x (9.9x excluding evidence retrieval) over VeriScore. To facilitate future factuality research, we publicly release our VeriFastScore model and synthetic datasets.

翻译：诸如FactScore和VeriScore等评估长文本事实性的指标，其工作原理是将输入的回答分解为原子性主张，然后逐一验证每个主张。尽管这些方法有效且可解释，但它们需要进行大量的大语言模型调用，评估单个回答可能需要超过100秒的时间，这限制了其在大规模评估和训练场景中的实用性。为解决这一问题，我们提出了VeriFastScore，该方法利用合成数据对Llama3.1 8B模型进行微调，以基于Google搜索的证据，同时提取并验证给定文本中的所有可验证主张。我们证明，由于其复杂性，这项任务无法通过闭源大语言模型的少样本提示解决：模型平均接收约4K个令牌的证据，需要同时分解主张、判断其可验证性，并在噪声证据中进行验证。然而，我们微调后的VeriFastScore模型在示例层面（r=0.80）和系统层面（r=0.94）均与原始VeriScore流程展现出强相关性，同时整体上实现了相对于VeriScore 6.6倍（若排除证据检索则为9.9倍）的加速。为促进未来事实性研究，我们公开发布了VeriFastScore模型及合成数据集。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日