Many research areas rely on data from the web to gain insights and test their methods. However, collecting comprehensive research datasets often demands manually reviewing many web pages to identify and record relevant data points, which is labor-intensive and susceptible to error. While the emergence of large language models (LLM)-powered web agents has begun to automate parts of this process, they often struggle to ensure the validity of the data they collect. Indeed, these agents exhibit several recurring failure modes - including hallucinating or omitting values, misinterpreting page semantics, and failing to detect invalid information - which are subtle and difficult to detect and correct manually. To address this, we introduce the AI Committee, a novel model-agnostic multi-agent system that automates the process of validating and remediating web-sourced datasets. Each agent is specialized in a distinct task in the data quality assurance pipeline, from source scrutiny and fact-checking to data remediation and integrity validation. The AI Committee leverages various LLM capabilities - including in-context learning for dataset adaptation, chain-of-thought reasoning for complex semantic validation, and a self-correction loop for data remediation - all without task-specific training. We demonstrate the effectiveness of our system by applying it to three real-world datasets, showing that it generalizes across LLMs and significantly outperforms baseline approaches, achieving data completeness up to 78.7% and precision up to 100%. We additionally conduct an ablation study demonstrating the contribution of each agent to the Committee's performance. This work is released as an open-source tool for the research community.
翻译:许多研究领域依赖网络数据来获取洞见并检验其方法。然而,收集全面的研究数据集通常需要人工审阅大量网页以识别和记录相关数据点,这一过程不仅劳动密集且易出错。尽管基于大语言模型(LLM)的网络智能体已开始自动化该流程的部分环节,但它们往往难以确保所收集数据的有效性。实际上,这些智能体表现出多种反复出现的故障模式——包括虚构或遗漏数值、误解页面语义以及未能检测无效信息——这些故障模式隐蔽且难以通过人工检测和修正。为解决此问题,我们提出了AI委员会,一种新颖的模型无关多智能体系统,可自动化验证与修复网络源数据集的过程。每个智能体专精于数据质量保障流程中的特定任务,涵盖从来源审查、事实核查到数据修复与完整性验证等环节。AI委员会综合利用多种LLM能力——包括用于数据集适配的上下文学习、用于复杂语义验证的思维链推理以及用于数据修复的自校正循环——所有功能均无需针对特定任务进行训练。我们通过将系统应用于三个真实世界数据集,证明了其有效性:该系统在不同LLM间具备良好泛化能力,显著优于基线方法,实现了高达78.7%的数据完整性和高达100%的精确度。我们还进行了消融实验以验证每个智能体对委员会性能的贡献。本研究成果已作为开源工具向研究社区发布。