We leverage multimodal large language models (LLMs) to construct a dataset of 306,070 German patents (1877-1918) from 9,562 archival image scans using our LLM-based pipeline powered by Gemini-2.5-Pro and Gemini-2.5-Flash-Lite. Our benchmarking exercise provides tentative evidence that multimodal LLMs can create higher quality datasets than our research assistants, while also being more than 795 times faster and 205 times cheaper in constructing the patent dataset from our image corpus. About 20 to 50 patent entries are embedded on each page, arranged in a double-column format and printed in Gothic and Roman fonts. The font and layout complexity of our primary source material suggests to us that multimodal LLMs are a paradigm shift in how datasets are constructed in economic history. We open-source our benchmarking and patent datasets as well as our LLM-based data pipeline, which can be easily adapted to other image corpora using LLM-assisted coding tools, lowering the barriers for less technical researchers. Finally, we explain the economics of deploying LLMs for historical dataset construction and conclude by speculating on the potential implications for the field of economic history.


翻译:我们利用多模态大语言模型(LLMs),通过基于Gemini-2.5-Pro和Gemini-2.5-Flash-Lite的LLM驱动流程,从9,562份档案图像扫描中构建了一个包含306,070项德国专利(1877-1918年)的数据集。我们的基准测试初步证明,多模态LLMs能够构建出比研究助理更高质量的数据集,同时从图像语料库构建专利数据集的速度快795倍以上,成本降低205倍。每页大约嵌入20至50个专利条目,采用双栏格式排版,并以哥特体和罗马体印刷。原始资料的字体和版面复杂性表明,多模态LLMs为经济史领域的数据集构建方式带来了范式转变。我们开源了基准测试和专利数据集以及基于LLM的数据处理流程,该流程可借助LLM辅助编程工具轻松适配其他图像语料库,从而降低非技术背景研究人员的入门门槛。最后,我们阐述了部署LLMs进行历史数据集构建的经济性,并展望了其对经济史领域的潜在影响。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员