Forecasting is not only a fundamental intellectual pursuit but also is of significant importance to societal systems such as finance and economics. With the rapid advances of large language models (LLMs) trained on Internet-scale data, it raises the promise of employing LLMs to forecast real-world future events, an emerging paradigm we call "LLM-as-a-Prophet". This paper systematically investigates such predictive intelligence of LLMs. To this end, we build Prophet Arena, a general evaluation benchmark that continuously collects live forecasting tasks and decomposes each task into distinct pipeline stages, in order to support our controlled and large-scale experimentation. Our comprehensive evaluation reveals that many LLMs already exhibit impressive forecasting capabilities, reflected in, e.g., their small calibration errors, consistent prediction confidence and promising market returns. However, we also uncover key bottlenecks towards achieving superior predictive intelligence via LLM-as-a-Prophet, such as LLMs' inaccurate event recalls, misunderstanding of data sources and slower information aggregation compared to markets when resolution nears.
翻译:预测不仅是一项基本的智力追求,而且对金融、经济等社会系统具有重要意义。随着基于互联网规模数据训练的大型语言模型(LLM)的快速发展,利用LLM来预测现实世界未来事件的前景日益显现,我们称这一新兴范式为“LLM-as-a-Prophet”。本文系统地研究了LLM的这种预测智能。为此,我们构建了Prophet Arena,一个通用的评估基准,它持续收集实时预测任务并将每个任务分解为不同的流水线阶段,以支持我们进行受控的大规模实验。我们的综合评估表明,许多LLM已经展现出令人印象深刻的预测能力,这体现在其较小的校准误差、一致的预测置信度以及可观的市场回报等方面。然而,我们也揭示了通过LLM-as-a-Prophet实现卓越预测智能的关键瓶颈,例如LLM对事件回忆的不准确、对数据源的误解,以及在接近事件发生时信息聚合速度慢于市场等。