基于大型语言模型构建大型因果模型 (Large Causal Models from Large Language Models)

We introduce a new paradigm for building large causal models (LCMs) that exploits the enormous potential latent in today's large language models (LLMs). We describe our ongoing experiments with an implemented system called DEMOCRITUS (Decentralized Extraction of Manifold Ontologies of Causal Relations Integrating Topos Universal Slices) aimed at building, organizing, and visualizing LCMs that span disparate domains extracted from carefully targeted textual queries to LLMs. DEMOCRITUS is methodologically distinct from traditional narrow domain and hypothesis centered causal inference that builds causal models from experiments that produce numerical data. A high-quality LLM is used to propose topics, generate causal questions, and extract plausible causal statements from a diverse range of domains. The technical challenge is then to take these isolated, fragmented, potentially ambiguous and possibly conflicting causal claims, and weave them into a coherent whole, converting them into relational causal triples and embedding them into a LCM. Addressing this technical challenge required inventing new categorical machine learning methods, which we can only briefly summarize in this paper, as it is focused more on the systems side of building DEMOCRITUS. We describe the implementation pipeline for DEMOCRITUS comprising of six modules, examine its computational cost profile to determine where the current bottlenecks in scaling the system to larger models. We describe the results of using DEMOCRITUS over a wide range of domains, spanning archaeology, biology, climate change, economics, medicine and technology. We discuss the limitations of the current DEMOCRITUS system, and outline directions for extending its capabilities.

翻译：本文提出了一种构建大型因果模型（LCMs）的新范式，旨在挖掘当前大型语言模型（LLMs）中蕴含的巨大潜力。我们介绍了正在进行的实验，这些实验基于一个名为DEMOCRITUS（去中心化提取因果关系的流形本体论并整合拓扑通用切片）的系统实现，该系统旨在构建、组织并可视化跨越不同领域的LCMs，这些领域通过向LLMs精心设计的文本查询提取而来。DEMOCRITUS在方法论上区别于传统的、以狭窄领域和假设为中心的因果推断，后者通常基于产生数值数据的实验来构建因果模型。我们利用高质量LLM来提出主题、生成因果问题，并从多样化的领域中提取合理的因果陈述。技术挑战在于将这些孤立、碎片化、可能模糊且存在冲突的因果主张整合成一个连贯的整体，将其转化为关系性因果三元组，并嵌入到LCM中。应对这一技术挑战需要发明新的范畴机器学习方法，本文仅能简要概述，因为重点更多在于构建DEMOCRITUS的系统层面。我们描述了DEMOCRITUS的六个模块实现流程，分析了其计算成本分布，以确定当前系统扩展到更大模型时的瓶颈所在。我们展示了DEMOCRITUS在考古学、生物学、气候变化、经济学、医学和技术等广泛领域中的应用结果。最后，我们讨论了当前DEMOCRITUS系统的局限性，并展望了扩展其能力的未来方向。