标准化自然语言处理工作流：一个可复现的语言学分析框架 (Standardising the NLP Workflow: A Framework for Reproducible Linguistic Analysis)

Yves Pauli,Jan-Bernard Marsman,Finn Rabe,Victoria Edkins,Roya Hüppi,Silvia Ciampelli,Akhil Ratan Misra,Nils Lang,Wolfram Hinzen,Iris Sommer,Philipp Homan

from arxiv, 26 pages, 3 figures

The introduction of large language models and other influential developments in AI-based language processing have led to an evolution in the methods available to quantitatively analyse language data. With the resultant growth of attention on language processing, significant challenges have emerged, including the lack of standardisation in organising and sharing linguistic data and the absence of standardised and reproducible processing methodologies. Striving for future standardisation, we first propose the Language Processing Data Structure (LPDS), a data structure inspired by the Brain Imaging Data Structure (BIDS), a widely adopted standard for handling neuroscience data. It provides a folder structure and file naming conventions for linguistic research. Second, we introduce pelican nlp, a modular and extensible Python package designed to enable streamlined language processing, from initial data cleaning and task-specific preprocessing to the extraction of sophisticated linguistic and acoustic features, such as semantic embeddings and prosodic metrics. The entire processing workflow can be specified within a single, shareable configuration file, which pelican nlp then executes on LPDS-formatted data. Depending on the specifications, the reproducible output can consist of preprocessed language data or standardised extraction of both linguistic and acoustic features and corresponding result aggregations. LPDS and pelican nlp collectively offer an end-to-end processing pipeline for linguistic data, designed to ensure methodological transparency and enhance reproducibility.

翻译：大型语言模型的引入以及基于人工智能的语言处理领域其他重要进展，推动了语言数据定量分析方法的演进。随着语言处理关注度的相应增长，一系列重大挑战也随之浮现，包括语言数据组织与共享缺乏标准化，以及处理方法的标准化与可复现性缺失。为促进未来的标准化进程，我们首先提出了语言处理数据结构（LPDS），这一数据结构受广泛采用的神经科学数据处理标准——脑成像数据结构（BIDS）的启发而设计。它为语言学研究提供了文件夹结构和文件命名规范。其次，我们推出了pelican nlp，这是一个模块化且可扩展的Python软件包，旨在实现从初始数据清洗、任务特定预处理，到复杂语言及声学特征（如语义嵌入和韵律指标）提取的流程化语言处理。整个处理工作流可通过单一可共享的配置文件进行指定，随后pelican nlp将在LPDS格式的数据上执行该流程。根据配置规范，可复现的输出可包含预处理后的语言数据，或语言与声学特征的标准化提取及相应的结果汇总。LPDS与pelican nlp共同构成了一个面向语言数据的端到端处理流程，旨在确保方法论的透明度并提升可复现性。