Operon：基于命名维度的不规则数据增量构建方法 (Operon: Incremental Construction of Ragged Data via Named Dimensions)

Modern data processing workflows frequently encounter ragged data: collections with variable-length elements that arise naturally in domains like natural language processing, scientific measurements, and autonomous AI agents. Existing workflow engines lack native support for tracking the shapes and dependencies inherent to ragged data, forcing users to manage complex indexing and dependency bookkeeping manually. We present Operon, a Rust-based workflow engine that addresses these challenges through a novel formalism of named dimensions with explicit dependency relations. Operon provides a domain-specific language where users declare pipelines with dimension annotations that are statically verified for correctness, while the runtime system dynamically schedules tasks as data shapes are incrementally discovered during execution. We formalize the mathematical foundation for reasoning about partial shapes and prove that Operon's incremental construction algorithm guarantees deterministic and confluent execution in parallel settings. The system's explicit modeling of partially-known states enables robust persistence and recovery mechanisms, while its per-task multi-queue architecture achieves efficient parallelism across heterogeneous task types. Empirical evaluation demonstrates that Operon outperforms an existing workflow engine with 14.94x baseline overhead reduction while maintaining near-linear end-to-end output rates as workloads scale, making it particularly suitable for large-scale data generation pipelines in machine learning applications.

翻译：现代数据处理工作流常遇到不规则数据：即包含可变长度元素的集合，这类数据自然出现在自然语言处理、科学测量和自主AI智能体等领域。现有工作流引擎缺乏对不规则数据固有形状与依赖关系的原生支持，迫使用户手动管理复杂的索引和依赖记录。本文提出Operon，一个基于Rust的工作流引擎，通过具有显式依赖关系的命名维度新形式化方法应对这些挑战。Operon提供一种领域特定语言，用户可通过维度标注声明流水线，这些标注会经过静态正确性验证，而运行时系统则在执行过程中随数据形状逐步发现而动态调度任务。我们形式化了用于推理部分形状的数学基础，并证明Operon的增量构建算法在并行环境下能保证确定性和汇合性执行。该系统对部分已知状态的显式建模实现了鲁棒的持久化与恢复机制，其基于任务的多队列架构则能在异构任务类型间实现高效并行。实证评估表明，Operon在保持近线性端到端输出速率随工作负载扩展的同时，将现有工作流引擎的基线开销降低了14.94倍，使其特别适用于机器学习应用中的大规模数据生成流水线。