Real-time data processing applications with low latency requirements have led to the increasing popularity of stream processing systems. While such systems offer convenient APIs that can be used to achieve data parallelism automatically, they offer limited support for computations which require synchronization between parallel nodes. In this paper we propose \emph{dependency-guided synchronization (DGS)}, an alternative programming model and stream processing API for stateful streaming computations with complex synchronization requirements. In a nutshell, using our API the input is viewed as partially ordered, and the program consists of a set of parallelization constructs which are applied to decompose the partial order and process events independently. Our API maps to an execution model called \emph{synchronization plans} which supports synchronization between parallel nodes. Our evaluation shows that APIs offered by two widely used systems (Flink and Timely Dataflow) cannot suitably expose parallelism in some representative applications. In contrast, DGS enables implementations which scale automatically, the resulting synchronization plans offer throughput improvements when implemented in existing systems, and the programming overhead is small compared to writing sequential code.
翻译:低延迟要求的实时数据处理应用程序已导致流流处理系统越来越受欢迎。 虽然这种系统提供了方便的API,可以自动用于实现数据平行化,但对于需要平行节点同步的计算却提供有限的支持。 在本文中,我们提议了\ emph{ 依赖性- 引导同步(DGS)},一种替代编程模型和流处理 API,用于有复杂同步要求的状态流计算。 在简而言之,使用我们的 API,输入被视为部分订购,而程序由一套平行结构组成,用于独立拆分部分顺序和进程事件。我们的API 地图用于一个称为 emphys{ 同步计划的执行模式,支持平行节点之间的同步。我们的评估表明,两个广泛使用的系统(链接和及时数据流)提供的API 无法在某些有代表性的应用中适当暴露平行性。 相比之下, DGS 能够使实施自动规模化, 由此产生的同步计划通过现有系统实施时的投影式改进, 而编程间接费用小于写顺序码。