To conduct real-time analytics computations, big data stream processing engines are required to process unbounded data streams at millions of events per second. However, current streaming engines exhibit low throughput and high tuple processing latency. Performance engineering is complicated by the fact that streaming engines constitute complex distributed systems consisting of multiple nodes in the cloud. A profiling technique is required that is capable of measuring time durations at high accuracy across nodes. Standard clock synchronization techniques such as the network time protocol (NTP) are limited to millisecond accuracy, and hence cannot be used. We propose a profiling technique that relates the time-stamp counters (TSCs) of nodes to measure the duration of events in a streaming framework. The precision of the TSC relation determines the accuracy of the measured duration. The TSC relation is conducted in quiescent periods of the network to achieve accuracy in the tens of microseconds. We propose a throughput-controlled data generator to reliably determine the sustainable throughput of a streaming engine. To facilitate high-throughput data ingestion, we propose a concurrent object factory that moves the deserialization overhead of incoming data tuples off the critical path of the streaming framework. The evaluation of the proposed techniques within the Apache Storm streaming framework on the Google Compute Engine public cloud shows that data ingestion increases from $700$ $\text{k}$ to $4.68$ $\text{M}$ tuples per second, and that time durations can be profiled at a measurement accuracy of $92$ $\mu\text{s}$, which is three orders of magnitude higher than the accuracy of NTP, and one order of magnitude higher than prior work.
翻译:进行实时分析计算时, 需要大数据流处理引擎来处理以百万秒计的事件。 然而, 当前流流引擎显示的是低输送量和高通量处理的延迟度。 由于流流引擎构成由云层多个节点组成的复杂分布系统, 执行工程复杂。 需要一种能够测量节点之间高精确度时间长度的剖析技术。 标准时钟同步技术, 如网络时间协议( NTP) 限为毫秒的精确度, 因此无法使用 。 我们提议了一种将时端端端的节点计( TSC) 的准确度测量值与流框架内的事件持续时间长度相连接的技术。 TSC 的精确度决定了测量时间长度的准确性。 TSC 是在网络的宽度期间进行, 以精确度计时段。 我们提议一个通过量控制的数据生成器, 以可靠地确定流动引擎的可持续通量 。 为了便利高通量的测量, 我们提议在流中同时建立一个目标工厂, 以美元 美元, 在流状框架内将数据流端端端端端端端端端端的 。