箭头:通过紧密结合的可配置NIC加速云层微服务中的 RPC (Dagger: Accelerating RPCs in Cloud Microservices Through Tightly-Coupled Reconfigurable NICs)

The ongoing shift of cloud services from monolithic designs to microservices creates high demand for efficient and high performance datacenter networking stacks, optimized for fine-grained workloads. Commodity networking systems based on software stacks and peripheral NICs introduce high overheads when it comes to delivering small messages. We present Dagger, a hardware acceleration fabric for cloud RPCs based on FPGAs, where the accelerator is closely-coupled with the host processor over a configurable memory interconnect. The three key design principle of Dagger are: (1) offloading the entire RPC stack to an FPGA-based NIC, (2) leveraging memory interconnects instead of PCIe buses as the interface with the host CPU, and (3) making the acceleration fabric reconfigurable, so it can accommodate the diverse needs of microservices. We show that the combination of these principles significantly improves the efficiency and performance of cloud RPC systems while preserving their generality. Dagger achieves 1.3-3.8x higher per-core RPC throughput compared to both highly-optimized software stacks, and systems using specialized RDMA adapters. It also scales up to 84 Mrps with 8 threads on 4 CPU cores, while maintaining state-of-the-art us-scale tail latency. We also demonstrate that large third-party applications, like memcached and MICA KVS, can be easily ported on Dagger with minimal changes to their codebase, bringing their median and tail KVS access latency down to 2.8 - 3.5us and 5.4 - 7.8us, respectively. Finally, we show that Dagger is beneficial for multi-tier end-to-end microservices with different threading models by evaluating it using an 8-tier application implementing a flight check-in service.

翻译：云层服务从单流设计向微服务的持续转变,产生了对高效高性能数据中心网络堆的高度需求,这是为细微工作量优化而优化的。基于软件堆和外围NIC的商品联网系统在发送小信息时引入了高间接费用。我们展示了Dagger,这是基于FPGAs的云层RPC的硬件加速结构,加速器与主机在可配置存储的内存连接中与云处理器紧密相联。Dagger的三大关键设计原则是:(1) 将整个 RPC 堆卸载到基于 FPGA 的堆中,优化用于细微工作量。基于软件堆和外围NIC的商品网络系统使用内存连接,而不是 PCIe 公交系统在发送小号 CPU的界面中引入内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存内存有