CoroAMU：通过延迟感知解耦操作释放内存驱动协程的潜力 (CoroAMU: Unleashing Memory-Driven Coroutines through Latency-Aware Decoupled Operations)

Modern data-intensive applications face memory latency challenges exacerbated by disaggregated memory systems. Recent work shows that coroutines are promising in effectively interleaving tasks and hiding memory latency, but they struggle to balance latency-hiding efficiency with runtime overhead. We present CoroAMU, a hardware-software co-designed system for memory-centric coroutines. It introduces compiler procedures that optimize coroutine code generation, minimize context, and coalesce requests, paired with a simple interface. With hardware support of decoupled memory operations, we enhance the Asynchronous Memory Unit to further exploit dynamic coroutine schedulers by coroutine-specific memory operations and a novel memory-guided branch prediction mechanism. It is implemented with LLVM and open-source XiangShan RISC-V processor over the FPGA platform. Experiments demonstrate that the CoroAMU compiler achieves a 1.51x speedup over state-of-the-art coroutine methods on Intel server processors. When combined with optimized hardware of decoupled memory access, it delivers 3.39x and 4.87x average performance improvements over the baseline processor on FPGA-emulated disaggregated systems under 200ns and 800ns latency respectively.

翻译：现代数据密集型应用面临内存延迟挑战，该挑战在解耦内存系统中尤为突出。近期研究表明，协程在有效交错任务和隐藏内存延迟方面具有潜力，但其在延迟隐藏效率与运行时开销之间难以取得平衡。本文提出CoroAMU，一种面向内存中心协程的软硬件协同设计系统。该系统引入了编译器过程，以优化协程代码生成、最小化上下文并合并请求，同时配备简洁的接口。借助解耦内存操作的硬件支持，我们增强了异步内存单元，通过协程专用内存操作和一种新颖的内存引导分支预测机制，进一步利用动态协程调度器。该系统基于LLVM和开源香山RISC-V处理器在FPGA平台上实现。实验表明，CoroAMU编译器在英特尔服务器处理器上相比最先进的协程方法实现了1.51倍的加速。当与优化的解耦内存访问硬件结合时，在FPGA仿真的解耦系统上，分别在200纳秒和800纳秒延迟下，相比基准处理器实现了平均3.39倍和4.87倍的性能提升。