Wang
Zongwu
home
archives
categories
tags
Slides
Your browser does not support HTML5 video.
NEWS LETTER
分布式异构硬件平台计算-通信重叠算子自动生成与优化的系统性文献调研报告
Home
2026
Scroll down
Welcome to Zongwu's Science Hub ✨
Residence:
Shanghai
Age:
18
Contact Me
05/22
17:00
zongwu wang
请输入密码继续
Other Articles
System
Compression-Aware Gradient Splitting for Collective Communications in Distributed Training
26/05/22
19:43
Review
视频生成模型研究综述notebooklm
26/05/21
12:03
Article table of contents
TOP
1.
分布式深度学习框架的通信机制与技术瓶颈
2.
计算图编译与自动并行化中的通信推导与代码生成
2.1.
分布式通信操作自动推断的核心机制
2.2.
编译器层面的重叠算子与代码生成实现
3.
典型关键文献的结构化剖析(一)
3.1.
文献 1:GSPMD: General and Scalable Parallelization for ML Computations (SOSP 2021) [26]
3.2.
文献 2:Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (OSDI 2022) [8]
3.3.
文献 3:A shared compilation stack for distributed-memory parallelism in stencil DSLs (ASPLOS 2024) [37, 38]
4.
高级计算-通信重叠调度技术与依赖分析方法
4.1.
全图级依赖分析与非序列化重叠调度(Lancet 机制)
4.2.
三维通信划分空间与层级调度(Centauri 机制)
4.3.
运行期硬件“波面”原子级信号通知与重排序(FlashOverlap 机制)
5.
典型关键文献的结构化剖析(二)
5.1.
文献 4:Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Overlapping (MLSys 2024) [17, 25]
5.2.
文献 5:Centauri: Enabling Efficient Scheduling for Overlap via Communication Partitioning (ASPLOS 2024) [7, 42]
5.3.
文献 6:Efficient and Adaptable Overlapping via Signaling and Reordering (FlashOverlap, arXiv 2025) [43]
6.
自定义算子、底层优化与全物理级内核融合
6.1.
底层高速底层通信技术:设备端启动与对称内存
6.2.
编译期自适应自动融合:Syncopate 机制
7.
典型关键文献的结构化剖析(三)
7.1.
文献 7:Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Overlap (arXiv 2026) [12]
7.2.
文献 8:T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives (ASPLOS 2024) [50, 51]
8.
异构硬件分布式编程模型与多厂商跨芯片协同
8.1.
跨厂商集合通信库:HetCCL 机制
8.2.
异构自适应并行计划搜索:Metis 机制
9.
典型关键文献的结构化剖析(四)
9.1.
文献 9:HetCCL: Accelerating LLM Training with Heterogeneous GPUs (arXiv 2026) [53]
9.2.
文献 10:Metis: Fast Automatic Distributed Training on Heterogeneous GPUs (ATC 2024) [20]
10.
典型编译与重叠技术综合特性矩阵
11.
识别研究空白与独特挑战:异构硬件下的自动算子生成
11.1.
空白 1:MLIR 编译器生态中缺乏统一异构融合代码生成的算子降级路径
11.2.
空白 2:多厂商混合异构间极低互连带宽(PCIe)环境下的极致细粒度重叠算子生成
11.3.
空白 3:同时兼顾物理显存带宽与片上功耗限制的“自适应功耗冲突(Contention-Aware)”重叠策略
12.
结论与对我们课题的具体启发
Please enter keywords to search