1 |
Thu 9/25 |
— |
Course Introduction |
Introduction by Yang; slides |
2 |
Tue 9/30 |
Datacenter networking |
The Tail at Scale
optional Attack of the Killer Microseconds
|
Presentation by Yang;
Paper Presentation Selection Due
|
Thu 10/02 |
A Scalable, Commodity Data Center Network Architecture
optional VL2: A Scalable and Flexible Data Center Network
|
— |
3 |
Tue 10/07 |
Data Center TCP (DCTCP)
optional Swift: Delay is Simple and Effective for Congestion Control in the Datacenter
|
Project Membership and Topic Due |
Thu 10/09 |
Design Guidelines for High Performance RDMA Systems
optional Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!
|
— |
4 |
Tue 10/14 |
Host networking |
RDMA over Ethernet for Distributed AI Training at Meta Scale
optional An Extensible Software Transport Layer for GPU Networking
|
— |
Thu 10/16 |
IX: A Protected Dataplane Operating System for High Throughput and Low Latency
optional Arrakis: The Operating System is the Control Plane
|
Project Proposal Due |
5 |
Tue 10/21 |
Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads
optional Snap: a Microkernel Approach to Host Networking
|
— |
Thu 10/23 |
Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms
optional MSCCL++: Rethinking GPU Communication Abstractions for Cutting-Edge AI Applications
|
— |
6 |
Tue 10/28 |
LLM Inference |
Efficient Memory Management for Large Language Model Serving with PagedAttention
optional vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
|
— |
Thu 10/30 |
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
optional Optimizing SLO-oriented LLM Serving with PD-Multiplexing
|
— |
7 |
Tue 11/04 |
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
optional XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models
|
— |
Thu 11/06 |
NanoFlow: Towards Optimal Large Language Model Serving Throughput
optional Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
|
Mid-Quarter Milestone Due |
8 |
Tue 11/11 |
LLM Training |
— |
Skipped for Veterans Day Holiday |
Thu 11/13 |
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
optional FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
optional FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
|
— |
9 |
Tue 11/18 |
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
optional PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
optional Everything about Distributed Training and Efficient Finetuning
|
— |
Thu 11/20 |
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures
optional DeepSeek Open Infra
|
— |
10 |
Tue 11/25 |
Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
optional Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
|
— |
Thu 11/27 |
— |
Skipped for Thanksgiving Holiday |
11 |
Tue 12/02 |
— |
Wrap Up |
Final Q&A; In-class presentations |
Thu 12/04 |
— |
Project Presentations |
In-class presentations |
— |
Tue 12/09 |
— |
Project Reports Due |
Final reports due |