1 |
Thu 9/25 |
— |
Course Introduction |
Introduction by Yang |
2 |
Tue 9/30 |
Datacenter networking |
The Tail at Scale
optional Attack of the
Killer Microseconds
|
Presentation by Yang
Paper Presentation Selection Due
|
Thu 10/02 |
A Scalable, Commodity Data Center Network
Architecture
|
— |
3 |
Tue 10/07 |
Data Center TCP (DCTCP)
optional Swift:
Delay is Simple and Effective for Congestion Control in the Datacenter
|
Project Membership and Topic Due |
Thu 10/09 |
Design Guidelines for High Performance RDMA
Systems
|
— |
4 |
Tue 10/14 |
Host networking |
RDMA over Ethernet for Distributed AI Training at
Meta Scale
optional An Extensible Software
Transport Layer for GPU Networking
|
— |
Thu 10/16 |
IX: A Protected
Dataplane Operating System for High Throughput and Low Latency
optional Arrakis: The
Operating System is the Control Plane
|
Project Proposal Due |
5 |
Tue 10/21 |
Shenango: Achieving High CPU Efficiency
for Latency-sensitive Datacenter Workloads
|
— |
Thu 10/23 |
MCCS: A Service-based Approach to
Collective Communication for Multi-Tenant Cloud
optional Demystifying NCCL: An
In-depth Analysis of GPU Communication Protocols and Algorithms
|
— |
6 |
Tue 10/28 |
LLM Inference |
Efficient Memory Management for Large Language Model Serving with
PagedAttention
|
— |
Thu 10/30 |
DistServe: Disaggregating Prefill and
Decoding for Goodput-optimized Large Language Model Serving
|
— |
7 |
Tue 11/04 |
Taming Throughput-Latency Tradeoff
in LLM Inference with Sarathi-Serve
|
— |
Thu 11/06 |
NanoFlow: Towards Optimal Large Language
Model Serving Throughput
|
Mid-Quarter Milestone Due |
8 |
Tue 11/11 |
LLM Training |
— |
Skipped for Veterans Day Holiday |
Thu 11/13 |
ZeRO: Memory Optimizations Toward Training Trillion Parameter
Models
optional Everything about Distributed
Training and Efficient Finetuning
|
— |
9 |
Tue 11/18 |
PyTorch FSDP: Experiences on Scaling Fully
Sharded Data Parallel
|
— |
Thu 11/20 |
Alpa: Automating Inter- and
Intra-Operator Parallelism for Distributed Deep Learning
|
— |
10 |
Tue 11/25 |
Gemini: Fast Failure Recovery in Distributed
Training with In-Memory Checkpoints
|
— |
Thu 11/27 |
— |
Skipped for Thanksgiving Holiday |
11 |
Tue 12/02 |
— |
Wrap Up |
Final Q&A |
Thu 12/04 |
— |
Project Presentations |
In-class presentations |
— |
Tue 12/09 |
— |
Project Reports Due |
Final reports due |