ECS 289D (Fall 2025) — Datacenter Systems for ML

Quarter-long seminar on datacenter systems for ML training & inference. Based on research papers and a team project.

General Information

Schedule

Week Date Topic Paper(s) Notes
1 Thu 9/25 Course Introduction Introduction by Yang
2 Tue 9/30 Datacenter networking The Tail at Scale
optional Attack of the Killer Microseconds
Presentation by Yang
Paper Presentation Selection Due
Thu 10/02 A Scalable, Commodity Data Center Network Architecture
3 Tue 10/07 Data Center TCP (DCTCP)
optional Swift: Delay is Simple and Effective for Congestion Control in the Datacenter
Project Membership and Topic Due
Thu 10/09 Design Guidelines for High Performance RDMA Systems
4 Tue 10/14 Host networking RDMA over Ethernet for Distributed AI Training at Meta Scale
optional An Extensible Software Transport Layer for GPU Networking
Thu 10/16 IX: A Protected Dataplane Operating System for High Throughput and Low Latency
optional Arrakis: The Operating System is the Control Plane
Project Proposal Due
5 Tue 10/21 Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads
Thu 10/23 MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud
optional Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms
6 Tue 10/28 LLM Inference Efficient Memory Management for Large Language Model Serving with PagedAttention
Thu 10/30 DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
7 Tue 11/04 Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Thu 11/06 NanoFlow: Towards Optimal Large Language Model Serving Throughput Mid-Quarter Milestone Due
8 Tue 11/11 LLM Training Skipped for Veterans Day Holiday
Thu 11/13 ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
optional Everything about Distributed Training and Efficient Finetuning
9 Tue 11/18 PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Thu 11/20 Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
10 Tue 11/25 Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
Thu 11/27 Skipped for Thanksgiving Holiday
11 Tue 12/02 Wrap Up Final Q&A
Thu 12/04 Project Presentations In-class presentations
Tue 12/09 Project Reports Due Final reports due

Readings, Presentations, and Project

Each class meeting focuses on a research paper. Students are expected to read assigned papers ahead of time and participate. Each student will lead one discussion and complete a quarter‑long team project.

Coursework and Grading

Last updated: August 25, 2025