ECS 289D (Fall 2025) — Datacenter Systems for LLMs

Quarter-long seminar on datacenter systems for LLM training & inference, based on research papers and a team project. The course will cover four topics: datacenter networking, host networking, LLM inference, and LLM training. The first two topics talk about the datacenter systems and components that are essential for LLM training and inference. The last two topics talk about the classic works and recent progress in LLM training and inference systems.

General Information

Instructor: Yang Zhou (yayzhou@ucdavis.edu)
CRN: 50052; 4 units
Time & Location: Tue/Thu 10:30am - 11:50am, Social Science & Humanities 00080
Office Hours: Tuesday 1pm-2pm, Kemper 2127 or on Google Meet
Course Site: This page; Canvas

Schedule

Week	Date	Topic	Paper(s)	Notes
1	Thu 9/25	—	Course Introduction	Introduction by Yang: slides
2	Tue 9/30	Datacenter networking	The Tail at Scale optional Attack of the Killer Microseconds	Presentation by Yang: slides Paper Presentation Selection Due
2	Thu 10/02		A Scalable, Commodity Data Center Network Architecture optional VL2: A Scalable and Flexible Data Center Network	1st pre by John Drabek: slides 2nd pre by Yanfeng Ma: slides
3	Tue 10/07		Data Center TCP (DCTCP) optional Swift: Delay is Simple and Effective for Congestion Control in the Datacenter	Project Membership and Topic Due 1st pre by Shreyas Shah: slides 2nd pre by Kaiyue Li: slides
3	Thu 10/09		Design Guidelines for High Performance RDMA Systems optional Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!	1st pre by Harish Krishnakumar:slides 2nd pre by Dalton Davis: slides
4	Tue 10/14	Host networking	RDMA over Ethernet for Distributed AI Training at Meta Scale optional An Extensible Software Transport Layer for GPU Networking	1st pre by Yang Zhou: slides 2nd pre by Lekhit Borole: slides
4	Thu 10/16		IX: A Protected Dataplane Operating System for High Throughput and Low Latency optional Arrakis: The Operating System is the Control Plane	Project Proposal Due 1st pre by Yang Zhou: slides 2nd pre by Yang Zhou: slides
5	Tue 10/21		Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads optional Snap: a Microkernel Approach to Host Networking	1st pre by Neha Pradeep: slides 2nd pre by Qianqian Tan: slides
5	Thu 10/23		Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms optional MSCCL++: Rethinking GPU Communication Abstractions for Cutting-Edge AI Applications	1st pre by Aman Dwivedi: slides 2nd pre by Yang Zhou: slides
6	Tue 10/28	LLM Inference	Efficient Memory Management for Large Language Model Serving with PagedAttention optional vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention	1st pre by Yuankai Li: slides 2nd pre by Alexander Dsouza: slides
6	Thu 10/30		DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving optional Optimizing SLO-oriented LLM Serving with PD-Multiplexing	1st pre by Yang Zhou: slides 2nd pre by Yang Zhou: slides
7	Tue 11/04		FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving optional XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models	1st pre by Zhuoli Huang: slides 2nd pre by Yixin Dong (CMU): slides
7	Thu 11/06		NanoFlow: Towards Optimal Large Language Model Serving Throughput optional Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve	Mid-Quarter Milestone Due 1st pre by Yiqiao Lin: slides 2nd pre by Sakthi Karimanal: slides
8	Tue 11/11	LLM Training	—	Skipped for Veterans Day Holiday
8	Thu 11/13		FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness optional FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning optional FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision	1st pre by Ansha Prashanth: slides 2nd pre by Shuang Ma: slides
9	Tue 11/18		ZeRO: Memory Optimizations Toward Training Trillion Parameter Models optional PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel optional Everything about Distributed Training and Efficient Finetuning	1st pre by Hemang Singh: slides 2nd pre by Vrushali Harane: slides
9	Thu 11/20		Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures optional DeepSeek Open Infra	1st pre by Nathan Kotni 2nd pre by Sudarsan Srivathsun
10	Tue 11/25		Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints optional Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning	1st pre by Yang Zhou 2nd pre by Yihan Zhang
10	Thu 11/27		—	Skipped for Thanksgiving Holiday
11	Tue 12/02	—	Wrap Up	Final Q&A In-class presentations
11	Thu 12/04	—	Project Presentations	In-class presentations
—	Tue 12/09	—	Project Reports Due	Final reports due

Coursework and Grading

Project: 50%, including 1% project proposal, 4% mid-term check, 10% final project presentation, and 35% final report and code. Project teams of 3-5 students with milestones listed in the schedule.
Class presentation: 25%. Each session will have two presentations: one for required readings and one for optional readings.
Paper reviews: 15%. Students are expected to read required readings ahead of time and encouraged to read optional readings. Up to 5% bonus points will be given if you also write good reviews for optional readings. You can use GenAI for understanding the paper but not for writing the reviews; one can easily identify GenAI-generated reviews. For how to write a good review, kindly refer to Prof. Raybuck's instruction here. Paper reviews are due by 11:59pm the day before class (see Canvas assignments).
Class participation: 10% (including class attendance, in-class discussions, and online discussions)

Candidate Project Ideas

Building lossless compression into KV cache transfer: based on DietGPU, requiring CUDA programming
Efficient multi-reader multi-write ringbuffer between GPU and CPU: based on UCCl existing ringbuffers, requiring CUDA programming
Fast asymmetric all-to-all (alltoallv) via NVLink forwarding: requiring CUDA programming
RL weight broadcasting via RDMA: based on this RL weight transfer blog and UCCl existing RDMA implementation of P2P
Cross-WAN data transfer in UDP: based on this paper
Build a tutorial on RDMA programming in the GPU era: based on the code in UCCl existing RDMA implementation
Your project ideas are highly appreciated (no matter research-focused or engineering-focused)!

Last updated