# Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Authors: Lianmin Zheng,\* Zhuohan Li,\* Hao Zhang,\* Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica

Presenter: Yihan Zhang

# Background:

1. Training DL models tend to be distributed, which means parallelism



# Background

- 1. Data Parallelism
- 2. Operator Parallelism
- 3. Pipeline Parallelism

Figuring out those combination is HARD!!!



# What are System Challenges?

1. What if the input dataset is very large? 2. What if the model is very large? 😃 Easy. 😖 Hard !! Use data parallelism: partition input data and replicate the model GPU 1 GPU<sub>1</sub> input batch 1 (32 GB) model model (350 GB) GPU<sub>2</sub> GPU<sub>2</sub> input batch 2 (32 GB) model **Challenge**: How to partition a computational graph?

# **Partition Computational Graphs**



# Device 1 Device 2

#### Strategy 1: Inter-operator Parallelism



#### Trade-off

|                  | Inter-operator<br>Parallelism | Intra-operator<br>Parallelism |
|------------------|-------------------------------|-------------------------------|
| Communication    | Less                          | More                          |
| Device Idle Time | More                          | Less                          |

#### Strategy 2: Intra-operator Parallelism



# Alpa Compiler:

A unified compiler that automatically finds and executes the best Inter-op and Intra-op parallelism for large deep learning models



Two-level hierarchical space of parallelism techniques.



Effective optimization algorithms at each level.



Efficient compiler and runtime system implementation.

#### **Overview**



#### Computational Graph



#### Whole Search Space



#### **Alpa Hierarchical Space**



# Alpa Compiler: Hierarchical Optimization



# Inter-op Pass



or

• • •

# Inter-op Pass



# Inter-op Steps:

1. Minimize 1F1B iteration latency by partitioning the model into stages and mapping stages to device meshes under memory and device constraints.

1. Use dynamic programming to choose stage boundaries and meshes to minimize the serial sum, adding boundary comm/resharding costs and enforcing device non-overlap.

$$T^* = \min_{\substack{s_1, \dots, s_S; \\ (n_1, m_1), \dots, (n_S, m_S)}} \left\{ \sum_{i=1}^S t_i + (B-1) \cdot \max_{1 \le j \le S} \{t_j\} \right\}.$$

1. Output the ordered stage-mesh plan with cross-mesh comm specs, compile per-mesh executables, and run with a 1F1B pipeline.

### Intra-op Pass



Stage with intra-operator parallelization

# **Intra-op Steps**

1. Get output from internode stage

1. Create the search space by enumerating a set of tensor-sharding options with their required collectives (all-reduce/all-gather/all-to-all), and define the resulting shard layouts. Estimate percandidate communication time using mesh bandwidth.

1. Solve a compact ILP to pick one candidate per op that minimizes total (op communication + resharding), outputing sharding and collective plan for the whole stage.

$$\min_{s} \sum_{v \in V} s_v^{\mathsf{T}}(c_v + d_v) + \sum_{(v,u) \in E} s_v^{\mathsf{T}} R_{vu} s_u,$$

# **Evaluation:** Comparing with Previous Works







Weak scaling results where the model size grow with #GPUs. Evaluated on 8 AWS EC2 p3.16xlarge nodes with 8 16GB V100s each (64 GPUs in total).

# **Evaluation:** Ablation Study with Inter-op and Intra-op Only



Combining inter- and intra-operator parallelism scales to more devices.

# THANKS!!!