Model Parallelisum
Model Parallel
https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
모델 잘라서 여러 gpu들에 분산 시킴. Pipelining 잘하면 당연히 속도 빨라질 수 있음.
Pipelining에서 중요한 건 split_size, 즉 각각의 gpu device가 처리하는 data batch의 size다. 이 split_size가 작으면 각 gpu의 idle time이 줄어들지만, cuda kernel launch가 너무 많아져서 비효율적일 수도 있고 (정확히는 무슨 말인 지 모름), split_size가 너무 크면 각 gpu의 idle time이 커짐.
"Intuitively speaking, using small split_size leads to many tiny CUDA kernel launch, while using large split_size results to relatively long idle times during the first and last splits. Neither are optimal. There might be an optimal split_size configuration for this specific experiment."
General한 Solution은 없음.
Multi-Process를 asynchronous하게 돌리면 처음 빼고는 idle time을 없앨 수 있지 않나 싶은데? DDP 는 그렇게 안하고
- Recall from the prior tutorial that if your model is too large to fit on a single GPU, you must use model parallel to split it across multiple GPUs. DistributedDataParallel works with model parallel; DataParallel does not at this time. When DDP is combined with model parallel, each DDP process would use model parallel, and all processes collectively would use data parallel.
- https://pytorch.org/tutorials/intermediate/ddp_tutorial.html?highlight=distributeddataparallel
아마 FSDP가 그렇게 하는 것 같음. 확실치 않음.
https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html