DP vs DDP

홍돌 2021. 8. 13. 15:55

Update: 02/18/2024
https://huggingface.co/docs/transformers/v4.20.1/en/perf_train_gpu_many#:~:text=DP%20copies%20data%20within%20the,not%20the%20case%20with%20DP.

————/
DDP를 써라!
핵심은 DP는 single-process multi-thread이고, DDP는 multi-process라는 것. python GIL overhead를 피하려면 multi-thread가 아닌 multi-process를 쓰는게 효율적임.

https://pytorch.org/docs/master/notes/cuda.html#use-nn-parallel-distributeddataparallel-instead-of-multiprocessing-or-nn-dataparallel

CUDA semantics — PyTorch master documentation

CUDA semantics torch.cuda is used to set up and run CUDA operations. It keeps track of the currently selected GPU, and all CUDA tensors you allocate will by default be created on that device. The selected device can be changed with a torch.cuda.device cont

pytorch.org

DDP overview
https://pytorch.org/tutorials/beginner/dist_overview.html

PyTorch Distributed Overview — PyTorch Tutorials 1.9.0+cu102 documentation

PyTorch Distributed Overview Author: Shen Li This is the overview page for the torch.distributed package. As there are more and more documents, examples and tutorials added at different locations, it becomes unclear which document or tutorial to consult fo

pytorch.org

DDP, DP와 batch norm
https://discuss.pytorch.org/t/how-does-dataparallel-handels-batch-norm/14040

How does Dataparallel handels batch norm?

In particular, Is it that each GPU separately computes its own parameters for batch norm over minibatch allocated to it? or do they communicate with each other for computing those parameters? If GPUs are independently computing these parameters, then how d

discuss.pytorch.org

참고 dataloader num_workers는 subprocess가 아니라 thread 수를 지정해주는것임.
process vs thread 참고: https://stackoverflow.com/questions/37430255/difference-between-subprocess-and-thread

저작자표시 (새창열림)