DP vs DDP
Update: 02/18/2024
https://huggingface.co/docs/transformers/v4.20.1/en/perf_train_gpu_many#:~:text=DP%20copies%20data%20within%20the,not%20the%20case%20with%20DP.
————/
DDP를 써라!
핵심은 DP는 single-process multi-thread이고, DDP는 multi-process라는 것. python GIL overhead를 피하려면 multi-thread가 아닌 multi-process를 쓰는게 효율적임.

CUDA semantics — PyTorch master documentation
CUDA semantics torch.cuda is used to set up and run CUDA operations. It keeps track of the currently selected GPU, and all CUDA tensors you allocate will by default be created on that device. The selected device can be changed with a torch.cuda.device cont
pytorch.org
DDP overview
https://pytorch.org/tutorials/beginner/dist_overview.html
PyTorch Distributed Overview — PyTorch Tutorials 1.9.0+cu102 documentation
PyTorch Distributed Overview Author: Shen Li This is the overview page for the torch.distributed package. As there are more and more documents, examples and tutorials added at different locations, it becomes unclear which document or tutorial to consult fo
pytorch.org
DDP, DP와 batch norm
https://discuss.pytorch.org/t/how-does-dataparallel-handels-batch-norm/14040
How does Dataparallel handels batch norm?
In particular, Is it that each GPU separately computes its own parameters for batch norm over minibatch allocated to it? or do they communicate with each other for computing those parameters? If GPUs are independently computing these parameters, then how d
discuss.pytorch.org
참고 dataloader num_workers는 subprocess가 아니라 thread 수를 지정해주는것임.
process vs thread 참고: https://stackoverflow.com/questions/37430255/difference-between-subprocess-and-thread