티스토리 뷰

Research (연구 관련)

Intro to CUDA

홍돌 2024. 2. 19. 04:51

rasterization code 쓸 때랑, sparse conv api 이해/설치하려고 할 때, bundle adjustment를 cuda로 짤수 있지 않을까 생각해봤을 때 (결국 안함.) CUDA를 접한 적이 있고, 여전히 솔직히 잘 모르겠어서 처음부터 정리해보려고 함. (02/18/2024)

아니 직접적으로는 웨이모 인터뷰 질문을 글래스도어에서 찾아보다가 What does thread mean for GPU? 라는 걸 보고, 흠 뭐라고 대답해야하지 라고 생각하다 이 영상 보고 정리해보기로 함.

Intro to CUDA (part 1) https://www.youtube.com/watch?v=4APkMJdiudU

- GPU > CPU: tremendous computational throughput (e.g. GFLOPS / s) and extremely high memory bandwidth.

computational throughput and memory bandwith are different; two are different. 
> In general, if a computation re-uses data, it will require less memory bandwidth. Re-use can be accomplished by:
  > sending more inputs to be processed by the same weights
  > sending more weights to process the same inputs
  
  출처: https://culurciello.medium.com/computation-and-memory-bandwidth-in-deep-neural-networks-16cbac63ebd5

- CPU vs GPU

CPU is designed to minimize latency, so their silicon area is primarily devoted to advacned control logic and large cache. That's why CPU runs OS.

GPU is designed to mazimiate throughput, so their silicon area is primarily dedicated to massive number of cores (NVIDIA calls them CUDA cores). GPU is well-suited to address problems that can be expressed as data parallel computations where the same program is executed on many data elements in parallel because the same program is executed for each data element there is lower requirement for sophisticated control logic and because it is executed on many data elements the memory access latency can be hidden with calculations instead of big caches. - 하나도 줄일 부분이 없는 핵심적인 내용이라 그냥 받아씀.  

- Kernels & Threads: Kernels are functions that run on the GPU, and Kernels execute as a set of parallel threads. Each thread is mapped to a single CUDA core on the GPU.

- What is CUDA? CUDA is the heterogenous parallel programming language designed specifically for NVIDIA GPUS. CUDA is simply C with a set of extensions that allows for the HOST (CPU + on chip memory) and DEVICE (GPU + memory) to work together. 

- CUDA threads execute in a SIMD fashion, which NVIDIA calls this SIMT; Each thread executes the same instruction but on a separate piece of data (like a convolutional operator applieed to an image?). Different data can result in different rates even the instruction (operation) is the same.

Parallelism is doing operations concurrently, and can be roughly split into data-parallelism and task-parallelism. 
Data-parallelism is applying the same operation to multiple data-items (SIMD), 
and task-parallelism is doing different operations (MIMD/MISD). 
A GPU is designed for data-parallelism, while a CPU is designed for task-parallelism.

출처: https://streamhpc.com/blog/2017-01-24/many-threads-can-run-gpu/

- Thread ∈ Block ∈ Grid; One kernel launch executes as one grid, where ne grid is mapped to into the entire device (one GPU and its memory).

 

More)

Computational Throughput (GLOPS / s) and Memory Bandwidth (and latency)

If compute intensity (FLOPS / Data Rate) is high, it means the device is more being idle.

 

출처: https://www.youtube.com/watch?v=3l10o0DYJXg

 

'Research (연구 관련)' 카테고리의 다른 글

Image Processing II  (0) 2024.02.27
Lambertian reflection  (0) 2024.02.23
Ordinal loss for human pose  (0) 2024.02.11
google drive data download  (0) 2024.02.10
lens distortion correction  (0) 2024.02.06
공지사항
최근에 올라온 글
최근에 달린 댓글
Total
Today
Yesterday
링크
«   2024/04   »
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30
글 보관함