CUDA(Compute Unified Device Architecture)

This is parallel computing platform and programming model created by NVIDIA.
It leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems
CUDA comes with a software environment that allows developers to use C++ as a high-level programming language.

CUDA Programming Model

Concepts

Grid, Block, Threads

Term Meaning
Thread Smallest unit of execution in CUDA
Thread within same Block runs same code but with different data (data parallelism)
Each thread is identified using thread index (threadIdx.x, threadIdx.y, threadIdx.z).
Block 1 block can have 1024 threads at most
Threads in a block can communicate with each other through shared memory
A group of threads that execute on the same Streaming Multiprocessor (SM) on the GPU are on same block
Grid Collection of blocks that execute a kernel

Grid (3D: gridDim.x, gridDim.y, gridDim.z)
├── Block (0,0,0) (3D: blockDim.x, blockDim.y, blockDim.z)
│   ├── Thread (0,0,0)
│   ├── Thread (1,0,0)
│   ├── Thread (2,0,0)
│   ├── ...
│   ├── Thread (blockDim.x-1, blockDim.y-1, blockDim.z-1)
├── Block (1,0,0)
│   ├── Thread (0,0,0)
│   ├── ...
├── Block (0,1,0)
├── ...
└── Block (gridDim.x-1, gridDim.y-1, gridDim.z-1)
                            
CUDA View of GPU
Thread Block CLuster A Thread Block Cluster (or simply Cluster) is a feature introduced in NVIDIA’s Hopper architecture (e.g., H100 GPU, Compute Capability 9.0) and later.
This is group of thread blocks which are scheduled to run together on a GPU Processing Cluster (GPC)
Advantage is: Clusters allow blocks within the same cluster to share resources and synchronize, Remember that blocks in different clusters cannot communicate directly.

Kernel

Kernels are the functions that run on the GPU. They are executed in parallel by multiple threads.
These functions are called from the host (CPU) and executed in parallel on GPUs
A kernel is defined using the __global__ declaration specifier
Number of CUDA threads that execute that kernel is given using <<<blocks, threads>>> syntax.
Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through built-in variables.

/* 
__global__ : This is kernel function
Parameters:
    A,B,C: Pointers to floating point arrays stored on GPU global memory.
           A,B: input arrays. C: OUtput array where sum of A & B is stored.
*/
__global__ void VecAdd(float* A, float* B, float* C)
{
    /*
    threadIdx {x,y,z} This is a built-in CUDA variable that gives the unique ID of the thread within the block.
    threadIdx.x: x position of the thread in the block.
        Thread0, x=0. C[0] = A[0] + B[0]
        Thread1, x=1. C[1] = A[1] + B[1]
    */
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
}

int main()
{
    /*
        <<>>
        <<<1, N>>> means there will be 1 block with N threads.
        N: Set somewhere before in code
           if the arrays have 1024 elements, we can set N = 1024 so that each thread processes one element of the arrays.
    */
    VecAdd<<<1, N>>>(A, B, C);
    ...
}
            

Inter Thread Communication

1. Threads within the Same Block

Threads in the same block can communicate using shared memory
Synchronization within a block is achieved using __syncthreads()

2. Threads in Different Blocks

There is no built-in synchronization mechanism across blocks within a kernel (like __syncthreads())
Workarounds for Inter-Block Communication:
  Global Memory: Threads in different blocks can write to and read from global memory (e.g., the arrays A, B, C in the code).
  Atomic Operations: Use atomic functions to ensure that updates to global memory are done safely.
  Multiple Kernel Launches: If inter-block communication is needed, you can launch multiple kernels sequentially, where the output of one kernel serves as the input to the next

How many threads GPU can have?

Maximum grid dimensions (2³¹ - 1 for x, 65,535 for y and z)
Maximum threads per block (1024)
65535 x 1024 = 67 Million threads per grid.
Theoretically billions of threads per kernel launch

Example

Vector Addition on GPU

Below is simple code for teaching example to demonstrate CUDA’s basic concepts:
  Launching a kernel
  Managing GPU memory
  Parallelizing a simple computation


#include <cuda_runtime.h>
#include <stdio.h>
#define N 512

__global__ void VecAdd(float* A, float* B, float* C) {
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
}

int main() {
    // 1. Allocate memory on the host
    float *h_A = (float*)malloc(N * sizeof(float));
    float *h_B = (float*)malloc(N * sizeof(float));
    float *h_C = (float*)malloc(N * sizeof(float));

    // 2. Initialize host arrays
    for (int i = 0; i < N; i++) {
        h_A[i] = (float)i;
        h_B[i] = (float)i * 2;
    }

    // 3. Allocate memory on the GPU device
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, N * sizeof(float));
    cudaMalloc(&d_B, N * sizeof(float));
    cudaMalloc(&d_C, N * sizeof(float));

    // 4. Copy data from host to device
    cudaMemcpy(d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice);

    // 5. Launch kernel code. 
    // <<1,N=512>> means this function would be executed on 1 block with 512 threads.
    VecAdd<<<1, N>>>(d_A, d_B, d_C);

    // 6. Once results are calculated copy result back
    cudaMemcpy(h_C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost);

    // 7. Verify (optional)
    for (int i = 0; i < N; i++) {
        if (h_C[i] != h_A[i] + h_B[i]) {
            printf("Error at index %d\n", i);
            break;
        }
    }

    // 8. Clean up GPU and host memory
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);

    return 0;
}
                    
Real-World Scenarios for GPU Usage:
  Huge matrix operations for tranining neural networks
  Image processing
  Cryptography
  Bug data processing

How to decide on Number of threads to use?
  In problem on left side, we used 512 threads.
  Common choices are 256, 512, or 1024 threads per block, as they are multiples of the warp size (32 threads), which is the unit of execution on NVIDIA GPUs.
  Threads/block depends on problem size and the GPU architecture.

How to decide on Number of blocks to use?
  Based on number of threads needed
  blocks = ceil(total_elements(=512 in problem on left) / threads_per_block)

Streaming Multiprocessor (SM)

SM is responsible for executing threads in parallel.
Each SM contains multiple CUDA cores (for arithmetic operations), load/store units, special function units
SMs execute thread blocks assigned to them by the CUDA runtime.
Example: NVIDIA A100 GPU (Ampere architecture) has 108 SMs, each with 64 CUDA cores. Each SM can handle up to 2,048 threads concurrently.
Does the GPU Have as Many SMs as Blocks?
No. Blocks: A software construct defined by the programmer in the CUDA kernel launch (e.g., <<>>). The number of blocks can be arbitrary, often much larger than the number of SMs.
SMs: A fixed hardware resource on the GPU (e.g., 108 SMs on an A100).

Relation b/w Blocks and SMs:
  CUDA runtime schedules blocks to available SMs. Each SM can execute multiple blocks concurrently, up to resource limits
  If there are more blocks than SMs, the GPU schedules blocks in waves.

Graphics Processing Clusters (GPCs)

GPC is primarily used in the GPU’s graphics pipeline (e.g., for rendering tasks in gaming or visualization)

                1 GPC = Multiple SMs + raster engines + Texture processing clusters (TPCs)
            
NVIDIA A100 GPU has 8 GPCs, each with 13~14 SMs

GPU
├── GPC 0
│   ├── SM 0
│   │   ├── CUDA Cores (e.g., 64 cores)
│   │   ├── Shared Memory
│   │   ├── Registers
│   │   ├── Warp Schedulers
│   │   ├── Load/Store Units
│   │   ├── Thread Blocks (e.g., Block 0, Block 1)
│   │   │   ├── Threads (0,0,0), (1,0,0), ..., (threadIdx.x, y, z)
│   ├── SM 1
│   ├── ...
│   ├── SM n (e.g., 12 SMs per GPC)
│   ├── Texture Processing Cluster (TPC)
│   ├── Raster Engine
├── GPC 1
│   ├── SMs, TPCs, etc.
├── ...
├── GPC m (e.g., 8 GPCs in A100)
├── Global Memory (HBM/GDDR)
├── L2 Cache
├── Memory Controllers