Cuda Programming 3: Understanding the CUDA Programming Model

Introduction:
In the previous blogs, you learned about parallel computing, installed CUDA on your system, and ran your first simple program. In this post, we’ll take a deeper dive into how CUDA works under the hood. We’ll cover the fundamental programming model behind CUDA, which is based on threads, blocks, and grids, as well as explain how memory management works on the GPU.
By the end of this post, you’ll understand how CUDA organizes work across the GPU and how to write efficient code for parallel execution.
CUDA Programming Model Overview:
CUDA programming revolves around writing kernels—functions that run on the GPU. These kernels are executed by many threads simultaneously, organized into blocks. A group of blocks forms a grid. This structure allows you to harness the full power of the GPU, with thousands of threads working together in parallel.
Key Concepts:
- Threads: The smallest unit of execution in CUDA. Each thread executes the same kernel but works on different data.
- Blocks: Threads are grouped into blocks. Each block executes on a single multiprocessor (SM) on the GPU.
- Grids: Blocks are organized into grids. A grid is a collection of blocks that work together to perform the overall task.
CUDA Execution Model:
When you launch a kernel, CUDA automatically distributes the work among the threads, blocks, and grids. Each thread gets a unique identifier, which can be used to access different pieces of data. This allows you to process large datasets in parallel.
Here’s how the model works in practice:
- Thread Indexing: Each thread has a unique index that helps it determine what data to process. The index is typically calculated using the
threadIdx
,blockIdx
,blockDim
, andgridDim
built-in variables. - Block Dimensions: Each block can contain a maximum number of threads (up to 1024 threads per block, depending on the GPU architecture). The number of threads per block and the number of blocks in the grid can be adjusted based on the problem at hand.
- Grid Dimensions: The grid is essentially a collection of blocks. The number of blocks in the grid can be specified as a 1D, 2D, or 3D grid, allowing you to work with multidimensional data more naturally.
Understanding Thread and Block Indices:
Let’s break down how you can access threads within blocks and organize your data.
1. Thread Indexing
The threadIdx
built-in variable gives the index of the thread within its block. It’s a 3D variable, so you can access the x
, y
, and z
components.
cppCopyEditint idx = threadIdx.x + blockIdx.x * blockDim.x;
threadIdx.x
gives the thread’s index within its block.blockIdx.x
gives the block’s index in the grid.blockDim.x
gives the number of threads in each block.
By calculating idx
, we obtain a unique index for each thread in the entire grid.
2. Block Dimensions
Each block has a size defined by blockDim.x
, blockDim.y
, and blockDim.z
. These variables give the number of threads in each dimension of the block.
3. Grid Dimensions
The grid has a size defined by gridDim.x
, gridDim.y
, and gridDim.z
. This allows you to define how many blocks are in each dimension of the grid.
A Simple Example: Vector Addition with Threads and Blocks
Let’s revisit our earlier vector addition program and break it down further by utilizing the thread and block indices. Here’s the updated code:
cppCopyEdit#include <iostream>
#include <cuda_runtime.h>
// CUDA Kernel function to add two vectors
__global__ void addVectors(int *A, int *B, int *C, int N) {
// Calculate the index of the thread
int idx = threadIdx.x + blockIdx.x * blockDim.x;
// Ensure we don't go out of bounds
if (idx < N) {
C[idx] = A[idx] + B[idx]; // Add corresponding elements of A and B
}
}
int main() {
int N = 50000; // Number of elements in the vectors
int size = N * sizeof(int);
// Allocate memory on the host (CPU)
int *h_A = (int*)malloc(size);
int *h_B = (int*)malloc(size);
int *h_C = (int*)malloc(size);
// Initialize vectors A and B
for (int i = 0; i < N; i++) {
h_A[i] = i;
h_B[i] = i * 2;
}
// Allocate memory on the device (GPU)
int *d_A, *d_B, *d_C;
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
// Copy data from host to device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Define the number of threads per block and the number of blocks
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
// Launch the kernel
addVectors<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// Copy the result back to the host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Print the first 10 elements of the result
for (int i = 0; i < 10; i++) {
std::cout << "C[" << i << "] = " << h_C[i] << std::endl;
}
// Free device memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
// Free host memory
free(h_A);
free(h_B);
free(h_C);
return 0;
}
Code Explanation:
- Thread Indexing: In the kernel
addVectors
, the indexidx
is calculated usingthreadIdx.x
andblockIdx.x
. This allows each thread to operate on a unique element of the vectors. - Grid and Block Dimensions: The kernel is launched with
blocksPerGrid
blocks andthreadsPerBlock
threads per block. This ensures that enough threads are launched to process all the elements of the vectors.
Step-by-Step Guide to Understanding the Execution Model:
- Thread Execution: Each thread calculates its unique index using
threadIdx.x + blockIdx.x * blockDim.x
. This ensures that every thread gets a unique index to process a different element of the vectors. - Block Execution: Each block contains a number of threads. The number of threads per block (
threadsPerBlock
) is set to 256, which is a common choice for efficient GPU utilization. - Grid Execution: The grid contains a number of blocks, determined by the total number of elements (
N
) and the number of threads per block. The grid size is calculated as(N + threadsPerBlock - 1) / threadsPerBlock
, ensuring we have enough blocks to handle all elements.
Memory Management in CUDA:
- Global Memory: Memory accessible by all threads on the GPU. In our example, the vectors
A
,B
, andC
are stored in global memory. - Shared Memory: A small, fast memory space shared by threads within the same block. We will explore shared memory in future posts when we talk about optimization.
Conclusion:
Now that you have a basic understanding of the CUDA programming model, you are ready to start writing efficient kernels that can run on the GPU. In the next blog, we’ll explore more advanced topics like memory management and optimization techniques to help you write faster and more efficient CUDA programs.
Call to Action:
- Got questions? Drop them in the comments below, and we’ll answer them in the next post!
- Up next: In the next blog, we’ll dive into optimizing memory usage for CUDA programs, including how to use shared memory effectively.