Cuda Programming 1: Introduction to CUDA and Parallel Computing

Introduction:
Welcome to the first post in our series on CUDA programming for beginners. In this blog, we’ll introduce you to the concepts of parallel computing, explain what CUDA is, and provide a simple example you can run on your machine.
What is Parallel Computing?
Parallel computing allows tasks to be divided into smaller pieces that can be executed simultaneously, making large-scale computations much faster. Think of it like a factory where workers are assembling parts at the same time instead of one by one.
What is CUDA?
CUDA is NVIDIA’s parallel computing platform that enables developers to harness the power of NVIDIA GPUs for general-purpose computing tasks. CUDA makes it easy to offload computationally intensive operations to the GPU, speeding up processes like image processing, simulations, and machine learning.
How Does CUDA Work?
CUDA uses the GPU’s many cores to execute thousands of threads in parallel. These threads are grouped into blocks, and blocks are organized into a grid. Each thread executes the same kernel (function) but operates on different data, allowing the program to process large datasets at once.
Hands-on Example: A Simple CUDA Program
Let’s look at a simple example where we use CUDA to perform vector addition. This program will add two vectors, A
and B
, element-wise and store the result in vector C
.
Here’s the code:
cppCopyEdit#include <iostream>
#include <cuda_runtime.h>
// CUDA Kernel function to add two vectors
__global__ void addVectors(int *A, int *B, int *C, int N) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < N) {
C[index] = A[index] + B[index]; // Add corresponding elements of A and B
}
}
int main() {
int N = 50000; // Number of elements in the vectors
int size = N * sizeof(int);
// Allocate memory on the host (CPU)
int *h_A = (int*)malloc(size);
int *h_B = (int*)malloc(size);
int *h_C = (int*)malloc(size);
// Initialize vectors A and B
for (int i = 0; i < N; i++) {
h_A[i] = i;
h_B[i] = i * 2;
}
// Allocate memory on the device (GPU)
int *d_A, *d_B, *d_C;
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
// Copy data from host to device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Launch the kernel with enough blocks and threads
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
addVectors<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// Copy the result back to the host
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Print the first 10 elements of the result
for (int i = 0; i < 10; i++) {
std::cout << "C[" << i << "] = " << h_C[i] << std::endl;
}
// Free device memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
// Free host memory
free(h_A);
free(h_B);
free(h_C);
return 0;
}
Code Explanation:
- Memory Allocation: We first allocate memory for the vectors on both the host (CPU) and device (GPU). The host memory is managed using
malloc()
, and the device memory is allocated usingcudaMalloc()
. - CUDA Kernel: The
addVectors
function is a simple CUDA kernel, marked with the__global__
keyword. This function runs on the GPU. Each thread adds corresponding elements from vectorsA
andB
and stores the result inC
. - Kernel Launch: We launch the kernel using
addVectors<<<blocksPerGrid, threadsPerBlock>>>(...)
. This defines how many threads and blocks will be used to execute the kernel. In this example, we use 256 threads per block, and the number of blocks is computed based on the size of the input vector. - Memory Copying: The
cudaMemcpy()
function is used to copy data between host and device memory. - Result Output: After the kernel execution, we copy the results back to the host memory and print the first 10 elements of the result vector
C
.
How to Run the Code:
- Install CUDA: Make sure you have CUDA installed on your machine. You can download the CUDA Toolkit from NVIDIA’s website and follow the installation instructions for your platform.
- Compile the Code: Save the code in a
.cpp
file and compile it using the NVIDIA compiler (nvcc
):bashCopyEditnvcc vector_add.cu -o vector_add
- Run the Program: After compiling, run the program:bashCopyEdit
./vector_add
You should see output similar to:
cssCopyEditC[0] = 0
C[1] = 3
C[2] = 6
C[3] = 9
C[4] = 12
...
Conclusion:
In this first blog, we’ve introduced parallel computing and CUDA and provided a simple example to get you started. This hands-on experience will give you the foundation to explore more advanced CUDA programming concepts in the upcoming posts. Stay tuned for the next blog, where we’ll walk through setting up your CUDA environment!
Call to Action:
- Have you tried the vector addition example? Let us know how it worked for you in the comments!
- Up next: In our next post, we’ll guide you through setting up your CUDA environment step-by-step.