Cuda Programming 1: Introduction to CUDA and Parallel Computing

Cuda Programming 1: Introduction to CUDA and Parallel Computing

Introduction:

Welcome to the first post in our series on CUDA programming for beginners. In this blog, we’ll introduce you to the concepts of parallel computing, explain what CUDA is, and provide a simple example you can run on your machine.

What is Parallel Computing?

Parallel computing allows tasks to be divided into smaller pieces that can be executed simultaneously, making large-scale computations much faster. Think of it like a factory where workers are assembling parts at the same time instead of one by one.

What is CUDA?

CUDA is NVIDIA’s parallel computing platform that enables developers to harness the power of NVIDIA GPUs for general-purpose computing tasks. CUDA makes it easy to offload computationally intensive operations to the GPU, speeding up processes like image processing, simulations, and machine learning.

How Does CUDA Work?

CUDA uses the GPU’s many cores to execute thousands of threads in parallel. These threads are grouped into blocks, and blocks are organized into a grid. Each thread executes the same kernel (function) but operates on different data, allowing the program to process large datasets at once.

Hands-on Example: A Simple CUDA Program

Let’s look at a simple example where we use CUDA to perform vector addition. This program will add two vectors, A and B, element-wise and store the result in vector C.

Here’s the code:

cppCopyEdit#include <iostream>
#include <cuda_runtime.h>

// CUDA Kernel function to add two vectors
__global__ void addVectors(int *A, int *B, int *C, int N) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if (index < N) {
        C[index] = A[index] + B[index];  // Add corresponding elements of A and B
    }
}

int main() {
    int N = 50000;  // Number of elements in the vectors
    int size = N * sizeof(int);

    // Allocate memory on the host (CPU)
    int *h_A = (int*)malloc(size);
    int *h_B = (int*)malloc(size);
    int *h_C = (int*)malloc(size);

    // Initialize vectors A and B
    for (int i = 0; i < N; i++) {
        h_A[i] = i;
        h_B[i] = i * 2;
    }

    // Allocate memory on the device (GPU)
    int *d_A, *d_B, *d_C;
    cudaMalloc((void**)&d_A, size);
    cudaMalloc((void**)&d_B, size);
    cudaMalloc((void**)&d_C, size);

    // Copy data from host to device
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Launch the kernel with enough blocks and threads
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    addVectors<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Copy the result back to the host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Print the first 10 elements of the result
    for (int i = 0; i < 10; i++) {
        std::cout << "C[" << i << "] = " << h_C[i] << std::endl;
    }

    // Free device memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    // Free host memory
    free(h_A);
    free(h_B);
    free(h_C);

    return 0;
}

Code Explanation:

  • Memory Allocation: We first allocate memory for the vectors on both the host (CPU) and device (GPU). The host memory is managed using malloc(), and the device memory is allocated using cudaMalloc().
  • CUDA Kernel: The addVectors function is a simple CUDA kernel, marked with the __global__ keyword. This function runs on the GPU. Each thread adds corresponding elements from vectors A and B and stores the result in C.
  • Kernel Launch: We launch the kernel using addVectors<<<blocksPerGrid, threadsPerBlock>>>(...). This defines how many threads and blocks will be used to execute the kernel. In this example, we use 256 threads per block, and the number of blocks is computed based on the size of the input vector.
  • Memory Copying: The cudaMemcpy() function is used to copy data between host and device memory.
  • Result Output: After the kernel execution, we copy the results back to the host memory and print the first 10 elements of the result vector C.

How to Run the Code:

  1. Install CUDA: Make sure you have CUDA installed on your machine. You can download the CUDA Toolkit from NVIDIA’s website and follow the installation instructions for your platform.
  2. Compile the Code: Save the code in a .cpp file and compile it using the NVIDIA compiler (nvcc):bashCopyEditnvcc vector_add.cu -o vector_add
  3. Run the Program: After compiling, run the program:bashCopyEdit./vector_add

You should see output similar to:

cssCopyEditC[0] = 0
C[1] = 3
C[2] = 6
C[3] = 9
C[4] = 12
...

Conclusion:

In this first blog, we’ve introduced parallel computing and CUDA and provided a simple example to get you started. This hands-on experience will give you the foundation to explore more advanced CUDA programming concepts in the upcoming posts. Stay tuned for the next blog, where we’ll walk through setting up your CUDA environment!


Call to Action:

  • Have you tried the vector addition example? Let us know how it worked for you in the comments!
  • Up next: In our next post, we’ll guide you through setting up your CUDA environment step-by-step.


Leave a Reply

Your email address will not be published. Required fields are marked *