GPU Programming with C++

28/12/2020

Overview

In this guide, we’ll explore the power of GPU programming with C++. Developers can expect incredible performance with C++, and accessing the phenomenal power of the GPU with a low-level language can yield some of the fastest computation currently available.

Requirements

While any machine capable of running a modern version of Linux can support a C++ compiler, you’ll need an NVIDIA-based GPU to follow along with this exercise. If you don’t have a GPU, you can spin up a GPU-powered instance in Amazon Web Services or another cloud provider of your choice.

If you choose a physical machine, please ensure you have the NVIDIA proprietary drivers installed. You can find instructions for this here: https://linuxhint.com/install-nvidia-drivers-linux/

In addition to the driver, you’ll need the CUDA toolkit. In this example, we’ll use Ubuntu 16.04 LTS, but there are downloads available for most major distributions at the following URL: https://developer.nvidia.com/cuda-downloads

For Ubuntu, you would choose the .deb based download. The downloaded file will not have a .deb extension by default, so I recommend renaming it to have a .deb at the end. Then, you can install with:

sudo dpkg -i package-name.deb

You will likely be prompted to install a GPG key, and if so, follow the instructions provided to do so.

Once you’ve done that, update your repositories:

sudo apt-get update
sudo apt-get install cuda -y

Once done, I recommend rebooting to ensure everything is properly loaded.

The Benefits of GPU Development

CPUs handle many different inputs and outputs and contain a large assortment of functions for not only dealing with a wide assortment of program needs but also for managing varying hardware configurations. They also handle memory, caching, the system bus, segmenting, and IO functionality, making them a jack of all trades.

GPUs are the opposite – they contain many individual processors that are focused on very simple mathematical functions. Because of this, they process tasks many times faster than CPUs. By specializing in scalar functions (a function that takes one or more inputs but returns only a single output), they achieve extreme performance at the cost of extreme specialization.

Example Code

In the example code, we add vectors together. I have added a CPU and GPU version of the code for speed comparison.
gpu-example.cpp contents below:

#include "cuda_runtime.h"
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <cstdio>
#include <chrono>

typedef std::chrono::high_resolution_clock Clock;

#define ITER 65535

// CPU version of the vector add function
void vector_add_cpu(int *a, int *b, int *c, int n) {
    int i;

    // Add the vector elements a and b to the vector c
    for (i = 0; i < n; ++i) {
    c[i] = a[i] + b[i];
    }
}

// GPU version of the vector add function
__global__ void vector_add_gpu(int *gpu_a, int *gpu_b, int *gpu_c, int n) {
    int i = threadIdx.x;
    // No for loop needed because the CUDA runtime
    // will thread this ITER times
    gpu_c[i] = gpu_a[i] + gpu_b[i];
}

int main() {

    int *a, *b, *c;
    int *gpu_a, *gpu_b, *gpu_c;

    a = (int *)malloc(ITER * sizeof(int));
    b = (int *)malloc(ITER * sizeof(int));
    c = (int *)malloc(ITER * sizeof(int));

    // We need variables accessible to the GPU,
    // so cudaMallocManaged provides these
    cudaMallocManaged(&gpu_a, ITER * sizeof(int));
    cudaMallocManaged(&gpu_b, ITER * sizeof(int));
    cudaMallocManaged(&gpu_c, ITER * sizeof(int));

    for (int i = 0; i < ITER; ++i) {
        a[i] = i;
        b[i] = i;
        c[i] = i;
    }

    // Call the CPU function and time it
    auto cpu_start = Clock::now();
    vector_add_cpu(a, b, c, ITER);
    auto cpu_end = Clock::now();
    std::cout << "vector_add_cpu: "
    << std::chrono::duration_cast<std::chrono::nanoseconds>(cpu_end cpu_start).count()
    << " nanoseconds.n";

    // Call the GPU function and time it
    // The triple angle brakets is a CUDA runtime extension that allows
    // parameters of a CUDA kernel call to be passed.
    // In this example, we are passing one thread block with ITER threads.
    auto gpu_start = Clock::now();
    vector_add_gpu <<<1, ITER>>> (gpu_a, gpu_b, gpu_c, ITER);
    cudaDeviceSynchronize();
    auto gpu_end = Clock::now();
    std::cout << "vector_add_gpu: "
    << std::chrono::duration_cast<std::chrono::nanoseconds>(gpu_end gpu_start).count()
    << " nanoseconds.n";

    // Free the GPU-function based memory allocations
    cudaFree(a);
    cudaFree(b);
    cudaFree(c);

    // Free the CPU-function based memory allocations
    free(a);
    free(b);
    free(c);

    return 0;
}

Makefile contents below:

INC=-I/usr/local/cuda/include
NVCC=/usr/local/cuda/bin/nvcc
NVCC_OPT=-std=c++11

all:
    $(NVCC) $(NVCC_OPT) gpu-example.cpp -o gpu-example

clean:
    -rm -f gpu-example

To run the example, compile it:

make

Then run the program:

./gpu-example

As you can see, the CPU version (vector_add_cpu) runs considerably slower than the GPU version (vector_add_gpu).

If not, you may need to adjust the ITER define in gpu-example.cu to a higher number. This is due to the GPU setup time being longer than some smaller CPU-intensive loops. I found 65535 to work well on my machine, but your mileage may vary. However, once you clear this threshold, the GPU is dramatically faster than the CPU.

Conclusion

I hope you’ve learned a lot from our introduction into GPU programming with C++. The example above doesn’t accomplish a great deal, but the concepts demonstrated provide a framework that you can use to incorporate your ideas to unleash the power of your GPU.

ONET IDC thành lập vào năm 2012, là công ty chuyên nghiệp tại Việt Nam trong lĩnh vực cung cấp dịch vụ Hosting, VPS, máy chủ vật lý, dịch vụ Firewall Anti DDoS, SSL… Với 10 năm xây dựng và phát triển, ứng dụng nhiều công nghệ hiện đại, ONET IDC đã giúp hàng ngàn khách hàng tin tưởng lựa chọn, mang lại sự ổn định tuyệt đối cho website của khách hàng để thúc đẩy việc kinh doanh đạt được hiệu quả và thành công.
Bài viết liên quan

Install Plex on Raspberry Pi 3

Plex is a media server. You can store Movies, TV shows, Music etc on your local storage and stream them across all your...
29/12/2020

Vim Split Screen

The controls of Vim circulate around keyboard and the keyboard only. This is especially useful when you’re dealing with...
29/12/2020

5 Best Linux Offensive Security Distributions

This is not a tutorial but a brief review on the most popular Linux distributions oriented to (offensive) IT security....
29/12/2020