5.2 Vector Addition Example: see hardware in action

This example is designed to show you how you can run an experiment to find out the difference in speed between a CPU core and a single GPU core. To do this we will run the vector addition on the host, where it will use the CPU, and then run it on a kernel function that only uses one core of the GPU (never a good idea for real code, but fine for this type of experiment).

Parts that remain the same from the previous chapter

  • We use the same macro for detecting and reporting errors in code run on the device.

  • We use the same function for initializing the arrays with values and the same function for verifying that the result of the vector addition is correct.

  • We use managed memory as in the final example from the previous chapter.

Differences for this example

  1. The command line argument is now the array size so that we can change it each time we run it. The getArguments() function now looks like this:

getArguments function
// simple argument gather for this simple 1D example program
//
// Design is the arguments will be optional in this order:
//  number of  data elements in 1D vector arrays
void getArguments(int argc, char **argv, int *numElements) {

  if (argc == 2) {  
    *numElements = atoi(argv[1]);
  }
}

We use it in main by setting a default value and then overwriting it if there is one given on the command line, like this:

use of getArguments function
  int N = 32*1048576;
  
  // get optional argument: change array size
  getArguments(argc, argv, &N); 

  printf("size (N) of 1D array is: %d\n\n", N);

  1. Case 1: We have a host function to add the two arrays on the host CPU:

hostAdd function for CPU
// To run code on host for comparison
void HostAdd(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

Note that this function does not use the __global__ keyword and therefore acts as a regular function that runs on the host CPU.

  1. Case 2: We have a kernel function that runs on only one thread on one core of the GPU:

the add kernel function for 1 thread
// Kernel function to add the elements of two arrays
// This one is sequential on one GPU core.
__global__ void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

Note how we are not making any use of multiple threads by finding a thread number and computing an array index. Note below how we call this function in main.

  1. We introduce how we can time our code. In this case we will use standard C functions for this. There are functions for this from CUDA libraries for this, but this method works just as well in many cases. In main, we time the host function like this:

C language timing of hostAdd function
  // case 1: run on the host on one core
  t_start = clock();
  // sequentially on the host
  HostAdd(N, x, y);
  t_end = clock();
  tot_time_secs = ((double)(t_end-t_start)) / CLOCKS_PER_SEC;
  tot_time_milliseconds = tot_time_secs*1000;
  printf("\nSequential time on host: %f seconds (%f milliseconds)\n", 
          tot_time_secs, tot_time_milliseconds);

We do similar timing around the kernel function call in main:

C language timing of add kernel function
  // case 2:
  // Purely illustration of something you do not ordinarilly do:
  // Run kernel on all elements on the GPU sequentially on one thread
  
  // re-initialize
  initialize(x, y, N);

  t_start = clock();

  add<<<1, 1>>>(N, x, y);   // the kernel call
  cudaCheckErrors("add kernel call");

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();
  cudaCheckErrors("Failure to synchronize device");
  
  t_end = clock();
  tot_time_secs = ((double)(t_end-t_start)) / CLOCKS_PER_SEC;
  tot_time_milliseconds = tot_time_secs*1000;
  printf("\nSequential time on one device thread: %f seconds (%f milliseconds)\n", 
         tot_time_secs, tot_time_milliseconds);

Note

The simple trick to get the code to run on one GPU device thread is the kernal call itself above:

add<<<1, 1>>>(N, x, y); // the kernel call

This also illustrates that instead of using dim3 variables between the <<< and >>> symbols, we can use integers for just the x values for number of blocks in a grid and number of threads in a block. In this case, <<<1, 1>>> is indicating that there will be 1 block with one thread.

We use this function solely to compare the time of a CPU core running the hostAdd to the time to run the same code on one GPU thread.

The complete code for CPU to GPU comparison

Try using ‘16777216’ to use half the default array size.

Let’s consider what you see from this by answering the following questions.

This case brings up an interesting observation that you can make for this particular example code. Figure it out by answering this question:

Build and run on your machine

File: 4-UMVectorAdd-timing/vectorAdd-1.cu

Just as for previous examples, you can use the make command on your own machine or compile the code like this:

nvcc -arch=native  -o vectorAdd-1 vectorAdd-1.cu

Remember that you will need to use a different -arch flag if native does not work for you. (See note at end of section 4.1.)

You can execute this code like this:

./vectorAdd-1
./vectorAdd-1 16777216
You have attempted of activities on this page