5.4 Vector Addition Example: exploring thread block mappings¶

Here is a full example that tries running the code under several conditions:

Case 3: using one block of 256 threads (or you can change this default block size)
Case 4: using 16 blocks of 256 threads (or you can change this default block size)
Case 5: using as many blocks of 256 threads that we need for the entire array (or you change the block size)

The default first argument is the blockSize of 256, and the second argument is the array size, N, whose default size is 33554432 (a multiple of 256 and also 1024, which is sometimes the largest recommended number of threads per block). Note at the end of the code that the getArguments() function now handles two optional arguments for blockSize and N.

Each case takes a bit of time, so be patient while it executes each one before sending the results back.

//
 // Demonstration using a single 1D grid and
 // 1, or more 1D blocks of optional size
 //
 /*
 * Example of vector addition :
 * Array of floats x is added to array of floats y and the
 * result is placed back in y
 *
 * Timing added for analysis of block size differences.
 */

#include <math.h>
 #include <iostream> // alternative cout print for illustration
 #include <time.h>
 #include <cuda.h>

void initialize(float *x, float *y, int N);
 void verifyCorrect(float *y, int N);
 void getArguments(int argc, char **argv, int *blockSize, int *numElements);

///////
 // error checking macro taken from Oakridge Nat'l lab training code:
 // https://github.com/olcf/cuda-training-series
 ////////
 #define cudaCheckErrors(msg) \
     do { \
         cudaError_t __err = cudaGetLastError(); \
         if (__err != cudaSuccess) { \
             fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                 msg, cudaGetErrorString(__err), \
                 __FILE__, __LINE__); \
             fprintf(stderr, "*** FAILED - ABORTING\n"); \
             exit(1); \
         } \
     } while (0)

// Parallel version that uses threads in the block.
 //
 //  If block size is 8, e.g.
 //    thread 0 works on index 0, 8, 16, 24, etc. of each array
 //    thread 1 works on index 1, 9, 17, 25, etc.
 //    thread 2 works on index 2, 10, 18, 26, etc.
 //
 // This is mapping a 1D block of threads onto these 1D arrays.
 __global__
 void add_parallel_1block(int n, float *x, float *y)
 {
   int index = threadIdx.x;    // which thread am I in the block?
   int stride = blockDim.x;    // threads per block
   for (int i = index; i < n; i += stride)
       y[i] = x[i] + y[i];
 }

// In this version, thread number is its block number
 // in the grid (blockIdx.x) times
 // the threads per block plus which thread it is in that block.
 //
 // Then the 'stride' to the next element in the array goes forward
 // by multiplying threads per block (blockDim.x) times
 // the number of blocks in the grid (gridDim.x).

__global__
 void add_parallel_nblocks(int n, float *x, float *y)
 {
   int index = blockIdx.x * blockDim.x + threadIdx.x;
   int stride = blockDim.x * gridDim.x;
   for (int i = index; i < n; i += stride)
     y[i] = x[i] + y[i];
 }

// Kernel function based on 1D grid of 1D blocks of threads
 // In this version, thread number is:
 //  its block number in the grid (blockIdx.x) times
 // the threads per block plus which thread it is in that block.
 //
 // This thread id is then the index into the 1D array of floats.
 // This represents the simplest type of mapping:
 // Each thread takes care of one element of the result
 //
 // For this to work, the number of blocks specified
 // times the specified threads per block must
 // be the same or greater than the size of the array.
 __global__
 void vecAdd(float *x, float *y, int n)
 {
     // Get our global thread ID
     int id = (blockIdx.x * blockDim.x) + threadIdx.x;

// Make sure we do not go out of bounds
     if (id < n)
         y[id] = x[id] + y[id];
 }
 /// temp test

int main(int argc, char **argv)
 {
   printf("This program lets us experiment with the number of blocks and the \n");
   printf("number of threads per block to see its effect on running time.\n");
   printf("\nUsage:\n");
   printf("%s [num threads per block] [array_size]\n\n", argv[0]);
   printf("\nwhere you can optionally specify the number of threads per block \n");
   printf("and the array size\n");

// Set up default size of arrays
   // multiple of 1024 to match largest threads per block
   // allowed in many NVIDIA GPUs
   //
   int N = 32*1048576;
   int blockSize = 256;     // threads per block
   float *x, *y;

// determine largest number of threads per block allowed
   int devId;            // the number assigned to the GPU
   int maxThreadsPerBlock;  // maximum threads available per block
   cudaGetDevice(&devId);
   cudaDeviceGetAttribute(&maxThreadsPerBlock,
     cudaDevAttrMaxThreadsPerBlock, devId);

// get optional arguments
   getArguments(argc, argv, &blockSize, &N);

//change if requested block size is too large
   if (blockSize > maxThreadsPerBlock) {
     blockSize = maxThreadsPerBlock;
     printf("WARNING: using %d threads per block, which is the maximum.", blockSize);
   }

printf("size (N) of 1D array is: %d\n\n", N);
   // Size, in bytes, of each vector; just use below
   size_t bytes = N*sizeof(float);

// Allocate Unified Memory - accessible from CPU or GPU
   cudaMallocManaged(&x, bytes);
   cudaMallocManaged(&y, bytes);
   cudaCheckErrors("allocate managed memory");

// initialize x and y arrays on the host
   initialize(x, y, N);

clock_t t_start, t_end;              // for timing
   double tot_time_secs;
   double tot_time_milliseconds;

///////////////////////////////////////////////////////////////////
   // case 3: using a single block of threads
   // re-initialize x and y arrays on the host
   initialize(x, y, N);

// Use the GPU in parallel with one block of threads.
   // Essentially using one SM for the block.

t_start = clock();

add_parallel_1block<<<1, blockSize>>>(N, x, y);   // the kernel call
   cudaCheckErrors("add_parallel_1block kernel call");

// Wait for GPU to finish before accessing on host
   cudaDeviceSynchronize();
   cudaCheckErrors("Failure to synchronize device");

t_end = clock();
   tot_time_secs = ((double)(t_end-t_start)) / CLOCKS_PER_SEC;
   tot_time_milliseconds = tot_time_secs*1000;

printf("\nCase 3: Parallel time on 1 block of %d threads: %f milliseconds\n",
      blockSize, tot_time_milliseconds);

verifyCorrect(y, N);

///////////////////////////////////////////////////////////////////
   // Case 4:
   // Now use multiple blocks so that we use more than one SM.
   // Also use a slightly different 'stride' pattern for which
   // thread works on which element.

// re-initialize x and y arrays on the host
   initialize(x, y, N);

// Number of thread blocks in grid could be fixed
   // and smaller than maximum needed.
   int gridSize = 16;

printf("\n----------- number of %d-thread blocks: %d\n", blockSize, gridSize);

t_start = clock();
   // the kernel call assuming a fixed grid size and using a stride
   add_parallel_nblocks<<<gridSize, blockSize>>>(N, x, y);
   cudaCheckErrors("add_parallel_nblocks kernel call");

// Wait for GPU to finish before accessing on host
   cudaDeviceSynchronize();
   cudaCheckErrors("Failure to synchronize device");

t_end = clock();
   tot_time_secs = ((double)(t_end-t_start)) / CLOCKS_PER_SEC;
   tot_time_milliseconds = tot_time_secs*1000;
   printf("Case 4: Stride loop pattern: \n");
   printf("Parallel time on %d blocks of %d threads = %f milliseconds\n", gridSize, blockSize, tot_time_milliseconds);

verifyCorrect(y, N);

//////////////////////////////////////////////////////////////////
   // case 5: without using stride
   //
   // re-initialize x and y arrays on the host
   initialize(x, y, N);

// set grid size based on array size and block size
   gridSize = ((int)ceil((float)N/blockSize));

printf("\n----------- number of %d-thread blocks: %d\n", blockSize, gridSize);

t_start = clock();
   // the kernel call
   vecAdd<<<gridSize, blockSize>>>(x, y, N);
   cudaCheckErrors("vecAdd kernel call");

// Wait for GPU to finish before accessing on host
   cudaDeviceSynchronize();
   cudaCheckErrors("Failure to synchronize device");

t_end = clock();
   tot_time_secs = ((double)(t_end-t_start)) / CLOCKS_PER_SEC;
   tot_time_milliseconds = tot_time_secs*1000;
   printf("Case 5: No stride loop pattern: \n");
   printf("Parallel time on %d blocks of %d threads = %f milliseconds\n", gridSize, blockSize, tot_time_milliseconds);

verifyCorrect(y, N);

///////////////////////////// end of tests

// Free memory
   cudaFree(x);
   cudaFree(y);
   cudaCheckErrors("free cuda memory");

return 0;
 }

// To reset the arrays for each trial
 void initialize(float *x, float *y, int N) {
   // initialize x and y arrays on the host
   for (int i = 0; i < N; i++) {
     x[i] = 1.0f;
     y[i] = 2.0f;
   }
 }

// check whether the kernel functions worked as expected
 void verifyCorrect(float *y, int N) {
   // Check for errors (all values should be 3.0f)
   float maxError = 0.0f;
   for (int i = 0; i < N; i++)
     maxError = fmax(maxError, fabs(y[i]- 3.0f));
   std::cout << "Max error: " << maxError << std::endl;
 }

// simple argument gather for this simple 1D example program
 //
 // Design is the arguments will be optional in this order:
 //  number of threads per block
 //  number of  data elements in 1D vector arrays
 void getArguments(int argc, char **argv, int *blockSize, int *numElements) {

if (argc == 3) {   // both given
     *numElements = atoi(argv[2]);
     *blockSize = atoi(argv[1]);
   } else if (argc == 2) {   // just threads per block given
     *blockSize = atoi(argv[1]);
   }
 }

5.4-1: What explains the different run times?

Choose all that are correct. After observing this code running, using 1 block of threads instead of multiple blocks of threads is slower because:

The 1 block case is scheduled on multiple SMs, making it slower.
1 block maps to 1 SM, so it does not use all the SMs available on the GPU.
The multiple blocks cases use multiple SMs, making better use of the hardware.
Using more SMs often increases the speed of your code.
The 1 block case is scheduled on 1 SM, so it does not use all the SMs available on the GPU.
Using more SMs often increases the speed of your code.

An Exercise¶

Observe the times that each case in the code takes with the default array size, which is 33554432, and the default blockSize of 256. What time do you observe when using half of the default array size, or 16777216?

['256', '16777216']

Try doubling the size of the original array in the command line arguments, like this:

['256', '67108864']

What time does it take to run each array size on each of case 4 (stride) and case 5 (no stride)?

Go even higher:

['256', '134217728']

On a sequential version of this algorithm, the timings should be twice as long as we double the array size N for this type of algorithm, which we call an O(N) algorithm.

The reason you see better results than an O(N) single core solution is that as the size of the array increases, we are able to use more cores in parallel across all the SMs on this particular GPU card and more importantly it is able to schedule the computations on those cores effectively.

Note

Though there is a difference between case 4 and 5 times, it is fairly small (the time is reported in milliseconds, or thousandths of a second), and may be different for each GPU card. This means either method works well for this particular problem.

Note

An important point about the design of the NVIDIA cards is that the block size should be a multiple of 32 and that for today’s cards, experiments seem to show that block sizes of 128, 256, or 512 are preferred choices for the design of the hardware.

Build and run on your machine¶

File: 4-UMVectorAdd-timing/vectorAdd.cu

Just as for previous examples, you can use the make command on your own machine or compile the code like this:

nvcc -arch=native  -o vectorAdd vectorAdd.cu

Remember that you will need to use a different -arch flag if native does not work for you. (See note at end of section 4.1.)

You can execute this code like this:

./vectorAdd

You can also experiment with trying a smaller or larger block size, by running like this:

./vectorAdd 128
./vectorAdd 512

Also try changing the array sizes as we did earlier.

Test your understanding¶

5.4-2: Using one block of threads is a reasonable choice for this example.

True.
Look carefully at the case of one block.
False.
Yes! The single block case ran much slower than when using multiple blocks of threads.

5.4-3: When we use multiple blocks and double the size of our problem, the code takes slightly less than twice the time to run on the GPU for each case.

True.
Yes! In this case, the GPU cores are being scheduled very efficiently.
False.
Try running each case a few more times to determine if it really is true.

Summary¶

This example shows that a host CPU is faster than a single core on a GPU by quite some margin. So to use GPUs effectively, you need to use as many cores as possible in parallel to complete the computation. From this example, you can see that this is possible when the mapping of cores to data elements is straightforward and the computation on each data element is simple (though this example still works well with more sophisticated mathematical calculations involving single elements of each array).

In Araujo(2023), the authors performed an extensive study to determine the affect of the block size on a variety of different benchmark code examples. They concluded that under some circumstances the block size had very small effects on the runtime, but for other cases, keeping it small or large made a considerable difference on how fast the codes ran. Here you likely will see minor effects for vector addition, but in other cases you may not, so it is best to design your code so that you can change it and run experiments. It still holds from their results that block sizes of 128, 256, or 512 are preferred choices.

In this same study, the authors ran their experiments 10 times for the same conditions, getting an average time. This is also a practice you will want to get into the habit of when testing out your code. You should have seen variation in your timing results as you ran the same condition multiple times.

Exercises for Further Exploration¶

There is overhead creating the Unified Memory array and copying it to the GPU. The vector addition computation as we use more blocks of threads on the GPU does not increase by exactly twice as we double the array size, but a complete solution with timing should include that memory overhead. You could try creating a version of just case 5 that timed all parts of the code.
Given this example of how the code can be timed on the host, go back and add timing to the code that did not use unified memory. The results will enable you to determine whether our assertion that using unified memory is the preferred method is true for this example.
Another exercise is to consider when the 5th case, using one thread id per array index and calculate the number of blocks, could fail. Though likely a rare case, it is worth thinking about. To do it, go back to the information about your device and determine the maximum number of blocks allowed in a grid.
For case 4, experiment with changing the fixed number of blocks. Is there any case where the time is consistently better or worse than the case where we calculate the number of blocks in the grid based on N and the block size?
There is an example provided in our GitHub repository where CUDA library functions are used for timing the code instead of C timing functions on the host. If you want to explore this example, you can see how CUDA has also provided mechanisms for timing code that can sometimes be useful for adding timing to sophisticated kernel or device functions.

References¶

Araujo, G., Griebler, D., Rockenbach, D. A., Danelutto, M., & Fernandes, L. G. (2023). NAS Parallel Benchmarks with CUDA and beyond. Software: Practice and Experience, 53(1), 53-80.

You have attempted of activities on this page