7.4 GPU device code using OpenACC pragmas with the pgcc compiler

In this section we will finally see how we use OpenACC pragmas and pgcc compiler options to indicate that we want to execute the addition of the two arrays on the GPU device.

Let’s examine the code below by starting with the compiler arguments shown at the bottom of the code. Here’s what each one means:

Next, let’s look at the OpenACC pragmas used on lines 10 and 11 in the code below.

Note

  • With a GPU device containing thousands of cores, the programming model is different: we consider that there are enough threads to work on each data element of the array independently. This isn’t strictly true when our arrays are extremely large, but the GPU system manages which threads in different thread blocks will map to the updates of the elements of the arrays in the loop.

  • In the main program, code is executing on the host CPU until the function GPUadd() is called, then execution begins on the GPU with array memory being managed between the device and the host. When GPUadd() completes and device memory is copied back to the host, execution begins again on the host CPU.

Exercises

  • Run as is to see that the output still looks the same with small arrays.

  • Note the compiler output after the ===== STANDARD ERROR ===== line in the output. Be careful to notice that the compiler is indicating two important concepts:
    1. That the data in the array called x is being copied in from the host and the data in the array called y is being copied in and back out, as indicated by the keyword ‘copy’.

    2. That the compiler is parallelizing the loop for the GPU and in this case is setting up gangs (equivalent to CUDA blocks) of 128 threads.

  • Remove ‘-n’, ‘10’ from the square brackets in the command arguments and run again with the default size.

  • Explore the need for the loop independent: try using [‘-n’, ‘8192’] for the command line arguments and eliminating the word ‘independent’ from the second pragma in the GPUadd function. carefully observe the compiler output. When you see output like this, you should be aware that the compiler is choosing not to run the loop in parallel. This is the result of the compiler being conservative- as the developer you need to tell it that the calculations are independent or it often will not choose to set up the parallelism.

Important point

As with other examples in this chapter, we are using openMP functions to time our code, and specifically how long it took to copy data to the device, compute the addition of each of the elements, and copy the result back. The main point you should see here: this version with GPUadd function runs slower than the previous CPU versions. We need more work to make running functions on the GPU device worthwhile. The idea here is that there is a cost for the data movement between the host and the GPU device. The amount of time for computations must be high enough so that the data movement time is an insignificant portion of the overall time. We will see examples where this is the case in the next couple of chapters.

You have attempted of activities on this page