8.3 A classic linear algebra example: matrix multiply

Let’s now introduce an operation that takes two matrices as input and creates a third as output: matrix multiply.

If you are unfamiliar with how matrix multiplication code works, you should read over our explanation of in Chapter 6, section 2 of our PDC for Beginners book, where we describe the problem and introduce solutions, including a sequential, OpenMP, and CUDA version.

Here we essentially are able to start from the sequential version and add OpenACC pragmas that enable the pgcc/nvc compiler to generate a GPU version that is similar in performance to the CUDA version we introduced in PDC for Beginners. As we saw there, this example works extremely well on GPUs and enables us to work on much larger matrices without waiting very long for the operations to complete.

The key features to note in this code are:

As in prior examples in this chapter, the command line arguments are the same, in this order:


  1. You can try running the following problem sizes for the side of each square matrix. What do you observe about the changes in the running times? You can refer to the detailed explanation in Chapter 6, section 2 of our PDC for Beginners book to get a better sense of the big-Oh order of this algorithm and why the times scale the way that they do as you double the size of one side of each matrix.

  1. Visit Chapter 6, section 2 of our PDC for Beginners book and scroll to the complete code and the section labeled ‘Experimenting with the programs’. In there are a sequential and an OpenMP version of code for this problem. Try collecting times for the sequential version in the first tab and the OpenMP version with 8 cores in the second tab with the OpenACC GPU version here. How many times faster is the OpenACC version than either of those two for 1024x1024? Note that this is the way that we compare GPU device versions that use many cores that are slower than a CPU core to versions run on CPUs.

Final thought: a way of work

In this chapter we have used examples that stay true to a general ‘way of work’ for creating parallel versions of code.

  1. First, have a method for verifying the correctness of your solution.

  2. Next, run experiments to determine how well your program scales. From these examples, you can see that knowing how the algorithm works and its sequential performance in terms of big-Oh often helps you to see and explain the improvements on the manycore version.

  3. For manycore GPU versions of algorithms, we usually examine their performance by considering how many times faster they can run on the same problem size than a sequential or multicore version.

As we look at other examples, we will also see that determining where to focus on parallelizing a larger program will be another step in our working process of parallelizing its code.

You have attempted of activities on this page