8.2 Reduction with a parallel for loop¶

We add one feature to the example in the previous section to this example below: after performing the updates to each cell, we are going to add the new computed value to a sum designed to contain all of the values.

As we have seen in other examples of this kind, we need to ensure that this sum is computed correctly when using many threads by making it part of a reduction clause (this clause has the same syntax as we use for OpenMP). Look for this in the function called matrixSum() below, which has pragmas for running it on the GPU.

/*
* OpenACC GPU version of matrix summation operation.
* Demonstrates collapse and reduction clauses.
*/
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h> // just for timing

// function declarations
void fillMatrix(int size, float * A);
void getArguments(int argc, char **argv, int *size, int *verbose, int *check);
void debugPrintMatrix(int verbose, int size, float *matrix, const char *msg);
void showMatrix(int size, float * matrix);
void checkForErrors(float *y, int N,  float sum);

// device function:
// Update matrix A values by doing some math
//
float MatrixSum(int size, float * __restrict__ A) {

float sum = 0.0;

#pragma acc kernels
    #pragma acc loop collapse(2) independent reduction(+:sum)
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            // do some contrived calculations that add up to 1.0 in each cell
            // when each data element is equal to Pi.
            A[i*size + j] = hypot(cos(A[i*size + j]), sin(A[i*size + j]));
            sum += A[i*size + j];
        }
    }
    return sum;
}

////////////////////////////////////////////////////////// main
int main (int argc, char **argv) {

// default values
    int size = 256;          // num rows, cols of square matrix
    int verbose = 0;         // default to not printing matrices
    int check = 0;           // check for errors if >0
    getArguments(argc, argv, &size, &verbose, &check); //change defaults

float * A;  // matrix to fill and perform calculations on

// Use a 'flattened' 1D array of contiguous memory for the matrix
    // size = number of rows = number of columns in the square matrix
    size_t num_elements = size * size * sizeof(float);
    A = (float *)malloc(num_elements);

fillMatrix(size, A);

char msgA[32] = "matrix A after filling:";
    debugPrintMatrix(verbose, size, A, msgA);

double startTime = omp_get_wtime();

float total = MatrixSum(size, A);

char msgC[32] = "matrix A after Matrixupdate(): ";
    debugPrintMatrix(verbose, size, A, msgC);
    printf("Sum of all values = %f\n", total);

double endTime = omp_get_wtime();

printf("\nTotal runtime %f seconds (%f milliseconds)\n",
    (endTime-startTime), (endTime-startTime)*1000);

if (check) {
        checkForErrors(A, size, total);
    }

free(A);
    return 0;
}
////////////////////////////////////// end main

// fill a given square matrix with rows of float values
// equal to Pi
void fillMatrix(int size, float * A) {
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            A[i*size + j] = ((float)M_PI);
        }
    }
}

void getArguments(int argc, char **argv, int *size, int *verbose, int *check) {
    // 3 arguments optional:
    //   size of one side of square matrix
    //   verbose printing for debugging
    //   whether to check for correct result
    if (argc > 4) {
        fprintf(stderr,"Use: %s [size] [verbose] [check] \n", argv[0]);
        exit(EXIT_FAILURE);
    }

if (argc >= 2) {
        *size = atoi(argv[1]);
        if (argc >= 3) {
            *verbose = atoi(argv[2]);
        }
        if (argc == 4) {
            *check = atoi(argv[3]);
        }
    }

if (*verbose) {
        printf("size of matrix side: %d\n", *size);
    }
}

void debugPrintMatrix(int verbose, int size, float *matrix, const char *msg) {
    if (verbose){
        printf("%s \n", msg);
        showMatrix(size, matrix);
    }
}

// display a given square matrix for debugging purposes
void showMatrix(int size, float * matrix) {
    int i, j;
    for (i=0; i<size; i++){
        for (j=0; j<size; j++) {
            printf("element [%d][%d] = %f \n",i,j, matrix[i*size + j]);
        }
    }
}

// check whether the kernel functions worked as expected
void checkForErrors(float *y, int size, float sum) {
    // Check for errors (all values should be 1.0f)
    float maxError = 0.0f;
    for (int i=0; i<size; i++){
        for (int j=0; j<size; j++) {
            maxError = fmaxf(maxError, fabs(y[i]-1.0f));
        }
    }
    printf("Max error in any data element: %f\n", maxError);
    float estSum = (float)(size * size);
    maxError = estSum - sum;
    printf("Sum is off by: %f\n", maxError);

}

Just like the previous example, this code takes 3 optional command line arguments, in this order:

The size of one side of the square matrix being used.
Whether to print out the arrays after the manipulation (default of zero is don’t print, non-zero is print). This should be used only with very small values of the size of a side of the matrix, since this book doesn’t return large print buffers and it is hard to read.
Whether to check if the results are correct. The particular contrived computation we chose is easy to check.

Exercises

These exercises are very similar to the previous section’s example.

The command line arguments above enable you to see what the result of the manipulation of the data elements produces and that the data check is correct, including whether the sum has the right value.
After running the default, try matrix sizes that are larger to take advantage of the GPU. Try [‘5000’], [‘10000’], [‘20000’], and [‘40000’] in the command line arguments. Jot down times for each one.
How many times more calculations than the previous trial are we doing when we double the size of one side of the matrix like this? (Hint: try with 2x2, then 4x4, then 8x8, then 16x16, dividing the current one by the preceding one.) This can give you some sense of the scalability of this GPU solution by observing the times you see from Exercise 2.
You could try creating a multicore CPU version and test it for correctness and timing.

You have attempted of activities on this page