# CHAPTER 3: Running Loops in Parallel¶

Next we will consider a program that has a loop in it. An iterative for loop is a remarkably common pattern in all programming, primarily used to perform a calculation N times, often over a set of data containing N elements, using each element in turn inside the for loop.

If there are no dependencies between the iterations (i.e. the order of them is not important), then the code inside the loop can be split between forked threads. However, the programmer must first decide how to partition the work between the threads. Specifically, how many and which iterations of the loop will each thread complete on its own?

The data decomposition pattern describes the way how work is distributed across multiple threads. This chapter presents two patternlets, parallelLoop-equalChunks and parallelLoop-chunksOf1, that describe two common data decomposition strategies.

## 3.1 Parallel Loop, Equal Chunks¶

Let’s next visit the parallelLoop-equalChunks directory. You can navigate to this directory directly from the spmd directory by using the following command (don’t forget to use the Tab key to complete the typing for you):

cd ../parallelLoop-equalChunks


Running ls on this directory reveals two new files:

• Makefile

• parallelLoopEqualChunks.c

Let’s study the code in parallelLoopEqualChunks.c in greater detail:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 /* parallelLoopEqualChunks.c * ... illustrates the use of OpenMP's default parallel for loop in which * threads iterate through equal sized chunks of the index range * (cache-beneficial when accessing adjacent memory locations). * * Joel Adams, Calvin College, November 2009. * * Usage: ./parallelLoopEqualChunks [numThreads] * * Exercise * - Compile and run, comparing output to source code * - try with different numbers of threads, e.g.: 2, 3, 4, 6, 8 */ #include // printf() #include // atoi() #include // OpenMP int main(int argc, char** argv) { const int REPS = 16; printf("\n"); if (argc > 1) { omp_set_num_threads( atoi(argv[1]) ); } #pragma omp parallel for for (int i = 0; i < REPS; i++) { int id = omp_get_thread_num(); printf("Thread %d performed iteration %d\n", id, i); } printf("\n"); return 0; } 

The omp parallel for pragma on line 27 directs the compiler to do the following:

• Generate a team of threads (default is equal to the number of cores)

• Assign each thread an equal number of iterations (a chunk) of the for loop.

• At the end of the scope of the for loop, join all the theads back to a single-threaded process.

As in our previous example, the code up to line 27 is run serially. The code that is in the scope of the omp parallel for pragma (defined by the for loop) is run in parallel, with a subset of iterations assigned to each thread. After the implicit join on line 32, the program once again is a single-threaded process that executes serially to completion.

In the above program, REPS is set to 16. If the program is run with 4 threads, then each thread gets 4 iterations of the loop (see illustration below):

### Try It Out¶

Compile and run the program using the following command:

make
./parallelLoopEqualChunks 4


### Unequal Iterations¶

Sometimes, the number of iterations that a loop runs is not equally divisible by the number of threads.

The equal-chunk decomposition patternlet is especially useful in the following scenarios:

• Each iteration of the loop takes the same amount of time to finish

• The loop involves accesses to data in consecutive memory locations (e.g. an array), allowing the program to take advantage of spatial locality.

## 3.2 Parallel Loop, Chunks of 1¶

In some cases, it makes sense to have iterations assigned to threads in “round-robin” style. In other words, iteration 0 goes to thread 0, iteration 1 goes to thread 1, iteration 2 goes to thread 2, and so on.

Let’s look at the code example in the parallelLoop-chunksOf1 directory. You can visit this directory from the parallelLoopEqualChunks directory by executing the following command:

cd ../parallelLoop-chunksOf1


Running the ls command in this new directory reveals the following files:

• Makefile

• parallelLoopChunksOf1.c

Let’s take a closer look at the parallelLoopChunksOf1.c file:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 /* parallelLoopChunksOf1.c * ... illustrates how to make OpenMP map threads to * parallel loop iterations in chunks of size 1 * (use when not accesssing memory). * * Joel Adams, Calvin College, November 2009. * * Usage: ./parallelLoopChunksOf1 [numThreads] * * Exercise: * 1. Compile and run, comparing output to source code, * and to the output of the 'equal chunks' version. * 2. Uncomment the "commented out" code below, * and verify that both loops produce the same output. * The first loop is simpler but more restrictive; * the second loop is more complex but less restrictive. */ #include #include #include int main(int argc, char** argv) { const int REPS = 16; printf("\n"); if (argc > 1) { omp_set_num_threads( atoi(argv[1]) ); } #pragma omp parallel for schedule(static,1) for (int i = 0; i < REPS; i++) { int id = omp_get_thread_num(); printf("Thread %d performed iteration %d\n", id, i); } /* printf("\n---\n\n"); #pragma omp parallel { int id = omp_get_thread_num(); int numThreads = omp_get_num_threads(); for (int i = id; i < REPS; i += numThreads) { printf("Thread %d performed iteration %d\n", id, i); } } */ printf("\n"); return 0; } 

The code up to line 37 is near identical to the previous program. The difference is the pragma on line 31. The omp parallel for pragma has a new schedule clause which specifies the way iterations should be assigned to threads. The static keyword indicates that the the compiler should assign work to each thread at compile time (a static scheduling policy). The 1 indicates that the chunk size should be 1 iteration. Therefore, the above code would have 16 total chunks.

In the case where the number of chunks exceed the number of theads, each successive chunk is assigned to a thread in round-robin fashion.

### Try It Out¶

Similar to the other examples, you can make the program and run it as follows:

 make
./parallelLoopChunksOf1 4


### Some equivalent code¶

If you carefully looked at the above code file, you would have noticed some code commented out starting on line 38:

 38 39 40 41 42 43 44 45 46 47 48 49 /* printf("\n---\n\n"); #pragma omp parallel { int id = omp_get_thread_num(); int numThreads = omp_get_num_threads(); for (int i = id; i < REPS; i += numThreads) { printf("Thread %d performed iteration %d\n", id, i); } } */ 

This code shows how the round-robin scheduling approach could be achieved with using just the omp parallel pragma. To understand how this works, pay close attention to the for loop on line 44. Each thread initializes its loop with its thread id, continues until it exceeds REPS iterations, incrementing by the number of threads each iteration. Uncommenting and running this code should therefore result in each iteration of the for loop being assigned to each thread in round-robin fashion.

### Dynamic Scheduling¶

In some cases, it is beneficial for the assignment of loop iterations to occur at run-time. This is especially useful when each iteration of the loop can take a different amount of time. Dynamic scheduling, or assigning iterations to threads at run time, allows threads that have finished work to start on new work, while letting threads that are still busy continue to work in peace.

To employ a dynamic scheduling policy, replace the static keyword in the schedule clause on line 31 with the keyword dynamic.