OpenMP: ParallelFor

Подписаться 4,3 тыс.

Просмотров 22 тыс.

50% 1

Hey guys! Welcome to HPC Education!
And today we’ll be looking at the Parallel For loop construct.
Before we begin with the Parallel For loop, lets quickly go through the basic For loop. The for loop is basically a repetition control structure that executes some block of code a number of times. Init-expr is executed first, and only once. This step allows you to declare and initialize any loop control variables. It would be something like int i = 0. Next the condition is evaluated. If it is true, the body of the loop is executed. If it is false, the body of the loop does not execute and the flow of control jumps to the next statement just after the 'for' loop. After the block of code executes once, the incr-expr statement is run. This either increments or decrements the loop control variable. The loop basically runs continuously till the condition fails and then exits the loop.
Now lets look at a trivial example of the serial for loop. Here’s a simple for loop that aims to find the sum of first 100 natural numbers. In this loop we initiate I as 1 and sum as zero. Now we run the loop till 100, adding the value of i to the sum variable and incrementing i every iteration. When i crosses 100, the condition fails and then the loop is exited. And we print the sum. The answer is 5050. This can be verified using the formula n(n+1)/2. Okay so that’s simple. Now we want to use multiple threads to make this loop faster. This is where the parallel for construct comes into play. By just adding a few statements, you can make your for loop divide its computation over multiple threads to speed things up drastically. Although, this is just a a trivial example and we aren’t really making any difference. In fact, we are actually taking more time due to a concept called false sharing, but I’ll get to that later. So coming to our modified code, to use any openMP constructs we need to first include the necessary header files which is done by including omp.h. In our declaration of variables you must have noticed a new array thread_sum of size 4. This array will be storing some intermediate values. You will understand what that means pretty soon. Our next statement omp_set_num_threads sets the number of threads used by the compiler to 4. We can skip this line to let the compiler choose for us. Pragma omp parallel is a compiler directive which basically marks the beginning of the parallel portion of our code which has to be enclosed in a pair of curly braces. Everything inside these braces will be executed in parallel. You can picture the serial program splitting into 4 blocks running in parallel from this point. Each parallel block is running on its own thread. First we use mp_get_thread_num() to get and store the thread ID into a variable called ID. Now we initialize each postion of thread_sum array indexed by the thread ID as zero. I think you must have an idea of what we are trying to do here. Basically what we’re trying to do is split the basic for loop which runs 100 times to add 100 numbers into 4 basic loops that run 25 times each to add 25 numbers. Each of these loops save their respective sums into the thread_sum array. Now we exit the parallel region of our code and just like that, everything becomes serial again! Here we run another simple for loop 4 times adding the 4 sums calculated by the 4 threads and we print this sum. And with just a few more lines of code we have reduced a for loop to 1/4th of its time taken. Now try this simple code by yourself and don’t forget to link the openmp library using -fopenmp during compilation on GCC. For other compilers, please refer the documentation. In case you get stuck, don’t worry!
So why is this Parallel Loop slower than the Serial For Loop? Two words. False Sharing! This concept is out of the scope of this video but I’ll still try explaining it in simple terms. This has something to do with our thread_sum array. Processors usually have a cache between the highspeed registers of the cpu and the slow memory. When we access memory, a slice of memory is copied onto the cache called cache line. This is what happens to our thread_sum array. The compiler uses this copy stored in the cache. This offers speed but updates must never clash. Our parallel loop updates individual elements in the same cache line from different threads. These simultaneous updates are logically independent of each other but they still mark the entire cache line as invalid because concurrency is checked for the cache line as a whole and not for individual elements This causes the thread to fetch a more recent copy of the thread_sum array every time. This creates a lot of overhead and makes the program slow and inefficient. This problem is very common and can be overcome by using special constructs.