1.4 Real World Problem - Drug Design¶
Let’s look at a larger example. An important problem in the biological sciences is that of drug design. The goal is to find small molecules, called ligands, that are good candidates for developing into drug treatments.
The biology of drug design¶
The proteins in our bodies have particular shapes that enable them to carry out the processes we need to live. Each protein consist of a long sequence of biological material, called amino acids, that naturally folds into a particular three-dimensional shape, according to the type of that protein. A disease may arise if something goes wrong with the shape of a type of protein. Ligands are shorter sequences of amino acids that fold into their own shapes. If a ligand can be discovered that binds (fits) into a target protein in order to change that protein’s shape in a beneficial way, then a drug supplying that type of ligand could treat a disease caused by that protein.
Here is an image illustrating the concept of a ligand (represented by small sticks in center) binding to the folded shape of a protein (represented by ribbon structure):
But how can the right ligands be found and formulated into drugs for treating diseases? Teams of laboratory scientists and medical professionals need years to develop and test new drugs. Fortunately software simulations can help identify favorable ligands for the laboratory scientists to work from, thus greatly reducing the time and costs for the design of drugs. The software assigns a matching score to each ligand that indicates how well they are likely to bind with the desired region of a (folded) target protein. The scientists can then start their laboratory work with the promising high-scoring ligands.
Working with actual ligand and protein data is beyond the scope of this example, so we will represent the computation by a simpler string-based comparison.
Specifically, we simplify the computation as follows:
Proteins and ligands will be represented as (randomly-generated) character strings.
The docking-problem computation will be represented by comparing a ligand string
Lto a protein string
P. The score for a pair
[L, P]will be the maximum number of matching characters among all possibilities when
Lis compared to
P, moving from left to right, allowing possible insertions and deletions. For example, if
Lis the string “cxtbcrv” and
Pis the string “lcacxtqvivg,” then the score is 4, arising from this comparison of
Lto a segment of
This is not the only comparison of that ligand to that protein that yields four matching characters. Another one is
However, there is no comparison that matches five characters while moving from left to right, so the score is 4.
The first example program below provides a sequential C++ implementation of our simplified drug design problem.
The program optionally accepts up to three command-line arguments:
maximum length of the (randomly generated) ligand strings
number of ligands generated
protein string to which ligands will be compared
We will explore some example code that generates short strings, representing ligands, then assigns a score to each ligand according to how well it matches a longer string, representing a protein. In real drug design work, the scoring algorithms based on molecular biology are much more sophisticated than the example code’s simple matching algorithm. But for both the real software and our example code, the longer the ligand or the longer the protein, the longer it takes for the matching and score of the match to complete. Also, parallel computing can significantly speed up the computation time for our example code, as it does for real drug design software.
By default, we create a list of 100 possible ligands as random strings of length between 1 and 7. Below in the code box at the bottom of the code, the first command line argument represents the ligand length, and the second command line argument represents the number of ligands to try.
We have created a default fake protein in the example code. This can be changed on the command line by adding a third string, or you can update the code itself (see the tab for the MR class declaration).
In this implementation, the class
MR encapsulates the map-reduce steps
Reduce() as private methods (member functions of the class defined in the first tab for the MR class definition in the second tab), and a public method
run() invokes these steps according to a map-reduce program strategy (in first tab below):
A set of random ligands of varying lengths are put in a queue (Generate_tasks()).
In a loop, the Map function computes a score for each ligand, creating a list of pairs containing the ligand and its score.
The ligands are sorted according to their scores, highest first.
The Reduce function keeps only the ligands with the highest score (multiple ones could have the highest score).
When you try running this, note that it takes several seconds to run, so be patient.
We advise that you keep the maximum length of 7 and 100 ligands as given, because you will likely exceed the time limit for this book’s code execution on the remote machine that this runs on.
The Parallel OpenMP Version¶
To create an OpenMP parallel version the runs correctly, we have to make a couple of key changes:
The while loop that works through each generated ligand needs to be converted to a for loop so that the data decomposition pattern using a parallel for loop can be used to split the work onto threads.
The pairs of each ligand with their score must be held in a data structure that multiple threads can update concurrently by incorporating appropriate locking mechanisms. This is done by using a concurrent_vector container from the threaded building blocks (tbb) library.
The important portion of the code is line 66, where the pragma for OpenMP requests a parallel for loop.
The third command line argument provided is the number of threads to use. Try running using 1, 2, 3, and 4 threads. Note the time for each thread by recording them. You will enter them later below.
On line 66 above, we used a clause for the parallel for loop pattern:
schedule(static). This clause indicates that the work of processing each ligand in the for loop should be divided evenly among the threads.
After recording the times for this version for 1, 2, 3, and 4 threads, you can try changing the word ‘static’ inside the parentheses to dynamic, so it looks like this:
schedule(dynamic). Then repeat this, collecting times for 1, 2, 3, and 4 threads.
You can write your results as a table that looks like this:
Recall that the equation for speedup is:
Where \(T_1\) is the time it takes to execute a program on one thread, \(T_n\) is the time it takes to execute that same program on n threads, and \(S_n\) is the associated speedup.
We will use Python to assist us with our speedup calculation. Fill in the code below to compute the speedup for each version on each set of threads:
In many cases, static scheduling is sufficient. However, there is an implicit assumption with static scheduling that all components take about the same amount of time. However, if some components take longer than others, a load balancing issue can arise. In the case of the drug design example, different ligands take longer to compute than others. Therefore, a dynamic scheduling approach is better.