One major difficulty with directly parallelizing X!Tandem lies in the complexity of the data structures that are created internally. X!Tandem is written in C++, and during the run it creates a complex, highly interlinked graph of C++ objects that represent spectra, protein sequences, scores, etc. These data structures could be shared easily in the multithreaded version of X!Tandem, but a distributed version normally requires splitting the data structure among the separate processes, a daunting prospect.
To resolve this problem, we turned to a technique that we have often found useful when attempting to parallelize codes with complex data structures.14,15
We call this technique Owner Computes
. The basic approach is as follows: We create k
instances of the code. Each copy behaves much like the standalone, sequential version, in that it does a complete initialization, creating the entire set of data structures rather than just a fraction of them.
However, when each copy reaches the main computational section of the code (for example the main computational loop), each copy only performs some of the iterations, keeping track of which iterations it actually performed, and skipping the rest. Each iteration is performed by only one copy. Which subset of iterations a particular copy performs can be determined in a number of ways and can be static or dynamic. At the end of the main computational section, each copy has some data that is up-to-date (the parts that it computed) but others that require updating (the parts that other processors computed). To correct for this, a merge phase is performed that updates each copy's data structures to contain all the results. The benefits of the Owner Computes method are that (i) we need to understand only a few data structures well enough to move them around, and (ii) very few changes are needed to the original source code.
In the case of X!Tandem, we had an additional advantage, in that the code already contained a threaded parallelization. This parallelization operated by allowing multiple threads to crawl over the data structures in parallel, computing results for disjoint sets of spectra. Once that was completed, a single thread performed a merge step that combined the results appropriately. This pre-existing merge step contained most of the logic we would need for our Owner Computes merge step.
X!!Tandem performs the following steps: We create k copies of the code, each of which is already structured as a k-threaded program. As described above, each copy behaves much as a standalone k-threaded program. However, at the point where it would normally create k threads, instead each ith copy creates only a single thread that does exactly what the ith thread would normally have done. In aggregate, we have k threads, each running inside their own process and CPU. When each thread finishes, we gather up the results it computed and send them to the master (which is just the copy running the first thread), where each set of results are unpacked and placed exactly where they normally would be created by the ith thread.
At this point, the data on the master are exactly what they would have been in a normal, k-threaded version, so the normal merge code can be invoked, combining the results. Next the combined results are gathered up and sent back to the other program copies, where they are unpacked and laid out in their proper location. At this point, all k copies have the same data structures and are ready to continue forward. Since X!Tandem has two computationally intensive sections (the unrefined and refined search steps), the process described above occurs twice, with a merge step at the end of each (see ).
Using the Owner Computes approach, fewer than 30 lines of code were added or changed in the main X!Tandem source file, tandem.cpp. The changes did not alter the overall logic of the program. A few other files required the addition of simple, boilerplate code necessary for moving a few key data structures during the merge step (data serialization
). Two additional C++ modules were added; one manages the details of the Owner Computes method, and the other manages data serialization. The most difficult part of the parallelization involved this serialization: the packing, sending, and unpacking of the updated C++ data structures. We used the Boost Serialization package,16
an open-source library, to manage the packing and unpacking. MPI was used to send the packaged data between processors and for coordination.
It is obviously important that the parallel versions of X!Tandem compute correct results. This seemingly simple requirement is, in fact, somewhat complicated with X!Tandem. The SMP version of the program produces textually different results depending on the number of threads used, and simple textual comparison of the results (e.g., using the “diff” file comparison program) proved to be quite difficult. We found that holding the number of threads constant made it possible to compare the output from X!!Tandem and X!Tandem. For example, if we compared X!!Tandem on 4 processors to X!Tandem with 4 threads, the results were identical, with one exception. The exception concerned one field of the output: “nextscore”, which is supposed to represent the second best score for a particular peptide. Very infrequently, this value differed between X!!Tandem and X!Tandem. This turned out to be due to the use of exact floating point equality comparisons in the original X!Tandem code, a practice that is known to make code sensitive to seemingly innocuous reordering of computations, such as changing the order in which numbers are summed. We consider this discrepancy to be minor; the nextscore field is relatively unimportant, and future releases of X!Tandem will correct this nondeterministic behavior.13
Producing identical results in the parallel version greatly simplifies the task of verifying the results; we consider this to be extremely important for creating confidence in the parallel version.
The output from the Duncan parallel X!Tandem method is not the same as that from X!Tandem or X!!Tandem. The results differed substantially, including scores, expectation values, and even which proteins were found in the case of marginal matches. According to the developer, the goal in that project was to find strong matches as quickly as possible, rather than reproducing the X!Tandem output exactly.17
. compares a small subset of the peptides found by the Duncan method to those found by X!Tandem and X!!Tandem. Since the X!Tandem and X!!Tandem output was identical, it is only presented once in the table. The peptides found agree well, although some differences exist in the expectation values.
Comparison of a Small Subset of the Peptides Found by the Duncan Method to Those Found by X!Tandem and X!!Tandem
. compares the top 20 protein hits found by the Duncan method versus X!Tandem/X!!Tandem. The table lists the union of the two sets, ordered by rank for X!Tandem. Here the differences are more pronounced. Although the Duncan method found most of the top 20 found by X!Tandem/X!!Tandem, the scores and ranking differ significantly. We attribute the lack of agreement between the two outputs to the complexity of the input spectra and database; in particular, the database contains numerous cases of analogous proteins that would score identically, and which might be ranked arbitrarily by the two methods. In addition, for this data set, the expectation values found are not particularly high, indicating a weaker signal that is likely to result in less definitive matches and therefore less consistency between the two methods.
Top 20 Protein Hits Found by the Duncan Method versus X!Tandem, X!!Tandem
Given this, we do not assert that the differences are necessarily significant biologically. The challenge of interpreting these differences, however, illustrates the value of an approach to parallelism that produces results that can be easily compared to the original.
We compared the performance of the three methods using the input files described in the Methods and Materials section. shows the runtime vs the number of CPUs, plotted log–log. X!!Tandem shows significantly better performance than the other two methods as the number of CPUs is increased.
Total runtime of the X!Tandem, X!!Tandem, and Duncan methods, plotted log–log. Lower is better. The dotted line indicates perfect scaling versus sequential X!Tandem.
shows the same data as speedup, as compared to the sequential time of the X!Tandem, again plotted log–log. Note that the log–log plot, though allowing all the data to be clearly seen, has the effect of making the speedup curves seem better than they are. For example, although the X!!Tandem speedup curve looks nearly straight, the actual 64 CPU speedup is ~29 fold.
Speedup of the X!Tandem, X!!Tandem, and Duncan methods, plotted log–log. Higher is better. The dotted line indicates perfect speedup versus sequential X!Tandem.
breaks the runtimes into their major components, giving some insight into the relative costs of each. The X!Tandem data shows that the refinement step dominates the computation, with the merge steps representing relatively little time (the first merge step is so small that it is not visible). The reason for the relatively poor performance on 4 CPUs is unclear, although we suspect that the machine was experiencing contention when all CPUs were busy. The X!!Tandem data shows that the unrefined and refined steps scale nicely as the number of CPUs increases, and the second merge time only begins to dominate at 64 processors. Here, too, the first merge is so small as to be invisible. The Duncan data show very clearly that method's problem with scaling to large numbers of CPUs; although the first two (parallel) invocations of X!Tandem scale well, the third invocation, which is run as one instance of X!Tandem (using two threads when the overall run uses more than one CPU) to obtain proper scoring, dominates as the number of CPUs increases. In fairness, it is possible to run the Duncan method omitting this third invocation, which will produce results with incorrect expectation scores but does improve both the absolute runtime and scalability substantially.
Figure 4 Breakdown of the X!Tandem, X!!Tandem, and Duncan methods, showing the time spent in each step. Steps that are too small to see are omitted. X!Tandem and X!!Tandem methods break down into unrefined and refined search steps, each followed by a communication/merge (more ...)
We further investigated the scaling properties of X!!Tandem by performing a 32-CPU run with instrumentation added to determine the time each worker spent performing useful computation, overhead activities (serialization and communication) and waiting due to load imbalance. The results are summarized in and . highlights the problem of load balance, particularly following the refinement phase. Note that the time spent in the send 2 phase is almost entirely due to load imbalance, since buffering limitations force MPI to delay the worker's send until the master is ready to receive the data. Sends that occur after the master is ready to receive (e.g., worker 20) take almost no time. aggregates the time spent on all CPUs. Also note that the aggregate computation time is greater for the parallel run than the uniprocessor run; we attribute this to processing and I/O that is duplicated due to the parallelization strategy employed.
Breakdown of Time Spent by X!!Tandem Workers on Various Activitiesa
Execution timeline for 32-CPU run, showing the various processing phases for the workers and master (shown as M). Send/recv and serialize/deserialize are reversed for the master. The serialize/deserialize steps are too short to be visible.
Optimizing load balance is a difficult problem given X!Tandem's decomposition strategy, which was leveraged in X!!Tandem. It requires that the spectra be subdivided into disjoint sets at the beginning of the computation, and that these subsets remain fixed through the entire computation. This prevents dynamic load rebalancing as the cost of computing individual spectra is determined. Unfortunately, spectra vary greatly in the computation they require (for example due to varying loci of PTMs). It is possible that a heuristic could be developed that would estimate the cost of each spectrum a priori. The estimates would be used to create subsets of putatively equal computation.
X!!Tandem was also tested using a custom scoring mechanism, via X!Tandem's plugin scoring feature. We obtained the k
and used it to score the same spectra/database combination described in the Methods and Materials section. We found that the output files were identical to those generated by X!Tandem. The only modification required by plugin scoring was the addition of a small amount (less than 10 lines) of code to the plugin code, extending the serialization routine for the scoring object, which was subclassed in the plugin.