MultiPhyl is one of a number of applications that runs on our general-purpose distributed computing system (16
). The overall design of the system is based on the client-server model. This model describes a system consisting of a single server computer and a number of client computers. Our system is divided into three separate pieces of software: server, client and remote interface (18
). Apart from the server, the system is completely dynamic meaning that new donor machines (a donor machine is owned by someone that offers their machines spare clock cycles to MultiPhyl) can be added at any time and existing donor machines can fail arbitrarily with no adverse effect to the computations. We have deployed the distributed system across our university campus using the idle computing resources of several hundred desktop computers (see http://www.cs.nuim.ie/distributed/statistics.php
The overall flow of control in MultiPhyl can be broken up into three distinct stages: model selection, tree searching and bootstrapping. The user uploads a group of n
nucleotide or amino acid alignments in either FASTA or PHYLIP file format. MultiPhyl makes extensive use of the Java-based Phylogenetic Analysis Library (PAL) v1.5 (19
) to perform all likelihood calculations. Each of the analysis stages (model selection, tree searching and bootstrapping) operates completely asynchronously across different alignments, thus achieving maximum speedup and avoiding performance bottlenecks. One of the essential requirements for ML phylogenetic inference is a model of amino acid or nucleotide substitution. It has been shown that using incorrect or suboptimal substitution models can produce less accurate or incorrect phylogenies when the wrong model of evolution is assumed (20
). Therefore, it may be useful to first compute the most suitable model before performing a treesearch. However, model selection is one area that has been overlooked as an essential part of the phylogeny construction process by many other parallel phylogenetic construction programs. We have recognized the importance of integrating model selection into all phylogenetic studies and therefore MultiPhyl allows the user to perform model selection on every alignment prior to all tree searches (20
). The task of testing multiple substitution models against an alignment is very amenable to parallelization as each model can be optimized completely independently and the corresponding likelihoods can be collected and analysed at the server to determine the best model. In order to gain the maximum speedup for the model selection stage of the program, we implemented an adaptive parallel strategy where the server first recruits a single donor machine to construct the initial NJ tree to be used to optimize each model. The parallel granularity of a model selection stage is estimated dynamically by recording how many models are optimized by the first donor machine in m
min (where the default value of m
min). Although it may seem that the optimal parallel granularity would be to just issue a single model per donor machine, this could have a number of adverse effects on the distributed system such as overloading the non-dedicated network or overloading the server with many simultaneous connections. After the server receives the first set of o
optimized models, it issues groups of o
models to multiple donor machines to be optimized. As MultiPhyl is based on a network of heterogeneous processors with varying computational abilities, each donor machine has the option to return a partial set of p
optimized models (where p
is the number of models optimized by the particular machine within 15
min and p
) with the unoptimized models being redistributed to other machines, thereby avoiding a bottleneck that might be caused by a few slow donor machines in the system.
In order to address the trade-off between running time and accuracy when performing large-scale analysis, we have implemented two different tree search algorithms in MultiPhyl and given the user the option to select which algorithm they wish to use to analyse their dataset. The first tree- building algorithm initially constructs an NJ tree and then iteratively improves the trees by repeatedly performing all possible simultaneous NNI and local branch length optimization swaps that improve the likelihood until convergence (4
). The model parameters are also optimized after each round of NNI swaps. Finally, all of the branch lengths are globally optimized to improve the final likelihood of the tree. The second tree search algorithm implemented in MultiPhyl allows the user to define a maximal depth of subtree pruning and regrafting (SPR) rearrangements to perform from each node of the tree. The SPR tree search algorithm begins by creating a starting tree by performing the NNI search algorithm outlined above and then iteratively visits each node of the tree and performs SPR rearrangements spanning increasing levels of the tree (beginning at 2 levels and continuing up to a user-defined maximum). In order to accommodate both long tree searches and the varying computational abilities of the available donor machines, if an individual tree search has not converged after 60
min on a particular machine, then the current state of the tree search is saved and returned to the server. This tree search is then re-issued to another donor machine to continue optimizing the tree. This also ensures that if a donor machine fails (e.g. crashes or is switched off), it is only possible to lose a maximum of approximately 60
min computing time for any tree search.
One significant challenge in a distributed computing environment is the huge disparity of available memory on the donor machines. This is a particular issue in phylogenetics as the amount of memory consumed by ML-based phylogeny programs increases linearly with sequence length and number of taxa. MultiPhyl implements a dynamic memory scheme where the minimum memory requirements of each task is first set to some initial value (default is 60
Mb). When a donor machine requests a work unit, the donor machine can only be issued a particular task (model optimization or tree search) that has a memory requirement less than the amount of memory available to the Java Virtual Machine (JVM) on the donor machine. If a donor machine reports that the JVM ran out of memory while attempting to perform a particular task, the server dynamically increases the minimum memory requirements of that individual task to a level that is suitable for the task. If no machine with sufficient memory can be detected after q
attempts (default value for q
is 5), then the problem is removed from the system with an appropriate entry in the log file returned to the user. This dynamic scheme ensures that each phylogenetic computation can gain access to the maximum amount of computational resources available to MultiPhyl as the user is not required to decide a priori what types of machines are capable of processing their dataset.