We aimed to develop a toolkit for phylogeny-based maximum-likelihood modelling of molecular evolution that is:
• Extensible, fast and scalable
• Straightforward to implement most existing methods, and simplifies the development of novel methods
• Straightforward to learn by novice biologists or by developers
• Free and can be modified without restriction
To meet these objectives, PyEvolve is implemented as a Python module with the most computationally intensive algorithms written in C/C++ and is released under the GPL.
There are several reasons behind the choice of Python. Principal among them is Python's utility as a programmable scripting language. By selecting a popular language as the principal interface to the toolkit, learning the toolkit for those familiar with Python will be trivial, and for others the general utility of the language means it can be applied to many other problems. Development related factors also affected our choice. Python is designed to be able to readily connect with external modules written in other languages such as C/C++ or Fortran with little fuss. This ability to glue pieces of code from different languages increases choice for selecting from existing code for critical numerical routines (e.g. optimisation). The syntactical clarity of Python, the ready availability of good documentation suitable for novices [such as the open book project, [14
]], and the self documenting capabilities of Python code all contribute to lowering the barrier to use of the toolkit. Together these attributes significantly reduced the amount of work required by us to generate a toolkit useful to an end-user group with diverse interests and expertise.
The top-level objects of PyEvolve and their relationships are illustrated in Figure . Briefly, these objects and their functions are:
Figure 1 Relationship among the PyEvolve components The boxes represent the objects (classes) that make up PyEvolve. The filled diamonds, as specified by the Unified Modelling language specification, indicate the object at which the diamonds end has the connected (more ...)
Defines a parallelisation stack with virtual processors. Communicates among processors using PyPar, a Python MPI toolkit.
A parametric bootstrapping module. Can be used to assess parameter confidence intervals or probabilities. In either case, it requires users to provide a reference to function(s) that construct controller object(s). Runs in parallel when multiple cpu's are available.
A generic interface to bound-constrained numerical optimisers. Takes a vector of floating point values for optimisation, corresponding vectors of bounds defining the range of acceptable values, and a ParameterController object. Presently, a simulated annealing optimiser is provided.
Defines the parameterisation of the statistical model, sets parameter starting values and bounds for optimisation. Specifies the mapping of parameters in the optimisation vector to the likelihood calculation.
Performs the likelihood calculation by calling the calculatelikelihood
module (which is written in C++). The calculatelikelihood
module also calculates the matrix of substitution probabilities, unless overridden. Can also simulate an alignment, and estimate posterior probabilities of ancestral motifs [15
For reading and manipulating phylogenetic trees. Tree also provides methods for extracting sub-tree's, and for identifying portions of a tree using the names of tips.
For reading and manipulating sequence alignments.
Provides services for defining and implementing Markov process models of substitution. Both the preparation of the instantaneous average relative rate matrix and the matrix exponential calculations can be performed by calc_psubs, a module written in C / C++. In the fastest implementation, the instantaneous matrix and matrix exponentiation routines are in-lined with the calculatelikelihood C++ module. In this case, SubstitutionModel is relegated to defining the configuration of the substitution model.
Represents the motifs (states) in the substitution model. Relates alphabet motifs to IUPAC ambiguity codes, and performs translation for different genetic codes.
Serial performance innovations
At the heart of the computational challenge is the poor scalability of the pruning and matrix exponentiation algorithms. The order of Felsenstein's pruning algorithm is ~O(NA
) where A is the number of motifs in the sequence alphabet (e.g. 4 for DNA, 20 for amino-acids) and N is the number of sequences. The matrix exponentiation, which is done by eigenvalue decomposition using the same code as that used by PAML [13
], also scales poorly (> ~ O(N2
)). Given the computationally intensive nature of these algorithms they have been implemented in C / C++.
One way efficiencies can be gained is by reducing the number of times the algorithm is used. When the sequences descended from a node are identical at several sites in the alignment, the partial likelihoods for that node of the tree will also be identical for those sites. This has commonly been exploited for the special case in which the node is the root node, in other words all sequences are identical between the sites. In this conventional case the likelihood is calculated only for one instance of a site-pattern and the site-patterns' log-likelihood is then multiplied by its' frequency of occurrence. PyEvolve implements an advanced site-pattern algorithm whereby the data are pre-processed to identify node-specific site-patterns that don't fall within the site-pattern of a higher node (such as the root). The partial likelihoods calculated for the first instance of a node specific site-pattern are reused for all other cases of that site-pattern.
Another optimisation strategy involves storing the results of previous calculations. Many numerical optimisers change one parameter at a time in the vector to be optimised for a function. For a parameter that has a local (single branch) scope, then only the matrix exponentiation for that branch need be recomputed. PyEvolve takes advantage of such optimisers by storing the results of each exponentiation. If the optimiser returns a parameter vector with parameters relevant to a subset of branches being unchanged, the stored probability of substitution matrices are used instead. If parameters for a branch remain unchanged twice in a row, a second level backup of that matrix is created. This second-level becomes most valuable in the later stages of optimisation when most changes to the parameter set are sub-optimal. The performance improvement conferred by this approach is striking for a model in which most parameters are local (e.g. a model with branch lengths and local substitution model parameters). In this instance, the majority of likelihood function evaluations involve recalculating the probabilities of substitution for a single branch.
PyEvolve has also been modified to store previous partial likelihoods for reuse, taking advantage again of the optimiser behaviour. The effect of this approach is that the partial likelihoods need only be determined for nodes affected by a parameter change, and their parents. The optimisation vector has been ordered so as to ensure that the minimum number of nodes need recalculating. The vector is in a natural tree traverse order, where the descendants of each node are added before the node.
The memory impact of storing previous results is minimal for both strategies. Both the partial likelihood and probability of substitution matrix reuse algorithms minimise the overhead for backing up by using the same memory space. The partial likelihood reuse algorithm has also been implemented with two versions that differ in memory usage. The default version keeps all the tree tips and nodes for the fastest and simplest reuse of previous partial likelihoods. The second version only keeps the nodes, thus requiring only half the memory. The performance hit is low because the partial likelihoods are considerably quicker to recalculate for tips than nodes. This algorithm may be of greatest value with bigger trees as the memory saving becomes more important. The reuse algorithm can be specified using an argument of the LikelihoodFunction constructor. The higher memory algorithm is on by default.
An optional efficiency related to pruning in PyEvolve concerns the number of motifs in the alphabet. As the pruning algorithm is most sensitive to the number of motifs in the alphabet, reducing these can provide enormous benefits. This efficiency can be trivially achieved in PyEvolve by reducing the number of motifs in the alphabet to only those observed in the data, and constructing the SubstitutionModel with the revised alphabet.
Parallel performance innovations
With the availability of multi-processor clusters of computers becoming commonplace, performance gains can be achieved by parallelising portions of the computations. The limits to the maximum gain in efficiency are governed by the proportion of the algorithm that is strictly serial (as suggested by Amdahl's Law). Accordingly, the general principle for parallelisation is to minimise the amount of serial computation. A secondary, but similarly important, impact on the performance of a parallelised program is the time taken up by communication between nodes in the compute cluster. The combined impact of the diversity of possible use cases and the heterogeneity in hardware performance precludes a simple fixed parallelisation scheme.
Within a single model there are three obviously parallelisable stages. At the highest level, portions of the numerical optimisation procedure may be parallelised. The details of this parallelisation effort differ between procedures. For instance, optimisers that use a finite difference method for estimating the gradient of the likelihood function can be trivially parallelised at the finite differences step. While this provides a significant performance boost, these optimisers can suffer from a tendency to find local maxima rather than the global maxima being sought. This may require the optimiser to be started multiple times, each from a different position in the parameter space, significantly reducing real world performance. Although global optimisation techniques tend to be slower and, at least for the simulated annealing procedure implemented in PyEvolve, benefit less from parallelisation, global procedures can be more efficient [16
]. For these reasons we have developed a parallelised version of the bound-constrained simulated annealing algorithm [16
]. This level of parallelisation will be referred to as the SA level.
The two other obvious opportunities for parallelisation are the calculation of the probabilities of substitution by matrix exponentiation and the calculation of the log-likelihood for columns in the alignment. We have not implemented the former because the overhead for communicating an array of 3721 double precision floating point values, as would be required for a codon substitution model, would negate the benefit of parallelising the matrix exponentiation on most hardware. The likelihood calculation has been parallelised at the level of the alignment, which will be referred to as the LF level parallelisation.
In order to permit adaptation of the mixture of parallelised routines to different tasks we have implemented a flexible multi-level parallelisation schema in PyEvolve. The Parallel
component of PyEvolve uses the PyPar interface to MPI [17
]. The basic principle of MPI/PyPar parallelisation is that a copy of the same program runs on all processors, but that each process uses knowledge that the other processes exist to selectively do less work. This works because each process knows how many processors there are in total, and their ID (or rank) within this number. Usually the final result is collected onto processor 0 and then at the beginning of the next calculation the starting state is sent out to all the other processes.
PyEvolve's schema allows processors to be grouped into a virtual processor, and provides methods for stacking these virtual processors on top of one another. The parallelisation stack created to perform the hypothesis test described below is illustrated in Figure . Each level of the parallelisation stack is made up of a number of virtual processors that in turn correspond to the same, or a greater, number of actual cpu's. The following rules must be satisfied in order to define a multi-level parallelisation stack – at each level, the maximum number of processors that make up the virtual cpu (the subgroup size) must be a divisor of the number of cpu's in the level above. At the top level, the subgroup size must be a divisor of the total number of cpu's assigned to the program.
Figure 2 The 3-level parallelisation stack for a model Model – the null or alternative hypothesis parameterisations; SA – the simulated annealing parallelisation level; LF – the likelihood function parallelisation level; CPU – actual (more ...)