To illustrate the performance gains that GPUs afford in statistical phylogenetics, we explore the evolutionary relationship of mitochondrial genomes from 62 extant carnivores and a pangolin outgroup, compiled from GENBANK. This genomic sequence alignment contains 10 860 nt columns that code for 12 mitochondrial proteins and when translated into a 60-state vertebrate mitochondrial codon model, yields an impressive 3620 columns, of which 3600 are unique. Conducting a phylogenetic analysis on such an extensive dataset using a codon substitution model would previously be considered computationally foolhardy.
displays the tree for these carnivores that we infer from a Bayesian analysis under the M0
codon model (Goldman and Yang, 1994
) with a 4-class discrete-Γ model for rate variation (Yang, 1994
) and relaxed molecular clock (Drummond et al.
). This analysis uses GPU-enabled calculations through MCMC to draw 25 million random samples from the joint posterior of both an unknown phylogeny and unknown parameters characterizing the substitution model. The resulting phylogeny is compatible with the current understanding of the familial relationships of Carnivora and importantly helps resolve relationships within the Arctoidea. This infraorder within the caniform (dog-like) carnivores comprises the Ursidae (bears), Musteloidea (weasles, raccoons, skunks and red panda) and the Pinnipedia (seals and walruses) and the branching order has been debated. Delisle and Strobeck (2005
) entertain a nucleotide-based analysis of a smaller dataset of the same 12 genes and show marginal support for either the grouping of Pinnipedia and Ursidae or Musteloidea and Uridae (depending on whether a Bayesian or maximum likelihood approach, respectively, is employed). When run under a conventional GTR nucleotide model, our alignment also yields the Pinnipedia and Ursidae grouping but with less than emphatic support (posterior probability 0.79). However, with the codon model, the same data proffer significant support (posterior probability 0.97) for the Pinnipedia and Musteloidea grouping, a relationship also found for nuclear markers (Flynn et al.
; Yu et al.
) and ‘supertree’ approaches that attempt to synthesize available phylogenetic knowledge (Bininda-Emonds et al.
Fig. 2. Reconstructed codon-based majority clade consensus tree of 62 carnivore mitochondrial protein-coding sequences with the long-tailed pangolin (Manis tetradactyla) as an outgroup. We label clades with posterior probabilities except where they approach 1 (more ...)
reports the marginal posterior estimates for the tree height in expected substitutions, the transition:transversion rate ratio κ, the non-synonymous:synonymous rate ratio ω, rate variation dispersion α and relaxed clock deviation σ. These estimates demonstrate the high overall rate of synonymous substitution, suggesting that these changes may be saturating faster than amino acid replacement substitutions, an effect not adequately modeled at the nucleotide level. The codon model employed here does allow different rates for synonymous and replacement substitutions and this helps to explain the congruence of the resulting tree with that of nuclear genes that evolve at a slower rate.
Codon substitution model parameter estimates for 62 extant carnivores and the pangolin outgroup
We perform this codon-based analysis on a standard desktop PC sporting a 3.2 GHz Intel Core 2 Extreme (QX9770) CPU and 8 GB of 1.6 GHz DDR3 RAM. This CPU is the fastest available at the end of 2008 and, together with the high-end RAM, maximizes the speed at which the Java and C implementations run to provide a benchmark comparison. We equipped the desktop PC with three NVIDIA GTX280 cards each of which carries a single GPU with 240 processing cores running at 1.3 GHz and 1 GB of RAM. Exploiting all three GPUs, our analysis only requires 64 h to run.
compares runtime speeds using all three GPUs, a single GPU and Java and C implementations. For codon models, the C version is not available in the current BEAST release but our C implementation provides only a 1.6-fold speed increase over Java. To make these comparisons in reasonable time, we run short MCMC chains of 10 000 steps. Between the three GPUs and single GPU for the codon model, we observe a slightly <3-fold improvement. This factor compares well with the theoretical maximum, identifying that the replicated work in recomputing the finite-time transition probabilities is not a burden when fitting models with large state spaces. Between the three GPUs and the current BEAST implementation, we observe an over 140-fold improvement (a 90-fold improvement over the C implementation). One major difference between the default computations on the GPU and the CPU implementations is the precision, as modern CPUs provide double-precision at full performance. Fortunately, the GTX280 GPU also contains a very limited number (30) of double precision floating point units and it is possible to exploit double precision on the GPU, but at a performance cost. For the three GPU configuration, this cost is also ~3-fold. To compute this data example in double precision requires more than the 1 GB of RAM available on a single card so we are unable to replicate this experiment using a single GPU. This is a limitation of our particular hardware; cards with 4 GB of RAM are now available. However, for this dataset, the performance hit comes with no scientific gain; posterior estimates at both double- and single precision remain unchanged.
Speed comparison of GPU, Java and C BEAST implementations. Speedup factors are on the log-scale.
One important scaling dimension of the parallelization is model state-space size S. We reanalyze the carnivore alignment at the nucleotide level using the GTR model for nucleotide substitution. For nucleotides, S=4 and we expect the speedup that the GPUs afford to become much more modest. However, we still observe ~20-fold (three GPUs) and 12-fold (single GPU) performance improvements at single-precision, and single- and double-precision MCMC chains return the same estimates. These substantial increases definitely warrant using the GPUs even for small state spaces.
Two other scaling dimensions of practical concern are the numbers of alignment columns C and of taxa N. For very small C, the overhead in transferring data onto and off of the GPU may outweigh the computational benefit; performance should then rapidly improve to a point where the parallel resources on the GPU become fully occupied. After this saturation point, we suspect only moderate performance increase as C grows. The former cross-over point is critical in choosing whether or not to attempt a GPU analysis and the latter provides guidance in splitting datasets across multiple GPUs. On the other hand, changing N should have a much less pronounced effect with one additional taxon introducing one internal nodes and two branches to the phylogeny. reports how performance scales when analyzing 1–3600 unique codon alignment columns for all 62 carnivores and 200 columns for 3–62 taxa. Interestingly, even for a single codon column, the GPU implementation runs over 10-fold faster due to the performance gains in our parallel calculation of the finite-time transition probabilities. For this codon model, the saturation point occurs around 500 unique columns; therefore, optimal performance gains arise from distributing several hundred columns to each GPU in a multi-GPU system.
GPU performance scaling by number of unique alignment columns C and of taxa N. Speedup factors are on the log-scale.