PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of molbiolevolLink to Publisher's site
 
Mol Biol Evol. Sep 2008; 25(9): 1795–1808.
Published online May 2, 2008. doi:  10.1093/molbev/msn104
PMCID: PMC2515873
Confirming the Phylogeny of Mammals by Use of Large Comparative Sequence Data Sets
Arjun B. Prasad,* Marc W. Allard,§ and Eric D. Greencorresponding author*
NISC Comparative Sequencing Program*
*Genome Technology Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD
NIH Intramural Sequencing Center, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD
Integrated Biosciences Program, George Washington University
§Department of Biological Sciences, George Washington University
corresponding authorCorresponding author.
E-mail: egreen/at/nhgri.nih.gov.
Scott Edwards, Associate Editor
Accepted April 7, 2008.
The ongoing generation of prodigious amounts of genomic sequence data from myriad vertebrates is providing unparalleled opportunities for establishing definitive phylogenetic relationships among species. The size and complexities of such comparative sequence data sets not only allow smaller and more difficult branches to be resolved but also present unique challenges, including large computational requirements and the negative consequences of systematic biases. To explore these issues and to clarify the phylogenetic relationships among mammals, we have analyzed a large data set of over 60 megabase pairs (Mb) of high-quality genomic sequence, which we generated from 41 mammals and 3 other vertebrates. All sequences are orthologous to a 1.9-Mb region of the human genome that encompasses the cystic fibrosis transmembrane conductance regulator gene (CFTR). To understand the characteristics and challenges associated with phylogenetic analyses of such a large data set, we partitioned the sequence data in several ways and utilized maximum likelihood, maximum parsimony, and Neighbor-Joining algorithms, implemented in parallel on Linux clusters. These studies yielded well-supported phylogenetic trees, largely confirming other recent molecular phylogenetic analyses. Our results provide support for rooting the placental mammal tree between Atlantogenata (Xenarthra and Afrotheria) and Boreoeutheria (Euarchontoglires and Laurasiatheria), illustrate the difficulty in resolving some branches even with large amounts of data (e.g., in the case of Laurasiatheria), and demonstrate the valuable role that very large comparative sequence data sets can play in refining our understanding of the evolutionary relationships of vertebrates.
Keywords: Placentalia, Eutheria, Mammalia, mammalian phylogeny, phylogenomics, Atlantogenata, molecular systematics
Advances in large-scale DNA sequencing are creating new opportunities for molecular phylogeneticists to examine ever-larger amounts of genomic sequence from increasing numbers of taxa. These data have the potential to greatly enhance our ability to answer difficult phylogenetic questions; however, the size and inherent imperfections of such data sets present some unique challenges for accurate tree inference. To begin with, the large numbers of characters that serve as input demand a robust computational infrastructure. Further, the fast-evolving nature of most eukaryotic genomes has yielded large amounts of nonprotein-coding sequences that are not conserved across species, making it difficult to generate complete and accurate multisequence alignments (Margulies et al. 2006).
These and other challenges in dealing with nucleotide sequence–based characters have prompted some to make phylogenetic inferences based on genomic characters that change less frequently than individual nucleotides, such as inversions, transposon insertions, and coding insertions/deletions (indels) (Shimamura et al. 1997; Murphy et al. 2004, 2007; Bashir et al. 2005; Chaisson et al. 2006; Kriegs et al. 2006). However, these genomic characters present their own challenges. First, they are less common, so few may be found to help differentiate short branches (Nishihara et al. 2005). This leads to particular difficulties when assessing support using traditional methods (e.g., bootstrapping). Second, assigning the actual character state (e.g., the presence of the same insertion at a given position or a given rearrangement shared between 2 species) can be difficult because overlapping rearrangements, changing boundaries, and/or sequence divergence can obscure the historical relationships (Murphy et al. 2004). Finally, although methods for modeling such rare genomic characters have been developed (Waddell et al. 2001; Chaisson et al. 2006), biases leading to the potential for homoplastic evolution are not well understood (Boissinot et al. 2004; Chen et al. 2005). For example, it is probable that indel events are less likely to occur independently in multiple lineages than single nucleotide changes; however, the extent of biases in indel location appears to vary among lineages and types of indel events. Thus, although rare genomic changes can be used as informative phylogenetic characters, there are still reasons why sequence-based characters are helpful as independent sources of phylogenetic information.
Meanwhile, traditional phylogenetic analyses based on nucleotide mutations present a different set of challenges. As the costs of procuring and operating large clusters of commodity computers have decreased, it has become increasingly practical to harness significant amounts of processing power to analyze very large sequence-based data sets. This provides the ability to exploit single nucleotide mutations more extensively, yielding more robust phylogenetic inferences. Additionally, there is extensive theory and experience relevant to both modeling the evolution of these characters and using the algorithms to infer phylogenetic trees. However, care must be taken to rule out sources of systematic (or nonstochastic) error, such as long-branch attraction, alignment guide trees, and base-composition biases that can hinder the use of such data sets (Kluge and Wolf 1993; Hillis et al. 2003; Philippe et al. 2005; Rokas and Carroll 2005).
Indeed, there remain a number of uncertainties about the mammalian phylogeny that are beginning to be clarified using both substitution-based and rare genomic character–based methods. For example, although recent molecular studies have broken the placental (eutherian) mammals into 4 groups or superorders—Afrotheria (Stanhope et al. 1998), Euarchontoglires (also called Supraprimates by Waddell et al. 2001), Laurasiatheria (Waddell, Okada, and Hasegawa 1999), and Xenarthra (Cope 1889)—the relative arrangement of these groups is uncertain (Waddell, Okada, and Hasegawa 1999; Madsen et al. 2001; Murphy, Eizirik, Johnson, et al. 2001; Waddell et al. 2001; Scally et al. 2002; Kriegs et al. 2006; Nishihara et al. 2006). Studies investigating the relationships among these 4 mammalian superorders agree that Euarchontoglires and Laurasiatheria should be grouped together to form Boreoeutheria (Murphy, Eizirik, O'Brien, et al. 2001; Springer and de Jong 2001; Waddell et al. 2001; Kriegs et al. 2006; Nishihara et al. 2006), although the exact root among the 3 remaining groups (Boreoeutheria, Afrotheria, and Xenarthra) remains unsettled. Most molecular analyses have weakly supported a basal Afrotherian root (Murphy, Eizirik, O'Brien, et al. 2001; Waddell et al. 2001; Scally et al. 2002; Amrine-Madsen et al. 2003; Waddell and Shelley 2003; Kriegs et al. 2006; Nikolaev et al. 2007; Nishihara et al. 2007), whereas a few have weakly supported the association of Afrotheria and Xenarthra to form Atlantogenata, thereby placing the root between Atlantogenata and Boreoeutheria (Waddell, Cao, et al. 1999; Madsen et al. 2001; Delsuc et al. 2002; Lin et al. 2002; Waddell and Shelley 2003; Hallstrom et al. 2007; Murphy et al. 2007; Waters et al. 2007). On the other hand, the traditional placement of Xenarthra at the root of the placental tree, with Boreoeutheria and Afrotheria together forming Epitheria, has been supported by others (Shoshani and McKenna 1998).
Several recent studies have further investigated these issues, leading to different conclusions. Kriegs et al. (2006) identified 2 shared transposon insertions in Afrotheria and Boreoeutheria that could not be found in Xenarthra or in an opossum outgroup, although they did not consider these results statistically significant. Nikolaev et al. (2007) used comparative sequence data generated for 1% of the human genome (as part of the ENCODE project) to examine the root of Placentalia; they reported significant support for a root between Afrotheria and Exafroplacentalia (Boreoeutheria + Xenarthra), though they found it necessary to perform separate analyses on conserved noncoding sequences and amino-acid sequences to exclude both other possible roots. Murphy et al. (2007) searched for informative coding indels within whole-genome sequence data, finding 4 examples supporting Atlantogenata as the root and none supporting the 2 alternative roots; they also identified 2 retroelement insertions with well-conserved flanking sequence that also support Atlantogenata as the root. In addition, Waters et al. (2007) analyzed a phylogeny of L1 sequences and found further support for Atlantogenata as the root. Hallstrom et al. (2007) and Wildman et al. (2007) used coding sequence extracted from whole-genome shotgun sequencing data to find support for an Atlantogenatan root; however, Nishihara et al. (2007) used a similar data set and found that the use of more complex models of evolution that partitioned individual genes suggested an Afrotherian root. Another unresolved issue in mammalian phylogenetics relates to the relationships among orders within Laurasiatheria. Though the monophyly of Laurasiatheria is fairly well established, the relative arrangements within this taxon have been difficult to establish (aside from placing Eulipotyphla at the base of Laurasiatheria) (Murphy, Eizirik, O'Brien, et al. 2001; Waddell et al. 2001; Arnason and Janke 2002; Chaisson et al. 2006; Nishihara et al. 2006).
To investigate some of the above issues, we sought to confirm and refine the mammalian phylogeny by examining the performance of nucleotide-based phylogenetic analyses using very large genomic sequence data sets. Specifically, we mapped, sequenced, assembled, and analyzed a large genomic region (~1.9 Mb in humans) in 41 mammals, 1 bird, and 2 fishes. Here, we report the experience and results of performing phylogenetic analyses of this large (>69 Mb) comparative sequence data set.
The comparative sequence data set analyzed here (available at http://www.nisc.nih.gov/data) is an expanded version of that reported by Thomas et al. (2003), with all sequences orthologous to a 1.9-Mb region of human chromosome 7 (build hg18, chr7:115,597,757–117,475,182) that includes 10 known genes (e.g., CFTR, ST7, and CAV1). All species’ sequences were generated by first isolating bacterial artificial chromosome (BAC) clones from the orthologous genomic region using overgo-based hybridization methods (Thomas et al. 2002) and then generating high-quality sequence of each selected BAC (Blakesley et al. 2004). For each species, sets of overlapping BAC sequences were compiled into a single ordered and oriented sequence. The assembled BAC sequences are provided as supplementary data online (available at http://www.nisc.nih.gov/data). All the analyzed sequences were generated in this fashion as part of the NIH Intramural Sequencing Center Comparative Sequencing Program (Thomas et al. 2003), except for the human sequence (generated by the International Human Genome Sequencing Consortium [2001]) and the Fugu sequence (generated by the Fugu Genome Consortium [Aparicio et al. 2002]). Table 1 provides a list of the species whose sequences were analyzed.
Table 1
Table 1
Multispecies Comparative Sequence Data Set
The assembled sequences were aligned using the threaded blockset aligner (TBA), a local alignment program designed to generate multisequence alignments of large data sets (Blanchette et al. 2004). The final alignment size of all alignable sequence in the data set was 44 taxa by 6,270,442 characters. The initial alignment guide tree was based on the results of Murphy, Eizirik, Johnson, et al. (2001); this was then modified to test alternate hypotheses and to verify that the results were not dependent on the alignment guide tree (see Results and Discussion). The alignment was divided into partitions (i.e., corresponding subportions of the genomic region, as described below) using custom perl scripts (available on request).
Coding sequences were identified based on data from the Consensus CDS Project (http://www.ncbi.nlm.nih.gov/CCDS), as provided on the UCSC Genome Browser (hg17 build; http://www.genome.ucsc.edu). This approach was used for all genes except MET, which was derived from the longest GENCODE annotation (http://genome.imim.es/gencode/) because it was not present in the Consensus CDS annotation. Coding regions within the multisequence alignment were manually edited, and areas of uncertain alignment were removed, with gap columns added where necessary to maintain phase using jalView 2.2.1 (Clamp et al. 2004). Codon position partitions were generated using every third base of the alignment.
Evolutionarily conserved sequences were identified using annotations represented on the “17-way Most Conserved” track of the UCSC Genome Browser (hg18 build) (Kent et al. 2002; Karolchik et al. 2004; Siepel et al. 2005). These annotations reflect conserved elements that were detected using phastCons (Siepel et al. 2005), which applies a 2-state conserved versus nonconserved phylogenetic hidden Markov model to a 17-species multisequence alignment. PhastCons also uses the 5-parameter general time reversible (GTR) model of sequence evolution with a scaling parameter for conserved sequence. With the parameters used for generating the 17-way Most Conserved track, 90% of the human coding bases in our analyzed genomic region reside within conserved regions. For the studies described here, we extracted all coding bases from the annotated conserved regions, leaving a conserved noncoding sequence partition of 104,918 human bases and a total alignment (including gaps) of 132,422 bases. A character state matrix (coding plus conserved noncoding) was created by adding all manually edited protein-coding sequence alignments to the above conserved noncoding sequence alignment. We also generated another conserved sequence matrix for comparison purposes using Gblocks 0.91b, which uses a phylogenetically naive approach to identify sequence conservation (parameters: minimum 23 sequences for a conserved position, minimum 37 sequences for a flanking position, maximum 8 contiguous nonconserved positions, minimum initial block length of 10, minimum block length of 10, and half allowed gap positions) (Castresana 2000). The resulting matrix of conserved sequences contained 77,961 bases.
The extraction of specific bases and conversion of data files to FASTA, NEXUS, or PHYLIP formats for subsequent analyses were performed using custom perl scripts (available on request). Maximum parsimony tree searching was performed using PAUP* 4.0b10 (Swofford 2003). All trees were rooted using the fishes (Fugu and Tetraodon) as outgroup taxa, and a constraint for the monophyly of mammals was used. Maximum parsimony trees were generated using random addition replicates as well as bootstrapped with 1,000 replicates of 10 random addition subtree pruning regrafting runs. Neighbor-Joining trees were generated with 1,000 bootstrap replicates using maximum likelihood (ML) HKY85 distances. Incongruence length difference (ILD) or partition homogeneity tests were performed with 1,000 replicates of 10 Tree Bisection-Reconnection random addition tree searches with PAUP* (Bull et al. 1993; Cunningham 1997).
The monophyly of mammals was constrained for all tree searching, except when performing ILD tests, Shimodaira–Hasegawa tests (SH tests), or otherwise noted (Shimodaira and Hasegawa 1999). Bayesian phylogenetic analysis was performed with both partitioned likelihood models and single partition models using the MPI version of MrBayes v3.1.2 and GTR + I + Γ models, as suggested by MrModeltest or Modeltest (Posada and Crandall 1998; Nylander 2004). For the MrBayes analysis, model parameters were estimated from the data, and default priors were used. Metropolis-coupled Markov chain Monte Carlo chains were run for 500,000 or 1,000,000 generations, sampling every 100 generations and using 6 heated chains per each of 2 independent runs. Stationarity was confirmed by manual inspection for convergence of independent runs as well as topological and likelihood value stability. Majority rules consensus trees were generated from the final two-thirds of sampled trees using PAUP. Bootstrapped Bayesian runs were performed using seqboot from PHYLIP to create 100 bootstrap data sets, which were then independently analyzed with MrBayes using the settings described above (500,000 generations, sampling every 100 generations, with the final 6,668 sampled trees used for the consensus) (Felsenstein 2007). The RY-coded coding plus conserved non-coding sequence matrix was run to 5,000,000 generations, the average likelihood values increased until around 500,000 generations, but a single tree became completely resolved before 200,000 generations (data not shown).
ML tree searches were performed with the GTR + Γ model using RAxML-VI-HPC v2.1.3 (Stamatakis 2006). A proportion of invariable sites was not used because that parameter (I) is not implemented in RAxML-VI-HPC v2.1.3 (Stamatakis 2006). The best ML trees were obtained by performing greater than 20 independent tree searches from both completely random and random addition parsimony-based starting trees (default for RAxML) using the “−f d” high-performance hill-climbing algorithm. Highest likelihood trees from multiple runs of RAxML were the same as the trees obtained from multiple runs of PHYML using GTR + Γ + I models for several data sets tested (Guindon and Gascuel 2003). Even though its hill-climbing algorithm was slightly more likely to end at local maxima, we used RAxML because a single run with the conserved sequence partition took roughly one-third the time (~1.5 h) and one-tenth the RAM (~400 Mb) compared with PHYML; this allowed for more efficient use of available cluster resources. Bootstrap replicates were performed using the parallelized MPI-enabled version of RAxML-VI-HPC v2.1.3 with default settings. SH tests were performed with the best ML trees found using 15 random addition runs of RAxML using constraints for taxa in question. The best trees were used for SH tests with PAUP* and 10,000 RELL replicates with the GTR + Γ + I model (Shimodaira and Hasegawa 1999). All tree searching was performed using Linux clusters. Analyses with PAUP and RAxML were scripted and split using custom perl and shell scripts. Trees were visualized with the assistance of TreeGraph (Muller and Muller 2004).
Overview
We generated and compiled a high-quality comparative sequence data set consisting of sequences from 44 vertebrate species (table 1), all of which are orthologous to a 1.9-Mb region on human chromosome 7 (Thomas et al. 2003). This entire genomic region is syntenic in all mammals, reptiles, and fishes that we have examined to date (including some whose sequence was not analyzed in this study). Together, the consistent long-range synteny, Blast-based sequence comparisons, and nature of the cross-species BAC isolation and mapping process (see Thomas et al. [2002]) confirm the orthologous relationship of the sequences within the analyzed data set. The amount of assembled, annotated, and quality-trimmed sequence in the data set varies from 257 kb from Fugu to 2 Mb from elephant, with this variance reflecting both intrinsic differences in the size of the genomic region among species as well as incomplete sequence coverage for some species (see table 1).
Sequences were aligned using TBA (Blanchette et al. 2004). Because TBA produces local multisequence alignments, it handles small inversions or other local rearrangements well and avoids incorporating regions where the alignment uncertainty is high (Blanchette et al. 2004; Pollard et al. 2006). Protein-coding sequences were excised and manually edited to constrain coding indels to multiples of 3 bases, unless there was significant evidence for other indel sizes. We also removed any portions of the alignment where gap positions could not be easily determined. From this alignment, we made a coding sequence matrix, which contained 20,647 human-coding bases and 21,129 total characters; this matrix was then analyzed with ML, Bayesian, maximum parsimony, and Neighbor-Joining approaches. Bayesian analysis yielded a highly resolved tree with posterior probabilities of 1.0 for all nodes (supplementary fig. 1, Supplementary Material online). This tree is fully congruent with the ML tree, and both trees are largely in agreement with other recent phylogenetic studies (Waddell, Cao, et al. 1999; Murphy, Eizirik, Johnson, et al. 2001; Murphy, Eizirik, O'Brien, et al. 2001; Waddell et al. 2001; Delsuc et al. 2002; Lin et al. 2002; Amrine-Madsen et al. 2003; Phillips and Penny 2003; Springer et al. 2003, 2004; Waddell and Shelley 2003; Kriegs et al. 2006; Nishihara et al. 2006; Hallstrom et al. 2007; Murphy et al. 2007), though a few nodes are different than those reported in some of these studies (see below and fig. 1).
FIG. 1.
FIG. 1.—
ML tree derived from the analysis of the coding sequence partition using RY-coded bases and a codon position partitioned CF + Γ model. Branch lengths indicate likelihood-inferred substitutions per site with a GTR + Γ model. ML bootstrap (more ...)
The ML bootstrap proportions are significantly lower than Bayesian posterior probabilities. To investigate the relationship between the bootstrap proportions and the Bayesian posterior probabilities, we developed a “Bayesian bootstrap score” based on the majority rules consensus tree at stationarity for each bootstrapped data set. The resulting score was only slightly higher than the ML bootstrap proportions (see supplementary fig. 2, Supplementary Material online); these results support the notion that Bayesian posterior probabilities may be misleadingly high in some cases and are not necessarily comparable with traditional bootstrap measures of support (Waddell et al. 2001, 2002; Douady et al. 2003; Yang 2007). It is possible that more complex models and/or Bayesian techniques for combining data could lead to posterior probabilities that better reflect the uncertainties in the tree (Edwards et al. 2007; Liu and Pearl 2007). Some bootstrap replicates resulted in unlikely rearrangements of chicken and platypus (e.g., platypus outside the chicken and other mammals). We also saw this when the coding plus conserved noncoding sequence matrix was nucleotide coded and analyzed with maximum parsimony. We believe that these findings are a consequence of the long branches leading to the monotreme and reptile species represented in this study, as well as the low taxon sampling of those groups and the distant fish outgroup. Because these unlikely arrangements affected bootstrap support values, we constrained the monophyly of mammals unless otherwise noted.
Although Bayesian posterior probabilities are high for all branches except the Homo–Pan–Gorilla group (human, chimpanzee, and gorilla), ML support is not sufficient to resolve some branches. In an attempt to rectify this and to take advantage of our notably large sequence data set, we investigated using conserved noncoding sequence for tree construction. Conserved bases were identified using phastCons (Siepel et al. 2005), which utilizes a phylogenetic hidden Markov model to distinguish conserved versus nonconserved sequence and a GTR model of substitution rates to identify conserved segments (other details are provided in Methods). With the parameters used, 90% of the known coding sequence and 5.5% of the presumed noncoding sequence within the region were identified as conserved. We generated a matrix containing 153,552 bases by combining the coding sequence alignment and the conserved noncoding sequence alignment to form a coding plus conserved noncoding sequence matrix; this matrix yielded highly resolved trees with strong branch support (fig. 2).
FIG. 2.
FIG. 2.—
ML tree derived from the analysis of coding plus conserved noncoding sequence matrix using RY-coded bases. A CF + Γ model was used, with 4 partitions: 3 for codon positions and 1 for conserved noncoding sequence. Long branches leading to platypus (more ...)
Although we found significant differences between the trees generated with the coding versus conserved noncoding sequence partitions based on ILD tests (P = 0.001), this was entirely due to the less-conserved third codon positions (Bull et al. 1993). When third codon positions were excluded (i.e., using partitions consisting of codon positions 1 and 2 vs. conserved noncoding sequences), there was no significant difference between the results obtained with each partition (ILD test P = 0.432; see table 2). RY-based coding of the data (discussed below) eliminated the significant differences between partitions, even when the third codon positions were included (ILD test P = 0.328; see table 2). Indeed, we only encountered differences between the trees generated with the 2 partitions on branches that were weakly supported by the coding sequence data set (figs. 1 and and2).2). Finally, we performed likelihood and Bayesian analysis with models partitioned by codon position (i.e., with the same model of evolution but allowing parameters of those models to vary between partitions); no significant differences in tree topology were seen, although minor differences in branch support scores were noted. Because phastCons bases its identification of conserved sequences on a phylogenetic tree, we also used Gblocks, a phylogenetically naive program for finding conserved regions. We found no differences in the ML tree generated from conserved regions identified by phastCons and Gblocks, with the associated bootstrap proportions similar in both cases (data not shown).
Table 2
Table 2
Pairwise ILD P Valuesa
We also examined coding sequence for high-confidence indels, finding 24 phylogenetically informative indels (supplementary fig. 6, Supplementary Material online). A large number of the coding indels were shared by closely related species, such as rat and mouse (7 indels supporting), and separated the fish as our outgroup (5 indels supporting). Notably, we found 3 indels that are homoplastic on any of our trees, 2 of which (labeled “l” and “p” in supplementary fig. 6, Supplementary Material online) likely reflect multiple independent deletion events as they were detected in marsupials and only 1–3 species in Euarchontoglires. The third indel that appears homoplastic on our trees joins dusky titi to the apes (human, chimpanzee, gorilla, orangutan, and gibbon); this may be the result of lineage sorting or independent events. Although multiple deletion events may be relatively rare, the observed homoplasy suggests that caution should be used in interpreting support for taxa based on small numbers of such events.
Nonphylogenetic Signals
Large sequence data sets, such as the one analyzed here, offer the potential to resolve weakly supported branches; however, they can also be prone to detecting nonphylogenetic signals that confound the results (Philippe et al. 2005). We examined several potential sources of “systematic error” or “nonphylogenetic signals,” attempting to exclude them or to control for their influence (Philippe et al. 2005). Specifically, we considered base-composition bias, incongruence across the genomic region, missing data, influence of the alignment guide tree, and long-branch attraction as possible sources of nonphylogenetic signal. We further examined long-branch attraction during the analysis of various individual taxa.
Base Composition
There are significant differences in base composition among species, ranging from 45% to 58% G + C in the coding sequence partition and 32% to 45% G + C in the conserved noncoding sequence partition. The chi-square test for homogeneity of base frequencies across taxa was highly significant (P < 0.000001 for all partitions examined, except with codon second positions P = 0.00926835), though the validity of this test is questionable because it does not take phylogenetic structure into account. To reduce the effects of nonphylogenetic signals due to base-composition differences among species, we coded the nucleotides as purines or pyrimidines (RY coding) (Phillips et al. 2001, 2004; Philippe et al. 2005). This approach also has the benefit of removing signals deriving from the more common transitions that may be associated with higher rates of saturation due to reversals. Indeed, we found that RY-coding eliminated significant differences in trees supported by the less-conserved codon third positions (table 2). Because of the large data set size, we maintained sufficient signal with RY-coded data to make robust phylogenetic inferences (figs. 1 and and2).2). We further found that RY-coding eliminates almost all base-composition differences among species. The coefficient of variation between purines and pyrimidines was 84% lower than the coefficient of variation between G + C and A + T for the coding sequence partition and 87% lower for the conserved noncoding sequence partition. RY-coding also eliminated significant differences in trees supported by codon third positions (ILD P = 0.948).
Incongruence Across the Genomic Region
Combining phylogenetic data is thought to potentially be problematic (Bull et al. 1993). For example, recent genome-wide studies in yeast and bacteria encountered problems with combining large numbers of protein-coding sequence alignments (Comas et al. 2007; Edwards et al. 2007). To check for heterogeneity of support across the alignment, we split the coding plus conserved noncoding sequence matrix into 10 equal-sized segments and analyzed each with ML (fig. 3 and table 3). Although we found support for the Atlantogenata hypothesis with all 10 segments, there was considerable variation in the strength of that support, with some segments providing much larger shares of the overall support. One segment (6; see table 3 and fig. 3) did not contain any armadillo sequence (due to a gap in the BAC map) and thus could not provide support for the placental root. There was considerable heterogeneity of support among the segments for some other clades as well. Laurasiatheria was generally not well resolved in any of our analyses, and orders within Laurasiatheria did not have a consistent relationship among the different segments. The relationship between the marsupials and platypus varied as well, perhaps because of the long branches and poor taxon sampling of only one monotreme. Additionally, the relationships among owl monkey, marmoset, and squirrel monkey differed among partitions, though with weak support (discussed further below). A majority of segments supported Marsupionta, though overall support was strongly in favor of Theria (figs. 2 and and33).
FIG. 3.
FIG. 3.—
ML trees for each of 10 sequential, equal-sized partitions from the coding plus conserved noncoding sequence matrix. Numbers (1–10) reflect the specific partition used. The arrangement of taxa and branches indicated in colors other than black (more ...)
Table 3
Table 3
Relative Likelihood Support for Placental Root across Coding Plus Conserved Noncoding Sequence Matrix
Missing Data
Missing data have been considered a source of systematic error in phylogenetic analyses (Huelsenbeck 1991; Kearney 2002), although some simulation studies suggest that missing data should not be a problem for data sets with sufficient characters to provide robust signal (Rosenberg and Kumar 2003; Philippe et al. 2004, 2005; Wiens 2005). The coding sequence alignment had 11% missing data (including indels and sequencing gaps). To test if the missing data were biasing our analyses, we performed ILD tests on the RY-coded coding sequence partition, comparing alignment columns with gaps in 3 or fewer species and those with gaps in more than 3 species; no significant differences were seen (P = 0.591, supplementary table 1, Supplementary Material online). Similar results were obtained when the same analysis was performed with the RY-coded conserved noncoding sequence partition (P = 0.627). Finally, to see if additional missing data would strongly affect the analysis, we randomly deleted 25% of nongap bases from the coding plus conserved noncoding sequence matrix and observed no effect on the resulting ML tree, other than slightly changing the bootstrap support for a few branches (data not shown).
Alignment Guide Tree
We also analyzed a matrix consisting of all TBA-aligned sequence (containing 1,798,347 human bases). Because of computational constraints, we analyzed this data set only by maximum parsimony and ML methods. The trees derived from these analyses were almost completely resolved; however, by permuting the alignment guide tree, we were able to change the relative arrangement of those branches showing 100% bootstrap support in this analysis. Using only the conserved and protein-coding portions of the alignment yielded a tree with fewer well-resolved branches; however, branches with >70% bootstrap support were resistant to permutations of the alignment guide tree. Notably, the only branches with bootstrap proportions <70% were the interordinal branches within Laurasiatheria. These results confirm that difficult-to-resolve branches are more susceptible to biases introduced by aligning less-conserved sequences and that biases due to alignment guide trees can be largely controlled by only considering conserved sequences and strongly supported branches. To control for any possible effect of the alignment guide tree on our phylogenetic analysis, we permuted the alignment guide tree and reanalyzed the data for all controversial branches to confirm that the alignment guide tree was not biasing the results.
Individual Taxa
Euarchontoglires
The primate portion of the tree was strongly supported regardless of the partition or alignment guide tree used with the exception of the Homo–Pan–Gorilla group, which had insufficient informative changes in the RY-coded coding sequence partition to resolve (fig 1). However, when transitions are included, there is sufficient signal and 100% support for the sister relationship between chimpanzees and humans (supplementary fig. 1, Supplementary Material online). The major primate clades (including Catarrhini, Platyrrhini, and Strepsirrhini) were all supported at 100% by ML, Bayesian, Neighbor-Joining, and maximum parsimony approaches. These results largely agree with other recently reported molecular phylogenies for the primates (Barroso et al. 1997; Goodman et al. 1998; Poux and Douzery 2004; Ray et al. 2005; Opazo et al. 2006), with the exception of the relationships within Cebidae (marmoset, squirrel monkey, and owl monkey) (Barroso et al. 1997). Molecular systematic studies have disagreed on the arrangement of these 3 taxa. We find support for the association of squirrel monkey with marmoset. Although we see consistent support for this association regardless of alignment guide tree, RY- or nucleotide-coding, or tree inference algorithm, bootstrap proportions are relatively weak, and the support varies across the genomic region under study (fig. 3). The monophyly of both Glires and Euarchontoglires was strongly supported by Bayesian, ML, and maximum parsimony approaches, as in other recent studies (Thomas et al. 2003; Douzery and Huchon 2004); note that this is in sharp contrast to the results of Misawa and Janke (2003). Notably, Neighbor-Joining support for Glires and Euarchontoglires was also strong (bootstrap proportion 91%, supplementary fig. 5, Supplementary Material online) in contrast to the Neighbor-Joining results of Wildman et al. (2007), who used whole-genome shotgun sequence data; this is potentially explained by the much greater taxon sampling afforded by our data set. The placement of guinea pig securely within Rodentia is also strongly supported by our data (SH test P < 0.0001 excluding guinea pig as an outgroup to the other rodents), in agreement with others (Sullivan and Swofford 2004); this result holds when the alignment guide tree is permuted to place guinea pig outside the rodents.
Laurasiatheria
Within Laurasiatheria, there has been considerable disagreement about the arrangement and composition of the historical order Insectivora, although most recent molecular studies divide up this group among several orders, with tenrec falling in Afrotheria (Stanhope et al. 1998; Lin et al. 2002; Nishihara et al. 2006) and shrew and Erinaceous hedgehogs falling in Eulipotyphla (within Laurasiatheria) (Madsen et al. 2001; Murphy, Eizirik, Johnson, et al. 2001; Lin et al. 2002; Malia et al. 2002; Amrine-Madsen et al. 2003). Many studies of mitochondrial DNA have placed the Eulipotyphlans at the root of Placentalia, probably because of the unusually high AT content of their mitochondrial genomes; however, recent studies with greater taxon sampling and more complex models have also placed them in Laurasiatheria (Waddell, Cao, et al. 1999; Arnason et al. 2002; Lin et al. 2002; Gibson et al. 2005; Kjer and Honeycutt 2007). Our Neighbor-Joining tree has Eulipotyphla at the root of Placentalia, but this may be a consequence of the long-branch lengths of the 2 Eulipotyphlans (hedgehog and shrew) represented in our data set (fig. 2 and supplementary fig. 5, Supplementary Material online). Using ML, Bayesian, and maximum parsimony approaches, our analyses consistently place Eulipotyphla at the root of Laurasiatheria, as do most other recent studies (fig. 2 and supplementary figs. 3 and 5; Supplementary Material online) (Murphy, Eizirik, O'Brien, et al. 2001; Waddell et al. 2001; Arnason et al. 2002; Scally et al. 2002; Amrine-Madsen et al. 2003; Waddell and Shelley 2003; Nishihara et al. 2006; Nikolaev et al. 2007).
The placement of Perissodactyla, represented here by the horse, has been another source of controversy. Usually, this group is placed either sister to Cetartiodactyla (Murphy, Eizirik, Johnson, et al. 2001; Lin et al. 2002) or sister to Carnivora (Murphy, Eizirik, O'Brien, et al. 2001; Arnason and Janke 2002; Amrine-Madsen et al. 2003), in most cases with weak support (Waddell, Cao, et al. 1999). Schwartz et al. (2003) found a single transposon insertion supporting the Perissodactyla–Carnivora association; meanwhile, Nishihara et al. (2006) found a single transposon insertion supporting a Perissodactyla–Carnivora association and 5 insertions supporting a Perissodactyla–Chiroptera–Carnivora (Pegasoferae, Nishihara et al. 2006) association (i.e., excluding the traditional Perissodactyla–Cetartiodactyla association of hoofed mammals that we see with this analyses). Of note, Nishihara et al. (2006) did also find one transposon insertion that conflicted with the Pegasoferae hypothesis. Even with the large number of characters analyzed here, we only found weak bootstrap support for the placement of Perissodactyla (fig. 2), although it tended to associate closest with the Cetartiodactylans and secondarily with the Carnivores. Across the region, support for the arrangement of orders varied significantly, with only segments 4, 5, and 10 agreeing and none of the segments agreeing with the ML tree for the entire matrix (fig. 3). Thus, the arrangement of orders within Laurasiatheria appears to be difficult to resolve even with large amounts of sequence data and reasonably large numbers of species represented. We further found that the relative arrangement of Laurasiatherian orders was highly sensitive to alignment guide tree artifacts, though not in a predictable way. Using the coding plus conserved noncoding sequence matrix, we performed SH tests with the 5 most supported Laurasiatherian trees from the literature; none could be excluded with high confidence, and this likely is due, in part, to the short branches separating Laurasiatherian orders. Perhaps with increased taxon sampling, this problem will be more tractable. It may be that a strong nonphylogenetic signal or incomplete lineage sorting is obscuring the interordinal relationships within Laurasiatheria. Methods that treat gene trees and species trees simultaneously, such as that described by Liu and Pearl (2007), might also be able to better resolve such regions.
Theria
Although considered a mammal, the phylogenetic placement of monotremes has long been controversial. The hierarchical placement of monotremes as an outgroup of the other mammals has been challenged by molecular and morphological studies that placed Monotremata as a sister group to the marsupials in a clade called Marsupionta (Janke et al. 1996, 1997). Our results, however, agree with recent molecular studies that yielded significant evidence (including coding indels) in support of the monophyly of Theria (placental mammals and marsupials), with the monotremes as the first branch of the mammalian tree (Killian et al. 2001; Phillips and Penny 2003; van Rheede et al. 2006). We find strong bootstrap support (≥94%, see figs. 1 and and2)2) for Theria by Neighbor-Joining, maximum parsimony, Bayesian, and ML approaches, and SH tests with nucleotide-coded data just reach significance (P = 0.0251) in excluding Marsupionta with the coding plus conserved noncoding sequence matrix. However, there is significant heterogeneity of support for Theria across the region, and a majority of the segments (6/10) support Marsupionta (fig. 3). These findings are consistent with other recent molecular and morphological analyses that supported the monophyly of Theria (Hu et al. 1997; Phillips and Penny 2003; van Rheede et al. 2006) but illustrate the difficulty of determining the relationships between these clades.
Placentalia
Although the nucleotide-coded protein-coding sequence partition failed to resolve the root of Placentalia (ML bootstrap <60%, supplementary fig. 1, Supplementary Material online), the RY-coded coding sequence partition supports an Atlantogenatan root (fig. 1). Adding the conserved noncoding partition provides high statistical support for Atlantogenata, both with nucleotide- and RY-coded data (figs. 2 and and4).4). Bootstrap support of 100% is seen with all model-based approaches used (including ML, Bayesian, Neighbor-Joining, and minimum evolution supplementary figs. 4 and 5). SH tests using the coding plus conserved noncoding sequence matrix exclude Epitheria and Exafroplacentalia, with the results significant past the limits of PAUP and CONSEL (P < 0.0001) (fig. 4). Because of the limited number of Afrotherian and Atlantogenatan species in this study, some caution is warranted in interpreting these results. Maximum parsimony analysis of the nucleotide-coded coding sequence partition supports an Epitherian root (e.g., fig. 4A), but when codon third position sites are removed or the sequence is RY-coded, bootstrap support is reduced to <50%. Maximum parsimony analysis of the coding plus conserved noncoding sequence partition also weakly supports Epitheria (supplementary fig. 4).
FIG. 4.
FIG. 4.—
Three possible roots for Placentalia. SH test results from the coding plus conserved noncoding sequence matrix for both nucleotide- and RY-coded matrices. (A) Hypothesis rooting Placentalia between Xenarthra and Epitheria (Boreoeutheria + Afrotheria). (more ...)
To exclude the influence of biases introduced by the alignment guide tree, we realigned the sequences using a guide tree based on the highest likelihood tree constrained to each possible root, then repeated the likelihood analysis. In all cases, the ML tree derived from the coding plus conserved noncoding sequence matrix was rooted between Atlantogenata and Boreoeutheria with 100% bootstrap support and highly significant SH test results (P < 0.001). Because tenrec has a significantly longer branch length with these data than elephant or armadillo, we tried individually removing tenrec and elephant. With either tenrec or elephant missing, we still saw ≥99% ML bootstrap support for Atlantogenata. We also separated the coding plus conserved noncoding sequence matrix into 10 equally sized partitions and analyzed each separately (fig. 3). Although the likelihood values for Atlantogenata varied significantly among the partitions, all partitions supported an Atlantogenatan root (table 3).
These results agree with some other recent large-scale analyses on mostly independent data sets (e.g., Hallstrom et al. [2007]; Murphy et al. [2007]; and Wildman et al. [2007]) but conflict with the findings of Nikolaev et al. (2007), Nishihara et al. (2007), and Kriegs et al. (2006). Kriegs et al. (2006) found 2 transposon insertions in Afrotheria and Boreoeutheria that were not found in armadillo or sloth. These transposon sequences are quite old, and the flanking sequence is not well conserved. Thus, the transposon-associated sequences may have mutated out of recognition in the 30 Myr from the placental root to the divergence of armadillo and sloth; alternatively, multiple transposon insertions may have occurred in Afrotheria and Boreoeutheria (Springer et al. 2003; Kriegs et al. 2006; Murphy et al. 2007). Homoplasy for transposon insertions due to targeted insertions or lineage sorting on short branches, though presumably rare, has been reported (Pecon-Slattery et al. 2004; van de Lagemaat et al. 2005; Yu and Zhang 2005; Nishihara et al. 2006) and could also explain these results.
Nikolaev et al. (2007) analyzed amino acid and conserved noncoding genomic sequences from 14 species to examine the root of Placentalia. Using ML analyses of conserved noncoding sequence from the ENCODE pilot project regions (http://www.genome.gov/10005107), they exclude the Epitheria hypothesis and, separately, use amino acid sequences derived from the same regions to exclude the Atlantogenata hypothesis. Notably, analyses using their largest data set (conserved noncoding sequence) failed to differentiate between rooting Placentalia at Atlantogenata or Exafroplacentalia. Additionally, their limited taxon and outgroup sampling argues for caution in interpreting the final results (Delsuc et al. 2002); for example, when we only analyzed data from the 14 species studied by Nikolaev et al. (2007), we still found support for Atlantogenata as the root, although the bootstrap support was weak. The data used in our analysis contain significantly more taxa, both ingroup and outgroup, than other recent large-scale nucleotide-based analyses, and this may affect the results significantly.
In summary, we used a comparative sequence data set that contains a remarkably large number of conserved bases to derive a phylogeny that provides additional evidence to resolve some of the controversial branches in the mammalian lineage. We find significant support for an Atlantogenatan root of Placentalia, as well as additional evidence for the monophyly of Theria. Our studies highlight the difficulties in resolving some very short mammalian branches (e.g., interordinal relationships within Laurasiatheria), even with large amounts of data. Our work further illustrates the value of large genomic sequence data sets for improving the resolution of phylogenetic trees, in this case, to clarify some of the remaining ambiguities within the mammalian tree. Sequences from an increasing number of mammalian taxa should help to resolve the remaining ambiguities associated with the short branches within and between the placental orders.
Supplementary Material
Supplementary figures 16 and table 1 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
Supplementary Material
[Supplementary Data]
Acknowledgments
We thank Adam Siepel and Webb Miller for helpful comments about this manuscript. We also thank members of the NISC Comparative Sequencing Program (particularly B. Blakesley, G. Bouffard, J. McDowell, B. Maskeri, M. Park, N. Hanson, and P. Thomas) for providing leadership in the generation of the comparative sequence data analyzed here. We also thank Associate Editor Scott Edwards and 2 anonymous reviewers for unusually thorough and helpful comments that ultimately resulted in a significantly improved manuscript. This research was supported in part by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health.
  • Amrine-Madsen H, Koepfli KP, Wayne RK, Springer MS. A new phylogenetic marker, apolipoprotein B, provides compelling evidence for eutherian relationships. Mol Phylogenet Evol. 2003;28:225–240. [PubMed]
  • Aparicio S, Chapman J, Stupka E, et al. (41 co-authors) Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science. 2002;297:1301–1310. [PubMed]
  • Arnason U, Adegoke JA, Bodin K, Born EW, Esa YB, Gullberg A, Nilsson M, Short RV, Xu X, Janke A. Mammalian mitogenomic relationships and the root of the eutherian tree. Proc Natl Acad Sci USA. 2002;99:8151–8156. [PubMed]
  • Arnason U, Janke A. Mitogenomic analyses of eutherian relationships. Cytogenet Genome Res. 2002;96:20–32. [PubMed]
  • Barroso CML, Schneider H, Schneider MPC, Sampaio I, Harada ML, Czelusniak J, Goodman M. Update on the phylogenetic systematics of new world monkeys: further DNA evidence for placing the pygmy marmoset (Cebuella) within the genus Callithrix. Int J Primatol. 1997;18:651–674.
  • Bashir A, Ye C, Price AL, Bafna V. Orthologous repeats and mammalian phylogenetic inference. Genome Res. 2005;15:998–1006. [PubMed]
  • Blakesley RW, Hansen NF, Mullikin JC, et al. (22 co-authors) An intermediate grade of finished genomic sequence suitable for comparative analyses. Genome Res. 2004;14:2235–2244. [PubMed]
  • Blanchette M, Kent WJ, Riemer C, et al. (12 co-authors) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14:708–715. [PubMed]
  • Boissinot S, Entezam A, Young L, Munson PJ, Furano AV. The insertional history of an active family of L1 retrotransposons in humans. Genome Res. 2004;14:1221–1231. [PubMed]
  • Bull JJ, Huelsenbeck JP, Cunningham CW, Swofford DL, Waddell PJ. Partitioning and combining data in phylogenetic analysis. Syst Biol. 1993;42:384–397.
  • Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol. 2000;17:540–552. [PubMed]
  • Chaisson MJ, Raphael BJ, Pevzner PA. Microinversions in mammalian evolution. Proc Natl Acad Sci USA. 2006;103:19824–19829. [PubMed]
  • Chen JM, Stenson PD, Cooper DN, Ferec C. A systematic analysis of LINE-1 endonuclease-dependent retrotranspositional events causing human genetic disease. Hum Genet. 2005;117:411–427. [PubMed]
  • Clamp M, Cuff J, Searle SM, Barton GJ. The Jalview Java alignment editor. Bioinformatics. 2004;20:426–427. [PubMed]
  • Comas I, Moya A, Gonzalez-Candelas F. From phylogenetics to phylogenomics: the evolutionary relationships of insect endosymbiotic gamma-proteobacteria as a test case. Syst Biol. 2007;56:1–16. [PubMed]
  • Cope ED. The Edentata of North America. Am Nat. 1889;23:657–664.
  • Cunningham CW. Can three incongruence tests predict when data should be combined? Mol Biol Evol. 1997;14:733–740. [PubMed]
  • Delsuc F, Scally M, Madsen O, Stanhope MJ, de Jong WW, Catzeflis FM, Springer MS, Douzery EJ. Molecular phylogeny of living xenarthrans and the impact of character and taxon sampling on the placental tree rooting. Mol Biol Evol. 2002;19:1656–1671. [PubMed]
  • Douady CJ, Delsuc F, Boucher Y, Doolittle WF, Douzery EJ. Comparison of Bayesian and maximum likelihood bootstrap measures of phylogenetic reliability. Mol Biol Evol. 2003;20:248–254. [PubMed]
  • Douzery EJ, Huchon D. Rabbits, if anything, are likely Glires. Mol Phylogenet Evol. 2004;33:922–935. [PubMed]
  • Edwards SV, Liu L, Pearl DK. High-resolution species trees without concatenation. Proc Natl Acad Sci USA. 2007;104:5936–5941. [PubMed]
  • Felsenstein J. PHYLIP (phylogeny inference package). Distributed by the author. Seattle (WA): Department of Genome Sciences, University of Washington; 2007.
  • Gibson A, Gowri-Shankar V, Higgs PG, Rattray M. A comprehensive analysis of mammalian mitochondrial genome base composition and improved phylogenetic methods. Mol Biol Evol. 2005;22:251–264. [PubMed]
  • Goodman M, Porter CA, Czelusniak J, Page SL, Schneider H, Shoshani J, Gunnell G, Groves CP. Toward a phylogenetic classification of primates based on DNA evidence complemented by fossil evidence. Mol Phylogenet Evol. 1998;9:585–598. [PubMed]
  • Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003;52:696–704. [PubMed]
  • Hallstrom BM, Kullberg M, Nilsson MA, Janke A. Phylogenomic data analyses provide evidence that Xenarthra and Afrotheria are sister groups. Mol Biol Evol. 2007;24:2059–2068. [PubMed]
  • Hillis DM, Pollock DD, McGuire JA, Zwickl DJ. Is sparse taxon sampling a problem for phylogenetic inference? Syst Biol. 2003;52:124–126. [PMC free article] [PubMed]
  • Hu Y, Wang Y, Luo Z, Li C. A new symmetrodont mammal from China and its implications for mammalian evolution. Nature. 1997;390:137–142. [PubMed]
  • Huelsenbeck JP. When are fossils better than extant taxa in phylogenetic analysis? Syst Zool. 1991;40:458–469.
  • International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. [PubMed]
  • Janke A, Gemmell NJ, Feldmaier-Fuchs G, von Haeseler A, Paabo S. The mitochondrial genome of a monotreme–the platypus (Ornithorhynchus anatinus) J Mol Evol. 1996;42:153–159. [PubMed]
  • Janke A, Xu X, Arnason U. The complete mitochondrial genome of the wallaroo (Macropus robustus) and the phylogenetic relationship among Monotremata, Marsupialia and Eutheria. Proc Natl Acad Sci USA. 1997;94:1276–1281. [PubMed]
  • Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ. The UCSC table browser data retrieval tool. Nucleic Acids Res. 2004;32:D493–D496. [PMC free article] [PubMed]
  • Kearney M. Fragmentary taxa, missing data, and ambiguity: mistaken assumptions and conclusions. Syst Biol. 2002;51:369–381. [PubMed]
  • Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res. 2002;12:996–1006. [PubMed]
  • Killian JK, Buckley TR, Stewart N, Munday BL, Jirtle RL. Marsupials and Eutherians reunited: genetic evidence for the Theria hypothesis of mammalian evolution. Mamm Genome. 2001;12:513–517. [PubMed]
  • Kjer KM, Honeycutt RL. Site specific rates of mitochondrial genomes and the phylogeny of eutheria. BMC Evol Biol. 2007;7:8. [PMC free article] [PubMed]
  • Kluge AG, Wolf AJ. Cladistics: what's in a Name. Cladistics. 1993;9:183–199.
  • Kriegs JO, Churakov G, Kiefmann M, Jordan U, Brosius J, Schmitz J. Retroposed elements as archives for the evolutionary history of placental mammals. PLoS Biol. 2006;4:e91. [PMC free article] [PubMed]
  • Lin YH, McLenachan PA, Gore AR, Phillips MJ, Ota R, Hendy MD, Penny D. Four new mitochondrial genomes and the increased stability of evolutionary trees of mammals from improved taxon sampling. Mol Biol Evol. 2002;19:2060–2070. [PubMed]
  • Liu L, Pearl DK. Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. Syst Biol. 2007;56:504–514. [PubMed]
  • Madsen O, Scally M, Douady CJ, Kao DJ, DeBry RW, Adkins R, Amrine HM, Stanhope MJ, de Jong WW, Springer MS. Parallel adaptive radiations in two major clades of placental mammals. Nature. 2001;409:610–614. [PubMed]
  • Malia MJ, Jr, Adkins RM, Allard MW. Molecular support for Afrotheria and the polyphyly of Lipotyphla based on analyses of the growth hormone receptor gene. Mol Phylogenet Evol. 2002;24:91–101. [PubMed]
  • Margulies EH, Chen CW, Green ED. Differences between pair-wise and multi-sequence alignment methods affect vertebrate genome comparisons. Trends Genet. 2006;22:187–193. [PubMed]
  • Misawa K, Janke A. Revisiting the Glires concept—phylogenetic analysis of nuclear sequences. Mol Phylogenet Evol. 2003;28:320–327. [PubMed]
  • Muller J, Muller K. TREEGRAPH: automated drawing of complex tree figures using an extensible tree description format. Molecular Ecology Notes. 2004;4:786–788.
  • Murphy WJ, Eizirik E, Johnson WE, Zhang YP, Ryder OA, O'Brien SJ. Molecular phylogenetics and the origins of placental mammals. Nature. 2001;409:614–618. [PubMed]
  • Murphy WJ, Eizirik E, O'Brien SJ, et al. (11 co-authors) Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science. 2001;294:2348–2351. [PubMed]
  • Murphy WJ, Pevzner PA, O'Brien SJ. Mammalian phylogenomics comes of age. Trends Genet. 2004;20:631–639. [PubMed]
  • Murphy WJ, Pringle TH, Crider TA, Springer MS, Miller W. Using genomic data to unravel the root of the placental mammal phylogeny. Genome Res. 2007;17:413–421. [PubMed]
  • Nikolaev S, Montoya-Burgos JI, Margulies EH, Rougemont J, Nyffeler B, Antonarakis SE. Early history of mammals is elucidated with the ENCODE multiple species sequencing data. PLoS Genet. 2007;3:e2. [PubMed]
  • Nishihara H, Hasegawa M, Okada N. Pegasoferae, an unexpected mammalian clade revealed by tracking ancient retroposon insertions. Proc Natl Acad Sci USA. 2006;103:9929–9934. [PubMed]
  • Nishihara H, Okada N, Hasegawa M. Rooting the eutherian tree: the power and pitfalls of phylogenomics. Genome Biol. 2007;8:R199. [PMC free article] [PubMed]
  • Nishihara H, Satta Y, Nikaido M, Thewissen JG, Stanhope MJ, Okada N. A retroposon analysis of Afrotherian phylogeny. Mol Biol Evol. 2005;22:1823–1833. [PubMed]
  • Nylander JAA. Program distributed by the author. Uppsala (Sweden): Evolutionary Biology Centre, Uppsala University; 2004. MrModeltest.
  • Opazo JC, Wildman DE, Prychitko T, Johnson RM, Goodman M. Phylogenetic relationships and divergence times among New World monkeys (Platyrrhini, Primates) Mol Phylogenet Evol. 2006;40:274–280. [PubMed]
  • Pecon-Slattery J, Pearks Wilkerson AJ, Murphy WJ, O'Brien SJ. Phylogenetic assessment of introns and SINEs within the Y chromosome using the cat family felidae as a species tree. Mol Biol Evol. 2004;21:2299–2309. [PubMed]
  • Philippe H, Delsuc F, Brinkmann H, Lartillot N. Phylogenomics. Annu Rev Ecol Evol. 2005;36:541–562.
  • Philippe H, Snell EA, Bapteste E, Lopez P, Holland PW, Casane D. Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol Biol Evol. 2004;21:1740–1752. [PubMed]
  • Phillips MJ, Delsuc F, Penny D. Genome-scale phylogeny and the detection of systematic biases. Mol Biol Evol. 2004;21:1455–1458. [PubMed]
  • Phillips MJ, Lin YH, Harrison GL, Penny D. Mitochondrial genomes of a bandicoot and a brushtail possum confirm the monophyly of australidelphian marsupials. Proc Biol Sci. 2001;268:1533–1538. [PMC free article] [PubMed]
  • Phillips MJ, Penny D. The root of the mammalian tree inferred from whole mitochondrial genomes. Mol Phylogenet Evol. 2003;28:171–185. [PubMed]
  • Pollard DA, Moses AM, Iyer VN, Eisen MB. Detecting the limits of regulatory element conservation and divergence estimation using pairwise and multiple alignments. BMC Bioinformatics. 2006;7:376. [PMC free article] [PubMed]
  • Posada D, Crandall KA. MODELTEST: testing the model of DNA substitution. Bioinformatics. 1998;14:817–818. [PubMed]
  • Poux C, Douzery EJ. Primate phylogeny, evolutionary rate variations, and divergence times: a contribution from the nuclear gene IRBP. Am J Phys Anthropol. 2004;124:1–16. [PubMed]
  • Ray DA, Xing J, Hedges DJ, et al. (13 co-authors) Alu insertion loci and platyrrhine primate phylogeny. Mol Phylogenet Evol. 2005;35:117–126. [PubMed]
  • Rokas A, Carroll SB. More genes or more taxa? The relative contribution of gene number and taxon number to phylogenetic accuracy. Mol Biol Evol. 2005;22:1337–1344. [PubMed]
  • Rosenberg MS, Kumar S. Taxon sampling, bioinformatics, and phylogenomics. Syst Biol. 2003;52:119–124. [PMC free article] [PubMed]
  • Scally M, Madsen O, Douady CJ, de Jong WW, Stanhope MJ, Springer MS. Molecular evidence for the major clades of placental mammals. J Mammal Evol. 2002;8:239–277.
  • Schwartz S, Elnitski L, Li M, Weirauch M, Riemer C, Smit A, Green ED, Hardison RC, Miller W. MultiPipMaker and supporting tools: alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Res. 2003;31:3518–3524. [PMC free article] [PubMed]
  • Shimamura M, Yasue H, Ohshima K, Abe H, Kato H, Kishiro T, Goto M, Munechika I, Okada N. Molecular evidence from retroposons that whales form a clade within even-toed ungulates. Nature. 1997;388:666–670. [PubMed]
  • Shimodaira H, Hasegawa M. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol Biol Evol. 1999;16:1114–1116.
  • Shoshani J, McKenna MC. Higher taxonomic relationships among extant mammals based on morphology, with selected comparisons of results from molecular data. Mol Phylogenet Evol. 1998;9:572–584. [PubMed]
  • Siepel A, Bejerano G, Pedersen JS, et al. (16 co-authors) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15:1034–1050. [PubMed]
  • Springer MS, de Jong WW. Phylogenetics. Which mammalian supertree to bark up? Science. 2001;291:1709–1711. [PubMed]
  • Springer MS, Murphy WJ, Eizirik E, O'Brien SJ. Placental mammal diversification and the cretaceous-tertiary boundary. Proc Natl Acad Sci USA. 2003;100:1056–1061. [PubMed]
  • Springer MS, Stanhope MJ, Madsen O, de Jong WW. Molecules consolidate the placental mammal tree. Trends Ecol Evol. 2004;19:430–438. [PubMed]
  • Stamatakis A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006;22:2688–2690. [PubMed]
  • Stanhope MJ, Waddell VG, Madsen O, de Jong W, Hedges SB, Cleven GC, Kao D, Springer MS. Molecular evidence for multiple origins of Insectivora and for a new order of endemic African insectivore mammals. Proc Natl Acad Sci USA. 1998;95:9967–9972. [PubMed]
  • Sullivan J, Swofford DL. Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics. J Mammal Evol. 2004;4:77–86.
  • Swofford DL. PAUP*: Phylogenetic analyisis using parsimony (*and other methods) 2003. Sunderland (MA): Sinauer Associates.
  • Thomas JW, Prasad AB, Summers TJ, Lee-Lin SQ, Maduro VV, Idol JR, Ryan JF, Thomas PJ, McDowell JC, Green ED. Parallel construction of orthologous sequence-ready clone contig maps in multiple species. Genome Res. 2002;12:1277–1285. [PubMed]
  • Thomas JW, Touchman JW, Blakesley RW, et al. (71 co-authors) Comparative analyses of multi-species sequences from targeted genomic regions. Nature. 2003;424:788–793. [PubMed]
  • van de Lagemaat LN, Gagnier L, Medstrand P, Mager DL. Genomic deletions and precise removal of transposable elements mediated by short identical DNA segments in primates. Genome Res. 2005;15:1243–1249. [PubMed]
  • van Rheede T, Bastiaans T, Boone DN, Hedges SB, de Jong WW, Madsen O. The platypus is in its place: nuclear genes and indels confirm the sister group relation of monotremes and Therians. Mol Biol Evol. 2006;23:587–597. [PubMed]
  • Waddell PJ, Cao Y, Hauf J, Hasegawa M. Using novel phylogenetic methods to evaluate mammalian mtDNA, including amino acid-invariant sites-LogDet plus site stripping, to detect internal conflicts in the data, with special reference to the positions of hedgehog, armadillo, and elephant. Syst Biol. 1999;48:31–53. [PubMed]
  • Waddell PJ, Kishino H, Ota R. A phylogenetic foundation for comparative mammalian genomics. Genome Inform. 2001;12:141–154. [PubMed]
  • Waddell PJ, Kishino H, Ota R. Very fast algorithms for evaluating the stability of ML and Bayesian phylogenetic trees from sequence data. Genome Inform. 2002;13:82–92. [PubMed]
  • Waddell PJ, Okada N, Hasegawa M. Towards resolving the interordinal relationships of placental mammals. Syst Biol. 1999;48:1–5. [PubMed]
  • Waddell PJ, Shelley S. Evaluating placental inter-ordinal phylogenies with novel sequences including RAG1, gamma-fibrinogen, ND6, and mt-tRNA, plus MCMC-driven nucleotide, amino acid, and codon models. Mol Phylogenet Evol. 2003;28:197–224. [PubMed]
  • Waters PD, Dobigny G, Waddell PJ, Robinson TJ. Evolutionary history of LINE-1 in the major clades of placental mammals. PLoS ONE. 2007;2:e158. [PMC free article] [PubMed]
  • Wiens J. Can incomplete taxa rescue phylogenetic analyses from long-branch attraction? Syst Biol. 2005;54:731–742. [PubMed]
  • Wildman DE, Uddin M, Opazo JC, Liu G, Lefort V, Guindon S, Gascuel O, Grossman LI, Romero R, Goodman M. Genomics, biogeography, and the diversification of placental mammals. Proc Natl Acad Sci USA. 2007;104:14395–14400. [PubMed]
  • Yang Z. Fair-balance paradox, star-tree paradox, and Bayesian phylogenetics. Mol Biol Evol. 2007;24:1639–1655. [PubMed]
  • Yu L, Zhang YP. Evolutionary implications of multiple SINE insertions in an intronic region from diverse mammals. Mamm Genome. 2005;16:651–660. [PubMed]
Articles from Molecular Biology and Evolution are provided here courtesy of
Oxford University Press