All species on Earth – current estimates exceed 1.8 million – are related through common ancestors in the evolutionary Tree of Life. The construction of this phylogeny is a major endeavor for biology and largely now depends on the unprecedented growth of molecular sequence data available in public databases. Efforts focused on single clades, whole genome sequencing, genomic library construction (ESTs, BACs), and large collaborative efforts, such as NSF's Assembling the Tree of Life project, are contributing to the fast-paced growth of public databases, with more than 92 million sequences stored in the current release of GenBank (release 167). Current efforts to infer really large phylogenetic trees center on data combination using so-called supertree [e.g., [
1]] and supermatrix methods [e.g., [
2-
4]] as opposed to using a single gene (or multiple genes) sampled very widely across taxa [e.g., [
5,
6]]. For example, recent large-scale database-enabled phylogenetic analyses employing these approaches have shed light on the radiation and early evolution of mammals [
1], and the phylogenetic diversity of Bacteria [
3]. Recent advances in phylogenetic tree-building methods have provided the necessary first steps in approaching the problem of producing large and comprehensive phylogenetic trees [
7-
9]. However, assembling large datasets from databases remains a critical problem upstream of the tree-building process.
Supertree methods compile many source trees with partially overlapping taxa into a single comprehensive tree [
10,
11]. Generally, each source topology is converted into a data matrix and combined with other topological matrices. Many different algorithms exist for creating the final supertree including MRP (matrix representation with parsimony; [
12,
13]), MRF (matrix representation with flipping; [
14]), MinCut [
15], and modified MinCut [
16]. Although straightforward, supertree methods are not without their limitations, including problems related to data independence (same data can contribute to more than one source tree), "signal enhancement" ([
11] novel relationships in supertrees contradicting one or several source trees), and the assessment of uncertainty and confidence in relationships [
17]. In addition, supertrees are strictly topological, thus requiring sequence data to obtain useful branch lengths [
1]. Most importantly however, supertrees do not directly rely on the primary data for tree inference, making novel topologies suspect. Perhaps due to these limitations, and despite active development of methodologies [e.g., [
1]], few large supertrees for diverse groups have been successfully constructed (but see [
18,
19]).
Supermatrix methods, on the other hand, are directly inferred from the sequence data through the construction of a large multiple sequence alignment for simultaneous analysis of the final data-matrix [
20]. Given the fact that few genes are sampled very completely across many taxa, supermatrix methods often sacrifice completeness in the interest of size. In fact, one of the largest supermatrices, with >2000 tips, had 95% missing data [
4]. Other supermatrix analyses have focused on the number of gene regions and not on the number of species [
2,
3]. The construction of a large supermatrix involves a number of computationally challenging steps including, but not limited to, database operations, BLAST comparisons, sequence clustering, multiple sequence alignment, and combining data sets. An exhaustive discussion of these steps is presented elsewhere [
4], but each will be briefly touched upon as it relates to the approach presented here. Typically, sequences have been deposited in a database and all-by-all sequence comparisons with BLAST are conducted to assemble sequence clusters based on similarity. Methods for this step include agglomerative procedures, like single linkage clustering (e.g. blastclust) and stochastic methods (e.g. Tribe-MCL; [
21]). Clustered sequences are then submitted to multiple sequence alignment (MSA). There are a host of other procedures that can be conducted once multiple sequence alignments are produced, especially related to identifying sequence orthology. Multi-locus datasets are created from individual alignments that do not have "too many" missing entries using a bipartite graph of taxa and loci and combining with bicliques or quasi-bicliques [
22,
23].
Each step described above is computationally difficult and rarely has been discussed in the context of what might be optimal for the final goal of tree construction. Despite the computational difficulties and potential shortcomings of specific steps in their construction, supermatrix methods allow for simultaneous data analysis. Also, unlike supertrees, they do not suffer from data independence or "signal enhancement" problems, and, at least in principle, confidence can be assessed using standard bootstrapping approaches. However, problems related to missing data and assessing the quality of the trees produced persists. Tools addressing certain steps of supermatrix construction are beginning to become available (e.g., Phylota; [
24]) and some notable large trees have been successfully produced [
4]. However, tools for constructing supermatrices are not readily available for comparative biologists, and rather few large matrices for specific clades have been successfully analyzed. Nevertheless, supermatrix methods have made enormous strides forward and recent discussions have begun to center on methods that combine elements of both supertree and supermatrix approaches [e.g. [
17,
25,
26]].
The method introduced here, to which we refer to as a "mega-phylogeny", is most similar to supermatrix methods, but differs from previous methods used to create large matrices in a number of ways. The mega-phylogeny method relies on the user identifying the gene regions of interest by presenting actual examples of the gene region and the breadth of molecular diversity of that gene within the clade of interest. Also, the mega-phylogeny method employs profile alignments to combine alignments of orthologous gene regions that would either be poorly aligned if done across a broad taxonomic group or would be broken up by clustering analyses. The mega-phylogeny approach can quickly create enormous phylogenetic matrices as more data from the same gene may be used, and the problems associated with sequence saturation are specifically attenuated. The first demonstration of this method [
27] produced phylogenies for plant clades from 366 species and 11,374 sites (Dipsacales) to 4657 species and 22,391 sites (Commelinidae).
Here, we describe this new approach and its current implementation. We also present two example phylogenies for two plant clades created using our method: an Asterales phylogeny containing 4954 species and five gene regions and an rbcL phylogeny of green plants (Viridiplantae) comprising more than 13,533 species.