|Home | About | Journals | Submit | Contact Us | Français|
In the eight years since phylogenomics was introduced as the intersection of genomics and phylogenetics, the field has provided fundamental insights into gene function, genome history and organismal relationships. The utility of phylogenomics is growing with the increase in the number and diversity of taxa for which whole genome and large transcriptome sequence sets are being generated. We assert that the synergy between genomic and phylogenetic perspectives in comparative biology would be enhanced by the development and refinement of minimal reporting standards for phylogenetic analyses. Encouraged by the development of the Minimum Information About a Microarray Experiment (MIAME) standard, we propose a similar roadmap for the development of a Minimal Information About a Phylogenetic Analysis (MIAPA) standard. Key in the successful development and implementation of such a standard will be broad participation by developers of phylogenetic analysis software, phylogenetic database developers, practitioners of phylogenomics, and journal editors.
Phylogenies have provided a historical framework for interpreting the evolution of form and function since Darwin (1859) and Haeckel (1866) published their iconic tree figures some 150 years ago. In recent years, phylogenetics has come to play a multifaceted role in genomic analyses and interpretation of genomics data. Phylogenetic analyses are now being performed on a genomic scale in order to address issues ranging from the prediction of gene and protein function (Eisen, 1998; Sjölander, 2004; Engelhardt et al., 2005) to organismal relationships (Philippe et al., 2005; Delsuc et al., 2005), to the influences of polyploidy (Bowers et al., 2003; Byrne and Wolfe, 2005) and horizontal gene transfer (Ge et al., 2005; Simonson et al., 2005) on genome content and structure (Wolf et al., 2002), to the reconstruction of ancestral genome characteristics (Blanchette et al., 2004).
The fundamental nature of inferences drawn from all of these applications underscores the growing importance of genomic and sub-genomic investigations of species covering the spectrum of organismal diversity. Phylogenomic analyses, defined here broadly as the integration of phylogenetic and genomic analysis (Eisen and Fraser, 2003), place genome sequence, gene expression (Gu and Gu, 2003; Gu 2004; Gu et al., 2005; Duarte et al., 2006) and functional data in a historical context and thereby help to elucidate those processes shaping the structure and function of genes, genetic systems and whole organisms. The development and refinement of searchable phylogeny databases such as TreeBase (Piel et al., 2003) or gene tree databases (Duret et al., 1994; Sjölander, 2004; Roth et al., 2005; Hartmann et al., 2006; Li et al., 2006) is an important step in the advancement of phylogenomics, but only a miniscule fraction of published phylogenies are currently deposited in a database. What is worse, the alignments for many published phylogenies are not easily accessible, and methods of analysis are not adequately described. These are serious impediments to those wanting to test the robustness of published phylogenies, conduct cross-study comparisons of phylogenetic inferences, or draw new inferences from meta-analyses.
Accurate phylogenetic trees provide a valuable historical context for a variety of comparative analyses, and can be applied to a host of biological questions unforeseen by the original authors. This is particularly true in phylogenomics, where many applications require the investigation of phylogenetic trees for a large number of independent gene/protein families. If inadequately documented, however, even the most carefully constructed phylogenetic analysis will languish in the pages of a journal. Thus, a key step in the continued ability of phylogenomics to take full advantage of the rapidly expanding volume of sequence data will be the development of reporting standards for phylogenetic analyses, along with databases from which these metadata can easily be retrieved. In this paper, we propose a roadmap to develop a set of reporting standards for phylogenetic analyses. Using the MIAME standard (Brazma et al., 2001) as a model, we call for a community-wide effort to develop a Minimal Information About a Phylogenetic Analysis (MIAPA) standard.
The papers in this special issue constitute a series of case studies on the importance of standard practices for reporting the results of various types of experiments in a way that facilitates the ability of scientists to use these data in subsequent studies (Field and Sansone, 2006). The motivating question behind the MIAME standard for microarray experiments was as follows: “What is the minimum information necessary for an independent scientist to carry out an independent analysis of the data?” (Quackenbush, 2005).
The motivation for the MIAPA standard is the same, as is the challenge: minimizing the reporting requirements while maximizing the information available to those interpreting the results of a study (Brazma, 2001; Brazma et al., 2001; Ball and Brazma, this issue). The phylogenetics community is coming together to develop this standard with careful consideration of the types of future analyses that are likely to be performed and the data required. For example, systematists may combine pre-existing phylogenies into supertree analyses (Davies et al., 2004; Page, 2005), while genomicists may combine them to investigate the timing of genome duplication events (Chapman et al., 2004). At the same time, investigators may require access to the alignments and component sequences used to build the selected phylogenies in order to perform independent phylogenetic analyses on single or combined datasets. Thus, just as the MIAME standard was designed to accommodate the nested organization of gene expression levels derived from signal quantification matrices derived in turn from raw image data (Brazma et al., 2001), the MIAPA standard would need to accommodate phylogenies derived from analysis of alignments derived in turn from raw sequence data.
A decision that was integral to development and success of the MIAME standards was that they should be applicable to a wide variety of microarray technologies and no one platform or hybridization protocol was prescribed. Similarly, we suggest that the MIAPA standard should be agnostic concerning methods of alignment and phylogenetic reconstruction. The diversity of methods of phylogenetic inference is perhaps even greater than the diversity of applications to which phylogenies may be applied (Swofford et al., 1996; Felsenstein, 2004; Delsuc et al., 2005) and novel methods are likely to be developed in the future. Parsimony, likelihood, Bayesian and distance-based approaches have all been adapted for analyses of the various data types relevant to phylogenomics, including aligned nucleotide and protein sequences, gene structure (insertions and deletions), gene content, motif frequencies (Qi et al., 2004) and gene order (Moret et al., 2001). Multiple sequence alignment has its own diverse set of methodologies, and, in some approaches, a multiple sequence alignment and phylogenetic tree are constructed simultaneously (Gladstein and Wheeler, 1997; Edgar and Sjölander, 2003; Lunter et al., 2005; Fleissner et al., 2005). The relative performance of these different methods is an area of active research, but it is clear that no single method is optimal for all data sets (Swofford et al., 2001; Spencer et al., 2005). Benchmark datasets have been compiled for comparing the performance of alignment algorithms (van Walle et al., 2004; Thompson et al., 2005) but there are few comparable benchmarks for phylogenetic algorithms, and so comparisons have relied largely on analyses of simulated or contrived data sets (Huelsenbeck 1995; Swofford et al., 2001; Spencer et al., 2005, but see Hillis et al., 1992; Cunningham et al., 1997). Thus, for a variety of reasons, methodological diversity in phylogenetics is likely to be the state of affairs for the foreseeable future. No matter how phylogenies are constructed, however, a comprehensive description of how a set of sequences was aligned, and how phylogenetic trees were derived from an alignment would allow researchers to evaluate their confidence in a phylogeny and run their own analyses if they see fit.
The six required components of the MIAME standards proposed in 2001 (Brazma et al., 2001) included descriptions of (1) the experimental design for a complete study; (2) the design of each array and the identity of each spot on the arrays used in the study; (3) the biological sample extraction preparations and labelling procedures used for each hybridization; (4) the hybridization protocols; (5) the measurements, including imaging and signal quantification parameters; and (6) the normalization and control information. At this stage, it would be premature to specify the details of the MIAPA standard, but Figure 1 offers a starting point for considering MIAPA’s essential components that it might include. By analogy with the MIAME standards, minimum reporting standards for phylogenetic analyses are likely to include (1) a description of the objectives of the phylogenetic analysis and the component trees included in a study (many phylogenetic studies produce multiple trees based on different data sets or analytical methods); (2) the raw sequences or character descriptions; (3) sample voucher information; (4) a description of procedures for establishing orthology of characters (e.g., sequence alignment); (5) the sequence alignment or some other character matrix; (6) detailed description of the phylogenetic analysis, including search strategies and parameter values (specific commands for the analysis program would be optimal); and (7) the phylogenies including branch lengths and support values (e.g., bootstrap). The schematic shown in Figure 1 is likely to be incomplete. For example, it is not clear whether or how to report measures of node support, such as bootstrap values, and phylogenetic analyses are often performed on data matrices other than nucleotide and protein sequence alignments. If the reporting standard were focused on sequence data, referencing an external database for the unaligned and unmasked sequences would require that all sequence identifiers in a database such as GenBank would be stable over the long term. If the standard were to extend to phylogenetic analyses of morphological characters, character descriptions and data matrices could be deposited in MorphBank (www.morphbank.com) or MorphoBank (www.morphobank.org). Following the MIAME model (Brazma et al., 2001), the scheme in Figure 1 is reliant on an external database (e.g., the taxonomy database at NCBI) for information about the taxonomic placement of the studied organisms. However, it might be better to require the full taxonomy of the studied organisms to be reported in order to allow a full search of the taxonomic hierarchy (Page, 2005). We suggest that sample voucher information be included in the reporting standard in order to properly synthesize future combined data matrices or build supertrees. The phylogenetics community will have to grapple with these issues and more as we formalize the reporting standard, and we reiterate that Figure 1 is presented simply as a starting point for deeper consideration.
The nearly universal adoption of the MIAME standards and their impact on all aspects of microarray-based expression profiling was driven by necessity. The deliberate process by which they were constructed started with an international meeting of what became the Microarray Gene Expression Data Society (www.mged.org) in 1999 and culminated in an open letter first published in Nature Genetics in 2001. There has been continued refinement of the standards at annual meetings (Ball and Brazma, this issue). Much of the success of MIAME must also be attributed to the fact that MGED engaged commercial interests and database managers. The compliance of microarray databases (Parkinson et al., 2005; Barrett et al., 2005) was facilitated by the development of formal protocols for data exchange, namely the microarray gene expression object model (MAGE-OM) implemented in XML (MAGE-ML) (Spellman et al., 2002). User-friendly systems for submission of expression data and metadata that built upon these protocols (Mukherjee et al., 2005) have further promoted widespread compliance with MIAME guidelines among investigators.
Development of the MIAPA standard must also involve developers of phylogenetic analysis software (Felsenstein, 2005; Goloboff et al., 2004; Kumar et al., 2004; Ronquist and Huelsenbeck, 2003; Swofford, 2001; Roshan et al., 2004; www.phylo.org), existing public databases for organismal (Piel et al., 2003) and gene family phylogenies (Duret et al., 1994; Sjölander, 2004; Roth et al., 2005; Hartmann et al., 2006; Li et al., 2006), as well as editors of the journals in which phylogenetic analyses are published. A well-defined protocol for saving and transferring phylogenetic metadata should be considered, one that would complement existing formats such as New Hampshire (or Newick) and PhyloXML (www.phyloxml.org).
Following the example of MGED, development of the MIAPA standard could be advanced through an international conference of representative stakeholders in conjunction with open discussions across the phylogenetics community. We will be soliciting involvement in an organizational conference at scientific meetings this coming summer and publishing proposals for the MIAPA standard in the journals most read by the phylogenetics community. We anticipate these efforts will culminate in an open letter to the editors of all journals publishing phylogenies in which MIAPA will be described in detail. In addition, the standard would be most viable if accompanied by software and database tools that would facilitate utility and widespread compliance.
These are ambitious objectives, but the time is ripe for the development and implementation of minimal reporting standards for phylogenetic analyses. Widespread recognition of the importance of phylogenetics to genome biology comes at a time when recent advances have increased the rate of sequence generation by orders of magnitude (Margulies et al., 2005). Increases in sequencing capacity and concomitant cost decreases are spurring a rapid expansion in the availability of whole genome sequences (Liolios et al., 2006) or subgenomic sequence data (Lee et al., 2005). Beyond doubt, this flood of sequence data will spur a corresponding flood of comparative analyses in which phylogenetic trees play a central role. Indeed, many computational and statistical methods for functional genomic analysis are being developed, which are, more or less, phylogeny-based. When reporting of phylogenetic analyses is brought more fully into the informatics age, it will have manifold beneficial effects on the utility and impact of phylogenomics.
We thank Dawn Field and Peter Sterk for organizing the “Cataloguing our Current Genome Collection” workshop at EBI, where some of these ideas were first proposed. We also thank Dawn Field, Susanna Sansone, and Eugene Kolker for their roles in putting together this special issue.