|Home | About | Journals | Submit | Contact Us | Français|
Evolutionary retention of duplicated genes encoding transcription-associated proteins (TAPs, comprising transcription factors and other transcriptional regulators) has been hypothesized to be positively correlated with increasing morphological complexity and paleopolyploidizations, especially within the plant kingdom. Here, we present the most comprehensive set of classification rules for TAPs and its application for genome-wide analyses of plants and algae. Using a dated species tree and phylogenetic comparative (PC) analyses, we define the timeline of TAP loss, gain, and expansion among Viridiplantae and find that two major bursts of gain/expansion occurred, coinciding with the water-to-land transition and the radiation of flowering plants. For the first time, we provide PC proof for the long-standing hypothesis that TAPs are major driving forces behind the evolution of morphological complexity, the latter in Plantae being shaped significantly by polyploidization and subsequent biased paleolog retention. Principal component analysis incorporating the number of TAPs per genome provides an alternate and significant proxy for complexity, ideally suited for PC genomics. Our work lays the ground for further interrogation of the shaping of gene regulatory networks underlying the evolution of organism complexity.
The regulated expression of genes is essential for defining morphology, functional capacity, and developmental fate of both solitary living cells as well as cells inhabiting the social environment of a multicellular organism. In this regard, the regulation of transcription, that is, the synthesis of messenger RNA from a genomic DNA template, plays a crucial role. It contributes to the control of temporal and spatial RNA and protein levels in a cell and therefore has an essential function in all living organisms. Transcriptional regulation is primarily achieved by transcription-associated proteins (TAPs, comprising transcription factors [TFs] and other transcriptional regulators [TRs]), which are especially attractive for the investigation of gene regulatory networks. The evolutionary development of organisms throughout all kingdoms of life seems to be tightly linked to the evolution and expansion of TAP gene families (Hsia and McGinnis 2003; Levine and Tjian 2003; Gutierrez et al. 2004; Carroll 2005). It has been proposed before that there is a direct correlation between the genomic fraction of TAPs and the morphological complexity of an organism (e.g., Levine and Tjian 2003) and that expansions of TAP families contribute to the evolution of morphological diversification (Lespinet et al. 2002; Richardt et al. 2007).
There are multiple lines of evidence suggesting that many, if not most, eukaryotic genomes underwent one to several large-scale duplication events in their evolutionary history (e.g., Paterson et al. 2006; Edger and Pires 2009; Van de Peer et al. 2009). Which genes are retained after such an event seems to be critical in terms of gene dosage balance, especially among members of signaling and regulatory networks. TAPs and DNA-binding TFs, in particular, belong to the functional classes of genes that have been found to be retained preferentially after duplication in most studies investigating large-scale and other “balanced duplications” events (Edger and Pires 2009). This finding has been discussed frequently as an indication for the importance of genome duplication events for the observable gain of morphological complexity in the animal and plant lineages (Freeling and Thomas 2006). Although the basic transcription machinery in different eukaryotes is essentially similar, the gene families regulating this machinery often show lineage-specific expansions (Lespinet et al. 2002) and are therefore ideally suited for analyzing taxonomic diversity (Coulson and Ouzounis 2003). In plants, due to the above-mentioned higher retention rate of TAPs after duplication events (Lespinet et al. 2002; Shiu et al. 2005), the contribution of TAPs to the total number of genes is more pronounced than in other eukaryotes (Riano-Pachon et al. 2008). Hence, especially in plants, the cross-species comparison of TAPs is expected to yield interesting insights into the evolution of regulatory networks.
Phylogenetic comparative methods (PCMs) allow the identification of evolutionary correlations across taxa (Quader et al. 2004). Statistical correlation of traits between different species without incorporation of their evolutionary relationships suffers from the problem of phylogenetic nonindependence as taxa may be similar simply due to shared ancestry. Thus, comparative data often violate the statistical assumption of independence. Given a species tree, PCMs can be used to correct comparative data for phylogenetic nonindependence (Felsenstein 1985; Pagel 1994; Garland and Ives 2000; Martins 2000). Although the hypothesis of a direct evolutionary correlation between TAPs and organismal complexity seems very intuitive and indirectly supported by findings from other fields of research, like, for example, the specific paleolog (i.e., paralogs retained after a paleoduplication event) retention pattern described earlier, it has not yet been put to test using comparative phylogenetic methods. Moreover, it is important for our understanding of TAP gene family evolution to elucidate whether all or only certain TAP families are correlated with complexity.
The availability of several annotated genomes covering most of the major clades along the red/green lineage (the Archaeplastida or Plantae; which have acquired their plastid by primary endosymbiosis; [Cavalier-Smith 1998; Adl et al. 2005]) now allows us to evaluate which traits have been important for the observed gain of morphological complexity. The findings reviewed above suggest that the TAP complement is an important trait to test in this context. Necessary prerequisites for these analyses are a detailed classification of the TAP complements of the genomes under investigation, a species phylogeny with branch lengths, divergence time estimates, and traits to describe organismal complexity.
TAPs can be divided into 1) TFs binding to cis-regulatory DNA elements in a sequence-specific manner, directly enhancing or repressing the transcription of their target genes; 2) TRs with indirect regulatory functions, assisting in the assembly of the RNA polymerase II complex (general TFs), functioning as scaffold proteins in enhancer/repressor complexes or controlling the chromatin structure by, for example, modifying histones or the DNA methylation state, respectively; and, finally, 3) putative TAPs (PTs) have so far not been functionally investigated but in silico prediction suggests a role in transcriptional regulation. For all groups, numerous members have been identified and described in different organisms, for example, there are extensive studies dealing with TAPs in the flowering plants Arabidopsis thaliana (Riechmann et al. 2000; Guo et al. 2005), Oryza sativa (Gao et al. 2006), Populus trichocarpa (Zhu et al. 2007), and Nicotiana tabacum (Rushton et al. 2008). Those studies revealed functional networks of transcriptional regulation in a particular system. However, for comparative evolutionary analyses as well as for gaining insight into the development of a wide variety of TAP families, a broader approach, covering a larger set of divergent species, is necessary. In order to identify TAPs in genomes, we classified proteins into families involved in transcriptional regulation. To this end, we applied an approach that exploits the domain structure of the protein for its characterization. Because protein domains fulfill a crucial role, mutations within domains are often deleterious. Therefore, such regions are strongly conserved, leading to highly similar sequences originating from a common evolutionary ancestor. Previous studies already applied a domain-based approach for defining protein families (Riano-Pachon et al. 2007; Guo et al. 2008). Domains relevant for TAP family classification were retrieved from those publications and used to establish rules defining domains mandatory or forbidden in proteins of a certain family. Further relevant domains and rules were extracted from PlanTAPDB (Richardt et al. 2007) as well as from literature (Fernandez-Silva et al. 1997; Shuai et al. 2002; Hackbusch et al. 2005; Whitcomb et al. 2007; Yamada et al. 2008). By combining the existing rules and resolving potential conflicts between rules from different sources, we were able to enlarge the previously used sets of TAP domains and families. Here, we report the so far most extensive comparative phylogenetic analysis of TAP gene families in land plants and algae, revealing insights into the evolution of transcriptional regulation from unicellular to highly complex photosynthetic organisms.
Organismal or morphological complexity is often measured by the number of cell types or tissue types (Bell and Mooers 1997; Adami 2002; Hedges et al. 2004). However, the publication record of exact cell and tissue type estimates is scarce (Bell and Mooers 1997). Extended literature searches revealed no peer reviewed, experimentally determined absolute cell type or tissue type numbers for any of the sequenced green model species. Online resources like Plant Ontology (http://www.plantontology.org/) or BioNumbers (http://bionumbers.hms.harvard.edu) are beginning to conquer this dilemma but cannot yet guarantee completeness and accuracy. Thus, for more detailed taxon-rich analyses, an alternative proxy for organismal complexity is needed.
Using the phylogenetic framework of a 20 species phylogeny (based on a concatenated alignment of 14 nuclear-encoded markers) and subsequent molecular divergence time estimates, we here present the first comparative phylogenetic approach to better understand the evolutionary relationship and dependence of transcriptional regulation and morphological complexity in Viridiplantae. Employing a combination of principal component analysis (PCA) and PCMs, we derived a novel proxy for organismal complexity that allows us to assess more detailed evolutionary questions on broader taxonomic scale like, for example, which particular TAP families did expand in correlation with the general increase in morphological complexity.
The rules for classifying the investigated proteins into TAP families define mandatory (“should”) as well as forbidden (“should not”) conserved protein domains in those families. The initial set of rules was adopted from three previous publications/databases, that is, PlantTFDB (Guo et al. 2008), PlnTFDB (Riano-Pachon et al. 2007), and PlanTAPDB (Richardt et al. 2007). In the latter case, domains present in more than 95% of the members of a given family were defined as a mandatory domain. Potential conflicts between the three sources were manually evaluated and subsequently solved, for example, via literature research, as in the case of PlnTFDB C2C2-GATA (tify; could) versus PlantTFDB (tify; should not). As the tify domain has been described to appear in GATA members (Reyes et al. 2004), the PlnTFDB rule was preferred. The sources for all rules are listed in supplementary table 1 (Supplementary Material online). The resulting combined set of preliminary classification rules was used in a test run with the A. thaliana TAIR 7 protein set. Comparison with the results of the above mentioned publications as well as reduction of the number of double-classified proteins by adding exclusion rules (see below) were steps to refine the rules and to create a unified classification set. This set was subsequently expanded based on literature reports of recently defined families or subfamilies. Eleven additional rules describing nine TAP families were derived from the literature (Andrianopoulos and Timberlake 1991; Burglin 1991; Kagoshima et al. 1993; Muller et al. 1995; Fernandez-Silva et al. 1997; Hackbusch et al. 2005; Da et al. 2006; Duncan et al. 2007; Whitcomb et al. 2007; Yamada et al. 2008). Furthermore, the examination of selected proteins from the Uniprot and PFAM databases led to the definition of five new families implemented via 12 rules. The sources of all rules are shown in detail in supplementary table 1 (Supplementary Material online). In cases where either 1 out of 2 domains was necessary and sufficient for assignment to the respective family (bZIP, HD-Zip, and GARP_ARR-B), an “OR” rule was applied (five rules in total). To render the rule set more robust, we furthermore implemented rules reducing the number of proteins classified into any two independent families. Based on the test run against A. thaliana and subsequent literature surveys, 30 rules were thus added. The rule set for each family consists of at least one entry defining a should rule, that is, a domain mandatory for that particular family. Additional entries may define further should or should not (forbidden) domains (fig. 2).
All domains relevant for classifying the TAPs are represented by a “ls” (“glocal”) hidden Markov model (HMM), that is, global with respect to the profile and local with respect to sequence (as compared with “fs” HMMs that are local with respect to both). Thus, the ls HMMs identify whole domains only. Because we used only fully sequenced genomes, identifying fragments of the relevant domains was neither necessary nor suitable for our approach. Accordingly, the use of ls HMMs reduced the number of false positive hits obtained during the search by limiting the identification of truncated domains (Perez-Rodriguez et al. 2010). If available, the HMMs were retrieved directly from the “PFAM_ls” database (Finn et al. 2008). For the remaining domains, HMMs were custom made using multiple sequence alignments (MSAs) to identify the conserved domains of interest. The MSAs used for creating the custom-HMMs were downloaded from PlnTFDB (Riano-Pachon et al. 2007). For domains not represented in this database, MSAs were created as follows. Blast searches with a protein query containing the respective domain yielded homologous hits defined by having at least 30% sequence identity with the query over a minimum length of 80 amino acids (Rost 1999). Hits were aligned using MAFFT (Katoh et al. 2005) and manually curated using Jalview (Clamp et al. 2004). The conserved domain of interest was extracted and the HMM calculated with HMMER 2.3.2 (http://hmmer.janelia.org/) using “hmmbuild” with the default parameters to generate ls HMMs and subsequently “hmmcalibrate” with the option “--seed 0” for gaining reproducible results. Gathering (GA) cutoff values were defined for each custom-HMM. The GA was set as the lowest score of a domain-containing protein (true positive) after a “hmmpfam” search (using an E value cut off of 1 × 10−5) against the full proteome sets of several different species and considering the alignments of all hits. In two cases (NOZZLE and NAC, substituting NAM), custom-HMMs were used despite their availability in the PFAM databases. Both custom-HMMs are able to detect more members of the corresponding families than the publicly available ones. All custom-HMMs are given in supplementary file 1 (Supplementary Material online).
In order to avoid sampling bias, only fully sequenced genomes were used in this study. For each organism, the complete set of proteins derived from conceptual translation of the nuclear gene models (using the filtered/selected model per locus) were combined with the proteins encoded by the respective mitochondrial and plastidal genomes, if available. All proteins can be unambiguously identified via their fasta id. We used a unique five-letter code for each organism (table 1) followed by “mt” (mitochondrial), “pt” (plastid), or “pl” (plasmid), if applicable, and the accession number of the gene model. In the case of splice variants (A. thaliana, O. sativa, Medicago truncatula, Glycine max, Zea mays, and Carica papaya), the model with the lowest index number per locus was chosen, as it usually represents the first model determined for that locus and therefore has the highest level of accuracy. The organisms, numbers of encoded proteins, genome versions/download times, institutions, and download links are mentioned in table 1.
Using all proteins of the investigated organisms as query, hmmpfam searches (from the HMMER v2.3.2 package, http://hmmer.janelia.org/) were performed against an HMM library containing all 124 domains necessary for the TAP classification (supplementary table 2, Supplementary Material online). The tool hmmpfam considers the HMM collection as the database to search against. Because we were focusing on TAPs, a restriction of the HMMs to the ones of interest reduced the number of observations to a reasonable size. Furthermore, the number of HMMs employed remained constant, in contrast to the number of proteins encoded by the genomes, therefore yielding comparable E values. In order to obtain a high level of specificity, hmmpfam searches were performed using GA (cutoff) values as the score cut off for domain hits. The PFAM GA is manually curated throughout the HMM building process. Besides the GA, two additional score cut offs are defined during the creation of a PFAM HMM. The noise cut-off (NC) represents a very relaxed criterion, whereas the trusted cut-off (TC) is very stringent. Those cut offs would lead to false positive assignments (NC) or the loss of identified true positive domains (TC), respectively. Therefore, TC and NC are regarded not suitable for genome-wide domain assignments (HMMer user guide). Because an arbitrary E value cut off would not yield trustworthy results as well, the GA is considered the best choice for our approach. GA values were either provided with the “PFAM” HMMs or defined as described above. The classification rules (fig. 2) were subsequently applied to all proteins for which at least one significant domain hit was found. In cases where the domain composition of a protein matched more than one classification rule, the should rule with the highest score determined the family into which the protein was categorized. Highly similar domains, which are often found in the same or overlapping regions of a protein, were treated in similar fashion, that is, the domain with the higher score was used for subsequent classification. This procedure was necessary in four cases, namely 1) Myb_DNA-binding and G2-like_Domain, 2) NF-YB, NF-YC, and CCAAT-Dr1_Domain, 3) PHD and Alfin-like, and 4) GATA and zf-Dof (fig. 2). In addition, a Boolean OR rule was applied to three families (bZIP, HD-Zip, and GARP_ARR-B) (fig. 2). In these cases, either 1 out of 2 domains was found to be necessary and sufficient for a protein to be classified into the corresponding family.
Significant expansion of individual families between different groups of organisms was analyzed using standard T-test with subsequent false discovery rate correction (Benjamini and Hochberg 1995). T-tests and PCA for figure 4 were performed using Expressionist Analyst v5.3.5 (GeneData).
The predicted nuclear proteomes of the 20 Plantae species were clustered using BlastClust (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) requiring 50% sequence identity and 70% coverage. Resulting clusters were filtered, selecting for clusters with only one gene in the genomes without evidence of recent polyploidization events (Arath, Carpa, Chlre, Volca, Ostta, Ostlu, Micp1, Micp2, Cyame, Poptr, Ricco, Vitvi, Phypa, Selmo, Sorbi, Orysa; see table 1 for full species names and supplementary table 5 (Supplementary Material online) for a detailed taxonomic profile of all 14 gene families). The protein sequences resulting in 13 clusters (supplementary table 5, Supplementary Material online) were aligned using M.A.F.F.T. L-INSI (Katoh et al. 2009) and manually curated using Jalview (Waterhouse et al. 2009). Based on individual genes, trees were inferred by Neighbor-Joining as implemented in quicktree-SD (Howe et al. 2002; Frickenhaus and Beszteri 2008) using the ScoreDist distance matrix (Sonnhammer and Hollich 2005) with 1,000 bootstrap replicates and rooted at the longest internal branch. Possible in-paralogs within the polyploid species were reduced to one representative sequence based on evolutionary distance. For the small subunit (SSU) alignment, available RNA alignments were downloaded from the SILVA rRNA database (Pruesse et al. 2007). The alignments (available from TreeBase at http://purl.org/phylo/treebase/phylows/study/TB2:S10409) were combined into a partitioned data set which was used to infer the final species topology and branch lengths using MrBayes (Ronquist and Huelsenbeck 2003) with a mixed model (all: ratepr = variable; SSU:GTR rates = invgamma ngamma = 8 statefreqpr = dirichlet(1,1,1,1); proteins: rates = invgamma ngamma = 8; aamodelpr = mixed). The topology was constrained to reflect relationships based on current taxonomic literature constraining Brassicales (Arath, Carpa), Malpighiales (Poptr, Ricco), Fabidae (Poptr, Ricco, Medtr, Glyma), and tracheophytes (Selmo, Orysa, Zeama, Sorbi, Vitvi, Arath, Carpa, Poptr, Ricco, Medtr, Glyma).
The resulting species tree (available from TreeBase at http://purl.org/phylo/treebase/phylows/study/TB2:S10409) was used to estimate divergence times with the r8s software (Sanderson 2003), employing the procedures and recommendations as previously described ([Sanderson et al. 2004; Bell and Donoghue 2005; Hug and Roger 2007; Magallon and Castillo 2009; Wang et al. 2009], r8s manual). Age constraints were derived from previous analyses or reviews of fossil records (Bowe et al. 2000; Zimmer et al. 2007; Lang et al. 2008; Wang et al. 2009): red/green (1,000–1,400 Ma), chlorophytes/streptophytes (500–1,200 Ma), bryophytes/tracheophytes (400–700 Ma), lycophytes/spermatophytes (423–475 Ma), rosids (minimum age 89.5 Ma), and Liliopsida/eudicotyledons (125–300 Ma). Divergence times and 95% confidence intervals (CIs) as presented in figure 5 and supplementary figure S1 (Supplementary Material online) are based on the application of several methods implemented in r8s (LF, PL Powell, PL NPRS) including variants with fixed root ages (1,200–1,500 Ma) and fossil-based cross-validation for model selection using penalized likelihood. The output of all methods was combined to calculate mean divergence times and 95% CIs, which are shown in supplementary figure S1 (Supplementary Material online). The individual divergence time estimates of all applied strategies and summary statistics including CIs are listed in supplementary table 6 (Supplementary Material online).
The character matrix used for the PCM analysis is provided in supplementary table 7 (Supplementary Material online) and has been submitted to BioNumbers (bion 105322). miRNA (family) annotations are based on miRBase release 13.0 (http://www.mirbase.org). Drawing of phylo- and chronograms, ancestral state reconstruction, PCM, and PCA were carried out in R (http://www.r-project.org) using the R packages APE (Paradis et al. 2004) and GEIGER (Harmon et al. 2008). Pairwise PCM comparisons of single traits were carried out by Pearson correlation of phylogenetically independent contrasts (PICs) (Felsenstein 1985) and the application of the Brownian, Martens, Grafen, and PIC generalized least square (GLS) models as implemented by APE and GEIGER using linear, linear/log, and log/log transformed data. The best model was selected for each comparison using the Akaike Information Criterion as implemented in APE. In cases of missing data, the species tree was truncated using the drop.tip function of APE. Scaled PICs were used to derive principal components in the search for a proxy of organismal complexity. Ancestral states were reconstructed for all traits of the character matrix (supplementary table 7, Supplementary Material online) comparing the GLS, PIC, and maximum likelihood methods implemented in APE. The reconstructed ancestral states of all methods were plotted together with the extant states at the nodes of the species tree (e.g., supplementary files 2 and 3, Supplementary Material online) and analyzed manually to derive the final set of gains and losses presented in figure 5.
For classifying proteins into TAP families, we applied a set of rules defining whether a certain domain is mandatory (should) or forbidden (should not) in a given family. The majority of these rules were extracted from three publications/databases dealing with TAPs in plants (Riano-Pachon et al. 2007; Richardt et al. 2007; Guo et al. 2008). Rules describing optional domains were excluded for brevity. In total, 83 rules describing 63 TAP families could be extracted from PlantTFDB (Guo et al. 2008), whereas 103 rules defining 68 families were taken from PlnTFDB (Riano-Pachon et al. 2007). PlantTAPDB (Richardt et al. 2007) was employed as the third major source of rules. TAP family members from this database were examined for the occurrence of domains present in more than 95% of the respective family members. Using this approach, 51 rules representing 48 families were obtained. The contribution of each source to the complete rule set (fig. 1) emphasizes the presence of common as well as unique rules derived from the different sources.
As a general meta-rule, we assigned higher importance to sequence-specific DNA-binding domains present in TFs than to domains classifying TRs. Thus, whenever a combination of domains leads to multiple possible family classifications, the TF family is favored over TR and PT, based on the evolutionary perspective that the majority of cis-specific TF domains have been acquired as differentiations to introduce DNA specificity to more generally acting TRs with protein-interaction domains. This was encountered in 14 cases, resulting in 14 rules. For the homeobox TFs, several subfamilies (HB, HB-KNOX, HD-Zip) were defined by another nine rules. Taken together, the complete rule set (fig. 2) defines 111 TAP families by using 223 rules comprising 134 “mandatory” and 89 “forbidden” rules. In total, 124 domain HMMs are being used, 16 of which are custom made (supplementary file 1, Supplementary Material online), and 108 obtained from the PFAM database (supplementary table 2, Supplementary Material online).
In order to assess the quality of our identification and classification approach (detailed in Materials and Methods), we compared our results with selected publications in which detailed phylogenetic analyses had been carried out for TAP families of the plant species A. thaliana and O. sativa and of the green alga Chlamydomonas reinhardtii. We refer to the reported TAP classification as the “gold standard” in the following. In addition, we included results from a recent report describing TAPs in the two Micromonas strains so far fully sequenced (Worden et al. 2009). In cases where the genome annotation version was the same as the one employed in the current study, we compared the coincidences of protein IDs between the respective data sets (ours and the gold standard). In the case of deviating versions, we compared the actual protein sequences. Following these guidelines, we were able to compute the sensitivity and the positive predictive value (PPV) of our approach, as previously described (Iida et al. 2005; Riano-Pachon et al. 2007). The comparison of the classification results presented in this study with those chosen as gold standard is shown in supplementary table 3 (Supplementary Material online). For the two seed plants, A. thaliana and O. sativa, the average sensitivity is 0.94 and 0.86, respectively. For the green alga C. reinhardtii, it is 0.93, whereas for the two Micromonas strains, it is only 0.65 and 0.56, respectively. This might be due to differences in the methodology employed and the relatively small average size of the Micromonas TAP families, resulting in a comparatively large amplitude of sensitivity and PPV even if only a few classifications differ.
The summary of TAP classification results for (formerly) plastid-bearing organisms (supplementary table 4, Supplementary Material online) enables us to identify trends during the evolution of photosynthetic eukaryotes. The absolute amount of TAPs per genome shows an extensive expansion of TAPs between algae and land plants and between the nonseed and seed plants (fig. 3A), which is congruent with previous results based on fewer organisms (Richardt et al. 2007). The haptophyte Emiliania huxleyi appears to be an exception from this trend. However, when displaying the data relative to the coding potential of the genome (fig. 3B), thus smoothing effects that might be due to large-scale gene duplication events, the E. huxleyi TAP complement seems to follow the mentioned trend as well. Furthermore, it can be deduced that TFs in particular were subject to expansion during plant evolution, which has been suggested earlier (Richardt et al. 2007). PCA of the phylogenetically uncorrected TAP family sizes results in separation according to taxonomic groups, as shown, for example, for seed and nonseed plants, or algae derived from primary and secondary endosymbiosis, respectively (fig. 4). Based on a smaller data set, a trend correlating multicellularity and global TAP amount had been suggested before (Richardt et al. 2007). Using the uncorrected, phylogenetically dependent comparative data, there is no clear-cut trend of this kind in any of the present visualizations.
To ensure a reliable framework for phylogenetic comparative (PC) analyses, we derived a novel marker set of 13 nuclear single-copy, protein-coding orthologs, which was employed together with the small ribosomal subunit DNA (SSU; supplementary table 5, Supplementary Material online) to infer a phylogenetic tree of the Plantae (red/green lineage; [Cavalier-Smith 1998; Adl et al. 2005]) comprising the 20 sequenced species analyzed in this study. Subsequently, the nodes on the tree were dated by relaxed molecular clock approaches (Sanderson 2003) using fossil constraints and protocols from the literature estimating CIs by averaging over time points obtained from multiple methods (supplementary table 6, Supplementary Material online; [Sanderson et al. 2004; Bell and Donoghue 2005; Hug and Roger 2007; Magallon and Castillo 2009; Wang et al. 2009]). The divergence time estimates of the resulting chronogram (supplementary fig. 1, Supplementary Material online) are in good agreement with previous reports (Kenrick and Crane 1997; Bowe et al. 2000; Hedges et al. 2004; Yoon et al. 2004; Zimmer et al. 2007; Lang et al. 2008; Wang et al. 2009). The tree was used to trace and visualize extant and ancestral character states of TAP gene family evolution in Plantae. The total number of TFs, TRs, and PTs encoded by the respective extant and ancestral genomes are visualized in figure 5. Manual inspection of the individual ancestral state reconstructions for all individual TAP families resulted in gain, loss, and expansion estimates for the individual nodes, which were integrated into a global view providing a detailed timeline of TAP gene family evolution in the green lineage (fig. 5). The data from the sole red alga Cyanidioschyzon merolae was used as an outgroup to elucidate TAP evolution in the green lineage (Viridiplantae). A total of 21 TAP families (of which 16 are TF) arose within the earliest land plants (500 Ma, megaannum) or in their aquatic ancestor. Three further TAP families arose in the last common ancestor (LCA) of vascular plants 470 Ma (TFs: BBR/BPC and DBP; PT: DUF246), and three more TFs (C2C2_YABBY, GeBP, and ULT) in the LCA of extant angiosperms (or seed plants) 210 Ma. Two TF families specifically arose within eudicotyledons (NZZ and SAP; 143 Ma), and one is an invention of the lineage leading to the extant Volvocales (VARL; 47 Ma). As has been noted before (Riano-Pachon et al. 2008), the TF families, CAMTA and Trihelix, were secondarily lost from the genomes of all algae around 600 Ma. Red algae and prasinophytes share the loss of Alfin-like, Argonaute, and MBF1 genes. Green algae and prasinophytes have apparently lost the TR families DDT, SWI/SNF_SWI3, and the green algae have lost the TF family LIM.
In terms of expansions, a total of 44 TAP families are larger in size in land plants than in algae (fig. 5). The size of three TF families (EIL, GRF, and SRS) increased with the onset of vascularity; 23 TAP families (of which 18 are TF) are expanded in angiosperms (or possibly seed plants). Within the green algae, the TF families, RWP-RK and SBP, were expanded and the TR family TRAF in the Volvocales. Among the angiosperms, the TR family Sin3 was found to be expanded among the rosids, the TR families TRAF and OFP in Poales, the TF family Alfin-like in the Panicoideae, and the MADS TF family in the Brassicales and Fabales.
Both, the emergence and expansion of TAP families during land plant evolution, suggest a clear trend of increasing transcriptional complexity along with morphological complexity. An obvious and commonly used proxy for organismal complexity is the number of cell types (Bell and Mooers 1997; Carroll 2001; Hedges et al. 2004; McCarthy and Enquist 2005; Vogel and Chothia 2006; Xia et al. 2008). We were able to gather cell type count estimates for 12 of the organisms under study from literature, public databases, and through personal communications (supplementary table 7, Supplementary Material online). The application of this reduced taxon set to answer the question of evolutionary correlation between TAPs and morphological complexity (i.e., number of cell types), using both PC correlation analysis with PICs and regression analysis with the best of several phylogenetic GLSs models, confirms the initial hypothesis. The evolutionary pattern of the overall number of TAPs as well as the numbers of TFs and PTs show significant positive correlation with the number of cell types (TAPs: R = 0.95, P value best GLS model and correlation <<0.01; TF: R = 0.96, both P values << 0.01; PT: R = 0.94, P << 0.01). TRs on the other hand show only weak correlation with the number of cell types (R = 0.68, P < 0.05).
Like multicellularity, increases in mean and maximal organismal size have occurred multiple times in the evolution of uni- and multicellular life forms (Carroll 2001), which has often been discussed as a general evolutionary trend toward an increase of body size. Although this trend could only be proven within some lineages (Carroll 2001), body size has been shown to be positively correlated with the number of cell types in metazoans (McCarthy and Enquist 2005) and therefore might provide an indirect proxy for complexity. Initial correlation analysis confirms this trend for the reduced Plantae set with available cell type estimates (R = 0.84, P < 0.001). If the reported maximum size is used as a proxy for complexity it mirrors the above demonstrated correlation of TAPs and number of cell types (R = 0.83, P < 0.05). However, if the analysis is extended to the full set of 20 Plantae taxa, this relationship is not significant anymore (e.g., TAPs R = 0.31, P = 0.2).
It would certainly be premature to falsify the long-standing hypothesis based on this data only. Instead, we needed an alternative proxy that allowed us to cover the entire taxonomic range of Plantae. To achieve this, we combined PC analysis with PCA, which allows combining several, possibly correlated traits into a smaller number of uncorrelated variables. To define traits suitable for establishing an improved proxy for organismal complexity, we first carried out PCA on the reduced data set using PICs of the following traits: cell types, numbers of TAPs, TFs, TRs, or PTs, genome size, number of genome duplications, and reported maximum body size (and length of the sporophyte). The number of cell types as well as the numbers of TFs, PTs, and TAPs contributed most to the variance represented by the first component (>95%). In the next step, the (difficult to determine) cell type trait was excluded from the PCA, and the first component was employed to test for correlation with the number of cell types. In fact the number of cell types shows strong correlation with the PCA proxy (R = 0.95, P << 0.01).
For the general applicability of this proxy, a reduction of incorporated variables would be desirable. Therefore, we repeated the PCA only with the three traits that contributed most to the first principal component (TAPs, TFs, and PTs). In this case, the positive correlation with the number of cell types is significant as well (R = 0.95, P << 0.01). Thus, the first component (combining the PICs of the total number of TAPs, TFs, and PTs) provides an excellent proxy for organismal complexity, ideally suited for PC genomics approaches on a larger taxonomic scale. If we apply this proxy to look for patterns of correlated evolution of TAP gene families and complexity on the complete Plantae data set, we find that all but 10 TF and 8 TR families show significantly correlated evolution with complexity (q ≤ 0.05; supplementary table 8, Supplementary Material online). Among the TF families that seem to deviate from the general trend are the ABI3/VP3, CCAAT Dr1, and VARL families. Examples for TRs without obvious correlation are Dicer, Tfb2, and Sin3.
By employing the first principal component as a proxy for complexity, we find a significant correlation (R = 0.78, P = 0.0046) for the relationship between whole-genome duplication (WGD) events and organismal complexity among Plantae. Along the individual plant lineages, the expansion patterns of 27 TAP families (22 TFs, 3 TRs, and 2 PTs; supplementary table 9, Supplementary Material online) display significant correlation with the number of paleoploidy events (q < 0.01). These finding clearly backs previous predictions about the importance of WGDs for organismal complexity and diversity (Crow and Wagner 2006; Freeling and Thomas 2006; Freeling 2008; Edger and Pires 2009; Soltis PS and Soltis DE 2009; Van de Peer et al. 2009). However, the number of WGDs alone was below the inclusion cut off for the PCA to derive a complexity proxy, and the relationship in the correlation analysis was found to be relatively weak (as compared with TAPs).
Previous comparative studies revealed important information about TAP evolution. They were, however, facing different problems. Due to the lack of sufficient numbers of fully sequenced genomes, such analyses could only be performed with a small number of organisms (Riano-Pachon et al. 2007; Richardt et al. 2007). Additionally, in some cases, it was necessary to retrieve sequences from expressed sequence tag databases (Richardt et al. 2007; Guo et al. 2008), which may lead to biased TAP classifications because of lacking sequence data for such organisms. Recently, however, many genomes of plants and algae became available providing a solid basis for the study of gene family evolution. The combined and updated TAP classification rule set presented here (fig. 2) is the most comprehensive described for plants so far. Our rigorous classification procedure, employing a set of PFAM domain-specific models and manually curated GA cut offs performs with high specificity and sensitivity, as shown by comparison with data reported for well-annotated Plantae genomes. The rule set presented here is expected to yield accurate results with regard to species of the red/green lineage. As there are no well-analyzed genomes of nongreen algae with regard to their TAP complement, no detailed comparison is possible yet. The uncorrected, phylogenetically dependent comparative studies show that there is an extensive expansion of TAPs, mainly TFs, during evolution from algae to land plants and from nonseed to seed plants (fig. 3A). PCA of the data is able to correctly separate taxonomic groups (fig. 4). However, these analyses fail to clearly correlate multicellularity with TAP complexity. The same conclusion was reached in a more detailed comparison of the TAPs encoded by the genome of the multicellular brown alga Ectocarpus siliculosus with those of unicellular heterokonts (Cock et al. 2010) and of the multicellular green alga Volvox carteri as compared with its unicellular sister taxa (Prochnik et al. 2010).
The set of nuclear orthologs developed and employed here allows the robust dating of divergence times among Plantae. This data set can be expanded as further genomes become available and thus represents a valuable basis for acquiring divergence time estimates. The timeline of TAP evolution in the Viridiplantae (gain, loss, and expansion analyses, fig. 5) demonstrates that the major bursts of TF expansion occurred in the LCA of angiosperms 210 Ma. Although the lack of data for, for example, ferns and gymnosperms might convolute the picture, the interlinked increase of TF and flower complexity might well help to explain Darwin’s “abominable mystery” (Busch and Zachgo 2009).
Unfortunately, the exact number of cell types is not yet available for most of the organisms under study here. For most of the vascular plants (except O. sativa, Z. mays, and A. thaliana), we failed to get any estimates at all. Therefore, the resulting data set might provide a biased picture of complexity. Yet, by applying PC genomics, we can demonstrate for the first time that the total complement of TAPs is positively correlated with morphological complexity as measured by the number of cell types. This is in contrast to the analysis of the phylogenetically dependent, uncorrected data mentioned above (fig. 4) and demonstrates that genomic-scale PCM are necessary in order to detect otherwise convoluted evolutionary signals.
While TFs mirror this correlation, TRs do not. Therefore, paleolog retention of TFs (that bind in sequence-specific fashion to cis-regulatory elements) occurs more often than within TRs (that interact with DNA and proteins, including TFs, in order to regulate transcription). The fact that the size of the PT families described here is also positively correlated with the number of cell types suggests that they actually include TFs. As mentioned above, the number of cell types as a proxy for organism complexity is difficult to track down, and the maximum and average body size apparently represent a less than optimal proxy. Here, we can show that a principal component comprising the total number of TAPs, TFs, and PTs (as defined in this study) can be used as a proxy for organismal (morphological) complexity, as they are significantly positively correlated with the number of cell types in those organisms where data are available. Therefore, genome-wide determination of the TAP complement can serve as an indicator of morphological complexity, allowing to trace this trait in case it is hidden or, for example, in metagenomic studies where the identity of the contributing species is not always known.
Although there is an observable gain of morphological complexity in the evolution of Plantae and Metazoa, there have been discussions whether this actually represents a global, unidirectional, upward trend. Our ancestral state reconstructions indicate cases of both, increases and decreases. For example, the ancestral states of the maximal reported organism size and genome size (supplementary files 2 and 3, Supplementary Material online) suggest secondary reductions along the lineages leading to the different unicellular algae under study here.
In terms of TAP gain and expansion patterns, no clear marker for multicellularity emerges from our analyses. However, the pattern of initial expansion concomitant with the development of multicellularity might be obscured by subsequent expansions within the multicellular lineages as they developed more tissues and cell types. Yet, some families, upon close scrutiny, might offer hints for the development of land plant multicellularity, such as TAP families that are not encoded by the genomes of the green algae and prasinophytes analyzed in this study (fig. 5). The TR family DDT is not well characterized, it contains the DDT (DNA-binding homeobox and different TFs) domain that has been proposed to bind to DNA (Doerks et al. 2001). DDT is encoded in single copy by the genomes of some unicellular organisms (e.g., C. merolae and several heterokonts). The trihelix TFs appear to be involved in a plethora of specialized functions in seed plants, for example, abiotic stress tolerance (Xie et al. 2009), ploidy-dependent cell growth (Breuer et al. 2009), repression of seed maturation (Gao et al. 2009), and perianth architecture (Brewer et al. 2004). Next to the lack of these gene families in the green algal genomes studied here, the only genomes that encode more than one DDT and trihelix gene, respectively, are those of land plants and multicellular animals, suggesting a possible involvement in transcriptional regulation of cell-to-cell interactions within these groups of multicellular organisms.
The comparatively weak correlation of WGD with organism complexity might be due to the current genome sampling bias that excludes major lineages like ferns, gymnosperms, and charophytes. However, it might also suggest that not only large-scale events but also small-scale or balanced segmental duplications are apparently important driving forces in the evolution of transcriptional regulation and complexity in Plantae. This ambivalence has been reported earlier (Crow and Wagner 2006) and might be a reflection of what is observed in animal evolution, where the inclusion of fossil taxa does not provide support for hypotheses linking genome duplications to the evolution of complexity in vertebrates (Donoghue and Purnell 2005). In Plantae, WGDs have been implicated before to be positively correlated with the rise of morphological complexity and the adaptive radiation of angiosperms (reviewed, e.g., by Soltis PS and Soltis DE ; Van de Peer et al. ). Moreover, they have been suggested to be correlated with geological upheaval periods such at the Cretaceous–Tertiary boundary (Fawcett et al. 2009), although this hypothesis might be disputable (Soltis and Burleigh 2009). Both might be realized through the potential for subfunctionalization and neofunctionalization that a WGD event allows for. Here, we show a significant positive correlation between the size of the TAP complement and the number of paleopolyploidizations that supports this hypothesis. The application of the TAP PCA proxy also reveals a weak correlation between genome size and complexity (R = 0.69, P < 0.1), which might be related to the trend observable for WGDs. The mitochondrial (R = −0.33, P = 0.29) and plastid (R = 0.08, P = 0.77) genome sizes do not follow this trend, neither do the reported maximum body sizes (R = 0.28, P = 0.24) nor the more detailed maximum sizes of the sporophyte (R = 0.28, P = 0.24), respectively, gametophyte (R = −0.08, P = 0.83).
The phylogenetic framework described here opens the door to address additional important evolutionary questions. For example, there is a growing body of evidence that miRNAs (which often target TAPs) are also a viable causal factor for the increase in morphological complexity (Li and Mao 2007; Lee et al. 2007; Heimberg et al. 2008). The evolutionary pattern of miRNA families was shown to coincide with the advent of morphological complexity in vertebrates (Heimberg et al. 2008). We find initial phylogenetic evidence for this pattern to be true for the evolution of Plantae as well. The number of miRNAs (supplementary table 6, Supplementary Material online) correlates with organismal complexity (R = 0.93, P < 0.05) and with the complement of TAPs (R = 0.93, P < 0.05). The data provide initial evidence only because miRNA annotations for the genomes under investigation vary in their completeness and do not provide the same level of coverage as we now have for the TAP gene families.
Pairwise comparisons of the evolutionary pattern of individual miRNA and TAP families and other traits like, for example, sporophyte size can provide interesting hypotheses for further experiments. Examples for this are the miRNA family MIR390 and the TF families ABI3/VP3 and BBR/BPC, which show significant correlated evolution (R > 0.6, P < 0.01) with the size of the sporophytes. Furthermore, patterns of correlated evolution between gene families can be indicative of functional relationships (e.g., members of a protein complex) or regulatory roles (e.g., repressor or activator functions). As an example for such a correlation, we find a significant correlation between expansion patterns of the gene families coding for Aux/IAA TRs and ARF TFs (R = 0.97, best model and correlation P << 0.01), which are known to dimerize to regulate the transcription of auxin-responsive genes in plants (Paponov et al. 2009).
Using a comprehensive set of classification rules and genomes, we show for the first time that the observable increase in morphological complexity in Plantae is positively correlated with the expansion of their TAP complement and especially TFs. Large-scale or WGD events are confirmed as major driving forces behind transcriptional and morphological complexity. The evolutionary pattern of miRNAs, which also act as important TRs and often regulate TFs, reveals correlated evolution with TAPs and morphological complexity. It will be exciting to test whether this pattern also holds true for the evolution of cis-regulatory elements and other proteins involved in signalling cascades. Together with the wealth of available and upcoming plant genome sequences, the TAP classification scheme and the phylogenetic framework developed in this study provide a powerful resource to address a plethora of evolutionary questions on a genome-wide scale. Yet, the currently available taxon sampling in terms of completely sequenced genomes is biased toward angiosperms, green algae, prasinophytes, and some groups within the heterokonts. In order to unravel how the lineage leading to extant land plants managed (in terms of transcriptional regulation) to become multicellular, we are in need of sequences from other (multicellular) algal genomes that are more closely related to extant land plants, that is, of Charophyta. In addition, other huge gaps have to be closed, namely within the red algae, liverworts, hornworts, ferns, and gymnosperms.
We are indebted to the consortia that have made the current wealth of genome drafts available. Funding by German Research Foundation (RE 837/10-2 to R.R. and S.A.R.) and German Federal Ministry of Education and Research (Freiburg Initiative for Systems Biology, FKZ 0313921 to R.R. and S.A.R.; GoFORSYS, FKZ 0313924 to B.M.R.) is gratefully acknowledged. D.M.R.P. acknowledges financial support by the German Federal Ministry of Education and Research (GABI-FUTURE, FKZ 0315046). We are indebted to Jo Ann Banks, Virginia Walbot, and Paul Rushton for their help on the cell type number estimates.