|Home | About | Journals | Submit | Contact Us | Français|
Whole genome duplication (WGD) is a special case of gene duplication, observed rarely in animals, whereby all genes duplicate simultaneously through polyploidisation. Two rounds of WGD (2R-WGD) occurred at the base of vertebrates, giving rise to an enormous wave of genetic novelty, but a systematic analysis of functional consequences of this event has not yet been performed.
We show that 2R-WGD affected an overwhelming majority (74%) of signalling genes, in particular developmental pathways involving receptor tyrosine kinases, Wnt and transforming growth factor-β ligands, G protein-coupled receptors and the apoptosis pathway. 2R-retained genes, in contrast to tandem duplicates, were enriched in protein interaction domains and multifunctional signalling modules of Ras and mitogen-activated protein kinase cascades. 2R-WGD had a fundamental impact on the cell-cycle machinery, redefined molecular building blocks of the neuronal synapse, and was formative for vertebrate brains. We investigated 2R-associated nodes in the context of the human signalling network, as well as in an inferred ancestral pre-2R (AP2R) network, and found that hubs (particularly involving negative regulation) were preferentially retained, with high connectivity driving retention. Finally, microarrays and proteomics demonstrated a trend for gradual paralog expression divergence independent of the duplication mechanism, but inferred ancestral expression states suggested preferential subfunctionalisation among 2R-ohnologs (2ROs).
The 2R event left an indelible imprint on vertebrate signalling and the cell cycle. We show that 2R-WGD preferentially retained genes are associated with higher organismal complexity (for example, locomotion, nervous system, morphogenesis), while genes associated with basic cellular functions (for example, translation, replication, splicing, recombination; with the notable exception of cell cycle) tended to be excluded. 2R-WGD set the stage for the emergence of key vertebrate functional novelties (such as complex brains, circulatory system, heart, bone, cartilage, musculature and adipose tissue). A full explanation of the impact of 2R on evolution, function and the flow of information in vertebrate signalling networks is likely to have practical consequences for regenerative medicine, stem cell therapies and cancer treatment.
Most genes belong to gene families which are derived through consecutive cycles of gene duplication. In animals, in the absence of horizontal gene transfer, gene duplication is the most important source of evolutionary novelty. While most duplications are of single genes, predominantly in tandem arrangements, whole genome duplication (WGD) is a special case whereby all genes duplicate simultaneously through polyploidisation. A WGD is followed by the loss of the majority of duplicated genes in the process of rediploidisation. Over large evolutionary time scales, rediploidisation results in the formation of a paleopolyploid species such as Saccharomyces cerevisiae. S. cerevisiae was shown in a pioneering study to derive from a WGD which took place after the divergence of Saccharomyces from Kluyveromyces . Pairs of genes derived from this WGD were shown to constitute about 13% of the yeast coding gene set .
Evidence has accumulated that single-gene duplications and WGDs result in preferential retention of different functional gene classes. In particular, WGDs may facilitate coevolution of interacting proteins, which are likely to resist single-gene duplication because of sensitivity to gene dosage [2,3]. Ohnologs, which are defined as paralogs derived from a WGD, were shown to be enriched in transcription factors (TFs) and signalling genes in animals, plants, yeast and Paramecium (reviewed in ). However, no detailed analysis of the consequences of these trends for functionality of signal transduction pathways and signalling networks has been undertaken for any of the known WGD events in plants, animals or protozoans.
The modern version of the two rounds (2R) hypothesis (recently reviewed in ) proposes that two WGDs occurred at the base of vertebrates after the divergence of urochordates and before the radiation of gnathostomes, most likely even before the cyclostome-gnathostome split . Recent genomic studies have provided overwhelming support for the 2R hypothesis. In particular, strong evidence was derived through sequencing and analysis of genomes of the human , the fish Tetraodon nigroviridis  and the lancelet Branchiostoma floridae . An important methodological advance was made by Dehal and Boore , who used an ingenious approach of mapping 2R paralogons by first identifying descendants of gene duplications mapping to the base of vertebrates by phylogenetic timing. Furthermore, successful attempts have been made recently at the reconstruction of the ancestral vertebrate genome before 2R-WGD .
Herein we present the first systematic analysis of the functional consequences of 2R-WGD using state-of-the-art methods for inference of orthology and duplications. We establish a gene retention percentage for each WGD, analyse preferences in types of retained genes and tissue expression signatures, contrast trends detected for 2R-WGD with those observed for tandem or segmental duplications and view gene family data through the lens of signalling network evolution.
We performed a comprehensive computational screen of duplication patterns in the TreeFam database of metazoan gene families (see Methods). Table Table11 shows the distribution of inferred duplication events associated with different taxonomic units (sorted according to rank). The highest number of inferred duplications was associated with the emergence of vertebrates (7,701 duplication nodes), more than twice as many as for the second most abundant taxonomic unit, the Bilateria (3,313 nodes). Human 2R-ohnologs (2ROs) were mapped to 9,958 unique Entrez Genes (Additional file 1, Table S1), which were then placed on the genome defining a linear pattern of paralogons covering 83% of the length of human chromosomes (proving that they originated through whole genome duplication; see Methods, Identification of paralogons in the human genome).
Next, we calculated the number of ancestral pre-2R (AP2R) chordate genes as 3,545 by analysing the topology of TreeFam trees. 2R-WGD was assumed to initially result in a fourfold increase minus genes lost later through rediploidisation. Thus, the overall retention percentage was estimated as 9,958 ÷ (3,545 × 4) × 100% = 70.2%. However, this was an overestimate, as 4,231 tandem or segmental duplications, identified as nodes younger than 2R-WGD, inflated the number of ohnologs. One also needs to take into account the number of gene families without ohnologs, estimated from the total of human genes in TreeFam minus 2ROs (14,892 - 9,958 = 4,934), or total of Ensembl-predicted human protein-coding genes minus 2ROs (23,438 - 9,958 = 13,480). The upper-bound estimate was therefore calculated as (9,958 - 4,231) ÷ (3,545 × 4 + 4,934) × 100% = 30% and the lower bound as (9,958 - 4,231) ÷ (3545 × 4 + 13,480) × 100% = 20.7%.
The number of duplication nodes which could be placed at the very base of vertebrates was 3,545. An additional 3,263 duplication nodes were children to these nodes, yet were still assigned to vertebrates by phylogenetic timing. These two waves of duplications, in close temporal succession, and of similar quantitative contributions, should be interpreted as differential signatures of two consecutive rounds of genome doubling of 2R-WGD, with similar retention rates (approximately 10%-15%, that is, half of the overall 2R-WGD retention rate).
A key set of uniquely important conserved genes, known as the shared toolkit, control development in all animals. To better understand the evolution of the toolkit, we investigated duplication patterns of the eight key signalling pathways, namely, G protein-coupled receptors (GPCRs), receptor tyrosine kinases (RTKs), Wnt, Notch, transforming growth factor (TGF)-β, Janus kinase/signal transducer and activator of transcription (JAK/STAT), Hedgehog and nuclear hormone receptors. These duplication patterns are illustrated in Table Table11 with two clear major waves of diversification: one at the emergence of Bilaterians and the other tied to the emergence of vertebrates (2R-WGD). Apart from these two waves, the toolkit was strongly conserved throughout vertebrates, although a few additional modifications were associated with teleosts (potentially reflecting a fish-specific genome duplication (FSGD)), tetrapods and mammals. GPCRs were perhaps most dynamic, particularly those involved in sensory information processing, which was likely a sign of environmental adaptation. For example, Canis familiaris (dog) was associated with 222 species-specific GPCR duplications (the majority of which map to families of olfactory receptors, such as TF344049, TF337111, TF337295, TF343679, TF337210 and TF336833).
Gene families expanded during the course of 2R included predominantly TFs and signalling genes. Table Table22 lists the top 20 expanded families. The highest number of 2R duplications was assigned to the T-box transcriptional factor family (19 gene duplication nodes); followed by integrin-α (13 nodes); GPCRs of the gonadotropin-releasing hormone/vasopressin/oxytocin family (11 nodes); and Cdc42, Wnt ligand, annexin and PDZ/LIM domain family (10 gene duplications each). The integrin-β repertoire (which pair with α-integrins in a combinatorial fashion) also underwent substantial expansion in the course of 2R (six duplication events).
We searched for over- and underrepresented functional categories associated with 2ROs. Table Table33 lists the top 20 overrepresented gene ontology (GO) biological process (BP) terms. Some terms were related, as the GO classification used was a mixture of all hierarchy levels. No ontology should be seen in isolation; instead, specific functions should always be viewed in the context of higher-level functions. Signal transduction (GO:0007165) was the top overrepresented term, with 853 of the total pool of 1,160 human-associated genes (that is, 74%) being 2ROs. The full list of overrepresented BP terms is given in Additional file 2, Table S2_bp, while overrepresented molecular function (MF) and cellular component (CC) terms are given in Additional file 3, TableS2_mf, and Additional file 4, TableS2_cc, respectively.
Specific GO terms revealed signalling pathways preferentially affected by 2R-WGD, that is, GPCRs, Ras and its regulators, Wnt pathway, and RTK-associated signalling (Additional file 2, Table S2_bp). Several terms also pointed to signalling associated with the cytoskeleton and cellular attachment (Additional file 2, Table S2_bp). Vertebrate evolutionary novelties could be associated with a high proportion of 2ROs. For example, a number of overrepresented terms could be linked with the muscular upgrade: muscle contraction, skeletal muscle development and myoblast differentiation (Additional file 2, Table S2_bp). Higher-level terms indicated a general trend towards greater organismal complexity, as well as expanded locomotory and sensory abilities characteristic of vertebrates (Additional file 2, Table S2_bp).
BP terms underrepresented among 2ROs (Additional file 5, Table S3_bp) were dominated by basic cellular functions strongly conserved throughout Eukaryota, such as translation, DNA repair, RNA splicing, DNA replication, protein folding, DNA recombination, cellular respiration, mRNA transport or ubiquitin-dependent protein catabolic process. Underrepresented MF and CC terms are given in Additional file 6, TableS3_mf, and Additional file 7, TableS3_cc, respectively.
It was also intriguing to directly compare over- and underrepresented CC terms. The former (Additional file 4, TableS2_cc) centered on the membrane and cellular skeleton, including plasma membrane, synapse, actin cytoskeleton, cell junction, postsynaptic membrane and contractile fiber. The latter (Additional file 7, TableS3_cc), in contrast, centered on organelles, cytoplasm and the nucleus, including mitochondrion, spliceosome, ribosomal subunit, proteasome complex, nucleolus, chromosome and cytoplasm.
Overall, these findings boldly underlined the conclusion that signal transduction genes were preferentially retained after 2R-WGD. In stark contrast, very different functional trends characterised duplications younger than 2R, corresponding to tandem or segmental duplications (mapped to 5,495 unique human Entrez Genes). Overrepresented BP terms associated with these genes (Additional file 8, Table S3_not2R-over) were strongly biased towards immune functions and DNA/nucleosome/chromatin packaging. Crucially, terms associated with cell communication, development and cell cycle were strongly underrepresented (Additional file 9, Table S3_not2R-under).
Investigation of Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways overrepresented in gene duplications associated with 2R (Additional file 10, Table S4) revealed four classes of pathways: (1) canonical signalling pathways (calcium signalling, mitogen-activated protein kinase (MAPK) signalling, Wnt signalling, insulin signalling, ErbB signalling, TGF-β signalling); (2) pathways associated with vertebrate novelties (neuroactive ligand-receptor interaction, axon guidance, melanogenesis, leukocyte transendothelial migration, adipocytokine signalling pathway, vascular endothelial growth factor signalling pathway, B cell receptor signalling pathway); (3) pathways associated with the cellular cytoskeleton, cell-cell and cell-extracellular matrix (ECM) interactions (regulation of actin cytoskeleton, focal adhesion, adherens junction, tight junction and ECM-receptor interaction); and (4) disease-associated pathways (renal cell carcinoma, chronic myeloid leukemia, long-term depression, colorectal cancer, type 2 diabetes mellitus, small cell lung cancer, glioma, pancreatic cancer).
All Pfam domains overrepresented among 2ROs were related to signal transduction (Table (Table44 and Additional file 11, Table S5). The most obvious were typical signalling domains, such as the protein tyrosine kinase domain, the serine/threonine kinase domain (harboured by cyclin-dependent kinases among others), Ras family signature, RhoGEF domain, RhoGAP domain, protein kinase C, protein-tyrosine phosphatase, neurotransmitter-gated ion channel domains and two types of 7TM domains (rhodopsin and secretin families). Several further domains were associated with transcriptional factors integrated with signalling pathways, that is, homeobox domain, helix-loop-helix, or ligand-binding domain of nuclear hormone receptor.
Figure Figure11 is a heatmap illustrating the relationship between the relative timing of gene duplications (as inferred by phylogenetic timing) and the spatial expression domain of progeny genes (as determined by mRNA levels). For example, a label "Homo sapiens" on the vertical axis signifies recent human-specific duplications. "Vertebrata" signifies duplications at the base of vertebrates linked with 2R-WGD. Dark red colour indicates preferential expression in a given tissue, while light yellow colour indicates a tendency for the exclusion of a given tissue from the expression domain of progeny genes. Taxons and tissues were ordered using the hierarchical clustering. It is striking and has never been previously reported that taxons comprise four chronological groups which can be aligned with major evolutionary transitions that have occurred in the course of animal evolution:
(a) Eukaryota and Metazoa: The emergence of the nucleated cell and establishment of multicellularity
(b) Bilateria, Chordata and Vertebrata: Bilateral symmetry and complex body plans
(c) Tetrapoda, Amniota, Mammalia, Eutheria and Theria: Diversification of vertebrates
(d) Homo/Pan/Gorilla, Homo sapiens and Catarrhini: The emergence of primates
The preferential expression of 2ROs in neuronal tissues demonstrated in Figure Figure11 is particularly exciting. Additional file 12, Table S6, lists 349 2ROs preferentially expressed in brain (preferential expression measure >0.4). GO terms (BP, CC and MF, respectively) preferentially associated with these genes are listed in Additional file 13, Table S7_bp; Additional file 14, Table S7_cc; and Additional file 15, Table S7_mf. The top three overrepresented terms in each category were as follows: synaptic transmission, neurological system process and cell communication (BP); synapse, cell junction and plasma membrane (CC); and calcium ion binding, transporter activity and calmodulin binding (MF).
In the next step, we analysed expression divergence between paralogs and found that mRNA expression similarity decays steadily over evolutionary time. This is proven by the falling average expression correlation between pairs of paralogs grouped by increasing age, detected using a variety of measures of expression distance (Figure (Figure2a)2a) (Pearson product-moment correlation coefficient, Additional file 16, Figure S1; Kendall correlation coefficient, Additional file 17, Figure S2; Spearman correlation coefficient, Additional file 18, Figure S3; Manhattan distance, Additional file 19, Figure S4, simple difference in breadth of expression). No specific signature could be discerned for 2ROs: the overall rate of expression divergence was similar between tandem and WGD duplicates, at least over large evolutionary time scales.
Similar trends were inferred with relative protein concentration data derived from a large-scale antibody screen of human cell lines and tissues . However, relative protein abundance between paralogs diverge with somewhat different temporal dynamics when tissue distribution (tissue microarray (TMA), overall; tissue microarray focused (TMAf), focused on positive regions) rather than cell line distribution (CMA) is considered (Figure (Figure2b).2b). CMA diverges gradually, reaching a plateau for duplications dated to mammals and older, similar to mRNA divergence (Figure (Figure2a).2a). In contrast, TMA and TMAf divergence is low for human-specific duplications, but subsequent divergence is extremely rapid. One possible explanation is that tissue protein levels are stabilised posttranscriptionally by a mRNA-level network of regulation, for example, through microRNA (miRNA), and this step is subtly influenced by the tissue microenvironment. The stabilising effect, however, decays rapidly over evolutionary time, as noncoding regions of paralogs, being under little selective constraint, rapidly diverge in sequence.
Evolution of the Ras family (TreeFam family TF312796) was shaped predominantly by the 2R event. Three vertebrate co-orthologs of Ras, K-Ras, N-Ras and H-Ras, originated from 2R. Two further duplications in the Ras family could be mapped to 2R: (1) Ras-related protein Ral-A (RALA; located on chromosome 7) and Ras-related protein Ral-B (RALB; chromosome 2) and (2) Ras-related protein (RRAS; chromosome 19) and Ras-related protein 2 (RRAS2; chromosome 11).
RasGAPs are GTPase-activating proteins. There are two subfamilies of RasGAPs in TreeFam: TF105303 and TF105302. In TF105303, RAS protein activator-like 3 (located on chromosome 19), RAS protein activator-like 2 (chromosome 1), disabled homolog 2-interacting protein (DAB2IP; chromosome 9), and SYNGAP1 (chromosome 6) are 2ROs. DAB2IP acts as a tumor suppressor gene and is inactivated by methylation or polycomb Ezh2 complex and histone deacetylase in prostate cancer . SynGAP is a synaptic-specific GTPase-activating protein . There is only one RasGAP of that subfamily in fly (CG42270) and worm (gap-2). In the second subfamily of RasGAPs (TF105302), two duplications can be mapped to 2R: Ras GTPase-activating protein 4 (RASA4; chromosome 7) and RasGAP-activating-like protein 1 (RASAL1; chromosome 12), as well as Ras GTPase-activating protein 2 (RASA2; chromosome 3) and Ras GTPase-activating protein 3 (RASA3; chromosome 13).
Most cyclins, including key cell cycle-regulating groups A, B and D, underwent diversification at the base of vertebrates and are represented by two to four vertebrate-specific paralogs (Table (Table5).5). Analysis of the TreeFam database indicated that the following genes were also 2ROs: cyclin-dependent kinases CDK2 and CDK3 (TF300619), CDK4 and CDK6 (TF328559) and cyclin-dependent kinase inhibitors p21 and p27 (TF101038), as well as p18 and p19 (TF333311) and finally orthologs of the WEE1 (S. pombe) inactivator of the CDK/cyclin complex, namely, Wee1 and Wee2 (TF315075).
Neurotrophins are key growth factors influencing proliferation, differentiation, survival and death of neuronal cells. Human neurotrophins (TF106463) include NTF4 (located on chromosome 19), brain-derived neurotrophic factor (BDNF; chromosome 11), NTF3 (chromosome 12) and nerve growth factor (NGF; chromosome 1). NTF4, BDNF, NTF3 and NGF are evidently ohnologs deriving from the 2R-WGD (Figure (Figure3).3). Interestingly, worm and fly lack orthologs of neurotrophins . The expanded neurotrophin family is most likely involved in the sculpturing of characteristically complex vertebrate nervous systems, with a large, centralized and multicompartmental brain.
The family of class II histone deacetylases (HDAC; TF106174) includes four vertebrate genes, that is, HDAC4 (located on chromosome 2), HDAC5 (chromosome 17), HDAC7 (chromosome 12) and HDAC9 (chromosome 7). There is only one HDAC in fly, worm and C. intestinalis (Figure (Figure4).4). Vertebrate HDACs have cell type-specific expression patterns and link through an N-terminal extension to TFs from a number of signalling pathways . There are links with vertebrate novelties in the skeletal, circulatory and muscular systems. HDAC4 is a corepressor controlling bone development ; HDAC5 and HDAC9 have been shown to suppress cardiac stress signals and control cardiac development ; and HDAC9 also couples neuronal activity to muscle chromatin acetylation and gene expression .
First, we identified the total overlap between protein members of the 1,625-node human cancer signalling map (HCSM) of Cui et al.  and paralogous genes of different ages. (HCSM is not limited to cancer; it describes the entire human signalling network and is suitable for the evolutionary analyses of the type we have undertaken.) The overlap consisted of 237 nodes for duplications mapped to Tetrapoda and younger, 1,096 nodes for 2ROs, 194 nodes for duplications linked with the emergence of chordates and 620 for those linked with the emergence of bilaterians (Table (Table6).6). Second, separate overlaps were identified for subnetworks consisting of positive, negative and scaffolding edges alone. (Note that these categories are not mutually exclusive, as many nodes are linked to more than one edge type.)
Table Table66 illustrates the overall results of the analysis of network connectivity, suggesting that highly connected negative regulators were preferentially retained after 2R-WGD. All statistically significant differences corresponded to increases in connectivity and/or betweenness centrality. These statistically significant differences were mostly associated with either 2ROs or the emergence of Bilaterians. It should be noted that the degree of a node and its betweenness centrality were highly correlated (Pearson's r correlation, 0.91), and we cannot be certain whether the simple number of interacting partners (degree) or the amount of information flowing through the given node (betweenness centrality) correspond more closely to the biological properties of the network which were selected for in evolution.
We compared degree difference and edge conservation, following waves of duplications of different ages. Degree difference, that is, the absolute value of degree subtraction between two nodes, is a metric of paralog connectivity divergence. The edge conservation concept is considered in the Discussion section. Table Table77 shows that these two metrics were inversely correlated, that is, the lower the average degree difference, the higher the percentage of conserved edges. Furthermore, edge conservation was higher and degree difference was lower between pairs of nodes linked with 2ROs than those linked with gene duplications mapping to the base of Chordates or Bilaterians. The differences in conservation of edges were highly statistically significant. For example, when the entire signalling network was considered, ratios of conserved edges to the total number of edges, calculated for each paralogous pair separately and then averaged, were (1) 0.281, (2) 0.083 and (3) 0.141, using Wilcoxon rank-sum test P values for pairwise comparisons 1 versus 2 and 1 versus 3 were 6.163e-08 and 1.235e-13, respectively. The differences were even more pronounced when distribution characteristics were considered: The percentages of paralogous pairs with more than one third of conserved edges were 39%, 9% and 17%, for (1), (2) and (3), respectively. The percentages of paralogous pairs which had at least one conserved edge were 65%, 35% and 47%, respectively.
It should be noted that the number of random node pairs with shared edges is much lower than the values observed among paralogs. By random sampling, we approximate that the expected fraction of shared ages under the null hypothesis is only 0.6% (versus 25.63% for 2ROs), while only 4% of random node pairs have any shared edges (versus 65% for 2ROs). This strongly supports the conclusion that the presence of shared edges between duplicate nodes stems from their shared evolutionary ancestry.
Interestingly, for all age groups, conservation of regulatory edges with negative impact was higher than those with positive impact. This effect was strongest for pairs of paralogous nodes linked with 2ROs: 658 (23.92%) of 2,750 positive edges and 348 (44.44%) of 783 negative edges were conserved. Corresponding percentage figures for Chordates and Bilaterians were 13.66% versus 23.52% and 11.93% versus 15.89%, respectively (Table (Table77).
We further subdivided 2RO-linked regulatory conserved edges into those originating from the shared interaction node and directed towards the paralogous pair (conserved incoming edges, or CIEs) and those originating from the paralogous pair and directed towards the shared interaction node (conserved outgoing edges, or COEs), with similar percentage frequencies (22.28% versus 26.91% for positive edges only and 48.42% versus 45.54% for negative edges alone).
Finally, it should also be noted that the fraction of conserved edges for duplications mapping to Tetrapoda and younger is not significantly different from that calculated for 2ROs (62/275 = 22.54% versus 1,728/6,741 = 25.63%, respectively; Wilcoxon P value = 0.808), suggesting that there is no dominant linear correlation between duplicate age and edge conservation and that greater edge conservation among 2ROs than older single-gene duplications can be attributed in large measure to the inherent properties of genome duplication.
A bridging edge is an edge directly linking the paralogous node pair. Bridging interactions were preferentially of the scaffolding type. We identified 48 bridged 2RO pairs, with 35 scaffolding, 19 stimulatory and 5 inhibitory bridging edges (Additional file 20, Table 2ROs, bridged). The apparent excess of activatory links was simply a reflection of the general bias in network composition. Bridged pairs were also characterized by higher average number of edges, 26.7 per pair versus 15 per pair in the total set (Wilcoxon rank-sum test P value = 0.000196), and had a similar fraction of conserved edges (0.291 versus 0.281, t-test P value = 0.77).
The fly ancestral ortholog set (FAOS) is a set of unique Drosophila melanogaster orthologs of human ohnolog pairs inferred from TreeFam with high confidence (bootstrap >75%). To investigate associated topology features, the FAOS was linked to the fly PPI network described by Giot et al. . We found that FAOS nodes were characterised by increased degree (6.28 compared to the average of 5.52 for 1,000 randomly sampled subnetworks; P = 0.005). Nodes linked with mammalian and chordate duplications exhibited even higher-degree biases: 7.75 and 8.74 (P = 0.001 and 0.002, respectively).
We combined the FAOS with Fly Expression Atlas (FEA)  to infer AP2R expression states, as well as patterns of expression sub- and neofunctionalisation. The following FEA tissues were considered analogous to human tissues: brain, midgut, hindgut, heart, ovary, testis and fat. However, we found good correlation between the FEA and the GEA only for brain and not for the other six tissues (data not shown). On top of that, in the past we have found that brain had the highest number of uniquely expressed genes and was robust in comparative expression tests . Therefore, we concentrated on brain expression.
To investigate the evolution of AP2R expression states, we identified gene triads consisting of a unique fly ortholog (member of the FAOS) and a pair of human paralogs, where the three genes could be linked to expression data in the FEA or the GEA. We then established whether these genes were preferentially expressed in brain. Preferential brain expression (PBE) was defined as brain signal higher than the average signal for all tissues. "b" denotes the PBE, while "nb" denotes the lack thereof. Triads in each of six possible configurations were then quantified and interpreted (Table (Table8).8). For example, "b: b & b" and "nb: nb & nb" correspond to straightforward conservation of brain expression status. More interestingly, "b: b & nb" signifies subfunctionalisation through loss of the PBE in one of the paralogs, while "nb: nb & b" is interpreted as neofunctionalisation through gain of the PBE.
The first interesting observation was that subfunctionalisation was two to three times more common relative to neofunctionalisation among 2ROs than among other duplicates (Table (Table8).8). Clearly, subfunctionalisation, as a faster process, was well suited for network remodeling following 2R-WGD. The second important observation was that expression neofunctionalisation into brain was very rare among duplicates dating to Tetrapoda and younger, supporting the interpretation of 2R as the formative event for vertebrate brains.
2R-WGD occurred more than 450 million years ago, and most resulting gene duplicates were lost, leading to rediploidisation. Here we set out to functionally characterize retained 2ROs. First, we found that signal transduction was the most enriched GO term (in stark contrast to tandem or segmental duplications, where this term was underrepresented). In total, 74% of human signalling genes were descendants of 2ROs. Foreshadowing later findings, several GO terms were associated with the nervous system: neurogenesis, synaptic transmission, axon guidance, nervous system development and neuron differentiation (Additional file 2, Table S2_bp). Next, we searched for protein domains enriched among 2ROs and found many classic signalling domains, as well as well-known protein interaction (PI) domains, such as Src homology 2 (SH2), Src homology 3 (SH3), phosphotyrosine-binding domain (PTB) and PDZ (reviewed in ). The PI domains aid signalling by enabling dynamic formation of signalling protein complexes. For example, SH2 and PTB selectively recognise phosphorylated tyrosines, while SH3 binds proline-rich sequences with a characteristic motif Pro-X-X-Pro. SH2 proteins frequently form membrane-attached signal-processing complexes at autophosphorylated receptors and participate in positive and negative feedback loops of phosphorylation cascades. PTB-bearing proteins, in turn, are predominantly adaptors and docking stations, frequently anchored in the cell membrane (sometimes by means of a lipid-binding PH domain), and promoting assembly of large signalling complexes at autophosphorylated tyrosine kinases. Finally, PDZ domains recognise internal valine or leucine residues and are abundant in synapses, serving as scaffolds for the assembly of large signalling complexes involved in neurotransmission.
To better understand the evolutionary dynamics of 2R-WGD, we investigated the relationship between relative timing of gene duplication and spatial expression domain of progeny genes. The heatmap in Figure Figure11 revealed 2R's expression signature in the broader context of animal evolution. Significantly, a trend could be observed for brain and nervous tissue expression (amygdala, thalamus, caudate nucleus, corpus callosum, spinal cord, fetal brain, cerebellum, cortex and whole brain) to map to the taxonomic cluster (b), Bilateria, Chordata and Vertebrata, while being excluded from younger clusters (c) and (d). These expression patterns, taken together with the results of GO analysis, suggested that the molecular machinery of the vertebrate neuron was defined in the 2R event and strongly conserved thereafter. A previous focused study of fly and mouse noted that vertebrate synapses were far more complex than those of invertebrates , but the scale, the mechanism and the precise timing of this key evolutionary transition was hitherto unknown.
Development of large multicompartmentalised vertebrate brains is shaped by three layers of control : (1) establishment of patterning centres that secrete diffusible signalling ligands, such as WNTs, BMPs and soluble bone morphogenetic protein (BMP) antagonists; (2) brain-specific transcriptional regulatory networks involving TFs such as paired box proteins (PAX) and forkhead box protein (FOXP); and (3) extensive neuronal apoptosis shaping the fine detail of brain structures and compartments. For example, in a direct mechanistic demonstration, mice deficient in cysteine-aspartic acid protease 3 (CASP3) exhibited decreased neuronal apoptosis and hyperplasia, resulting in gross brain abnormalities . How important was 2R-WGD for the definition of this developmental toolkit? We found that multiple WNT ligands (TF105310), PAX2/5/8 and PAX1/9 (TF315397), FOXP1/2/3/4 (TF326978) and CASP3/7 (TF102023) were 2ROs. Previously, we showed that the evolution of the BMP/TGF-β pathway was guided almost entirely by 2R-WGD . In conclusion, we identified most of the vertebrate brain developmental toolkit as 2ROs.
The exclusion of nervous tissue from the expression domain of newly formed mammalian and primate genes contradicts intuition. However, anatomical differences between vertebrate nervous systems can be sufficiently explained by changes in developmental expression patterns of existing regulatory and structural genes of the neuron. Higher complexity of mental functions in certain vertebrate lineages (for example, in primates, some birds, and dolphins) is likely to stem from these anatomical differences, as well as more complex ways in which neurons are connected, as demonstrated by the rising area of connectomics.
Uniquely in animal evolution, and in stark contrast to other basic cellular functions, 2R-WGD expanded the cell cycle machinery, in particular cyclins A and B, and the interface with signalling made up by cyclins D1-D3, CDK4/6, p21/p27 and p18/p19. Cyclin D levels (unlike cyclins A and B) do not correlate with cell-cycle phases but with extracellular mitogens, cytokines, hormones and juxtacrine ligands. Signalling pathways induce expression of cyclin D, which pairs with cyclin-dependent kinases (CDKs) of types 4 and 6, stimulating the cell to enter the cycle from G1. (This progression can be inhibited by cyclin-dependent kinase inhibitors p21/p27 and p18/p19.) We identify all four sets of genes involved (that is, cyclins D1-D3, CDK4/6, p21/p27 and p18/p19) as 2ROs.
Arguably, the cyclin/CDK engine might be a relatively late evolutionary invention, taking over from ancient kinases , and with the inherent tendency for redundancy characteristic of an integrating system . However, cyclin/CDK signalling is very well documented in yeast. Regardless of the controversy regarding the nature of primordial cell-cycle regulators, the results presented here suggest that control over cell cycles became more important in large and long-lived animals and that expansion of the cyclin/CDK network, which occurred through genome duplication, facilitated fine-tuning of that control. No such regulatory upgrade was required for other basic cellular functions (such as translation, replication, splicing and recombination). We hope to open a new area of investigation into the differences of cell-cycle control between vertebrates and model species such as fly, worm and yeast, with important consequences for both basic and applied science. As cyclin D1-D3/CDK4/6 complexes have at least partially overlapping phosphorylation targets, the apparent functional redundancy serves to integrate multiple upstream signals. In other words, 2R-WGD most likely resulted in retention of duplicates with different signalling inputs but similar outputs. Kinetic modeling, protein interaction and target screens focused on differences between invertebrate and vertebrate cyclin/CDK networks should yield the first clues.
The next question we decided to ask was whether signalling network nodes linked with 2ROs exhibited some characteristic features, such as the degree (that is, the number of interaction partners) or betweenness centrality (that is, the amount of network traffic, or information, flowing through a given node). The degree of human 2RO nodes was significantly increased, with the strongest effect on outdegree of negative regulation (Table (Table6).6). This suggested that highly connected nodes, that is, network hubs, in particular those involving negative regulators, were preferentially retained. Enrichment of 2ROs in PI domains, as shown by PFAM analysis, also suggested higher interconnectedness of the post-2R network. The likely biological result of this trend towards greater network complexity was increased signalling robustness and cross-talk. Negative feedback loops, on the other hand, were likely to mediate inducible and temporary biological responses invoked by external stimuli or network oscillations facilitating spatiotemporal patterning during vertebrate development.
However, was high connectedness driving preferential retention, or was it merely a consequence of rediploidisation? If only we could sequence the genome of the AP2R animal! This is, of course, impossible, but some features can be inferred from extant species. To this end, we compared fly and human and found that hubs were already enriched in genes ancestral to 2ROs. High connectedness was therefore a factor contributing towards preferential retention. Interestingly, ancestral nodes associated with mammalian and chordate duplications exhibited even higher connectivity biases, but the progeny of these genes were not associated with human hubs. This could be explained by the evolutionary model in which all duplications preferentially target highly connected nodes but WGDs preserve their status as hubs, while tandem and segmental duplications remodel them towards reduced connectivity.
Do gene duplications conserve interactions or rewire duplicates with novel interaction partners? We must first define a few concepts which will help us approach network topology from the evolutionary perspective, with a focus on gene duplication. Let us define shared edges as a pair of edges extending between two nodes and an identical third node. A conserved edge, on the other hand, corresponds to an ancestral interaction in the ancestral network, which is still present in the extant network. We can see that shared edges between a pair of 2ROs are parsimoniously explained as conserved edges (derived from an ancestral interaction in the AP2R network), as the probability of gaining shared edges through convergent evolution is extremely low. Finally, a bridging edge is an edge directly linking the paralogous node pair, suggesting sophisticated forms of regulatory feedback and information processing between duplicates . The bridging edge is an evolutionary novelty created as a consequence of duplication, possibly but not necessarily associated with ancestral proteins prone to homodimerisation.
When the concepts of shared, conserved and bridging edges are applied to HCSM (Table (Table7),7), a number of observations emerge: (1) the fraction of conserved edges is higher for 2ROs than for paralogs mapping to Chordates or Bilaterians, (2) the fraction of conserved regulatory edges with negative impact is higher than those with positive impact, and (3) complex novel network motifs are formed by bridged hubs (Figure (Figure5).5). Figure Figure55 shows a graph representation of a HCSM subnetwork focusing on the apoptosis pathway featuring three bridged 2RO pairs. Overall, bridged pairs are extremely rich in signalling hubs, with twice the average number of interacting partners. In terms of the broader evolutionary impact, we propose that 450 million years ago, at the time of 2R, instantaneous doubling of the signalling network through WGD not only immediately expanded the available space of network states but also kick-started rapid coevolution of nodes into novel topologies. The cumulative effect was that of greatly increased phenotype space, enabling adaptation to an expanded range of physiological parameters, such as temperature, osmotic pressure, availability of nutrients and growth factors. Greater organismal adaptability facilitated, in turn, colonisation of novel environments or ecological niches. 2R-WGD was most likely an instantaneous speciation, in itself an extraordinary evolutionary event, somewhat contrary to the classic Darwinian view of gradual evolution. It probably took place under stress conditions on the fringes of the normal ecological range of the parental species. Few "hopeful monsters", with duplicated genomes, must have had an instant adaptability advantage to compete with AP2R parental populations, despite the increased costs of DNA replication, chromatin remodeling and chromosome segregation associated with polyploidy. For example, Conant and Wolfe  proposed that yeast WGD conferred an immediate selective advantage for growth in high-glucose environments through the increase of dosage of genes in the glycolytic pathway. In the longer term, as proven by our GO analysis, 2R-WGD likely also provided a drive for increased morphological complexity  and conferred greater evolvability, facilitating the emergence of vertebrate novelties.
Herein we present the first global analysis of functional trends among 2R-WGD-retained genes using state-of-the-art methodology and a high-quality data set verified through manual curation. In a methodological advance, 2R ohnologs were identified using detailed phylogenetic trees on the basis of a tree-merging algorithm implemented in the TreeFam database. We found that 2R-WGD was the paramount source of novelty in vertebrate evolution, affecting an overwhelming majority (74%) of signalling genes, in particular developmental pathways involving receptor tyrosine kinases, Wnt and TGF-β ligands, GPCRs and apoptosis pathway. Moreover, 2R-WGD redefined vertebrate synapses and facilitated the formation of centralised brains. We show that 2R-WGD preferentially retained genes associated with higher organismal complexity (for example, locomotion, nervous system, morphogenesis), while genes associated with basic cellular functions (for example, translation, replication, splicing and recombination, with the notable exception of cell cycle) tended to be excluded. In conclusion, 2R-WGD left an indelible imprint on the vertebrate signalling network (including the interface with cell-cycle machinery) and set the stage for the emergence of key vertebrate functional novelties, facilitating the evolutionary success of this taxonomic group. Finally, we link observed functional trends to signalling network and expression evolution, investigating the human signalling network and the inferred AP2R network, a comparison that has never previously been performed.
TreeFam (Tree Families; http://www.treefam.org/) is a database of phylogenetic trees of animal genes . TreeFam relies on a multistage computational pipeline and a tree-merging algorithm implemented in the TreeBeST phylogenetic engine. Several types of trees are utilized: (1) a maximum likelihood tree built using PHYML with the WAG model, (2) a maximum likelihood tree built using PHYML with the HKY model, (3) a neighbor-joining tree using P distance, (4) a neighbor-joining tree using Ka distance, and (5) a neighbor-joining tree using Ks distance. The merging procedure is implemented in the TreeBeST phylogenetic engine. In the first step of the merging procedure, the set of permitted branches and nodes is constructed, given the input set of trees. In the second step, the tree which optimizes the objective function is found. The objective function measures the similarity between a gene tree and the species tree (by minimizing the number of inferred gene duplications and losses), as well as the overall bootstrap support. TreeBeST has been tested extensively against knowledge of biologists, including manual curation, within the TreeFam and Ensembl databases .
TreeFam release 6 database was downloaded as a set of SQL instructions and data files, and reconstituted as a local MySQL database. The total data set extracted from TreeFam6 consisted of 11,635 trees containing 391,730 genes and assigned to 64 different taxonomic categories spanning the animal kingdom with S. cerevisiae and A. thaliana as outgroups. Perl scripts based on the provided TreeFam API http://treesoft.sourceforge.net/tf-perl-api.shtml were used to extract the data.
Figures Figures33 and and44 were produced using a locally installed version of the TreeBeST pipeline. Trees were rooted on time. A speciation and duplication inference (SDI) algorithm, based on the reconciliation of the gene tree with a trusted species tree , was used to infer orthology, paralogy, speciation nodes and gene duplication events. However, inferred duplication events with no species intersection support (SIS = 0) were attributed to locally incorrect gene tree topology.
A hypergeometric test implemented in the Bioconductor package GOstats was used to detect enrichment in GO categories, KEGG pathways, PFAM domains or spatial clustering along human chromosomes. The test is implemented in function hyperGTest, which enables testing for both over- and underrepresentation of terms, and conditional correction taking into account the hierarchical structure of GO ontologies was used .
Expression data are derived from a Gene Expression Atlas , a collection of Affymetrix readings from 47 human tissues and cell lines (see Additional file 21, Table S8, and http://expression.gnf.org/human_annot.html). Mapping of Affymetrix clusters to Entrez Genes was derived from the R environment table hgu95aENTREZID. In case of multiple probes mapping to the same Entrez Gene, average values across the probes were taken. Original data from the Gene Expression Atlas were in the form of average difference (AD) values, representing the fluorescence assigned to a given gene probe after subtraction of background calculated from its mismatch control. The cutoff accepted in the GEA for a gene to be regarded as switched-on or expressed was AD = 200. The AD values were converted to preferential expression measures (PEMs). PEM is log10(S/A), where S is the Affymetrix signal for a given gene in a specific tissue and A is the arithmetic mean signal for the gene across all tissues. To avoid biasing PEM values by genes without detectable expression in any of the GEA tissues or by genes with very high average expression, data points with ADavg < 50 or ADavg > 999 were omitted (2,130 and 545 of the total 6,837 human genes mappable to the GEA data set). The actual number of genes with expression data corresponding to each taxonomic unit and average PEM values across 47 human GEA tissues are given in Additional file 22, Table S9. The heatmap in Figure Figure11 was generated using the standard R function and the brewer.pal palette from the RColorBrewer package. Tissues (horizontal axis) and taxons (vertical axis) were ordered using a simple hierarchical clustering algorithm and visualized using dendrograms.
For the analysis of tissue specificity of expression, two separate measures of tissue specificity were used: percentage breadth of expression and PEMMAX. Percentage breadth of expression was defined as the percentage of the 47 human tissues studied in which a given gene was expressed above the threshold level (AD = 200). PEMMAX is the maximal value of the PEM across the 47 tissues for a given gene.
We identify 2ROs in the human genome with high certainty by the combination of phylogenetic timing and paralogon detection. By phylogenetic timing, we mean the inference of relative evolutionary timing of gene duplications on the basis of the patterns of presence or absence of orthologs and paralogs in genomes of extant species, assuming a known and trusted species tree.
Paralogons (or multiplicons) are pairs of mutually paralogous chromosome regions in which gene content and gene order are conserved. Coverage of the majority of the genome by multiplicons is an unmistakable signature of a past WGD. It should be noted that in the case of ancient WGDs, such as the 2R event, a large number of positional rearrangements is bound to have occurred and the conservation of synteny is imperfect or sometimes even marginal. Various algorithms aiming at increased sensitivity have been developed, including the Automatic Detection of Homologous Regions (ADHoRe), which combines information across multiple homologous segments. In our analysis, multiplicon detection was performed using the i-ADHoRe 2.4  implementation of the ADHoRe algorithm. The following parameter values were used: 30 intervening genes were allowed between anchor points, a minimum of five anchor points were required to define a multiplicon, a quality value of 0.8 and a probability cutoff of 0.00001. Multiplicon detection was used to verify gene duplications assigned to the 2R-WGD by phylogenetic timing. Positional mapping of human genes descending from the duplication node mapping the base of vertebrates yields a characteristic pattern of paralogous regions covering 83% of the length of the human genome, constituting a definitive signal of an ancient WGD (Additional file 23, Table S10, contains the list of genes with chromosomal location; Additional file 24, Table S11, defines the paralogy relationships between human 2ROs). Similarly to the analysis of Dehal and Boore , chromosomes 18, 21 and Y were found not to harbour any paralogous regions, suggesting that they originated after 2R-WGD (Additional file 25, Table S12, lists fractions contained in multiplicons by chromosome breakdown).
Additional confirmation can be derived from the fact that no simple spatial clustering in chromosomal gene order can be detected for the list of 2ROs, which would be expected if these paralogs arose in the standard fashion, that is, by tandem duplications. In contrast, such chromosomal clusters can be found for descendants of gene duplications mapped by phylogenetic timing to other taxons, such as Chordata (Additional file 26, Table S13), Amniota/Tetrapoda (Additional file 27, Table S14), Mammalia/Eutheria/Theria (Additional file 28, Table S15) or identified as human-specific (Additional file 29, Table S16).
A manually curated signalling network consisting of 1,634 nodes, termed the human cancer signalling map (HCSM), was derived by Cui et al.  from the integration of several data sets: the BioCarta database http://www.biocarta.com/, the Cancer Cell Map http://cancer.cellmap.org/cellmap/ and the data set described by Ma'ayan et al. . The integrated network consists of 5,059 edges divided into undirected 1,915 scaffolding links (neutral physical interactions) and directed 2,403 activator and 741 inhibitory links. The network was retrieved as a Microsoft Excel file (Microsoft Corp., Redmond, WA, USA) and processed into two tab-delimited files containing data on nodes and edges, respectively. These files were then processed into a format compatible with the R class graphNEL and analysed using the R package RBGL (an interface to the popular Boost C++ library of graph algorithms). Protein members of the signalling network were mapped to Entrez Gene Ids. Custom R functions were written to analyze the overlap between human gene sets mapped to gene duplications of different phylogenetic age and the protein nodes of the imported human signalling network. Nodes were compared using standard RBGL functions, with regard to degree (including indegree and outdegree), and betweenness centrality (function brandes.betweenness.centrality, method relative.betweenness.centrality.vertices) as shown in Table Table6.6. P values were calculated by comparison with distributions of relevant function values calculated for random subgraphs of the same size as that obtained through mapping to a given test set. Random subgraphs were derived through permutation of node labels using sampling without replacement.
An analysis of conservation of network edges following gene duplications of different ages was also performed (Table (Table7).7). The number of identical edges was compared against the total number of edges linked to the pair of nodes identified as paralogs. The percentage of conserved edges was calculated and is shown in Table Table7.7. An undirected graph representation of the human signalling network was used. The average difference in degree between paralogous nodes, and the number of paralogous nodes connected by a bridging edge, were also calculated (Table (Table77).
2R-WGD: two rounds of whole genome duplication at the base of vertebrates; 2ROs: 2R-ohnologs; AP2R: ancestral pre-2R; GPCRs: G protein-coupled receptors; RTKs: receptor tyrosine kinases; WGD: whole genome duplication.
LH performed the analysis and wrote the manuscript. CHH contributed to writing the manuscript. LH and CHH together designed the study and interpreted the results.
LH is a senior postdoctoral fellow in evolutionary systems biology at Ludwig Institute for Cancer Research, Uppsala Branch (LICR-UPP), funded by the ENFIN Network of Excellence for Systems Biology to conduct interdisciplinary research on the molecular evolution of animal signalling. CHH is a molecular biologist, director of LICR-UPP, and Swedish representative to the European Research Council.
TableS1. 2ROs mapped to Entrez Genes.
TableS2_bp. 2RO overrepresented BP terms.
TableS2_mf. 2RO overrepresented MF terms.
TableS2_cc. 2RO overrepresented CC terms.
TableS3_bp. 2RO underrepresented BP terms.
TableS3_mf. 2RO underrepresented MF terms.
TableS3_cc. 2RO underrepresented CC terms.
TableS3_not2R-over. Tandem/segmental duplication overrepresented BP terms.
TableS3_not2R-under. Tandem/segmental duplication underrepresented BP terms.
TableS4. 2RO overrepresented KEGG pathways.
TableS5. 2RO overrepresented PFAM domains.
TableS6. 349 2ROs preferentially expressed in brain.
TableS7_bp.html. 2ROs preferentially expressed in brain, overrepresented BP terms.
TableS7_cc.html. 2ROs preferentially expressed in brain, overrepresented CC terms.
TableS7_mf.html. 2ROs preferentially expressed in brain, overrepresented MF terms.
FigureS1. Duplication timing and expression divergence (Kendall correlation).
FigureS2. Duplication timing and expression divergence (Spearman's rank correlation).
FigureS3. Duplication timing and expression divergence (Manhattan distance).
FigureS4. Duplication timing and expression (difference in breadth of expression).
2ROs-bridged. Human 2RO bridged pairs.
TableS8. 47 human tissues and cell lines in Gene Expression Atlas.
TableS9. Matrix of average PEM values across tissues and taxons.
TableS10. Human genes with chromosomal location.
TableS11. Paralogy relationship between human 2ROs.
TableS12. Fractions contained in 2RO-defined multiplicons by chromosome breakdown, paralog regions cover 83% of the human genome proving a WGD.
TableS13. Chromosomal clusters for gene duplications mapped to Chordata.
TableS14. Chromosomal clusters for gene duplications mapped to Amniota/Tetrapoda.
TableS15. Chromosomal clusters for duplications mapped to Mammalia/Eutheria/Theria.
TableS16. Chromosomal clusters for human-specific duplications.
This work was supported by ENFIN, a Network of Excellence funded by the European Commission FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health", contract number LSHG-CT-2005-518254.