Gene duplication has been described as an opportunity to explore forbidden evolutionary space [2
], the idea that duplicated genes operating under temporary conditions of relaxed selection provide the raw material for evolution of new gene functions. While whole-genome duplication events are critical in shaping broader genome architecture, gene duplication, particularly tandem events, represent more recent, and potentially, adaptive signatures of evolution [34
] which are expected to differ among vertebrate lineages [23
]. Indeed [36
], using zebrafish as their model, and others have shown evidence that evolutionary rates of duplicated genes in teleost fish far outstrip those of the mouse lineage. These differences, aside from adaptive consequences, can have profound effects on the degree of shared ancestry and synteny among vertebrate genomes. For example, only 50% of duplicated genes in zebrafish, and 70% in Tetraodon
, have their origin in 1R/2R WGD events, compared to over 80% in mammalian, avian, and amphibian lineages. The remaining fraction comes from FSGD and species-specific events [30
]. Clearly, patterns of teleost gene duplication deserve closer scrutiny to better understand how this process continues to shape genome evolution. Therefore, here we examined the nature and extent of gene duplication in four model teleosts, zebrafish, medaka, stickleback and Tetraodon
Our approach divided duplicated genes into sets based on duplication type and captured larger gene families as well as smaller, recent duplications. From the onset of our analysis, zebrafish stood out from the other three model species by most measures, with a larger percentage of sets involved in tandem and intra-chromosomal arrangements and numerous small duplication sets (Table
, Figure ). Our analysis of the mutational distance between duplicate pairs (Ks
) across the teleost species (Figure ), however, produced the most striking illustration of different patterns of duplication and retention. Over 24% of duplicate pairs in zebrafish had Ks
values of ≤1.0 compared to around 1% or less in the other three species. These results are supported by previous studies which noted high evolutionary rates and duplicate retention rates in zebrafish [29
]. The abundance of low Ks
duplicate pairs in zebrafish may stem from a greater number of birth events or fewer gene loss events among young duplicates. Although homogenization through gene conversion is a possibility [2
], the low Ks values are mostly associated with tandem duplicates, suggesting recent gene duplications.
Our approach focused on surveying the broader architecture of duplication in the teleost genomes rather than relying on cross-species phylogenetic analysis for identification of orthologous relationships. Our analyses are limited, therefore, in distinguishing between rapid lineage specific gains in zebrafish and excessive gene loss in other teleosts for particular duplicate sets. The bias in the low Ks
duplicate pairs in zebrafish toward tandem duplication (Figure ) provides support for these being recent duplication events. Close to 65% of these zebrafish duplicate pairs with Ks
1.0 are found in tandem arrangements compared with ~15% of total duplicated sets (data not shown). In addition, gene ontology analysis revealed a bias in these duplicates toward physiological functions previously associated with rapid evolution and adaptation [28
]. Indeed, the enriched categories (olfactory receptors, MHC) are well known for their rapid diversification through duplication, recombination, and gene conversion [39
]. Taken together, our results suggest strikingly rapid evolution and high retention of recent duplicates in zebrafish in a manner likely to result in specialization of immune and sensory mechanisms.
The differences observed in Ks
distributions among the four teleost species (Figures and ) raised several intriguing questions for further research: What is the effect of life history on the genome architecture of fish, and is there a link between genome size and duplication rate/retention rate in fish? Shiu et al. (2006) examined similar lineage-specific patterns when comparing human and mouse duplicates, suggesting that the larger population size and shorter generation interval in murine species could account for more effective natural selection and retention of duplicated genes. In the four investigated teleost genomes, zebrafish and medaka share similar life history patterns, generation intervals of 7–9
weeks and large effective population sizes, and similar Ks
distributions (excluding Ks
<1.0). In contrast, Tetraodon
and stickleback, with generation intervals of 1–2
year and smaller effective population sizes, had a notable absence of young (low Ks
) duplicates and shared remarkably similar Ks
distributions (Figure ) across their duplicated genes. These patterns of duplication rate and retention have been explored in the light of population size using genome sequence information in invertebrates [43
] and previously, on a more theoretical basis [44
]. Previous observations of correlations between spontaneous duplication/deletion rates and effective population size and increasing retention of linked (tandem) duplicates at intermediate population sizes appear to support such a connection between life history and duplication profiles as suggested by our data. Another pattern deserving further attention as additional teleost genomes become available is a potential association between duplication timing/retention rates and genome size. Based on the limited data available from the four model genomes here, patterns of duplication rate (especially as reflected by those pairs with Ks
1.0) reflect genome size with zebrafish with the largest genome at 1.5 Gb, followed by medaka (700
Mb), stickleback (446
Mb) and Tetraodon
Mb). The drastically differing patterns of duplicate formation and retention as detected here and by Blomme (2006) may be reflected in evolution of non-coding elements as well [29
] and, together, could contribute to significantly higher genic content and associated genome size, as observed in zebrafish [46
The observed differences in age of duplicated genes as reflected in Ks values could also result from errors in genome sequence assemblies of medaka, stickleback and Tetraodon. As these genomes were sequenced using the shotgun approaches, sequence assembly could have underestimated the segmental duplicated genes. In other words, the most similar paralogues could have been assembled as one gene while they are truly two or more genes in the genome. In this scenario, the missing segmental duplications do affect the assessment of the age of duplications [47
]. However, this problem cannot be easily addressed. In order to determine if such a possibility could have caused the major differences in Ks values between zebrafish and the other three fish species, we conducted simulations using zebrafish chromosome 1. The whole genome sequence assembly of zebrafish chromosome 1 was “segmented” into 500
bp pieces and then de novo assembly was conducted using a 10X sequence coverage. In this assembly, a large number of contigs were obtained, 37,396 contigs. Apparently, the large numbers of contigs were resulted from interspersed repetitive segments, most notably the TC1-like transposons. We then mapped the assembled contigs in silico to the reference genome sequence of zebrafish chromosome 1. Over 99.7% of these assembled contigs were mapped to chromosome 1 sequences, suggesting that the “shotgun” approach did not affect the identification of paralogs. Therefore, we believe that the differences in Ks values were likely not caused by sequence assembly errors in medaka, stickleback and Tetraodon although all these genomes were sequenced using whole genome shotgun sequencing.
Previously, we highlighted the low levels of alternative splicing detected from zebrafish (17% of mapped genes) compared with the other model teleost species [31
]. By contrast, the compact genome of Tetraodon
showed alternative splicing in 43% of mapped genes. In that study, an inverse correlation between genome size and alternative splicing was observed. Researchers have previously suggested an inverse relationship between rates of gene duplication and alternative splicing in animals [48
] and, more recently, in plants [49
] based on single gene or gene family investigations. Our previous analysis of alternative splicing combined with our present examination of gene duplication in the same teleost species appears to support this connection on a genome scale. Further study is warranted to investigate whether the recent duplicates of zebrafish can provide the functional repertoire generated through alternative splicing in other, smaller teleost genomes.
Our findings indicate that varying rates of gene duplication and retention can have a dramatic impact on the ancestry and architecture of teleost genomes and contribute to functional diversification and divergence of important physiological processes. These patterns may be reflective of differences in life history across the teleost radiation and may ultimately influence genic content and genome size. Further analyses of the genomes of additional, key teleosts (i.e. catfish, carp) in the near future will allow us to test these theoretical relationships and analyze the particularities of the zebrafish genome in the context of more recently diverged species.
In Brown’s paper, the Copy number variation elements (CNVE) appeared to be consistent with extensive population substructuring (i.e., local adaption) among zebrafish population, with 4,199 (69%) of the identified CNVEs unique to one strain and only 457 (7.5%) CNVEs are common to all four groups [50
]. Given this large amount of genome variation among zebrafish populations, analysis of genomes from additional zebrafish populations may reveal differences in gene copy numbers within a given duplication set. This would be of great interest in helping to establish the rate of gene birth in zebrafish. However, only the reference genome sequences were available for the present analysis. In addition, large differences of gene copy number variations have been mostly associated with anonymous genomic segments, not protein-encoding genes.