Given the importance of gene duplication to the origin of biological innovations, a deeper understanding of the evolutionary process might be gained from investigating the differential contributions, if any, of gene duplication to the genome architecture within diverse lineages. Genomes can be variably shaped by the mutational input of duplicate sequences (the frequency and the flavor of redundant genetic sequences being generated) and their differential preservation/degeneration dictated by the strength of natural selection and random genetic drift. Some effort has been made towards such comparative genomic analyses of the gene duplication process, both at the level of closely and distantly related eukaryotic genomes (for example, [
30-
42]). In a similar vein, this study analyzes various structural and genomic features of gene duplicates in the
S. cerevisiae genome and aims to contrast these with gene duplicates with low synonymous divergence in the genome of a multicellular eukaryote,
C. elegans, as well as compare evolutionarily recent gene duplications with evolutionarily older gene duplicates with low synonymous divergence in
S. cerevisiae.
Most of the
S. cerevisiae duplication events (approximately 69%; 47 of 68) analyzed here are thought to have originated from a WGD in the distant past [
23]. This paucity of extant gene duplicates with low synonymous divergence in the
S. cerevisiae genome led Gao and Innan [
27] to conclude an extremely low gene duplication rate of approximately 0.001 to 0.006% per gene per million years for this species. However, a recent study utilizing multiple mutation accumulation lines of
S. cerevisiae conclusively demonstrates that the spontaneous rate of gene duplication is high, at 1.5 × 10
-6 per gene per cell division [
43]. This experimental measure in conjunction with the low incidence of extant evolutionarily young gene duplicates in the yeast genome suggests that the fate of most newly spawned gene duplicates in the yeast genome is loss. The large effective population size (
Ne) achieved in yeast cultures dictates that new gene duplicates with even slightly deleterious selection coefficients may be subject to loss by purifying selection due to the efficacy of natural selection within the yeast genome. The role of effective population size (and, hence, strength of selection) in influencing patterns of genomic sequence evolution has been recently championed by Lynch and colleagues [
44-
46], although the associated theoretical underpinnings in relation to molecular sequence evolution can be traced back to the proponents of the neutral theory [
47,
48].
The extant group of gene duplicate pairs with low synonymous divergence in the
S. cerevisiae genome comprise a mixed population. Most of these pairs (approximately 69%) are derived from evolutionarily older duplications wherein sequence divergence between paralogs has been curbed by the processes of codon selection usage bias, sometimes in conjunction with gene conversion [
19,
27,
28], whereas a smaller subset of gene duplicates (approximately 31%) referred to as non-ohnologs in this study are thought to be of relatively more recent origin, probably occurring subsequent to the WGD event. Furthermore, codon selection usage bias/gene conversion appears to have affected sequence evolution in some of these non-ohnologs as well given that different paralogous pairs within the same linked set (presumably arising from the same duplication event) have extremely divergent K
S values (Table ). For these reasons, K
S values between gene paralogs cannot be taken as a blanket proxy for estimating the evolutionary age of all gene duplicates, at least in the
S. cerevisiae genome. The mixed nature of this population of yeast gene duplicates is also apparent during sequence alignments of ribosomal protein paralogs comprising at least one intron. Twenty-four pairs of ribosomal protein yeast duplicates in the ohnolog class have no discernible sequence identity over most of their intronic regions (barring small sequence tracts ranging from 1 to 10 bp at their splice junctions), despite relatively low levels of synonymous divergence in their coding sequences. This lends credence to view that these previously classified ohnologs are indeed of older evolutionary origin [
19,
23]. Given the presence of ancient gene duplicates with low degrees of synonymous divergence in the
S. cerevisiae genome, it is reasonable to question whether gene duplicates with low synonymous divergence in other genomes are necessarily young, evolutionarily speaking. A preceding study applied statistical tests for detecting gene conversion to a subset of gene duplicates in the
C. elegans genome and found that most gene conversion events were restricted to members of large gene families [
49], suggesting that the degree of synonymous divergence may be an accurate indicator of evolutionary age for paralogs belonging to small gene families in this genome. Therefore, the worm and yeast genomes may differ in the degree to which concerted evolution or codon usage bias selection effectively homogenizes gene paralogs based on the size of the gene family and the effective population size of the species (and, hence, the strength of natural selection).
We charted out the extent of homology between two paralogs by aligning their genic as well as upstream and downstream flanking regions, thereby calculating a minimal estimate of the extent of duplication by visual inspection. For evolutionarily older duplicates, erosion of sequence homology in the intergenic regions would lead us to underestimate the original duplication span. This expectation is borne out by the fact that 56 of the 93 duplicate pairs in our data set appear to involve the duplication of a single locus. Yet, preceding studies have identified 46 of these 56 gene duplicate pairs as ohnologs. The remaining 37 duplicate genes were generated by 12 duplication events referred to as 'linked sets' (16% of all duplications in this data set) that involved the simultaneous duplication of multiple gene loci (range two to seven genes). Interestingly, only one of these twelve duplication events is thought to have originated from the WGD, suggesting that duplication of lengthier DNA segments encompassing multiple loci is an ongoing process in the yeast genome. Indeed, gene duplication during experimental evolution in yeast frequently involves large chromosomal blocks comprising multiple loci [
43,
50,
51]. Segmental duplications in
C. elegans encompassing more than one locus, on the other hand, only comprise 7.1% of all observed duplications [
34]. This contrast in the patterns of segmental duplication between worm and yeast suggests that duplication events spanning multiple loci occur with a greater frequency and/or are selectively advantageous in the yeast genome relative to
C. elegans.
Based on a determination of the extent of sequence homology visible between yeast paralogs in their flanking regions, we calculated the minimum duplication span for each duplicate pair and also determined the minimum number of loci that appear to be duplicated. In the majority of the cases, the duplications appear to span only a single locus (approximately 82%; 56 of 68) and the median duplication span for the cumulative data set comprising both ohnologs and non-ohnologs in yeast is 1,004 bp, slightly lower than the median duplication span of 1.4 kb for C. elegans gene duplicates. These results appear paradoxical when we consider that the majority of yeast duplicate pairs comprising this data set (69%; 47 of 68) originated via a WGD event. The median duplication span for ohnologs is significantly lower than that for non-ohnologs (958 bp and approximately 2,500 bp, respectively). Furthermore, ohnolog duplication spans are far more restricted in their size range than non-ohnologs. This shorter span of duplication for gene duplicates arising from a WGD are in accord with an older evolutionary age for ohnologs in conjunction with the erosion of sequence homology in their intergenic regions over evolutionary time due to sequence divergence, deletions and/or local rearrangements.
Yeast paralogs were characterized as possessing complete, partial or chimeric structural homology based on the extent of sequence homology using techniques previously described for
C. elegans paralogs [
11]. The genomes of these two eukaryotes are in stark contrast with respect to the frequency of these three structural categories of gene duplicate pairs. The
C. elegans genome has a high frequency of structurally heterogeneous gene duplicates, with approximately 50% of all evolutionarily young gene duplicate pairs categorized as partials or chimerics [
11].
S. cerevisiae, on the other hand, has a preponderance of complete duplicates, a handful of chimeric duplicates and a complete absence of partial duplicates. When yeast duplicates are partitioned based on their mechanism of duplication, ohnologs and non-ohnologs are found to be similar with respect to the frequencies of these three structural categories of duplicates. Several factors in combination probably contribute to the paucity of structurally heterogeneous duplicates in the yeast genome. Given a WGD origin for the majority of these duplicates, they are likely to have originated as structural replicas of the ancestral copy with concomitant inheritance of the full repertoire of ancestral
cis-regulatory elements. Evolutionarily older duplicates such as the ohnologs in this data are likely to have experienced local rearrangements, insertion or deletions that could potentially convert one or both paralogs such that the paralogs appear structurally heterogeneous. However, we observe a remarkable level of structural preservation between evolutionarily older paralogs in
S. cerevisiae, suggesting purifying selection against mutations modifying ancestral ORF structure and/or pervasive gene conversion leading to structural homogeneity. Indeed, gene conversion is known to operate at an appreciable frequency in the yeast genome and is commonly invoked as one of the factors responsible for the low synonymous divergence among
S. cerevisiae ohnologs [
19,
27,
28].
Despite the fact that both yeast non-ohnologs and
C. elegans gene duplicates resulted from SSD events, it is interesting to note that the genomes of these two species differ with respect to the degree of structural homogeneity observed between paralogs. Approximately 82% of yeast non-ohnologs are structurally homogeneous compared to only 40% of gene duplicate pairs with low synonymous divergence in the
C. elegans genome [
11]. This difference may be attributed to an interplay between the median gene length, median duplication span and the strength of natural selection in these two genomes. The median gene length in
S. cerevisiae and
C. elegans is 1.1 and 1.4 kb, respectively. The median duplication span for extant
S. cerevisiae (minimal discernible estimate and excluding ohnologs) and
C. elegans duplicates is 2.5 and 1.4 kb, respectively. If the median duplication span of extant yeast duplicates accurately approximates that of the entire population of gene duplicates (both preserved and extinct), a SSD event in
S. cerevisiae, on average, is more likely to encompass the entire ORF of the ancestral copy relative to
C. elegans. It is also possible that the average length of a SSD event in
S. cerevisiae may be much shorter than that of extant duplicates. If newly originated duplicates are mildly deleterious because they lack structural and functional redundancy with the progenitor copy, they may be rapidly weeded out in the yeast genome owing to the greater efficacy of natural selection. However, a recent study demonstrates that most spontaneous duplications in yeast experimental lines tend to be fairly large [
43]. A smaller
Ne for
C. elegans relative to yeast means that such structurally heterogeneous gene duplicates, if mildly deleterious, may be more likely to persist in the worm genome due to an attenuated strength of natural selection.
The genomic location of paralogs relative to one another can provide clues to the mechanism(s) of duplication and the general patterns of their genomic movement subsequent to their origin. Overall, 82% of duplicate pairs in this yeast data set comprise paralogs located on different chromosomes, a pattern that is not surprising given that the vast majority of these gene duplicates are ohnologs that owe their origin to the WGD. Barring the possibility of misidentification of non-ohnologs as ohnologs, the presence of ohnologs with both copies residing on the same chromosome can probably be explained by the secondary movement of one paralog in proximity to its sister copy in the post-duplication period. Interestingly, ohnologs and non-ohnologs display no significant differences with respect to the chromosomal location of paralogs (same versus different chromosomes). While genome- or chromosome-wide duplication events are expected to initially yield paralogs residing on different chromosomes, SSD events do not necessitate such a pattern of paralog location. While approximately 90% of newborn gene duplicates in the
C. elegans genome comprise both copies residing on the same chromosome [
11], only 29% of yeast non-ohnologs are in such close genomic proximity. If gene duplication by retrotransposition is a frequent mechanism of duplication in the yeast genomes due to the presence of
Ty elements [
43,
52-
55], there should be a further decrease in the likelihood of a paralog originating on the same chromosome as the ancestral locus. However, we have no evidence for the origin of gene duplicates via retrotransposition in this yeast dataset. That is to say, wherever introns are present, both yeast paralogs bear them. Duplications in experimental yeast populations are frequently translocative [
43,
50]. Furthermore, there is evidence that translocated segmental duplicates in yeast have enhanced stability relative to tandem duplications [
56]. Both of these factors likely contribute to the preponderance of yeast non-ohonologs residing on different chromosomes.
Functional diversification between paralogs can be effected by both coding and regulatory sequence divergence. Studies focusing on the absence/presence of a correlation between coding sequence divergence and expression divergence across a breadth of model organisms have yielded contrasting results, reporting the two variables as coupled (for example, [
36,
57-
59]) as well as decoupled [
35,
60-
62]. High levels of gene conversion and/or codon usage bias, which serve to homogenize the coding sequences of paralogs, may restrict the potential for expression and functional divergence between them if coding sequence evolution was the only contributing factor to functional diversification. Given these regimes of pervasive gene conversion and/or codon usage bias in the yeast genome, functional diversification via
cis-regulatory sequence divergence can greatly facilitate functional diversification of paralogs, independent of coding sequence divergence. Papp and colleagues [
63] demonstrated a rapid reduction in the number of shared
cis-regulatory motifs between yeast duplicates as a function of increasing synonymous divergence despite constancy in the total number of regulatory motifs. Our analysis of the extent of sequence homology in the 5' and 3' flanking regions of yeast paralogs suggests extremely limited levels of sequence preservation in the flanking regions of yeast paralogs, for ohnologs and non-ohnologs alike; 80% and 86% of yeast gene duplicate pairs have detectable sequence homology of only 0 to 10 bp in their 5' and 3' flanking regions, respectively. This diminished sequence identity between paralogs in their flanking regions can be explained by sequence divergence of initially paralogous regions by mutational saturation over evolutionary time, deletions and other rearrangements or a failure to inherit ancestral regulatory elements during the duplication process. Given that many of these gene duplicate pairs are thought to have arisen from a WGD event, the first two scenarios are the most likely explanations for the limited flanking region homology between putative ohnologs comprising this data set. Irrespective of the specific mechanism driving the divergence of flanking regions of
S. cerevisie paralogs, there exists an appreciable potential for functional diversification between paralogs due to the lack of shared regulatory elements despite complete sequence homology across their ORFs. The causes for the lack of shared flanking region sequence between yeast paralogs are likely to differ for the ohnolog and non-ohnolog classes (rapid molecular divergence versus limited duplication span). However, the sequence divergence in flanking regions of both classes of yeast duplicates is likely to play an important role in driving expression divergence between yeast paralogs, despite the maintenance of sequence homology in their coding regions. Interestingly, ohnologs and non-ohnologs show both similarities and disparities with respect to their flanking region homology. Ohnologs and non-ohnologs were not found to be statistically different with respect to the extent of 5' sequence homology. These results are not in agreement with a previous study that found ohnologs to have more diverged upstream regulatory regions relative to non-ohnologs [
25], although this discrepancy between the two studies could be due to both differences in sample size and methodology. In contrast to our 5' flanking region results, there exists a significant difference in the extent of 3' sequence homology between these two classes of yeast duplicates, with ohnologs displaying far more restricted 3' flanking sequence homology relative to non-ohnologs. It is reasonable to suggest that this highly limited extent of homology in the downstream flanking regions of ohnologs is due to diminished selection for conservation of sequence in this area relative to the upstream flanking sequence.