Full-length duplicate genes typically carry promoters, UTRs, RNA stability sites, and cellular targeting signals that are identical to their original singleton parents. As such, they will often emerge with an expression profile that is identical or highly similar to that of their ancestral sequences. Although dispersed duplicates and retrogenes offer a greater chance to recruit novel regulatory elements, their ability to change within-transcript regulatory elements may still be limited. Chimeric genes, however, combine portions of different transcripts, and as such they can shuffle regulatory elements and emerge with mRNA profiles that are distinct from their parental genes. This ability to generate regulatory novelty in addition to peptide changes can offer a wider range of phenotypic and adaptive outcomes than gene duplication.
We describe expression profiles, cellular targeting, protein domains, and population genetics of the chimeric and parental genes to determine the factors that influence the selective impacts of chimeric genes in D. melanogaster.
Expression
We previously identified 14 chimeric genes in
D. melanogaster. Seven of these genes are exceptionally young (
dS < 0.03) and are specific to
D. melanogaster; the remaining seven are older and appear to have been incorporated stably into the genome (
Rogers et al. 2009). Among the youngest chimeric genes, we can determine the expression consequences of chimeric gene formation by comparing the chimera with its parental genes. These young genes are newly formed and have not had sufficient evolutionary time to accumulate substantial nucleotide changes after formation. All regulatory material is inherited from either the 5′ or 3′ parental gene, and hence chimera expression patterns should reflect those of the parental genes. However, through the shuffling of different promoters, enhancers, and RNA stability sites, it is possible to create a number of distinct expression profiles.
Some chimeric genes display expression patterns that closely resemble only one parental gene. For example,
CG18853 has an expression profile that closely resembles the parental gene that donated the 5′ end and promoter (). Here, chimeric gene formation resulted in a novel peptide that now appears in parallel with one parental gene, and the 3′ parental gene contributes little to expression patterns. Similarly,
CG32318 takes a portion of the 3′ parental gene,
CG9187, an S-phase regulator, and places it in an expression profile that mimics
CG9191, a kinesin protein involved in microtubule movements (). The peptide is placed in a more limited context than its 3′ parent, thus allowing for specialization. The change appears to be neutral in this particular case, but the general phenomenon could have broad impacts on pleiotropic and selective constraints.
| Table 1.Expression Pattern of Chimeric Genes |
In some cases, however, chimeric genes can create fully distinct expression profiles through the shuffling of regulatory elements in the 5′ and 3′ ends. The young chimeric gene
CG12592 pulls the majority of its genetic material from the parental gene
CG12819. Across tissues, it is expressed similarly to the parental gene that formed its 5′ end,
CG18545, but its expression pattern across developmental time points is identical to that of the gene that donated the 3′ segment,
CG12819 or
sle, a peptide necessary for brain development (). Here, we find that the peptide sequence belonging to
sle has changed context across tissues while maintaining its expression profile across time points. Hence, enhancers or stability sites that govern expression in male tissues must act independently of sites governing expression during development. Additionally,
Qtzl, a chimeric gene that was involved in a recent selective sweep, has an expression profile that is distinct from either parental gene. The expression profile largely mimics that of the 3′ parent,
escl, but due to the novel combination of regulatory elements, it is expressed in the heads of adult males as well as in late embryonic stages (
Rogers et al. 2010;
supplementary table S4,
Supplementary Material online).
Finally, one chimeric gene is expressed in cases where both parental genes are silenced.
CG11961 shows expression in the testes, late larvae, and whole adult females, in contrast to both parentals. It is expressed in every tissue and life stage examined, with the exception of newly fertilized embryos less than 2 h old (
supplementary table S5,
Supplementary Material online). Assuming that this chimeric gene formed through tandem duplication (
Rogers et al. 2009), the gene that donated the 3′ end of the gene has relocated, giving
CG11961 new upstream material and
CG30049 new downstream material. At present
CG30049 is expressed only in pupae and in the carcass of adult males (
supplementary table S5,
Supplementary Material online).
CG9416, which donated the 5′ end of the gene is expressed across most stages and tissues, but is not found in testes, late larvae, or whole adult females (
supplementary table S5,
Supplementary Material online). Thus,
CG11961 has an expression profile that is unique from its parental genes on several points. Whether this expression represents neofunctionalization or partitioning of ancestral expression patterns cannot be determined from present data.
As a comparison for these newly formed chimeras, we used the modENCODE unique mapping data set to identify mRNA profiles for our 37 duplicate gene pairs with
dS < 0.03. Seven of these did not have modENCODE expression data because of changes in gene annotations, possible gene silencing, or lack of uniquely mapping reads. From the remaining 30, only one duplicate gene pair shows evidence for qualitative differential expression comparable to that observed in chimeric genes, although even this change involves a relatively minor difference in the timing of expression (
supplementary table S14,
Supplementary Material online). A Fisher's exact test of these ratios yields
P = 0.0388. Hence, although duplicate genes may be able to produce quantitative changes in gene expression, their ability to generate novel expression profiles is extremely limited in comparison with chimeric genes. Thus, chimeric genes may be a richer source of genetic novelty that can influence evolutionary outcomes in profound ways.
Targeting
Beyond changes in transcription and RNA stability, different peptides can be targeted to different compartments within the cell, opening up a greater diversity of profiles and functions. For example,
Qtzl inherits a mitochondrial target sequence from the 5′ parental gene, allowing this new sequence to be targeted to the organelle (
Rogers et al. 2010). Based on TargetP predictions, several other chimeric genes have experienced targeting changes.
The chimeric gene
CG31687 appears to inherit a mitochondrial target signal from the 5′ gene
CG2508, whereas its 3′ parental
CG31688 is targeted to the cytoplasm. Conversely, the chimeric gene
CG18217 appears to be targeted to the cytoplasm, whereas the 3′ parent
CG4098 is predicted as a secreted peptide, a signal that likely reflects a nuclear targeting signal for the
CG4098 DNA repair peptide. The 5′ parent of
CG18217, designated
CG17286, is also targeted to the cytoplasm. These changes can broaden or narrow the cellular context of a particular peptide, influencing phenotypic outcomes as well as subsequent evolution.
Some of these chimeric genes appear to be selectively favored, whereas others are consistent with neutral processes. Regardless of the selective impacts of individual genes, the ability of chimeric genes to modify the cellular context of a peptide and to target a sequence to various cell compartments should allow for a diverse range of phenotypes as a consequence of mutations. As typical duplicate genes carry the entire protein sequence of their singleton ancestors, they will be unable to effect similar changes in cellular targeting. Hence, chimeric gene formation should be able to affect a wider range of phenotypic outcomes than gene duplication.
Shuffling Membrane-Bound Domains
Classic views on exon shuffling have focused largely on recombination of whole conserved protein domains. However, changes on a finer scale below functional domains may be equally important in the development of novel peptide structures. Membrane-bound domains provide short modular units whose presence or absence can significantly impact peptide functions (
Tusnady and Simon 1998). We used the HMMTop webserver to identify membrane-bound domains in each of our chimeric and parental genes to explore the potential for changes in membrane anchoring and orientation.
CG31904 contains a major part of
Acp1, but with orientation reversed.
CG13796 is predicted to have neurotransporter activity and has a total of 12 predicted transmembrane helices. Adult cuticular protein 1 (Acp1) is a cuticular protein component expressed in the heads and thorax or adult
D. melanogaster (
Qiu and Hardin 1995). The chimera inherits three transmembrane helices from the neurotransmitter and one transmembrane domain from
Acp1. All three are predicted to have an N-terminus inside the cell. The resulting protein carries the majority of
Acp1 now oriented inside, rather than outside the cellular membrane ().
Based on a worldwide sample of D. melanogaster, the chimeric gene appears to be absent in many lines, and measures of nucleotide diversity and site frequency spectra suggest that these particular changes were likely neutral, or nearly neutral, consistent with the general inert properties of cuticular peptides. We see no evidence of pseudogenization that might suggest ancient origins. Yet again, this particular type of change where portions of proteins change transmembrane status and orientation through the combination of different proteins could in some cases produce structures with unique functional attributes, especially when modifying more biochemically active proteins.
Mid-domain Breaks
Similarly, much of the exon-shuffling literature asserts that recombination between domains is far more likely to be favorable than mid-domain breaks. However, recent work has shown that breakpoints within domains can produce functional peptides as well (
Mody et al. 2009). We examined chimeric genes to assess their propensity to generate and tolerate breakpoints within conserved protein domains rather than whole-domain shuffling.
Of the seven youngest chimeric genes, we have found three where breakpoints occur within, rather than outside of protein domains.
CG31904 disrupts a sodium neurotransmitter domain and pairs this segment with
Acp1. As discussed above, this change has altered the membrane orientation of the protein, reflecting protein modularity in secondary structure below the domain level. Similarly
CG18853 displays a mid-domain break that disrupts an uncharacterized conserved domain as well as an FAD-binding segment (). Finally,
CG32318 breaks apart a kinesin domain, combining this 5′ end with a portion of a cell regulatory peptide. All seven young chimeric genes form from unrelated peptide sequences that house fully distinct domains. These results suggest that the functional units of peptide modularity lie at a smaller scale than previously thought and that peptide structures may be fairly amenable to modification.
When we examine the older chimeric genes that are stably incorporated into the genome, we see a very different pattern. All of the preserved chimeras where protein domain data are available in Pfam have formed from parental proteins that display amino acid similarity to the same conserved domains and all align in a BLASTp. Many of these chimeras and parental genes have distinct expression profiles that prevent total functional overlap. However, conserved domains appear to be identical, and the locations of membrane-bound helices predicted by HMMTop appear to be the same. Using a Bayesian binomial approach, we can determine that the one-sided 95% confidence interval (CI) for the rate at which chimeras form from related parental genes must lie below p = 0.26242. The probability of choosing seven of these to be retained and choosing none of the other types is P < 0.262427 < 10 − 5. Hence, the overrepresentation of chimeras formed from similar peptides among the preserved chimeras is extremely significant.
All preserved chimeric genes where
dN and
dS estimates are available show strong constraint in amino acid substitutions on multiple branches of the tree, suggesting that chimeras are not entirely functionally redundant with their parental sequences. One chimera,
CG31688 may show signs of higher substitution rates on the most recent branches, although alignment is ambiguous for short sections of the gene in
D. simulans and
D. sechellia possibly inflating
dS. The gene still displays constraint on all ancestral branches.
The methods we used to identify chimeric genes are biased against new genes that subsequently duplicate to form their own gene families. As such, adaptive chimeric genes that have proliferated within the genome may be underrepresented, partially explaining the discordance. We modified our search of chimeric genes to allow for subsequent duplication and found that all of the older chimeric genes still form from highly similar parents (
supplementary text,
Supplementary Material online). Hence, the disparity is not due to this particular aspect of our chimeric gene identification methods.
Selection
If the formation of chimeric genes is indeed a key contributor to adaptive evolution, then we should observe signals of positive selection surrounding the youngest chimeric genes. After selective sweeps, where a favored sequence spreads quickly through the population, we should observe statistical signals that include reduced nucleotide diversity and highly skewed site frequency spectra (
Tajima 1989).
The chimeric gene
CG18217 appears to have formed recently in
D. melanogaster and is not shared with any other
Drosophila species. In spite of having formed very recently, it appears to have risen to high frequency worldwide. It is found in 9/10 African strains, and 11/12 strains from a worldwide collection. Furthermore, 37/37 Raleigh strains from the DPGP release 1.0 show sequencing reads that span the unique chimera boundary. Assuming that presence or absence in each strain is an independent Bernoulli trial, and given a uniform prior distribution on population frequency, we estimate the frequency
CG18217 worldwide falls between 0.8847 and 0.9896 (95% two-sided CI).
CG18217 also lies near the bottom of a wide valley in diversity on chromosome 3L (). Tajimas's
D in the region approaches − 2.5, indicating highly skewed site frequency spectra. The reduction in diversity spans roughly 40 kb, and the chimeric gene lies at the center of the sweep. The lower boundary is abnormally flat and wide, with sharp slopes, a product of the low recombination rate within this region (
Singh et al. 2005). Fitting the Kaplan–Langley equations, which assume a single sweep and simple demography, the sweep appears to have occurred around 20,000 years ago just prior to the migration out of Africa with a selective coefficient of 0.6%.
CG18217 has been changed to a pseudogene annotation in the most recent
D. melanogaster genome releases. Yet, the gene clearly aligns to known expressed sequence tags and transcripts have been amplified using polyA preparations and show the presence of correctly spliced introns. Furthermore, the associated coding sequence contains no premature stop codons. Considering the current evidence, as well as its presence in a region that has experienced a selective sweep, we expect that this gene could well be functional. The 5′ end of
CG18217 is derived from
spd-2, an essential component of the centrioles required for formation of the spindle. It is active during the earliest stages of mitosis and meiosis (
Giansanti et al. 2008). The
spd-2 mRNA is strongly expressed in developing embryos, pupae, and adult females. It appears to be moderately expressed in whole adult males (
supplementary table S1,
Supplementary Material online).
CG18217 contains a NUDIX DNA repair domain in its 3′ end, which is derived from the parental gene
CG4098 ().
CG4098 is most strongly expressed in pupae and adult females (
supplementary table S1,
Supplementary Material online). We were unable to amplify
CG4098 from cDNA derived from adult male testes or carcasses, although it was successfully amplified from cDNA from adult heads (
supplementary table S1,
Supplementary Material online).
P-element insertions for
CG18217 are listed in FlyBase as viable and fertile, as would be expected for a newly formed gene with partial redundancy in the genome.
It is entirely possible that the combination of this DNA-repair domain with a regulatory element that functions just before cell division could be advantageous in preventing cellular errors. NUDIX domains have also been implicated in small molecule signaling (
McLennan 1999), which could produce new phenotypic effects. Alternatively, epistatic interactions between the separate sections of the peptide, a common consequence of domain tethering (
Bashton and Chothia 2007), could result in a new function.
Another chimeric gene,
CG18853, also lies in a valley of reduced diversity and shows skewed site frequency spectra on chromosome 2R. The reduction in diversity appears to span roughly 45 kb (). Such a reduction in diversity is consistent with a single sweep occurring around 200,000 years ago with a selection coefficient of 0.25%.
CG18853 houses portions of two protein domains but is characterized by an unusual breakpoint that lies within two domains. Whether or not this gene is functional remains uncertain. Transposable element insertion lines are listed in FlyBase as viable and fertile, however, again, this is expected for a newly formed gene.
The parental peptide
CG12822 carries a conserved domain of unknown function that is found in vertebrates as well as in multiple bacteria. The human ortholog of
CG12822, Nef-associated protein 1, is a thioesterase that interacts with HIV protein Nef (
Liu et al. 1997). The remainder of the peptide contains an FAD-binding domain derived from a photolyase (
phr). The boundary of chimera formation falls within these two domains, resulting in a chimera that combines portions of domains rather than whole-domain shuffling (fig. ). How the different portions of these domains interact is not known. Still, resistance to viruses and similar pathogens could create an opportunity for an evolutionary arms race that might generate strong selection to fix new genes but result in selective pressures that are transient, consistent with the patterns observed in chimeric genes.
In each of these cases, we cannot be entirely certain that the locus of selection lies in the chimeric gene, a common problem with scans of selection. Furthermore, recent fixation of tightly linked duplicate genes through neutral processes can cause moderate reductions in diversity and Tajimas's
D (
Thornton 2007). These types of effects may explain a portion of the signals that we see, but would be insufficient to produce reductions of this breadth or magnitude.
Two other chimeras,
CG12592 and
CG31668, display a less drastic reduction in diversity and somewhat skewed site frequency spectra (), but both are several kilobases away from the local minimum and are not strong candidates for selective sweeps.
| Table 3.Tajimas’s D for Chimeric and Parental Genes |
Qtzl,
CG18217, and
CG18853 are all newly formed chimeric genes that are found at the center of selective sweeps in
D. melanogaster. We used a resampling approach, choosing seven genes at random from the two
D. melanogaster autosomes, to account for the likelihood of finding three selective sweeps among seven genes. Using a fairly liberal cutoff of Tajima's
D < − 1.8, we found 39 of 10,000 replicates had three or more genes that might potentially be involved in selective sweeps. This cutoff is far less stringent than that applied to any of our chimeric genes. It does not require that the gene be an outlier with respect to the region around it, and few of these supposed sweeps have the same breadth as those found at our chimeric genes. Thus, the likelihood of obtaining similar results by chance must be exceedingly rare and is most certainly
P < 0.0039.
Comparing each of these chimeric genes with their parental sequences, we find that they each have
dS < 0.03. During this time frame, an estimated 15.5 chimeric genes will have formed (
Rogers et al. 2009), suggesting that 19.3% of chimeric genes are subject to selective sweeps just after formation. This contrasts with a frequency of preservation of 1.4% (
Rogers et al. 2009). This disparity between the frequency of fixation due to selective sweeps and the frequency of preservation, combined with the disparity in domain structures for newly formed and preserved chimeric genes, strongly suggests that adaptation and gene preservation are largely distinct phenomena (see Discussion).
In contrast, out of 37 pairs of duplicate genes with
dS < 0.03 not located in regions where chimeric genes formed, we found four pairs with Tajima's
D < − 1.8. In this time frame, an estimated 104.1 duplicate genes will have formed (
Rogers et al. 2009), suggesting that the frequency with which new duplicate genes are involved in selective sweeps is only 3.8%. Again, this requirement is far less stringent than the criteria used for selective sweeps on chimeric genes and may overestimate the contribution of young duplicate genes in adaptation. We performed 10,000 jackknife replicates, choosing seven genes without replacement from the list of duplicate genes with
dS < 0.03. In 10,000 jackknife replicates, 168 had three or more pairs with Tajima's
D < − 1.8, indicating that
P < 0.0168. Hence, the overrepresentation of chimeric genes in regions associated with selective sweeps in comparison with duplicates is significant. Thus, chimeric genes are substantially more likely to be involved in selective sweeps than young duplicate genes and therefore offer a substantially richer source of genetic material for adaptation in the near term.