|Home | About | Journals | Submit | Contact Us | Français|
The most common form of protein-coding gene overlap in eukaryotes is a simple nested structure, whereby one gene is embedded in an intron of another. Analysis of nested protein-coding genes in vertebrates, fruit flies and nematodes revealed substantially higher rates of evolutionary gains than losses. The accumulation of nested gene structures could not be attributed to any obvious functional relationships between the genes involved and represents an increase of the organizational complexity of animal genomes via a neutral process.
Eukaryotes are typically more complex than prokaryotes on the molecular, systems and phenotypic scales of biological organization. In particular, genomes of multicellular eukaryotes possess a complex architecture that involves substantial overlapping of their transcribed regions [1–4] and protein-coding genes [5–7], forming an interleaving mosaic of exon and intron sequences. Although it is clear that such complex genome organization is made possible by the presence of introns, the rates and mechanisms of evolutionary events leading to gains and losses of overlapping gene arrangements have not been studied previously.
Previous studies of the evolution of genome complexity have primarily relied on correlations between the abundances of various genomic elements (introns, transposons, gene size, etc.) and the product of the effective population size and mutation rate [8–13]. However, claims of causality based on such correlative analyses are always inconclusive, because other potentially important factors can never be excluded. In an attempt to circumvent some of the limitations of the correlative approach, we explored the evolution of genomic complexity in a more direct manner, by tracing the evolutionary dynamics of nested pairs of protein-coding genes in animals. This study covers only one, perhaps not even the most common, class of interleaved gene arrangements because we left out the numerous intron-contained small RNA genes . Nevertheless, even this limited analysis clearly reveals the ongoing increase of the organizational complexity of animal genomes and suggests that this process occurs via a nonselective route.
The most common form of overlap between protein-coding genes in eukaryotes is a nested gene structure, and in a majority of such structures, the internal gene lies entirely within one intron of the external gene [6,7]. Thus, we investigated the evolution of this class of nested gene structures in vertebrates, Drosophila and Caenorhabditis. A search of NCBI annotation records yielded 428, 815, 440 and 608 nested gene pairs in H. sapiens, D. melanogaster, C. elegans and C. briggsae genomes, respectively. After eliminating gene pairs that might have been misannotated (see Supplementary Material), we arrived at sets of 128, 792, 429 and 233 nested gene pairs, respectively. Only a small minority of the protein sequences encoded by internal genes from each of these three major taxa show significant sequence similarity to internal genes products in the other two taxa (data not shown), suggesting that either these structures emerged independently and relatively late during evolution or that they were extensively and repeatedly lost.
By examining gene annotations and constructing sequence alignments, we identified the closest species with a completely sequenced genome in which each nested gene structure was absent. Absence of the nested structure in an appropriate outgroup species indicates its emergence (gain) in the respective lineage, whereas presence of the nested structure in the outgroup indicates its loss (Figure 1). Gains were found in all three taxa, with the emergence of 55 internal genes in at least 40 independent events in vertebrates, 52 internal genes in at least 48 events in Drosophila and 22 internal genes in as many events in Caenorhabditis. The rate of these acquisitions was approximately uniform throughout the course of evolution (Figure 2). By contrast, losses of nested gene structures were much rarer, with none detected in vertebrates, 17 in Drosophila and 2 in Caenorhabditis.
At least four scenarios are plausible for the formation of a nested gene structure: (i) an internal gene can evolve by insertion of a DNA sequence into an intron of a pre-existing gene, (ii) an internal gene can evolve de novo from an intronic sequence of a pre-existing gene, (iii) a gene can become internal after an adjacent gene acquires an additional exon(s) or (iv) a gene can become internal after fusion of two genes that flank it from the opposite sides (Figure 3).
By comparing the gene structures and encoded protein sequences of internal and external genes to complete gene sets from the respective species, we deduced the mechanisms of formation of nested gene structures in vertebrates (Table 1). Nearly all nested gene structures seem to have emerged by insertion of a DNA sequence, which arose by gene duplication or retrotransposition, into an intron of a pre-existing gene. The origin of an internal gene was classified as a retrotransposition when it was intronless in a given species, whereas its non-nested ortholog in a sister species contained introns. A duplication at the DNA level was inferred when both the internal gene and a non-nested ortholog in a sister species had introns. In cases where the internal gene and a non-nested ortholog were both intronless, retrotransposition and duplication at the DNA level could not be discriminated. Five internal genes in humans are candidates for de novo origin from intron sequences (see Supplementary Data), including one gene with no sequence similarity beyond apes [placenta-specific 4 (PLAC4)] and another with no similarity beyond old world monkeys [saitohin (STH)] (Table S1). Analysis of the 12 recently sequenced Drosophila genomes showed that the majority of de novo genes originate in introns . Consistent with this observation, we found 11 internal genes in D. melanogaster with no sequence similarity to any genes in the genome of the closely related D. yakuba. We did not identify any nested gene structures that evolved via the remaining two scenarios.
At least three hypotheses could explain the parallel accumulation of nested gene structures in different taxa. First, a nested structure might confer a selective advantage because of a functional or co-regulatory relationship between its members [16–20]. Second, according to the transcriptional collision model, members of a nested gene structure could interfere with each other’s transcription [21–23], resulting in alternative expression of these genes in different tissues or during different times in development. Finally, acquisition of a nested gene structure could be a neutral process [8–13,17], driven by the presence of numerous long introns that provide niches for insertion of genes. Each of these hypotheses leads to a distinct prediction about the relationship between the expression of internal and external genes in a nested pair. The functional co-regulation hypothesis predicts a positive correlation between levels of their expression in similar tissues, the transcriptional collision hypothesis predicts a negative correlation and the neutral hypothesis predicts no correlation.
To discriminate between these three hypotheses, we analyzed gene expression data from human and D. melanogaster genomes (see Supplementary Material). We compared correlations of gene expression in 109 and 752 nested gene pairs in humans and D. melanogaster, respectively, to 1000 random sets of 109 and 752 adjacent gene pairs from corresponding genomes. There was no significant difference in mean correlation coefficients of gene expression levels between nested and adjacent genes in either human (0.33 ± 0.03 for nested and 0.33 ± 0.0008 for adjacent pairs) or D. melanogaster (0.041 ± 0.014 for nested and 0.030 ± 0.00046 for adjacent gene pairs), which is consistent with the neutral hypothesis. The observation that external genes have substantially more and longer introns than average in the respective species (Ref.  and Supplementary Material) is also compatible with the neutral hypothesis. Furthermore, examination of the available functional information for nested gene pairs (Table S1) did not reveal any obvious connections . Fixation of originally neutral or even slightly deleterious sequence segments, such as introns and transposable elements, through genetic drift acting in relatively small populations is a common phenomenon in eukaryotic evolution that might be partially responsible for the evolution of complex phenotypes [8–13]. The increase in organizational complexity of intron-rich genomes via emergence of nested gene structures seems to be another facet of this process.
The neutral hypothesis implies that the preferential evolutionary gain of nested gene structures is caused by metazoan genomes being far from neutral equilibrium with respect to birth and death of intron-contained genes . We estimated the rate of acquisition of nested gene structures as ~0.4, 0.9 and 0.2 events per million years in the H. sapiens, D. melanogaster and C. elegans lineages, respectively (see Supplementary Material). Because animal genomes currently contain ~500–800 nested gene pairs, these rates indicate that nested gene structures began to emerge ~1 billion years ago, perhaps concurrent with the substantial intron gain that apparently occurred at the onset of metazoan evolution . These results suggest that metazoan introns are still far from saturation by internal genes and that the organizational complexity of metazoan genomes will continue to increase for many millions of years via the emergence of new nested gene structures. By the time metazoan genomes reach organizational complexity equilibrium, the overlap of functional elements is expected to be much greater than what we observe in extant taxa and will probably include numerous Russian doll-like nested structures. This process has already begun in fruit flies, with the D. melanogaster genome containing six cases where a nested gene structure is nested in another gene.
We have shown that the evolution of metazoan genomes is accompanied by a steady rise in the prevalence of nested arrangements of protein-coding genes, leading to increasingly complex genome architectures. In addition to nested protein-coding genes, animal genomes contain numerous complex arrangements, including incomplete overlaps of protein coding regions and their untranslated regions and various RNA genes [2–6,25]. In particular, a substantial fraction of microRNA and small nucleolar RNA genes are either fully contained within introns of protein-coding genes or overlap with protein-coding exons . It will be of major interest to determine whether the trend of increasingly complex genome organization reported here applies to RNA genes or incompletely overlapping gene structures.
Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.tig. 2008.08.003.