Using a hypothesis-driven comparative genomics approach, we detected a number of exonic remnants which, prior to the WGD in the teleost lineage, were likely bifunctional—coding exons doubling as regulatory elements or parts thereof. We corroborated this observation by showing evidence that the corresponding exons in mammals are still under both coding and non-coding selection pressure. The non-coding pressure was indicated by their significantly decreased nucleotide substitution rates and nucleotide distances of synonymous sites, when compared to neutrally evolving and protein-coding regions in the same genomic regions.
The idea that some coding exons might be under a combination of coding and noncoding selection pressure has recently received some attention. Xing and Lee (
47,
48) demonstrated that non-coding selection pressure can distort
Ka/Ks values, making the metric unsuitable for annotating some exons in the genome or estimating the functional significance of amino acid residues encoded by them. More recently, several different probabilistic models were suggested for exons under different modes of selection pressure (
4,
19,
49).
In particular, many facultative (occasionally skipped) exons were shown to have a high conservation of synonymous sites (
50,
51), presumably because the coding information is overlapped by regulatory inputs governing inclusion or skipping of these exons during splicing. However, under our model, this explanation for the noncoding conservation component is implausible since we explicitly detected exon remnants that lack evidence for being transcribed in zebrafish according to the UCSC genome browser ‘known zebrafish spliced ESTs’ and mRNA annotation (accessed 22 May 2009, ‘Materials and Methods’ section).
These observations imply that additional (non-coding) purifying selection pressure acts on RCE regions. This does not necessarily mean that all RCEs in our set have been subject to evolutionary constraint throughout the ~500 Myr separating humans and zebrafish from their last common ancestor. While it is possible that some exonic remnants are indeed wholly or partly unannotated non-coding RNA, and others may have more recently lost their protein-coding ability, the available sequence evidence—including the absence of most of the other exons of the ancestral gene, frequent disruption of ancestral splice sites, and lack of EST support—indicate that this is a highly unlikely explanation for the majority of detected cases.
If the RCE regions have been subject to extra purifying selection from non-coding functional components, what is their function? Like the HCNEs that function as long-range regulatory sequences for their target gene(s) (
2,
5), the RCE regions appear to be part of the same array of conserved elements around a target gene responsive to long-range developmental regulation. Many of those elements have been shown to possess enhancer activity [from 50% in mouse (
4,
42) to close to 80% in zebrafish reporter assays (
27,
52)]. The conservation of detected RCEs often extends significantly into one or both of the flanking introns in tetrapod genomes, which indicates that the whole region must have been recruited into its non-coding function at some point. It was apparently not an obstacle that (part of) it coded for a functional part of a protein (
Supplementary Table S4). This does not necessarily suggest that the entire lengths of exons that gave rise to RCEs, or that their—still exonic—orthologs in tetrapod are regulatory—the most we can claim without additional evidence is that the part of the ancestral exon that has been retained as an exonic remnant in zebrafish most likely has regulatory function.
Overlap between coding and regulatory sequence has been observed in genomes of bacteria (
53) and viruses (
54–57), and was explained as a way to minimize genome size. For vertebrates, where protein-coding regions make up only a small percentage of the genome, coding + regulatory overlap is not likely to be a space-saving strategy. Even so, the number of reported individual cases of such arrangements is growing. An early study revealed that interaction of transcription factor
B-Myb with
HSS8 (a hypersensitive site mapped to exon 2 of the
Bcl-2 gene) may enhance
Bcl-2 gene expression by cooperating with its promoter (
58). Barthel and Liu (
59) computationally identified a regulatory region associated with the gene
ADAMTS5 that encompasses the entirety of the essential coding exon 2. The
APOE gene was also found to contain an enhancer in its coding region for the E4 allele, which is associated with Alzheimer’s disease (
60).
In this work, we did not attempt to find the RCEs overlapping the exons of the GRB target genes, since they cannot be detected as exonic remnants under non-coding selection. However, the high density of HCNEs in introns of target genes, as well as low rate of synonymous substitution at many of their exons indicates that exons of GRB targets might often overlap their own regulatory elements. The recently reported ultraconserved element in Hoxa2 (
10) is one example of this. On the other hand, even though exons can be targets of RNA-mediated posttranscriptional regulation (
10,
61), this type of regulation requires the RCE to be transcribed, which cannot explain the selection pressure on isolated and apparently un-transcribed exonic remnants studied in this article.
Our results add support to the idea that HCNEs were recruited from existing sequences within regulatory reach of their target genes. A recent study demonstrated that a large number of repeat elements in regions that we now know as GRBs are also undergoing purifying selection (
7). These findings should provide an incentive to test experimentally the detected exon remnants in zebrafish and their orthologs in human for the presence of enhancer activity. Suitable test systems exist in zebrafish (
62), medaka (
63) and mouse (
4). If proven able to drive expression in a spatiotemporal pattern that recapitulates a subset of expression patterns of the neighboring gene, this would mean that we have to modify our view of both how protein sequences evolve and where to look for regulatory elements in vertebrate genomes. For protein sequences, it would mean that the non-coding component might mask the effect on selection at the protein level to an extent where it might be difficult to draw conclusions about functional importance of a part of a protein sequence based on its evolutionary conservation. For regulatory information, this will demonstrate that these exons are an integral part of the arrays of HCNEs, and that the non-coding component of the selection pressure that acts on them is equivalent to the pressure that kept HCNEs highly conserved for hundreds of millions of years. It would also suggest that the bystander genes were in place (i.e. in synteny to the neighboring HCNE target) before the HCNEs themselves appeared.