In the pre-genomic era the number of genes per genome was predicted based on different techniques, and sometimes extrapolation from other species adjusted by the perceived 'complexity' of a given species. Thus, the human genome was predicted to contain over 100,000 genes [39
]. Surprisingly, after genome sequencing and genome annotation this number fell to a quarter of this estimate [40
]. In Drosophila
, the annotation indicated 13,907 protein-coding genes mapped to chromosomes (FlyBase release 5 [21
]; as accessed in October 2011), which is in between old estimates based on number of chromosomal bands and saturation mutagenesis of defined regions (5,000 genes) [41
], and those based on estimation of non-repetitive DNA content (80,000 genes) [42
], as reviewed in [43
]. How can we accommodate a large number of functions with a small number of genes? Several answers have been proposed: complex genetic interactions, alternative splicing, and non-protein coding genes [44
]. Another less discussed possibility could be that non-canonical coding sequences escape detection from genome annotation pipelines. Already an overwhelming number of RNAs in humans and mice are deemed 'non-coding' (ENCODE data). It could well be that many of these actually encode translated and functional smORFs. For example, the putatively non-coding gene HAR1F [45
], which has undergone evolutionary acceleration in humans and is therefore considered as a candidate to participate in the acquisition of human-specific traits, contains several smORFs (JP Couso, unpublished observations). In fact, if putative smORFs were distributed evenly every 400 bp on average, as we find in Drosophila
, most non-coding RNAs must contain smORFs.
Even though programs for yeast genome annotation initially excluded sequences coding for less than 100 codons [1
], yeast remains the best-studied genome for the presence of smORF genes [46
]. smORF searchers used computational methods for an initial screening of small ORFs and subsequently they tested whether the smORFs are expressed, using a variety of expression identification methods [18
]. More recently, Kastenmayer et al
] collected all potential smORFs from the literature with evidence of either translation or transcription and concluded that the S. cerevisiae
genome has 299 potentially functional smORFs, which would represent 5% of the genes in yeast. Of these, at least 168 (67% of those tested; Table ) were found to have evidence of translation into peptides.
Comparison of smORF searches in different eukaryotic organisms
A recent study of the genome of A. thaliana
revealed 3,241 smORFs that show evidence for either transcription or purifying selection [11
]. This estimate is an order of magnitude higher than in yeast; however, Hanada et al
] reach a number of 3,241 smORF genes by pooling together smORFs with evidence of either transcription or amino acid sequence conservation (a version of the Ka/Ks < 1 test). We would consider this number is therefore a maximum estimate, similar to our 4,561 conserved smORFs. However, Hanada et al
. point out that the actual number of smORFs with evidence of both transcription and amino acid conservation is 941. This is about 5% of the Arabidopsis
genes, equal to the fraction postulated in yeast. To date no functional validation of these estimates has been reported.
The study of Frith et al
] used a modification of the CRITICA gene-detection program suite, originally designed for prokaryote genomes, to detect smORFs in the RIKEN Mus musculus
cDNA collection. CRITICA's entry point to identify coding sequences is sequence similarity between species, and in particular conservation of coding content of putative ORFs. This study is thus comparable to our own pipeline, in that sequence similarity, coding conservation and evidence of transcription are considered indicators of smORF functionality, and used as filters for their detection. It is interesting, therefore, that the volume of putative new, smORF-encoding sequences found in mouse by Frith et al
. is again very similar, at 5% of the already annotated fraction of canonical genes with ORFs longer than 100 amino acids. The total number of smORFs (1,240) upon which this fraction is based might be an underestimate, as small cDNAs tend to be eliminated from cDNA collections (as they may represent truncated cDNAs), but Frith et al
] detect no bias in the size distribution of their smORF-containing cDNAs. In that study, the authors assessed the translation of a sample of these new smORFs, and found evidence of translation for 14 out of 25 smORFs assessed (56%), a fraction very similar to that found in S. cerevisiae
(67%). It is important to consider that the methods applied (green fluorescent protein tagging in mouse and flies, and a mixture of tagging and proteomics in yeast) can still produce false negatives. Even after applying these fractions as correctors for the total number of real smORFs in eukaryotes, the numbers still hover at around 2 to 3.5% of annotated protein-coding genes, that is, several hundred new coding sequences per organism (Table ).
The promising results obtained by these three smORF searches previously carried out in other eukaryotes (summarized in Table ) prompted us to carry out a search for smORFs in Drosophila
. The Drosophila
genome is considered as one of the best annotated [20
], yet our study suggests that a remarkably similar fraction of smORFs as in other eukaryotes might have escaped detection by genome annotation efforts and might be an important source of functional genes (Table ). Programs for Drosophila
's genome annotation do not completely omit genes with smORFs from their predictions, but it had been suggested that short coding sequences might escape the cutoff thresholds of the programs [20
]. Recently, a lowering of this 100 amino acid threshold has led to the addition of some 700 putative smORFs above 50 amino acids in length (FlyBase release 5 [21
]; as accessed in May 2010), but most of these belong to isoforms or truncated cDNAs of canonical long genes, and do not overlap with the smORFs we identify here (V Pereira and JP Couso, unpublished observations). Still, we have corroborated that at least 25 of these seem to correspond to bona fide
smORFs, with proteomic evidence of translation and sizes ranging from 42 to 99 amino acids (Table ). Finally, we have recently characterized the tal
gene, which encodes smORFs of 11 to 32 amino acids, and shown that these peptides carry out important cell signaling functions during development [22
]. Adding to these examples, we provide here evidence for the existence of hundreds more functional smORFs.
How many Drosophila smORFs are actually functional?
Sequence conservation is an accepted hallmark of functionality, when subjected to appropriate controls to distinguish it from random conservation. D. pseudoobscura
and D. melanogaster
diverged some 25 to 55 million years ago, and have gone through sufficient divergence to scramble neutral sequences but yet not large enough to mask functional conservation [27
]. Conservation of a whole ORF, from start to stop, appears to be a reasonably stringent filter because i) the number of surviving smORFs does not increase when the initial tBLASTn search is made less stringent; ii) surviving smORFs show syntenic conservation in very high numbers; and iii) the size distribution of the 4,561 conserved smORFs is significantly different from the starting pools and from random controls, whereas it does not change after applying further filters (Additional file 1
). Some 4,500 smORFs might, therefore, constitute an upper estimate of the number of functional smORFs in Drosophila
, which would be in line with the Hanada et al
] estimate of 3,241 smORFs in Arabidopsis
It might be argued that syntenic conservation (that is, conservation in homologous regions of the genome) should be considered as the definitive criterion to distinguish real conservation (presumably due to functional significance) from simple similarity (presumably due to chance). However, the absence of synteny may be due to individual gene translocation. It is interesting that almost 50% of the 25 previously annotated and proteomics-corroborated smORFs are not conserved in syntenic positions in D. pseudoobscura (Table ). Similarly, 178 of our smORFs are conserved with start and stop codons in D. pseudoobscura, have Ka/Ks < 0.1, and have multiple evidence of transcription, yet they do not appear to be conserved in syntenic positions (Figure ). We note that both these 25 smORFs, plus another 6 for which we have separate evidence of function (E Magny and JP Couso, unpublished observations) belong to transcripts of less than 2 kb; perhaps small smORF-encoding transcripts are more mobile than their canonical counterparts. Taking into account these facts, we resolved not to consider synteny for our upper estimate of putative functional smORFs, but to include it amongst the criteria to generate our conservative estimate.
Following the conservative approach, we have taken the position that a functional smORF must also show evidence of both amino acid sequence conservation and transcription. We therefore studied first the pattern of sequence conservation in these conserved 3,144 syntenic smORFs, and observed that 1,075 smORFs show a Ka/Ks score < 0.1, which distinguish them from random conservation with a 10% FDR. Next, we assessed the correspondence between these 1,075 smORFs and transcribed regions in the Drosophila
genome. Again using a conservative criterion, we observe that 401 of these smORFs overlap transcribed fragments on more than one occasion. These 401 smORFs are thus very strong candidates to be functional, as they present evidence of both transcription and amino acid conservation (see Additional files 3
for the actual sequences). Pending manual curation and functional studies, our preliminary clustering analysis suggests that those 401 smORFs may belong to 383 different transcripts, and therefore some 380 is our lower estimate for the number of smORF-containing genes in Drosophila
. However, these may well be underestimates. Firstly, the experimental data of Manak et al
] and modENCODE that we have used here to ascertain transcription detects only transcripts expressed during embryogenesis. It leaves out transcripts only expressed during larval or adult phases of the life cycle, including organs transcriptionally active and rich in specific transcripts such as the brain and the gonads. Further, high-throughput detection of transcription during embryogenesis is not absolute but subject to inevitable experimental limitations. Thus, genes expressed in only a few cells of the embryo may not be represented (E Magny and JP Couso, unpublished observations). Secondly, the Ka/Ks test is of limited usefulness with short sequences, producing up to 9% false negatives [38
] with a limit of Ka/Ks < 1. In order to obtain a 10% FDR, we have had to lower the Ka/Ks limit to 0.1, which is one order of magnitude lower than the usual < 1 used for longer protein sequences, and leaves out a further 1,868 smORFs (Table ). It is also possible that the 314 smORFs with Ka/Ks > 1, may be genuine translated coding sequences undergoing adaptive evolution [54
]. Finally, the average length and population size distribution of the 401 smORFs with evidence of transcription, synteny and Ka/Ks < 0.1 is not significantly different to that of the 4,561 conserved smORFs that have no evidence of synteny, transcription and Ka/Ks < 0.1 (Figure ; Additional file 1
), suggesting that some of the smORFs with no current evidence of transcription or protein sequence conservation may be transcribed at low levels, or at yet not fully explored stages of development and life history. Altogether, our conclusion from the Ka/Ks, transcription and synteny analysis is that our estimate of some 400 functional smORFs in Drosophila
is solid but perhaps too conservative. A less conservative conclusion would be that up to 4,561 functional smORFs could exist. Between these numbers, other estimates are possible, depending on the actual filters and cutoff values considered (see Figure for a breakdown of the results of applying each filter and of the different overlaps between these filters).
At any rate, a lower estimate of 400 smORFs represents some 3% of currently annotated protein-coding genes, and is in line with the 5% estimate in Saccharomyces
] and a conservative Arabidopsis
estimate. We therefore favor this conservative estimate for the number of functional smORFs in Drosophila
Obtaining further proof of smORF functionality
Full genome annotation, including manual curation, of the 401 smORFs conserved in D. pseudoobscura, overlapping transcripts, and with Ka/Ks < 0.1 is a necessary future step to build the gene models they belong to, but will not offer definitive proof of their functionality or translation, which can only come from experimental verification.
We provide a limited experimental functional validation of our computational search, pooling observations from previously published studies, and the results would seem to vindicate our computational estimates. The peptides for 25 smORFs in the annotated D. melanogaster genome with sizes ranging from 42 to 99 amino acids have been isolated by proteomics methods, and offer direct and unambiguous proof of their translation. In several cases these smORFs arise after splicing of multi-exon transcripts, and this limits the number of filters we can apply. However, it can be observed that, except for the evidence of multiple overlap with transcribed sequences, many of these smORFs pass any of our filters of tBLASTn E-value < 10-3, syntenic conservation, and Ka/Ks < 0.1 (Table ). This test of our pipeline reveals it to be stringent, and validates our estimate of 401 functional smORFs in Drosophila as conservative.
It is possible that new proteomics approaches specifically designed to detect peptides of less than 100 amino acids [55
] could offer definitive proof for the translation of these 401 smORFs. However, successful peptidomics approaches in insects, including Drosophila
, have only been reported in a few cases and can only be attempted on an organ-by-organ and stage-by-stage basis [57
]. Still, such proteomics and peptidomics studies routinely uncover high numbers of 'orphan' putative sequences that cannot be matched to annotated Drosophila
]. The problem is more serious with very short peptides, which offer fewer sequence signatures (digestion sites, and so no) for proteomic pipelines to obtain high-confidence sequences. These orphan sequences could be artifacts, or could be the products of non-annotated functional smORFs.
An alternative proof for the functionality of these smORFs in Drosophila
could be obtained by genetic analyses. Single-smORF genetic studies can accurately characterize functional smORFs and are essential to build the gene models they belong to [22
], but cannot be extended to a whole genome sweep because of restriction in methods, cost and time. Whereas functional genetic analysis is routine in D. melanogaster
, standard genetic techniques are not suited to the analysis of hundreds of loci, which currently could only be attempted by a research consortium. Even in the yeast S. cerevisiae
, with its much greater speed and ease of manipulation, Kastenmayer et al
] similarly pooled data from previous studies and their own mutational effort and were only able to detect a mutant growth phenotype in 22 out of 247 smORF mutants. Other types of phenotypes were not assessed, but it seems likely that more detailed studies are needed to increase the fraction of smORFs with known functions in this lower eukaryote (since the fraction of canonical genes showing a similar growth phenotype is only 25% [4
]). Interestingly also, a staggering number of putative loci (16,383), defined by 18,873 mutant alleles, have still not been mapped to any Drosophila
gene (FlyBase [21
], FB2010_05 release 5 notes, accessed May 2010). Several of these mutants probably correspond to lost alleles of canonical genes. However, a significant fraction of them may in fact map to non-annotated genes and coding sequences, including smORFs.
Thus, it is likely that a specific research effort is needed to obtain experimental evidence for the translation and functionality of smORFs in Drosophila
, probably the higher eukaryote model best suited for this, because of its ease of genetic manipulation and extensive genome annotation. This enterprise would be well worth the effort, judging from the results of the analysis of a single smORF-encoding gene, tarsal-less
, which has prominent essential functions at several life stages and whose encoded peptides seem to work as cell signals [22
], and the important functions of other smORF products such as antibacterial or sex peptides [60
]. If the results of the bioinformatics search presented here are thus corroborated by further proteomic and functional studies in Drosophila
, it would justify similar functional studies in vertebrates, where expense and time are even greater constraints. Since functional translated smORFs and smORF-encoding genes have now been shown to exist in Saccharomyces
, they have all probability of existing in vertebrates as well. We do not necessarily expect that individual smORFs will be conserved in vertebrates (amongst other things, their small sizes make computational identification of homologues across distant species difficult if not impossible), but we expect a new class of genes encoding smORFs to be present; a new class, adding hundreds or even thousands of extra coding sequences to current genome annotations.