Even though it is estimated that only 20,000 to 25,000 protein coding genes exist in the human genome, the transcriptome is quite complex and contains protein coding, nonprotein coding, alternatively spliced, and antisense genes [
28]. RACE sequencing has provided a sensitive means for probing the human transcriptome. We found that transcripts from known gene regions often matched the known gene annotation but that many additional novel transcripts were also detected. We were also able to detect both novel and known RNA transcripts from known genes that were not previously detected in NB4 cells using genomic tiling arrays. It is thus likely that many (and possibly the majority) of known genes are expressed and spliced in human tissues and cell lines, and that multiple transcripts are produced from most gene loci, at least at a low level.
In addition to many annotated exons, high-density oligonucleotide tiling arrays has identified a large number (8,958) of novel TARs located in both intronic regions and intergenic regions distal from previously annotated genes [
15,
18]. In this report, end sequencing of the 5'-RACE and 3'-RACE PCR products from novel TARs identified extensively overlapping and interconnected novel transcripts. Most of the RACE sequences from the novel TARs and the nonTX regions are unspliced. This is consistent with mouse transcriptome studies, which found the most obvious difference between coding and noncoding transcripts to be that a higher percentage (71%) of the noncoding transcripts are unspliced/single exons, as compared with protein coding transcripts (18%) [
29]. Many human RACE products do not contain long ORFs, and thus the function of these transcripts is not known. They probably either represent nonprotein coding RNAs that may have structural, enzymatic, or regulatory functions; pre-mRNAs; or RNAs from genomic regions that are transcribed and present in polyA+ RNA but lack a function.
Although many of the novel RNAs do not have long ORFs, a subset of them do (about 9%). From our limited study we found 27 protein coding sequences that are not present in RefSeq but are likely to encode proteins based on the presence of a more than 50-codon ORF that is homologous to other proteins in GenBank. A small fraction (two out of 27) of these is spliced. Additional studies of the entire human genome are thus likely to expand the number of protein coding genes accordingly.
Complementary natural antisense transcripts exert control at many steps of gene expression in prokaryotes and higher eukaryotes from transcription to translation, including transcript initiation, elongation, mRNA processing, location, and stability [
30,
31]. Natural antisense transcripts may be involved in diverse biologic functions, such as development, adaptive response, viral infection, and genomic imprinting [
32,
33]. In recent years, a large amount of sense-antisense transcription phenomena have been reported in both human and mouse. In a mouse transcriptome study using the reverse transcribed cDNA libraries [
19], it was indicated that as many as 72% of all transcriptional units have an antisense transcript. In humans, 61% of all transcribed regions were suggested to possess antisense transcript [
16]. Our findings that some antisense transcripts lack consensus splice junctions and can be detected on strand-specific microarrays only in cDNA, but not directly labeled RNA, raises the possibility that many antisense signals are artifacts resulting from reverse transcription. The conditions that we used are similar to those used by most other laboratories, suggesting that low level second strand synthesis is likely to be present in many studies. Consistent with this, while our manuscript was under review, Perocchi and coworkers recently reported the presence of
in vitro antisense synthesis in their cDNA preparations [
25]. These findings indicate that much antisense transcription is due to
in vitro synthesis and not
in vivo cDNA synthesis, and therefore caution should be used in interpreting antisense messages. The fact that some antisense regions still hybridize to directly labeled RNA probes indicates that some antisense transcripts do exist
in vivo.
RACE sequencing was able to uncover novel transcripts from nontranscribed regions where microarray experiments did not detect any transcription, indicating the RACE sequence is more sensitive. This is probably due to the fact that micorarray signals are dampened by cross-hybridization to short oligonucleotides on the array. This problem is especially acute for genes that have homologous pseudogenes and paralogs. RACE sequencing offers several other advantages relative to microarrays. Microarrays do not provide information about transcript structure, splicing patterns, or the ability of these regions to encode proteins. Only sequencing full-length cDNA can resolve these issues. The recent developments of massively parallel sequencing technology has the potential to expedite this process greatly [
34-
37]. A large number of sequences (400,000 250-bp reads for 454 sequencer [Roche Applied Science, Indianapolis, IN, USA] and >300 million approximately 30-bp reads for Solexa sequencer [Illumina Inc., San Diego, CA, USA]) can readily be obtained in a single run. Although still relative short, these reads have the potential to identify novel transcribed regions of the human genome, and the longer reads may help to identify new spliced variants [
38].
As noted above, quantitative measurements of transcript expression reveals that two known genes (
SYN3 and
TIMP3) are expressed at low levels even in tissues where they have no obvious role and cannot be detected by standard methods. Likewise, analysis of novel TARs and even random regions of the genome indicates that much of the genome produces transcripts that are present in polyA+ RNA, at least at a low level. Expression of these RNAs was 10
3 to 10
5 times lower than that of the
HPRT gene. Assuming that
HPRT is present at 10
-5 (1 copy per 100,000 molecules of the total RNA) in total RNA, the novel transcripts we detected are present at 10
-8 to 10
-10 of the total RNA. The finding that much of the genome is likely to be expressed has previously been reported for yeast, for which evidence also exists that the RNA is translated [
39,
40]. As suggested previously, we speculate that the ability to express novel regions of the genome continuously could ultimately be useful in evolution for selecting new functions.
Our study highlights the enormous complexity of the human transcriptome and the vast amount of RNA transcripts generated both from alternative splicing and protein coding and nonprotein coding RNAs. The ability of RNA to encode protein and to serve a structural and regulatory role makes it a diverse molecule for mediating many functions. The remarkable complexity of RNAs of the human transcriptome coupled with their diverse functions may therefore help explain the dramatic increase of complexity in higher eukaryotes and phenotypic variation [
41,
42].