Exon 0 and -1 are novel first exons
To examine if the two novel, putative exon -1 and extended exon 0 regions were first exons, that is if transcription start site(s) (TSS) were present, 5' RACE was performed. Exon 0-specific reverse primers and a RACE-ready panel of anchored cDNA libraries derived from 24 human tissues (OriGene, Rockville, MD) were used. Sequencing of the 5' RACE clones from two replicate, independent experiments identified two exon -1 start sites, as well as multiple transcription start sites in exon 0 (Fig. ). The cDNA sequences of the 5' RACE [GenBank:EF549561, EF549562, EF549563, EF549564, and EF549565] products are shown in Fig. . The longest 5' RACE clone identified (TSS1 in Fig. ) is expressed in adipose tissue, leukocytes and the uterus [GenBank:EF549561] and contains a 106 base pair exon -1 and a large 736 bp exon 0 (exon 0b, the complete extended exon 0 shown in Fig. ).
Cap analysis of gene expression (CAGE) tags average 20–21 nucleotides and are produced by large-scale sequencing of concatemers derived from the 5' ends of capped mRNA [
17,
18]. The CAGE method, therefore, detects the most 5' site of the mRNA transcripts – the transcription start site. Even singly, CAGE tags are considered to be reliable markers of transcription start site (TSS) locations [
19]. Intriguingly, a human CAGE tag starting site (CTSS) corresponding exactly to a 106 bp exon -1 transcription start site (Fig. ), was found via the CAGE basic viewer. Furthermore, a 92 bp exon -1 (exon -1b, TSS2 in Fig. ) containing transcript was amplified by 5' RACE from adipose tissue [GenBank:
EF549564], leukocytes and the spleen. Sequence analysis demonstrates that it splices into a 197 base pair exon 0 (197 bp from the 3' termini of exon 0). Thus, the 100 bp conserved region, identified using Mulan and confirmed experimentally, harbours an exon -1 with two transcription start sites.
Several transcripts initiating in the 5' extended exon 0 were obtained in many of the tissues examined [GenBank:
EF549565,
EF549562, and
EF549563] including normal human testis, stomach, adipose tissue, leukocytes and ovary (TSS 3–5, Fig. ). While it is possible that the transcription start sites identified in exon 0 via 5' RACE represent truncated cDNA, the sites are likely to be genuine, as we found multiple CAGE tag starting sites in the putative, extended exon 0 of mouse ghrelin (Fig. ). While the transcription start site of the short 20 bp human exon 0 aligns with the murine start site, the TSSs of the human extended exon 0 and the putative extended murine exon 0 are quite different (Fig. ). This suggests that these exons have diverged significantly over time, resulting in considerable variation in their start sites, termed TSS turnover [
19]. TSS turnover, with the translocation of mouse start sites compared to human start sites, occurs in a number of genes [
19].
Recent studies indicate that many genes have broad transcriptional regulation with a wide distribution of proximal start sites, and not all genes are regulated by distinct start sites controlled by a TATA box [
19,
20]. While a putative TATA box flanks the originally described 20 bp exon 0, it appears to be a very weak start site [
21,
22]. The cluster of transcription start sites (TSSs) in the extended exon 0 sequence upstream of the short 20 bp exon 0 (Fig. ) contains no apparent TATA boxes (data not shown). Our study indicates that the ghrelin gene is broadly regulated and has many potential transcription start sites. This may allow the transcription of numerous tissue-specific and developmental stage-specific transcripts. Using
in silico analysis coupled with RT-PCR and 5' RACE analysis, we have demonstrated that the conserved regions upstream of exon 1 (exon -1 and extended exon 0) are transcribed and correspond to novel first exons of the human ghrelin gene.
Multiple transcripts arise from alternative splicing from exons upstream of exon 1
Expression of exon -1-containing and extended exon 0-containing transcripts were examined using RT-PCR with exon-specific sense primers (for exons -1 and extended exon 0) and with an antisense primer in the 3' terminal exon 4 of the ghrelin gene. A list of exons and exon-intron boundaries of ghrelin locus derived-transcripts identified in this and previous studies, as well as ESTs, is given in [Additional file
1].
Using sense primers in exon 0, a 1064 bp product that spanned exons 0 to 4 and contained a 558 bp exon 0 [GenBank:EF549569] was found in the human stomach (Fig. ). Moreover, primer walking with sense primers further downstream in exon 0 always resulted in bands corresponding to exon 0 and 1 (data not shown). Several PCR products were amplified from the SW1353 chondrosarcoma cell line (and these are depicted in Fig. ). The difference in the sizes of the transcripts results from multiple non-canonical introns in exon 0 (data not shown), and are most likely due to promiscuous splicing in the continuous tumour cell line.
We then examined the alternative splicing of transcripts expressing exon -1 (located 2.6 kb upstream of exon 1) in human tissues and a range of human continuous cell lines. RT-PCR using sense primers to exon -1 revealed multiple transcripts in the normal human stomach [GenBank:EF549568, EF549570, EF549571, EF549572, EF549573, EF549574, and EF549575] and several other tissues and cell lines [GenBank:EF549557, EU072081, EU072082, EU072083, EU072084, EU072085, EU072086, and EU072087]. The transcripts were classified into three groups based on their sequences. The amplicons sequenced from the human stomach are depicted in Fig. , while Fig. summaries all exon -1 to 4 amplicons obtained in this study. The first two groups have exon structures that obey the GT/AG rule, while transcripts in the third group harbour canonical GT/AG intron splice sites in the antisense direction only.
In the first group (I), all amplicons include exon 1 to 4 which code for preproghrelin, and vary only in the length of the sequence upstream of exon 1 (the preproghrelin 5' UTR). The amplicons obtained from the human stomach [GenBank:
EF549571,
EF549572, and
EF549573] are depicted in Fig. . The 855 bp amplicon, demonstrated in the human stomach, was also observed in all cell lines examined (DU145, RWPE-1, RWPE-2, LNCaP, PC3 and SW1353, data not shown), as well as in the heart, brain, spleen, testis, salivary gland, leukocytes and bone marrow (see [Additional file
2]). As depicted in Fig. , sequencing of the Rapid-Scan human tissue cDNA panel also revealed a 1316 bp amplicon with a 736 bp exon 0b in the placenta [GenBank:
EU072083]. Furthermore, splice variants with an alternative exon 2 splice site (hereafter termed exon 2b), which results in loss of a glutamine residue at position 14 of the mature ghrelin peptide (termed des-Gln
14-ghrelin or ghrelin-27) [
23], were also sequenced and correspond to a 1313 bp [GenBank:
EU072084] amplicon from the kidney and a 852 amplicon from heart tissue [GenBank:
EU072085] (Fig. ). Interestingly, a 1067 bp amplicon with a 212 bp exon 0h initiating at the start of the 736 bp exon 0, followed by the 275 bp exon 0d (separated by a 294 bp novel intron), was found in the kidney [GenBank:
EU072087]. A single nucleotide polymorphism (SNP) g.-1062G > C (nucleotide number -1 corresponds to the first nucleotide upstream of the translation start site of preproghrelin in exon 1 of the ghrelin gene) [dbSNP:rs26311] is present in base 209 of the 736 bp exon 0 (exon 0b), creating a 3' splice-site consensus sequence (CAG) [
24]. This polymorphism has recently been linked to obesity and metabolic syndrome in the Korean population and is thought to influence the ghrelin promoter, ultimately increasing preproghrelin transcription efficiency [
25]. Our findings raise the possibility that this SNP effects mRNA splicing, resulting in allele-specific transcription of a 736 bp (exon 0b) or a 487 bp exon 0 (the latter resulting from g.-1062C induced splicing of a 212 bp exon 0h into a 275 bp exon 0d).
The large, extended 736 bp exon 0 is extensively spliced and contains numerous non-conserved uORFs (upstream open reading frames), while exon -1 contains a single translation start site in the human sequence only (data not shown). Approximately 12% of human genes are alternatively spliced within their 5' untranslated regions [
26]. Upstream open reading frames, as well as mRNA secondary structure and other motifs in 5' UTRs are known to regulate the translation of downstream major ORFs and particularly those which translate developmental genes [
27]. We suggest that the alternative transcripts identified which splice into exon 1 may be a part of such a regulatory mechanism. The 20 bp exon 0 found in human stomach and thyroid medullary carcinoma TT cells [
6,
7] is devoid of upstream open reading frames and stable secondary structure (data not shown). As a consequence, this transcript may be more efficiently translated than the group I transcripts with exon -1 and extended exon 0 which have more extensive 5' untranslated regions.
The second group (II) of transcripts contains splice variants which include exon -1 in various combinations with exons downstream of exon 1 (exons 2 to 4). In addition to the two 443 and 326 bp amplicons cloned from the human stomach [GenBank:
EF549574, and
EF549575] (Fig. ), we obtained a 217 base pair amplicon (
EF549557) in the PC3 human prostate carcinoma cell line corresponding to a transcript that lacks exons 0–3, but contains exon -1 and exon 4, flanked by GT/AG splice junctions (Fig. ). Furthermore, the 443 bp amplicon from the human stomach was also sequenced from the heart and spleen (data not shown) and amplicons at the expected size were observed in leukocyte and bone marrow (see [Additional file
2]). Moreover, sequencing of leukocyte-derived amplicons demonstrated an mRNA variant with a previously described 3 base pair 5' truncated exon 2 [
23], exon 2b (440 bp amplicon in Fig. ) [GenBank:
EU072086]. Given that exon 1 is skipped in all group II variants, preproghrelin and the N-terminal 'active core' (Gly-Ser-Ser-(n-octanoyl)-Phe) of the ghrelin hormone [
28] cannot be translated from them. Interestingly, analysis of these variants using SignalP V3.0 [
25] predicts a signal peptide (MFTCWWSYLRSTLAAVPGEA) in exon -1 (with a signal peptide probability of 0.57, signal anchor probability of 0.00, and a cleavage site between position 19 and 20 which was assigned a cleavage site probability score of ~0.47). Indeed, if the signal peptide is translated, the putative peptides would be in-frame with previously reported ghrelin gene derived peptides (depicted in Fig. ). The putative peptides encoded by these transcripts would include the sequence for C-ghrelin (ex -1, 2, 3, 4) (which also includes the coding region for obestatin [
10]), the hormone obestatin alone (ex -1, 3, 4) [
9], and also a novel C-terminal proghrelin peptide (exon 3-deleted proghrelin) (ex-1, 4) that is upregulated in prostate [
5] and breast [
3] cancer. The identification of several mRNA variants with coding sequence in-frame with an exon -1 encoded putative signal peptide strongly suggests that the signal peptide is translated and functional. For example, we have demonstrated the expression a C-ghrelin mRNA variant in human heart tissue. C-ghrelin circulates at high levels in patients with heart failure and at low levels in patients with myocardial infarction, and do not correspond with ghrelin levels [
10]. In rat plasma and rat tissues, C-ghrelin levels do not appear to correspond directly with ghrelin levels [
11]. Therefore, the regulation of preproghrelin and C-ghrelin could be independent and C-ghrelin could be a ghrelin gene derived hormone with distinct functions. Interestingly, a murine testis specific transcript, the ghrelin-gene derived transcript or GGDT, that codes for obestatin but not ghrelin, has previously been demonstrated and harbours a putative nuclear localisation signal [
29]. While the functions of obestatin remain somewhat controversial [
9,
30-
32], it may play a role in sleep [
33], anxiety [
34] and in cell proliferation [
35].
Natural antisense transcripts are transcribed from a gene on the opposite strand of ghrelin (ghrelinOS)
The third (III) group of alternative transcripts containing exon -1 and 4 that we identified in the human stomach (a 1176 bp amplicon [GenBank:
EF549568] and a 921 bp amplicon [GenBank:
EF549570], Fig. ) result from splicing of transcripts with exon -1 sequences of ~880 bp, termed exon -1*a, and ~625 bp, termed exon -1*b. These exons extend into intron -1 and are considerably larger than the 106 bp and 92 bp exon -1 sequences obtained in this study. Moreover, these splice variants also contain two novel intron 2-derived exons (exon 2* and 2**) and an alternative exon 4 (exon 4*) (Fig. ). As is the case in the stomach, the fragments corresponding to these unusual transcripts were observed in all cell lines examined (DU145, RWPE-1, RWPE-2, LNCaP, PC3 and SW1353, data not shown). The 1176 bp amplicon expressed in the stomach is also expressed in the heart and fetal liver (data not shown). Furthermore, an mRNA variant with a ~880 bp exon -1*a and exon 2*, but lacking exon 2** (the 1108 bp amplicon in Fig. ) was sequenced from the spleen [GenBank:
EU072081], while a 950 bp amplicon [GenBank:
EU072082] harbouring exon -1*a and exon 4* only was obtained from the kidney (Fig. ). We then determined the direction of transcription of these transcripts. GMAP [
36] and manual analyses were performed using sequenced PCR products obtained in this study, as well as expressed sequence tags (ESTs) spanning at least one intron. This analysis showed that while there are no canonical GT/AG splice junctions if the variants are transcribed from the sense (ghrelin gene) DNA strand, the reverse strand contains GT/AG intron junctions. All exons demonstrated this pattern with the exception of the intron flanking exon 2* of these fragments and -1*a/b, where the splice junction is GC/AG. GC/AG is relatively rare, but the most common non-canonical splice site pair [
37].
To confirm the direction of transcription of the putative antisense transcripts, we employed strand-specific primers in reverse transcription (RT) reactions to specifically target either sense or antisense transcripts. RT-PCR analysis of human stomach cDNA (performed by combining the sense and antisense RT-primers) revealed that the target transcripts are transcribed in the antisense direction (Fig. ). Furthermore, Southern hybridisation, employing a nested DIG-labelled PCR probe, demonstrated a very strong signal for the antisense-direction amplicon (data not shown). The expected 201 base pair amplicon (spanning exon 4*, 2** and 2*) was isolated and sequenced to confirm its identity [GenBank:EF549558]. We have termed this antisense gene ghrelinOS (ghrelin opposite strand).
In order to firmly establish the origin of the antisense ghrelinOS mRNA transcripts, and to determine how far they extend relative to the ghrelin gene, 5' RLM-RACE using human stomach cDNA was performed. Sequencing of RACE products identified two transcription start sites (TSSs) corresponding to a 63 bp and an 86 bp exon 4* [GenBank:
EF549559, and
EF549560]. Furthermore, a CAGE tag starting site was identified (T03F009D342E) in the antisense direction corresponding to a 28 base pair exon 4*. The three TSSs of ghrelinOS transcripts are summarised in Fig. . We did not identify any potential TATA-boxes, therefore, these findings suggest that exon 4* contains multiple TSSs, which is typical of TATA-less promoters [
20].
Sequence analysis demonstrates that the ghrelinOS gene undergoes substantial alternative splicing, and we have identified five natural antisense transcripts (termed ghrelinOS1–5, Fig. ). The transcripts differ in the length of exon -1* (GhrelinOS1,3–5
vs 2), while in the third transcript (GhrelinOS3) exon 2* is extended and exon 2** is absent. GhrelinOS4 lacks exon 2** and harbours the canonical exon 2*. Finally, GhrelinOS5 lacks both exon 2* and 2**. The exon-intron junctions of GhrelinOS transcripts are depicted in Fig. . The analysis revealed no significant sequence similarity to any known gene, protein or to any long ORFs (data not shown), suggesting that these transcripts may function as regulatory, non-coding RNA [
38]. Natural antisense transcripts (NATs) that are transcribed from the opposite strands of the same genomic locus are termed cis-NATs [
39].
The ghrelin NATs that we have described span the non-coding 3' UTR region of exon 4 and exon -1 of mature, sense ghrelin gene-derived transcripts and overlap intron 2 sequence of ghrelin pre-mRNAs (Fig. ). Sense exons -1 and 4 are conserved when compared to mouse genomic sequence, while the degree of sequence similarity to exon -1* (the region not overlapping the sense exon -1), exon 2* and exon 2** appears to be very low (data not shown). Interestingly, most previously reported cis-NATs overlap with the sense transcript in their untranslated regions [
39]. This appears to be the case for ghrelinOS transcript as exon -1 and 4 corresponds to 5' and 3' UTRs of sense ghrelin transcripts encoding preproghrelin.
It has been demonstrated that the mammalian genome is often transcribed from both the sense and antisense DNA strands [
40]. Although the understanding of the mechanisms of action of natural antisense transcripts remains in its infancy, these transcripts have been shown to be involved in transcriptional and post-transcriptional regulation. NATs have been associated with a range of regulatory mechanisms, including transcriptional interference, RNA masking and dsRNA mediated gene-silencing via direct interaction between the sense and antisense transcripts [
39,
41]. Intriguingly, in a very recent study, rats were administered a short, 22 base pair ghrelin antisense oligonucleotide into the cerebrospinal fluid [
42]. The antisense oligonucleotide is complementary to sequence in the rat preproghrelin 3' UTR in exon 4 (exon 4* of putative rat ghrelinOS transcripts). The study found that the antisense oligonucleotide decreased anxiety in rats (the opposite effect to ghrelin) and may act as an antidepressant [
42]. We suggest that this preliminary evidence may provide a first glimpse of the function of endogenous ghrelin natural antisense transcripts.
It has been reported that ghrelin mRNA and protein levels are dissociated [
43-
45]. We hypothesise that this may be due to either the presence of upstream open reading frames in exons 5' to exon 1 (exon 1 harbours the preproghrelin start codon); expression of ghrelin locus derived transcripts lacking coding potential for ghrelin; or non-coding sense and/or antisense regulatory transcripts. The transcripts identified in this study may be examples of at least one of these factors. Therefore the physiological significance of each transcript species, in particular mRNA variants encoding preproghrelin, cannot be determined based on mRNA expression data alone.