Genes Selection and in silico assembly
76 annotated genes were selected for analysis. The selection was done based on the following criteria: (i) specific or high levels of expression in retina, (ii) a role in retina specific physiological processes or retinal development, and (iii) involvement in retinal disease. A compilation of all tested genes including gene symbol, definition, chromosomal location and tissue/cell type of expression is shown in Additional file 1
cDNA and transcript sequences available in public databases (RefSeq, NCBI and Ensemble gene predictions covered by at least one EST, Unigene ESTs database) were downloaded and new assemblies generated using SeqMan.
We found that 5' transcript termini represented in public datasets can be readily identified by clusters of cDNA ends in the assemblies. Additionally, the information about putative TSSs was assessed in The Eukaryotic Promoter Database [17
] and Database of Transcriptional Start Sites (DBTSS) [8
]. These data were compiled to create a preliminary gene model which was used to design primers for the subsequent Cap-finder RACE experiments.
Experimental examination of TSSs
Cap-finder RACE cDNA fragments were cloned and a variable number of clones were sequenced for each gene, depending on the number and the sizes of the colony PCR products detected on the gel. We obtained products for 54 genes out of the 76 genes analysed. A summary of the results obtained with Cap-finder RACE is shown in Tables . Genes for which the promoter and TSS were already known (RHO and OPN1SW) served as internal positive controls. For each gene we detected at least one splice variant that agreed with one or more RefSeq annotated exons.
Results of the RACE experiment to determine the TSS of retina transcripts
Our strategy relies on the location of gene specific primers within internal exons. We obtained those cDNA products that covered at least one exon-exon junction and thus ruled out the possibility of amplification of genomic contamination. This strategy has enabled us to identify alternatively spliced 5' ends that arise from tissue specific gene expression and regulation.
Table lists the results of the Cap-finder RACE experiments for 54 retinal expressed genes and the corresponding RefSeq entry (database release 18). These results can be grouped into five categories with reference to RefSeq; (i) new TSSs within novel exons (8 genes), (ii) alternative splice form of the second exon (2 genes: IMPG1, SAG), (iii) extension of the annotated first exon (27 genes), (iv) length shortening of the annotated first exon (4 genes), (v) confirmation of previously annotated TSSs (13 genes). Table provides the exact nucleotide positions of 5' termini of the Cap-finder RACE cDNA clones referring to the UCSC Human Genome Browser (March 2006 assembly). In defining the interval where TSSs are located we report the start, and when present, the internal frequent start and end nucleotide position of each TSS. Sequences from this study have been submitted to GenBank under the accession numbers: DQ067456–DQ067464, DQ426859–DQ426897, DQ980599–DQ980621.
Retinal expressed genes with new 5' exons
For 8 genes (C1orf32, CNGA3, DHRS3, ELOVL5, KIFC3, RCV1, RDH12, SLC24A2) we have identified a new exon composition at the 5' end of the transcript and in some cases new untranslated 5' exons that locate the TSS several kilobases upstream or downstream from the annotated one.
- C1orf32. This transcribed locus in chromosome 1 was selected for its retinal expression. For this gene, whose function is still not characterised, we retrieved two new isoforms lacking the first annotated exon found in RefSeq. These isoforms contain TSSs in two new exons. One form displays a new first exon located 3 kb downstream from the previous TSS. The other form presents a first exon located 58,4 kb upstream of the former TSS, generating a new first intron spanning a locus transcribed in the opposite strand, the gene MAEL. (Figure )
Figure 1 Schematic gene structure of 3 analysed genes. Schema of the RefSeq, ESTs, exonic structure of new isoforms identified with the Cap-finder RACE of human retina mRNA and the genomic structure containing the TSSs indicated by arrows. Red Arrows indicate (more ...)
(cyclic nucleotide gated channel alpha 3) codes for the α-subunits of the cone photoreceptor cGMP-gated channel, a crucial component of the cone phototransduction cascade in colour vision. Mutation in this gene causes achromatopsia. The RACE experiment confirmed the presence of 4 isoforms, all containing a splicing of untranslated exon 0 localised 23,4 kb upstream of exon1 [18
] (Figure ).
(Retinol dehydrogenase 12) is an enzyme with dual-specificity retinol dehydrogenases that metabolise both all-trans
- and cis-
retinols, reported to be expressed in photoreceptors [19
]. Mutations within RDH12
cause both recessive early onset Retinitis pigmentosa and Leber's congenital amaurosis [20
]. In human retinal mRNA we retrieved two forms of the transcripts containing a new first exon located upstream of the RefSeq TSS and a differentially spliced second exon. The in silico
assembly and experimental pipeline allowed us report three putative TSSs for this gene; the first is defined by the RefSeq annotation, the second was deduced from the most upstream transcript represented by ESTs from pooled colon and the third is a new TSS displayed by retinal transcripts (Figure ).
(dehydrogenase/reductase, SDR family, member 3) codes for an enzyme catalysing the reduction of all-trans
-retinal to all-trans
-retinol in the presence of NADPH [19
]. The gene was included in our study for its high expression in retina. Cap-finder RACE confirmed the previous first exon and TSS. We also detected an alternative TSS in a new first exon downstream from the annotated one which was predicted with FirstEF [6
]. (Additional file 2
: Figure 5).
(elongation of long chain fatty acids, including docosahexanoic acid (DHA), family member 5). This gene was recently annotated as a retinal expressed gene [23
] and a target of mutation studies in retinitis pigmentosa [24
]. We detected a new form of the transcript with a new first exon that was not previously annotated or described for retina. (Additional file 3
: Figure 6)
(Kinesin family member C3) codes for a retina specific microtubule-associated force-producing protein that may play a role in intracellular transport [25
]. We have characterised two new isoforms of KIFC3
retinal transcripts which lack the first 3 exons annotated in RefSeq. Both transcripts include a new first exon that localizes these new TSSs 44 kb upstream and 27 kb downstream respectively from the TSS referenced in the RefSeq database. The more upstream start site locates the gene in proximity to another retina specific gene (CNGB1
). (Additional file 4
: Figure 7)
(recoverin) inhibits rhodopsin kinase activity in retinal photoreceptors by reducing the binding of arrestin to rhodopsin. Deregulation of recoverin expression in certain types of cancer demonstrates a pathological role in cancer-associated retinopathy [26
]. Although a previous study of the promoter was performed [27
], no clear evidence of the TSS have been described. For this gene we detected three alternative transcripts; the first with the same 5' end as the previously annotated TSS (first exon length may vary from 203 bp longer to 444 bp shorter), the second with a more frequent isoform lacking the first exon and starting 80 bp upstream from the second exon of the RefSeq and the third form has a new first exon located downstream from the annotated one. (Additional file 5
: Figure 8).
(solute carrier family 24, sodium/potassium/calcium exchanger, member 2) codes for a potassium-dependent sodium-calcium exchanger in cone photoreceptor [28
]. Although variant alleles of the cone SLC24A2
gene have been identified, none of them are definitively associated with a specific retinal disease [29
]. The new model we present for SLC24A2
predicts three putative TSSs located in two new additional exons that are alternatively spliced (Additional file 6
: Figure 9).
We also investigated whether the new exons that extend the 5' end of the transcript may introduce new potentially protein coding sequences. We didn't observe in any case an extension of the open reading frame beyond the annotated start codon. However short alternate open reading frames of at least 40 codons were observed for C1orf32 (nucleotide position 18–290 from the TSS in isoform a, and position 164–400 in isoform b), CNGA3 (position 166–315), DHRS3 (position 166–315), KIFC3 (position 4–195), and SLC24A2 (position 55–183 isoform a, 44–289 isoform b). Yet the translated sequences of these short ORFs do not have homology with any protein in public databases.
Detection of novel splicing variants and shorter transcripts
Our experimental procedure described alternatively spliced isoforms for two genes IMPG1, SAG, which lack exon 2 of the RefSeq. These forms have not been annotated in the RefSeq database. We confirmed these alternatively spliced isoforms by regular RT-PCR (Data not shown). The second exon of the gene SAG contains the TSS and the presence of this alternative form, lacking the regular start site, may play a role in the regulation and further processing of the transcript. For 4 genes (CRB2, CRX, RP1, WDR17), we detected shorter transcripts that lack the annotated start codon. Since these experiments were done with the same adapter ligated first-strand cDNA we assume that these short transcripts are derived from true alternative TSSs. These transcripts may be preferentially amplified in the RT-PCR and may be translated from an internal initiation codon. We report in Table the detailed results for these genes.
Confirmation of results with primer extension
To provide an experimental validation of our results we undertook primer extension experiments. We performed reverse transcription of mRNA with a sequence-specific FAM-labelled primer for two genes (CNGA3, RDH12). The length of the FAM-labelled cDNA primer extension product can be analysed on ABI-DNA Genetic Analyser using GeneScan software. As a result of the analyses we detected a fragment of 350 bp for CNGA3 (Fig ) and a fragment of 215 bp for RDH12 (Figure ). The size of these fragments confirms the presence of the transcripts that we detected with RACE.
Figure 2 Primer extension results from WERI-RB1 retina cell lines mRNA. Primer extension products obtained with the gene specific primer for CNGA3 (A) and for RDH12 (B). The blue peaks in each panel correspond to the primer extension product (FAM-labelled cDNA). (more ...)
Comparison with existing annotations and databases
To assess the quality of current annotations of the 5' end of genes expressed in human retina, the sequences obtained by 5' RACE were compared with the corresponding gene annotation/prediction. Overall, RACE experiments detected 15 exons that were neither annotated nor predicted for retina transcripts; 8 exons did not have any matching experimental evidence in GenBank, while the other 7 showed different boundaries or alternative splice sites. Of these 15 un-annotated exons, 12 are first exons and can be considered the new first exon for the retinal transcripts. Of the 54 genes successfully amplified, 41 (76%) delivered 5' RACE sequences different from the annotation. Results of a parallel project, DBTSS [8
], supported our results concerning 3 of these genes (CNGA3, ELOVL5
) although the source of mRNA was not human retina. We extended the annotated first exon of 27 RefSeq genes by an average of 60 transcribed bases. We compared our results with genome wide mapping of TSSs using CAGE tags [10
]. We found perfect correspondence for 13 transcript isoforms; for another 6 transcripts the start site retrieved in the CAGE database is located less than 400 bp away. For 35 transcript isoforms the TSS is located in a different position (See Table ). This discrepancy in the results may be due to the fact that the CAGE database doesn't include retina amongst the panel of analysed tissues and therefore lack specific and rare transcript isoforms present in that tissue.
Shape of TSSs and conservation
After analyzing the distribution of RACE clones we could define the shape of TSSs according to the classification previously reported [10
]. The different clones were clustered and depending on the start base position of each clone within a cluster we divided the start sites into four shapes. In the single dominant peak class (SP) the majority of clones are concentrated to no more than four consecutive start positions with a single dominant TSS. The clusters spanning a broader region are grouped in a general broad distribution (BR), a broad distribution with a dominant peak (PB) and a bi- or multimodal distribution (MU): 22 genes showed a single dominant peak, 11 a broad distribution, 8 a bi-multi peak distribution and 6 a broad distribution with a dominant peak. For some transcripts we could not make a classification because the number of clones was less than 5. We report the results of this analysis Table (TSS shape). Figure shows a graphical view of TSSs identified for AOC2
, and LRRC21
as an example of the different distributions observed. The classification of the shape of TSSs defined by distribution of 5' end RACE clones within a cluster is useful for the further characterization of expression regulation. The distribution of the clones defines different elements of the core promoter and gives insights on the start of transcription. Even if broad promoters are the major class in mammals [10
], 36 of the analysed transcripts present a dominant peak highlighting the possibility that those transcripts are tightly regulated.
Figure 3 TSSs present different shapes. Histograms indicate the number of RACE clones mapping at each nucleotide position. Examples show the different pattern that we observed during the analysis of the Cap-finder RACE. A) Clones distribution for AOC2 (single (more ...)
Although TSSs of orthologous genes do not necessary reside on equivalent locations because of evolution of mammalian TSSs [30
], we analysed sequence conservation of the first new exons among a set of mammals (mouse, dog, cow): the range of conservation varies between 42 and 89 %. We report the pairwise alignment percentage of identity in Additional file 7
Sequences residing upstream and downstream from the boundaries of new defined exons are regions displaying high regulatory potential calculated by a computational algorithm [31
] integrated in the UCSC genome browser. The regulatory potential (RP) scores computed for the 500 bp sequence upstream the TSS shows that in 9 new first exons out of 11 the RP value exceeds an arbitrary threshold of 0.2 (data not shown). Considering 500 bp downstream the splice site of the first exon the RP value is > 0,2 at least for 7 first exons out of 11. This observation confirms the importance of the new described exons to locate new regulatory elements that are important for transcription in retina.
A high level conservation was observed for splice donor sites of the new first exon. 5 genes show an average conservation of at least 75% in the region -3/+5 spanning the splice donor site. For example, we report an inter-species alignment of the 3' end of exon 0 (CNGA3). The sequence conservation at the level of the splice donor site highlights the possibility of a particular role for this splicing (Figure ).
Figure 4 Inter-species alignment of exon-intron boundaries of the exon 0 of CNGA3. Conserved nucleotides are labelled with colours and with the star in the bottom those conserved in all the analysed species (human, mouse, rat, rabbit, dog cow, elephant and tenrec). (more ...)