A total of 181,279 ESTs were obtained which passed the default quality thresholds as determined by the manufacturer (454 Life Sciences Corporation, USA). A summary of the analysis of the ESTs is presented in Table . Initially, low quality bases were trimmed from the EST ends using trim2 [12
] and the resulting sequences, with an average, minimum and maximum length of 102, 41, and 302 bp, respectively, were compared to the known and predicted human transcriptome [3
]. 140,906 (77.7%) sequences matched directly by BLAST [13
] to a specific human transcript with a p-value less than 9 × 10-7
, while 40,373 (22.3%) of the sequences did not match with any known human transcribed sequence and hence potentially identify novel transcripts at this relatively high stringency. The 140,906 ESTs mapping to a known transcript cover 1.2 MB of the annotated human transcriptome and are available through dbEST [14
Summary analysis of EST to human transcriptome/genome mapping
An histogram of the abundance of transcripts from annotated gene loci is shown in Figure . The ten most abundant transcripts are shown in Table . Further, we collected all genes detected in our library which are described by Ensembl as either being involved in cancer or expressed in the prostate and have made these available as a supplementary table [see Additional file 1
]. Of the 9,173 loci for which transcription was detected (BLAST p ≤ 9 × 10-7
) 2,199 were observed with a single EST sequence. Lowering the BLAST p-value to 0.05 allows 152,544 ESTs to map to 10,117 loci of which 2,417 have only one EST mapped to them (data not shown).
A histogram showing the number of gene loci hit by a given number of ESTs.
Top 10 most abundant transcripts in androgen-stimulated LNCaP cells by EST count.
In order to determine whether our technique quantitatively measures transcript abundance, we compared our EST library to two SAGE libraries of R1881 treated LNCaP cells [15
]. These two combined libraries have approximately 28,000 unique tags that we mapped unambiguously to 1,050 genes using DiscoverySpace [17
Calculating the Pearson coefficient for all 9,173 genes gives a correlation value of 0.40. This value increases to 0.45 if we only consider genes that had at least one SAGE tag. Of our 10 most abundant genes, 3 (ENSG00000198899, 198886, 198763) are represented in the top 15 most abundant genes in the SAGE library. By reducing our stringency on which tags are deemed ambiguous we can successfully map an additional 3 genes of our top ten genes to the top 15 most abundant genes in the SAGE library.
We studied the representation of the ESTs across known spliced transcripts (Figure ). From this experiment, we observe four types of sequencing bias. The first favours the positive strand (coding) of the transcript. The second bias is seen at the 5' and 3' ends of the transcript. This occurs because the ends of a given transcript are readily available for sequencing, even if fragmentation of the cDNA is incomplete during the sample preparation (Methods). The increased representation of sequences in the mid-range of the transcript arises, almost entirely, to transcripts having lengths shorter than 1,200 bp. We found that such transcripts generally shear near the centre of the sequence (data not shown). Lastly, there is a general bias to the 3' end of the transcript that is likely due to incomplete cDNA synthesis across the entire length of the RNA transcript.
Figure 2 A histogram showing the start of EST alignments to human transcript sequences (length > 500). Position is given as a percentage of the length of the transcript. ESTs which align to the positive or negative strands of the cDNA are shown in light (more ...)
We investigated the ability of this approach to identify alternatively spliced transcripts within the mRNA population. By using BLAT [18
] to map reads which showed good alignment to the transcriptome, but poor alignment at either end of the read, we discovered 25 (Table ) previously unreported splice junctions that begin and end in a previously annotated exon and 106 novel alternative splice variants that map from a known exon and splice into intronic sequence. For all alternate splices, save 2, the event was demonstrated by a single EST. Figure shows an alternative splicing event within the gene coding for Brain Protein I3 (BRI3
) that causes a 40 base pair insertion between exons 1 and 2. This frame-shift occurs upstream of the transmembrane regions of the protein, as predicted by Ensembl [19
], and although it does not cause a premature stop in the coding sequence, it eliminates the transmembrane domains of the protein as determined by InterProScan [20
]. Interestingly, disruption of BRI3
has been implicated in tumor necrosis factor alpha (TNF-α) induced apoptosis resistance [21
25 novel alternative splicing events in androgen-stimulated LNCaP cells
Figure 3 Alternative splicing of Brain Protein I3 (ENSG00000164713) showing a short insertion between two exons. 5' and 3' ends of two exons are shown in black text, interspaced by an intron (full sequence not shown) in orange. Base positions where the EST aligns (more ...)
In areas where the base quality, as determined by the 454 sequencer, was exceptionally high, we utilized the EST data to detect high quality discrepancies (HQDs) in the LNCaP transcriptome. In this analysis, we required that discrepancies have a phred-like score >80 in order to be considered significant. This score threshold is set a such a level that we would require, at minimum, 3 sequences to confirm the presence of a HQD. Using this stringent approach we discovered 1,479 HQDs of which 86 (5.8%) were present in Ensembl's variations database [3
], 29 showed variations at the same position but to a different base, and 1,364 were not described in Ensembl. For each HQD type, the mean and median base quality score were calculated (see Methods). The mean and median scores for confirmable HQDs are significantly higher (p ≤ 0.02) than for unconfirmable HDQs (Table ). Although we do not dismiss all of the unconfirmable 1,393 HQDs as spurious, even under such an hypothesis, a significant (p ≤ 10-45
) enrichment of characterized variations is found, as compared to random sampling.
The HQD count, mean and median phred scores and the HQD count
We also used the ESTs for the discovery of unannontated genes. Of the 40,373 ESTs that did not map to the known human transcriptome at high stringency, 19,392 (48.0%) were successfully mapped to the entire human genome sequence at high confidence (p ≤ 5 × 10-6). Of these, 9,488 (48.9%) map mostly or entirely to an intron. The remaining 9,904 ESTs map intergenically, that is, to a region that is not known or predicted to contain an ORF. Further, 1,900 (19.2%) of these ESTs map to a region where there is no existing alignment feature in Ensembl (EST gene, or EST) and 380 (3.8%) align at least 20 Kbp away from a known gene.
Of the 40,373 reads, 20,981 failed to map to the human genome at p ≤ 5 × 10-6. Figure shows the distribution of ESTs that fail to map successfully to the human genome or transcriptome at various p-value thresholds. At p ≤ 0.05, 9,585 reads remain unmapped.
A histogram of ESTs that fail to map to the human genome at various p-values.
The 9,585 ESTs that failed to map to the human genome were then aligned to sequences in GenBank-nt (Dec 5th, 2005). 7,643 failed to map with a p-value ≤ 0.05, 605 reads mapped most strongly to human sequences and 587 reads were mapped to other organisms (Table ). The 605 human sequences that failed to map to the human genome/transcriptome were missed for 2 reasons. First, because of the different nucleotide frequencies in the two databases, ESTs that have low complexity, and therefore high p-values, when mapped to the human genome (or transcriptome) have lower p-values when mapped to GenBank-nt. Second, the GenBank-nt database contained ESTs with splicing patterns not represented in the current Ensembl transcript annotation. Therefore, the ESTs that map to exon-exon junctions in these speculated genes will also have poor p-values. The 587 reads which were mapped to other organisms are a product of contamination of the sample. Although the contamination could have occurred before cDNA production, the nature of the contaminants suggests that this most likely occurred in the 454 nebulization process. The 7,643 remaining ESTs are almost certainly the product of poor quality sequencing. Although their quality values are not significantly different from the other reads, they are markedly shorter (average length of ~86 nt) which reduces the minimum possible p-value for these sequences.
Idetification of contamination of 454 EST data