There have been previous reports of alternatively spliced cancer-associated genes that could serve as diagnostic and prognostic markers as well as guide to potential therapeutic targets [
10-
12]. These studies used a variety of tools including analysis of publicly available databases and discovery using microarrays, both of which utilize high-throughput methods to characterize cancer related alternative gene splicing [
13,
20]. More recently, a high throughput RT-PCR method has been used to measure the differential expression of 3,327 alternative splicing events in 600 cancer-related genes for ovarian and breast cancer [
3,
21]. The present study provides one of the first attempts to show how whole transcriptome sequencing using massive parallel sequencing technology can be used to simultaneously profile as many as 151,486 alternative splicing events in cancer patient samples for all the human genes cataloged in the comprehensive AceView database of transcribed sequences. In our previously study [
15], we used the Roche/454 next-generation sequencing system for whole transcriptome pyrosequencing of cDNA samples for 4 MPM tumors, 1 lung adenocarcinoma and 1 normal lung. We generated between 2.5 and 2.9 million shotgun sequencing reads for each sample of average length ~105 bp and focused on the identification of novel Single Nucleotide Polymophisms (SNPs) in the expressed sequences. Herein, we hypothesized that these same transcriptome reads could be used to detect alternatively spliced genes in the patient samples by mapping all the read sequences onto known exon junction sequences as virtual probes. Using a subset of the same dataset [
15], we have developed a software pipeline to quantify the expression levels of exon junctions by counting the number of reads that match to each exon junction to identify cancer related alternative splicing pattern.
In this study, we have identified several genes expressing alternatively spliced transcripts at different levels in the tumors and in the matching normal lungs. Many of these genes have been previously implicated in cancer. ACTG2 contains one 5' un-translated exon and 8 coding exons spanning 27 kb [
22]. This gene has 7 known splicing variants shown in AceView website [
16], but none of them has been directly related to cancer. An expression microarray analysis performed on a derivate breast cancer cell line resistant to cisplatin showed that ACTG2 expression increases in the chemotherapy resistant cell line compared to the normal indicating that it may be associated with cisplatin resistance [
23]. In another study, ACTG2 expression was identified as cadmium-responsive [
24]. The authors concluded that repressed expression of ACTG2 following cadmium exposure may contribute to the cell cycle arrest. CDK4 is a well-known for its role in cancer [
25]. It is a cyclin-dependent kinase-4 involved in the cell cycle and can both start and stop the cell cycle in response to proliferative or anti-proliferative signals. It has 13 known splice variants according to AceView website [
16]. Several reports have already linked CDK4 expression to mesothelioma [
26,
27]. However, this was the first study showing that different CDK4 splice variants have differential expression levels in MPM and matching normal lung.
The differentially expressed splice variants for ACGT2 and CDK4 were specifically chosen for further examination using qRT-PCR in additional 18 MPM and matched normal lung samples. This analysis suggested that the differentially expressed splice variants may provide reliable markers for disease and be used to classify the samples with high sensitivity and specificity.
Several of the other genes that appeared to exhibit differentially expressed splice variants in the present study have also been implicated in cancer and would be worthy of further study. CYFIP1 has been shown to be a novel tyrosine kinase substrate in a breast cancer model [
28]. Interestingly, the differentially expressed transcript (CYFIP1.fAug05) is not included in NCBI Refseq sequence database. It has been suggested that COL3A1 could be a potential diagnostic marker for large B-cell lymphoma (DLBCL) as is shows statistically significant different expression between DLBCL and follicular lymphoma [
29]. In addition, COL3A1 expression has been related to resistance to platinum drugs in ovarian cancer [
30]. TXNRD1 is a key enzyme in the regulation of the intracellular redox environment [
31]. Transcription of TXNRD1 involves alternative splicing, leading to a number of transcripts. In particular, expression of the TXNRD1_v3 transcript has been found in several cancer cell lines [
32]. Recently, its locus has been associated with advanced colorectal adenoma by epidemiologic and animal studies [
33].
In this pilot study, we were not able to find statistical difference between differentially expressed exon junctions because of the small sample size (only 4 MPM and 1 normal lung sample), neither we were able to compare the results among different platforms. In addition, not all exon junctions predicted to be differentially expressed proved to be so to the same extent when examined with the qRT-PCR. This is likely due the limited number of specimens examined. Furthermore, the signal to noise ratio may have been over-amplified as not to miss any potential differentially expressed candidates. After all, the EJEI was designed to magnify the differential expression and is not in the same order of magnitude as the actual expression of the exon junctions. Other potential limitations may be due to sequencing artifacts, insufficient sequencing depth, SNPs near the EJ and incomplete database for possible exon junctions. These limitations may be avoided using other next-generation sequencing platforms, such as Helicos True Single Molecule Sequencing without amplification, Illumina or SOLiD [
34], or by increasing the sequencing depth. Nevertheless, at least 2 of the top 10 exon junctions prioritized for analysis remained differentially expressed in most tested specimens supporting the utility of this approach.
The present study provides an example of a possible application of advanced sequencing technologies in cancer research. The current sequencing technologies are now capable of generating millions of shotgun transcriptome reads in a matter of days. For example, the latest Roche/454 GS FLX Titanium system generates over 1,000,000 reads of average length 400 bp in one 10-hour sequencing run providing orders of magnitude improvement in speed and cost over conventional Sanger-based sequencing. One of the great virtues of the shogun transcriptome sequencing process is that there is no need to impose any bias for known genes, exons, or splice-junctions as required for example with exon microarrays. As these technologies continue to improve their throughput and read lengths and lower their costs, they promise to revolutionize gene expression analysis by simultaneously providing information about expression levels, transcript variants, and SNPs.
The greater challenge for the successful application these technologies for our understanding of health and disease will be the analysis and interpretation of the data. Here we have introduced a data analysis pipeline to map 13,274,187 transcriptome reads from patient cDNA samples onto 151,486 known splice junctions cataloged in the comprehensive AveView database of transcribed sequences. However, in short order we can expect that the competing next generation sequencing technologies will be generating several orders of magnitude more transcriptome and genomic sequencing data for a wide variety of human diseases and cancer. Further advances in the bioinformatics analysis of this flood of data are clearly required, for example, to map sequencing reads directly to human genome to identify novel transcribed sequences and genes, alternative exons and splicing events, and possible gene fusions in patient samples