|Home | About | Journals | Submit | Contact Us | Français|
A genomic era of cancer studies is developing rapidly, fueled by the emergence of next-generation sequencing technologies that provide exquisite sensitivity and resolution. This article discusses several areas within cancer genomics that are being transformed by the application of new technology, and in the process are dramatically expanding our understanding of this disease. Although, we anticipate that there will be many exciting discoveries in the near future, the ultimate success of these endeavors rests on our ability to translate what is learned into better diagnosis, treatment and prevention of cancer.
In this past year, remarkable advances in our understanding of the mutational profiles and other disease-specific alterations of cancer genomes have been reported. In general, the field of cancer genomics has been impacted most profoundly by the application of next-generation sequencing technology, which has tremendously accelerated the pace of discovery while dramatically reducing the cost of data production. Hence, there has been a rapid progression from targeted gene re-sequencing using PCR and Sanger sequencing to either targeted, whole genome, or whole transcriptome sequencing using these massively parallel sequencing platforms, coupled with the requisite bioinformatics-based approaches to analyze the data. Within this brief timeframe, studies examining all known genes in a few samples to those examining hundreds of genes in hundreds of samples, to whole genome sequencing and analysis of a matched tumor/normal pair have been reported. There remains much to be learned about this complex disease, of course, but our fundamental understanding of which genes are mutated in cancer cells, the pathways that are impacted by these mutations, and how these data inform our models of cancer biology will undoubtedly evolve rapidly in the near future.
A well-known characteristic of cancer genomes is that they are frequently altered in their gross chromosomal structure by amplification, deletion, translocation and/or inversion of chromosomal segments. Such alterations often, of course, concomitantly alter genes in a number of ways that may be critical to cancer onset or progression. As such, important developments in obtaining increasingly more detailed genome-wide characterizations of structural variation (SV) in tumor genomes have been described recently. Initially, these studies were conducted using signal strength-based analyses on high-density SNP array data sets, where tumor and normal genomic DNA were compared and any large-scale amplification or deletion signals were detected as continuous blocks of SNPs with higher than (amplification) or lower than (deletion) the normalized signal strength (1). The genes in these regions often are re-sequenced to identify mutations or are assayed for evidence of altered gene expression levels that correlate with a detected copy number alteration. Weir et al. (2) provided a powerful example of this approach using 384 lung adenocarcinoma samples in which they identified a novel candidate proto-oncogene (NKX2-1/TITF1) in an amplified region of chromosome 14.
Complementary to array-based methods, next-generation sequencing-based approaches are being applied to the SV problem at a higher level of resolution and complexity. Korbel et al. (3) first demonstrated that paired-end reads from next-generation sequencing platforms can be aligned to the genome and examined algorithmically to identify putative SV. Their approach was based on the identification of anomalously mapping read pairs that align several standard deviations outside the well-defined size range of the library itself. Read pairs that mapped too close together, too far apart, in an unpredicted orientation, or across chromosomes gave the indication of potential insertions, deletion, inversions or translocations in the sequenced genome. By these methods, we can obtain a much more precise view of genome-wide SV than by array-based analysis methods. Several groups have recently described advanced implementations of this approach; utilizing low coverage of a cancer genome with paired end reads (4,5). These methods fit nicely into a paradigm of whole genome sequencing followed by mutation discovery. Here, a small investment in paired end reads at light coverage can profile the extent of SV across a large number of tumor samples as a first step. This type of analysis not only identifies common copy number and structural variant loci, but can also allows a calculation of the deeper sequence coverage that will be required to characterize focal mutations (e.g. single nucleotide and small in/dels) in each tumor genome, since large-scale amplification (for example) will inflate the sequence coverage requirement. One can then obtain this deeper coverage with the same libraries used to produce the initial data set.
The combination of PCR and Sanger sequencing to discover mutations in tumor genomes has proven a powerful initial approach, as evidenced by several recent studies that we describe below. Although studies using this method have targeted limited numbers of genes and successfully identified key somatic mutations in cancer genomes, the method recently has been applied to characterize hundreds of genes as well as the entire ‘exome’ (all known protein coding exons). In particular, two articles published in the same 2008 issue of Nature (6,7) demonstrated how targeted gene re-sequencing and variant detection can contribute significantly to our understanding of the types of genes carrying somatic mutations in a given cancer type [here, lung adenocarcinoma and glioblastoma multiforme (GBM)] by discovering novel genes mutated in each tumor type. In these studies, by virtue of sequencing large numbers of the same tumor type (based on pathological examination, tumor stage and grade), the results highlighted the cellular pathways putatively impacted by these mutations. Both articles arrived at important correlative conclusions by integrating the somatic mutation data with the results from other genome-wide characterizations of the same samples, such as array-based gene expression data, genome structure perturbation data [e.g. loss-of-heterozygosity (LOH), amplification or deletion of large chromosomal segments], and clinical data elements (e.g. outcome, response to therapy, etc.). For example, MAPK signaling, P53 signaling, cell cycle regulation and mTOR pathways are targeted in lung adenocarcinoma samples by combinations of point mutation, copy number amplification and deletion and LOH (7).
Similarly, Vogelstein and colleagues have extended their initial efforts to characterize mutations by screening most of the known coding genes in the genome in several tumor types (8,9), to also include information about gene expression using next-generation sequencing of serial analysis of gene expression tags, and about genome copy number alterations from genotyping arrays. Their analyses combine data about somatically mutated genes with data about copy number alterations to identify candidate cancer genes (‘CAN-genes’), thereby generating evidence for mutations that are driving carcinogenesis (‘drivers’) versus having no impact on tumor growth (‘passengers’). Gene expression data inform the pathways analysis, by reflecting epigenetic alterations not detectable by sequencing or copy number analyses.
This combined approach, in a study of GBM samples, resulted in the discovery of several commonly mutated genes, some impacting novel pathways. Among these was the surprising identification of an IDH1 mutation that was found in 18/149 (12%) cases, all occurring at the same residue (R132) (10). Using clinical data, several interesting correlations regarding the IDH1 mutation were made; namely that this mutation was more prevalent in younger GBM patients (mean age of 33 versus 53 years of age), more prevalent in patients developing secondary GBMs (that develop from low grade gliomas) and predicted a significantly improved prognosis (median overall survival of 3.8 versus 1.1 years). In a follow-on study, this group evaluated the IDH1 R132 and related IDH2 R172 mutation prevalence in a much wider range of tumor types that included 445 central nervous system (CNS) tumors and 494 non-CNS tumors (11). Here, the previously observed improved outcome for GBM patients carrying the IDH1 mutation was confirmed and extended to those carrying mutated IDH2 (median overall survival of 31 versus 15 months, at P = 0.002), and for patients with anaplastic astrocytomas (median overall survival of 65 versus 20 months, P < 0.001). An evaluation of the impact of one IDH1 mutation (R132H) and three IDH2 mutations (R172G, K and M) on the function of the resulting proteins showed severely diminished activity in NADPH production relative to the wild-type enzymes.
As more detailed profiling of the cancer genome has developed, the need for a full understanding of how these somatic alterations are manifest in the genes expressed by tumors has become pertinent. As in genome characterization, the use of next-generation sequencing of RNA extracted from tumor cells (‘RNA-seq’) produces a comprehensive data set for complete transcriptome characterization, as well as correlation to known genomic changes such as structural and copy number alterations, focused in/dels and single nucleotide mutations. Not only does this approach greatly expand the dynamic range of gene expression level data beyond the sensitivity limits of microarrays (12), but also it provides data that can be further mined in a number of ways (13) to enhance the understanding of the transcriptome in cancer. For example, RNA-seq data can identify allele-specific expression in the context of known mutations, verify the impact of a nonsense mutation, or provide a means of finding mutations in tumors as illustrated recently in ovarian tumors (14). Here, four granulosa-cell tumors (GCT) of the ovary were analyzed using whole transcriptome paired-end RNA sequencing, demonstrating that all four GCTs had a missense point mutation in the FOXL2 gene. This gene encodes a transcription factor known to be crucial in granulosa cell development, and since the same mutation was determined to be present in additional GCTs of the same adult-type tumors, it is a potential driver mutation.
These data also can be analyzed to detect alternative splice isoforms and fusion transcripts (15), as illustrated recently in a very clever approach by Maher et al. (16) that identified both known and novel fusion transcripts in prostate cancer samples. This approach utilized a combination of two next-generation platforms to produce sequence reads that were combined to identify fusion transcripts from cancer cell lines. In particular, RNA-seq data from a longer-read technology (Roche/454) first identified putative fusion transcripts by virtue of their alignment characteristics to the transcriptome, and then a second RNA-seq data set from short read length platform (Illumina Genome Analyzer) was aligned to the putative fusion transcript reads to provide support for their presence. Using this paradigm, Maher et al. successfully identified known and novel fusion transcripts in the prostate cancer cell lines LnCaP and VCaP, and subsequently in RNA from several prostate tumor samples.
RNA-seq also can build evidence for novel genes that previously have not been annotated due to lack of ESTs or were missed by in silico prediction (13,17). Hence, further development of methods that elucidate the complexity of the transcriptome in cancer will both support and enrich our understanding of the cancer genome and cancer biology.
In addition to mRNA, the study of microRNAs (miRNAs) and their roles in regulating the expression of specific genes in both healthy and cancerous cells is rapidly expanding our comprehension about this aspect of cell biology (18). A recent study by Uziel et al. (19) demonstrated the interaction between miRNA overexpression and a well-characterized signaling pathway, Sonic Hedgehog/Patched (SHH/PTCH) in medulloblastoma (MB). Having determined the overexpression of nine genes in the miR-17–92 cluster in an MB mouse model with constitutively activated SHH/PTCH signaling pathway, this group then tested and demonstrated similar miR-17–92 cluster upregulation in a subset of human MB tumors with constitutively activated SHH/PTCH. This study provided the first evidence that the SHH/PTCH signaling pathway and miR-17–92 functionally interact and contribute to both murine and human MB development.
Similarly, Wyman et al. (20) and Nygaard et al. (21) demonstrated detection of novel miRNAs and miRNAs with differential expression in ovarian and breast cancer, respectively, using Roche/454 sequencing and miRNA discovery bioinformatics pipelines. Building upon these studies and others, numerous groups are now proposing miRNAs as prognostic or diagnostic markers for a variety of cancer types (22–25).
The most significant impact of next-generation sequencing on cancer genomics has been the ability to re-sequence, analyze and compare the matched tumor and normal genomes of a single patient. With the significantly reduced cost of sequencing and tremendously enhanced throughput, it is now within the realm of possibility to sequence multiple patient samples of a given cancer type. Such efforts require not only data generation, but also the careful development of analytical tools and pipelines, supported by validation efforts that feedback into the analytical process, to enhance the sensitivity and specificity of variant discovery. Due to the complex nature of genome variation, the entire spectrum of potential mutations requires consideration, including germline susceptibility loci, somatic single nucleotide and small indel mutations, copy number alterations and structural variants. To-date, one publication has outlined such a study, describing the results obtained from sequencing and analysis of an acute myeloid leukemia genome (26). Several key concepts have emerged from this approach, including the use of high-density SNP genotype data to estimate genome sequence coverage by tracking the accuracy of sequence-based SNP calls at heterozygous loci, a step-wise approach to somatic single nucleotide variant discovery, and the use of read counts to establish the prevalence of somatic variants in the tumor cell population. The basic analytical approach aligned tumor (~21-fold haploid coverage) and normal (~14-fold haploid coverage) sequence reads to the reference human genome using the Maq alignment algorithm (27). As coverage accumulated during the generation of tumor and germline reads, Maq was used to call variant positions across the genome, and those calls were compared with the heterozygous loci determined from the overlapping set of SNP array genotype calls identified by both Illumina and Affymetrix genotyping arrays. Sequence coverage was considered sufficient for mutation discovery once heterozygous calls from sequence data were made for >95% of these orthogonally determined heterozygous SNP positions. This approach toward monitoring genome coverage is now a cornerstone of our cancer genome re-sequencing pipeline.
Somatic mutation discovery requires a number of steps to eliminate from consideration all known sequence variants, typically by (1) comparison with other sequenced genomes (via dbSNP) and to other resources for variant discovery such as the 1000 Genomes Project (www.1000genomes.org), followed by (2) comparison at remaining variant sites between the tumor and the normal genome. The approach also takes into consideration two primary measures of quality in order to distinguish high- from low-quality variants in the latter comparison. These primary measures include first, a cumulative base-calling quality value that is summed from the individual quality values of each base identifying the putative variant (assigned by the Illumina analysis pipeline) and second, a mapping quality value assigned by Maq that indicates the genome-wide uniqueness of each aligned read. Nonetheless, false positives do occur in this analysis, as do false negatives. False positives tend to result from incorrect interpretation of one or more data elements considered by the multicomponent analysis algorithm, often due to non-unique read placement or to a missing variant call in the matched normal sequence. The false negatives are harder to evaluate, but mainly appear to be due to lack of sufficient read support for a true variant in the tumor. On one hand a reasonably high false positive rate is desired so true mutations are not missed, but on the other it is important to known which predictions are incorrect. Because of this, performing an orthogonal validation step using PCR-directed sequencing or genotyping to establish false from true positives for all putative somatic variants in genes or in regulatory/conserved regions of the genome should be done.
One of the key aspects of evaluating somatic mutations in cancer genomes is that the collective sequencing read pool represents a census of the genomic DNA contributed from all cancer cells used for DNA isolation. One challenge of this pooled approach is to determine what proportion of those cells carried each identified mutation. Information about the prevalence of any mutation in a cell population allows one to infer how early in the path toward cancer development that particular mutation occurred. The digital nature of next-generation sequencing allows us to evaluate this prevalence, since each read in the sequenced pool of fragments represents a single original DNA fragment from that cancer cell census. For example, since many mutations will present as heterozygous, we expect that 50% of the reads in a pure tumor cell population will contain the variant. Obviously, this proportionality will be influenced by the percentage of tumor cells in a sample, so a correction factor is applied based either on estimates from pathology review or by a more precise measure that calculates the percentage of normal reads present in the tumor read population at known/validated somatic sites in that tumor genome (L. Ding, personal communication). This type of analysis was applied to the first AML genome sequence, demonstrating that all somatic mutations were found in virtually all of the cells of the tumor, except for the FLT3 internal tandem duplication (Fig. 1), which is known from mouse models to not be an initiating mutation in AML (28).
We recently published our findings from sequencing a second AML genome and matched normal (29), where we employed the aforementioned concepts, identifying nine single nucleotide somatic variants in genes, two genic indels, and 54 somatic single nucleotide variants in known regulatory or highly conserved regions of the genome. Although none of the novel somatic variants identified in the first AML genome were recurrent among 187 other AML tumor genomes tested, one mutation found in the second AML genome analysis proved to be recurrent in 8.2% of those samples. This gene was IDH1, mutated at the exact R132 site also identified in GBM (10), as described earlier. Unlike Parsons et al., however, our correlation analysis among the 187 AML patients, combined with the clinical data, indicated that in AML, the IDH1 mutations portend a significantly worse outcome by Kaplan–Meier analysis for those patients who have normal cytogenetics and lack the NPMc and FLT3 mutations (Fig. 2). This finding demonstrates the power of the genomics approach, and highlights how new insights into cancer biology will result from further cancer genome sequencing.
One clear trend in cancer genome sequencing is that the continuing advance of next-generation technology in terms of data capacity per instrument run and read length will accelerate the rate of sequencing whole genomes, at ever-decreasing costs. Since next-generation platforms can produce data to characterize gene expression, methylation, histone packaging, transcription factor and other regulatory protein binding positions, and so on, we can build data sets that quite comprehensively characterize a broad spectrum of genomic alterations among sets of tumor samples.
A key question is what the planned sequencing of hundreds of tumors might reveal? For example, it is not yet clear whether the cancer-critical somatic alterations we identify will be found to recurrently affect specific genes, or if the combination of recurrent and ‘private’ mutations will define each cancer genome and hence, its treatment. We also need to understand the potential role of inherited genomic variation in shaping the onset of cancer and its outcomes, which is one reason sequencing a matched normal sample from each patient is so important. Determining the genomic landscape of hundreds of tumors ultimately will dictate whether each cancer genome will require a full genome variation profile as a diagnostic component of individualized treatment. It is imperative also to focus some genome characterization efforts toward elucidating the genomic changes that distinguish primary from metastatic disease.
Once we understand the genomic landscape of cancer, what should follow? Whereas genome-wide characterization of tumors likely will yield important clues about the genes that play a role in carcinogenesis or metastasis, we must be prepared to follow-up on these clues by carrying out functional screens of altered genes with commensurately high-throughput capabilities. Functional screening would aim to identify those somatic alterations that are initiating carcinogenesis, or promoting metastasis, thereby establishing candidate genes and their protein products for targeted therapy development or testing, as well as for diagnostic/prognostic assay development. Luo et al. (30) have published such one approach, employing pooled short hairpin RNA (shRNA) screening paradigms of cancer cell lines that identified genes essential for growth and related phenotypes in these cells, as well as genes involved in the response of cancer cells to tumoricidal agents. Lynda Chin and colleagues (31) recently published an elegant example of a complete genomics-to-function paradigm, first identifying a genomic region at 5p13 that was commonly amplified in several cancer types (lung, ovarian, prostate, breast, melanoma), and then using integrated analysis of this region to pinpoint the Golgi-associated protein GOLPH3 for further study. Using a variety of clues from the results of in vitro shRNA knock-down of GOLPH3 in cell lines that either did or did not contain the 5p13 amplification, to in vivo GOLPH3 overexpression in these same cell lines, to clues from yeast genetics that linked GOLPH3 to the trans-Golgi network and ultimately as a determinant of rapamycin sensitivity as a regulator of mTOR, the study established GOLPH3 as a first-in-class Golgi oncoprotein. This result further emphasizes the need for multiple lines of evidence to support functional and mechanistic roles for the genomic alterations we are finding in cancer genomics today.
We thank the National Human Genome Research Institute for support of this research via U54-HG003079 (R.K.W.).
The authors wish to acknowledge Drs Devin Locke and Li Ding, for their critical reading of the manuscript.
Conflicts of Interest statement. None declared.