|Home | About | Journals | Submit | Contact Us | Français|
The systematic karyotyping of bone marrow cells was the first genomic approach used to personalize therapy for patients with leukemia. The paradigm established by cytogenetic studies in leukemia (from gene discovery to therapeutic intervention) now has the potential to be rapidly extended with the use of whole-genome sequencing approaches for cancer, which are now possible. We are now entering a period of exponential growth in cancer gene discovery that will provide many novel therapeutic targets for a large number of cancer types. Establishing the pathogenetic relevance of individual mutations is a major challenge that must be solved. However, after thousands of cancer genomes have been sequenced, the genetic rules of cancer will become known and new approaches for diagnosis, risk stratification and individualized treatment of cancer patients will surely follow.
Identifying all the genetic events that lead to cancer development will expedite the discovery of novel therapeutics. Cancer genetics went through a revolutionary period with the advent of cytogenetic analysis of chromosomes 50 years ago, and we are poised to undergo another revolution in cancer gene discovery with the recent advances in DNA sequencing technology that are available today. As the discovery of genes that are mutated in cancer continues to grow with the use of unbiased genomic platforms (array-based and next-generation sequencing), many of the lessons learned from studying the genetics of leukemia over the past 50 years remain pertinent for this new era of discovery.
The first clue that cancer was a genetic disease came from the German biologist David von Hansemann, who observed aberrant mitotic figures in carcinoma cells in 1890 . The study of chromosomal abnormalities in cancer underwent a paradigm shift when Nowell and Hungerford described the Philadelphia chromosome in patients with chronic myelogenous leukemia (CML) in 1960 . The subsequent discovery that the Philadelphia chromosome was caused by a balanced translocation between chromosomes 9 and 22 (t[9;22]) not only provided a diagnostic test (a chromosomal rearrangement) that was specific for CML , but ultimately led to the development of targeted therapy in CML, since the fusion gene created by the translocation is the initiating mutation for the disease. The karyotyping methods used to identify t(9;22) set the stage for the landmark discoveries of other recurrent chromosomal translocations in leukemias and lymphomas in the 1970s and 1980s, which also led to the improved treatment and outcomes for many patients [4–8].
The meticulous cataloging of translocations and chromosomal abnormalities in acute myeloid leukemia (AML) over the past several decades has allowed cytogenetics to become one of the most powerful diagnostic and prognostic tools available for AML patients [9,10]. Patients can now be classified into three main cytogenetic groups: those with favorable, intermediate, or adverse risk karyotypes, which are predictive of overall survival (Figure 1) . Although the cytogenetic classification system does risk-stratify patients with AML, it is not perfect; there is considerable heterogeneity in overall survival within each cytogenetic group, indicating that additional factors are relevant for defining prognosis. Some of these have recently been uncovered in candidate gene resequencing studies and genome-wide array based studies, as discussed below.
To identify small (<5 Mb in size) subcytogenetic amplifications and deletions (as well as the genes contained in those regions), array-based comparative genomic hybridization (CGH) and SNP genotyping platforms (which can also be used to deduce copy number alterations) have recently been used to screen AML genomes for acquired (somatic) copy number alterations [11–13]. The most recent of these studies, which employed the extensive validation of putative calls, suggested that there are, in fact, very few recurrent acquired copy number changes in most AML genomes. Furthermore, these data have not yet improved our ability to predict the prognosis for AML patients beyond what was known using standard cytogenetics [14,15].
The cytogenetic classification system has been improved by incorporating the mutation status of genes that are commonly mutated in AML (e.g., FLT3, NPM1, NRAS, CEBPA and MLL) [16,17] and the mRNA expression levels of single genes that are sometimes expressed in AML cells [17–23]. However, these classification systems are also imperfect, again suggesting that crucial genetic and/or epigenetic events in AML remain to be discovered.
As noted above, the discovery of common translocations and of the genes located at translocation breakpoints ultimately led to the development and use of treatments that targeted the mutated genes (e.g., ABL in CML and RARA in acute promyelocytic leukemia) [24,25]. In the past, it took several decades from the discovery of a cytogenetic abnormality to the identification of the mutated genes. Today, the timeline for cancer gene discovery is greatly compressed owing to the development of genomic platforms that can provide nucleotide level resolution of the entire cancer genome in only a few weeks, the timeframe required to make treatment decisions. A more precise genetic classification of AML and other cancers may well be possible using next-generation sequencing approaches, which can simultaneously identify all of the structural chromosomal abnormalities and specific gene mutations in a cancer genome; it is possible that many or all of these will need to be understood to move forward with personalized therapies.
The ultimate goal of this work is to comprehensively and systematically define the critical genes that are altered in each patient with cancer and then to develop personalized therapy for that patient. The success of this approach will rely on the rigorous experimental design and cataloging of mutations in cancer genomes. This will ultimately be possible by generating the complete genetic landscape from thousands of individuals with a specific cancer, which will identify genetic subgroups of patients with a greater homogeneity than was previously possible using cytogenetics.
Cancer is a genetic disease, but epigenetic (see below) and non-cell intrinsic factors (e.g., angiogenesis, stromal interactions, immune responses and so on) are also important. Sequence-based studies will not capture all of this complexity, but it is the genetic factors that have so far had the greatest impact on AML diagnosis, risk stratification and tailored therapy. Therefore, there is a strong rationale for capitalizing on the recent developments in sequencing technology to completely characterize cancer genomes.
Acute myeloid leukemia was a very attractive disease for our initial studies of cancer genome sequencing for several reasons:
First, AML is a very serious cancer and is not rare. In 2008, approximately 13,000 individuals in the USA developed AML and nearly 9000 died from the disease.
Second, improving our ability to predict outcomes using key mutations should have an immediate impact on how we treat patients. As noted above, we already tailor our initial approach to therapy, which is based on a low-resolution genomic screen (cytogenetics). Refining this with the knowledge of all the mutations in the genome may well improve our ability to predict outcomes, and will allow us to treat our patients more precisely based on an accurate upfront risk assessment. Novel drugs will not need to be developed for these data to have an immediate impact on how we treat our patients.
Acute myeloid leukemia cells are also easy to access with nonsurgical procedures (peripheral blood sampling and/or bone marrow aspiration) and a large fraction of AML samples are only minimally contaminated with normal cells. Therefore, serial sampling of the diseased tissue is straightforward and tumor cell purification is generally not required.
In addition, nearly half of all AML genomes are cytogenetically normal. Furthermore, the study of AML genomes with array-based, high-resolution CGH approaches has revealed that many have no detectable copy number alterations at 35-kb resolution (compared with the 5-Mb resolution of cytogenetics). The study of diploid genomes using our initial whole-genome sequencing approach greatly simplifies the assessment of coverage (see below), and also simplifies the analysis and interpretation of whole-genome sequencing data. From preliminary studies performed by our group and others, it appears that cytogenetically normal genomes may contain fewer mutations than cancer genomes that are highly aneuploid (data not shown); therefore, mutations in diploid genomes may be more likely to be pathogenetically relevant.
Finally, many of the mutations found in AML genomes may also be relevant for other cancer types. While AML genomes do have some mutations that seem to be restricted to this disease (e.g., NPMc and FLT3 internal tandem duplication mutations), others (e.g., RAS and IDH1) are known to be important for other cancer types as well.
Therefore, our initial studies with whole-genome sequencing have been performed with relatively simple AML genomes, with the hope that the experience gained and mutations discovered would guide our work with aneuploid genomes (that may have many more passenger mutations, and may therefore be more difficult to understand).
Prior to the introduction of next-generation sequencing, the cost of sequencing whole cancer genomes was simply beyond the reach of any laboratory or institution; although large sections of the exome were successfully sequenced in breast and colorectal cancer by Velculescu and colleagues, less than 1% of the total genome was actually sequenced in this study . Based on the considerations listed below, as recently as 3 years ago it would have cost nearly US$90 million to completely sequence a cancer genome and its matched normal counterpart. However, with the advent of next-generation sequencing technologies, this cost has already fallen to approximately US$100,000 and it is expected to fall further in the very near future. Therefore, until very recently, the biggest obstacle for sequencing cancer genomes was cost.
Why is cancer genome sequencing so expensive? First, massive amounts of sequence data are required to discover all the mutations in a cancer genome since the human species is out-bred – each person has 3–4 million sequence variants (and hundreds of copy number variants) that need to be assessed for their possible contributions to the cancer phenotype. In addition to this, because each human genome has so many genomic variations, both the tumor genome and the genome of a matched normal tissue sample (generally skin for hematologic malignancies, or blood for solid tumors) must be sequenced from each patient in order to define whether the genomic variants are inherited or acquired. Since we expect that most somatic mutations will be heterozygous, both alleles must also be sampled at every position in the genome to obtain adequate coverage for mutation discovery (this will henceforth be referred to as ‘diploid’ coverage). While only sixfold coverage (i.e., 18 billion base pairs of sequence for a 3 billion base pair genome) is required to solve a genome's primary structure, at least four-times that (~25-fold coverage) is required to achieve adequate diploid coverage for comprehensive mutation discovery. Finally, tumor genomes do not only contain point mutations, but also structural variations that could be relevant for pathogenesis.
To discover all of the genomic alterations that could be relevant for cancer pathogenesis, a shotgun sequencing method, using ‘paired-end’ reads on massively parallel sequencing devices, is currently the method of choice in most genome centers. Transcriptome sequencing is also being performed to sample the expressed part of the genome and represents an important adjunct to whole-genome sequencing (see below). Importantly, libraries of DNA fragments from tumor samples can now be made with very small amounts of input DNA (using as little as 100 ng is now routine), which is critical since sample abundance is rate-limiting for many tumors and normal matched control samples.
Our initial work with cancer genome sequencing utilized short fragment reads (30–35 bp in length). Because of the short length of these reads, many could not be unambiguously mapped back to the reference genome. However, longer sequence reads (now up to 100 bp) from both ends of small DNA fragments (e.g., 250 bp up to several kilobases in size) are now being routinely performed in the large genome centers. The longer read lengths, coupled with paired-end reads, dramatically increase mapping efficiency and accuracy, and allow for unambiguous sequence assignment even in regions that are highly repetitive. In addition, this technology dramatically increases our ability to identify structural variants, including deletions and amplifications, translocations and inversions. Now that it is possible to make libraries with DNA fragments that are more than 1 kb in size, translocations are routinely detected. Finally, this technology requires fewer sequencing runs to achieve adequate diploid coverage for each genome, which results in additional cost savings.
DNA fragment capture techniques are also being evaluated as a strategy to reduce costs and increase the yield of sequence-based studies by focusing on the exons themselves for mutation discovery. Although there are some advantages of this approach (i.e., large numbers of patients can potentially be screened more rapidly and deep read counts for each captured exon would assure high sensitivity for mutation discovery), there are disadvantages as well. All exon capture approaches are biased towards predefined regions of the genome, and the efficiency of capture methods is still far from perfect. For the time being, whole-genome sequencing is our method of choice for cancer mutation discovery.
Over the past 2 years, we have successfully sequenced the matched tumor and normal skin genomes of two individuals with the M1 subtype of AML [28,29]. Both individuals had essentially diploid tumor genomes and we were able to estimate that each genome contained approximately 500–1000 somatic point mutations. However, current modeling estimates suggest that only five–ten mutations are required to cause most cancers [30,31]. So, why are there so many mutations in these genomes? And which ones are important?
Of all the mutations found in each tumor, only a small number were nonsynonymous; in the first AML genome we found ten mutations, and in the second, just 12. By performing deep read counts of the variant allele frequencies for each mutation, we were able to establish that all of these nonsynonymous mutations were present in virtually all of the tumor cells (assuming that the mutations were heterozygous). A total of four of these mutations were found in at least one other AML genome (out of 188 genomes tested in total), strongly suggesting that they are indeed important for pathogenesis because they are recurrent. In the second AML genome, we validated 54 mutations that were not in coding sequences, but that did fall in highly conserved regions of the genome (and/or regions with potential regulatory importance) . Using deep count analysis, we demonstrated that all of these mutations were likewise present in virtually every tumor cell.
How can a tumor ‘retain’ that many acquired mutations? Are they all important? Probably not. We feel that the most likely explanation is that many of them are irrelevant mutations that were already present in the hematopoietic stem cell, which was transformed by the acquisition of one or more key initiating mutations that altered the growth and/or developmental fate of that cell. A model for this hypothesis is presented in Figure 2. We suggest that long-lived hematopoietic stem cells normally acquire a number of benign mutations that do not alter the function of these cells during the life of an individual. Even though most of them are irrelevant, they are all present in the individual cell when it acquires the critical mutation that sets the cancer in motion. Additional mutations then cause the transformed cell to progress to overt leukemia. When this transformed cell expands clonally to produce the tumor, and is sequenced with our existing technologies, we are really examining the genomic history of the initiating cell and are defining all of the mutations that may have been present in that cell from the time of transformation onwards (other potential scenarios exist to explain these data, but we currently favor this one). The greatest challenge in cancer genomics is to distinguish the benign, pre-existing ‘passenger’ mutations from those that are relevant for pathogenesis – the so-called ‘drivers’.
Since hundreds to thousands of mutations will be present in any given cancer genome, the prospect of assigning relevance to each is a daunting task. Furthermore, assembling all of the potentially relevant mutations in a test cell or organism to assess the role of each ‘in context’ is an enormous challenge.
How can we begin to define the truly relevant mutations in any cancer genome? A number of strategies can be employed, and together they should provide powerful clues that will be helpful in the short term.
First and most importantly, recurring mutations in a gene, or at an individual nucleotide position in the genome, are likely to be important for pathogenesis. The likelihood of having an identical mutation at the same position in 188 genomes (the number chosen for our initial studies given the statistical and cost constraints) with a mutation frequency of 1 in 4 million bp is only approximately 1 × 10-9; therefore, the vast majority of recurring mutations will be important. To identify these recurring mutations, tens of thousands of cancer genomes will need to be sequenced over the next several years. Initially, much of the sequencing will be performed with highly annotated samples in the academic sector. However, in the future, many cancer genomes will probably be sequenced by commercial entities. We feel that it will be critical to capture data generated by these commercial sources for cancer genome databases, so that recurring mutations can be more rapidly identified.
In nongenic regions of the genome that are minimally annotated (i.e., most of the genome), finding recurring mutations may be extremely helpful in identifying functional regions that are currently of unknown significance. In addition, some studies have suggested that large portions of intergenic regions are transcribed, adding to their potential to be functionally relevant . For example, we found a recurring point mutation in a conserved (but nongenic) region of chromosome 10 in two AML cases. Finding recurring mutations in nonannotated parts of the genome may greatly improve our understanding of the genome in general, and will undoubtedly uncover new mechanisms for cancer pathogenesis.
Needless to say, the gold standard for assessing the importance of mutations will continue to be functional validation in tissue culture cells, or in a model organism. However, assembling multiple mutations into a single cell or model organism will probably be required to discern the importance of individual mutations for cancer pathogenesis. This represents an extraordinarily daunting challenge that will require new approaches for its execution.
The analysis of cancer genomes from mice and other model organisms may also be very useful for identifying potentially relevant human cancer mutations. Since mice have a relatively short lifespan, they may accumulate fewer somatic mutations in the time it takes to develop a cancer. Since many laboratory mice are maintained as homozygous strains, the DNA sequence coverage required to identify acquired mutations will be far less than that required for human genomes, and matched normal DNA may not need to be fully sequenced for each mouse tumor analyzed. In inbred strains of the worm Caenorhabditis elegans, only 8× coverage was required to routinely detect somatic mutations , and similar sequence coverage should be reasonable for identifying mutations in inbred mouse genomes. Furthermore, mutations found in genes that are associated with cancers in both mice and humans will almost certainly be pathogenetically relevant.
Currently, there are several array-based platforms that are utilized to assess the epigenetic contributions to cancer pathogenesis. Expression studies are routinely performed on arrays. Chromatin immunoprecipitation (ChIP) studies and methylation studies are now routinely performed using array-based technologies as well. Although these studies are relatively robust, it is clear that next-generation sequencing technologies will further increase the quality of data generated by these strategies.
Next-generation sequencing can be used to quantitate mRNA abundance by sequencing cDNA libraries derived from tumors. The read counts of different mRNA species can be used to obtain a highly accurate digital representation of mRNA abundance in a tumor sample. Furthermore, since primary sequence data are generated as well, point mutations and fusion transcripts can also be identified [34–36]. Although transcriptome analysis with next-generation sequencing is powerful, this technology will miss mutations that inactivate genes (such as deletions, mutations that cause nonsense-mediated decay and regulatory mutations) and these would be found with whole-genome sequencing. Many aspects of this technology are still evolving, but it will certainly be an important adjunct to whole-genome sequencing for the foreseeable future.
Chromatin immunoprecipitation is a powerful tool for determining the DNA sequences with which proteins interact in the genome. This technology has been greatly aided by array-based approaches, but most arrays do not represent the entire genome and are therefore biased to varying extents. By coupling ChIP with next-generation sequencing techniques (ChIP-Seq) [37,38], an even greater level of precision will be obtained and will complement and extend ChIP studies in the future. Likewise, the evaluation of methylated regions of the genome may be highly relevant for understanding cancer pathogenesis. Whole-genome methylation studies have been successfully performed using array-based technologies, but the sequencing of methylated regions of the genome on next-generation platforms (Methyl-Seq) will improve the quality of data.
As noted above, cancer genomics requires an understanding of not only the structure of primary cancer genomes, but also of their function. One of the greatest technical challenges facing the field is the integration of different kinds of genomic data (generated on different platforms) into a seamless whole. When this is routine, an even greater intellectual challenge will exist: how to integrate alterations occurring at the level of genes versus alterations occurring at the levels of pathways. Based on the analysis of the first few cancer genomes, it seems possible that the combination of mutations associated with a given cancer may be extremely large; however, the pathways affected by these mutations may be limited and obey a set of rules that remain to be defined. Integrating mutational and pathway data remains an enormous challenge for the field, but this problem will hopefully be solved when thousands of cancer genomes have been successfully sequenced.
Cancer genome sequencing will reveal all of the inherited variants present in each individual with cancer. It is highly likely that some of these inherited variants may provide a critical ‘substrate’ for the acquired mutations that have been the focus of this perspective thus far. However, the role of inherited variants will ultimately need to be integrated into the sequencing analysis in order to fully understand the cancer genome. Characterizing these inherited factors is not only of great interest scientifically, but it will also have relevance for genetic counseling and screening, and potentially, for cancer prevention.
Inherited variants with established roles in AML susceptibility occur in two genes associated with rare familial leukemia syndromes and several others that cause more general cancer predisposition syndromes. The best-characterized familial leukemia syndrome is the familial platelet disorder with a predisposition to AML (FDP/AML), which is caused by inherited mutations in RUNX1 . A smaller number of families with germline mutations in CEBPA have also been reported [40,41]. These individuals develop AML with nearly complete penetrance. The RUNX1 and CEBPA pedigrees display features that are typical of single-gene cancer predisposition syndromes, including multiple affected first-degree relatives, early-onset of disease compared with sporadic cases and low prevalence. Other cases of familial leukemia have been described in which linkage to RUNX1 and CEBPA have been excluded, implying that there are additional susceptibility loci that have not yet been identified [42,43].
Several classical cancer predisposition syndromes are associated with an increased risk of AML in addition to other tumors. The incidence of leukemia is increased in Li Fraumeni families (caused by mutations in TP53), although myelodysplastic syndromes (MDS) or AML probably occur in fewer than 5% of affected individuals . Children with neurofibromatosis Type I (caused by NF1 mutations) have a greatly increased risk of developing MDS/AML . Other cancer predisposition syndromes associated with excess leukemia cases include ataxia telangiectasia (exclusively associated with ALL) and Wiskott–Aldrich Syndrome [46,47].
Several genetic syndromes causing bone marrow failure are also associated with a predisposition to MDS/AML. In some of these, the genetic lesion causes loss of genomic integrity that presumably contributes directly to tumor formation (e.g., Fanconi's anemia, Bloom's syndrome and Dyskeratosis congenita). In others, the mechanism of leukemia predisposition is less clear but may be an indirect consequence of chronic bone marrow ‘stress’ (e.g., severe congenital neutropenia, Diamond–Blackfan anemia or Schwachman–Diamond syndrome).
Perhaps most importantly for the general population, sporadic AML may also have an inherited component. From twin studies, the inherited contribution to leukemia susceptibility was estimated to be approximately 20% . The cumulative impact of all of the syndromes discussed above falls far short of this figure, implying that most of the inherited factors that are important for leukemia susceptibility have not yet been discovered. In nonsyndromic AML, familial aggregation of cases is rare. This suggests that inherited susceptibility alleles have modest effects, that they may arise spontaneously in affected individuals (de novo mutations), or that AML in the general population does not have a significant inherited component. This will be an extremely challenging problem to resolve, given the large number of polymorphisms that have been detected in the small number of genomes that have been systematically analyzed to date (~3.5 million SNPs and hundreds of copy number variants per genome). Genome-wide association studies using array-based platforms will fail to detect de novo or rare variants that are not in linkage disequilibrium with markers included in the standard panels. As consensus builds around the idea that AML, like other sporadic cancers, may not conform to the ‘common variant, common disease’ paradigm (i.e., AML may not be a consequence of polymorphisms that occur frequently in the general population) [49,50], it will become increasingly important to examine these rare variants in association studies.
Comprehensive analysis of DNA from individuals with cancer using next-generation sequencing approaches provides an unbiased view of all variants present in these genomes. One of the important lessons from our studies of somatic mutations in AML is that the sequence of a paired sample from nonmalignant tissue (e.g., skin) obtained from each patient is an essential component of the analytical pipeline. This strategy allows for the unambiguous classification of nucleotide variants detected in the tumor sample as acquired versus inherited mutations. When the goal of the study is to identify acquired mutations, those changes detected in both the tumor and nonmalignant sample are ‘discarded’ as inherited. For studies of susceptibility, the complete complement of germline variants (e.g., single nucleotide variants, insertion/deletions and copy number variants) must be considered. What must occur next is a comparison between the genomes of affected (with AML) and unaffected (without AML) individuals from similar genetic backgrounds. The power of such a study could, in principle, be improved by choosing cases with a higher a priori probability of harboring novel susceptibility alleles (e.g., early disease onset, exclusion of known predisposing factors, excess first-degree relatives with AML, other hematologic malignancy, or other cancers). Discovery of alleles that play a role in AML susceptibility will require a comparison with thousands of cases to appropriately matched controls. As daunting as this task may seem, these studies will be feasible in the near future, and should lead to novel prevention and control strategies for AML.
Next-generation sequencing technologies have the potential to revolutionize our understanding of cancer. The basic rules for using whole-genome screens to understand cancer pathogenesis were established with the use of karyotyping (a low-resolution genomic screen) in AML cases more than 30 years ago. By using many of the paradigms established in those studies, the scientific community should be able to use whole-genome sequencing data (and associated epigenetic data) to establish a clear picture of the mutations that cause each cancer in the not too distant future. This will require sequencing tens of thousands of cancer genomes and the careful analysis of these genomes for recurrent mutations that impact the pathways relevant for cancer pathogenesis.
In our view, there is no question that these data will dramatically change our understanding of cancer and lead to new ways to diagnose and classify this large group of diseases. Some of the information will have an immediate impact on the care of patients, and some will lead to the development of novel drugs and strategies for the treatment of individual patients in the future. Regardless, this revolutionary technology will permanently change our understanding of cancer by providing a complete picture of the mutations that cause it. Only then can we realistically hope to find the novel approaches that will lead to increased therapeutic success.
Whole-genome sequencing (structural genomics) and epigenomic studies (functional genomics) are important new tools for understanding cancer and improving its treatment. When the cost of these studies reaches a critical threshold (probably a few thousand dollars per patient), we suggest that most, if not all, cancer patients should have high resolution genomic studies performed as part of their initial evaluation. It is highly likely that many genomic studies will be performed by commercial entities, with interpretation provided by trained pathologists and oncologists. In the short term, we believe that data from individual patients will be used to provide more accurate diagnostic and prognostic information, and that it will heavily influence treatment decisions.
Although many great challenges remain, the information gained from next-generation sequencing platforms should ultimately lead to the discovery of novel drugs that more effectively target the key genes and pathways that cause cancer. The marriage of genomics and targeted therapies will hopefully lead to therapeutic successes akin to that achieved for chronic myelogenous leukemia patients with imatinib, where the drug directly targets the initiating oncogene. Clearly, this is the ultimate goal for all of this work.
Financial & competing interests disclosure: The authors have no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.
No writing assistance was utilized in the production of this manuscript.
Papers of special note have been highlighted as:
of considerable interest