|Home | About | Journals | Submit | Contact Us | Français|
New generation sequencers have been developed with a strong impact on genomics. These sequencers are based on a principle different from the Sanger method, and can sequence one to several million templates in a single run, albeit read length is relatively small. The current large-scale efforts are: 1) complete genome sequencing of 1,000 individuals, the primary objective of which is identification of rare SNP variants, not identified by the international HapMap project; 2) large-scale sequencing of cancer genomes to construct a complete catalog of genomic changes. These sequencers are also being applied in the identification of new infectious agents. Steady increase in data production capacity and decrease of cost will definitely make the sequencers a powerful diagnostic tool, especially for screening of all genetic diseases. On the contrary, statistical problems inherent to large data sets need to be solved before application to specific problems in medical science.
In the field of genomics, the next generation DNA sequencer is currently the hottest topic. These new sequencers can produce over 100 times more data compared to the most sophisticated capillary sequencers based on the Sanger method. The rapid developments of machines and bioinformatics are making the goal a “1,000 dollar genome sequence”, i.e., sequencing individual human genomes at a cost of $ 1, 000 each. The entire scene of biomedical science may change when the goal has been reached.
In this review, I summarize the principle of the next generation sequencers, current applications, and their future prospects in medical science. The first generation sequencers refer to those based on the Sanger method, the second generation sequencers are those based on massive parallel analysis, and the third generation sequencers are those based on single molecule sequencing in addition to massive parallel analysis. Because the current excitement comes from the second-generation sequencers, I will show their basic principle first.
Three second generation sequencers are commercially available: Roche FLX , Illumina Genome Analyzer (GA) , and Lifetechnologies' SOLiD [3, 4]. Those machines are widely distributed, and their performance has been well characterized. All sequencers are based on a similar principle.
A schematic representation of the representtative sequence principle is shown in Figure 1. Each sequencer employs different principles of reaction including:
The current benchmarks of the sequencers are summarized in Table 1. In brief, FLX produces long reads (~400 bases), but the number of templates per run is moderate (~1,000,000). GA and SOLiD produce short read (50~75 bases), but are characterized by the large number of templates per run (100,000,000~85,000,000). Their performance is increasing rapidly.
Pacific Bioscience Inc. is developing a sequencer based on a new principle, which should be categorized as third generation. This DNA sequencer uses single DNA molecules as templates. The main characteristic of this sequencer is real-time monitoring of nucleotide incorporation with DNA polymerase. The major drawback of the second-generation sequencers from the Sanger method is short read of templates. Not like the second-generation sequencer, this sequencer can obtain reads of several kilobases from a single template. This sequencer is based on the following three technical components.
Because the third generation sequencer has yet to be commercialized, this review further focuses on the second-generation sequencers. It should be noted that there is plenty of room for improvement in throughput of GA and SOLID. With these systems, templates can be accumulated at a much higher density. On the contrary, Roche FLX has limitations. Because the fluorescent dye, i.e., oxyluciferin, diffuses into the reaction solution, each template bead must be separated in an individual well. This feature limits the template density
The surrounding situation of the second-generation sequencers is different from that of the first-generation sequencers. The most important factor is the completion of the human genome project. As shown above, a major drawback of the next-generation sequencers from the previous sequencers is the short read length: 350 bases (FLX) and 50-75 bases (GA, SOLiD), compared to > 800 bases with first-generation sequencers. The short read length is a considerable disadvantage for de novo sequencing. In de novo sequencing, it is necessary to construct a complete sequence from a large number of short sequence pieces. If the one read length is short, the short pieces make only small overlaps, making it difficult to construct contigs. Thus, the second-generation sequencers, especially GA and SOLiD, are not intended for de novo sequencing. However, in the human genome, the short pieces may be assembled into large sequences, being matched with the reference human genome sequence. In this way, the second-generation sequencers can produce complete genome sequences of individuals. The major genome centers now challenge two targets, i.e., the genomes of individuals and cancer genomes.
For several years, single nucleotide polymerphism (SNP) and its application to human genetics has been the most intensive area in genomics. SNP was at first intensively collected using sequences obtained during the human genome project. These SNPs (roughly 100 million) were organized by haplotypes identified by the international HapMap project . Consequently, about 50,000 tag SNPs representing haplotypes, were obtained. Genetic loci associated with a number of common diseases have been identified using the above tag SNP set through genome-wide association studies (GWAS). Accumulating results, however, show that GWAS generally failed to identify most of the genetic background of common diseases. A series of articles has been recently published to review the results from various viewpoints [13-15]. There are now a number of discussions to determine the research direction, i.e., continuation of GWAS or turning the research direction to complete sequencing of individual human genomes. Because the SNP markers used in GWAS are based on the international HapMap project, they detect allele variants whose frequencies are over 5 %. Therefore, rare variants (0.1 - 5 %) cannot be detected in GWAS. Proponents of the genome sequencing argue that genetic association may be found with rare variants, not detected by the current tag SNPs, and the complete genome sequences of a large number of individuals will uncover the more detailed view of variations. Currently, the “1,000 genomes” project (http://www.1000genomes.org), an international project to sequence genomes of 1,000 individuals, is ongoing. The outcome of the projects will be an important resource for human genome variation, but the direct objective is identification of rare variants to extend current GWAS.
It is important to confirm whether the second-generation sequencers can identify SNP equally as well as the Sanger method. Two Caucasian individual genomes have been determined before the “1,000 genomes” project. One that was obtained by the Sanger method , identified 2.8 million known SNPs and about 0.74 million novel SNPs. The other that was sequenced with GS20, a previous model of FLX , identified 2.72 million known and 0.61 million novel SNPs. Pilot experiments of the 1,000 genome project determined genomes of two individuals with GA [18, 19]. The sequence of a male Yoruba identified 3.8-4.1 million SNPs, 73.6% of which were in dbSNP . The sequence of an Asian individual identified 3 million SNPs, 73.5% of which were in dbSNP . Recently, a new study compared the second-generation sequencers and a Sanger sequencer from the view point of GWAS . In general, the second-generation sequencers had very high sensitivity, i.e., identification of SNPs, but relatively low specificity. This tendency was more prominent with GA and SOLiD, because of short sequence reads: errors were more common in repeated sequence regions, probably due to errors during sequence assembly. The other obstacle is biases in representation among genomic regions. To obtain complete coverage of a genomic region, it is necessary to obtain more reads. These results suggest that the next-generation sequencers are useful for SNP studies, if enough reads are obtained.
Still the complete human genome sequencing is expensive. In addition, a huge computational load is required. Instead, sequencing of all protein coding regions, named “exome”, is regarded as a cost-effective approach . SNPs or mutations in coding regions are more informative and likely to be linked to diseases than those in non-coding regions. One of the examples is a study on pancreatic cancer described below .
The objective of projects, such as The Cancer Genome Atlas (http://cancergenome.nih.gov), to sequence cancer genomes is a complete list of genomic changes contributing to carcinogenesis. These projects hypothesize that there would be undiscovered genes contributing to carcinogenesis, and they will accompany genomic changes such as mutations, copy number variations and translocations. Epigenetic events have also been known to contribute to carcinogenesis, and may be incorporated into the projects. Unbiased exploration of such events would substantially contribute to understanding of cancer, and lead to identification of new target molecules.
Several pilot experiments using the first generation sequencers have been performed. Due to limited throughput of the first generation sequencers, several early studies focused on specific gene families, such as tyrosine kinases, which were often activated by somatic mutations. An organized study was performed at the Welcome Trust Sanger Institute . In that study, somatic mutations were classified into “driver” and “passenger” mutations. “Driver” mutations are defined as that conferring growth advantage, and “passenger” mutations are defined as those without any biological effects. Overall selection pressure by all the substitute mutations was calculated: 1.29 (95% confidence interval, 1.10-1.51; P=0.0013).197The other study examined the majority of the transcribed genes (18,191 genes) with eleven breast and eleven colorectal cancer tissues . This study revealed that there were a large number of mutations with rare incidence, in addition to a small number of genes with mutations of high incidence. Both studies suggested that known somatic mutations were only a small fraction of mutations in cancer genomes, and more systematic analysis of the cancer genome, i.e., complete genome sequencing of a large number of cancer tissues, is necessary. These studies were followed by two studies on glioma [25, 26]. Both studies accompanied measurements of copy number variation by genome arrays and gene expression profiling  by microarrays or SAGE . One of the studies found recurrent mutations at the active site of isocitrate dehydrogenase 1 (IDH1) in 12% of glioblastoma patients . This result suggests that there would be additional important mutations not discovered so far.
Comparison of a cancer genome with the corresponding germline genome is very informative. One study analyzed the whole genome of malignant cells and normal cells from a single acute myelogenous leukemia (AML) patient . The whole genome analysis revealed that the AML genome had only eight heterozygous, non-synonymous somatic mutations, all of which were novel. Another study to sequence all coding regions on a genome of familial pancreatic cancer identified that mutations in PALB2 was responsible for the disease, validated with 96 additional samples . Both studies could pinpoint out a small number of candidate genes, demonstrating the accuracy and thoroughness of the whole-genome approach.
The above early studies strongly suggest that the large-scale cancer genome projects would definitely contribute to our understanding of genetic changes in cancer. However, contribution to medicine is a different problem. The rationale to justify the large investments for these projects is identification of molecular targets and subsequent developments of anti-cancer drugs. The proponents of the projects argue that newly identified mutations will be effective targets for anti-cancer drug development. This reflects the current trend of anti-cancer drug development: a large number of molecular target drugs are now being developed or during clinical trials with expectations to improve cancer therapy. However, when the above cancer genome projects were finished, the current trend and enthusiasm might be finished. Already, there is controversy among scientists on the future prediction of molecular target drugs [28, 29]. So far, all molecular target drugs except imatinib extend overall survival only several months. Molecular target therapy might turn out to be not attractive as it is: pharmaceutical companies might lose interest. In any case, the resulting data will be valuable as a resource for cancer research.
The third important application of the second-generation sequencers is identification of infectious agents. RNA or DNA of human tissues or cells infected by a specific infectious agent such as a virus, bacterium, contain the human genome sequences as well as sequences of the infectious agent. Sequencing a large number of RNA or DNA pieces from an infected sample, the resulting sequences contain those derived from the infectious agent as well as from the human genome. Now that the complete human genome sequence has been obtained, subtraction of the human genome sequence should theoretically yield sequences of the infectious agent. This idea is not new. In 2002, a computational experiment was performed, by searching the human genome sequences for expressed tag sequences (EST) of human origin using data in the public database . Among sequences not matching the human genome, more than 50 sequences matching virus genomes were identified. The same group performed a model experiment with tissues of post-transplant lymphoproliferative disorder (PTLD), and successfully recovered Epstein-Bar virus sequences, the known agent of PTLD . These studies suggested the plausibility of the above experimental strategy.
In spite of the potential strength of the strategy, the high cost of DNA sequencing has prevented real application. Due to the decreased cost of sequencing by the second-generation sequencer, two studies using FLX appeared in 2008. One study focused on patients who died of febrile illness after visceral organ transplantation . Unbiased transcript sequencing from liver and kidney, and subsequent data analysis revealed infection of a new arena virus. The other study focused on Merkel cell carcinoma, a rare type of skin cancer . Sequencing of nearly 400,000 transcripts identified sequences similar to known polyoma viruses, Further analysis revealed a new polyoma virus sequence named Merkel cell polyoma virus.
The sequencers can be applied to gene expression profiling, i.e., a genome-scale analysis of gene expression. Sequencing a large number of transcripts purified from a tissue or cell, and subsequently matching them to the human reference genome reveals the identity of each transcript. The expression level of the gene can be determined from the number of times each gene sequence appeared. This approach of gene expression profiling has been named digital gene expression profiling, and was originally initiated in the early stage of the human genome project . Later, a new technique named serial analysis of gene expression (SAGE) , appeared. In SAGE, a small tag (SAGE tag), with a size of 9 to 21 bases, is obtained from each transcript, and tens of tags are concatemerized, and read with a sequencer. With SAGE, from a single read, frequency information of tens of transcripts can be obtained. Even still with SAGE, it was not practical to process a large number of samples due to low throughput of the sequencers based on the Sanger method. With the next-generation sequencers, digital expression profiling has finally become a plausible method comparable to microarrays. Its major advantage over microarray is straightforward standardization of the data. In digital expression profiling, data is just molecular counts. In contrast, the data obtained by microarray analysis is expression level against some standard, and it is difficult to compare data from different experimental series. However, for laboratory use, i.e., comparison of global gene expression among samples of interest, digital expression profiling does not have clear advantage over microarrays.
In this review, the principle of the next generation sequencers, and their major research areas have been described. As shown above, the current applications are centered on continuation of works already started before appearance of the second-generation sequencers, and mainly restricted to experts in genomics. However, one of the most important aspects of this technical revolution should be the easy access to the large sequence data by scientists in other areas and doctors. For the wide spread use, sequence data will soon be available from outsourcing companies, but data analysis will still remain a difficult task. Development of software systems easily accessible to nonexperts is essential for utilization of large sequence data.
Considering the steady increase of sequence capacity and decrease of cost, application to diagnosis will be realized in the near future. Routine neonatal diagnosis will be replaced with the routine sequence of the entire genome or the exome. This new type of diagnosis would reveal affected alleles of all known genetic diseases, including genes currently screened in postnatal diagnosis. From the data of a couple, risks of genetic diseases in their children can be accurately predicted.
Application to diagnostics for genetic diseases is easily predictable, but how to apply the next generation sequencers to medical science is rather difficult to predict. Large-scale data production such as the human genome project, the “1,000 genomes” project and “The Cancer Genome Atlas”, are productive as far as construction of resources. However, utilization of large data sets to solve a specific problem is usually difficult as exemplified by GWAS. The obstacles are mainly statistical problems inherent to large data sets. One is multiplicity in statistical tests. In general, there is no method of choice to control multiplicity, and the method is chosen through practical applications. For example, Bonferroni correction is used with GWAS, and the q-value or FDR is used for gene expression analysis. Although the validity of each method has been confirmed with repeated use, it should be noted that some true positives must be excluded. In particular, GWAS detected loci representing only a fraction of the genetic background of common diseases. One possibility is that true positive loci may be excluded by stringent criteria set by Bonferroni correction.
Another problem is the “curse of dimensionnality”. The “curse of dimensionality” is the problem caused by the exponential increase in volume associated with adding extra dimensions to a space. This problem results in an increased number of samples needed for analysis. In the cancer classification problem, where cancer samples are classified into two classes by means of gene expression profiling, the “curse of dimensionality” is avoided by dimension reduction, i.e., reduction of the number of genes by gene selection. For example, when the initial data set contains 10,000 genes, the problem is to classify cancers in a 10,000 dimensional space. By selection of differentially expressed genes, classification is usually performed in the reduced dimensional space, requiring adequate number of samples. On the contrary, it is impractical to reduce the number of SNP markers in GWAS. Thus, the aim of GWAS is limited to discovery of individual loci associated with the disease. GWAS cannot identify association of a disease with a combination of more than two genes. This is due to the “curse of dimensionality”, but an alternative explanation is as follows. When the number of tag SNP markers is 50,000, the number of the combination of two genes would be 1,249,975,000. It is impractical to perform this huge number of statistical tests, because of requirement of far larger cohorts and very low threshold p-value. The next-generation sequencers do not solve these statistical problems. Complex diseases are most likely to be mediated by numerous loci (both coding and non-coding) that interact with many environmental factors. Some argue that the whole genome sequencing would be useful for identification of such loci, but the real obstacle would be the statistical problem, which suggests requirement of a huge number of samples. This would also be the case in cancer genome projects.
Probably the most reasonable application would be identification of genes responsible for familial disorders. Familial disorders with small pedigrees, which cannot be subjected to linkage analysis, would be good targets. The above example of familial pancreatic cancer is a good example.