|Home | About | Journals | Submit | Contact Us | Français|
The 1000 Genomes Project aims to provide detailed genetic variation data on over 1000 genomes from worldwide populations using the next-generation sequencing technologies. Some of the samples utilized for the 1000 Genomes Project are the International HapMap samples that are composed of lymphoblastoid cell lines derived from individuals of different world populations. These same samples have been used in pharmacogenomic discovery and validation. For example, a cell-based, genome-wide approach using the HapMap samples has been used to identify pharmacogenomic loci associated with chemotherapeutic-induced cytotoxicity with the goal to identify genetic markers for clinical evaluation. Although the coverage of the current HapMap data is generally high, the detailed map of human genetic variation promised by the 1000 Genomes Project will allow a more in-depth analysis of the contribution of genetic variation to drug response. Future studies utilizing this new resource may greatly enhance our understanding of the genetic basis of drug response and other complex traits (e.g., gene expression), therefore, help advance personalized medicine.
Although environment, diet, age, gender and lifestyle as well as other nongenetic factors (e.g., socioeconomic status) can influence a patient’s response to therapeutic treatments, understanding an individual’s genetic makeup is believed to be the key to realizing personalized medicine that aims to maximize drug efficacy and minimize adverse side effects. Personalized medicine is particularly appealing for oncologists because of the severity of adverse events and the high likelihood of mortality associated with nonresponse. Side effects can include nephrotoxicity, neurotoxicity and ototoxicity [1,2], and this toxicity is often observed earlier than the therapeutic effect . The fundamental problem is that many anticancer drugs present a narrow therapeutic index; thus, small changes in dosage can cause unacceptable toxic responses that, in extreme cases, could lead to fatalities. Pharmacogenomics holds the promise to advance individualized medicine so that therapeutic decision-making would be based on each individual patient’s own genetic makeup.
Similar to complex health-related phenotypes, such as the risks of some common diseases (e.g., cancer, diabetes, heart disease and stroke), drug response is often affected by multiple genes in addition to other nongenetic factors. Genes responsible for drug intake, drug metabolism and drug excretion can be involved in an individual’s response to therapeutic agents. Previous studies on some well-characterized candidate genes (e.g., drug-metabolizing enzymes) have suggested that DNA variations, especially those in the form of SNPs, in genes that code for these enzymes can influence their ability to break down, convert and efficiently eliminate drugs from the body. A recent example is the observation that reduced cytochrome P450 2D6 activity leads to the therapeutic failure of tamoxifen in the prevention and treatment of breast cancer, as a result of absence of conversion of the prodrug to its active forms . Some classical examples include the identification of genetic polymorphisms in thiopurine S-methyltransferase (TPMT ), which lead to decreased TPMT enzyme activity and subsequently increased 6-mercaptopurine toxicity ; decreased activity of UGT1A1*28 polymorphism is associated with an increased risk of irinote-can treatment-associated neutropenia . With the launch of the Human Genome Project and several parallel research efforts, including the International HapMap Project, which aimed to develop a haplotype map of the human genome to describe the common patterns of human DNA sequence variation [7,8,101], it is now possible to scan the entire human genome or targeted genomic regions to identify genetic determinants responsible for drug-induced effects. Unlike the traditional candidate-gene approach, genome-wide association studies (GWAS) do not require a priori assumptions and, therefore, apply an unbiased approach. Notably, pharmacogenomic approaches have been applied to identifying candidate loci associated with response to therapeutic treatments for various diseases (e.g., asthma , psychiatric disorders , leukemia [11,12], stroke  and cardiovascular disease ). Particularly, during the past few years, GWAS using the Epstein Barr Virus-transformed lymphoblastoid cell lines (LCLs; e.g., the HapMap samples) have demonstrated the feasibility of integrating whole genome gene expression [15,16] and genotypic data (e.g., >3.1 million SNPs ) to identify genes and/or genetic variants associated with the cytotoxicities of anticancer agents, for example 5-fluorouracil [18,19], docetaxel , etoposide , cisplatin , daunorubicin , carboplatin , cytarabine [24–26] and gemcitabine [25,26] (for reviews see [27–29]). Although the current LCL-based model has been demonstrated to be useful in pharmacogenomic discovery, we will provide our perspective on how the ongoing 1000 Genomes Project, which will generate a much more detailed map of genetic variation using over 1000 LCL samples from diverse populations , could help overcome some of its limitations (e.g., untyped SNPs in the HapMap samples) and, therefore, benefit the next wave of pharmacogenomic discovery.
The LCLs, particularly the HapMap samples, have proved to be a very useful model for pharmacogenomic discovery, which may represent the best model for hematologic toxicities associated with chemotherapeutic agents [28,29]. One of the major advantages of this cell-based model is the avoidance of giving chemotherapy to unaffected family members for genetic studies and the enormous amount of publicly available genotypic (e.g., SNPs) and phenotypic (e.g., gene expression) data on these samples (National Institute of General Medical Sciences Cell Repository ). Another major advantage of this model is that the HapMap samples were derived from three major geographical populations (Caucasian residents with northern and western European ancestry from UT, USA [CEU]; Yoruba people in Ibadan, Nigeria [YRI]; and Asian samples, including Japanese in Tokyo [JPT] and Han Chinese in Beijing, China [CHB]), thereby allowing for inter-ethnic comparisons in cellular sensitivity to drugs. In fact, previous studies have shown significant differences in cytotoxicities to certain anticancer drugs between human populations .
However, there are currently some important limitations and challenges associated with the HapMap LCL samples . Some limitations are ‘intrinsic’. For example, the LCLs represent only one tissue type from apparently healthy individuals; therefore, they may not reflect tumor response or sensitivity of target tissue of known toxicity. In addition, only approximately 50–60% of human genes are estimated to be expressed in LCLs . Obviously, a more comprehensive understanding of the genetic basis of drug response will need to consider other tissues or possibly tumors.
Fortunately, some of the limitations and challenges of the current model will be addressed with advancement of technologies, development of new algorithms and better study design. Since the original HapMap panel is comprised of 90 CEU (30 parents–offspring trios), 90 YRI (30 parents–offspring trios), 45 unrelated CHB and 45 unrelated JPT samples, there may not be enough statistical power to identify genetic variants with small-to-medium effect sizes. A recent study showed that there exists quite significant genetic variation between the two Asian samples (CHB and JPT) , suggesting that simply combining these samples as a single Asian population in studies might lead to spurious associations. This limitation may especially exacerbate the power issue for the Asian samples (45 samples each population), because over 55 samples would be needed to attain the power of 80% to identify a variant with medium effect size (e.g., 0.15). On the other hand, although the coverage of the HapMap Project data is believed to be generally high, comparison studies have shown that the HapMap genotypic data may not be able to capture a substantial proportion of untyped SNPs [34,35]. For example, Tantoso et al. showed that the SNPs from the HapMap YRI samples capture only approximately 30% of the variants , when compared with a deep-resequencing project from the NIEHS Environmental Genome Project , and overall, the HapMap SNPs were not robust enough to capture the untyped variants for most of the genes they surveyed . This is not surprising, because the efforts of the International HapMap Project have been focused on characterizing common genetic variants with allele frequencies of greater than 5% [7,8,101]. Thus, for example, untyped or unknown rarer SNPs with large effects cannot be identified using the currently available data on these samples. In addition, although the HapMap LCL model is of tremendous value in the discovery stage, before being tested in clinical trials, these identified pharmacogenomic loci will need to be thoroughly validated in independent replication sets and/or their functions may need to be determined. This raises a challenge to the current HapMap LCL model (i.e., sample coverage of only three major populations), which, although allowing limited cross-validation between the three populations, does not accommodate validations either in the same population or across more geographical populations. Furthermore, the associated loci from the current studies may simply be proxies for underlying causal genetic variants, which may not be genotyped in the HapMap data. Theoretically, these ‘nonintrinsic’ limitations or challenges could be overcome by, for example, a large-scale deep-resequencing project that aims to provide a much more detailed map of human genetic variation on a much larger number of world-wide samples. The 1000 Genome Project aims to do just that.
The success of large-scale sequencing efforts depends on the capability to process a large number of samples in parallel and obtain reliable sequencing data in an acceptable turnaround time and at a sustainable cost. Recently, several new sequencing instruments referred to as ‘next-generation’ or ‘massively parallel’ sequencing platforms have become available for the fast, inexpensive sequencing of whole genomes [36,37]. Some relatively mature platforms include the GS-FLX™ (454) sequencer (Roche, CT, USA), the Genome Analyzer (Illumina, CA, USA) and the Sequencing by Oligo Ligation and Detection (SOLiD™; Applied Biosystems, CA, USA), as well as the so-called single-molecule sequencing technologies [38,39] from Helicos Biosciences (MA, USA) and Pacific Biosciences (CA, USA). In contrast to conventional capillary-based sequencing, these next-generation sequencers are able to process millions of sequencing reads in parallel rather than 96 at a time, although individual platforms have different performance characteristics.
Using the next-generation sequencing technologies (i.e., the Illumina platform and the Applied Biosystems SOLiD) , a deep-resequencing project launched in 2008, the 1000 Genomes Project, ambitiously aims to provide the most detailed map of human genetic variation yet through genotyping at least 1000 human genomes from world-wide populations . As the 1000 Genomes Project focuses on samples for which consent has been obtained for open access on the web without needing approval for each use, these requirements have led to the choice of the HapMap samples (HapMap Phase 3 panel) . Besides the original samples from the International HapMap Project (CEU, YRI, JPT and CHB) [7,8,101], the following seven populations are also included in the study: Luhya in Webuye, Kenya (LWK); Maasai in Kinyawa, Kenya (MKK); Toscani in Italy (TSI); Gujarati Indians in Houston, TX, USA (GIH); Chinese in Metropolitan Denver, CO, USA (CHD); people of Mexican ancestry in LA, California, USA (MEX); and people of African ancestry in the southwestern USA (ASW). Particularly, the Epstein Barr Virus-transformed LCLs and DNA samples derived from these individuals are available through the NIGMS Human Genetic Cell Repository (for the CEU samples)  and the NHGRI Sample Repository for Human Genetic Research (for the other ten populations) . The specified aims of this project are to identify over 95% of the variants with allele frequencies of more than 1% in parts of the human genome that can be sequenced, over 95% of the variants with allele frequencies over 0.1–0.5% in exons, as well as structural variants, such as copy-number variants (CNVs), other insertions and deletions, and inversions, including sequence-level understanding of breakpoints .
In December 2008, the 1000 Genomes Project announced the release of the first set of SNP calls for four individuals (three samples from a CEU parents–child trio and one YRI sample) that are part of the high-coverage pilot project (>20×) . In addition to these four samples, the current data release (accessed on 28 July 2009) covers more than 700 samples of the low-coverage project (2×). In addition to SNP genotypic calls, for each LCL sample, the raw project data , including FASTQ files (nucleotides and quality assessments), Binary Simple Alignment/Map files, and FASTA files for the human genome reference assembly, have also been released through the Short Read Archive at the NCBI . The NCBI Short Read Archive is specifically designed for short read data, and will be making the complete project data available in the future. Currently, indel calls are available for the trio children NA12878 (CEU) and NA19240 (YRI; May, 2009). Additional updates of the 1000 Genomes Project data are expected to be released regularly.
Owing to the huge amount of data and the new data types, the analysis of the 1000 Genomes Project data poses formidable informatics challenges. The 1000 Genomes Project provides a web-based browser to facilitate immediate analysis of the 1000 Genomes data (December 2008) by the whole scientific community . This Ensembl-based browser integrates the SNP calls and read coverage for the four genomes (three CEU and one YRI) in the high-coverage pilot 2. The current version (accessed 7 October 2009) supports the viewing of the consequences of sequence variation at the level of each transcript in the genome (Transcript SNP View) as well as showing read-depth data alongside SNPs (SeqAlign View) relative to the NCBI Build 36 reference (October 2005). Other bioinformatics tools, such as EagleView [40,109] and MapView [41,110], which were tailored for the next-generation sequencing technologies, would also be useful for visualizing and analyzing these new data.
In addition, a pharmacogene database enhanced by the 1000 Genomes Project was built for the community to immediately evaluate and utilize these newly released data [42,111]. Particularly, this database can be used to access SNP genotypic calls (both novel SNPs and known SNPs based on the dbSNP v129) of 39 pharmacogenetic candidate genes, maintained by the Very Important Pharmacogenes (VIP) project of the Pharmacogenetics Knowledge Base (PharmGKB) on 35 HapMap CEU and 26 HapMap YRI samples (April 2009) [43,112]. The VIP project is an initiative to provide annotated information regarding genes, variants, haplotypes and splice variants of particular relevance for pharmacogenetics and pharmacogenomics. A major advantage of this pharmacogene database [42,111] is that it allows the convenient extraction of genotypic calls on novel SNPs that have not been genotyped in the previous HapMap Phase 2 data [17,101]. Table 1 shows the summary of identified novel (i.e., SNPs not recorded in dbSNP v129) and known SNPs (i.e., SNPs included in dbSNP v129) in the 21 VIPs expressed in the CEU samples (based on criteria in Zhang et al. ). It is clear from Table 1 that there exists a substantial number of novel SNPs in many of these candidate genes (e.g., AHR has 25 novel SNPs vs 29 known SNPs). Convenient links to resources, such as the PharmGKB web-site, the Database of Genomic Variants , Gene Ontology  and the SNP and CNV Annotation Database (SCAN) [46,113], are also provided to allow researchers to access important and relevant information on these genes. Even at this early stage of the 1000 Genomes Project , this pharmacogene database demonstrates the potential impact of these data on pharmacogenomic discovery [42,111]. For example, a novel common SNP located in dihydropyrimidine dehydrogenase (DPYD) was found to be associated with hydroxyurea response in the CEU samples . To leverage the available resources, the design of this pharmacogene database was made compatible with the PharmGKB, thereby allowing it to be integrated into the PharmGKB in the future [43,112].
The 1000 Genomes Project  will greatly expand the sample size and target populations compared with the HapMap Project [7,8,101]. The availability of 11 diverse populations from the HapMap Phase 3 panel, therefore, offers unprecedented opportunities to compare complex phenotypes (e.g., gene expression and drug response) across important current human populations that are relevant to real-world patient demography (e.g., African–Americans and Mexican–Americans in the USA). By contrast, previous studies on gene expression  and drug response [28,29] have primarily focused on the three original HapMap populations, which, although supposedly representing a large proportion of world populations, may not reflect the complex genetic structure of certain populations. For example, the ancestry of African–Americans is predominantly from Niger–Kordofanian (~71%), European (~13%) and other African (~8%) populations, although admixture levels vary considerably among individuals .
Microarray platforms have proved to be powerful tools for profiling whole-genome gene expression. Taking advantage of the 1000 Genomes Project genotypic data , the profiling of gene expression in some HapMap 3 samples could provide novel insights into questions surrounding population differences in gene expression as well as the genetic architecture of gene expression. Although previous studies using the HapMap samples demonstrated some important findings (e.g., common SNPs that account for gene expression variation between populations), only approximately 30% of differentially expressed genes were found to be accounted for by allele frequency differences of either cis- or trans-acting SNPs [15,48,49]. Although it is possible that other gene regulatory mechanisms, such as miRNA  and epigenetics (e.g., DNA methylation) , could be responsible for the gene expression variation, no doubt a more detailed map of genetic variation from the 1000 Genomes Project  can be used to comprehensively evaluate the contribution of both common and relatively rarer SNPs to gene expression variation. Since gene expression is an intermediate phenotype that sits between DNA sequence variation and higher-level cellular or whole-body phenotypes (e.g., disease susceptibility and individualized drug response), novel insights into gene variation and regulation could enhance our understanding of the observed differences in complex traits including drug response between individual patients and different human populations.
Current pharmacogenomic studies were designed to identify common genetic variants, especially SNPs, which have been found to account for up to 30–50% of the observed variation in drug response [27–29]. Although other mechanisms, such as CNVs and nongenetic factors, could contribute to the remaining fraction of drug response variation , a more thorough evaluation of the contribution of genetic variants to drug response could be achieved using the more comprehensive genotypic data from the 1000 Genomes Project . Particularly, given the current sample size and design of the 1000 Genomes Project (e.g., <100 samples for a population) , these data may allow identification of certain rare variants with relatively large effects. As the statistical power to identify associations depends on allele frequency, sample size and effect size, the current size may still not have enough power for identifying rare variants only with moderate or minor effects. The current GWAS using the HapMap data largely ignored the contribution of rare SNPs, although their effects on drug response have been appreciated in previous studies [53,54]. The current findings from association studies are likely to be just proxies to the causal genetic variants. Functional validations of these associated SNPs have often been very challenging. A more detailed map of human genetic variation will provide the possibility or promise to locate the true causal variants. Therefore, by the integration of 1000 Genomes Project data and the systematic phenotyping (e.g., mRNA expression profiling, miRNA expression profiling, and drug response phenotyping) of these new samples covered by the project, it will be possible to identify new candidate loci (both previously untyped common and rare ones) and pinpoint causal variants.
In medicine, clinical response to drugs may range widely within and among human populations. Personalized medicine has the benefits of providing safer dosing options, assisting physicians to make better treatment decisions, and facilitating clinical trials in target patient populations. Although it is still in its early stages, pharmacogenomic studies using the HapMap LCL model have demonstrated its potential and promise to identify genetic variants associated with drug cytotoxicity [27–29]. For the next wave of pharmacogenomic discovery and follow-up translational research, however, challenges raised from the current cell-based models must be addressed. The 1000 Genomes Project, aiming to provide a detailed map of genetic variation in over 1000 individuals worldwide, could greatly expand the scope and depth of the current studies by increasing sample size, number of representative populations and the coverage of both common and rare genetic variants . One potential limitation for using these data immediately is that the 1000 Genomes Project is still at its early stage . For example, some scientists question the early-stage data (mostly low coverage at ~2×) for how accurate the finished genomes will be (e.g., missing genomic regions and rare variants), given its short timeline and low budget, as well as the lack of phenotypic information (e.g., medical records and basic data such as weight and height) . Although some criticisms (e.g., lack of medical information) are arguable, the 1000 Genomes Project  has plans to evaluate the effect of coverage depth and perform regional comparisons with other deep resequencing projects  such as the Encyclopedia of DNA Elements (ENCODE) project [56,114]. Therefore, reasonable expectations are that, after some careful quality control and with progress in the next phase of the 1000 Genomes Project  (e.g., ~20× coverage to be used to sequence some protein-coding regions ), the final 1000 Genomes Project data will have an acceptable level of genomic coverage as well as high accuracy of allele calls.
Understandably, as a result of the amount of data (estimated at >2 TB) that will be made available from the 1000 Genomes Project , efficient bioinformatics tools will be needed to store and accommodate these data to facilitate their use in pharmacogenomic studies. For example, a pharmacogene database enhanced by the early release of the 1000 Genomes Project data offers a convenient way to utilize these data on some well-characterized candidate pharmacology-related genes [42,111]. In addition, previous pharmacogenomic studies have focused on common genetic variants (individual SNPs). To take advantage of the more detailed 1000 Genomes Project data, novel statistical algorithms or data analysis approaches may be necessary to study the contribution of relatively rare variants, CNVs and indel calls from the 1000 Genomes Project to drug response. Pharmacogenomic studies focusing beyond common SNPs will improve our understanding of the genetic basis of drug response. The diverse sample coverage of the 1000 Genomes Project [102,106] can also allow comparisons of genetic factors responsible for drug response between human populations and, therefore, may potentially help identify any race- or population-specific genes and/or variants important for drug response. Furthermore, because of its comprehensiveness, the 1000 Genomes Project data can potentially be used to impute the currently available HapMap genotypic data. For example, only approximately 1 million SNP genotypic data are available for the HapMap Phase 3 samples . These samples, therefore, may be imputed to untyped SNPs using the 1000 Genomes Project data, thus providing a more comprehensive genomic coverage for future studies using these samples, including pharmacogenomic studies. Prospectively, besides gene expression, whole-genome profiling of other molecular targets (e.g., DNA methylation  and miRNA expression) on these samples could help build a much more comprehensive drug response model by integrating various ‘-omics’ data. Finally, to leverage the power of existing tools, it is expected that findings based on the 1000 Genomes Project (e.g., a pharmacogene database [42,111]) will be fully integrated into resources such as the PharmGKB and UCSC Genome Browser [58,115]. In summary, the 1000 Genomes Project  will be an important resource for the next wave of pharmacogenomic discovery, will greatly enhance our understanding of the genetic basis of drug response, and will prove to be a major step forward on the road of personalized medicine.
For reprint orders, please contact: moc.enicidemerutuf@stnirper
Financial & competing interests disclosure
Some of the research described in this paper was funded by the Pharmacogenetics of Anticancer Agents Research (PAAR) Group (www.pharmacogenetics.org) grant NIH/NIGMS U01GM61393, the University of Chicago Breast Cancer SPORE grant NIH/NCI P50CA125183 and NCI CA136765. The authors have no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.
No writing assistance was utilized in the production of this manuscript.
Papers of special note have been highlighted as:
of considerable interest