|Home | About | Journals | Submit | Contact Us | Français|
Motivation: Widespread availability of low-cost, full genome sequencing will introduce new challenges for bioinformatics.
Results: This review outlines recent developments in sequencing technologies and genome analysis methods for application in personalized medicine. New methods are needed in four areas to realize the potential of personalized medicine: (i) processing large-scale robust genomic data; (ii) interpreting the functional effect and the impact of genomic variation; (iii) integrating systems data to relate complex genetic interactions with phenotypes; and (iv) translating these discoveries into medical practice.
Supplementary information: Supplementary data are available at Bioinformatics online.
We are on the verge of the genomic era: doctors and patients will have access to genetic data to customize medical treatment. Consumers can already get 500 000–1 000 000 variant markers analyzed with associated trait information (Hindorff et al., 2009), and soon full genome sequencing will cost less than $1000 (Drmanac et al., 2010). One group has performed a complete clinical assessment of a patient using a personal genome (Ashley et al., 2010), and the 1000 Genomes Project is sequencing 1000 individuals (1000 Genomes Project Consortium et al., 2010). In the coming years, the bioinformatics world will be inundated with individual genomic data. This flood of data introduces significant challenges that the bioinformatics community needs to address. This review outlines the developments that led to these challenges, the previous work that can address them and the need for new methods to address them. The challenges fall into four main areas: (i) processing large-scale robust genomic data; (ii) interpreting the functional impacts of genomic variation; (iii) integrating data to relate complex interactions with phenotypes; and (iv) translating these discoveries into medical practices.
In the last decade, molecular science has made many advances to benefit medicine, including the Human Genome project, International HapMap project and genome wide association studies (GWASs) (International HapMap Consortium, 2005). Single nucleotide polymorphisms (SNPs) are now recognized as the main cause of human genetic variability and are already a valuable resource for mapping complex genetic traits (Collins et al., 1997). Thousands of DNA variants have been identified that are associated with diseases and traits (Hindorff et al., 2009). By combining these genetic associations with phenotypes and drug response, personalized medicine will tailor treatments to the patients' specific genotype (Fig. 1). Although whole genome sequences are not used in regular practice today (McGuire and Burke, 2008), there are already many examples of personalized medicine in current practice. Chemotherapy medications such as trastuzumab and imatinib target specific cancers (Gambacorti-Passerini, 2008; Hudis, 2007), a targeted pharmacogenetic dosing algorithm is used for warfarin (International Warfarin Pharmacogenetics Consortium et al., 2009; Sagreiya et al., 2010) and the incidence of adverse events is reduced by checking for susceptible genotypes for drugs like abacavir, carbamazepine and clozapine (Dettling et al., 2007; Ferrell and McLeod, 2008; Hetherington et al., 2002).
Despite all of these advances, many challenges need to be addressed to make personalized medicine a reality. Today, a patient's genetics are consulted only for a few diagnoses and treatment plans and only in certain medical centers. Even if doctors had access to their patients' genomes today, only a small percentage of the genome could even be used (Yngvadottir et al., 2009). Many of the annotations come from association studies, which tend to identify variants with small effect sizes and have limited applications for healthcare (Moore et al., 2010). By addressing the challenges outlined in this review, bioinformatics will create the tools to tailor medical care to each individual genome, rather than rely on blanket therapies (Ginsburg and Willard, 2009).
Sequencing technologies are becoming affordable and are replacing the microarray-based genotyping methods, which were limited to interrogating regions of known variation (Ng et al., 2010). Now a whole genome or a few dozen exomes can be sequenced in <2 weeks with an error rate of ~ 1 error per 100 kb (Drmanac et al., 2010). Even such low error rates can lead to a significant number of errors; a 3 GB human genome would have ~30 000 erroneous variant calls.
The error rate from these technologies is a source of significant challenges in applications, including discovering novel variants. Each newly sequenced genome is expected to have between 100 000 and 300 000 previously undiscovered SNPs and <1000 somatic mutations per generation (1000 Genomes Project Consortium et al., 2010). The number of expected mutations may decrease as new genomes are sequenced; however, such a high number of errors turns variant discovery into a ‘needle in a haystack’ problem. Whenever a novel variant is identified, it will still have to be verified due to this false positive rate. In addition, other classes of variation, such as short insertion–deletion variants (indels), as well as copy number variants (CNVs) and structural variants (SVs), are even more difficult to detect using high-throughput sequencing. New algorithms for calling indels, CNVs and SVs from read data will be crucial in detecting these types of variations for clinical applications.
Even high-quality sequence reads must be placed into their genomic context to identify variants, which is an active area of research since, for example, different mapping and alignment algorithms often yield different results. Because de novo assembly (Shendure and Ji, 2008) is slow and complicated by repetitive elements, sequences are usually mapped to a genomic reference sequence instead. Algorithms such as BLAST (Altschul et al., 1990) or Smith–Waterman (Smith and Waterman, 1981) have been traditionally used, but their execution speed depends on the genome size. While individual queries may only take seconds per CPU, aligning 100 million of them would require more than 3 CPU years.
As a result, new algorithms are being developed to address this problem. BLAT is similar to standard sequence alignment, but also incorporates an indexed version of the genome instead of linear search (Kent, 2002). Many packages like BLAT have been optimized for the alignment of short reads by using hashing, prefix and suffix trees or other heuristics (Li and Homer, 2010). BWA, used for the 1000 Genomes Project, is highly accurate with <0.1% errors for simulated data and can map ~7 GB of short reads per CPU day (Li and Durbin, 2009; Li and Homer, 2010). To achieve the standard 30X coverage would still require 13 CPU days and so is ideally performed on a cluster or by using a cloud computing environment (Dudley and Butte, 2010), which can be used for efficient computational analysis of secure clinical data.
A remaining challenge for short read assemblers is reference sequence bias: reads that more closely resemble the reference sequence are more likely to successfully map as compared with reads that contain valid mismatches. Proper care must be taken to avoid errors in these alignments, and is discussed in a recent review (Pool et al., 2010). There is an inherent trade-off in allowing mismatches: the program must allow for mismatches without resulting in false alignments. Reference sequence bias is important when making heterozygous SNP calls and when analyzing allele-specific expression using RNA-Seq data (Degner et al., 2009). The problem is exacerbated with longer reads: allowing for one mismatch per read is acceptable for 35 bp reads, but insufficient for 100 bp reads.
When the diploid sequence is known, reference sequence bias can be avoided by mapping the reads to both strands, as can be done when mapping RNA-Seq reads to a sequenced genome. An alternative approach is to use ambiguous base codes to avoid the requirement of storing redundant sequences, such as with MOSAIK, developed by the Marth Lab (Michael Stromberg, Boston University). Using this approach, a C/T SNP can be represented as Y. This representation increases the storage requirements: because the genome is often stored in a hashed data structure, the number of keys and mappings increases to accommodate the new codes.
Another challenge is developing new methods for novel SNP discovery: while the calling of common variants can be aided by their presence in a database such as dbSNP, accurate detection of rare and novel variants will require increased confidence in the SNP call. De novo alignment methods require too much computation time to be feasible and reference alignment methods are biased. The challenge is to develop new algorithms that are computationally tractable and still avoid reference sequence bias.
Finally, there is a pressing need to improve quality control metrics. We can judge mapping and SNP call qualities by the ratio of transition (purine/purine or pyrimidine/pyrimidine) substitutions to transversion (purine/pyrimidine) substitutions. These ratios were established during previous sequencing efforts and we expect to see similar ratios (~2–2.1) for newly human genomes (Zhang and Gerstein, 2003). When working with genomes from families, we can estimate errors with the Mendelian inheritance error (MIE) rate: impossible combinations of inheritance most likely represent errors (Ewen et al., 2000). Transition/transversion ratio and MIE metrics are useful for measuring the quality of a dataset and are used by most large projects, such as the 1000 Genomes project (1000 Genomes Project Consortium et al., 2010). At the individual SNP level, we must rely on relative quality scores, so in order to confidently identify novel variants we must be verify them with an independent method. Variants can be validated with targeted resequencing or genotyping arrays. Alternatively, whole genome resequencing by an orthogonal sequencing platform can be performed, but is expensive and time consuming.
After genomic data has been processed, the functional effect and the impact of the genetic variations must be analyzed. Genome-wide association studies (GWASs) have been used to assess the statistical associations of SNPs with many important common diseases (WTCC Consortium, 2007). These methods are providing new insights, but only a limited number of variants have been characterized, and understanding the functional relationship between associated variants and phenotypic traits has been difficult (Frazer et al., 2009).
In the strictest definition, a SNP is a single nucleotide variant where the allele frequency in the human population is higher then 1%. In this review, we use the term SNP in a broader sense to also include rare variants that occur in a smaller fraction of the population. Important issues for predicting the impact of SNPs are data management, retrieval and quality control. During the last few years, the number of known SNPs has increased at an exponential rate (Fig. 2). The dbSNP database (Sherry et al., 2001) is the most comprehensive repository of SNPs data from different organisms. At the time of writing this review, the database contains about 20 million validated human SNPs (Build 132, September 2010). The Human Gene Mutation Database (HGMD) is a comprehensive collection of germline mutations in genes that are associated with human inherited diseases. The free version for academic and non-profit users contains more than 76 000 mutations from ~2900 genes. The SwissVar is a database of manually annotated missense SNPs (mSNPs) and contains 56 000 mSNPs from >11 000 genes.
Another important resource for SNP data is the Online Mendelian Inheritance in Man (OMIM) database (Amberger et al., 2009) of human SNPs and their associations with Mendelian disorders. The PharmGKB database contains manually curated associations between genes and drugs and a catalog of genetic variations with known impact on drug response, including >40 very important pharmacogenes (VIPs) and over 3400 annotated drug–response variants. The Catalogue of Somatic Mutations in Cancer (COSMIC) at the Sanger Institute stores ~25 000 unique mutations somatic mutation data related to human cancer extracted from the literature. A selection of the most significant SNP data sources is reported in Supplementary Table S1.
In the last few years, several computational methods have been developed to predict deleterious missense SNPs (Karchin, 2009; Mooney, 2005; Tavtigian et al., 2008). These methods have used different approaches such as empirical rules (Ng and Henikoff, 2003; Ramensky et al., 2002), Hidden Markov Models (HMMs) (Thomas and Kejariwal, 2004), Neural Networks (Bromberg et al., 2008; Ferrer-Costa et al., 2005), Decision Trees (Dobson et al., 2006; Krishnan and Westhead, 2003), Random Forests (Bao and Cui, 2005; Carter et al., 2009; Kaminker et al., 2007; Li et al., 2009; Wainreb et al., 2010) and Support Vector Machines (Calabrese et al., 2009; Capriotti et al., 2006, 2008; Karchin et al., 2005; Yue and Moult, 2006).
The prediction algorithms input features generally include amino acid sequence, protein structure and evolutionary information. The amino acid sequence features rely on the physicochemical properties of the mutated residues such as hydrophobicity, charge, polarity and bulkiness. Protein structural information describes the structural environment of the mutation and has been successfully used to predict the protein stability change upon mutation (Capriotti et al., 2004, 2005; Schymkowitz et al., 2005; Zhou and Zhou, 2002). Some of the most important features for the prediction of the impact of missense SNPs are derived from evolutionary analysis: critical amino acids are often conserved in protein families and so changes at conserved positions tend to be deleterious.
New algorithms that include knowledge-based information are being developed (Alexiou et al., 2009; Calabrese et al., 2009; Kaminker et al., 2007). Methods based on evolutionary information for the prediction of mSNPs include SIFT (Ng and Henikoff, 2003) and PolyPhen (Ramensky et al., 2002). SIFT scores the normalized probabilities for all possible substitutions using a multiple sequence alignment between homolog proteins, and PolyPhen predicts the impact of mSNPs using different sequence-based features and a position-specific independent counts (PSICs) matrix from multiple sequence alignment. The PANTHER algorithm (Thomas et al., 2003) uses a library of protein family hidden Markov models to predict deleterious mutations. Recent work shows that 3D structural features improve the prediction of disease-related mSNPs (Bao and Cui, 2005; Karchin et al., 2005; Yue and Moult, 2006). Knowledge-based information has been used to increase the accuracy of prediction algorithms to over 80%. For example, SNPs&GO (Calabrese et al., 2009) is an algorithm based on functional information that takes in input log-odd scores calculated using Gene Ontology (GO) annotation terms. MutPred (Li et al., 2009) evaluates the probabilities of gain or loss of structure and function upon mutations and predicts their impact using a Random Forest-based approach. Selected methods for the prediction of deleterious mSNPs are listed in Supplementary Table S2 and more details about mSNP predictors have been recently reviewed (Cline and Karchin, 2011; Thusberg et al., 2011)
Prediction methods do not provide any information about the pathophysiology of the diseases and so experimental tests are required to validate genetic predictions. Laboratory validation is expensive and time consuming and so there is a need for fast and accurate methods for gene prioritization. Currently, the most effective strategy uses the concept of similarity to genes that are linked to the biological process of interest (guilt-by-association). The input data for the available gene prioritization methods are derived from functional annotation, protein–protein interaction (PPI) data, biological pathways and literature.
The SUSPECT algorithm prioritizes genes by comparing sequence features, gene expression data, Interpro domains and functional terms (Adie et al., 2006). ToppGene combines mouse phenotype data with human gene annotations and literature. MedSim uses functional information from human disease genes or proteins and their orthologs in mouse models (Schlicker et al., 2010). ENDEAVOUR is trained on genes involved in a known biological process and ranks candidate genes after considering several genomic data sources (Tranchevent et al., 2008). G2D prioritization strategy is based on a combination of data mining on biomedical databases and sequence features (Perez-Iratxeta et al., 2005). PolySearch analyzes biomedical databases to build relationships between diseases, genes, mutations, drugs, pathways, tissues, organs and metabolites in humans (Cheng et al., 2008). MimMiner ranks phenotypes using text mining by comparing the human phenome and disease phenotypes (van Driel et al., 2006). PhenoPred detects gene–disease associations using the human PPI network, known gene–disease associations, protein sequences and protein functional information at the molecular level (Radivojac et al., 2008). GeneMANIA (Andersen et al., 2008) generates hypotheses about gene function, analyzing gene lists and prioritizing genes for functional assays. The method takes in input genes from six organisms and analyzes them using information from different general and organism-specific functional genomics datasets. For more details about gene prioritizing tools, a recently published review (Tranchevent et al., 2010) and the Gene Prioritization Portal provide comprehensive descriptions of available predictors.
The methods for the analysis of SNPs are mainly limited to the prediction of the impact of missense SNPs. New methods are needed to evaluate the impact of insertion, deletion and synonymous SNPs. In addition, there is a need to detect functional regions in the genome so that the effect of intronic SNPs can be analyzed, such as those in promoter regions and splicing sites. For non-coding regions, conservation across species is more difficult to detect. Fortunately, with the fast growth of functionally annotated genomes our ability to predict the impact of non-coding variants will increase. For example, SNPs occurring in transcriptional motifs can affect transcription factor binding, which suggests functional consequences for variants in regulatory regions (Kasowski et al., 2010). Recently, a method to identify possible genetic variations in regulatory regions (is-rSNP) has been developed (Andersen et al., 2008). Is-rSNP combines phylogenetic information and transcription factor binding site prediction to identify variation in candidate cis-regulatory elements. The detection of variants affecting splicing site is also an important task. The Skippy algorithm (Woolfe et al., 2010) analyzes the genomic region surrounding the variant to predict severe effects on gene function through disruption of splicing. A more exhaustive description of the methods for the prediction of deleterious variants in non-coding has been recently published (Cline and Karchin, 2011).
Last year, the first edition of the Critical Assessment of Genome Interpretation (CAGI) was organized to assess the available methods for predicting phenotypic impact of genomic variation and to stimulate future research. In the first year of CAGI (http://genomeinterpretation.org/), the organizers provided six different sets of data for six different tasks. The majority of the participating groups submitted predictions for just two classes of experiments related to the detection of disease-related and function-modifying variants. A few groups submitted predictions for the other categories: evaluation of risky SNPs from GWAS studies, interpretation of the Personal Genome Project data, prediction of mutations to P53 function and the response of breast cancer cell lines to different drugs. Several available predictors performed well for disease and functional predictions and there were promising results in the other categories. In the future, competitions such as CAGI will improve the quality of the available prediction methods and will renew the challenge for the understanding of genomic variation data.
Given the complex phenotypes involved in personalized medicine, the simple ‘one-SNP, one-phenotype’ approach taken by most studies is insufficient. Most medically relevant phenotypes are thought to be the result of gene–gene and gene–environment interactions (Manolio et al., 2009). For example, drug response often depends on multiple pharmacokinetic and pharmacodynamic interactions, which form a robust and tolerant system with highly polymorphic enzymes and many interaction partners (Wilke et al., 2005). As a result of this complexity, a drug–response phenotype of interest is likely to depend on many genes and environmental factors.
Basic GWAS approaches for pharmacogenomics have had some success, including studies of warfarin that have linked the majority of variation in response to just two genes, CYP2C9 and VKORC1 (Limdi and Veenstra, 2008). These and other studies of warfarin have even led to an improved dosing algorithm with improvements over the traditional clinical algorithm (International Warfarin Pharmacogenetics Consortium et al., 2009). Clopidogrel response has similarly been associated with variants of CYP2C19 (Shuldiner et al., 2009).
Despite this success, there is debate over whether or not traditional techniques will be successful for pharmacogenomics. There is concern that pharmacogenomics GWAS themselves are susceptible to many limitations: insufficient sample size, selection biases for genetic variants, environmental interactions that may affect the outcome measures and multiple gene–gene interactions that may underlie unexplained effects (Motsinger-Reif et al., 2010). These limitations become particularly difficult when researching rare events such as the pharmacogentics of adverse events.
The methods for GWAS are designed for single marker associations and are known to have limitations in explaining the heritability of disease (Manolio et al., 2009). It is unlikely that these same methods will do any better with pharmacogenetics. In fact, if these methods are parameterized for the multiple-marker associations necessary for pharmacogenetics, then they will suffer from the ‘curse of dimensionality’ and lose a significant amount of statistical power (Bellman and Kalaba, 1959). For example, to evaluate all combinations of two SNPs for 1 million SNPs in a genome requires examining nearly 500 billion possibilities. The challenge for bioinformatics is to address this complexity by developing methods that combine multiple data sources without losing statistical power.
Several groups have already tried to deal with this kind of complexity in GWAS for disease (Motsinger et al., 2007). Exhaustive search (Storey et al., 2005) and forward search (Consortium et al., 2007) have both been applied; however, the former can still lose statistical power and the later may miss some associations. Model selection methods have been successful with disease and trait GWAS studies by using selection techniques to choose multifactorial models that balance the false positive rate, statistical power and computational requirements of the search (Lee et al., 2008; Wray et al., 2007; Wu and Zhao, 2009).
Given the size of the genomic datasets, dimensionality reduction methods such as principal components analysis, information gain and multifactor dimensionality reduction will be essential to make complexity algorithms tractable (Hahn et al., 2003; Statnikov et al., 2005; Yeung and Ruzzo, 2001). Some of these methods have proven successful for finding multilocus associations with diseases such as hypertension and familial amyloid polyneuropathy type I (Soares et al., 2005; Williams et al., 2004). Many more feature selection techniques for bioinformatics are classified and discussed in a recent review (Saeys et al., 2007). These methods can be very effective when dealing with large datasets; however, they do not integrate with any external knowledge sources or inform the biology behind the interactions.
Systems biology and network approaches address to the problem of complexity by integrating molecular data at multiple levels of biology including genomes, transcriptomes, metabolomes, proteomes and functional and regulatory networks (Kohl et al., 2010). We can view a disease or a drug–response phenotype as a global perturbation of networks from their stable state (Auffray et al., 2009). This approach integrates biological knowledge from networks to make inferences about what genes or combinations of genes and other biological markers are more likely to be associated.
Combining disparate data sources can result in novel associations and provide insight into gene–gene and gene–environment interactions. One group created a disease–gene network by combining the diseases and associated genes available in OMIM (Goh et al., 2007). Analyzing this network showed that disease genes are often non-essential and not necessarily hub genes. The same group created a drug–target network and integrated that network with a PPI network. The network shows that similar drugs cluster together, palliative and etiological drugs show different topologies, and newer and experimental drugs tend toward polypharmacology (Yildirim et al., 2007). A global mapping of pharmacalogical space can be made using chemical structure, disease indication and protein sequence and can be used to make predictions of polypharmacology (Paolini et al., 2006). Another suggestion is to integrate epigenetic information to further our understanding of drug phenotypes (Zhang and Dolan, 2009).
Pathway and gene set methods can also be applied to GWAS, where a set of genes is identified that is suspected to be associated. These methods are similar to Gene Set Enrichment Analysis (GSEA) for microarray expression data (Subramanian et al., 2005). Usually a standard statistical test is used to determine if a set of genes is associated (Chasman, 2008; Wang et al., 2007; Yu et al., 2009), but other more specialized metrics have been created. The SNP Ratio Test compares the number of SNPs in a pathway to permuted sets, and the Prioritizing Risk Pathways method combines pathway and genetic data into a single metric (Chen et al., 2009; O'Dushlaine et al., 2009).
Many groups hypothesize that the integrative approach of systems biology will successfully link genomic measurements with clinical applications (Atkinson and Lyster, 2010; Berg et al., 2010; Hopkins, 2007). Indeed, one group has integrated chemical similarity metrics, pharmacogenomic interactions and PPI to predictive method for pharmacogenes (Hansen et al., 2009). Another group has used similarity of drug ligand sets to predict and validate novel ‘off-target’ interactions (Keiser et al., 2007).
These systems approaches are encouraging, but bioinformaticians need to be careful of a few pitfalls as they proceed. Methods need to be based on high-quality data to avoid the ‘garbage-in, garbage-out’ phenomenon, especially when one incorrect assumption can propagate through multiple data source and magnify the error. For example, transferring annotations based on similarity works sometimes, but could easily associate a paralog with an incorrect function. Chemical similarity poses the same risk; two similar molecules may behave very differently biochemically. Finally, assumptions must also be examined carefully; for example, a method that relates gene expression with drug targets must bear in mind that most drugs bind proteins, not DNA or RNA.
The ultimate challenge for this research is to apply the results for improved patient care. Much of this research has yet to be translated to the clinic. In fact, many physicians are unprepared to incorporate personal genetic testing into their practice and it is unclear how to best apply research results to improve patient care (McGuire and Burke, 2008). One of the areas where bioinformatics can have the greatest clinical impact is in pharmacogenomics.
Most pharmaceutical development addresses medical problems with a ‘one drug fits all’ approach. Genetic variation has been shown to influence drug selection, dosing and adverse events (Giacomini et al., 2007), and the therapeutic benefits of taking a genetically tailored approach to drug development is now recognized (Foot et al., 2010; Roses, 2004). One study found that a hypothetical pharmacogenetically driven clinical trial of the anticoagulant warfarin could save up to 60% of the cost and reduce possible adverse events (Ohashi and Tanaka, 2010). There are already many examples of drugs which have retrospectively been found to have strong pharmacogenomic interactions, including thiopurines for cancer (Weinshilboum, 2001) and the anticoagulant clopiogrel (Shuldiner et al., 2009).
A trial for using rosiglitazone, an approved Type II diabetes drug, for Alzheimer's disease is an early example of prospective application of pharmacogenomics. The hypothesis was that ApoE4 non-carriers would have a better response than ApoE4 carriers. The initial Phase II pharmacogenetic-based results appeared to show that non-ApoE4 carriers showed improvement over placebo (Roses, 2009). A later study of ApoE4-stratified patients showed no significant benefits; however, the idea of prospective gene-based stratification for drug trials still holds future promise (Gold et al., 2010). Prospective gene-stratification hypotheses need to be generated for future trials and will require new bioinformatics methods (Roses, 2009). Since new drugs will not have any known gene interactions, tools for predicting drug–target or drug–gene interactions will be essential (Hansen et al., 2009; Keiser et al., 2009).
Pharmacogenomics has already been successful in improving drug prescription and dosing. Most prescriptions are written with a ‘one dose fits all’ approach with adjustments based on gender, weight, liver and kidney functions or allergies. Some drugs have more laborious dosing calculations such as the anticoagulant warfarin (Gage and Lesko, 2008; Wysowski et al., 2007). Warfarin dosing is traditionally determined by a time-intensive ‘guess and test’ method, until the coagulation tests stabilize. Pharmacogenomics identified several SNPs affecting dosing, includingCYP2C9 and VKORC1 (Higashi et al., 2002; Rieder et al., 2005; Rost et al., 2004). Similar studies have been applied to clopidogrel, tramadol, anti-psychotics and many other drugs (Wilffert et al., 2011). Ultimately, pharmacogenomic prescription and dosing algorithms need to be accessible to physicians, like the new warfarin dosing algorithms from the International Warfarin Pharmacogenomic Consortium (IWPC) (International Warfarin Pharmacogenetics Consortium et al., 2009). Moreover, the current state of medical practice needs to be updated to include routine pharmacogenetic testing, educating and training physicians in personalized medicine, and further clinical trials to prove the efficacy of pharmacogenetic-based prescriptions.
Bioinformatics also translates discoveries to the clinic by disseminating discoveries through curated, searchable databases like PharmGKB, dbGaP, PacDB and FDA AERS (Gamazon et al., 2010; Mailman et al., 2007; Thorn et al., 2010). A major bottleneck for these databases is manual curation of the data. Biologically and medically focused text mining algorithms can speed the collection of this structured data, such as methods that use sentence syntax and natural language processing to derive drug–gene and gene–gene interactions from scientific literature (Coulet et al., 2010; Garten et al., 2010). These databases and methods need to be developed and used carefully. All these data sources are susceptible to errors and so validation of data is essential, especially before the information is applied in the clinic.
Finally, there are challenges and opportunities for bioinformatics to integrate with the electronic medical record (EMR) (Busis, 2010). For example, the BioBank system at Vanderbilt links patient DNA with a deidentified EMRs to provide a rich research database for additional translational research in disease–gene and drug–gene associations (Denny et al., 2010; Roden et al., 2008). Some health care companies and HMOs have also begun to collect genetic information from their patients. In order to even implement such genome-based systems, the medical infrastructure will have to shift from paper to electronic medical records, in order to be compatible with bioinformatics portals for data delivery and interpretation. Ultimately, bioinformatics needs to develop methods that interrogate the genome in the clinic and allow physicians to use personalized medicine in their daily practice.
E.C. would like to acknowledge Dr Laura Kerov-Ghiglianovich who helped to draw the Figure 1.
Funding: Training grant (NIH LM007033 to G.H.F. and K.J.K.); Stanford Medical Scholars (to R.D.); Marie Curie International Outgoing Fellowship program (PIOF-GA-2009-237225 to E.C.). LM05652 and the NIH/NIGMS Pharmacogenetics Research Network and Database and the PharmGKB resource (NIH U01GM61374 to R.B.A.).
Conflict of Interest: none declared.