|Home | About | Journals | Submit | Contact Us | Français|
Over a 100 years ago, William Bateson provided, through his observations of the transmission of alkaptonuria in first cousin offspring, evidence of the application of Mendelian genetics to certain human traits and diseases. His work was corroborated by Archibald Garrod (Archibald AE. The incidence of alkaptonuria: a study in chemical individuality. Lancert 1902;ii:1616–20) and William Farabee (Farabee WC. Inheritance of digital malformations in man. In: Papers of the Peabody Museum of American Archaeology and Ethnology. Cambridge, Mass: Harvard University, 1905; 65–78), who recorded the familial tendencies of inheritance of malformations of human hands and feet. These were the pioneers of the hunt for disease genes that would continue through the century and result in the discovery of hundreds of genes that can be associated with different diseases. Despite many ground-breaking discoveries during the last century, we are far from having a complete understanding of the intricate network of molecular processes involved in diseases, and we are still searching for the cures for most complex diseases. In the last few years, new genome sequencing and other high-throughput experimental techniques have generated vast amounts of molecular and clinical data that contain crucial information with the potential of leading to the next major biomedical discoveries. The need to mine, visualize and integrate these data has motivated the development of several informatics approaches that can broadly be grouped in the research area of ‘translational bioinformatics’. This review highlights the latest advances in the field of translational bioinformatics, focusing on the advances of computational techniques to search for and classify disease genes.
More than 100 years ago, Archibald Garrod confirmed, with his study of the incidence of alkaptonuria in men, the Mendelian laws of inheritance of this disorder. Dr William Bateson, a keen follower of Mendel, had previously hypothesized that alkaptonuria in offspring resulting from mating of first cousins might be the due to the fact that ‘first cousins will frequently be the bearer of similar gametes’ dispelling the previous notion that mating of first cousins in general might lead to the diseases, and hypothesizing that the disease follows similar inheritance laws observed by Mendel in plants. Just after the terms genotype and phenotype were coined , in 1905, William Farabee , a recognized anthropologist, recorded the familial tendencies of inheritance for malformations of human hands and feet and also recognized the Mendelian patterns of inheritance for those anomalies.
It would take over 90 more years of genetic research to identify mutations in the BRCA1 gene with clear relationships to familial breast cancer . This breakthrough knowledge has had important implications for the diagnosis and prognosis of cancer and familial forms of other complex diseases. However, we are still far from resolving the subtleties involved in the intricate pathways and molecular relationships responsible for these disorders, in particular for complex diseases, and, most importantly, we are still unable to deliver a cure for most diseases. The shift to large-scale sequencing of individual human genomes and the availability of new techniques for probing thousands of genes provide new sources of meaningful medical insights. The informatics issues related to the accession, integration, visualization and representation of this knowledge in a systematic manner are quite challenging. On the other hand, the stakes are high; for instance, by identifying molecular patterns that characterize each individual genome and discerning which of these individual variations is related to a particular disease or response to treatment, bioinformaticians could provide the foundations for the development of tools for the diagnosis, prognosis and personalized treatment of diseases.
Translational bioinformatics is an emerging field addressing the computational challenges in biomedical research and the analysis of the vast amount of clinical data generated from it . It is difficult to define such a broad field and, due to the inherently interdisciplinary nature of the research, impossible to detach translational bioinformatics from other related fields. The American Medical Informatics Association (AMIA) has defined the field of Translational Bioinformatics as:
the development of storage, analytic, and interpretive methods to optimize the transformation of increasingly voluminous biomedical data, and genomic data in particular, into proactive, predictive, preventive, and participatory health. Translational bioinformatics includes research on the development of novel techniques for the integration of biological and clinical data and the evolution of clinical informatics methodology to encompass biological observations. The end product of translational bioinformatics is newly found knowledge from these integrative efforts that can be disseminated to a variety of stakeholders, including biomedical scientists, clinicians, and patients .
The combination of novel experimental techniques with the emergence of translational bioinformatics has changed how the search for disease genes is performed. In the past, searching for disease genes was done mainly using positional cloning. In modern approaches, bioinformatics is an integral part of the search for disease-associated genes.
Besides the potential impact on personalized medicine, the field of translational bioinformatics provides a wide range of tools and resources that are invaluable in biomedical research. Due to the space limitations, however, this review will focus on the latest accomplishments in the hunt for disease genes.
This review provides a summary of the computational approaches related to the search for disease genes and it is divided into three parts. The first section is focused on the study of the properties and characteristics of disease genes. The second section provides a description of the methodologies and the available resources for the identification of disease genes. Finally, the last section highlights the advances of the study of specific gene disruptions associated with diseases, i.e. the analysis of human single nucleotide polymorphisms (SNPs) and structural variations.
The two main intrinsic properties of the genes that hamper the study of their functions and their associations with diseases are:
In addition, environmental factors also make disease traits difficult to detect and complicate the search for the genes responsible for such traits. The use of medication or xenobiotic substances is an example of an environmental variant. For example, it is difficult to detect whether alcohol-induced toxicity is normal to the phenotype, or whether it is a result of the individual variation in human liver alcohol dehydrogenase and other enzymes. Other environmental factors of great relevance to human disease are viruses and other infectious agents [9, 10]. Environmental factors, in conjunction with epigenetic regulation of the genes, might also be responsible for the low penetrance of certain alleles.
Disease genes are those genes involved in the causation of, or associated with [e.g. in genome-wide association studies (GWAS)], the disease. For example, for cystic fibrosis (CF), the gene CTFR was mapped to chromosome 7q31-q32 by linkage analysis in 1985 and later cloned by Francis Collins and co-workers . The deletion of three base pairs in CFTR's nucleotide sequence results in the absence of a phenylalanine residue at position 508 of the protein. In the endoplasmic reticulum, CFTR proteins with this deletion are targeted for degradation. As a result, there is an imbalance of the sodium and chloride ion concentrations that creates a thick, sticky mucus layer that leads to chronic infections. Environmental and genetic factors influence this disease, and as a result, individuals with the same mutation might have different disease outcomes. Despite advances in understanding CF, there is still much to learn and understand about this disease. In particular, the mechanism for lung disease in CF patients, which is lethal, is still unknown.
The number of disease genes discovered has been steadily increasing throughout the years. Figure 1 depicts the growth of disease gene data from 1981 to 2009 (D.Magglot and J.Amberger, personal communication). The analyses of the characteristics that differentiate disease from non-disease genes have been used to develop disease classifiers, a research area of major importance due to the medical relevance of disease genes.
Proteins derived from disease genes have been found to have properties that distinguish them from all other genes: they are longer , more conserved, phylogenetically extended and without close paralog . In addition, when compared against housekeeping genes, they present different patterns of conservation, function and DNA coding lengths . Since inherited disease genes are more likely to be non-essential, one could hypothesize that they arrive later in the evolution of the human species. Surprisingly, Domazet-Loso and Tautz  studies showed that non-essential disease genes are of ancient origin. In agreement with previous findings about the disease gene length, their analysis also showed that ancient genes tend to be longer. The authors, confirming what others had reported, found no significant differences in the rates of evolution of disease versus non-disease genes . The question of evolutionary divergence of disease genes, however, remains open, with some findings indicating that there is a higher rate of non-synonymous substitutions than synonymous ones  and others the opposite [14, 17].
If disease genes were to evolve at higher speed, could this be just an effect of the weak dominance of these disease genes? To answer this question, Osada et al.  compared evolutionary rates and the degree of polymorphism of the dominant and recessive disease genes. They found a higher rate of non-synonymous polymorphisms in recessive genes. In their analysis, the differences in selection intensity are still significant even after taking into account the dominance, suggesting that there are significant differences in the deleterious effect of the dominant and recessive genes.
The interaction network of disease genes has been the subject of many studies [19, 20]. The relevance of protein interactions in diseases and the development of computational tools applied to disease gene identification has been previously discussed by Kann . Protein interaction networks have been studied for Alzheimer's disease , ataxias and disorders of Purkinje cell degeneration  and for cancer genes . Genes involved in the same disease have been found to form subnetworks . Finding functional modules associated with each disease could reveal important aspects of the disease mechanisms and aid in disease classification . Feldman et al.  compared disease genes against essential genes (from mouse orthologs of human genes) in terms of their connectivity and found the two set of genes to be clearly distinct. In their analysis, the authors used a network of interactions derived from the analysis of hundreds of articles obtained with the Gene Ways  natural language system. The resulting network includes almost 13 000 physical interactions and 4458 genes. This work exemplifies the impact of text mining in the field of translational bioinformatics for the gathering of data from millions of existing manuscripts.
In addition to the extensive study of the structured regions of the protein and their effect on disease, the study of intrinsically disordered regions of the proteins from disease genes has recently lead to ‘unfoldomics’, or mapping of disordered proteins to human diseases [28, 29]. A number of intrinsically disordered proteins have been shown to be associated with cancer , cardiovascular disease , diabetes, neurodegenerative diseases  and other human diseases [33, 34].
To summarize, not all the properties that characterize disease genes have been probed or can be easily explained. Previous studies require functional, evolutionary and statistical hypotheses to explain the observations about disease genes. Disease genes might need to interact with each other and might also need to be co-expressed as they participate in the same functional pathways. Fewer paralogs within the human genome might explain the inability of the system to compensate for disruptions created when these genes are modified. Longer genes can be explained statistically as they will have more possible sites for mutations. Figure 2 depicts the distribution of lengths of disease and non-disease genes (from OMIM  and RefSeq , respectively). Based on the two-sample Kolmogorov–Smirnov test statistic, the distribution of lengths for the disease and non-disease genes are significantly different with a P-value of 3.0e-21. In addition, we estimated the number of protein domains (from CDD ) of disease genes (from OMIM) to be higher, on average, than non-disease genes (Kolmogorov–Smirnov test statistic with a P-value 1e-6).
Large-scale experiments generate lists of several hundreds of disease gene candidates, and it is still a challenge to identify the disease genes among them. Certain gene properties, as described above, differentiate disease genes and have been used as the bases for computational tools to prioritize disease gene candidates derived from these experiments. Table 1 provides a sample of the most recent publicly available sites that offer tools to rank disease gene candidates. All of these approaches are based on the integration of different sources. A summary of the data sources used by these methods and a brief description of the results are provided subsequently.
Protein interaction (PPINT):It has been observed that disease genes are highly connected with other genes from the same disease. Differences in the network properties, such as higher connectivity, have been used to generate several gene-prioritization tools [12, 37–45]. PPINT is a feature that has been integrated with other gene properties into most of the tools highlighted in Table 1.
Gene function (gene ontology):Disease genes are expected to share common functional properties, as annotated in the gene ontology (GO) . This hypothesis was tested and validated in a set of disease genes from OMIM for each of the three branches of GO, namely biological process, cellular components and molecular function . Goh et al.  showed that the GO homogeneity in each branch of GO is significantly higher for each disorder compared with random. Therefore, most methods for gene prioritization will increase the score of the candidate gene that share GO annotation with other genes from the same disease .
Pathway (PATH):As with functional annotation, disease genes are, as it is with functional annotation, most likely to share common pathways as annotated in KEGG , Reactome , BioCarta , BioCyc , GenMAPP , MSigDB  and others.
Gene expression (GEXP):Disease genes are expected to be co-expressed; thus gene expression data can be used in combination with GO, PPINT and other features described in this section to increase the performance of gene-prioritization methods . In addition, the availability of gene-expression data with clinical phenotypes has generated many approaches for the integration of these data, all with enormous potential for the diagnosis and prognosis of cancer and other complex diseases [55–63]. A full description of the advances in microarray analysis is beyond the scope of this manuscript.
Protein domain (PDOM):Candidate disease genes might have functions that are more similar to those of known disease genes . The function affected might be due to the protein domains, which represent the functional units of the proteins. The presence of a certain domain when genes with that particular domain are enriched in the disease, has been used as an indication of the association of that gene with the disease.
Gene regulation (REG and TFBS):Genes within the same gene-regulation network are expected to affect similar diseases. Thus, similarities in transcription factor binding sites (TFBSs) have been also incorporated into several of the approaches highlighted in Table 1.
Sequence properties (SEQ: LEN, GSTRU):Sequence properties such as gene and protein length or structure could distinguish disease genes from non-disease genes (see previous section). Sequence similarity has also been incorporated into the ENDEAVOUR gene-prioritization tool.
Expression and phenotypic data from orthologs (ORTH, MOUSE):Functional information about genes in other species is the only source for functional information when the human data is not available or impossible to produce. Thus, studies on model organisms are key to biomedical progress. ToppGene  incorporates mouse data, and van Driel et al.  includes several other species into the GeneSeeker method.
Other ontologies used:The other ontologies used are eVOC anatomical ontology  and mammalian phenotype ontology (MP)  are used in CAESAR. The disease ontology (DO) information (http://diseaseontology.sourceforge.net) provides hierarchical organization for disease types based on the Unified Medical Literature system (UMLS) . The DO was incorporated into PhenoPred to cluster similar diseases into higher levels of aggregation, improving the confidence on PhenoPred predictions .
Text mining (TXT):There are over 19 million biomedical records in PubMed today , and this repository constitutes one of the best sources of information about disease genes. In addition to CAESAR (see below), several other approaches have been used to integrate text-mining tools with disease ontologies to derive gene–disease associations [43, 72].
Methods for gene prioritization rely on the information provided by one or more of the experimental techniques described above. Therefore, the amount and quality of the available experimental data generated by these techniques is a major limitation of the gene-prioritization techniques. For instance, protein–protein interaction-based methods suffer from the incompleteness and low quality of the data currently available for interaction networks in mammals. Another source of uncertainty is the disease mapping information used to train and evaluate the computational methods, for it is of variable resolution and expected to contain large numbers of false positives. Furthermore, gene-prioritization methods have been hampered by the complexity and difficulty in creating functional and disease ontologies. Methods that rely on text mining, also face the difficulties inherited from natural language processing, such as issues related with extracting gene names from the biomedical literature .
Prioritizer, developed by Franke et al.  is available for download in their site (Table1). Franke et al.  studied the effect of using three different gene networks—GO, PPINT and GEXP—to correctly rank the disease genes for a set of 96 disorders with 409 known disease genes. Combining PPINT and GEXP, a better ranking was achieved than what could have been obtained randomly. The method showed considerable improvement (represented by an increase in the area under the ROC curve) when GO was added. The best ranking of disease genes was reached when the three types of data were combined. The authors used the combination of all sources to prioritize genes in artificial susceptibility loci and found a 2.8-fold increase in the chance of detecting disease genes with respect to random selection.
Another tool, PROSPECTR, uses an alternating decision tree which has been trained to differentiate between genes ‘likely to be involved in disease’ and ‘genes unlikely to be involved’ in disease . The method uses gene properties that are characteristic of disease genes (see previous section) to provide each gene with a score, which is a measure of confidence in the classification. In a test set of 675 genes from the Human Gene Mutation Database , and 675 picked at random from Ensembl (had no association with disease), PROSPECTR performed with a sensitivity of 0.71 and a specificity of 0.58. SUSPECTS adds an extra layer to PROSPECTR, using GO annotation, protein domain and gene expression data to rank and score each gene . The program scores each gene of the test set based on its relationship to the networks in the training set. Similarly, ENDEAVOUR uses a larger set of over 12 data sets (listed in Table 1) and order statistics to score and rank genes from the test set . The test set of disease genes used to benchmark SUSPECTS was, on average, within the top 13% of the candidates, while results using ENDEAVOUR indicate that disease genes from a test set of 200 genes ranked 13 on average, representing a 7- and 9-fold enrichment over random classifiers for each method, respectively.
GenTrepid uses two methods: common module profiling (CMP), based on similarity of protein-domain composition, and common pathway scanning (CPS), based on common protein interactions and metabolic pathways among disease genes . George et al.  found the two methods to be complementary, with the combination of the two approaches yielding the best performance. However, a meta-analysis combining both methods into a consensus would have decreased the performance compared to using both methods independently. When used side by side, the two methods were reported to have a sensitivity of 0.52 and a specificity of 0.97 in a benchmark of 170 genes (29 diseases) representing a 13-fold enrichment in disease genes. In other words, a list of 100 gene candidates could be reduced to 8 with significant cost and time reduction in the posterior experimental analysis of these candidate genes.
CAESAR, developed by Vision and co-workers , is a tool primarily based on text mining of disease information mainly from review articles and the integration of other gene data (Table 1). The authors have addressed the challenge of analyzing complex traits, In their study of 18 genes complex human trait susceptibility genes, CAESAR selected 7 of the genes within the top 2% of the ranked genes. From almost 15 000 genes, 16 of the 18 genes were ranked with a median rank of 549.5. This represents a 67-fold average enrichment.
PhenoPred, developed by Radivojac et al.  studies the network of interactions and functional relationships of the target protein and detects its local neighborhoods in the network. The authors devised a supervised approach to find local signatures of the disease and find new candidate genes that are not necessarily in close proximity to the known disease genes. PhenoPred can be queried using a gene or a disease name. If the input is a gene, the program will return a list of diseases that the gene could be associated with (based on the network properties of the genes known for that disease). If starting with a disease name, the program retrieves all the genes predicted to be related to the disease. In both cases, PhenoPred provides a similarity score that represents the chance of the gene–disease association to be true. The authors showed that this approach works best when combined with the molecular function of the query gene and physicochemical properties of its protein product.
ToppGene Suite  integrates a vast number of genomic data from humans and mice (Table 1). This state-of-the-art resource includes ToppFun and ToppGene methods that can be used for the analysis of gene functional enrichment and for the prioritization of disease gene candidates, respectively. It uses a fuzzy-based similarity measure between the genes in the training and test set based on their semantic annotation. It also derives the probability (P-value) that each annotation is related to the gene in question, using random sampling of the whole genome. The authors analyzed 20 gene–disease associations from five disorders (from recently reported GWAS) and found that ToppGene ranked 19 of 20 candidate genes within the top 20%. The mean rank for ToppGene was 6.8 (excluding diseases that lacked interaction data ).
Lastly, CGPRIO, a tool recently developed by Furney et al.  is based on gene properties such as length and structure for identifying those features that characterize cancer genes. Based on distinguishing features, a naïve Bayes model is used to classify genes as proto-oncogene or tumor suppressor genes (Table 1).
From the user's perspective, the most desirable features for these methods are:
In summary, the methods for disease gene prioritization have led to an improvement in the detection of disease genes and to an increase in our knowledge about the integration of the several data sources for gene function and disease association. However, these methods can only be as accurate as the data they are based upon, which is an important issue, given the low quality of some of the experimental data on which they rely (e.g. protein–protein interaction data is incomplete and unreliable). Producing good ontologies for complex processes and improving the methods for mining and integrating the multisource data are difficult tasks that, unless addressed, will continue to severely limit the progress of gene-prioritization techniques.
Recent advances in sequencing techniques are generating data about individual human genomes at a relatively low cost. The identification of disease-related SNPs derived from large-scale techniques has the potential to create personalized tools for the diagnosis, prognosis and treatment of diseases. Mutations in the genomic code often produce changes in the protein sequence, leading to diseases. The key to approaches that identify disease mutations lies in distinguishing between SNPs that are functionally relevant from those that are not. For the non-synonymous SNPs within coding regions (coding nsSNPs), methods rely on the study of the functional disruptions produced in the protein. An in-depth discussion of the online resources available for the analysis of SNPs can be found in Karchin's recent review article . Here, I discuss the recent advances in computational methodologies developed for the analysis of coding nsSNPs and, briefly, for the analysis of structural variants.
Large-scale GWAS and human-sequencing projects are producing hundreds of SNPs with putative relevance to cancer  and other diseases (see review ). Some of these sequence disruptions in the protein produce changes in the stability, regulation, ability to interact or to be modified, and are ultimately associated with the disease. Computational approaches developed to prioritize SNPs can reduce the number of experimental trials by focusing on sites that are functionally relevant. Ideally, one would also like to deduce from the analyses of SNPs the mechanistic changes produced by the mutation and the cause of the disease. Methods used to predict whether a mutation is deleterious combine structural, conservation and/or other sequence properties that identify the mutational site as a potential site. The properties used in these approaches are highlighted below.
Disease mutations have been found to affect the stability of the proteins  or to cause protein aggregation . It has been shown by several authors that the impact of the coding nsSNPs can be investigated by studying the 3D structure of the protein [88–94]. Polyphen, a method developed by Sunayev and colleagues , relies on functional annotation and structure predictors for evaluating the deleteriousness of the SNPs.
Early studies showed that disease mutations are located in conserved sites [94, 96]. Conservation across species is often an indication of functional relevance. One of the earlier approaches, SIFT, combined conservation with physicochemical properties of the amino acids to produce a list of mutations that are not tolerated at a particular protein site .
Location of the mutation within a particular protein domain is also critical to predicting deleterious effects. Clifford et al.  incorporated a score based on the protein domain's position specific scoring matrix (PSSM). The score, or logR.E-value, is calculated as the log10 (E-value_variant/E-value_canonical), where the E-values are generated from the domain's alignment of the variant and canonical proteins using HMMer . The logR.E-value is a measure of how a particular mutation affects the total score of the alignment to the domain's PSSM. The authors found that this measure is a good predictor of whether or not the SNP is deleterious. Recently, Kann and co-workers have mapped all human SNPs and disease mutations (from OMIM  and Swiss-Prot ) to their corresponding protein domain sites. We have created a freely available resource for the domain mapping of domain mutations, the DMDM site. A screenshot of the DMDM protein domain webpage for the DNA-binding homeodomain is depicted in Figure 3. DMDM aggregates all the information about human mutations and provides coordinates of all mutations within the human domains. DMDM is available at http://bioinf.umbc.edu/DMDM and can be used to identify domain sites with high incidence of disease mutations.
Mutations that affect post-translational modifications might produce a gain or loss of function causative of disease. In a recent study of cancer mutations, Radivojac et al.  found that mutations predicted to have an effect on phosphorylation function are enriched in somatic cancer data. These results suggest that both gain and loss of phosphorylation might be important features for identifying cancer mutations, especially drivers. This approach was generalized to incorporate other post-translational modifications (methylation, glycosylation, ubiquitination) together with functional site predictors (e.g. catalytic residues, DNA-binding residues) towards probabilistically identifying molecular mechanisms of disease .
Modern approaches integrate multiple molecular features and have been applied to genes from several diseases [102–107]. However, there are still discrepancies among the predictions from the different approaches. In addition, with the exception of the phosphorylation function, these approaches are unable to provide hypotheses for the actual cause of the disease.
In addition to the study of coding SNPs, other computational approaches (only briefly mentioned here) focus on SNPs within the non-coding regions of the genome. Non-coding SNPs could be located within TFBSs, microRNA-binding motifs, regulatory-potential sequences or splice sites, and account for most of the human variation found in GWAS . Because SNPs located within the TFBS of a gene may affect the level or timing of the gene expression, computational methods that identify the TFBS-SNPs are valuable resources for selecting candidate regulatory polymorphisms of biomedical significance. An example of such an approach is RAVEN , which combines phylogenetic footprinting and TFBS prediction to identify variations in candidate cis-regulatory elements. Other methods like UTRScan  and FASTSNP  also focus their analysis on SNPs within the non-coding regions. UTRScan can be used for the analysis of 5′- and 3′-UTR of eukaryotic mRNAs. UTRScan relies on user-submitted experimental data about the biological activity of functional patterns of UTR sequences to predict whether a particular UTR SNP has functional relevance . FASTSNP has been used to identify intronic SNPs that may lead to defects in RNA and mRNA processing. FASTSNP is based on a decision tree principle to predict whether the SNP has an effect on the TFBS of the gene . SNPs located within two base pairs of an intron–exon junction, or at exonic splicing enhancer (ESE) or exonic splicing silencer (ESS)-binding sites may disrupt mRNA splicing and severely affect proein function . Methods such as ESEfinder or RESCUE ESE can predict ESE motifs [112, 113]. ESS sites can also be predicted using the FAS-ESS method . An excellent review of the different approaches used to predict and identify functional polymorphisms within microRNA-binding sites was provided by Chen et al. .
The study of variations of the human genome is not limited to the analysis of SNPs. Other structural variants can also be linked to diseases (see articles [116, 117] and reviews [118, 119]). These structural variants include duplications, inversions and deletions that can currently be identified by array comparative genomic hybridization (aCGH) and paired-end mapping [120–127]. Addressing the need for a common framework of reference for structural variant comparison, Raphael and co-workers proposed a computational approach for the localization of the breakpoints of these modifications . They introduced an algorithm for the identification of data from aCGH and paired-end mapping and provided a framework for comparing structural variants across the different techniques. Advances in the analysis of structural variants will have great implications for the analysis of the human-genome and cancer-genomesequencing projects in the near future.
The study of disease genes has evolved from basic assumptions that genes follow Mendelian laws to modern computational techniques that are capable of providing insight on hundreds of genes and discriminate particular mutations associated with diseases. The major breakthroughs in the field have lead to general knowledge of the functional, networking and evolutionary properties of disease genes as well as to the identification of genes for specific diseases. Our understanding of the molecular interactions within systems and the phenotypes they are capable of causing could still change dramatically, e.g. by devising the role of microRNAs in normal regulation and disease regulation. Bioinformaticians are addressing the challenges created by the availability of molecular and clinical data produced by new techniques. Integration of data from regulation, interaction and other functional activity of the genes has become essential in medical research.
Ideally, one would like to create the framework for the integration of the experimental, biological and clinical data with existing molecular data and to provide experimental validation of the computational findings. An example of such an approach is the work by Leach et al.  that introduced a knowledge-based system that combines reading, reasoning and reporting methods to facilitate analysis of experimental data, which was then applied to the analysis of a large-scale gene expression array data sets relevant to craniofacial development. Their tool, Hanalyzer, provided functional hypotheses regarding the role of four genes (Apobec2, E430002G05Rik, Hoxa2, Zim1) in the development of the murine tongue. Experimental validation of these results indicated that all four were expressed in the tongue. Further analysis will be required to determine if these genes have specific roles in tongue development and function, and if they act as specific markers for individual components of the intrinsic and extrinsic tongue musculature.
Recently the relations between diseases, phenotype and mechanisms have been exploited in an attempt to identify potential new applications for already approved drugs, which could accelerate drug development and reduce overall costs . Data from several pharmaceutical knowledge sources (Drug Bank, Anatomical Therapeutic Chemical Classification) and molecular networks (BIND, BioGRID, KEGG, HPRD) were aggregated using the Resource Description Framework (RDF) standard. The linked ensemble was analyzed for all associations between genes, phenotypes, diseases, clinical symptoms, drug mechanisms and indications; relations showing substantially greater disease to drug associations via phenotypes and mechanisms were ranked and investigated in more detail. One strong association found was between systemic lupus erythematosus and the breast cancer drug Tamoxifen.
Studies in yeast and other model organisms have led to the development of techniques for the integration of functional data in humans. Troyanskaya and colleagues  have recently introduced a Bayesian integration system to provide functional maps for human data. The functional maps are available at http://function.princeton.edu/hefalmp and allow for interactive visualization of large-scale experimental data. Another example of integration of experimental data is the work of Califano and co-workers  on the identification of post-translational modulators of transcription factor activity and the integration of networks from multiple sources. For this purpose, the authors created the Modulator Inference by Network Dynamics (MINDy) algorithm  and the interactome dysregulation enrichment analysis (IDEA) algorithm. MINDy was recently applied to analyze the interface between signaling pathways and transcriptional networks in human B cells . The IDEA algorithm is focused on the search for interactions (instead of genes) that might affect the disruption causing the diseases, and integrates data from different sources, including protein interactions. It has been successfully used to predict oncogenes and molecular perturbation targets in B-cell lymphomas .
The completion of the human genome has changed the way the search for disease genes is performed. In the past, the approach was to focus on one or a few genes at a time. Now, projects like the cancer genome atlas exemplify the efforts to systematically analyze all the gene alterations involved in different cancer types . The next step is to produce a complete picture of the mechanistic aspects of the diseases and the design of drugs against them. For that, a combination of two approaches will be needed: a systematic search and in-depth study of each gene.
The future of the field will be defined by new techniques to integrate large bodies of data from different sources and to incorporate functional information into the analysis of large-scale data. The response of bioinformatics to new experimental techniques brings a new perspective into the analysis of the experimental data, as demonstrated by the advances in the analysis of data from microarray and other technologies. It is expected that this trend will continue with novel approaches to respond to new techniques, such as next-generation sequencing technologies. For instance, the availability of large numbers of individual human genomes will promote the development of computational analyses of rare variants, including the statistical mining of their relations to lifestyles, drug interactions and other factors.
Biomedical research will also be driven by our ability to efficiently mine the large body of existing and continuously generated biomedical data. Text-mining techniques, in particular, when combined with other molecular data, can provide information about gene mutations and interactions and will become crucial to stay ahead of the exponential growth of data generated in biomedical research. Another field that is benefiting from the advances in mining and integration of molecular, clinical and drug analysis is pharmacogenomics [135–137]. In silico studies of the relationships between human variations and their effect on diseases will be key to the development of personalized medicine.
In summary, translational bioinformatics has already transformed the search for disease genes and has the potential to become a crucial component of other areas of medical research.
National Institutes of Health, National Cancer Institute [grant number: 1K22CA143148 - 01].
Thanks to all the scientists (included or not in this review) that contributed with their excellent work to the field of research reviewed here. Many thanks to Eric Neumann, Mileidy Gonzalez, Simone Gupta and Predrag Radivojac for their comments on the manuscript; to Richard Blissett for his editorial work; and to Attila Kertesz-Farkas and Tom Peterson for their help with the figures. Thanks to Donna Magglot (NCBI) and Joanna Amberger (OMIM) for kindly providing the data for Figure 1 and to anonymous reviewers for their helpful comments.
Maricel G. Kann is an assistant professor at the University of Maryland, Baltimore County. Her research interests include methods for alignment of protein sequences, predictors of protein–protein interactions and the study of protein domains and their associations with disease. She has co-chaired several sessions at international bioinformatics conferences related to the field of translational bioinformatics.