|Home | About | Journals | Submit | Contact Us | Français|
Proteomics and the study of protein–protein interactions are becoming increasingly important in our effort to understand human diseases on a system-wide level. Thanks to the development and curation of protein-interaction databases, up-to-date information on these interaction networks is accessible and publicly available to the scientific community. As our knowledge of protein–protein interactions increases, it is important to give thought to the different ways that these resources can impact biomedical research. In this article, we highlight the importance of protein–protein interactions in human genetics and genetic epidemiology. Since protein–protein interactions demonstrate one of the strongest functional relationships between genes, combining genomic data with available proteomic data may provide us with a more in-depth understanding of common human diseases. In this review, we will discuss some of the fundamentals of protein interactions, the databases that are publicly available and how information from these databases can be used to facilitate genome-wide genetic studies.
Sequencing complete genomes was a dream that has now become a reality. However, we are still striving to gain a complete understanding of the hundreds of thousands of proteins that are encoded by the fewer than 30,000 genes that comprise the human genome. The word ‘proteome’ was first introduced in 1995 and described as the ‘total protein complement of a genome’ , and it was then proposed that mapping the complete human proteome would help us to dissect the biochemical and physiological systems that influence complex human diseases at the molecular level. The concept of the proteome and the field of proteomics are rapidly developing, as new technology and high-throughput techniques are making the mapping of the entire human proteome seem less like a dream – as the complete sequencing of genomes once was – and more of a possible reality.
Similarly, the advent of proteomics did not put a halt to technological progress in the field of genomics. Measuring 1 million or more single nucleotide polymorphisms (SNPs) across the entire human genome is now technically and economically possible. As a result, the domain of human genetics is experiencing an information explosion; yet, at the same time, there is a lack of methods to analyze this information. That is, our ability to generate genetic data is far outpacing our ability to make sense of it. It is true that there are several important challenges that must be overcome before we can take full advantage of genome-wide data in order to decompose the genetic architecture of a complex trait into its interacting components .
The idea that the proteome is a complement of the genome is important and suggests that there is a strong relationship between the two. One criticism against genomics has been that sequence information provides us with a basic ‘snapshot’ of how a cell might utilize its genes, when, in reality, the cell is a dynamic entity that reacts and responds to its environment . To follow with the idea of complement, we feel that the field of genomics could benefit greatly by being complemented by proteomic data, specifically by providing a means to deal with the widening gap in understanding that the field is experiencing. To our advantage, information on protein–protein interactions (PPIs) is right at our fingertips in a number of publicly available databases.
The challenges we face conducting genetic analyses are derived from the complexity of the genotype- and phenotype-mapping relationship that results from phenomena such as epistasis (gene–gene interactions) and plastic reaction norms (gene–environment interactions) . First, there is the need for powerful statistical methods to model the relationship between combinations of SNPs and disease susceptibility because traditional methods, such as logistic regression, are not effective when genotype combinations increase exponentially with the number of SNPs being analyzed, and there are not enough subjects to represent each combination. Second, there is the challenge of selecting which SNPs should be included in the analysis, because analyzing all pair-wise and multiple-way SNP combinations in a genome-wide genetic study with thousands of SNPs is computationally infeasible. A third challenge is the interpretation of statistical gene–gene interaction models in the context of human biology, so that they may be used for the benefit of disease treatment and prevention.
The goal of this article is to explore how PPIs may play an important role in genetic research by providing a source of expert knowledge that can be used to guide genome-wide association studies (GWAS), in an effort to ease the computational burden of detecting and characterizing interactions while, at the same time, facilitating biological interpretation of the data. It is already known that the location of SNPs in the genome can have an effect on protein structure by changing amino acid sequences. Whether they are in a coding region of the genome or not, these SNPs may affect PPI, protein expression, alternative splicing, stability, folding, ligand binding or catalysis, thus inducing or influencing the disease state in individuals [4,5]. Several studies have led to the development of tools aimed at understanding the functional effect of SNPs by investigating their impact on protein structure or functional sites and even functional sites at the DNA level [6–8]. We will focus on the role of information extracted from PPI databases for improving the computational efficiency of genome-wide studies of epistasis.
To begin to understand how protein interactions may play an important role in human genetics research, we need first to understand the concept of epistasis and the challenges encountered when trying to detect epistasis in large-scale studies. Recognized for many years, epistasis has been described from two different perspectives: biological and statistical [9,10]. Biological epistasis, as defined by Bateson who coined the term, results from physical interactions among biomolecules in gene-regulatory networks and biochemical pathways at the cellular level in an individual . Statistical epistasis, as defined by Fisher, is deviation from additivity in a linear mathematical model that describes the relationship between multilocus genotypes and phenotype variation at the population level . While determining the relationship between the two remains a challenge, it is an important endeavor if we wish to infer biological conclusions from statistical results.
Although epistasis has acquired different meanings over the years, it is a concept that exhibits a clear contrast to Mendelian, or single-gene, traits that have been characterized, such as multiple endocrine neoplasia type 1 (MEN1) or Marfan syndrome. MEN1 is a rare autosomal-dominant disorder that displays an autosomal-dominant pattern of inheritance; that is, each patient has a 50% probability of passing the gene defect to their progeny. MEN1 patients present with symptoms such as parathyroid adenoma, gastrinoma and prolactinoma, or other endocrine and nonendocrine tumors, which develop after the inactivation of both MEN1 gene copies at chromosome 11q12–13 at tissue level . Marfan syndrome is also an autosomal-dominant connective tissue disorder that affects the skeletal, occular and cardiovascular systems. Marfan syndrome is primarily caused by a mutation in fibrillin-1 (FB1), which produces abnormal fibrillin proteins that normally act as extracellular matrix fibrillar components with functions in elastic and nonelastic tissues .
Epistasis, along with other phenomena, such as locus heterogeneity, phenocopy and gene–environment interactions, are major sources of complexity in the mapping relationship between genotype and phenotype in common human diseases. However, earlier studies of biological epistasis began in model organisms, such as Caenorhabditis elegans, that are still being used to order genetic pathways by means of crossing two genetic knockouts for a particular function and observing the resulting phenotypes [15,16]. More high-throughput studies began to take place approximately a decade ago that involved pairwise analysis of genes to determine combinations that conferred a phenotype, one of the first having utilized a synthetic genetic array analysis in yeast to examine the interactions of a series of different deletion mutants against more than 4000 deletion backgrounds . Additional approaches have used RNAi libraries to systematically knockdown genes in C. elegans, given that deletion libraries for yeast are unlikely to be widely applicable to multicellular organisms . While detecting epistasis on the molecular level with high-throughput approaches has led to a greater understanding of complex gene networks, detecting and characterizing epistasis on the statistical level to gain an understanding of the genetic susceptibility to these diseases is not a simple task.
Moore argues that epistasis is a ubiquitous component of common human disease based on the concept that epistasis has been prominent in literature for more than 100 years and many single locus results from linkage and association studies have demonstrated a further lack of replication [19–21]. The completion of the Human Genome Project in 2003 and the International HapMap Project in 2005 gave rise to an abundance of research tools, such as genome-wide genotyping, that allow researchers to conduct GWAS for detecting genetic variants which confer increased or decreased susceptibility to disease and have allowed for the detection of epistasis in a number of diseases, such as diabetes , bipolar disorder [23,24] and coronary artery disease . Although the technical details of measuring a large representative set of SNPs in an accurate and efficient manner are now well established , the analytical methods for determining which SNPs are important are in their infancy and are based on important assumptions, such as each SNP having a large and independent effect on disease risk . Traditional methods of analysis, such as linear and logistic regression, have had limited success owing to the sparseness of data in high dimensions. For example, when interactions among multiple SNPs are considered, there are many multilocus genotype combinations that may have very few or no data points. This can lead to an increase in type I and type II errors owing to parameter estimates with very large standard errors. It is evident that we need research strategies that embrace, rather than ignore, the complexity of the relationship between genotype and phenotype [10,28–33].
There have been many methods proposed to identify SNP interactions and their association with diseases. These proposed methods include multifactor dimensionality reduction (MDR) , combinatorial partitioning method , symbolic discriminant analysis , Monte Carlo logistic regression , recursive partitioning method , focused interaction testing framework , backward genotype–trait association , Bayesian epistasis association mapping , a forest-based approach , penalized logistic regression , grammatical evolution neural networks  and MegaSNPHunter . Each method has its advantages and disadvantages, and while some have shown improved computational efficiency over others, analyzing high-order interactions with any method is a computational burden that has yet to be overcome. To illustrate this, we next discuss MDR, which has been widely applied to detect gene–gene interactions in a number of common diseases.
Multifactor dimensionality reduction was developed as a nonparametric and model-free data mining method for detecting, characterizing and interpreting epistasis in the absence of significant independent effects in genetic and epidemiologic studies of complex traits such as disease susceptibility [33,34,46–50]. The goal of MDR is to change the representation of the data using a constructive induction algorithm to make nonadditive interactions easier to detect using any classification method, such as naive Bayes or logistic regression [49,50]. This is accomplished by first labeling each genotype combination as high or low risk using some function of a discrete end point, such as case–control status. A new MDR variable with two levels is constructed by pooling all high-risk genotype combinations into one group and all low-risk combinations into another group. Traditionally, MDR-constructed variables have been evaluated with a probabilistic naive Bayes classifier that is combined with tenfold cross validation to obtain an estimate of predictive accuracy or generalizability of epistasis models. While the MDR method has proven to be an effective way of detecting epistasis in a number of diseases, to analyze all of the combinations of SNP interactions in large datasets or genome-wide studies would be impractical, even with access to the largest and most powerful computers available.
To illustrate the scope of such an analysis, consider a recent report from the International HapMap Consortium that suggests that approximately 300,000 carefully selected SNPs may suffice to represent all of the relevant genetic variations across the human Caucasian genome . If this is to be regarded as the lower limit of a GWAS, then approximately 4.5 × 1010 pairwise combinations (300,000 choose two) and 4.5 × 1015 three-way combinations (300,000 choose three) would need to be exhaustively analyzed to detect low-order epistasis using MDR. If 106 MDR evaluations can be computed each second, then evaluation of each individual SNP would require less than 1 s of computer time. However, computing all two-way and three-way MDR models would require more than 52,000 days of computer time. Access to a 1000-processor supercomputer might reduce this to 52 days, which is within the realm of possibility. However, extending this to all four-way combinations is not computationally feasible. This adds to the many challenges of detecting epistasis on a genome-wide scale [2,32].
As previously mentioned, an important and understandably difficult goal in human genetics is to determine which of the many thousands of SNPs are useful for predicting who is at risk for common diseases. It was close to a decade ago that Risch and Merikangas first seriously proposed the testing of all known SNPs in the human genome for disease association either directly or by linkage disequilibrium with other SNPs , and today, this ‘genome-wide’ approach is expected to revolutionize the genetic analysis of common human diseases [32,53–57]. Currently, it is possible to measure more than 1 million SNPs with a genome-wide human SNP array available from Affymetrix, and Illumina® has released the Human1M DNA Analysis BeadChip, which is capable of profiling 1 million SNPs on a single array across the human genome.
Certainly, with these technologies now available to the scientific community at more affordable costs, it is clear to see how such large amounts of data are being rapidly produced. Unfortunately, owing to the lack of logical methods to summarize this quantity of information within a biological context, investigators are at a loss when they reach the analysis stage in their research. In fact, our ability to measure genetic information, and biological information in general, is far outpacing our ability to interpret it. It is probable that most of these large-scale studies harbor a wealth of information concerning susceptibility genes that can be used to improve the prevention, diagnosis and treatment of common diseases. However, to access this information to our full advantage, we need to address the specific technical challenges that confront researchers in the analysis process, such as the computational limitation of a large-scale genetic analysis with methods such as MDR and our ability to interpret it. We believe that we can overcome this limitation by utilizing expert knowledge from sources, such as PPI databases, that can be used to help guide our analysis as well as to help us to interpret our results in a biological context.
Expert knowledge can be defined as existing biological or statistical information regarding the problem at hand that can be incorporated into the analysis process to guide an algorithm in a more directed fashion. For example, when considering SNPs or genes in a genetic analysis, biological expert knowledge may be derived from what is known about the function of biochemical pathways, the gene ontology (GO) or expression information for that gene. Emily et al. used experimental knowledge concerning biological networks to narrow the search for two-locus epistases that confer susceptibility to Crohn’s disease, bipolar disorder, hypertension and rheumatoid arthritis. This group utilized the protein-interaction database, Search Tool for the Retrieval of Interacting Genes/Proteins (STRING), which we discuss later, and expert knowledge from protein interactions to guide their search for epistasis. They were able to identify 71,000 high-confidence potential PPIs in the database and, from there, identified all of the SNPs that corresponded to the genes for the relevant proteins. Subsequently, this interaction information was used to prioritize SNPs in a large genomic dataset from the Wellcome Trust Case–Control Consortium that covered a number of diseases. They were able to identify four significant cases of epistasis between unlinked loci in all four diseases . Another approach that similarly utilizes expert knowledge is a pathway-based approach or analysis, which employs computational methods to define sets of genes based on common biological attributes, such as GO or biological pathways. This information is used to define a measure of enrichment of each gene set among disease-associated markers . Shriner et al. explained how, in linkage studies of complex traits, testing each candidate gene from every region is a computational challenge, as similarly described for genome-wide epistasis studies. They demonstrate this concept of pathway-based analysis with their commonality of functional-annotation method, which operates by testing individual GO terms for enrichment in candidate gene pools and ultimately ranks genes based on the number of quantitative trait loci regions where genes with such annotations are found. When this method was applied to published linkage studies that examined the relationship between age of onset of Alzheimer’s and BMI, new candidate genes, as well as previously published candidate genes, were identified .
In addition to biological information, it is also useful to use prior statistical knowledge to help guide an epistasis analysis. For example, logarithm (base 10) of odds scores from a prior linkage analysis could be used to weight SNPs in certain chromosomal regions higher during a combinatorial epistasis analysis. That is, SNPs from a certain pathway or chromosomal region would be evaluated for interactions with a higher probability than others in the dataset. Statistical knowledge could also come from filter algorithms that explicitly assess the quality of a SNP based on their relationship with the clinical end point. The Tuned ReliefF (TuRF) algorithm is an example of an algorithm that can assign high-quality scores to SNPs involved in complex interactions. The TuRF algorithm uses a nearest-neighbor approach to assess SNP quality and, thus, does not suffer from the computational limitations of an algorithm that explicitly considers combinations of SNPs. As such, it is very useful for preprocessing the data prior to analysis. Once computed, the TuRF scores can be used to select some reduced number of SNPs for combinatorial analysis or can be used to help guide a computational search algorithm . While both biological and statistical expert knowledge may be useful for facilitating genome-wide studies, we wish to exploit the strong relationship between genes and proteins for these purposes. Statistical epistasis may be indicative of biological epistasis and vice versa, making this a valid application as well as a potential method to gain an understanding of the relationship between the two.
Proteomics and the study of PPIs are becoming increasingly important in our effort to understand human diseases on a system-wide level. While mass spectrometry has been a useful technology applied to the discovery of components in protein complexes , PPIs have traditionally been measured using a variety of assays, such as immunoprecipitation and yeast two-hybrid (Y2-H). To reconstruct the entire network of PPIs within cells remains a challenge; yet, this is becoming a more approachable problem. The field of proteomics is advancing, and the aforementioned techniques to detect PPIs have been scaled up to measure interactions on a genome-wide level. High-throughput techniques have also been developed to identify protein complexes using affinity pull-down followed by mass spectrometry , and systematically constructed double-knockout strains in yeast have proven to be useful for constructing a large-scale view of genetic-interaction networks .
To complement these experimental techniques, a number of computational methods have been developed that include algorithms that are capable of predicting interactions. Originally, the computational prediction of proteins was limited to proteins whose 3D structure had been determined, but now, on account of complete genome sequencing, prediction methods can extend to the genomic level, and many of the available systems for predicting PPIs are based on gene-coexpression data, given that two proteins must be spatially and temporally coordinated for an interaction to be meaningful , while others have been founded on the concept that if two proteins have co-evolved, then these proteins have a higher probability of interacting within a cell [60,65–67]. One such approach that encompasses the use of sequence information is based on the understanding of domain–domain interactions in proteins, and many computational methods have been developed in the past 8 years to infer domain–domain interactions from PPI databases that are then subsequently used to predict new PPIs. Other recent and common approaches are based on co-occurrence, which assumes a biological PPI based on their frequency of co-occurrence in text or rule-based approaches, which employ predefined phrase-pattern rules . For example, one system, the Protein Interaction Information Extraction system, was developed as an online system for predicting PPIs from text. Since it has been shown that PPI information exhibits a pattern in articles, machine-learning techniques have been used for many PPI prediction methods, given that machine learning has proven useful for discovering hidden patterns in data. This system uses both co-occurrence and rule-based approaches in a machine-learning framework . Additional prediction techniques are reviewed by Ta et al.; these include the association method, as previously mentioned, which determines pairs of correlated domains that co-occur in the PPIs more frequently than by chance, the Maximum Likelihood Estimation technique, used to calculate the interaction probability for all possible domain pairs observed in a PPI dataset, and a parsimony-explanation approach, which uses a specific type of programming to derive a statistical score for domain–domain interactions . Owing to the development and curation of protein-interaction databases, which will be elaborated upon later, up-to-date information on these interactions, both experimentally determined and predicted, is accessible and publicly available to the scientific community.
Some of these techniques have been actively employed in the study of the pathogenesis of Huntington’s disease (HD), an autosomal neurodegenerative disorder that causes cognitive impairment, psychiatric problems and motor dysfunction. This inherited disease is caused by the expansion of a polyglutamine tract in the huntingtin (htt) protein. Although this protein was discovered more than a decade ago, constructing the protein-interaction network that it belongs to is still an ongoing process that is providing clues on the function of htt and its role in HD pathogenesis . Many interaction partners for mutant and wild-type htt have been elucidated over the past decade by Y2-H, affinity chromatography and immunoprecipitation. To follow these studies, it was recently hypothesized that genetic modifiers of HD nuerodegeneration should be enriched among htt protein interactors, and to test this, both high-throughput Y2-H screening and affinity pull-down followed by mass spectrometry were utilized . This group was able to identify 104 htt interactions with Y2-H and 130 interactions with their pull-down method. To elucidate the biological relevance of these interactions, using a high-content validation assay, they also tested a set of 60 genes encoding interacting htt proteins for their ability to act as genetic modifiers of neurodegeneration in the HD Drosophila model. Results showed that 45% of these genes were high-confidence genetic modifiers (much higher than the 1–4% observed in unbiased genetic screens), and that these genes were similarly represented among proteins discovered with their Y2-H and pull-down and/or mass spectrometry methods. These results demonstrate that these methods are equally useful for identifying biologically relevant interactions .
Making use of information available in PPI databases, such as Kyoto Encyclopedia of Genes and Genomes (KEGG) and the Human Protein Reference Database (HPRD), along with PubMed publications, another group evaluated the commonality of molecular pathogenic mechanisms of neurodegenerative disorders, including HD, along with Alzheimer’s disease, Parkinson’s disease, dentatorubral–pallidoluysian atrophy and prion disease, as well as amyotrophic lateral sclerosis. The investigators examined the PPI networks associated with causative proteins, such as htt, and found 19 proteins common to all diseases from literature, as well as 81 new common proteins from their network constructed using database information. Many of these identified proteins were previously characterized as being associated with the respective diseases. A relatively high correlation between all diseases for all of their analysis was seen, including commonality in characteristic protein domains. They concluded that the interactions found in this study in silico may serve to function in the common pathogenic mechanisms among neurodegenerative disorders .
As the effort continues to reconstruct the entire proteome, it would be to our advantage to exploit the breadth of knowledge contained in PPI databases. Not only will we gain a greater understanding of numerous biological processes but also, presumably, be able to apply this knowledge to advance other fields of research, such as drug discovery, disease prognosis and the study of disease susceptibility. High-throughput molecular-profiling approaches, such as microarray technology, have already been successful in the advancement of these fields; yet, similarly, a rate-limiting step has been the ability to interpret the biological meaning of the data. Often, this problem has been approached from a pathway perspective that involves investigating which pathways are perturbed in a case–control population, which pathways determine a good or bad prognosis, or which pathways are activated or repressed in response to certain stimuli or compounds [57,62]. Such an approach can simplify the analysis and interpretation of genome-wide or large-scale datasets. Similarly, the abundance of information available in protein-interaction databases can be used for similar purposes in the field of human genetics, which we will discuss in greater depth later in this review.
Currently, there exist numerous publicly available protein-interaction databases that contain information regarding human-specific interactions (Table 1). The majority of PPIs in these databases are from curation of the literature by biologists; however, some are incorporated by direct deposit prior to publication by the investigator . In a majority of the PPI databases, the user will enter a protein of choice by either protein name or accession number, according to RefSeq, Genbank, Online Mendelian Inheritance in Man (OMIM), SwissProt or Entrez Gene, and, in return, receive a list of protein interactors, information pertaining to the experimental evidence for that interaction, as well as information concerning the protein itself. Another common feature of most databases is the ability to visualize the network of the queried protein and its interactors. We will touch on some of the key features of certain databases. For a more comprehensive review and additional information on these databases and others, please see [73–75].
One of the largest publicly available databases is the HPRD, which, to date, has more than 38,000 PPIs, more than 270,000 Pubmed links and access to curated pathways, as well as information on post-translational modifications (PTMs), domain architecture, protein functions, enzyme–substrate relationships, subcellular localization, tissue expression and disease association of genes. An interesting feature of this database is the Protein Distributed Annotation System, which enables researchers to annotate proteomic information in the context of HPRD data so that it is easily shared with the rest of the scientific community . Another large and growing database that has similar components is the BioGrid, which currently houses approximately 42,800 human PPIs, but, altogether, contains more than 200,000 interactions from Saccharomyces cerevisiae, Schizosaccharomyces pombe, Caenorhabditis elegans, Mus musculus and Drosophila melanogaster, in addition to Homo Sapiens.
Other available databases that are smaller than the HPRD and BioGrid yet offer additional unique features are The Biomolecular Interaction Network Database (BIND), which is a component of the Biomolecular Object Network Database, the Molecular Interaction database (MINT), the Database of Interacting Proteins (DIP) and Reactome. For example, BIND and MINT provide confidence scores for each interaction, specifically, for Y2-H experiments in BIND. In MINT, this score is based on the number of interactions, the number of citations and the type of experiment conducted to detect that interaction, while in BIND, the score is based on shared or related GO annotations, phenotypic profiling, homologous interactions, domain structure and the number of publications . MINT also contains information pertaining to protein interactions with promoter regions and mRNA. Unique to DIP, the user can select to have certain PPIs evaluated based on paralogous interactions or common expression profiles of interactors or through domain-interaction preferences.
Reactome is not specifically a PPI database, but a curated resource for human-pathway data based on biologic reaction networks. Reactome reactions are described as taking place between ‘physical entities’, which include not only proteins but also nucleic acids, single small molecules, macromolecular complexes and even subatomic particles. All proteins, genes and reactions are cross referenced to a variety of widely used databases, such as Entrez Gene, Online Mendelian Inheritance in Man (OMIM) and KEGG, and each reaction is supported by evidence from biomedical literature, as well as documented with approved citations . The user has the ability to search the database using a reaction name, gene name, protein name or any of several alternative identifiers. Reactions in the output are represented graphically, and the user has the option to click on ‘top-level’ pathways to delve deeper into the hierarchy with increasing detail at each level. Additionally, one can select for nonhuman species, and all accession numbers for all genes and proteins involved can be downloaded. For a more in-depth review of this database, please see .
Resources such as the STRING, Unified Human Interactome (UniHI) and GeneNetwork access a number of the reviewed databases to integrate protein-interaction information. The newest version of STRING, 8.0, covers approximately 2.5 million proteins from 630 different organisms and incorporates PPI information from a number of interaction databases, such as HPRD, BioGrid, MINT, BIND and DIP, and also imports known reactions from Reactome and KEGG pathways. Recent additions to this database incorporate interactions from IntACT, EcoCyc, NCI-Nature Pathway Interaction Database and GO. Automated text mining of PubMed abstracts, OMIM and information from other databases, such as the Saccharomyces Genome Database, Wormbase and the Interactive Fly, supplement this information [78,79]. For interactions in organisms that have not been confirmed experimentally, STRING is capable of running a set of prediction algorithms and transferring known interactions from model organisms to other species based on the prediction orthology for those proteins . The user, however, has the option to select which organism the queried protein and its interactors will pertain to. Each interaction is given a numerical confidence score based on the experimental evidence and orthologous evidence behind that interaction, which allows the user to filter networks according to a desired confidence threshold.
Unified Human Interactome not only integrates PPIs from large Y2-H screens and curated databases, such as HPRD, DIP, BIND and Reactome (as well as others we have not discussed), but also predicts interactions based on orthology and computational text-mining approaches. This database also provides detailed information on each interaction, including statistical-interaction validation by gene coexpression data and validation by shared path length according to GO co-annotation hierarchy. The sources of the interactions are also documented and provided, along with links to access more information about that particular source of evidence. A useful feature of UniHI is that it allows for a highly targeted search, by which the user can exclude certain mapping approaches, such as Y2-H, display only proteins that are common interaction partners to multiple proteins in a query, display only interactions that occur in multiple maps, or display only direct interactions .
GeneNetwork is comprised of known interactions from BIND, HPRD, Reactome and KEGG. Similar to STRING and UniHI, GeneNetwork supplies predicted interactions based on biological process and molecular function annotation from the GO database. Additional experimental data are incorporated, such as coexpression data from approximately 450 microarrays from the Stanford Microarray Database and the National Center for Biotechnology Information Gene Expression Omnibus. Human Y-2H interactions and interactions based on orthologous high-throughput PPIs from lower eukaryotes are also included. After submitting a query for a given gene, the user is returned a list of interactors, each of which has an overall likelihood score, along with likelihood scores for that interaction based on microarray coexpression, human PPI prediction and orthologous PPI predictions. Positive evidence of known interactions from HPRD, BIND, KEGG and Reactome is indicated in additional columns. A recent study used this database to rank the best positional candidates in susceptibility loci on the basis of their interactions using a method they developed known as the ‘Prioritizer’ .
A recent study examining PPI networks for human inherited neurodegenerative disorders characterized by ataxia (i.e., loss of balance or coordination) illustrates how these databases have been used to help us better understand pathogenic mechanisms underlying human diseases. Lim et al., examined protein-interaction networks involved in cerebellar Purkinje cell degeneration, which is the primary cause of coordination and balance loss in inherited ataxias . They developed a network for 54 proteins involved in 23 ataxias first by Y2-H screens and then expanded this network based on information from literature-curated and evolutionarily conserved interactions. Relevant direct PPIs were added from available interaction networks developed by Rual et al. and Stelzl et al. [83,84], and binary interactions were identified for the 54 ataxia-associated baits and 561 interacting prey proteins using literature-based information from BIND, HPRD, DIP, MINT and the mammalian PPI database. Furthermore, 1527 potential human interlogs (i.e., potentially evolutionarily conserved interactions) were identified from more than one species using the InParanoid database. Since 68 and 63% of literature-curated and interlog interactions are annotated to similar GO compartments, respectively, this group suggests that these identified interactions are of similar quality to interactions they identified in their Y2-H screens. The network demonstrated that several ataxia proteins interact and that there are shared pathways and mechanisms in this class of diseases. This study by Lim et al. will hopefully be able to provide additional knowledge about individual protein function and candidate genes for other diseases with similar phenotypes.
These are just a few of the more widely used publicly available databases that provide information on PPIs. Certainly, each one has unique features that allow researchers to gain access to vast amounts of useful biological information that can be broadly applied. In particular, we believe that this information would be extremely useful when applied to genome-wide studies that aim to detect epistatic or gene–gene interactions. There are many challenges when it comes to detecting epistasis, and we propose that we will be able to utilize this abundance of information to not only improve the computational efficiency of genome-wide studies of epistasis but also to facilitate the biological interpretation of the results.
We wish to exploit protein-interaction data to improve the genetic analysis of common human diseases. As we have illustrated, the information available to the scientific community in curated protein-interaction databases is abundant, and we certainly believe that proteomic data will be a useful complement to genetic data and that there is valuable relationship between the two. We can begin to dissect this relationship by simply examining whether epistatic interactions detected statistically, such as with MDR or other methods, are also found to exist at the protein level in the interaction databases previously mentioned.
To illustrate this, we used a number or databases to query the protein interactions represented by significant SNP interactions in three genetic-association studies. Coutinho et al. used MDR to analyze seven candidate genes in the serotonin metabolic and neurotransmission pathways mapping autism linkage regions and reported a significant interaction between polymorphisms in the 5-hydroxytryptamine (serotonin) receptor 5A (HTR5A), integrin-β3 precursor (ITGB3), and sodium-dependent serotonin transporter (SLC6A4; p = 0.001) . Evidence for physical interactions between SLC6A4 and HTR5A was found for these genes in the STRING database when querying HTR5A and was based on evidence from text mining. When querying SLC6A4 or ITGB3 in STRING, evidence for interaction between these two genes was provided and was also based on information from text mining. Both interactions use this specific paper along with others as sources of evidence. Asselbergs et al. analyzed interactions in polymorphisms influencing levels of tissue plasminogen activator and plasminogen activator inhibitor 1, which influence the risk of arterial thrombosis . Using a twoway analysis-of-variance statistical test, the investigators found significant interactions between a polymorphism in the bradykinin B2 gene (BDKRB2) and the angiotensin-converting enzyme (p = 0.003) on tissue plasminogen activator in females, and between polymorphisms in bradykinin B2 and angiotensin II type 1 receptor (AT1R/AGT1R) on tissue plasminogen activator in males (p = 0.006). This latter interaction was also significant for plasminogen activator inhibitor 1 levels in both males and females. Strong evidence for interaction for all three of these genes is seen when querying STRING (Figure 1) and is supported by both experimental, based on in vivo assays, and text-mining evidence. One or more of these interactions are found in the databases that STRING integrates, such as HPRD, BIND, Reactome, MINT, BioGrid, DIP and KEGG annotated pathways, and DIP. Another more recent study examined SNPs in topoisomerase 3-α, RECQ-mediated genome instability 1 protein (RMI1) and Bloom syndrome protein (BLM) and their association with cancer risk in acute myeloid leukemia/myelodysplatic syndromes, malignant melanoma, breast cancer and bladder cancer. Since mutations in BLM are known to be associated with elevated cancer risk, it was reasoned that genetic variants of BLM and the proteins that complex with it might play a role in influencing the risk for different cancers. It was determined that variant interactions in topoisomerase 3-α and BLM showed increased risk in all four cancers. While this study did not show statistical evidence of interaction with RMI1, it was shown to confer increased risk of acute myeloid leukemia/myelodysplatic syndromes and malignant melanoma . STRING showed that all three of these proteins interacted according to both text mining and experimental evidence based on co-immunoprecipitation, molecular sieving and fluorescence-imaging assays. While this provides stronger evidence that PPI databases can be useful as expert knowledge, we need to find a logical way to incorporate this information into the analysis process.
Similar to the approaches developed by Emily et al. (2009), one possible method would be to identify all of the genes associated with the SNPs in a dataset for whose protein products have evidence of direct interaction with each other and filter that dataset accordingly. Filtering based on the direct interactions may prove to be a simple solution, but doing so may ignore potentially important biological information. Interactions do not have to be direct, and it may be beneficial to include the SNPs and genes to a certain level according to their indirect interactions, in other words, by taking a more pathway-based approach, as did Shriner et al. (2008) and Askland et al. (2009). Another option would be to utilize or develop a confidence score for present interactions based on information available from a PPI database or even multiple databases. As mentioned, MINT, BIND, STRING, UniHI and GeneNetwork all provide a confidence score for interactions based on information such as the type of experiment conducted to detect that interaction and the supporting literature for that interaction. Specific metrics could be developed that would allow all SNPs or genes to be prioritized or weighted based on biological information on their interactions or allow investigators to filter SNPs or genes based on a determined interaction confidence threshold.
If one were to take any of these approaches, it would appear that the vital information to extract from these databases would be the direct and indirect interaction partners found in the dataset (to a certain level) as well as the evidence or confidence score to support those interactions. While this also seems to be a rather simple approach, one needs to consider that the number of databases available that could provide this information is abundant and that this information may not be consistent between databases. For example, Mathivanan et al. thoroughly reviewed the features of a number of databases, including MINT, BIND, HPRD, DIP and Reactome, and concluded that while there may be good overlap at the protein level between these databases, the level of overlap between PPIs is not as great . They also find that for PPIs that do overlap between databases, there exists a difference in annotation, partly on account of differences that arise according to how biologists interpret the experimental results. This presents an obstacle when attempting to apply this expert knowledge from multiple databases and may lead to the exclusion of important interactions or the inclusion of noninfluential interactions in a dataset. Considering this, it may be beneficial to use an integrated database, such as STRING, UniHI or GeneNetwork, which have, in their own respective ways, brought together the various information in a number of databases.
We believe the methods we have discussed to bring expert knowledge into genome-wide studies to guide an analysis are one way to deal with the computational infeasibility of these large studies. Indeed, Bush et al. and Saccone et al. have shown that using biological knowledge to guide genetic association studies may provide more meaningful results [88,89]. Yu et al. provide a hypothesis-testing framework for combining multiple SNPs from the same gene or from multiple genes in a pathway-based manner . As mentioned earlier, Askland et al. recently showed that patterns of SNPs in biological pathways are more likely to replicate than individual SNPs in GWAS . Wilke et al. have suggested that we should not even begin to analyze a GWAS study until we have exhaustively studied each candidate gene and each pathway . Only then will we have the appropriate knowledge base to make sense of GWAS results. As Moore noted, there is major shift in the field of genetic epidemiology away from the purely statistical approaches to these problems to a more bioinformatics approach that considers knowledge on gene function, gene networks and biochemical pathways . This year may mark the turning point toward more of a systems approach that recognizes the role of epistasis and other complexities in the genetic architecture of common diseases.
While current work has demonstrated promise in the idea of combining many types of data as an analysis strategy, to narrow the evaluation to gene combinations that have been shown to interact experimentally provides a biologically concise reason why those two genes may be detected together statistically. However, it is important to acknowledge the potential limitations of using PPI databases for this purpose. We first must acknowledge the dynamics of PPIs and the fact that, as mentioned earlier, although a database may claim an existing interaction, for that interaction to be meaningful for the disease at hand, these two proteins must be both spatially and temporally coordinated . PPIs are largely context dependent and require the appropriate cellular conditions in order for certain structural modifications that enable the interaction to occur. While mammalian two-hybrid systems that allow for assayed proteins to undergo these modifications in the appropriate cellular context have been developed to complement Y-2H systems, these tools are still under development and optimization. However, a large number of PPI networks available are representative of a static and not a dynamic network .
In addition to PPI dynamics, it is important to recognize that bias may exist across all databases, as well as the fact that genes and SNPs in the dataset may be unannotated or anonymous. How does one deal with anonymous SNPs or SNPs that are not in coding regions? It may be that a researcher wishes to consider this SNP as part of the gene that it is closest to or perhaps they may consider what annotated SNPs are in linkage disequilibrium with it. Furthermore, there are many proteins in the human proteome that have not been studied thoroughly or even studied at all and may be under-represented or nonexistent in these databases. To add to this, a bias of experimental methods for capturing certain interactions exists; for example, Y2-H experiments are not entirely adequate for detecting interactions with integral membrane proteins . It is important to bear in mind the point that even if an interaction is not detected on the biological level, this does not mean that this interaction does not exist or will not be seen at the statistical level, and conversely, to remember that what is detected statistically may not have any biological relevance . Therefore, we must be aware of, and concerned about, the amount of important information we may potentially be missing owing to bias and lack of annotation, and what types of studies this expert knowledge is appropriate for.
Similarly, some may argue that using expert knowledge is biased in and of itself, despite bias in the databases. The ability to conduct a GWAS has been said to ‘relax’ the need for a strong prior hypothesis because the whole genome can be analyzed at once . Individuals in support of ‘genomic agnosticism’ believe that when conducting a genome wide analysis, they will assume every SNP in the genome to be equally functional . This brings us back to the issue of what information we may be missing by applying expert knowledge. While the benefits of conducting an unbiased GWAS study are valid, such as having no prior hypothesis, elimination of bias and inclusion of all information, we are still at a loss for computational power to conduct these studies and fully explore all interactions.
These issues demonstrate the need for a method to evaluate the metrics we develop from PPI expert knowledge and information we extract from these databases. Making use of data that are available and that have already been evaluated for interactions would aid this process. With access to data where the biological importance of the results and the interactions are known, we could determine if this same information is retained after applying expert knowledge. Therefore, we may be able to gauge the amount of information we gain or, on the other hand, which we may lose by applying expert knowledge. Since we are also concerned with the efficiency of evaluating genome-wide studies, we may want to explore simulated large-scale datasets that are already imbedded with known epistatic interactions, both physical and statistical. We would then be able to compare metrics we develop based on information from these databases. This will, hopefully, allow us to achieve the most meaningful results in the most efficient manner.
The field of proteomics is expanding with the availability of high-throughput methods to detect and characterize protein interactions, and there is continuing development of curated protein-interaction databases to provide the scientific community with access to this information. While in the field of human genetics, the availability of high-dimensional datasets from genome-wide studies is making it computationally expensive and impractical to carry out a genetic analysis study utilizing data-mining methods such as MDR. Since there is no indication that technological advancements in either field will come to a halt any time soon, the amount of valuable genetic and proteomic data produced will continue to grow.
We have proposed that expert knowledge extracted from protein-interaction databases may reduce the computational burden of large-scale and genome-wide studies, as well as facilitate the biological interpretation of the data. We foresee that, in the future, the methods that we propose to will not only be applicable to SNP studies but also to studies involving other forms of genetic variation (i.e., copy number variation and sequence repeats) that may run into similar problems with large-scale data analysis. The exploitation of expert knowledge will eventually not only ease our computation burdens but also aid in understanding the relationship between biological and statistical epistasis, as well as helping us to better understand the relationship between the proteome and the genome.
To fully exploit the knowledge in PPI databases, we need to develop a logical method to evaluate the information in these databases and the metrics developed from this information in order to incorporate this type of expert knowledge into our analysis. While we do not expect this to be a simple task, success in similar endeavors have assured us that it is an important and worthwhile task that needs to be explored [10,58,59]. Once we are able to successfully develop these methods, not only will we improve the ease with which we will be able to identify important epistatic interactions in genome-wide studies, but we will gain an understanding of the physical biology that underlies these interactions and perhaps their role in a given disease. We expect these expert knowledge-based methods to enhance our comprehension of common human diseases and eventually lead to an improvement in the prevention, treatment and diagnosis of these diseases.
Financial & competing interests disclosure
This work was supported by NIH grants LM009012, LM010098 and AI59694. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.
No writing assistance was utilized in the production of this manuscript.
Kristine A Pattin, Computational Genetics Laboratory and Departments of Genetics, Dartmouth Medical School, Lebanon, NH, USA. Dartmouth-Hitchcock Medical Center, Norris Cotton Cancer Center, 1 Medical Center Drive, Lebanon, NH 03756, USA Fax: +1 603 653 9900 ; Email: email@example.com.
Jason H Moore, Computational Genetics Laboratory, Departments of Genetics, Community and Family Medicine, and Norris-Cotton Cancer Center, Dartmouth Medical School, Lebanon, NH, USA. Department of Computer Science, University of New Hampshire, Durham, NH, USA. Department of Computer Science, University of Vermont, Burlington, VT, USA. Translational Genomics Research Institute, Phoenix, AZ, USA. 706 Rubin Building, HB 7937, 1 Medical Center Drive, Dartmouth-Hitchcock Medical Center, Lebanon, NH 03756, USA Tel.: +1 603 653 9939, Fax: +1 603 653 9900 ; Email: firstname.lastname@example.org, www.epistasis.org.
• of interest
•• of considerable interest