|Home | About | Journals | Submit | Contact Us | Français|
Increasing knowledge about the organization of proteins into complexes, systems, and pathways has led to a flowering of theoretical approaches for exploiting this knowledge in order to better learn the functions of proteins and their roles underlying phenotypic traits and diseases. Much of this body of theory has been developed and tested in model organisms, relying on their relative simplicity and genetic and biochemical tractability to accelerate the research. In this review, we discuss several of the major approaches for computationally integrating proteomics and genomics observations into integrated protein networks, then applying guilt-by-association in these networks in order to identify genes underlying traits. Recent trends in this field include a rising appreciation of the modular network organization of proteins underlying traits or mutational phenotypes, and how to exploit such protein modularity using computational approaches related to the internet search algorithm PageRank. Many protein network-based predictions have recently been experimentally confirmed in yeast, worms, plants, and mice, and several successful approaches in model organisms have been directly translated to analyze human disease, with notable recent applications to glioma and breast cancer prognosis.
Model organisms have proven invaluable for better understanding protein function and interactions, both for enabling studies of single proteins via genetic and biochemical tractability, as well as for enabling global surveys of thousands of proteins. Large-scale maps of pair-wise protein interactions [1–3], protein complexes [4–7], genetic interactions [8,9], transcription factor-target interactions [10–13], protein localization , and other complementary datasets have accelerated the characterization of protein function on a proteome-wide scale. The bulk of these studies have occurred in yeast and the nematode C. elegans, but increasingly in Arabidopsis, mouse, and fly, leading the way for applications to human cell culture.
These rapidly accumulating large-scale biological data have necessitated a corresponding growth in theoretical methods for interpreting them. One major goal has been to translate such knowledge of proteomic organization into models capable of associating genetic changes with changes to measurable traits, phenotypes, and diseases. In principle, such models will help to better interpret the rapidly growing genotype-based and genome sequence-based data characterizing genetic variation among individuals, whether individual humans or individuals of another species altogether. Models exploiting proteomic organization in order to relate genetic changes to altered traits could, for example, guide the identification of new disease genes, of genes underlying susceptibility to infection, and of genes underlying major crop traits or of many other naturally occurring traits with genetic components. Here, again, model organisms are leading the way: over the past few years, a variety of computational methods, most exploiting large-scale proteomic and genomic datasets, have begun to show strong advances in predicting the phenotypic consequences of mutations and successfully identify genes underlying mutational traits.
One powerful strategy has been to exploit the principle of guilt-by-association (GBA) in protein networks (reviewed in ). In this scheme, the function of a protein will more often than not resemble the functions of those proteins with which it interacts, or is co-expressed, or is co-localized, and so on. Thus, knowledge of a few proteins’ functions, in combination with large-scale maps of protein-protein associations, provides substantial traction for characterizing the portions of the proteome that are as yet poorly understood. Interestingly, this strategy, developed initially for inferring protein function , has recently proved to be a powerful approach for linking genes to phenotypic traits and diseases.
That such an approach might work can be seen intuitively from an examination of arguably one of the simplest “diseases”: lethality of a yeast cell following deletion of an essential gene. Comparisons of the components of yeast protein complexes, measured at large-scale using mass spectrometry [4–7], with genes known from systematic gene deletion experiments to be essential for growth in standard laboratory medium , have shown that proteins encoded by essential genes tend to co-occur in the same physical complexes [18,19] (Fig. 1a). There is a general trend for proteins in the same physical complex to be encoded either mostly by essential genes or mostly by non-essential genes; complexes are systematically depleted for intermediate mixtures of essential and non-essential genes  (Fig. 1b). Thus, essentiality appears to be a function of the complex — the intact molecular “machine” — rather than the individual gene. Recent observations have further shown that essential complexes tend to be larger than non-essential complexes [20,21], which provides a physical explanation for a long-standing observation about gene essentiality, that proteins encoded by essential genes tend to have more interaction partners than non-essential ones (the “centrality-lethality rule”) .
From a practical perspective, the observation that proteins in the same physical complexes tend to be linked to similar mutational phenotypes suggests that knowing physical complexes and a subset of genes linked to a trait, one could confidently predict additional genes relevant to that trait based upon their interaction partners. Indeed, this strategy works reasonably well [23–25]; even better predictive performance comes from considering broader biological pathways and functional associations, for which this trend also appears to hold, rather than considering physical complexes alone [24,26]. In fact, studies have shown that highly but transiently connected proteins (e.g. kinases) often play key roles in complex disease . Thus, a general consideration of functional associations, whether restricted to the same physical complex or not, appears to be a reasonable strategy for linking genes to traits. This general strategy – exploiting the tendency for genes underlying the same trait to encode functionally associated proteins – has proven generally applicable and has now been tested for a wide variety of traits and phenotypes, and even human diseases (e.g., [23–26,28], among others).
Just as for the initial collection of the underlying large-scale proteomic data, model organisms have served as a productive test-bed for these approaches. Here, we discuss several of the major computational approaches for computationally integrating proteomics and genomics observations into integrated protein networks, then applying these networks in order to identify genes underlying traits. While much work has gone into identifying causal genes for traits, e.g. by association studies (reviewed in ) or integrating genotypic and genomic data (e.g., [30–32]), we primarily focus here on GBA methods, reviewing the basic approaches, discussing recent bioinformatics developments (such as a growing recognition of the relationships between these methods and Google’s web search algorithm PageRank), and presenting recent examples of experimental validation of these approaches at linking genes to traits in organisms ranging from yeast to mammals to plants.
For the purposes of GBA, protein networks can consist solely of direct measurements of protein-protein or genetic interactions, such as might come from a single large-scale yeast two-hybrid or mass spectrometry assay. Typically, though, many such measurements are first computationally integrated into a composite network. Diverse methods exist for constructing integrated protein networks from mixed proteomic and genomic data sets, and rapidly increasing amounts of data from high-throughput experiments have pushed this field to be very active. Networks are increasingly used for modeling a wide variety of biological relationships, often requiring complex construction strategies. Given that we wish to focus here on the value of the resultant networks for discovering protein function and relevance to traits, we provide only a brief high-level overview of the different strategies of network construction (Fig. 2). For more in-depth reviews of this field, see [33–37]; in some cases, step-by-step instructions are available (e.g., [26,38]).
Building a network begins with the selection of relevant data. Generally, any data which suggest relationships between pairs of proteins can be used. For example, proteins whose corresponding mRNAs exhibit correlated expression levels across different cellular conditions are often likely to be functionally associated [16,39]. Similarly, protein-protein interaction data, such as from mass spectrometry of purified complexes or yeast two-hybrid assays, can often provide strong support for proteins to function together. Such data might be used directly for inferring protein function or organization (e.g., as in Refs. [40,41]), but can also be logically combined with other types of proteomic or genomic data. Naturally, the motivation of the network guides the choice of data used. For example, in order to study host-pathogen interactions between humans and the H. pylori bacteria, Tyagi et al. used predictions of transmembrane protein-protein interactions and virulence factors . In another recent twist, Park et al. applied population-level disease patterns to study disease co-morbidity . For basic protein networks, a common strategy is to integrate a mixture of proteomics and genomics datasets in order to infer protein-protein associations, as the resultant networks tend to be more complete and robust [26,28,38,44–55]. In an early example of a large-scale integrative network, Troyanskaya et al. demonstrate how the combination of various data sources provides for better coverage and accuracy of predicted functional relationships .
The integration of multiple types of data may be approached from many avenues, with two general frameworks used most commonly: correlative and causal networks. In the former case, network edges are undirected and represent functional coupling between pairs of associated proteins. For example, such a network might summarize evidence for pairs of proteins to physically interact, but would not indicate which protein in such a pair is upstream in the biological pathway. Methods for constructing correlative networks include combining clustering coefficients of data of varying weights  and training support vector machines to identify co-complexed protein pairs . In naïve Bayesian networks, likelihood scores of proteins participating in the same pathway are calculated for each line of evidence and then combined as a weighted sum , providing a (weighted) Bayesian estimate for linked proteins to participate in the same cellular processes. Caveats to this approach include the reliance on current annotations (e.g., the Gene Ontology Consortium ) for training the networks. As annotations are often incomplete and may serve to propagate errors, there is a danger of introducing circularity. Nonetheless, many approaches may be taken to minimize this, such as using independent annotation test sets, benchmarking, and weighting of data (e.g., [30,44,45,61]). Importantly, networks constructed in this manner have proven strongly predictive for gene functions that lie outside of the annotated gene set, as discussed below (e.g., in the MouseFunc contest ).
Alternatively, in constructing causal networks describing causal relationships between genes, common strategies include ordering molecular events temporally, or by incorporating prior knowledge as to cause and effect (e.g., DNA mutations may alter a gene’s expression, but the reverse is unlikely). Zhu et al. demonstrated this technique by integrating transcription factor binding and expression QTL data into a probabilistic causal network in yeast . They first estimated joint probability distributions for various models of causality between loci and traits, then identified the most likely model which fit their observed data. Alternatively, Bonneau et al. used time-series DNA microarray measurements of the transcriptional response to different environmental perturbations to construct causal models of interactions between environmental factors and transcription factors . Though often computationally costly in construction, causal networks are of value for simplifying models of complex protein relationships, and in principle can identify early events in regulatory cascades, thereby guiding the selection of useful points of intervention for blocking such cascades (e.g., ). To accelerate progress in this field, an annual contest – the DREAM contest – is held dedicated to testing and improving the algorithms for deriving regulatory gene networks . For a more in-depth discussion of causal networks, see [66–68]. Both correlative and causal models capture a wide variety of molecular interactions, ranging from stable physical interactions to transient interactions to genetic, non-physical interactions, all of which may be functionally relevant to the pathway of interest.
Much of the real value of gene and protein networks lies in their utility for elucidating protein function. Thus, much work has been devoted to forming accurate predictions of protein functions using network information and previously known protein functions, generally by propagating annotations across network edges. Given that all current functional annotations for proteins are incomplete, sometimes woefully so, and that the networks represent an attempt to objectively reconstruct functional relationships among proteins, propagating annotations across a network’s edges is often useful, suggesting new functions for under-annotated proteins. These functional assignments are rarely unambiguous. Instead, functional prediction methods typically produce a score or rank representing how likely each protein is to be involved in the function. In order to provide some intuition for the relative merits of such approaches, we next introduce several methods and compare their abilities to annotate proteins in correlative networks. We focus on a small number of methods which we consider to be distinct and interesting; more comprehensive reviews of methods are available (e.g., ).
One of the most straightforward approaches to predict protein function via a protein network is that of neighbor counting (NC) . In this approach, for a particular protein function or pathway, the proteins with the most neighbors associated with that function are themselves deemed the most likely to share that function. In a slightly more sophisticated variation on this method, naïve Bayes label propagation (NB), the sum of the network edge-weights to implicated neighbors is used, rather than the count of interactions [26,38,46]. This latter approach relies on network edge-weights that correspond to log likelihood scores for proteins to participate in the same process; thus, the sum of edge-weights to a set of genes of interest corresponds to a naïve Bayes estimate for a gene to also belong to the gene set of interest. NC and NB are both limited in that they only score direct neighbors of annotated proteins. However, as discussed later, these simple methods have made many experimentally verified predictions.
Alternatively, “network diffusion” methods have been developed in order to effectively diffuse information throughout a network, thus overcoming a major limitation of methods such as NC and NB that consider only direct associations. Here we consider two methods which diffuse information from a single node to directly and indirectly connected nodes over the course of several iterations. In the first, which we term iterative ranking (IR), the score for a protein to be linked to a particular function or phenotype consists of an initial score and the normalized scores or weighted “votes” of each neighbor (e.g., ). As scores are updated in successive iterations, information about proteins relevant to the function of interest propagates across network edges, “smearing” the initial functional assignments across the network. Another diffusion method we consider is Gaussian field label propagation, which we refer to here simply as Gaussian smoothing (GS). In GS the minimization of two distances is computed: the difference between a protein’s initial and final scores and the weighted score difference between the protein and each neighbor . Box 1 and Fig. 3 present more detailed explanations of network diffusion algorithms, and should provide some intuition regarding their uses.
Diffusion algorithms are common in the growing body of work applying GBA in functional networks to predict protein function. Generally, network-based prediction algorithms utilize two pieces of information: f0, an initial vector of scores representing each protein's prior known association with a function, and W, the network topology matrix. The initial scores are propagated throughout the network, resulting in a set of final scores indicating each protein's predicted association to a function. The basic format for a diffusion algorithm is
where the vector of scores f is a combination of X, which contains initial scores for nodes, and Y, which contains information on the network topology. This convex combination allows the two components to be weighted differentially, along with some other mathematical conveniences. In contrast, in simpler neighbor counting and naïve Bayes methods, where a protein's score is the count or sum of edge-weights to seeds, the two components are combined (Table 1, Fig. 3b). Diffusion algorithms are advantageous because they assign scores to indirectly connected nodes. Additionally, the final scores are often readily computed. Below we describe two diffusion methods that have been successfully employed in various studies.
This algorithm was first developed in 1941 to model the input–output flows of economic industries by Nobel Prize winner Wassiliy Leontief . It was more recently popularized by Larry Page and Sergey Brin as a method (PageRank) to rank internet search query results using the link topology of the web . With minor adjustments, IR has been applied to numerous biological problems, including prioritizing functionally associated proteins [97–100], identifying protein clusters [101,102], identifying genes responsible for adverse drug reactions , and improving protein identification in high-throughput methods [71,104]. In the context of predicting protein function, the IR score of a protein is the combination of the initial seeds and the weighted average of IR scores of the protein's neighbors. Since each protein's score depends on that of its neighbors, the computation is iterative:
where ft are the scores at time t and U is the matrix of normalized network edges. The final scores are obtained when f stabilizes to within some threshold, or as the solution to the linear equation (Table 1, Fig. 3c). The type of network edge normalization performed is dependent on the application. In ranking internet search results, where each edge is equal weight, a page which points to a multitude of other sites, such as a home page, is likely unspecific in topic. Thus, each edge is normalized by the total number of outgoing edges from the node. In predicting protein function, where network edges are usually weighted, we wish to normalize each neighbor's contribution to a node's score. Thus, edges here are normalized by the total weight of incoming edges to the node. A more in-depth description of normalization differences is available .
This Gaussian field label propagation algorithm  minimizes the Euclidean distance between (1) the initial and final scores of a protein and (2) a protein's score and that of each of its neighbors:
where wij is the edge weight between protein i and its neighbor j. This can be derived from the assumptions that the error between initial and final scores f – f0 is normally distributed, f follows a multivariate normal distribution, and the covariance matrix Σ is equivalent to the inverse graph Laplacian matrix:
Network edge normalization may also be useful when implementing this algorithm. The authors of GS normalize each edge by the square root of the sum of incoming edges and the square root of the sum of outgoing edges for each node . Similar to IR, the GS score of a protein depends on an initial score and the scores of neighboring proteins. The solution to this minimization problem also reduces to a linear equation (Table 1, Fig. 3d).
|Algorithm||X term||Y term||U||ffinal|
|f0||Uf||α[I − (1 − α)U]−1f0|
|(f − f0)T(f − f0)||fTUf||D–W|
Dii = Σj wij
and Dij = 0
A wide variety of approaches can be imagined for propagating information across a network, and many such approaches, often initially developed in fields outside of biology, are proving useful for linking proteins to functions or traits. For example, Markov clustering (MCL) groups nodes based on simulation of stochastic flow in the network . This method was originally applied to predict protein families based on sequence similarity. While this method is useful for identifying clusters of functionally related proteins, it does not directly identify proteins of a particular function. Instead, it identifies clusters containing proteins of interest, but does not rank candidate genes within each cluster. However, proteins can then be prioritized by a variety of other approaches, such as by considering the sum of a protein’s edge-weights within a cluster relative to all of its edge-weights, with larger sums indicating more relevance to the functions captured by that cluster.
Finally, another interesting method analyzes the flow through a network using concepts from electric circuit analysis . In this circuit-based method (CB), the protein network is represented by an electrical circuit, where edge-weights are analogous to conductance (1/resistance) and implicated proteins are assigned as ground nodes. A current is simultaneously applied to each protein, and the nodes emerging with the highest current flowing through are predicted to be most likely to be associated with the ground nodes.
Because strongly connected nodes in a functional protein network are likely to work together in the same biological processes, they are also likely to share similar loss-of-function phenotypes. This can be demonstrated using correlative functional networks available for C. elegans  and S. cerevisiae  along with 318 RNAi phenotype gene sets available from WormBase , 100 loss-of-function phenotype sets from McGary et al. , and statistics of 282 morphological parameters for 4718 yeast gene deletion mutants from the Saccharomyces Cerevisiae Morphological Database (SCMD) . In order to analyze the quantitative data from SCMD, we assigned the genes corresponding to the 40 largest and smallest values for each morphological feature as phenotype sets, resulting in 564 total sets. Fig. 4 illustrates the relative performance of the various algorithms discussed above at identifying genes underlying traits, focusing on RNAi knockdowns in C. elegans (Fig. 4b) and loss-of-function mutational phenotypes (Fig. 4c) and morphological phenotypes (Fig. 4d) in yeast. A standard strategy for evaluating such algorithms is to perform 10-fold cross-validation, separating known examples into distinct training and test sets of proteins. Using this approach, for each phenotype, we calculated the true-positive rates (TPR) and false-positive rates (FPR) as a function of a method’s score or rank and plotted the corresponding ROC curve. Fig. 4a provides an example ROC curve illustrating the predictive ability of each method for correctly identifying genes responsible for abnormal locomotion of C. elegans following RNAi knockdown. The area under a ROC curve (AUC) provides a convenient summary of a method’s predictive ability on that phenotype; a curve along the diagonal line and an AUC near 0.5 has no predictive ability, while one pushed to the top left of the plot with an AUC closer to 1 has strong predictive ability.
The overall performance of each algorithm at predicting loss-of-function phenotypes in worm and yeast is shown as distributions of AUC values in Fig. 4b–d. NC and GBA methods perform quite similarly, presumably because a minimum edge weight threshold applied in network construction [26,46] causes the NC method to return similar rankings to the NB method. The MCL and CB methods, originally developed for different purposes, did not adapt well to task of phenotype prediction. However, MCL performance would most likely improve with further refinements to the ranking of proteins within clusters and to the optimization of clustering parameters. Overall, the two diffusion methods outperform the others by a notable margin. Some guides for effective use come from these analyses: In some cases, the diffusion methods perform poorly for small FPR and extremely well for higher FPR. Therefore, when choosing a predictive algorithm, the false-positive cost for the particular experiment should be considered; NC or NB methods are appropriate when false-positives are costly and diffusion methods suit cases where a more exhaustive set of predictions is desired. Notably, the relative performance of each method was robust to choice of organism and test set. The tests in C. elegans yielded higher average AUCs and a stronger performance boost from diffusion methods. These methods can also be further adapted to predict quantitative gene-pathway based traits (e.g., predicting quantitative yeast phenotypes with a modified NB method ). Finally a general caveat is merited: we have observed that the various network label propagation methods tend to perform differently for different applications, and it is often advisable to test several to see which performs best for a particular test of interest.
While much work on the computational analysis of gene networks has relied on computational tests, such as the cross-validation employed above, the last few years have seen increasing direct experimental validation of network-based predictions of protein functions and involvement in phenotypes. Here again, model organisms have proven invaluable for enabling rapid in vivo tests of the validity of these methods. In some studies, simple loss-of-function experiments have confirmed very striking phenotypes. For example, protein networks have successfully predicted genes whose RNAi knockdown suppresses the loss of the retinoblastoma tumor suppressor, validated in C. elegans (Fig. 5a) . In A. thaliana, novel regulators of drought sensitivity and lateral root development were discovered using NB predictions (Fig. 5b) . Similarly, in E. coli, many proteins were predicted to play roles in cell envelope biogenesis. Cells in which these proteins were deleted exhibited differential sensitivity to peptidoglycan assembly inhibitors . In another study, Qi et al. construct a yeast synthetic lethal genetic interaction network in order to predict pathway memberships and genetic interactions. Using a diffusion method similar to IR (described above), they identified and confirmed 18 novel genetic interactions for the transcriptional cofactor Ada2 and 20 for Esa1, a subunit of the histone acetyltransferase complex .
Beyond the simple scheme of mapping single genes to single phenotypes is the goal of understanding how complex phenotypes arise. For example, a time-dependent yeast protein interaction network revealed the role of Cdk1 in protein complex formation throughout the cell cycle . In a separate study employing an integrative yeast network, Hess et al. confirmed 140 of 235 predicted mitochondrial biogenesis genes (one such example is reprinted in Fig. 5c) . In another study, a tissue-specific functional interaction network is constructed in order to study tissue-specific regulation patterns in worm. Several genes predicted to express in the hypodermis, muscle, or neurons were confirmed using promoter-GFP constructs (Fig. 5e) . Similarly, using a co-expression gene network, Ghazalpour et al. predicted factors which influence the body weight of mice . The addition of non-proteomics datasets, including genome sequence or genotype data, has proven to often extend the scope of studies considerably. Applying genome-wide association data to a yeast functional network facilitated the identification of thousands of genetic interactions between protein complexes , and a model of protein dosage sensitivity . Using an extensive collection of chemical genetics datasets, Venancio et al. built a chemical-protein complex network and identified potential interactions between drugs and protein complexes .
Efforts to exploit proteomics and genomics data in order to better annotate model organism genes recently culminated in an international contest, MouseFunc, to annotate mouse genes with Gene Ontology (GO) annotations. Nine teams from around the world independently developed computational methods to predict gene function from a large collection of M. musculus data . The data included protein sequence pattern annotations, experimentally determined protein-protein interactions, mRNA expression across multiple tissues, gene-phenotype associations, disease associations of human orthologs, and phylogenetic distributions of mouse genes. Each team predicted blinded GO annotations for mouse genes, and were assessed on withheld annotations and annotations newly identified since the start of the contest. The strengths and weaknesses of each algorithm were assessed, and several methods emerged as strong performers. For example, the Gaussian smoothing algorithm GeneMANIA (Box 1, ) performed well. Performance using a computational approach known as support vector machines, combined with GO annotations over a Bayesian framework, was robust to the number of genes in the test set . Finally, Funckenstein, composed of two methods combined by logistic regression, produced high precision predictions for a wide range of GO annotations . This method uses guilt-by-association in gene networks in addition to guilt-by-profiling: exploiting the correlation between gene function and other gene characteristics. The ultimate result of this contest was a unified set of predictions over all teams’ approaches that averaged 41% precision over all GO annotations. Moreover, 26% of GO terms achieved a precision > 90%. Many new predictions emerged for 5000 previously uncharacterized genes. Predictions for one of these, the gene Fuz, implicated in vertebrate birth defects, were recently confirmed experimentally in transgenic mice and knockdown experiments in frogs (Fig. 5d) .
Overall, the work in integration of large-scale data sets and functional prediction in model organisms builds toward the ultimate goal of understanding complex human traits and phenotypes. Linding et al. approached this goal on a signaling level and developed an in vivo phosphorylation network, modeling kinase and phosphoprotein relationships . They identified substrates of kinases previously overlooked by motif-based methods alone, including those of ATM (a primary regulator of DNA damage response) and CDK1 (a driver of cell cycle progression). On a different level, genes or proteins associated with human diseases can be predicted through the network propagation methods discussed earlier, and such methods are increasingly being applied to human protein networks. For example, Ostlund et al. recently discovered genes with abnormally high connectivity to cancer genes and defined a novel method for ranking these new cancer gene candidates . Other studies have focused on specific diseases and understanding certain properties of interest. One commonly used approach is to identify network characteristics which map to such properties. For example, Huttenhower et al. first build small networks of biological relationships between genes, then extract disease level information from these models . Sun et al. identified modules of genes which they predicted to be responsible for metastasis of oral cavity tumors .
In two striking recent examples, the construction of gene interaction networks has led directly to predictions in patient prognosis. First, Carro et al. built a transcriptional regulatory network which models glioma cancer cell transition into an aberrant mesenchymal phenotype . Using gene expression profiles and array comparative genomic hybridization of 76 high-grade gliomas, they were able to infer C/EBPβ and STAT3 to be transcription factors responsible for initiating and regulating mesenchymal transformation. As a validation of their model, they found that patients with tumors double-positive for C/EBPβ and STAT3 were associated with worse clinical outcome than patients with either single- or double-negative tumors (Fig. 6a). Second, Taylor et al. studied breast cancer patient outcome by analyzing the hub proteins in a human protein interaction network in the context of genome-wide expression data in 79 human tissues . They identified two classes of protein network hubs: intermodular hubs, which display low correlation of co-expression with neighbors, and intramodular hubs, which display high correlation of co-expression with neighbors. Mutations of intermodular hubs were more strongly associated with cancer phenotypes. Using a cohort of breast cancer patients, they defined correlation of co-expression signatures corresponding to good and poor prognosis patients. The model strongly predicts patient outcome, as demonstrated in the Kaplan-Meier survival curves (Fig. 6b). On a broader level, patient records have recently been integrated into models in order to build disease networks [43,94]. These networks elucidate disease-disease relationships and offer insight into disease progression and co-morbidity, and it is reasonable to expect that such models can be usefully integrated with protein association networks to better characterize the genetic basis for human diseases.
The preceding network models and network methods reveal just small portions of the black box of interactions underlying complex phenotypes, but models that integrate multiple types of experimental datasets (e.g., combinations of proteomics and gene expression data) clearly perform best. Importantly, the wide availability of proteomics and genomics data is significantly boosting the ability to link genes to traits. These methods, proven initially in model organisms, are beginning to show utility for human diseases, in spite of a still significant lack of human proteomics data. As a result, most models are verified using high-throughput experiments on model organisms or cell lines, various annotation databases, and occasionally cohort data. Given reasonable expectations for technology developments in proteomics and genome and transcript sequencing to improve data quality and reduce the cost barriers to producing data, it seems safe to expect that the proteomics efforts proven in model organisms will increasingly translate into human studies, dramatically improving our ability to link genes to human diseases and traits.
We thank Martin Blom and Smriti Ramakrishnan for helpful discussions. This work was supported by grants from the Texas Advanced Research Program, the N.S.F., N.I.H., the Welch Foundation (F-1515), and a Packard Fellowship. The SCMD database has been provided freely by the University of Tokyo for use in this publication/correspondence only.