Normal physiology, disease processes, and response to drug treatment all involve complex interactions among genes and between genes and environmental factors. New high-throughput functional genomics technologies such as gene expression microarrays provide an enormous amount of data on how genes respond to genetic and environmental perturbations. Network and pathway methods, in which nodes represent genes and edges (links) between two nodes indicate a relationship between the corresponding genes, provide a useful framework for extracting and organizing information from such data. One of the primary aims in reconstructing reliable gene networks is to predict which genes respond directly to a stimulus (primary gene changes), as opposed to those genes that respond to changes in the primary genes (secondary gene changes).
Some network reconstruction methods are based on pairwise relationships among genes, while others, such as Bayesian network reconstruction methods, explicitly examine interactions involving more than two genes, attempting to separate direct from indirect influences. For example, an edge between two genes may indicate that the corresponding expression traits are correlated in a population of interest, or it may indicate that changes in the activity of one gene led to changes in the activity of the other gene [1
]. Ideally, a network will allow us to predict the system's response (or the probability of various responses) to any given perturbation.
Here we represent biological networks of genes as Bayesian networks [2
], which have successfully represented some biological systems [3
]. The edges in Bayesian networks have direction, and the topology of a Bayesian network defines certain relationships among the nodes. That is, given the states of the parent nodes—the nodes with edges that point to a node of interest—you can predict (probabilistically) the state of a node of interest. Cycles—paths that return to a starting node—are not allowed, meaning that certain types of feedback cannot be represented by Bayesian networks. Ideally, we would like to find the network that best explains the observed data, in the sense of maximizing a probability function on the network given the data (see Methods), but this presents several problems. First, the number of possible networks grows rapidly with the number of genes under consideration. This makes it impossible to examine all possible networks, so heuristic searches are used. Second, even if we could examine all possible networks, we face an underdetermined problem: the number of samples available in most microarray experiments is much smaller (often orders of magnitude smaller) than the number of genes, so many networks explain the observed data equally well. In particular, because Bayesian networks represent multivariate probability distributions (see Methods), the direction of many of the edges in such networks can be changed without affecting how well the model fits the data (Markov equivalence). Thus, both the data and the reconstruction method limit our ability to make inferences about causal relations among genes.
These limitations raise the question of whether and how network reconstruction can be improved by including other types of data. In segregating populations arising naturally or from experimental crosses, genetic information (e.g., genotypic data) can provide important information about which genes interact and can identify the relationships among interacting genes [5
]. Different alleles for a given gene are often associated with systematic differences in transcript abundances for the gene, as has been shown in several species [5
]. In the context of segregating populations, significant differences between allele-specific transcript levels can be detected as expression quantitative trait loci (eQTL) [9
]. Gene expression traits driven by common eQTL provide the structural information needed to identify which genes are likely to influence other genes, and this information can be used to bias the search for relationships among gene expression traits and between gene expression and other complex traits [1
]. Importantly, the genetic data provide information as to which of a pair of interacting genes is causal (upstream) and which is reactive (downstream). Therefore, links in the reconstructed networks that are based on genotypic data have much stronger indications of causality than links based only on correlation information.
We have previously demonstrated that a network reconstructed using both gene expression and genetic information allows better prediction of the effect of experimental perturbation of a particular gene [5
] than a network reconstructed using gene expression alone. Here we more formally assess the utility of integrating genotypic data to reconstruct gene networks by simulating genetic and gene expression data from biologically realistic networks and by quantifying the improvement in network reconstruction achieved using the combined data, compared with reconstruction using gene expression data alone. By reconstructing networks based on simulated datasets in which the number of samples was allowed to vary, we are able to estimate the incremental benefit of collecting additional data, in addition to the benefit of incorporating genotypic data. We conclude that our integrative genomics approach to reconstructing networks not only leads to more predictive network models, but may provide savings of time and money by decreasing the amount of data that must be generated under any given condition of interest to achieve a desired level of accuracy.