A second approach to modeling biological networks also holds tremendous potential for advancing knowledge of biology – namely, statistical influence networks. The growing amounts of microarray gene expression data along with improvements in data fidelity are now making it possible to make robust statistical systems-level inferences about the structure and dynamics of biomolecular control mechanisms, such as transcriptional regulatory networks.
Many approaches attempt to infer relationships between gene expression measurements using deterministic or stochastic formalisms. The fundamental idea behind these approaches is that models that faithfully capture such relationships have predictive capacity as regards system behavior and can be used to gain insight about system-wide properties, such as steady-state behavior or responses to perturbations or specific stimuli. There are a number of ways in which such relationships can be represented, both in the discrete and continuous domains.
One popular modeling approach that captures the nonlinear multivariate relationships exhibited by biological control circuits, such as gene regulatory networks, is the class of Boolean networks, which owes its inception to the work of Stuart Kauffman in the late 1960s [30
]. In the Boolean network model, the variables (e.g., genes or proteins) are binary-valued, meaning that their states can be either on or off, and the relationships between the variables are captured by Boolean functions. Each (target) gene is assigned a Boolean rule that determines its value as a function of the values of a set of other (predictor) genes, possibly including the target gene itself. System dynamics are generated by updating the Boolean functions, either synchronously or asynchronously, causing the system to transition from state to state in accordance with its Boolean update rules, where a state is a binary representation of the activities of all of the variables in the system (i.e., a binary vector representing the genes that are on or off at any given time).
Boolean network models have been constructed and analyzed for a number of developmental and physiological processes. For example, Albert et al
. constructed a Boolean network model for a subset of genes of the fruitfly Drosophila melanogaster
, which describes different stable gene expression patterns in the segmentation process of the developing embryo [31
]. The steady-state behavior of this model was in excellent agreement with experimentally observed expression patterns under wild type and several gene mutation conditions. This study highlighted the importance of the network topology in determining biologically correct asymptotic states of the system. Indeed, when the segment polarity gene control network was modeled with more detailed kinetic models, such as systems of nonlinear differential equations, exceptional robustness to changes in the kinetic parameters was observed [32
Boolean networks have also been used to model the yeast and mammalian cell cycle [33
]. Li et al
. demonstrated that the cell cycle sequence of protein states, which is a globally attracting trajectory of the dynamics, is extremely robust with respect to small perturbations to the network. The Boolean network formalism was also recently used to model systems-level regulation of the host immune response, which resulted in experimentally validated predictions regarding cytokine regulation and the effects of perturbations [35
]. Boolean rules can be learned from gene expression data using methods from computational learning theory [36
] and statistical signal processing [37
A limitation of the Boolean network approach is its inherent determinism. Because of the inherent stochasticity of gene expression and the uncertainty associated with the measurement process due to experimental noise and possible interacting latent variables (e.g. protein concentrations or activation states that are not measured), the inference of a single deterministic function may result in poor predictive accuracy, particularly in the context of small sample sizes (e.g., number of microarrays) relative to the number of genes.
One approach to “absorb” this uncertainty is to infer a number of simple functions (having few variables), each of which performs relatively well, and probabilistically synthesize them into a stochastic model, called a probabilistic Boolean network (PBN) [38
]. The contribution of each function is proportional to its determinative potential as captured by statistical measures such as the coefficient of determination, which are estimated from the data [37
]. The dynamical behavior of PBNs can be studied using the theory of Markov chains, which allows the determination of steady-state behavior as well as systematic intervention and control strategies designed to alter system behavior in a specified manner [39
]. The PBN formalism has been used to construct networks in the context of several cancer studies, including glioma [42
], melanoma [41
], and leukemia [40
]. PBNs, which are stochastic rule-based models, bear a close relationship to dynamic Bayesian networks [43
] – a popular model class for representing the dynamics of gene expression.
Bayesian networks are graphical models that have been used to represent conditional dependencies and independencies among the variables corresponding to gene expression measurements [44
]. One limitation of Bayesian networks for modeling genetic networks is that these models must be in the form of directed acyclic graphs and, as such, are not able to represent feedback control mechanisms. Dynamic Bayesian networks, on the other hand, are Bayesian networks that are capable of representing temporal processes [45
] that may include such feedback loops. Since not all causal relationships can be inferred from correlation data, meaning that there can be different directed graphs that explain the data equally well, intervention experiments where genes are manipulated by overexpression or deletion have been proposed to learn networks [47
]. The Bayesian network formalism has also been used to infer signaling networks from multicolor flow cytometry data [48
There exist a number of other approaches for inferring large-scale molecular regulatory networks from high-throughput data sets. One example is a method, called the Inferelator, that selects the most likely regulators of a given gene using a nonlinear model that can incorporate combinatorial nonlinear influences of a regulator on target gene expression, coupled with a sparse regression approach to avoid overfitting [49
]. In order to constrain the network inference, the Inferelator performs a preprocessing step of biclustering using the cMonkey algorithm [50
], which results in a reduction of dimensionality and places the inferred interactions into experiment-specific contexts. The authors used this approach to construct a model of transcriptional regulation in Halobacterium
that relates 80 transcription factors to 500 predicted gene targets.
Another method that predicts functional associations among genes by extracting statistical dependencies between gene expression measurements is the ARACNe algorithm [51
]. This information-theoretic method uses a pairwise mutual information criterion across gene expression profiles to determine significant interactions. A key step in the method is the use of the so-called data processing inequality, which is intended to eliminate indirect relationships in which two genes are co-regulated through one or more intermediaries. Thus, the relationships in the final reconstructed network are more likely to represent the direct regulatory interactions. The ARACNe algorithm was applied to 336 genome-wide expression profiles of human B cells, resulting in the identification of MYC as a major regulatory hub along with newly identified and validated MYC targets [52
A method related to the ARACNe algorithm, called the context likelihood of relatedness (CLR), also uses the mutual information measure but applies an adaptive background correction step to eliminate false correlations and indirect influences [53
]. CLR was applied to a compendium of 445 E. coli
microarray experiments collected under various conditions and compared to other inference algorithms on the same data set. The CLR algorithm had superior performance as compared to the other algorithms, which included Bayesian networks and ARACNe, when tested against experimentally determined interactions curated in the RegulonDB database. It also identified many novel interactions, a number of which were verified with chromatin immunoprecipitation [53