The topology and the dynamic realization of genetic networks often play a dominant role in phenotype formation. In order to understand the cause of a disease and/or develop effective therapeutics, it is important to understand the function and regulation of the underlying biological network. In recent years, studies on this problem have been focused on the data-based (reverse engineering) approaches, i.e., modeling a biological network from the experimental data and prior knowledge by machine learning algorithms, such as learning a genetic regulatory network (GRN) from microarray data using a Bayesian network.
Friedman et al. were the first to use Bayesian networks in identifying a regulatory network structure [17
]. In this work, a best Bayesian model structure is learned from gene expression data by maximizing its posterior probability based on the data. The ability of the model to reproduce certain known regulatory interactions is validated against real experiments and the BN model can also predict new regulatory relationships. In general, Bayesian network inference uses two kinds of cost functions, i.e., BIC and BDe score, to learn the BN structure. Following this line, plenty of works have been proposed to learn the genetic regulatory networks and protein-protein interaction networks by analyzing various data resources, such as gene expression data, ChIP-chip data, and protein expression data, etc. [15
]. In addition, time-dependent gene activities and their relationships can be inferred from microarray time series data using dynamic Bayesian networks [29
]. Other works have recruited data integration schemes to combine different kinds of data together with the prior knowledge into the learning task [22
]. Moreover, some works have been proposed to deal with reconstructing the genetic regulatory network with hidden factors and missing observations [2
]. All of the above methods and applications inevitably encounter a similar problem, i.e., there are not enough data samples for learning the network structure with given dimensionalities.
In reality, due to the relatively small amount of experimental data available compared to the size of the genetic regulatory network, the learned network often contains a small number of reliable (confident) edges. In addition, the conventional Bayesian network cannot capture cyclic structures in real biological systems, which often results in inaccuracy and/or error. Algorithms of learning cyclic structures from microarray data with dynamic Bayesian network have been proposed [35
]. However, these algorithms often need a large amount of data in time series, which is not necessarily available. Moreover, biological networks consist of various interactions, such as protein-protein and protein-DNA interactions. Due to the variation of the techniques used to generate these data, discrepancies between experiments and various types of data often make the data-driven approach difficult.
However, there are plenty of qualitative statements in the literature. For example, TGFβ stimulates tumor invasion and metastasis. This statement indicates a direct functional relationship, stimulation, between a cytokine, TGFβ and phenotypes, tumor invasion and metastasis. Such a qualitative statement lacks quantitative information, e.g., how strongly does TGFβ regulate those phenotypes and its reliability is dependent on the biological experiments that supported it. Nevertheless, the statement is a concrete conclusion supported by various evidences obtained from different experimental measurements including microarray, ELISA, and northern blot experiments. A qualitative statement is often a summary of the most prominent and consistent observations across multiple studies and it should be thusly treated as the most confident information in modeling the underlying biological network. Other links which are less reliable than qualitative statements emphasized in the literatures may be erroneously captured (false positives) in the learned Bayesian model using the data-based reverse engineering approach.
Consequently, it is very important in systems biology to develop methods based on highly confident qualitative statements in the literature (no quantitative experimental data are involved) to establish a genetic network for a specific phenotype (e.g., cancer). In such networks, vertices indicate cellular molecules at multiple levels, such as proteins and RNA molecules. Direct edges from any node(s) to other node(s) in the network represent direct functional regulations from the parental node(s) to the child node(s). Given this genetic network, it is imperative to parameterize its structure. We can thusly use it to interrogate new genetic programs and discover new knowledge about this network and its associated biological phenotypes.
Unfortunately, a major hurdle in developing this knowledge-based approach is the lack of quantitative parameterization information (in qualitative statements) that is crucial for performing quantitative inference. Thus, the problem boils down to constructing parameters from the qualitative statements and encoding this parameter and structure information into a mathematical model for quantitative manipulations. We proposed in this paper, a knowledge-based predictive framework for modeling the recurrent genetic networks based on dynamic Bayesian networks given qualitative knowledge and our model can then perform quantitative inference.
(Dynamic) Bayesian networks (DBNs) are a popular class of graphical probabilistic models which are motivated by Bayes’ theorem [1
]. A DBN represents a joint probability distribution over a set of variables. Once known, this joint distribution can be used to calculate the probabilities of any configuration of the variables. In Bayesian probabilistic inference, the conditional probabilities for the values of a set of unconstrained variables are calculated given fixed values of another set of variables, called observations or evidence. Bayesian models have been widely used for efficient probabilistic inference and reasoning [32
]. Numerous algorithms for learning the Bayesian network structure and parameters from data have been proposed [23
]. However, as we have discussed above, although the maximum a posteriori approximation, i.e., the selection of a single Bayesian network model from the data by learning, is useful for the case of large data sets, independence assumptions among the network variables often make this single model vulnerable to overfitting. In realistic problems, the data basis is often very sparse and hardly sufficient to select one adequate model, i.e., there is considerable model uncertainty. Selecting a single Bayesian model can then lead to strongly biased inference results.
Besides Bayesian networks, other state-of-the-art statistical and deterministic methods have been proposed to infer the genetic regulatory network from the data. These methods can analyze the full range of the behaviors and dynamics of a system under different conditions. (Probabilistic) Boolean networks were initially used to analyze the network stability in the yeast transcriptional regulatory network [27
] and to study the dynamics of cell cycle regulation in yeast [33
]. Boolean networks can provide important insights in terms of the existence and nature of network steady states and robustness. However, a Boolean network is largely limited by its level of modeling details and computational expense to analyze the dynamics of large networks, as the number of global states is exponential in the number of entities [26
]. Petri net is used to analyze the transition sequence of a network from a global state to another. Moreover, Petri net is used to analyze the dynamics of a regulatory network and large-scale metabolic networks [7
]. Modulo network module is introduced to infer the regulation logic of gene modules given gene expression data. A regulation logic is represented by a decision tree, in which a path from the root to a leaf is determined by the up or downregulation of regulatory modules, and a leaf determines the expression level of the corresponding genes. Module networks were tested with experimental data and correctly predicted some regulatory modules [43
]. Other successful model can predict the genetic regulatory network based on mutual information [34
As discussed above, a quantitative data set is a sole resource for all these conventional methods. Therefore, these methods’ performance are inevitably limited to the availability and quality of the data. In particular, the performance of these methods will be severely undermined in any of the following cases: 1) the data contain few samples (comparing to number of predictors/features/random variables of the system); 2) the data are contaminated by relatively high-level noise; 3) the data contain no functional measurements. In our method, we try to model the genetic regulatory network structure and parameters and to predict the system behavior based on solely priori qualitative statements. On the contrary, a qualitative knowledge about a physical interaction is usually evaluated by a combination of direct binding and functional regulation experiments. The qualitative knowledge thusly provide a high-confident landscape of the network structure. The major advantage of our proposed method is that we avoid the usage of noisy data yet to construct a confident network structure.
In this paper, we recruit a qualitative knowledge model [5
] to map major types of genetic interactions, i.e., 1) transcription factor-DNA regulations and 2) protein-protein interactions, to set a group of constraints over the structure and parameter space of the dynamic Bayesian network. In particular, the qualitative properties of the statements are dealt with by transforming the fuzziness of these statements into a set of prior joint probability distributions over the nodes in the dynamic Bayesian network. The genetic networks are restricted to a subset of models that are consistent with a body of qualitative knowledge. All dynamic Bayesian models satisfying the constraints over the joint probability space are considered as a candidate for the underlying biological network. In this way, we take model uncertainty into account instead of basing our prediction on a single “best” model. With full Bayesian approach, i.e., model averaging, this class of consistent models is used to perform quantitative inference which can be approximated by Monte Carlo methods. This knowledge-based quantitative Bayesian network modeling algorithm preserves the actual network topology derived from knowledge and is able to capture both “correlation” (joint probability) and “causal/influence” (conditional probability) relations in the Bayesian network. When we combine qualitative statements from various studies, statements targeting the same genetic interaction may be inconsistent. In this case, they can be integrated into a unified representation by calculating a priori distribution over the statements [5
In summary, our method demonstrates that we can achieve good predictions on the biological network behaviors given qualitative statements without any quantitative data. In Section 2, we present the quantitative inference methods with a dynamic Bayesian model based on a set of qualitative statements. In Section 3, we apply our framework to model the cell proliferation network in normal and cancerous breast cells and also predict cell growth given regulatory interventions to the network. Conclusions are made in Section 4.