Random arcs-and-nodes model
To address formally the problem that we just outlined, we suggest a random arcs-and-nodes graph model—a modified version of a Bayesian network.
The classical Bayesian network formalism was invented to address tasks that resemble that of making an automated medical diagnosis 
. A typical Bayesian network had a random variable associated with each node, while the directed arcs of the graph depicted the conditional dependencies between the nodes. For each pair of nodes connected by a directed arc, the node with the outgoing arc was called a parent
of the node with the incoming arc. The state of each node was assumed to be conditionally dependent on the states of that node's parents, and conditionally independent (given the states of parental nodes) from the remainder of the graph nodes. By design, in these models, the not yet observed states of nodes (unknown disease state that causes the observed symptoms) were of the predominant interest. The arcs—the probabilistic relations between the node variables—were assumed known and immutable 
Returning to the problem that we outlined in the introduction, we can see that, when we are dealing with a large collection of statements generated by a diverse set of sources of unequal quality, internal conflicts between the states of numerous arcs and nodes are inescapable. Therefore, it might be useful to allow the arcs themselves to be associated with random variables, and to quantify arc and node-associated uncertainty simultaneously. These new arc-related random variables can represent the strength of the experimental support for individual molecular interactions. We can then update both arc and node distributions, following standard probability calculus, to improve the overall consistency of the model.
Here we suggest a model, a simple generalization of a Bayesian network, where both arcs and nodes represent random variables. As in the classical Bayesian network applied to molecular-biology data, the allowed values for node variables can be defined as active/present
—which describe the possible states of a molecule in a cell or a tissue. (Alternatively, instead of having only two admissible values per node, we could assume three values: active, inactive
, and absent
. For the sake of simplicity, we have chosen to treat the states inactive
as indistinguishable.) Deviating from the classical Bayesian network formalism, we define arc variables, each with allowed values inhibit, activate
, and no effect
. The intuition behind this formulation is to provide a mechanism for the arc variables to change their values depending on states of the surrounding nodes, in addition to the traditional probabilistic dependencies between the parent- and child-node variables. (If we assume that the arc variables are conditionally independent of each other and of the node variables, our model reverts to the classical Bayesian network model.) Our goal here is to estimate both the joint and the reconciled marginal distributions over nodes and arcs, given partial prior marginal probabilities on the nodes and arcs and a partial set of conditional probabilities. (To satisfy classic probability calculus, P
1, where inhibit, no effect
, and activate
are encoded with integers −1, 0 and 1, respectively, and P (V
1, where we write V
1 and V
0 for active/present
values of V
, respectively.) We can view the reconciled marginal distributions of arcs and nodes in our model as experimentally testable hypotheses.
Random variables associated with arcs can be particularly useful to express general knowledge about molecular events—when it is known that an interaction between two substances is possible, but no precise specification of the condition is given. Node-specific random variables can be useful to express experimental conditions for a specific cell, cell state, tissue, or organ. The initial information about data in our model is expressed as marginal prior probabilities over nodes and arcs. We also define conditional probabilities of nodes given arcs, and of arcs given nodes (see Mathematical Box). We use an analog of the stochastic-integration procedure to compute the joint probability over all random variables. As is common in applications of Bayesian networks to real data, we assume that our molecular-interaction model has no directed cycles.
As will become clear from analysis of examples later in the paper, disparities between prior probabilities and reconciled marginal probabilities emerge when there are substantial conflicts among the prior probabilities for the variables.
Toy and not-so-toy examples
To support our contention that application of our model can lead to intuitive and potentially useful results, we clarify the relevant concepts with three toy examples. From these toy examples it is easy to see that the reconciled marginal distributions correspond to internally consistent pathway graphs. Furthermore, a large change in entropy (loss or gain of information) between the prior and reconciled marginal distributions of random variables is directly attributable to conflicts and agreements among statements in the model. After describing the toy examples we step through a larger, realistic pathway.
For our toy example we have chosen an X-shaped directed graph shown in . We look at three different prior variable distributions for the same-topology. (A) has logically consistent prior distributions over the variables. The most likely states of nodes G, B, and C are active/present; consistent with that, G and B both, most probably, activate C. Similarly, node C (most probably) inhibits node E and activates node D, a situation consistent with the probable states of nodes D and E, respectively. The reconciled marginal distributions for the same variables ( (A), marginals) are visually similar to the corresponding prior distributions. However, the reconciled marginal distribution on average became more informative: the overall entropy of the reconciled marginal distributions drops by 0.45 bits for the node variables and by 2.14 bits for the arc variables, in comparison to the prior distribution. (The Shannon entropy of a random variable with just two states, 0 and 1, is defined as -p0 log2 p0 -p1 log2 p1, where p0 and p1 are the probabilities that we will find the variable in state 0 or 1, respectively. A similar expression with three terms in the sum defines the entropy of a three-state random variable. The Shannon information is defined as a difference between two values of entropy for the same system; information is gained when entropy decreases and is lost when entropy grows.) In other words, if we start with a set of logically consistent prior distributions over variables in a graph, we can gain information by computing the joint distribution over all variables, because consistent parts of the random graph reinforce one another and make the reconciled marginal distribution sharper (more informative).
Computation of marginal distributions for all variables (arcs and nodes) of a hypothetical toy graph.
The inconsistent prior distributions for the same variables ( (B) and (C)) lead to quite different properties of the reconciled distributions. In the graph shown in (B), node B is active and is believed to inhibit node C, yet C is believed to be active. In addition, node C is believed to activate node D, yet node D is most likely inhibited/absent. The corresponding reconciled marginal distributions for arcs and nodes are no longer inconsistent: node D becomes activated, while arc ABC changes its most likely value from inhibit to activate. However, this improvement in consistency is achieved at a price: loss of certainty in the reconciled marginal distributions. That is, the entropy for the reconciled distributions increases by 1.41 bits for nodes and by 0.32 bits for arcs. The example in (C) has an apparent conflict between the states of arcs AGC and ABC (both arcs are, most likely, in the state inhibit) and the active/present states of nodes G, B and C. In addition, node E is originally believed to be activated by node C, but its most likely state is inactive. As with the previous examples, the reconciled marginal distributions are free of the inconsistencies observed in the prior distribution, but at the expense of an increase in the entropy (loss of information, by 0.125 bits for nodes and 0.53 bits for arcs). A larger, realistic pathway graph can have both consistent and contradictory parts.
To get a large, experimentally grounded data set, we used data from a large-scale text-mining project 
that provided access to experimental results described in hundreds of thousands of published research articles. These data closely match the imaginary situation described earlier, where researchers at numerous laboratories ran experiments unaware of each other's results 
. We decided to compile and analyze a set of human molecular interactions among genes that are suspected to harbor genetic polymorphisms predisposing to one of four major neurological disorders: autism, Alzheimer's disease, bipolar disorder, and schizophrenia. We present here analysis of 3, 161 full-text articles (we used 6, 724 unique sentences from these articles to extract molecular interactions) from 64 major scientific journals (see Supporting Information for detailed information on sources of data). The molecular network that we analyzed with our method was devoid of directed cycles; to generate a loopless graph, in each directed cycle of the original literature-derived network model, we removed the weakest (least supported) arc, striving to minimize the overall number of deleted arcs. To collect information on the brain-specific expression of genes in our molecular network, we examined 910, 221 journal abstracts that specifically referred to brain tissues; 14, 780 of these abstracts mentioned genes that we selected for our example (see Supporting Information for more detail). The result of this analysis was a molecular network that comprised 288 nodes and 353 arcs; each arc was represented by multiple statements and types of interactions from the literature. (We could have analyzed a much larger network, but the results would not have been amenable to compact representation easily accessible to a reader; nonetheless, our current pathway model, presented in and , is much larger than a typical pathway described in a comprehensive review article.)
Distributions for all arc and node variables in a large human molecular network.
Figure 3. Difference and entropy-change graphs for networks shown in .
In this large molecular network, we defined the prior distributions for the node variables using published statements about tissue-specific expression of individual genes. We computed the prior distributions for the arcs using the individual relationships between molecules extracted from the literature combined with the estimated confidence in the quality of the extraction of the individual relations (see Mathematical Box and Supporting Information for details).
We visualized the prior and reconciled marginal distributions side-by-side in to facilitate their comparison, and showed the absolute difference between them in (A). Additionally, we computed the change in entropy between the prior and reconciled distributions for each individual random variable ( (B)). The difference in entropy highlights the consistent and inconsistent parts of the graph: the blue-spectrum nodes and arcs increased their entropy (lost information), while the red-spectrum variables lost entropy (gained information). The blue-spectrum variables are the best candidates for further experimental corroboration or refutation.
We begin the analysis of our realistic pathway example by observing that the hypothetical example which we posited in the introduction exists in the real-life example. According to published statements, gene WNT1
is inhibited by both HBP1
and EMX2 
. Therefore, the pathway, as represented by the set of prior distributions over variable values, is inconsistent.
One of the arcs that decreased its activate (associated reconciled marginal probability) is the one connecting SRF
(see and ). It also shows a loss of information (it has a blue connecting line in 3 (B)). If we trace the arc support back to the source papers, we find that this particular arc is supported by a single sentence that formulates a hypothesis: “The combination of increased JNK
activity and up-regulation of c-JUN
and related proteins may activate gene transcription via interactions between c-JUN, SRF
, and the trans-activation domain of SP1
.” (see 
Some of our arc reconciled marginal distributions appear to conflict with the published data. One of the prominent examples of this kind in our figure is the interaction between TP53
(a notorious transcription factor participating in a number of cancer- and cell-death related pathways) and PSEN1
(human gene that is believed to harbor polymorphisms predisposing the bearer to Alzheimer's disease). Our prior distribution for this arc indicated that TP53
, see 
). Yet our prior distributions for the nodes TP53
were strongly biased towards active/present
state. Furthermore, according to our compiled graph, both TP53
are activated by a number of other genes (TP53
is activated by EGR1
, while PSEN1
is activated by e-CADHERIN
, and BCL-2
), further supporting the hypothesis that both genes are active. As a result, the reconciled distribution for the arc between TP53
has a larger probability for activate
than for inhibit
(see ). This apparent inconsistency can be explained and resolved in a number of ways. The interaction between TP53
may be in reality mediated by a third gene that is inactive in the neurons. An alternative explanation is that TP53
are indeed active in the same neuronal cells, but not at the same time. This can be tested by looking at experimental time series reflecting changes in states of genes proteins and other molecules in a cell.
Our computational approach identified inconsistencies in states of approximately 10% of arcs and 8% of nodes within the realistic pathway graph (see and ). We hypothesize that these estimates reflect the overall level of inconsistency among the published statements about molecular interactions.
and point to dozens of experimentally testable hypotheses that, we hope, the reader would be tempted to examine. Using the proposed methodology and currently accessible computational resources, we can scale the computation up to apply to thousands or even millions of statements, potentially, to the complete set of human molecular interactions.
Extensions and conclusion
A natural next step is to use our model to integrate results from large-scale wet-laboratory experiments with text-mining analyses statements. We hope to expand our methodology by incorporating the ability to handle directed cycles which are critically important in biological pathways. We can significantly improve (while making it also more complicated) the model for assigning the prior probabilities for nodes and arcs. For example, we can use a probabilistic mode of scientific publication process 
to take into account the type and amount of experimental support behind the published statements. A more long-term goal is to assemble and cross-validate a reliable and comprehensive map of human interactions, to enable diagnosis and treatment of complex human disorders 
. Since molecular networks of distinct species interact with each other, as is clear in the case of the pathogens and various allergy-inducing agents in humans, it is not unimaginable to attempt computing a reconciled model of the whole integrated current knowledge about molecular interactions 
. Finally, we can imagine a futuristic environment where new molecular-interaction hypotheses are automatically tested for consistency against the set of currently available facts.
Once a proper mapping of arc and node variables is defined, our model is immediately applicable to a diverse set of problems outside of molecular biology. For example, in ecology the node variables can represent presence or absence of a species in a geographic location, while arcs can represent predator-prey, host-parasite, mutualism, or synergism inter-species relations 
. In sociology the nodes can represent individuals present or absent in different groups while arcs can represent dependencies or associations between people 
. In political sciences the nodes can represent countries and their interactions in the context of local conflicts and economic competition 
. In economics, the graph nodes map to companies which may be either active or inactive in various markets, and the arcs depict collaboration, competition, or dependence between the various businesses. The common feature unifying all these disparate networks is that each of them has to be assembled from a rapidly growing avalanche of conflicting observation of unequal quality that need to be reconciled at a large scale.