The training set was composed of measurements of activating phosphorylation events on seven measured phosphorylated proteins or groups of protein isoforms [the kinase Akt, the mitogen-activated protein kinase (MAPK) family members ERK1 and -2, which are detected with the same antibody and denoted as ERK1/2; JNK1, -2, and -3, which are detected with the same antibody and denoted as JNK; p38; the MAPK kinase MEK1; inhibitor of nuclear factor κB (NF-κB) (denoted as IKB); and heat shock protein HSP27] observed at three time points (0, 30, and 180 min) after stimulation by one of four ligands [transforming growth factor–α (denoted as TGFa), insulin-like growth factor 1 (IGF1), tumor necrosis factor–α (denoted as TNFa), or interleukin-1α (denoted as IL1a) in human hepatocellular liver carcinoma HepG2 cells (). Measurements were obtained with and without pretreatment of cells with potent and relatively specific small-molecule inhibitors of cytosolic kinases (p38i, MEKi, PI3Ki, and IKKi, where “i” denotes inhibitor) that inhibited p38, MEK1, phosphatidylinositol 3-kinase (PI3K), or inhibitor of nuclear factor κB (IκB) kinase (IKK) as described (6
). Participants attempted to predict phosphorylation measurements of the same seven proteins at 30 min after stimulation by various individual and pair-wise combinations of the ligands, in the presence of pair-wise combinations of the inhibitors (12
). The experimental conditions comprising the training set were mutually exclusive with the experimental conditions comprising the test set. The complete challenge description and the data can be obtained from the DREAM project Web site (http://www.the-dream-project.org
The DREAM4 Challenge and resulting network
In addition to the training set, participants received a prior knowledge network (PKN; a directed graph with edges specified as activating or inhibitory) compiled from the scientific literature as based on the Ingenuity Systems (Redwood, California) database encompassing the pathways known to be responsive to the ligands used for the challenge (). In addition to the prediction task, the challenge entailed adding and removing edges to the PKN to capture those interactions that were essential to explain the training data. This task encouraged participants to go beyond “black box” prediction algorithms to enable some mechanistic interpretation of the quantitative models used to predict the test data set. Although some participants applied models that were interpretable as a network, others focused on the prediction task only and did not attempt to interpret their model in terms of a network. Anecdotally, the team with the highest prediction score used a model that was not readily interpretable as a network, suggesting that maximizing mechanistic interpretability of a model might compromise predictive accuracy.
The NRSS was evaluated separately for each of the seven proteins because measurements of phosphorylation status () between proteins are not directly comparable because of different affinities of the antibodies for their targets and variation in protein abundances. The seven P
values for each of the measured phosphoproteins represent the probability that the prediction accuracy on the test set is better than a naïve prediction assembled by randomly sampling from the phosphorylation status in the training set. The “Prediction Score” for a team summarizes the team’s overall predictive performance and was defined as the negative of the log10 of the geometric mean of the P
values obtained by that team across all the predicted proteins (Eq. 2
Table 1 The performance scores for each of the phoshoproteins predicted by the teams and assessed in the DREAM4 phosphoproteomics challenge. The blue entries are the most significant predictions, with P values (PVAL) smaller than 10−10. The red entries (more ...)
A high prediction score corresponds to high statistical significance for the accuracy of the prediction (a low average P value).
In prediction problems, a model with a number of fitted parameters that is smaller than the number of constraints in the problem (for example, the number of experiments) is generally preferred to a model with more parameters than constraints because the former is more parsimonious, less prone to overfitting, and typically more interpretable (7
). Also, empirical evidence suggests that biological networks are sparse (13
), that is, the number of edges are of the order N
(the number of nodes) rather than of order N2
. We imposed a sparseness criterion on the selection of the best-performer using a cost function that rewards prediction accuracy and penalizes densely connected model networks to calculate the “Overall Score” for each team (Eq. 3
Cost per edge was calibrated to the actual prediction scores and networks of the teams by taking the minimum (Prediction Score/Number of Edges) over all teams. The most accurate team was third by this criterion, whereas the second most accurate participant was first. [For the methodology used by the best performing team, Team 1, see (14
).] This Overall Score cost function is ad hoc, and other formulations could rank the teams differently. One take-home message is that predictive accuracy, as measured by the Prediction Score and without regard to model complexity, model interpretability, or mechanistic plausibility, may be valuable in some tasks but not necessarily in the task of network inference. Indeed, the correlation between edge count and prediction score was low (0.03), indicating that increasing the number of edges does not automatically improve the predictability of the model. For a Boolean model, we previously showed that removing edges with no empirical support improved predictive accuracy (12
). Networks with sparse connectivity, therefore, might be expected to score better than highly connected networks. However, it remains an open problem to design a cost function that rewards desirable attributes and penalizes undesirable attributes in a model network.