In this section, we describe our assessment of the predictions supplied by the community. Our dual goals are to identify the best-performers in each challenge and to characterize the efficacy of the community as a whole. We highlight the best-performer strategies and comment on some of the sub-optimal strategies. Where possible, we attempt to leverage the community intelligence by combining the predictions of multiple teams into a consensus prediction.
Best-performers in each challenge were identified by statistical significance with respect to a null model combined with a clear delineation from the rest of the participating teams (e.g., an order of magnitude lower p-value compared to the next best team). Occasionally, this criterion identified multiple best-performers in a challenge.
Signaling Cascade Identification
Seven teams submitted predictions for the signaling cascade identification challenge as described in the
Introduction. Submissions were scored based on the probability that a random solution to the challenge would achieve at least as many correct protein identifications as the submitted solution.
Five of seven teams identified two of the four proteins correctly (though not the same pair) (). One team identified only one protein correctly and one team did not identify any correctly. The
p-value for a team identifying two or more proteins correctly is 0.11, as described in the
Introduction. On the basis of this
p-value, this challenge did not have a best-performer. However, in the days following the conference, follow-up questions from some of the participants to the data provider revealed a misrepresentation in how the challenge was posed, which probably negatively impacted the teams' performances. The source of the confusion is describe below.
| Table 4Results of the signaling cascade identification challenge. |
Despite that no individual team gained much traction in solving this challenge, the community as a whole seemed to possess intelligence. For example, five of seven teams correctly identified two proteins (though not the same pair). While such a performance is not significant on an individual basis, the event of five teams correctly identifying two proteins is unlikely to occur by chance. Under the binomial distribution, assuming independent teams, the probability of five or more teams correctly identify two or more proteins is

.
Summing over the predictions of all the teams we obtain . For example, five of seven teams correctly identified x1 as the kinase. The probability that five or more teams would pick the same table entry is

. Similarly, the probability of three or more teams identifying the same pair of proteins (e.g., kinase, phosphoprotein) is

.
The assumption of independence is implicit in the null hypothesis underlying these p-values. Rejection of the null hypothesis on the basis of a small p-value indicates that there is a correlation between the teams. This correlation can be interpreted as a shared success within the community. In other words, the community exhibits some intelligence not evidenced in the predictions of the individual teams. Based on this assessment of the community as a whole, we conclude that some structural features of the signaling cascade were indeed identified from flow cytometry data.
The community assessment suggests that a mixture of methods may be an advantageous strategy for identifying signaling proteins from flow-cytometry data. A simple strategy for generating a consensus prediction is illustrated by in which the total number of predictions made by the community for each possible assignment are indicated along with the corresponding
p-values indicating the probability of such a concentration of predictions in a single table entry. The the kinase and phosphorylated protein are the only identifications (individually) significant at

. This analysis also reveals clustering of incorrect predictions—the phosphatase was most often confused with the activated phosphatase, and the phosphorylated protein was most often confused with the phosphorylated ligand-receptor complex—but these misidentifications were not significant.
Mea culpa: a poorly posed challenge There are three conjugate pairs of species in the signaling pathway: complex/phospho-complex, protein/phosho-protein, and phosphatase/activated phosphatase. The challenge description led participants to believe that each measured species (x1,…, x4) may match one of the six individual species. In fact, measurement x3 corresponded to total protein (inactive and active forms). Likewise, measurement x2 corresponded to total phosphatase (inactive and active forms). It would be highly unusual for an antibody to target one epitope of a protein to the exclusion of a phosphorylated epitope. That is, it would be difficult but not impossible to raise an antibody that reacted with only the unphosphorylated version of a protein. This serious flaw in the design of the challenge did not come to light until after the scoring was complete.
The simultaneous identification of the upstream kinase and the downstream phosphorylated protein () can be explained in light of the confusion surrounding precisely what the measurements entailed. The measurements corresponding to the kinase and phosphoprotein were accurately portrayed in the challenge description whereas the total protein and total phosphatase were not.
Signaling Response Prediction
Four teams participated in the signaling response prediction challenge. The phosphoprotein subchallenge received three submissions, as did the cytokine subchallenge. As described in the
Introduction, the task was to predict measurements of proteins and/or cytokines, in normal and cancerous cells, for combinatoric perturbations of stimuli and inhibitors of a signaling pathway. Submissions were scored by a metric based on the sum of the squared prediction errors ().
In the phosphoprotein subchallenge two teams achieved a p-value orders of magnitude lower than the remaining other submission (). In the cytokine subchallenge one team had a substantially smaller total prediction error than the next best team. On this basis, the best-performers were:
| Table 5Results of the signaling response prediction challenge. |
- Genome Singapore (phosphoprotein and cytokine subchallenges): Guillaume Bourque and Neil Clarke of the Genome Institute of Singapore, Singapore
- Vital SIB (phosphoprotein subchallenge): Nicolas Guex, Eugenia Migliavacca, and Ioannis Xenarios of the Swiss Institute of Bionformatics, Switzerland
There are two main types of strategies that could have been employed in this challenge: to explicitly model the underlying signaling network, or to model the data statistically. Both of the best-performers took a statistical approach. Vital SIB approached it as a missing data problem and used multiple imputation to predict the missing data. This involved learning model parameters by cross-validation, followed by prediction of the missing data
[20]. Genome Singapore identified the nearest-neighbors of missing measurements based on similarity of the measurement profiles
[21]. To predict the measurements for an unobserved stimulus or inhibitor, they took into consideration the values observed for the nearest neighbor. Neither team utilized external data sources, nor did they evoke the concept of a biological signaling network.
Surprisingly, one team in the cytokine subchallenge had a significantly larger total error than random. We investigated this strange outcome further. This team systematically under-predicted the medium and large intensity measurements (data not shown). This kind of systematic error was heavily penalized by the scoring metric. Nevertheless, the best-performer would have remained the same had linear correlation been used as the metric. Due to the low participation level from the community, we did not perform a community-wide analysis.
Gene Expression Prediction
Nine teams participated in the gene expression prediction challenge as described in the
Introduction. The task was to predict the expression of 50 genes in the

strain of
S. cerevisiae at eight time points. Participants submitted a spreadsheet of 50 rows (genes) by eight columns (time points). At each time point, the participant ranked the genes from most induced to most repressed compared to the wild type values at time zero. Predictions were assessed by Spearman's correlation coefficient and its corresponding
p-value under the null hypothesis that the ranks are uniformly distributed.
The p-values (based on Spearman correlation coefficient) computed over the set of 50 test genes at each of the eight time-points are reported in . Some trends are readily identifiable. Across the community, the least significant predictions were those at time zero. Relatively more significant predictions were made at 10, 20, 45, and 60 minutes, and comparatively less significant predictions were made at 30 and 90 minutes. This analysis identified the teams that predicted well (over the 50 test genes) at each time point. We computed a summary statistic for each team using the geometric mean of the eight p-values for the individual time points.
| Table 6Time-profile p-values of the gene expression prediction challenge. |
In the above analysis, each of the eight time points was analyzed as a 50-dimensional vector. An alternative viewpoint is to consider each of the 50 genes as an eight-dimensional vector. We also performed this analysis using Spearman's correlation coefficient computed for each gene. We computed a summary statistic for each team using the geometric mean of the 50
p-values for the individual genes (not shown). Correlation coefficients and
p-values for the gene-profiles are published on the DREAM website
[12].
Summary statistics from the time-profile analysis and the gene-profile analysis are reported in . Weaker significance of gene-profile
p-values compared to time-profile
p-values may be due to the fact that the former are eight-dimensional vectors while the latter are 50-dimensional vectors. Best-performers were identified by an overall score based on the time-profile and gene-profile summary
p-values. A difference of one in the overall score corresponds to an order of magnitude difference in the
p-value. Two teams performed more than an order of magnitude better than the nearest competitor at

.
| Table 7Results of the gene expression prediction challenge. |
- Gustafsson-Hornquist : Mika Gustafsson and Michael Hornquist of Linköping University, Sweden
- Dream Team 2008 : Jianhua Ruan of the University of Texas at San Antonio, USA
We used hierarchically clustered heat maps to visualize the teams' predictions (gene ranks from 1 to 50) relative to the gold standard (). The two best-performers were more similar to each other than either was to the gold standard. The Spearman correlation coefficient between Gustafsson-Hornquist and Dream Team 2008 is 0.96, while the correlation between either team and the Gold Standard is 0.67. One could reasonably presume that substantially similar methods were employed by both teams. That turns out not the be the case.
Team Gustafsson-Hornquist used a weighted least squares approach in which the prediction for each gene was a weighted sum of the values of the other genes
[22]. The particular linear model they employed is called an elastic net, which is a hybrid of the lasso and ridge regression
[23]. They incorporated additional data into their model, taking advantage of public yeast expression profiles and ChIP-chip data. The additional expression profiles provided more training examples from which to estimate pairwise correlations between genes. The physical binding data (ChIP-chip) was integrated into the linear model by weighting each gene's contribution to a prediction based on the number of common transcription factors the pair of genes shared.
Dream Team 2008 did not use any additional data beyond what was provided in the challenge. Rather, they employed a

-nearest neighbor (KNN) approach to predict the expression of a gene based on the expression of other genes in the same strain at the same time point
[24]. The Euclidean distance between all pairs of genes was determined from the strains for which complete expression profiles were provided. The predicted value of a gene was the mean expression of the

-nearest-neighbors. The parameter

was chosen by cross-validation;

was used for prediction.
Does the community possess an intelligence that trumps the efforts of any single team? To answer this question we created a consensus prediction by summing the predictions of multiple teams, then re-ranking. The results of this analysis are shown in which traces the overall score of the consensus prediction as lower-significance teams are included. The first consensus prediction includes the best and second-best teams. The next consensus prediction includes the top three teams, and so on.
The consensus prediction of the top four teams had a higher score than the best-performer, which is counter-intuitive since the third and fourth place teams individually scored much lower than the best-performer (). Furthermore, the inclusion of all teams in the consensus prediction scored about the same as the best-performer. This result suggests that, given the output of a collection of algorithms, combining multiple result sets into a consensus prediction is an effective strategy for improving the results.
We assigned a difficulty level to each gene based on the accuracy of the community. For each gene, we computed the geometric mean of the gene-profile
p-values over the nine teams, which we interpreted as the difficulty level of each gene. The five best-predicted genes were:
arg4,
ggc1,
tmt1,
arg1, and
arg3. The five worst-predicted genes were:
srx1,
lee1,
sol4,
glo4, and
bap2. The relative difficulty of prediction of a gene was weakly correlated with the absolute expression level of that gene at


=

0, but many of the 50 genes defied a clear trend. The five best-predicted genes had an average expression of 42.7 (arbitrary units, log scale) at t

=

0, whereas the five worst-predicted genes had an average expression of 3.7. It is known that low intensity signals are more difficult to characterize with respect to the noise. It is likely that the absolute intensity of the genes played a role in the relative difficulty of predicting their expression values.
In Silico Network Inference
Twenty-nine teams participated in the
in silico network inference challenge as described in the
Introduction, the greatest level of participation by far of the four DREAM3 challenges. The task was to infer the underlying gene regulation networks from
in silico measurements of environmental perturbations (dynamic trajectories), gene knock-downs (heterozygous mutants), and gene knock-outs (homozygous null-mutants). Participants predicted directed, unsigned networks as a ranked list of potential edges in order of the confidence that the edge is present in the gold standard network. Predictions for 15 different networks of various “real-world” inspired topologies were solicited, grouped into three separate subchallenges: the 10-node, 50-node, and 100-node subchallenges. The three subchallenges were evaluated separately.
Each predicted network was evaluated using two metrics, the area under the ROC curve (AUROC) and the area under the precision-recall curve (AUPR). To provide some context for these metrics we demonstrate the ROC and P-R curves for the five best teams in the 100-node subchallenge (). These complementary assessments enable valuable insights about the performance of the various teams.
Based on the P-R curve, we observe that the best-performer in this subchallenge actually had low precision at the top of the prediction list (i.e., the first few edge predictions were false positives), but subsequently maintained a high precision (approximately 0.7) to considerable depth in the prediction list. By contrast, the second-place team had perfect precision for the first few predictions, but precision then plummeted. In another example of the complementary nature of the two assessments, consider the fifth-place team. On the basis of the ROC, the fifth place team is scarcely better than random (diagonal dotted line) however, on the basis of the P-R curve, it is clear that the fifth place team achieved better precision than random at the top of edge list. The two types of curves are non-redundant and enable a fuller characterization of prediction performance than either alone.
ROC and P-R curves like those shown in were summarized using the area under the curve. The details of the calculation of the area under the ROC curve and the area under the P-R curve are described at length in
[10]. Probability densities for AUPR and AUROC were estimated by simulation of 100,000 random prediction lists. Curves were fit to the histograms using Equation 2 so that the probability densities could be extrapolated beyond the ranges of the histograms in order to compute
p-values for teams that predicted much better or worse than the null model. demonstrates the teams' scores in the reconstruction of the gold standard network called
InSilico_Size100_Yeast2. The best-performer made an exceedingly significant network prediction (identified by an arrow) whereas many of the teams predicted equivalently to random.
Best-performers in each subchallenge were identified by an overall score that summarized the statistical significance of the five network reconstructions composing the subchallenge (Ecoli1, Ecoli2, Yeast1, Yeast2, Yeast3). The AUROC
p-values for the 100-node subchallenge are indicated in . The complete set of tables for the other subchallenges are available on the DREAM website
[12]. A summary
p-value for AUROC was computed as the geometric mean of the five
p-values. Likewise, a summary
p-value for AUPR was computed (not shown). Finally, the overall score for a team was computed from the two summary
p-values according to Equation 4 (). A difference of one in the score corresponds to an order of magnitude difference in
p-value —the higher the score, the more significant the prediction. On the basis of the overall score, the same team was the best-performer in the 10-node, 50-node, and 100-node subchallenges:
| Table 8P-values for the area under the ROC in the in silico size 100 network inference challenge. |
| Table 9Results of the in silico size 100 network inference challenge. |
- B Team : Kevin Y. Yip, Roger P. Alexander, Koon-Kiu Yan, and Mark Gerstein of Yale University, USA
Runners-up were identified by scores that were orders of magnitude more significant than the community at large, but not as significant as the best-performer:
- USMtec347 (10-node, 50-node): Peng Li and Chaoyang Zhang of the University of Southern Mississippi, USA
- Bonneau (100-node): Aviv Madar, Alex Greenfield, Eric Vanden-Eijnden, and Richard Bonneau of New York University, USA
- Intigern HSP (100-node): Xuebing Wu, Feng Zeng, and Rui Jiang of Tsinghua University, China
The overall
p-values for the 100-node subchallenge () demonstrates that the best teams predicted significantly better than the null model—a randomly sorted prediction list. However, the majority of teams did not predict much better than the null model. In the 10-node subchallenge, twenty-six of twenty-nine teams did not make statistically significant predictions on the basis of the AUROC (

). Fourteen of 27 teams in the 50-node subchallenge did not make significant predictions (AUROC

). Eight of 22 teams in the 100-node subchallenge did not make significant predictions (AUROC

). This is a sobering result for the efficacy of the network inference community. In Conclusions we discuss some reasons for this seemingly distressing result.
Some teams' methods were well-suited to smaller networks, others to larger networks (). This may have less to do with the number of nodes and more to do with the relative sparsity of the larger networks since the number of potential edges grows geometrically with the number of nodes (i.e.,

).
| Table 10Comparison of scores in the 10, 50, and 100-node subchallenges. |
B Team used a collection of unsupervised methods to model both the genetic perturbation data (steady-state) and the dynamic trajectories
[25]. Most notably, they correctly assumed an appropriate noise model (additive noise), and characterized changes in gene expression relative to the typical variance observed for each gene. It turned-out that this simple treatment of measurement noise was credited with their overall exemplary performance. This conclusion is based on our own ability to recapitulate their performance using a simple method that also uses a noise model to infer connections (see analysis of null-mutant
Z-scores below). Additionally, B Team employed a few formulations of ODEs (linear functions, sigmoidal functions, etc.) to model the dynamic trajectories. In retrospect, their efforts to model the dynamic trajectories probably had a minor effect on their overall performance. Team Bonneau applied and extended a previously described algorithm, the Inferelator
[26], which uses regression and variable selection to identify transcriptional influences on genes
[27]. The methodologies of B Team and the other best-performers are described in separate publications in the PLoS ONE Collection.
A simple method: null-mutant z-score We investigated the utility of a very simple network inference strategy which we call the null-mutant
z-score. This strategy is a simplification of conditional correlation analysis
[28]. Suppose there is a regulatory interaction which we denote A

B. We assume that a large expression change in B occurs when A is deleted (compared to the wild-type expression). We compute the
z-score for the regulatory interaction A

B
where

is the value of B in the strain in which A was deleted,

is the mean value of B in all strains (WT and mutants), and

is the standard deviation of B in all strains. This calculation is performed for all directed pairs (A, B). We assume that

represents baseline expression (i.e., most gene deletions do not affect expression of B) and that deletion of direct regulators produces larger changes in expression than deletion of indirect regulators. Then, a network prediction is achieved by taking the absolute value of
z-score and ranking potential edges from high to low values of this metric. Of note, the
z-score prediction would have placed second, first, and first (tie) in the 10-node, 50-node, and 100-node subchallenges, respectively.
We do not imply that ranking edges by z-score is a superior algorithm for inferring gene regulation networks from null-mutant expression profiles in general, though conditional correlation has its merits. Rather, we interpret the efficacy of z-score for reverse-engineering these networks as a strong indication that an algorithm must begin with exploratory data analysis. Because additive Gaussian noise (i.e., simulated measurement noise) is a dominant feature of the data, z-score happens to be an efficacious method for discovering causal relationships between gene pairs. Furthermore, z-score can loosely be interpreted as a metric for the “information content” of a node deletion experiment. Subsequently, we will evoke this concept of information content to investigate why some network edges remain undiscovered by the entire community.
Intrinsic impediments to network inference Analysis of the predictions of the community as a whole shed light on two important technical issues. First, are certain edges easy or difficult to predict and why? Second, do certain network features lead teams to predict edges where none exist? We call the former concept the identifiability of an edge, and we call the latter concept systematic false positives. A straightforward metric for quantifying identifiability and systematic false positives is the number of teams that predict an edge at a specified cutoff in the prediction lists. In the following analysis, we used a cutoff of 2P (i.e., twice the number of actual positives in the gold standard), which means that the first 2P edges were thresholded as present (positives). Incomplete prediction lists were completed with a random ordering of the missing potential edges prior to thresholding.
We grouped the gold standard edges into bins according to the number of teams that identified the edge at the specified threshold (2P). We call the resulting histogram the identifiability distribution (). A community composed of the ten worst-performing teams has an identifiability distribution that is approximately equivalent to that of a community of random prediction lists—the two-sample Kolmogorov-Smirnov test
p-value is 0.89. By contrast, a community composed of the ten best teams has a markedly different identifiability distribution compared to a random community—the two sample K-S test
p-value is

.
The zero column in the identifiability distribution corresponds to the edges that were not identified by any team. We hypothesized that the unidentified edges could be due to a failure of the data to reveal the edge—the problem of insufficient information content of the data. Using the null-mutant z-score as a measure of the information content of the data supporting the existence of an edge, we show that unidentified edges tend to have much lower absolute Z-scores compared to the edges that were identified by at least one team (). This can occur if expression of the target node does not significantly change upon deletion of the regulator. For example, a target node that implements an OR-gate would be expected to have little change in expression upon the deletion of one or another of its regulators. Such a phenomena is more likely to occur for nodes that have a higher in-degree. Indeed, the unidentified edges have both lower z-score and higher target node in-degree than the identified edges ().
We investigated whether certain structural features of the gold standard networks led the community to incorrectly predict edges where there should be none. When multiple teams make the same false positive error, we call it a systematic false positive. The number of teams that make the error is a measure of confusion of the community. An ever-present conundrum in network inference is how to discriminate direct regulation from indirect regulation. We hypothesized that two types of topological properties of networks could be inherently confusing, leading to systematic false positives. The first type is what we call shortcut errors, where a false positive shortcuts a linear chain. A second type of direct/indirect confusion is what we call a co-regulation error, where co-regulated genes are incorrectly predicted to regulate one another (see schematic associated with ).
We performed a statistical test to determine if there is a relationship between systematic false positives and the shortcut and co-regulated topologies (). Fisher's exact test is a test of association between two types of classifications. First, we classified all negatives (absence of edges) by network topology as either belonging to the class of shortcut and co-regulated node pairs, or not. Second, we classified negatives by the predictions of the community as either systematic false positives, or not. Finally, we constructed the

contingency table, which tabulates the number of negatives classified according to both criteria simultaneously.
There is a strong relationship between systematic false positives and the special topologies that we investigated. The systematic false positives are concentrated in the shortcut and co-regulated node pairs. This can be seen by inspection of each

contingency table. For example, systematic false positives (the most common false positive errors in the community) have a ratio of 1.09 (51 special topologies to 47 generic topologies) whereas the less common false positive errors have a ratio of 0.11 (920 special topologies to 8757 generic topologies)—a profound difference in the topological distribution of false positives depending on whether many teams or few (including none) made the error. Direct-indirect confusion of this kind explains about half of the systematic false positives in the Ecoli1 network, and more than half in the other 100-node networks.
Community intelligence Does the community possess an intelligence that trumps the efforts of any single team? To test this hypothesis we experimented with various ways of combining the predictions of multiple teams into a consensus prediction. Based simplicity and performance, we settled on the rank sum. The order of the edges in a prediction list is a ranking. We summed the ranks for each edge given by the various teams, then re-ranked the list to produce the consensus network prediction. Depending on which teams are included, this procedure can boost the overall score. For example, combining the predictions of the second and third-place teams achieved a better score than the second place team (). This result seems to indicate that the predictions of second and third-place teams are complementary; probably these teams took advantage of different features in the data. However, combining predictions with those of the best-performer only degraded the best score. Obviously, if the best prediction is close to optimal, combination with a suboptimal prediction degrades the score.
Starting with the second place team and including progressively more teams, the rank sum prediction score degrades much slower than the score of the individual teams (). This is reassuring since, in general, given the output of a large number of algorithms, we may not know which algorithms have efficacy. The rank sum consensus prediction is robust to the inclusion of random prediction lists (the worst-performing teams predictions were equivalent to random). It seems to be efficacious to blend the results of a variety of algorithms that approach the problem from different perspectives. We expect hybrid strategies to become more common in future DREAM challenges.
Lessons for experimental validation of inferred networks This challenge called for the submission of a ranked list of predicted edges from most confidence to least confidence that an edge is present in the gold standard. Ranked lists are common for reporting the results of high-throughput screens, whether experimental (e.g., differential gene expression, protein-protein interactions, etc.) or computational. In the case of computational predictions, it is typical to experimentally validate a handful of the most reliable predictions. This amounts to characterizing the precision at the top of the prediction list. The outcome of the in silico network inference challenge reveals two reasons why a “top ten” approach to experimental validations is difficult to interpret.
Experimental validations of the handful of top predictions of an algorithm would be useful if precision were a monotonically decreasing function of the depth

of the prediction list. The actual P-R curves illustrate that this is not the case. In , the best-performer initially had low precision which rose to a high value and was maintained to a great depth in the prediction list. The second-best-performer initially had high precision, which plummeted abruptly with increasing

. Validation of the top ten predictions would have been overly pessimistic in the former case, and overly optimistic in the latter case. Unfortunately, since precision is not necessarily a monotonically decreasing function of

, a small number of experimental validations at the top of the prediction list can not be extrapolated.
Year-over-year comparison We would like to know if predictions are getting more accurate from year to year, and if teams are improving. With only two years of data available, no definitive statement can be made. However, there is one interesting observation from the comparison of individual teams' year-over-year scores. We compared the results of the 50-node subchallenge of DREAM3 to the results of the 50-node subchallenge of DREAM2 (the subchallenge that was substantially similar from year to year). It is a curious fact that teams that scored high in DREAM2 did not score high in DREAM3. There can be many reasons for the counter-trend. The in silico data sets were generated by different people from year to year. Furthermore, the topological characteristics of the networks were different. For example, all of the DREAM3 networks were devoid of cycles whereas the DREAM2 networks contained more than a few. The dynamics were implemented using different, though qualitatively similar equations. Finally, the current year data included additive Gaussian noise, whereas the prior data sets did not. Given the efficacy of directly acknowledging the measurement noise in the reverse engineering algorithm (e.g., null mutant z-score described above), any team that did not acknowledge the noise would have missed an important aspect of the data. We interpret the year-over-year performance as an indication that no algorithm is “one-size-fits-all.” The in silico network challenge data was sufficiently unique from year to year to warrant a custom solution. A final note, teams may have changed their algorithms.
Survey of methods A voluntary survey was conducted at the conclusion of DREAM3 in which 15 teams provided basic information about the class of methods used to solve the challenge (). The two most common modeling formalisms were Bayesian and linear/nonlinear dynamical models, which were equally popular (7 teams). Linear regression was the most popular data fitting/inference technique (4 teams); statistical (e.g., correlation) and local optimization (e.g., gradient descent) were the next most popular (2 teams). Teams that scored high tended to enforce additional constraints, such as minimization of the L1 norm (i.e., a sparsity constraint). Also, high-scoring teams did not ignore the null-mutant data set. The main conclusion from the survey of methods is that there does not seem to be a correlation between methods and scores, implying that success is more related to the details of implementation than the choice of general methodology.