We have measured the performance of various gene regulatory network construction methodologies against various sizes of simulated data with different numbers of samples. From this, a few conclusions can be drawn.
First, WGCNA and ARACNE performed well in constructing the global network, while SPACE did well in identifying a few connections with high specificity. GeneNet performed well in both aspects, but it is not suitable for identifying the hub genes, which can often be of biological interest. In the simulation study, SPACE performed well in identifying the hub genes as shown in . Since there is no a single method that outperforms other methods in all aspects, the user should choose an appropriate method based on the purpose of the study.
In applying these methods to the real E. coli data, WGCNA and ARACNE performed best, which may indicate that these two methods are relatively more robust. Overall, the performance in real data seemed to be worse than that in the simulation study, and there are several possible reasons: (1) the real biological network is much more complex than the simulation study; (2) many true connections in this network are still unknown; (3) some of the connections in RegulonDB may not be supported by gene expression data 
. Surprisingly, SPACE performed poorly in constructing the global network, which is because the SPACE algorithm uses an L1 penalty to shrink most of partial correlation to zero. If we manually decrease the penalty term, the performance improved as fewer partial correlations were shrunk to zero, but it also became much more computationally intensive. In this study, we used default parameters or recommended settings for each method whenever possible for a fair comparison. So, here we still present the results based the default setting of SPACE algorithm.
Another conclusion which can be drawn is that as sample sizes increase, the accuracy also increases. For the number of samples tested (20–1,000), the most significant performance improvements were obtained at the beginning; they began to saturate as the number of samples approached 1,000. This demonstrates that having thousands of samples may not offer significant performance improvements.
Also, this study demonstrates that it is feasible to use current techniques to generate accurate, informative networks even with dozens or hundreds of genes. Several algorithms scaled to such environments well without requiring sophisticated computational resources.
One disadvantage of probabilistic-network-based methods is the discretization of data. It is generally preferred to discretize into a small number of “buckets” which directly represent an underlying biological observation when using probability networks. To this end, data is typically discretized into binary buckets (implying that a gene is either “on” or “off”) or ternary buckets (signifying “under-expressed,” “normally expressed,” and “over expressed”). Unfortunately, fitting the data into any reasonable number of buckets will result in substantial data loss.
Finally, we found that the Bayesian methods did not scale to larger networks well. Because of the computational complexity as well as the memory requirements, these methods – as currently implemented – are not the ideal choice for such large networks. WGCNA, GeneNet, ARACNE and SPACE, on the other hand, were designed to construct the gene network at very large scales. Also, it worth mentioning that the WGCNA package provides several useful tools to facilitate the analysis and visualization of resulting networks, including tools to identify subnetworks and an interface to Cytoscape. The WGCNA package can be used for not only constructing gene networks but also for detecting modules/sub-networks, identifying hub genes, and selecting candidate genes as biomarkers.