In this work, we present an algorithm to identify robust regulatory networks from time profiles of gene expression data. Our identification algorithm is primarily developed on the hypothesis of sparsity of biological network connections. In our earlier work we established the validity of the hypothesis of sparsity using a simplified linear ode representation of gene expression dynamics in a deterministic system. Herein we further advance the algorithm by incorporating more realistic non-linear representation using an S-system formulation of gene expression dynamics. The identification algorithm is formulated as a bi-level optimization problem in which the upper level solves an integer programming problem while the lower level is a continuous parameter identification problem. Furthermore, we propose a framework to incorporate noisy experimental data towards identification of a robust regulatory network. This is done by first generating artificial experimental repeats using the bootstrapping technique, followed by solving the identification formulation at each of the bootstrap data sets. From this library of identified prospective networks we isolate the most-repeated network connections which we hypothesize to be a robust connection, having low variability to experimental noise.
The upper level integer programming problem is solved using GA. There are several advantages of using GA to solve the above problem, the most important being that it does not require gradient evaluation. This is a significant advantage for the above problem with non-linear ode as constraint function. In addition, GA starts its search not from a single point in the feasible parameter space, but from multiple locations specified in the starting population. Hence, it holds the chance of converging at global minima, although such convergence cannot be guaranteed with GA. However, it also suffers from the disadvantage of increased computational cost. All the computations reported here have been carried out on 2.66 Ghz processer and 16
GB RAM server. The computational time for the five gene network without noise was 1
hour and the same network with noise was 2.5
hours. The computational time for the experimental data was 3
hours. For the 10 gene network, the genetic algorithm needed more generations to converge, resulting in computational time of 11
hours. Hence, extension of the current solution procedure to a much larger data set will be expensive. While the same formulation will still be applicable in a larger system, alternate solution procedures are currently being investigated for its extension to larger networks.
In the formulation presented in equation (2), the only user defined parameter is the value of the tolerance which dictates how closely the model prediction must agree with experimental dynamics in order for the network to be considered in the overall algorithm. While for an in silico case study without noise the tolerance may not play a vital role, it will be relevant when evaluating noisy scenarios. Specifying a low tolerance value (10-3) in our algorithm under noisy data failed to identify any network, as would be expected. Moreover, using a low tolerance is not advisable when using data sets with noisy replicates since we are not targeting a profile which exactly fits the noisy data; the target is to identify network profiles which describe all the noisy scenarios relatively well. On the other hand, a relaxed tolerance runs the risk of compromised prediction quality. In order to quantitatively evaluate the effect of specified tolerance on the identified network structure, the bootstrap/ bi-level optimization algorithm was repeated on the same 5-gene dataset with different tolerance values. Table illustrates how the precision and recall of the identified network changes with altered tolerance values. Quite interestingly it is observed that precision is relatively insensitive to the network tolerance, while recall worsens with increased tolerance. This is very encouraging since this implies that even with relaxed tolerance the identified network does not have false positive connections, although false negative connections increase. Increase in false negatives can be explained by the nature of the objective function which tries to minimize the number of connections. Hence, relaxed tolerance will always lead to a sparser network, as seen in Table . This analysis indicates that even for a relaxed constraint the algorithm may fail to identify all the connections but the identified connections will always be accurate with low probability of false positivity.
Effect of error constraint on 5-gene network identification, 5% noise
The performance of the developed robust identification formulation is illustrated using three different systems. The first two case studies are based on in silico
data which allows for detailed analysis of the performance of the algorithm. Overall the algorithm was found to demonstrate excellent predictive capability both in the small 5-gene network along with larger 10-gene network. The proposed bootstrapping scheme was found to adequately capture the precise network from the noisy data as well. Encouraged by the in silico
results, we applied our algorithm to dynamic experimental data of a 6-gene network responsible for DNA damage repair in E. Coli
]. While verification of the identified network will be difficult for this system, the time profile of gene expression data predicted by the identified network is in good agreement with the experimental data set. A thorough literature search for existing knowledge of network interactions revealed that quite a few of the predicted connections have been reported in parallel studies. Our algorithm inferred the regulation of recAumuD
, which is consistent with the findings reported earlier [12
]. Another interesting finding is that our results suggest that polB
does not influence any of the other genes in the system (pol B
does not up- or down-regulate any other gene), a finding which was also reported by Kumura et al.
Furthermore, our identified network shows the self-regulation of recA
. This protein is the main factor responsible for sensing DNA damage, and has been reported to promote the transcription of itself, thereby promoting damage recognition, and other repair genes [27
The current approach offers an improvement on existing algorithms. Numerous studies have used the 5-gene network (the current case study I) to test the accuracy and efficiency of their network identification methods. A comparison between the methods is presented by Kimura et a
] for the five gene network without noise. While most studies do not report the metrics of precision and recall, the accuracy of the results is still commented on. Most methods have a shorter computational time than the proposed method. However, our algorithm is able to predict a perfect network (recall and precision of 1), while the other algorithms deviate from this. Therefore, there is a trade-off between computational time and accuracy, and selection of the most appropriate method for the system of interest should be chosen judiciously. Nevertheless, this comparison shows that recall and precision are an improvement over many existing algorithms when analyzing the 5-gene network. Additional improvements could be made on the current approach to decrease computation time, such as parallel programming, or by altering the formulism (e.g. avoiding direct integration of the system of ode).