PathRNet was first validated on simulated data. The data was constructed to mimic a microarray experiment that measures the expression profiles of 200 genes at 9 sample time points. 8 temporal transcription modules were assumed to exist in the expression data, each of which contains 30 to 50 genes and may share the same subset of genes. Genes in a module was assumed to be co-expressed and follow a coherent expression pattern [20
]. A simple PathRNet (Fig. ) was devised with each node corresponding to a molecule. Moreover, only nodes 1, 2, and 8 were assumed functional in the simulated expression data. The function of each node and their constitute genes were defined in a simulated prior knowledge database; the database contains a set of assumed pathways and TF regulated gene sets, whose members may be the genes in the module. Specifically, in the prior knowledge database, node 8 was assumed to be regulated by node 1-5. Note that only nodes 1 and 2 are functional nodes in the context-specific PathRNet that underlies the simulated expression data.
Figure 4 Simulated PathRNet The simulated PathRNet consists of 8 nodes and 5 edges in the generic network, which corresponds to 8 pathways or TFs and 5 regulations. Node 8 is assumed to be regulated by node 1-5 according to prior knowledge but only regulated by (more ...)
The data pattern of each module was generated according to an AR (1) model:
, where, Xmt
represents the expression level of module m
at time t
, and εmt
). Then, the expression level of gene g
in module m
can be simulated as:
where Gm denotes the set of genes in module m and Tm is the duration of module m.
In order to simulate the regulatory impact, the pattern of node 8 is modelled as a linear combination of the patterns of node 1 and 2:
are the regulatory coefficients and follow independent standard normal distribution Normal (0,12
) in our experiments. Since node 8 was known to be regulated by node 1-5 according to the prior knowledge, it can be equivalently expressed as:
where, β3 = β4 = β5 = 0. Then, each expression level in a module is multiplied with a random number from Normal (0,12 );this is used to simulate the difference in expression strength among genes. Finally, Gaussian additive noise was added and data were arranged in a matrix similar to microarray; rows of the data matrix were also shuffled to mimic real biological data.
The simulated system is similar to the one used in [14
], except that additional network structure was specified. The goal of PATTERN is to reconstruct the embedded network structure and estimate the regulatory coefficients βi
given the prior knowledge and gene expression data. In the following experiments, we evaluated the impacts of noise effect and prior knowledge on the performance of proposed PATTERN.
In the first experiments, PATTERN was evaluated under different noise variance by considering whether the correct network model was recovered and the percentage of correct model prediction is calculated. A network model is considered to be correctly identified when all the nodes within the network are predicted correctly, i.e., all the functional nodes are predicted to be functional, and all the nonfunctional nodes are predicted to be nonfunctional. For instance, in the network shown in Fig. , the network structure is correctly predicted if and only if nodes 1, 2, 8 are predicted functional, and meanwhile node 3-5 are predicted to be nonfunctional. (Node 6, 7 are independent nodes and thus are not considered.). The percentage of correct model prediction is defined as the ratio of the number of correct model predictions to the total number of experimental trials. For example, if among 100 trials, models are predicted correctly 80 times, the percentage of correct model prediction is 80%.
The result of first experiment is shown in Fig. . It can be seen that PATTERN was able to identify the network structure correctly when noise is small, and the performance decreases as noise increases. To further evaluate the ability of PATTERN to estimate the regulatory coefficients βi, the mean squared error of estimated βi was calculated and compared with that of a direct method. The direct approach result is computed by considering the complete model, i.e., all the nodes are functional. In the network shown in Fig. , when using direct method, all five parent nodes 1-5 of node 8 in the network are considered, and the actual regression model is: y = x1β1+ x2β2 + x3β3 + x4β4 + x5β5 + ε. The direct method mimics the common practice of the Bayesian Networks based reconstruction approach and infers the regulatory relationship without considering node enrichment, thus involving all the nodes in inference. The comparison is shown in Fig. . Compared with the direct method, PATTERN does improve the estimation accuracy of βi greatly under all tested noise conditions, which demonstrates the capability of PATTERN to identify the correct network structure and accurately estimate βi.
Figure 5 PATTERN on small synthetic PathRNet (a) The vertical axis represents the percentage of correctly predicted network structure and the horizontal axis denotes the noise standard deviation. The capability of PATTERN to recover the correct network structure (more ...)
Note that in PATTERN the enrichment analysis serves as a model selection approach. The performance improvement of PATTERN over the direct method in βi estimation demonstrates the importance of model selection. As seen in this example, to recover the true network structure, we need to 1) remove the non-functional nodes 3-5 from network, and 2) decide whether node 1 and 2 regulates node 8. In PATTERN, objective 1) is achieved by the enrichment analysis in the context-specific PATTERN construction module, and objective 2) is tackled by the BIC model selection in regulatory PATTERN construction module. However, in the direct method, objective 1) is not tackled.
Then, we evaluated the impact of annotation database on performance. Even though great effort has been made to construct and improve various prior knowledge databases, the existing databases are still very noisy and often inconsistent with prior knowledge, i.e., annotation is often incomplete and contains error. This scenario was simulated by first assigning a function category only to a fraction of genes in an embedded module and leaving the rest without any annotation, and then introducing randomly selected genes into the function category. The validation result was shown in Fig. . It can be seen from the figure that the ability of PATTERN to identify the correct network structure and estimate the coefficients was not significantly affected as long as more than 50% of coregulated genes are correctly annotated. PATTERN outperforms the direct method under all tested conditions. These results imply that PATTERN is robust to the incomplete and erroneous annotations in the prior knowledge databases. In this experiment, In order to maintain the network structure and acquire a comparable result, the annotations of genes that are shared in several pathways (and thus determine the edges and edge directions between nodes) are unchanged; only the annotation of remaining genes can change.
Additional experiments were been conducted to test the performance of the proposed algorithm on more complicated network. The proposed algorithm was applied to a large dataset with 4000 genes 14 samples where a synthetic network with 60 nodes (30 functional nodes and 30 nonfunctional nodes) and around 175 random edges (0.05 connectivity) was embedded, and the regulatory coefficients follow independent standard normal distribution. This size of this network mimics a rather realistic situation. The precision-recall curves of reconstruction are plotted in Fig. .
Figure 6 Precision-Recall curves of (a) nodes and (b) edges of PATTERN on large synthetic PathRNet. PATTERN was applied to a large simulated dataset, which consists of 4000 genes and 14 time samples. The PathRNet embedded consists of 60 nodes (30 functional and (more ...)
It can be seen from the figure that, the proposed algorithm PATTERN achieves relatively better precision and recall performance at low noise level; its performance decreases as noise variance increases. Particularly, for the noise variance at 0.35, the precisions of nodes and edges drop below 1 only when respective recalls are above 0.8 and 0.65.
To evaluate the proposed algorithm on different types of noise distributions, three different noise distributions (Gaussian, Laplacian and Uniform) are compared, and the results are shown in Figs and .
Figure 7 Precision-Recall performance of (a) nodes and (b) edges of Pattern for the different noise distributions (small noise variance). PATTERN was applied to a simulated dataset which consists of 400 genes 10 samples. The PathRNet embedded is shown in Fig. (more ...)
Figure 8 Precision-Recall performance of (a) nodes and (b) edges of Pattern for the different noise distributions (large noise variance). PATTERN was applied to a simulated dataset which consists of 400 genes 10 samples. The PathRNet embedded is shown in Fig. (more ...)
It can be seen from the figures that, when noise variance is small, the proposed algorithm PATTERN performs similarly on the datasets for three different noise distributions; however, when noise variance is large, PATTERN performs much better on the Laplacian noise contaminated datasets than on the other two, which indicates that the proposed approach is more robust to heavy tail noise distributions when the standard deviation of noise is the same.