To test the hypothesis that pathway activity analysis can identify changes that emerge at the pathway level that cannot be identified at the individual gene expression level, a synthetic pathway consisting of 45 genes was constructed and data representative of circadian pattern is generated at different noise levels. Subsequently we compared the significance of the event when 90, 50 and 10% of the genes within the synthetic pathway are circadian. These results are compared with the significance of the synthetic pathway showing circadian pattern in its pathway activity level in Figure . For either method, a significance value close to unity indicates that the event is highly likely. A typical threshold used to consider the significance of an event is 0.95. The purpose of this analysis is to evaluate the effect of noise level on the number of genes showing circadian pattern within the pathway.
Figure 1 Effect of noise level on the circadian dynamics of the synthetic pathway. As the noise level is increased, the significance (1-p-value) of the event that synthetic pathway is circadian and the events that 10, 50 and 90% of the genes within the synthetic (more ...)
From Figure , we observe that at low noise levels (0 < i < 6) we are confident that at least 90% of the genes within the synthetic pathway are circadian. However, the confidence level of detecting 90% of the genes is circadian decreases sharply as we increase the noise level. At this noise level, the underlying circadian pattern can be identified via both evaluating the circadian genes and pathway activity levels. At a noise level of 17, we can confidently conclude that only 50% of the genes are circadian. At higher noise levels, i.e. i = 30, we cannot even conclude that 10% of the genes are circadian (p-value > 0.05). Thus gene expression alone will not be able to provide information about the significant circadian pattern at this noise level. However, pathway activity analysis predicts with high confidence level (p-value < 0.0001) that there is an underlying circadian pattern within the synthetic pathway at this noise level (i = 30). Therefore, pathway activity levels are more robust than the gene expression levels in identifying underlying expression pattern within a pathway.
Nevertheless, a critical issue arises when we consider whether the variation captured by PALP (t) can represent the overall gene expression within a pathway. While we can be confident that a circadian pattern does exist, we cannot be confident that this pattern is real or due to random variations. To address this issue of random noise in the data vs. real gene expression changes, we evaluated the significance of the PALP (t) (presented in Figure at different noise levels). Even though PALP (t) might predict confidently a circadian pattern, that event could be the results of random variability in the data, as quantified by the significance of PALP (t). For example, at α = 10, the significance of the synthetic pathway being circadian is high; however, the significance of PALP (t) is considerably lower. This result indicates that the observed pattern cannot be solely attributed to the underlying structure of the data. Therefore, determining significance level PALP (t) is necessary for a reliable representation of circadian pathways.
Effect of noise level on the significance of PAL. As the noise level is increased, the significance (1-p-value) of the event that synthetic pathway is circadian and the significance of PAL are illustrated.
Circadian Signatures of Pathways in Rat Liver
We analyzed a rich time series of transcriptional profiling in rat liver where the rats were maintained in 12:12 hours light/dark cycle and exposed to the least possible environmental disturbances to minimize stress. We evaluated pathway activity level analysis on the microarray data and following applied a clustering analysis of the pathway activity levels.
As a result of the significance analysis f
P 486 of the 638 defined pathways in MSigDB are considered for further analysis. Having eliminated the pathway activity levels that do not exhibit a significant change over time (ANOVA, p-value < 0.01), the clustering analysis yielded five significant patterns of pathway activity levels (Figure ). We follow an unsupervised approach and identify the emergent pathway activity level patterns that appeared to have sinusoidal circadian patterns. The significant clusters represent the most populated pathway activity levels patterns within the data, whereas the rest of the data can be associated with random deviations. To quantify the characteristics of the circadian patterns, we perform the approximation of the centroid of each cluster to a sinusoidal function. The correlation between the centroid of each cluster and the associated fitted sinusoidal model exhibit high correlation (correlation = > 0.96, given on top of each graph in Figure ). The outline of this analysis is depicted in Figure .
Figure 3 The five significant clusters identified by a consensus clustering analysis using δ = 0.65. The pathway activity level (PAL) of pathways represents the presented curves and the exact reverse curves; PAL = (-) PAL. The signs of PAL are chosen (more ...)
Figure 4 The outline for clustering analysis of pathway activity levels. Pathway activity analysis begins with mapping gene expression onto known pre-defined groups of genes, pathways. Subsequently, the pathway activity levels are calculated using SVD and the (more ...)
The peak and nadir points are referred as the turning points. Cluster 1, Cluster 2 have their turning points around the middle of the light period (~6th-8th hours of the 24 hour cycle) and around the middle of the dark period (~18th and 20th hours 24 hour cycle). Cluster3, Cluster 4 and Cluster 5 have their turning points around the transition between the light and the dark period (~10th-13th hours of the 24 hour cycle) and their the turning points around the beginning of the light period and at the end of the dark period (~1st -2nd hours and ~20th and 22nd of the 24 hour cycle).
Evaluating pathway activity levels resulted cases where two pathways have similar fraction of overall gene expression captured by PALP (t), fP values, however the associated p-values, vary significantly. In example, fP MAPK Pathway, Nicotinate and nicotinamide metabolism and glycine, serine and threonine metabolism pathway are 0.23, 0.21 and 0.22 respectively (top panel of Figure ). On the other hand, their associated p-values are rather different; 0.66, 0.12 and 0, respectively (top panel of Figure ). Depending on the size of the pathways, which is number of the genes within a pathway, fP value can be obtained from random variations. Therefore, fP value itself is not an objective feature to identify whether the information captured overall gene expression by PALP (t)is significant. The significance analysis of PALP (t) enables us to filter out pathways that exhibit circadian rhythms by chance. For example, MAPK pathway and Nicotinate and nicotinamide metabolism may be identified as exhibiting circadian pattern without the significance analysis of PALP (t) because PALP (t) of MAPK Pathway and Nicotinate and nicotinamide metabolism exhibit high correlation with the fitted sinusoidal model (bottom left and bottom middle panels in Figure ).
Figure 5 Pathway activity levels for select pathways. A) The comparison of the fp to the permutated fp for MAPK Pathway, nicotinate and nicotinamide metabolism and glycine, serine and threonine metabolism pathway. The mean and the standard deviation interval of (more ...)
Glycine, serine and threonine metabolism exhibit both significant PALP (t) and high correlation with the fitted sinusoidal model (top right and bottom right panels in Figure ). To study the effect of individual gene expression on the pathway activity level, we depict the relationship between the weights and the correlation of the individual genes (the correlation between gene expression levels and the fitted sinusoidal model that represent the circadian pattern) in glycine, serine and threonine metabolism pathway Figure . The weight of a gene characterizes its contribution to the pathway activity level compared to the rest of the genes in the pathway.
Figure 6 The relationships between weight and the correlation of the genes within glycine, serine and threonine metabolism. The correlation is between gene expressions and the fitted sinusoidal models and is set to identify circadian genes. The threshold for circadian (more ...)
It can be seen from Figure , that Gldc, Cth, Chka, Chkb, Cbs, Bhmt and Shtm1 exhibit circadian patterns (correlation > 0.8) and also their weights are among the highest (weight > | -0.25|). In addition, the genes, which correlation is slightly under the threshold (correlation ~> 0.7) such as Gatm, Shtm2 and Alas1, have comparably higher absolute weights (weight ~> | -0.25|). The positive and negative values of weights indicate the direction of the gene expression when compared to the pathway activity level. In example, the genes that have negative weights have their peak in the early light period and their nadir in the early dark period (e.g. Chka, Cth), whereas the genes that have positive values have their nadir in the early light period and peak in the early dark period (e.g. Shmt1) (Figure .). The pathway activity levels of glycine, serine and threonine metabolism (bottom right panel in Figure ) follow the genes that have the positive weight value (e.g. Chka, Cth) and have its turning point in the early light period. The sign (positive or negative) of the weights can be chosen to represent pathway activity level as pathway activity levels indicate the overall orchestrated significant change in the gene expression within a pathway. Furthermore, we observe that there are genes, which correlation is slightly under the threshold (correlation ~> 0.7) but they have low absolute weights (weight ~< 0) such as Atp6voc and Sardh. The expression pattern of these genes, (as an example we depicted the expression pattern of Atp6voc in Figure ) does not coincide with the rest of the genes that have higher absolute weights, therefore do not contribute to the pathway activity level as much and has low weights.
Figure 7 Selected gene expressions within glycine, serine and threonine metabolism. The correlation between the gene expression levels and the fitted sinusoidal models and the weights, which are evaluated via SVD analysis, of the genes are given on top of each (more ...)
By applying SVD, a number of possible correlated variables (gene expressions) are mapped onto a smaller number of uncorrelated variables (the rows of
in Eq. (1). Pathway activity is denoted as the most significant data pattern which corresponds to the first row of
(Eq.(2))as the elements of SP
) are sorted from the highest to the lowest (Additional File 1
). The latter rows correspond to the other patterns which significances are determined with the associated eigenvalues. The matrix
is orthonormal matrix; therefore the rows represent different data patterns. The two sets of circadian patterns in glycine, serine and threonine metabolism (Figure ) are retrieved via the first two rows of
have high correlation with fitted sinusoidal model (Additional File 2
.). The p-value of
is statistically significant whereas the p-value of
is not statistically significant.
Table provides the detailed list of identified pathways in each cluster. In total, there are 78 pathways in five clusters. The list of genes in these pathways, associated gene expressions, the weights, the correlation between fitted sinusoidal model and the individual gene expressions can be found in Additional File 3
. The identification of the circadian signatures at the pathway level identified biologically relevant processes. As such, gene expression, metabolite concentration and enzyme activity in energy metabolism (e.g. glycolysis and gluconeogenesis), amino acid metabolism (e.g. lysine degradation, urea cycle) [23
], lipid metabolism (e.g. fatty acid biosynthesis) [25
] and DNA replication and protein synthesis (e.g. DNA replication reactome, Purine metabolism) [26
] exhibited having circadian dynamics in mammals liver.
Circadian pathways and associated cluster numbers.
In addition, we evaluated the enrichment of the pathways with the genes that exhibited circadian patterns in [9
]. MSigDB database [18
] offers an annotation tool that explore gene set annotations to gain further insight into the biology behind a gene set in question. The end result is a p-value indicating the significance of the overlap of the genes with a pathway http://www.broadinstitute.org/gsea/msigdb/annotate.jsp
The genes that exhibit circadian dynamics in [9
] have been mapped to 34 pathways (Additional File 4
), nine of which have significant p-value < 0.05.
To further explain the biological significance of the pathway activity level analysis, we studied the coordination between different pathways that is another level of organization in cellular processes, especially in cases where the product of one pathway is the substrate of another pathway. One classic example is the production of bile acids and it needs cholesterol as its starting material. Previous studies have shown that the pathways for steroid and bile acid biosynthesis are coordinated and coupled with cholesterol biosynthesis pathway for maximizing the efficiency of these processes. It has been established that bile acid levels are tightly controlled to ensure appropriate cholesterol catabolism, and promote optimal solubilization and absorption of fat and other essential nutrients [25
]. Figure shows the fitted sinusoidal models of PAL curves for cholesterol and bile acids biosynthesis. From the Figure , we could see that both pathways shows circadian rhythmicity with the phase of oscillations for cholesterol biosynthesis with a peak reaching at 15 hours after lights on, but the bile acid biosynthesis pathway shows a slight time lag in its oscillation with the peak occurring at 17 hours after lights on. In the figure, the PAL curves reach its peak during the mid-dark period and nadir during the mid-light period. As mentioned previously, the peak and nadir of PAL curves represent the maximum variation in the temporal gene expression in the pathway and the exact reverse of the PAL curve is mathematically same as the PAL curve itself (PAL-PAL). But from the literature, we know that these pathways peak during the dark period when the animals are actively feeding. Furthermore, the circadian oscillations in expression of many of the genes involved in the pathway (including the rate limiting genes like HMGCR for cholesterol biosynthesis [16
] and CYP7A1 for bile acid biosynthesis [28
] peaks during the dark/active period in the 24 hours light/dark cycle. So to deduce the biological significance of the PAL curve, along with the PAL curve pattern one should take into account of the oscillation patterns of the individual gene expression (including the rate limiting genes) along with any existing knowledge about the biological function and regulation of a given pathway. Additional file 5
provides the expression of individual genes in these pathways. Similar coupling of pathways are observed such as folate biosynthesis and one carbon pool by folate are coupled with purine and pyrimidine metabolism [29
Fitted sinusoidal models of pathway activity levels for cholesterol biosynthesis and bile acid biosynthesis.