The limitation of gene-phenotype approaches on tumor samples
The biological interpretations of microarray experiments are often performed by gene- or gene set-based approaches according to the gene-phenotype correlations. However, such correlation is barely detectable in heterogeneous genomics data sets (such as tumor transcriptomes) so that the detection power of either approach is decreased. To illustrate the influence of sample heterogeneity on data analysis, we selected data sets from three different liver pathological stages (Sci: cirrhosis, Sve: very early HCCs and Sva: very advanced HCCs in HCC2 experiment) to present the different levels of transcriptomic changes and heterogeneity. Six samples of each stage were compared with six normal liver samples to identify changes in gene expression. As shown in Figure , the transcriptomic changes increase along with the HCC progression. Sva have the greatest changes from normal liver (average Pearson dissimilarity: 0.049 ± 0.013), indicating many genes are de-regulated. The other early stages have relatively small changes and the change levels of Sci and Sve are similar (Sci: 0.021 ± 0.004 and Sve: 0.025 ± 0.003, Figure ).
Figure 3 Comparisons of homogeneous and heterogeneous data sets with their corresponding null data sets. (A-C) The t distributions for Sci, Sve and Sva data sets are compared with the corresponding null data set (gray line). (D) The t distributions of four test (more ...)
The transcriptomes of cirrhosis samples (§ in Figure ) are homogeneous and their transcriptome profiles (blue circle) as well as that of normal livers (open circle) are closely self-clustered in MDS plot (Figure ). Both very early HCC (# in Figure ) and very advanced HCC samples (‡ in Figure ) are heterogeneous and their transcriptome profiles are dispersed (very early HCC: green circle, very advanced HCC: red circle; Figure ). We also added an extreme case, 4 normal liver and 4 normal spleen samples in the analysis. The gene expression patterns of normal tissues are tightly controlled by tissue-specific regulation. The high homogeneity is shown in Figure and ¤ in Figure . A huge transcriptome difference (0.120 ± 0.009, Sts: Tissue comparison in Figure ) can be detected between the two tissue types.
To estimate background variations for each data set, we compared the t-distributions of these four data sets (Sci, Sve, Sva and Sts) with their corresponding null data sets (Snull), which have the same sample number but their gene signals were randomly sampled from normal distribution (μ = 0, σ2 = 1). In such comparison, heavier tail distribution indicates more genes have significant gene-phenotype correlations, i.e., data sets with heavy tails are more readily being analyzed by conventional t value-based methods. As an extreme case, Sts showed the most heavy tail distributions (black line in Figure ) among all tested data sets. We then compared two data sets with similar transcriptomic change level (Sci and Sve, Figure ). The tail distributions of Sci (blue line, Figure ) are heavier than those of Sve (green line, Figure ). Even in Sva (red line, Figure ), which contains over 2-fold dissimilarity than Sci (Figure ), the tail distributions (red line, Figure ) were still smaller than those of Sci (blue line, Figure ). The variation of t value is less informative in heterogeneous data sets. Classical approaches are therefore more applicable to homogenous data sets, but not heterogeneous ones (such as clinical tumor samples).
In IGA, the function annotation enrichment analysis is based on the most significant de-regulated genes (Figure ). These genes are selected from both ends of the t-statistic distribution, which are the most informative regions. Hence, IGA is expected to be a sensitive method for heterogeneous samples. However, de-regulated genes can barely be identified from heterogeneous and mild-changed samples, such as Sve (Figure and ). The selection of thresholds presents researchers with the dilemma of obtaining accurate interpretations but only few significant DEGs or obtaining more DEGs by reducing selection threshold (i.e., by reducing statistic powers) to expand the scope of analysis.
Figure 4 Threshold insensitivity and robustness of RE-based analysis. (A) The RE frequency distributions for three liver data sets (blue: Sci, green: Sve and red: Sva) and corresponding null data set (gray dash line: Snull). The x-axis represents the gene RE frequency (more ...)
In gene set-based analysis (GSA), the GS score, which is summarized from gene-phenotype correlations of all members in a gene set, is used to estimate functional changes (Figure ). Because GSA bypasses the threshold selection step, it can avoid the dilemma mentioned above. However, we wondered if such algorithms, based on the gene-phenotype correlation, also decrease their detection power in heterogeneous samples. We calculated the average t-statistic for each gene set (GS) to get GS scores. The increasing or decreasing of GS score represents the degree of gene set correlation with phenotype. We compared scores distribution of all data sets in Figure . Sts shows the most significant GS score variation (-5.5~22.7), which represents the obvious function difference between liver and spleen. The score variations of Sci, Sve and Sva are much smaller than Sts since they have similar tissue background when compared with reference samples, the normal liver. Although Sci has the smallest transcriptomic changes (Figure ), Sci shows a more significant GS score variation (-5.22~-6.20) than the other data sets do (Sve: -3.35~4.52, Sva: -4.75~4.09). The result implies that the heterogeneity of tumor samples reduces the detection power of GSA too.
Threshold insensitivity and robustness of eGSA
Recognizing the limitations of current gene- or gene set-based methods on heterogeneous samples, we alternatively analyzed transcriptomic changes by using regulatory events. Regulatory event is defined as a gene signal change of a test sample compared with a reference sample pool (see Materials and Methods). The distribution of regulatory event frequency in all data sets is clearly distinguishable from null data sets (Figure ) and the gradual increasing of tail distributions correlates with their transcriptomic change levels (Figure ). This result implies regulatory event is a better approach for analyzing heterogeneous samples. Based on this observation, we designed a novel regulatory event-based strategy, which we called the regulatory event-based Gene Set Analysis (eGSA), for the exploration of significant gene sets and compensate for the insufficiency of traditional method in heterogeneous samples.
One might argue that the regulatory event determination is still under an arbitrary cut-off threshold and its results also suffer from the same dilemma of threshold selection as IGA does. This argument can be partially answered by the insensitivity of regulatory event detection in different threshold values (see Figure ). The number of differentially expressed genes (DEGs) is sensitive to the threshold selection and is dramatically decreased when a stringent threshold is applied. Under a strict threshold (q-value < 1.0E-3), 1204 genes could be detected in homogenous data set (Sci, Figure ). However, only a few differentially expressed genes could be detected in heterogeneous data sets (Sve: 1 and Sva: 229, Figure ). In contrast, even under the most stringent threshold in our test (q-value < 1.0E-7), thousands of regulatory events could be detected robustly. Moreover, the event numbers of those data sets also correlated with the level of transcriptomic changes (Figure ).
The threshold insensitivity of regulatory event detection also reflects the robustness of eGSA. We selected the lipid metabolic process (GO:0006629) category in the GO databases as an example because both liver cirrhosis and carcinogenesis cause the defective lipid metabolism [11
]. Obvious down-regulation of this functional category is expected in all Sci
data sets (red line, Figure ). In IGA, hypergeometric test p
-values increase heavily toward the background values of the opposite regulation when higher threshold stringency was used in heterogeneous data set (dash line, Figure ). IGA even loses the detection power due to no annotated de-regulated genes (dash line, Figure ). Although the homogeneous data set, Sci
, showed higher tolerance to stringent thresholds, the p
-values appeared unstable when different thresholds were selected (red dash line, Figure ). Nevertheless, in eGSA the p
-values are more stable in both homogeneous and heterogeneous data set (red solid line in Figure ) and could be distinguished from the background values in all tested threshold values. Conclusively, measuring regulatory events is insensitive to threshold selection and can be performed more robustly in hypergeometric test.
Precise and broad biological interpretation of transcriptomechanges by eGSA
We further compared the accuracy and analysis scope between eGSA and IGA. Using 1634 gene sets defined by GO terms, we calculated the significance levels of gene sets based on the up-regulated genes or regulatory events. To make these two results comparable, we ranked gene sets by their p-values and then present their ranking orders in a scatter plot (Figure and ). For every gene set, we also calculated the ratio of up-regulated regulatory events over the total regulatory events within the gene set to indicate their regulatory directions.
Figure 5 Correlations between eGSA and IGA. (A) and (B), Scatter plots of GS significance ordered by eGSA (y-axis) and IGA (x-axis). Each circle represents one GS, and green triangles represent missed GSs in IGA. The color represents the percentage of up-regulated (more ...)
In Sts, the most distinctive data set (Figure ), the ranking orders of gene set are highly correlated between eGSA and IGA (Figure ), suggesting that eGSA has comparable detection power to that of IGA in homogeneous samples. This is due to the fact that most of the consistent gene regulations in Sts can be detected by both strategies. In Sve, the most indistinctive data set (Figure ), the top- and bottom-ranked gene sets of IGA and eGSA are correlated and are aligned in the diagonal corners (Figure ). This is because these gene sets represent the most consistent changes. However, eGSA detected more up-regulatory gene sets that were under-estimated (red nodes in the 2nd quadrant) or missed (green triangle) by IGA. This data demonstrates that eGSA has a broader scope of analysis than IGA. Moreover, several down-regulated GSs (blue nodes in the 4th quadrant) are misidentified as up-regulated GSs by IGA (Figure ).
We selected two extremely contradictory gene sets to demonstrate the different preferences between these two methods. The first gene set, the regulation of signal transduction (GO:0009966; indicated by a red arrow in Figure ), is highly-ranked in eGSA (170th) but not in IGA (1554th). The second gene set, the amino acid metabolic process (GO:0006520; blue arrow in Figure ), is highly ranked in IGA (94th) but not in eGSA (1359th). The details of both gene sets, including their differentially expressed genes, regulatory events and signal values, are shown in Figure and . In the first gene set, regulatory event-based analysis shows that 20% of genes (total regulatory events/(total genes × total test samples)) are up-regulated, which is in agreement with the observation in sample signal values. However, only 4% up-regulatory de-regulated genes can be detected due to the lack of consistency in gene expression level (Figure ). This also suggests that IGA may sometimes under-estimate the significance of a gene set. In the second GS, IGA detects a certain number of consistently up-regulatory de-regulated genes and conclude this gene set as a significantly up-regulated one (p = 7.99E-05). But this doesn't perfectly present the fact that there are actually more down-regulatory (24%) than up-regulatory (15%) expression changes in this gene set (Figure ). We conclude that for heterogeneous samples, since eGSA takes into account all REs in a gene set, this novel strategy can overcome certain limitations of IGA.
Functional patterns of very early HCC
To demonstrate the advantage of eGSA, we applied eGSA on the discovery of biological functions associated with liver cancer tumorigenesis. Hepatocellular carcinoma (HCC) is characterized as an obvious multistage process, just like many other types of human tumors. Understanding the de-regulated biological functions in early stage HCC will help to reveal the initial processes of hepatocarcinogenesis. The insights of early HCC functional changes and regulatory mechanisms are far from clear due to the lack of common oncogenes and tumor suppressors. One of the major difficulties in identifying common regulatory mechanisms is the genetic heterogeneity of HCCs [13
]. To overcome this heterogeneity issue, we applied eGSA on the analysis of the initial functional changes of HCV-induced HCCs (HCC2
In the original paper of HCC2
only 104 differentially expressed genes were identified when dysplasia samples (n = 17) were compared to early HCCs (n = 18) [15
]. The authors interpreted the biological function of early HCC based on gene functional classification. As a result, only a few gene sets were analyzed and their statistical significance had not been estimated yet. To provide a more comprehensive view on the initial processes of hepatocarcinogenesis, we applied eGSA on the earliest stage of HCCs (very early HCC in Figure , n = 6). Owing to the insensitivity of eGSA to transcriptomic heterogeneity, we can now identify more altered biological functions with statistical significance in very early HCC. To avoid potential errors from name-space issues [2
], we carefully removed the gene sets containing high percentage of duplicative genes (>10%). The filtered gene sets were functionally clustered into branches of an acyclic network according to their GO term relationships (Figure ). The GS ranking orders are presented in a color scale (red: up-regulated; green: down-regulated). The top 100 up- or down-regulated gene sets are listed in Additional file 2
Figure 6 Major functional changes in very early HCCs. The top ranked GSs (the dash box in Figure 5D) are organized into an acyclic network according to their GO term relationships. Each node represents one GS, and node color represents its ranking order by eGSA (more ...)
In the eGSA results, the major functional changes are basically consistent with the original observations and previous descriptions on HCC [11
]. Moreover, eGSA provides new insights into the initial processes of HCC. Gene sets related to major hepatic metabolisms, including cholesterol, steroid, hormone and amino acid metabolisms, are significantly down-regulated (see section B of Additional file 2
). The down-regulation of these biological functions may be due to the failure of normal hepatic functions. On the contrary, nucleoside and protein metabolisms required for rapid cell growth are significantly up-regulated. These include pyrimidine nucleoside metabolism (p
value is 2.3E-5, ranking order is 89th
), ubiquitin-mediated protein degradation (4.76E-5, 37th
), tRNA aminoacrylation (3E-3, 98th
) and ribosome biogenesis and assembly (9.52E-4, 76th
) (see section A of Additional file 2
The down-regulation of lipid metabolisms has been frequently reported in liver diseases [11
]. In the initial stage, fatty acid metabolism (2.6E-5, 22th
), steroid biosynthesis (6.5E-3,77th
) and sterol related metabolisms, e.g. hormone metabolism (6.4E-3,76th
), are all down regulated since several common enzymes are shared within these metabolic processes (see B section in Additional file 2
). Although lacking a consistent conclusion, several studies did report that a sterol hormone, androgen, was associated with HCC development: high level serum androgen was found as risk factor for HCC [18
], and androgen receptor has also been reported to promote cell growth and anti-apoptosis in hepatic cell line model [19
]. Hence, the linkages between the down-regulations of sterol metabolisms and the abnormal hormone metabolisms, and their effects on HCC initiation could be interesting and worthy of further study.
Escaping and surviving from attacks by the immune system are two other significant functional changes in HCC initiation. Certain immune functions, such as the inflammatory response (8.0E-3, 81th
), innate immune response (3.5E-3, 64th
) and immune effector process (9.2E-3, 86th
), are down-regulated while the survival pathway, the regulations of IκkB/NFκB cascade [20
] (2.6E-3, 93th
), is up-regulated (Additional file 2
). These observations are supported by many studies. For example, the number of different types of inflammatory cells are decreased in HCC tissue sections [22
]; the suppression of T cell-mediated immune responses is found in HCC patients and is associated with poor prognosis [23
]. To understand the regulatory mechanisms, we analyzed the biological pathways involved in those immune-related gene sets. Genes with high regulatory event frequency (up + down > 0.8) were annotated with their involved KEGG pathways [25
], and the most annotated pathways are listed in Additional file 3
. Several biological signaling pathways are highlighted, including the Toll-like receptor, JAK-STAT, MAPK and T cell receptor signaling pathway. Besides, a number of immune-related processes are also highlighted, such as the cell adhesion molecules, cytokine-cytokine receptors, antigen-processing and presentation, and NK cell mediated cytotoxicity (see Additional file 3
It is not surprising that cell cycle is up-regulated at the initial stage of HCC. However, in an advanced comparison of cell cycle phases, we identified that M phase (4E-3, 111th
) is the most significant phase over G1 (2.4E-1, 554th
) and S (2.5E-2, 180th
). M phase related gene sets, such as regulation of mitosis (1E-2, 94th
) and mitotic cell cycle checkpoint (1E-3, 79th
), are both significantly up-regulated (Figure and section A of Additional file 2
). To further illustrate the early changes of M phase regulation, we mapped the up-regulated regulatory event frequencies of four stages (very early, early, advance and very advance HCC stages) on a KEGG cell cycle pathway (hsa04110). As shown in Figure , most G2/M key regulators (CCNA2, CCNB and CDK1) are up-regulated at the first stage (also see Addtional file 2, the dashed box in panel A). These genes appear earlier than many G1/S key regulators (RB, RBL1, CDK2, CDK4 and CDK6) (Figure ). The early up-regulation of G2/M phase can also be supported by the appearance of the corresponding down-stream genes. For example, the regulators of mitosis and chromosome segregation (BUB1, BUB1B, BUB3, MAD2L1 and CDC20) were frequently up-regulated at the first stage, while the regulators of DNA replication (ORC and MCM complex) are up-regulated only at the later stages (Figure ). In another sample set (HCC1
), although lacking in stage information, we still found that these G1/M regulators were more frequently and synchronously up-regulated (see panel B of Additional file 4
). On the contrary, G1/S regulators, which are accompanied by DNA synthesis, are seldom found (panel B of Additional file 4
). Our findings are consistent with the previous protein and kinase activity studies. The enhancement of CCNA/CDK1 protein expression and kinase activity are observed in early HCC stage and non-tumorous cirrhosis tissues, but not in normal livers. G1/S regulators, CCND1, CCNE1 and CDK4, are higher in poorly differentiated HCC and advanced HCCs [26
Figure 7 The gene RE frequencies in the cell cycle pathway. The gene up-RE frequencies at each HCC stage are mapped as a grid on the cell cycle pathway (KEGG: hsa04110) by using a public software PathwayExplorer . In a grid, columns represent very early (VE), (more ...)
CCNA2 is considered as an important factor leading to cancer because of its dual roles in both S and M phase [27
]. The increased expressions of CCNA2 mRNA and protein were observed in many clinical HCC studies [27
]. However, in cell model experiments, over-expression of CCNA2 alone could not promote cell cycle but delayed metaphase/anaphase onset [31
]. These findings suggest that the high expression of CCNA2 might simply reflect high tumor proliferation rate rather than promote tumorigenesis [27
In our study, CCNA2 appears as a most frequent and earliest up-regulated cyclin at the initial stage. This strongly suggests that CCNA2 might play a crucial role in HCC initiation. Accompanying CCNA2, several mitosis-related regulators, such as CDC2, CCNB1/2 and CDC20, are synchronically up-regulated (Figure ), indicating that all these mitosis genes are also required for HCC initiation. However, previous individual gene studies did not test the gene combination (see Additional file 4
). Besides, an interesting supportive mechanism, the up-regulation of ubiquitin mediated proteolysis activity (4.76E-05, 37th
), could also potentially contribute to the rapid turn-over of cyclin and other mitosis components, and then help to complete the cell cycle (see section A of Additional file 2