Through the interactions of different co-regulators, specific transcription factors can regulate different cellular processes which sometimes lead to opposite downstream effects [32
]. SMAD transcription factors rely on transcription cofactors for appropriate activation or repression of target genes in response to TGF-β. TGF-β/SMAD signaling pathway is important for growth inhibition in normal ovarian epithelium, thus signaling disruption may lead to ovarian tumorigenesis [13
]. Although primary ovarian cancer retains TGF-β-mediated growth inhibition, studies demonstrated that most ovarian cancer are non-responsive to TGF-β signaling pathway [15
]. The cell line that we used in this study is responsive to TGF-β thus allowing us to evaluate the disruption of TGF-β signaling pathway in ovarian cancer. While most of the recent studies used expression microarrays to interrogate TGF-β targets in a particular system, those approaches cannot differentiate between direct and in-direct TGF-β/SMAD4 targets [35
]. In our study, we combined both ChIP-chip and expression microarray data to identify TGF-β/SMAD4 targets. This approach provides important information regarding the direct TGF-β/SMAD4 targets as they relate to ovarian biology. The regulatory module that we identified may also be important in understanding the disruption of TGF-β signaling in ovarian cancer. Only 17 out of the 150 TGF-β/SMAD targets were previously known to respond to TGF-β in various systems (See Additional File 4
– Table S1). Thus, more than 90% of the targets identified in this study are novel targets.
To gain further insight into the potential biological relevance of these newly identified TGF-β/SMAD targets, we classified the 150 targets by biological functions. This analysis revealed that the majority of the targets are related to either signaling pathways or play a role in transcriptional regulation (See Additional File 7
– Figure S4). For example, expression of the kit ligand, KITLG, is downregulated after addition of TGF-β, a result consistent with a previous study demonstrating that treatment of rat ovarian surface epithelial cells with TGF-β results in KITLG downregulation [37
]. Interestingly, KITLG is an important regulator of ovarian surface epithelial cell growth, and up-regulation of KITLG expression has been reported in ovarian cancer [38
]. Taken together, these observations suggest that disruption of TGF-β signaling pathway may lead to altered KITLG expression, which in turn could contribute to ovarian cancer carcinogenesis. An exciting extension is the possibility of KITLG activation in ovarian cancer initiating cells (OCICs), as we recently reported upregulation of the KITLG receptor, c-kit/CD117, in this highly tumorigenic subpopulation of cells in human ovarian adenocarincomas [40
It is interesting to note that TGFBR2 and SMAD3 were among the down-regulated targets, suggesting a possible negative feedback loop in the pathway. Since our microarray platform spans only the promoter regions, the number of binding sites in the present study is likely to be an underestimate at the whole genome level as recent finding indicates that the transcription factor binding happens throughout the genome [41
]. On the other hand, we have identified a total of 1946 SMAD4 targets by ChIP-Chip, but only 150 shows expression changes. The other 1796 targets, however, may show expression changes after 12 hours of TGF-β stimulation but could not be detected in the current setting. Alternatively, those targets may require other transcription factor(s) for initiation. Although the presence of false positive cannot be excluded, our experimental design, in which the ChIP-chip data is subtracted from 3 hr after addition of TGF-β to 0 hr, should have minimized this possibility.
In order to decipher complex gene regulatory networks associated with signaling pathways that play critical roles in normal and aberrant cell behavior, accurate prediction of transcription factor co-regulatory modules is essential [42
]. Computational analyses that rely solely on motifs derived from position weight matrix scanning are considered far from perfect and known to produce both false-positive and false-negative results [26
]. Phylogenetic footprinting, can be used to identify conserved sequences between distantly related species thereby improving module discovery [43
]. However, this comparative genomics approach can only partially improve prediction accuracy due to the lack of conserved binding sites among species and the unavailability of human gene counterparts in other organisms for comparative genomic analysis. With the advance of ChIP-chip technologies, we can now computationally interrogate the interactions between cis
-acting elements and transcription factors using experimental data [44
]. Recently, computational approaches that combine seemingly disparate experimental data have been successful in developing concise pathway models and transcriptional modules [45
]. RFs have been receiving increased attention in the data-mining field as a means of variable selection in many classification tasks in computational biology, including the selection of a subset of genetic markers and genes in microarray data analysis relevant for the prediction of a certain disease [47
]. Here, we have used an integrative modeling approach that combines CART and RF to classify different SMAD target promoters with reasonably good classification accuracy and reduced instability. Other popular classification methods, such as Naïve Bayes Tree, Logistic Model Tree, Bagging and LogitBoost (reviewed in [51
]) or combination of these algorithms with RF may give different performance results and derive different SMAD modules, which needs a systematic testing. Although the main goal in classification is to build a model with minimal mis-classification error in cross-validation (Table ), in this application we are equally interested in identifying TFBSs as highly important discriminating variables. One of the main goals of our analyses is to select potential SMAD interacting transcription factors from a large feature space (>150 transcription factors from Transfac database) in order to build SMAD modules. RF algorithm generates internal estimates of the decrease in the classifier's overall accuracy if that particular variable was not used in building the classifier. Thus, variables (TF binding sites) with larger importance measures can be deemed to have more power in discriminating different groups. A notable fact about our RF feature selection procedure is that more than one third of the transcription factors in the top ranking variables (Figures and ) are previously known to synergistically interact with SMAD in regulating the target promoter (Table ). Conversely, a substantial number of the known SMAD co-regulators appeared as the most important variables (See Additional File 5
– Table S4). This demonstrates the power of RF feature selection procedure and indicates that other top ranking transcription factors could be novel partners of SMAD, resulting in different transcriptional outcomes.
We first built a large number of RFs to identify and rank TFBSs of importance; and then supplied the resultant TFBSs as a relatively smaller set of predictor variables to CART for classification, using step-wise forward selection procedure. Based on our original microarray data, this process dramatically improved the misclassification error rate compared to using CART or RF analysis alone. By running a large number of RFs, we obtained a stable rank for the most important variables, which could not be achieved with a single RF run. When fitting the CART model, a series of models were built, starting with the most important TFBS as the predictor variable, followed by systematically adding more TFBSs from the variable reservoir. As expected, the overall misclassification error rate (defined as the sum of the error rates for the individual groups) first decreased and then increased again (See Additional File 8
– Figure S5). The one at the bottom of the decreasing trend is the best model, overcoming the limitation of using pre-set arbitrary cutoff values for variable selection in other CART models [42
By computational prediction, 83% of the target promoters contained SMAD4 consensus sequences, a significant enrichment compared to a random set of sequences. The consensus sequence of the SBE, nevertheless, contains only a weak signal. To ensure the binding of the SMAD complex, the presence of the binding sites of the co-regulators is equally important. Based on the ChIP-chip data, the combined classification tree analysis accurately predicted previously known TGF-β/SMAD co-regulators, including LEF1 [52
], ELK1 [53
], COUPTF [54
], E2F [55
], and P53 [56
]. The transcription factors that recognize the DR1 site, PPAR, HNF-4, COUPTF and RAR are all known SMAD partners [54
]. The combined RF and CART analysis also uncovered a novel co-regulator, PAX4, a paired-homeodomain transcription factor and important regulator of pancreas development [60
]. Previous studies have demonstrated regulation of PAX4 expression by activin A, a TGF-β superfamily member, and transcriptional regulation via interactions between paired domain transcription factors PAX8 and PAX6 and SMAD [62
]. Therefore, it seems reasonable to suggest that our computational prediction of a PAX4-SMAD interaction and subsequent target gene co-binding could contribute to gene up-regulation (Table S5). Furthermore, as a potential tumor suppressive function for PAX4 has recently been reported [64
], we speculate that disruption of PAX4 could compromise TGF-β-mediated growth inhibition and contribute to ovarian carcinogenesis.
Our integrative computational modeling and Ingenuity Pathway Analysis suggests that SMAD and LEF1 co-regulate some of the up-regulated SMAD responsive genes. It was shown that the activation of MSX2 gene was mediated via the cooperative binding of SMAD4 at two SBEs and of LEF1 at two Lef1/TCF binding sites [65
]. However, these predicted SMAD regulatory modules need to be confirmed by biological experiments. First, quantitative ChIP-PCR with the antibody against a TF can corroborate the recruitment of the TF to the promoter region of the target gene. Second, a promoter of a target gene with a TFBS deleted can be compared to a wild-type promoter to see if the TFBS confers any biological activity in a promoter-reporter assay setting. Third, comparison of target gene expression levels in cells transfected with siRNAs against specific SMAD proteins, or against a TF predicted to be a SMAD co-regulator, or against both, can reveal if there is any synergistic action between the two interacting partners.