Mouse embryonic stem cells (mESCs) are derived from the inner cell mass of a developing blastocyst and can be cultured indefinitely in-vitro. Their distinct features are their ability to self-renewal as well as to differentiate into all adult cell types including the germ-line. These features render mESCs ideal for applications in basic scientific research and translational medicine. To harness their full potential, better understanding of the molecular mechanisms of mESCs self-renewal maintenance and pluripotency is critical. Therefore, genes that are critical to mESCs self-renewal maintenance are of interest to the stem cell research field. In the past decade, significant steps have been made toward identifying and characterizing the genes and regulatory networks that compose the self-renewal machinery. A mESCs stemness membership gene (MSMG) signature has been proposed through application of high-throughput profiling approaches such as mRNA expression microarrays combined with advanced computational analyses as well as through low-throughput detailed functional studies [1
]. Genes that are predominantly expressed in mESCs cells are considered putative candidates for being MSMGs. Nevertheless, the overlap among candidate MSMGs across different studies is surprisingly small, whereas the full identification of MSMGs, the genes responsible for self-renewal and pluripotency, remains largely incomplete.
Fuelled by the growing volume, diversity and complexity of genome-wide profiling data generated from high-throughput biotechnologies, advanced computational approaches such as machine learning have been used to analyze multi-dimensional experimental data and integrate results from many studies [4
]. Support Vector Machines (SVM) is a popular supervised machine learning method that is based on statistical learning theory [11
]. SVM has been widely applied as a classification tool to address biological questions such as gene function prediction [4
], protein homolog identification [5
], and disease diagnosis [6
]. For example, previous studies used SVM and gene expression data for gene function classification [7
] and cancer tissue sample classification [8
]. Such studies used a single type of experimental data to conduct the analyses. Recently, Zhu et al. developed a network-based SVM approach where they combined prior knowledge with microarray data to improve the predictive performance for cancer tissue diagnostics [9
]. In another study, SVM-based predictions were applied to infer gene function by concatenating normalized features from diverse datasets [10
]. Hence, there is a trend of combining heterogeneous data-types to improve classification where the SVM approach is the computational method of choice. Here we attempted to use this approach to tackle the task of predicting MSMGs utilizing two types of high-throughput data by combining several independent studies.
We hypothesized that we can utilize data from mESCs-related mRNA microarrays profiling and genome-wide transcription factor binding profiling (ChIP-seq) applied to characterize mESCs to classify genes important for ES cell self-renewal and pluripotency (MSMGs). We believe that within these datasets there are subtle patterns from which a gene's functional characteristic, in regards to the self-renewal and pluripotency involvement, Yes or No question, can be inferred. We employed an SVM-based approach to construct a classifier that can be used to predict the class membership as being MSMG or not-MSMG for genes by combining genome-wide mRNA expression profiling data and ChIP-seq data. The accuracy and generality of the classifier are evaluated using the leave-one-out-cross-validation (LOOCV) approach. We also compared the SVM classifier with other machine learning classification methods, including linear discriminant classifier, decision trees, and artificial neural networks. Furthermore, we tested the ability of the SVM classifiers to predict the class membership of positive and negative lists of genes resulting from two genome-wide RNAi screen studies to demonstrate how such classification approach can be useful for helping in prioritizing hits from such screens.