The iBBiG bi-clustering algorithm is optimized for module discovery in sparse noisy binary genomics data and can be used for meta-GSA of multiple genomics datasets, to discover modules: groups of phenotypes whose differential gene expression profiles are enriched in the same gene sets. Data integration is made tractable by transformation of continuous ‘noisy’ gene expression data (with different probes/genes in each study) into profiles of differentially enriched gene sets (which are common to all studies). We examined two GSA approaches, GSEAlm which tests for enrichment of gene sets in genes that are differentially regulated between conditions and GSVA which is single sample GSA. The former can be applied to identifying gene sets or pathways associated with known clinical covariates, the latter is a pure discovery approach that ignores prior sample knowledge.
We designed iBBiG to have high specificity and thereby minimize the false-positive rate when discovering new classes, but the iterative approach employed in iBBiG ensures it is sufficiently sensitive to discover weak signals, even if they are potentially masked by stronger ones. When applied to simulated data it outperforms widely used global clustering approaches (K-means, hierarchical cluster analysis) and newer bi-clustering approaches (Bimax, FABIA and COALESCE) and is able to find overlapping gene set modules of varying sizes. iBBiG was able to identify all clusters in a simulated meta-GSA dataset with high levels of specificity and sensitivity. An advantage of iBBiG relative to other methods is that it does not require a priori knowledge of the true number of clusters. Following the application of iBBiG, the number of true clusters can be estimated from the weighed scores of the extracted modules. In some cases, we observed that a module may represent the residue or remaining signal of a stronger, previously extracted module. This residue remains because iBBiG only removes information from the data matrix that is actually used for the entropy-based score in a module. However, we do not consider these residual modules to be a shortcoming of the method as their existence facilitates discovery of the true overlap between modules and, further, these modules can be easily detected by looking at the overlap of phenotypes and gene sets.
Although iBBiG includes several parameters, we have shown that most impact only computation time and do not effect cluster discovery. The only parameter that had an impact on cluster discovery was α, which regulates the weighting between cluster homogeneity and the number of phenotypes. This parameter is useful in fine-adjustment of the sensitivity–specificity ratio.
One major advantage of iBBiG is its robustness in the presence of noise and its tolerance of missing data. It can tolerate high levels of noise as the entropy derived fitness score add members to a bi-cluster once the number of associations exceeds 50%. We demonstrate that iBBiG performs well even in the presence of false-positive associations and noise in both signal (20%) and background (40%). Second, iBBiG does not require a gene set to be associated with all phenotypes in a bi-cluster which is a attractive feature in complex biological data were biological processes maybe redundant or regulated by multiple factors concurrently. Many other bi-clustering algorithms, including Bimax and the recently described BiBiT (Rodriguez-Baena et al., 2011
), discover only homogenous bi-clusters and have low tolerance to noise and missing data. Bimax identified a large number of mini-bi-clusters and was unable to identify large clusters in our simulated dataset. In practice, application of Bimax to genomics data requires post-processing of bi-cluster results in order to either join or visualize overlapping bi-clusters (Santamaria et al., 2008
We applied iBBiG to discovery of new modules among 1008 primary breast tumors and discovered 13 modules in an iBBiG-GSVA analysis. Each module contained samples from multiple studies demonstrating successful data integration. While the largest highest confidence modules (M1–M4) discovered breast cancer molecular subtypes known to be important in breast cancer, the smaller modules represented sub-sets of these subtypes, supporting recent evidence that there are subtypes within each of the principal breast cancer molecular subtypes (Curtis et al., 2012
). The module M10 was characterized by gene sets associated with angiogenesis in response to hypoxia (or HIF1A degregulation) and was a strong predictor of recurrence in Luminal and ERRB2 amplied breast cancer. We uncovered different modules (n
= 8) associated with pairwise tests of breast cancer clinical covariates in a meta-GSA of 21 breast cancer gene expression datasets. Five of the eight modules including the first and largest module was strongly associated with tumor grade. Most high grade tumors were characterized by increased cell cycle (B1/B8) and those with fewer metastases had significant regulation of immune response genes (B4). In a meta-analysis of two datasets, Shi et al., (2010)
also reported an up-regulation of proliferation genes and down-regulation of cell adhesion genes in high-grade breast tumors. Although they had insufficient numbers of patients to establish statistical significance, they observed that high levels of immune genes were an indicator of good prognosis in high-grade breast cancer patients. Our analysis also suggests B4 immune response is associated with better outcome. When we examined which genes that most frequently appeared in module B4 GeneSigDB gene signatures, we found several chemokines including CCL5 (RANTES), a key regulators of T-cell immune response highly expressed in breast cancer and reported to be associated with metastases and progression (Soria and Ben-Baruch, 2008
; Zhang et al., 2009)
. However, our analysis does not fit this prevailing hypothesis and suggests CCL5 is associated with good prognosis in high-grade breast cancer patients.
In this study, we have used iBBiG to discover clusters in matrices of discretized P
-values from GSA of gene expression data; however, the method can also be easily applied to GSA of other different data types including SNP data (Cantor et al., 2010
; Raychaudhuri et al., 2009
). iBBiG can be applied to non-geneset data. For example, to demonstrate the application of iBBiG to an extremely sparse matrix (<0.3% > 0) in which small clusters are expected, iBBiG was applied to discretized data from the NHGRI genome-wide association study (GWAS) catalog (Hindorff et al., 2009
). As the weighted scores were low for modules identified, we averaged results over 100 runs of iBBiG and chose 10 robust modules. Only genes and traits that occured in at least 65/100 runs were included (Supplementary Fig. S27
). These modules are provided in Supplementary Table S22
. It discovered a possible link between triglyercides, HDL cholesterol and waist circumference with genes GCKR, LPL, BUD13 and ZNF259. Although LPL, BUD13 and ZNF259 has been implicated previously, this module suggested a new link with an expanding waistline and GCKR. While GSA requires input gene sets, it is not restricted to databases of curated gene sets and can use gene sets deduced through text mining from the published literature (Jelier et al., 2011
; Krallinger et al., 2010
; Raychaudhuri et al., 2009
). We anticipate iBBiG will be useful in integrated data analysis of multiple data types. iBBiG can be performed on any binary matrix and could be applied to binary protein–protein interaction or RNAi data; we would like to extend it to other data types, including categorical data. An attractive feature of iBBiG compared to others methods such as the recently described logistic regression meta-GSA approach (Montaner and Dopazo, 2010
) is its ability to perform integrative analysis using dozens of datasets.
In summary, iBBiG provides a simple, robust, rapid and scalable method for meta-GSA. When applied to simulated data it outperforms commonly used clustering and bi-clustering approaches and iteratively discovers gene set modules made up of both strong and weaker signals. Meta-GSA using iBBiG constitutes a new approach for discovery of pathway and gene set behavior across multiple studies and provides a higher-level understanding of gene and cellular function.
Funding: This work was supported the Claudia Adams Barr foundation and grant 1U19CA148065 from the National Cancer Institute of the US National Institutes of Health. We are grateful to Prof. Curtis Huttenhower for his invaluable expertise and generous help in running COALESCE. We thank Benjamin Haibe-Kains and Stephan Winkler for valuable comments during the development of the algorithm, Markus Schröder for offering his expertise on implementing the R functions as C libraries and Prof. Daniel Silver for assistance in biological interpretation of results of the analyses.
Conflict of Interest: none declared