In recent years genomic profiling of multiple data types in the same set of tumors has gained prominence. In a breast cancer study relating DNA copy number to gene expression, (Pollack et al
) estimated that 62% of highly amplified genes demonstrate moderately or highly elevated gene expression, and that DNA copy number aberrations account for ~10–12% of the global gene expression changes at the messenger RNA (mRNA) level. Hyman et al
) observed similar results in breast cancer cell lines. MicroRNAs, which are small non-coding RNAs that repress gene expression by binding mRNA target transcripts, provide another mechanism of gene expression regulation. Over 1000 microRNAs are predicted to exist in humans, and they are estimated to target one-third of all genes in the genome (Lewis et al
). The NCI/NHGRI-sponsored Cancer Genome Atlas (TCGA) pilot project is a coordinated effort to explore the entire spectrum of genomic alternations in human cancer to obtain an integrated view of such interplays. The group recently published an interim analysis of DNA sequencing, copy number, gene expression and DNA methylation data in a large set of glioblastomas (TCGA, 2008
In this study, we will refer to any genomic dataset involving more than one data type measured in the same set of tumors as multiple genomic platform (MGP) data. Identifying tumor subtypes by simultaneously analyzing MGP data is a new problem. The current approach to subtype discovery across multiple types is to separately cluster each type and then to manually integrate the results. An ideal integrative clustering approach would allow joint inference from MGP data and generate a single integrated cluster assignment through simultaneously capturing patterns of genomic alterations that are: (i) consistent across multiple data types; (ii) specific to individual data types; or (iii) weak yet consistent across datasets that would emerge only as a result of combining levels of evidence. Therefore, the goal of this study is to develop such an integrative framework for tumor subtype discovery.
There are two major challenges to the development of a truly integrative approach. First, to capture both concordant and unique alterations across data types, separate modeling of the covariance between data types and the variance–covariance structure within data types is needed. Most of the existing deterministic clustering methods cannot be easily adapted in this way. For example, Qin (2008
) performed a hierarchical clustering of the correlation matrix between gene expression and microRNA data. Similarly, Lee et al
) applied a biclustering algorithm on the correlation matrix to integrate DNA copy number and gene expression data. In both the cases, the goal was to identify correlated patterns of change given the two data types. While identifying correlated patterns is sufficient for studying the regulatory mechanism of gene expression via copy number changes or epi-genomic modifications, it is not suitable for integrative tumor subtype analysis where both concordant and unique alteration patterns may be important in defining disease subgroups. The importance of capturing both concordant and unique alterations across data types will be demonstrated in our data examples. In addition, properly separating covariance between data types and variance within data types facilitates probabilistic inference for data integration.
Second, dimension reduction is a key to the feasibility and performance of integrative clustering approaches. Methods that rely on pairwise correlation matrices are computationally prohibitive with today's high-resolution arrays. Dimension reduction techniques such as principal component analysis (PCA; Alter et al
; Holter et al
) and non-negative matrix factorization (NMF; Brunet et al
) have been proposed for use in combination with clustering algorithms. These methods work well for a single data type. However, simultaneous dimension reduction of multiple correlated datasets is beyond the capabilities of these algorithms.
Tipping and Bishop (1999
) showed that the principal components can be computed through maximum-likelihood estimation of parameters under a Gaussian latent variable model. In their framework, the correlations among variables are modeled through the latent variables of a substantially lower dimension space, while an additional error term is added to model the residual variance. Using the connection between PCA and latent variable models as a building block, we propose a novel integrative clustering method called iCluster that is based on a joint latent variable model. The main idea behind iCluster is that tumor subtypes can be modeled as unobserved (latent) variables that can be simultaneously estimated from copy number data, mRNA expression data and other available data types. It is a conceptually simple and computationally feasible model that allows simultaneous inference on any number and type of genomic datasets. Furthermore, we develop a sparse solution of the iCluster model through optimizing a penalized complete-data log-likelihood using the Expectation-Maximization (EM) algorithm (Dempster et al
). A lasso-type regularization method (Tibshirani, 1996
) is used in the penalized complete-data likelihood. The resulting model continuously shrink the coefficients for non-informative genes toward zero, and thus leading to reduced variance and better clustering performance. Moreover, a variable selection strategy emerges (since the coefficients for some of the genes will be exactly zero under lasso penalty), which helps to pinpoint important genes.
The article is organized as follows. In Section 2.1
, we discuss the K
-means clustering algorithm and a global optimal solution for the K
-means problem through PCA. In Section 2.2
, we formulate the K
-means problem as a Gaussian latent variable model and show the maximum likelihood-based solution and its connection with the PCA solution. Then in Section 2.3
, we extend the latent variable model to allow multiple data types for the purpose of integrative clustering. A sparse solution is derived in Section 2.4
. We demonstrate the method using two datasets from published studies in Section 3