Genomic studies have dramatically improved our understanding of the biology of tumor formation and treatment. In part, this has been accomplished by harnessing tools that profile the genes and proteins in tumor cells, revealing previously indistinguishable tumor sub-types that are likely to exhibit distinct sensitivities to treatment methods (Golub et al.
; Perou et al.
; Sorlie et al.
). As these tumor sub-types are uncovered, it becomes possible to develop novel therapeutics more specifically targeted to the particular genetic defects that cause each cancer (Atkins and Gershell, 2002
; Bild et al.
; Pegram et al.
). While recent advances have had a profound impact on our understanding of the tumor biology, the limits of our understanding of the molecular nature of cancer obstruct the burgeoning efforts in ‘targeted therapeutics’ development. These limitations are apparent in the high failure rate of the discovery pipeline for novel cancer therapeutics (Kamb et al.
) as well as in the continuing difficulty of predicting which patients will respond to a given therapeutic. A striking example is the fact that traztuzumab, the targeted therapeutic developed to treat HER2-amplified breast cancers, is ineffective in many patients who have HER2-overexpressing tumors and yet effective in some who do not (Paik et al.
). Furthermore, sub-types typically remain poorly defined—e.g. the ‘basal-like’ breast cancer sub-type, for which different studies have inferred very distinct genetic signatures (Perou et al.
; Sorlie et al.
; Sotiriou et al.
)—and yet many patients do not fall into any known sub-type. Our belief, then, is that clinical treatment of cancer will reap considerable benefit from the identification of new cancer sub-types and genetic signatures.
One promising approach for better elucidating the common mutational patterns by which tumors develop is to recognize that tumor development is an evolutionary process and apply phylogenetic methods to tumor data to reveal these evolutionary relationships. Much of the work on tumor evolution models flows from the seminal efforts of Desper et al.
) on inferring oncogenetic trees
from array comparative genomic hybridization (aCGH) profiles of tumor cells. The strength of this model stems from the extraction of ancestral structure from many probe sites per tumor, potentially utilizing measurements of the expression or copy number changes across the entire genome. However, this comes at the cost of overlooking the diversity of cell populations within tumors, which can provide important clues to tumor progression but are conflated with one another in tissue-wide assays such as aCGH.
The cell-by-cell approaches, such as Pennington et al.
); Shackney et al.
), use this heterogeneity information but at the cost of allowing only a small number of probes per cell. In recent work, Schwartz and Shackney (2010
) proposed bridging the gap between these two methodologies by computationally inferring cell populations from tissue-wide gene expression samples. This inference was accomplished through ‘geometric unmixing,’ a mathematical formalism of the problem of separating components of mixed samples in which each observation is presumed to be an unknown convex combination1
of several hidden fundamental components. Other approaches to inferring common pathways include mixture models of oncogenetic trees (Beerenwinkel et al.
), principle component analysis (PCA)-based methods (Hglund et al.
), conjunctive Bayesian networks (Gerstung et al.
) and clustering (Liu et al.
Unmixing falls into the class of methods that seek to recover a set of pure sources from a set of mixed observations. Analogous problems have been coined ‘the cocktail problem,’ ‘blind source separation’ and ‘component analysis’ and various communities have formalized a menagerie of models with distinct statistical assumptions. In a broad sense, the classical approach of PCA (Pearson, 1901
) seeks to factor the data under the constraint that, collectively, the fundamental components form an orthonormal system. Independent component analysis (ICA; Comon, 1994
) seeks a set of statistically independent fundamental components. These methods, and their ilk, have been extended to represent non-linear data distributions through the use of kernel methods (see Schölkopf and Smola, 2002
; Schölkopf et al.
, for details), which often confound modeling with black-box data transformations. Both PCA and ICA break down as pure source separators when the sources exhibit a modest degree of correlation. Collectively, these methods place strong independence constraints on the fundamental components that are unlikely to hold for tumor samples, where we expect components to correspond to closely related cell states.
The structure of our present inference problem, that of extracting multiple correlated fundamental components, has motivated the development of new methods for unmixing genetic data. Similar unmixing methods were first developed for tumor samples by Billheimer and colleagues (Etzioni et al.
) to improve the power of statistical tests on tumor samples in the presence of contaminating stromal cells. Similarly, a hidden Markov model approach to unmixing was developed by Lamy et al.
) to correct for stromal contamination in DNA copy number data. These recent advances demonstrate the feasibility of unmixing-based approaches for separating cell sub-populations in tumor data. Outside the bioinformatics community, geometric unmixing has been successfully applied in the geo-sciences (Ehrlich and Full, 1987
) and in hyper-spectral image analysis (Chan et al.
The recent work by Schwartz and Shackney (2010
) applied the hard geometric unmixing model (see Section 2.1.1
) to gene expression data with the goal of recovering expression signatures of tumor cell sub-types, with the specific goal of facilitating phylogenetic analysis of tumors. The results showed promise in identifying meaningful sub-populations and improving phylogenetic inferences. They were, however, hampered by limitations of the hard geometric approach, particularly the sensitivity to experimental error and outlier data points caused by the simplex fitting approach. An example of simplex fitting in the plane is shown in , illustrating why the strict containment model used in Chan et al.
); Ehrlich and Full (1987
); Schwartz and Shackney (2010
) is extremely sensitive to the noise in data. In the present work, we introduce a soft geometric unmixing model (see Section 2.1.2
) for tumor mixture separation, which relaxes the requirement for strict containment using a fitting criterion that is robust to noisy measurements. We develop a formalization of the problem and derive an efficient gradient-based optimization method. We develop this method specifically for analyzing tissue-wide DNA copy number data as assessed by aCGH data. We demonstrate the value of the soft unmixing model by comparison to a hard unmixing method on synthetic and real aCGH data. We apply our method to an aCGH dataset taken from Navin et al.
) and show that the method identifies state sets corresponding to known sub-types consistent with much of the analysis performed by the authors.
Fig. 1. (A) The minimum area fit of a simplex containing the sample points in the plane (shown in black) using the program in Section 2.1.1. On noiseless data, hard geometric unmixing recovers the locations of the fundamental components at the vertices. (B) However, (more ...)