DNA copy number change is an important form of structural variation in the human genome. Somatic copy number alterations (CNAs) are key genetic events in the development and progression of human cancers, and frequently contribute to tumorigenesis (Pollack et al., 2002
). The coverage of copy number changes varies from a few hundred to several million nucleotide bases, and somatic CNAs in tumors exhibit highly complex patterns. The advance of oligonucleotide-based single nucleotide polymorphism (SNP) arrays provides a high-density and allelic-specific genomic profile and enables researchers to study copy number changes on a genome-wide scale. For instance, Affymetrix offers several DNA analysis arrays for SNP genotyping and copy number variation (CNV) analysis, and the newest Affymetrix Genome-Wide Human SNP Array 6.0 features 1.8 million genetic markers, including more than 906 600 SNPs and more than 946 000 probes for detecting CNVs or CNAs.
Quantitative analysis of somatic CNAs has found broad application in cancer research. Although molecular analysis of tumors in their native tissue environment provides the most accurate picture of their in vivo
state, tissue samples often consist of mixed cancer and normal cells, and accordingly, the observed SNP intensity signals are the weighted sum of the copy numbers contributed from both cancer and normal cells. This tissue heterogeneity inherited in the measured copy number signals could significantly confound subsequent marker identification and molecular diagnosis rooted in cancer cells, e.g. true copy number estimation, consensus region detection, CNA association studies and detection of loss of heterozygosity and homozygous deletion. Experimental methods for minimizing normal cell contamination, such as cell enrichment or purification, are prohibitively expensive, inconvenient and prone to errors (Clarke et al., 2008
Here we ask whether it is possible to computationally correct normal tissue contamination by estimating the proportions of normal and cancer cells and recovering the true copy number profiles of cancer cells, based on the observed SNP intensity signals from cell mixtures. Albeit with limited success, some initial efforts have been recently made to address the impact of normal tissue contamination in copy number analysis (Assie et al., 2008
; Lamy et al., 2007
; Nancarrow et al., 2007
; Peiffer et al., 2006
) or to estimate the fraction of normal cells in tumor samples (Goransson et al., 2009
; Yamamoto et al., 2007
). Nancarrow et al. (2007
) developed a visual inspection toolkit that allows users to determine the presence of stromal contamination. Yamamoto et al. (2007
) and Goransson et al. (2009
) proposed computational methods to estimate the proportion of normal cells by matching to the experimental or simulated histograms of different mixtures. However, given the fact that the noise level in the raw copy number data is often quite high and varies from sample to sample, neither visual inspection nor simulated histogram matching will be able to produce an accurate and stable estimate of the fraction of normal cells in the tumor sample. An additional limitation associated with these methods is the lack of rigorous statistical principles in driving algorithm development.
In this study, we report a statistically principled in silico
approach to accurately detect genomic deletion type, estimate normal tissue contamination and accordingly recover the true copy number profile in cancer cells. By exploiting the allele-specific information provided by SNP arrays, we introduce a series of definitions and theorems to illustrate the detectability and its conditions, and propose a Bayesian Analysis of COpy number Mixtures (BACOM) method. The BACOM algorithm is based on a statistical mixture model for copy number deletion segments in heterogeneous tumor samples, whose parameters are estimated using Bayesian differentiation between hemizygous deletion (hemi-deletion, where one allele is absent) and homozygous deletion (homo-deletion, where both alleles are absent) and plug-in sample averaging. Subsequently, the weighted average of estimated normal tissue fraction coefficients across multiple segments is used to estimate the true copy numbers rooted in cancer cells across all loci on the genome. As shown in the Section 4
, this method not only produces cancer-specific copy number profiles but also substantially improves significant consensus events (SCEs) detection power.
To better serve the research community, we have developed a cross-platform Java application, which implements the whole pipeline of copy number analysis of heterogeneous cancer tissues. The BACOM software instantiates the algorithms described in this report and other necessary processing steps. To take advantage of many widely used packages in R to perform DNA copy number analysis and R's powerful and versatile visualization capabilities, we also provide an R interface, bacomR, that enables users to smoothly incorporate BACOM into their specific copy number analysis or to integrate BACOM with other R or Bioconductor packages. We expect this newly developed software to be a useful tool in routine copy number analysis of heterogeneous tissues.