High throughout SNP-based genotyping arrays have been increasingly used to identify copy number variation and copy-neutral loss of heterozygosity, and have provided invaluable insight into the complexity of genomic variations, especially for disease related variations. The accuracy and density of genotyping arrays have improved rapidly, with current versions having a density of over one million SNPs/probes. However, new detection algorithms are needed to extract more detailed information about genome complexity from these genotyping data. And new algorithms are also needed to detect the genome complexity in tumor samples mixed with stromal cells, which is almost unavoidable in biopsy samples. Under the assumption that all the CNV events originate from the underlying normal state, here we present MixHMM, a novel HMM based algorithm, which can detect copy number, allelic imbalance and genotype accurately, from homogeneous samples or heterogeneous samples with tumor cells mixed with a certain proportion of stromal cells. We validated the technique using both simulation data and real tumor data including breast cancer and melanoma.
Allelic imbalance revealed by the genotyping data includes not only classical single copy LOH and copy-neutral LOH but, in principle, can include other forms of imbalance such as high-copy LOH and imbalanced amplification. Such information has not typically been a focus of whole genome analyses, but may provide insight into differing mechanisms of amplification at specific loci or mechanisms differing among individual patients. Our preliminary analyses suggest such events do occur in tumors. Only algorithms which can utilize the available data to detect these events will be able to identify how prevalent such changes are and lead to determining their functional significance. MixHMM models multiple states for a high-copy region, for example, three states instead one are used for a 4-copy region (see and ). It is not only more genetically meaningful but also allows detection of all forms of allelic imbalance. Still another benefit of this modeling strategy is that we can assign a more meaningful genotype to each SNP, for example, instead of using ‘AB’ for a 4-copy heterozygous genotype, we distinguish ‘AABB’ or ‘ABBB’ or ‘AAAB’ instead.
Similarly to other HMMs for copy number analysis, such as wuHMM 
, MixHMM requires no training data. The six model parameters for each hidden state (mean and SD of LRR, mean and SD of
BAF, characteristic length of regions, proportion of SNPs) are provided with the package and can be easily modified by the users to adapt to special samples. We found that the CNV detection is robust to the transition parameters but is sensitive to the emission parameters (distributions of LRR and BAF).
Mismatches between data and model may cause inaccurate state assignments. These mismatches can stem from three different sources. The first type, which is the most common, stems from the fact that normalization procedures for the original density data were developed primarily for normal samples. In cancer samples with complex CNV events, BAF and LRR values of suboptimal quality are commonly found. The suboptimal quality can be manifested as asymmetric heterozygous BAF bands, characteristic LRR values for 2n considerably shifted from 0, genomic wave effects in LRR values, etc., none of which are biologically. In these cases, alternative normalization and preprocessing tools should be applied before CNV detection (see method 4.7). The second type of mismatch stems from a violation of our assumption that some regions of the ‘contaminated’ stromal genome are not normal, for example, in ‘F’ (one-copy deletion) state instead of ‘FM’ state, as from for instance, inherited copy number variants. In this case, the genotyping data from a paired stromal sample is needed for accurate CNV detection. The third type of data-model mismatch stems from the fact that the genome of tumor cells are sometimes not homogeneous (i.e. cancer cells with different copy number changes mix with each together), and this violates the model assumption that the input data are from a mixture of two kinds of genomes (see for an example). In this case, there will be different apparent proportions of normal cells in different regions, and small regions with alternating CNA states tend to be detected, which can be considered as a signal of inaccurate detection. Our model is not intended to distinguish among multiple clones because the state and proportion of tumor component cannot always be uniquely determined from the genotyping data of the mixed sample. For example a mixture with 50% ‘FFFM’ and 50% ‘FM’ gives BAF and LRR distributions exactly the same as those from 100% ‘FFM’ (germline CNV). Instead, we use the estimated global proportion (corresponding to the dominant clone of tumor cells) for CNV detection. Multiple regions of a tumor could be analyzed to more accurately deal with heterogeneous tumors 
Very recently, Sun et al. 
have developed GenoCNA to detect the cancer CNV in a tumor samples contaminated with stroma. We have shown, using simulated samples and dilution series of cancer celllines, that MixHMM is significantly more accurate in detecting CNV in samples with a considerable proportion of stroma. In addition, CNV regions with copy number up to 7 can be detected effectively with the 20-state MixHMM model. Although detection of higher copy number will inevitably be less accurate because of the saturation effects in both hybridization and scanning, it is essential to detect the highly amplified regions in some cancer samples. For example, detection of patterns of high level amplification, termed ‘firestorms’ reported in many breast cancer samples 
, may be relevant for classification and prognostic significance.
MixHMM is designed to detect CNV states using BAF and LRR values, which are the typical output of Illumina BeadStudio. For other SNP array platforms such as the Affymetrix chip, the original outputs need to be transformed to BAF and LRR values beforehand. Fortunately, there are tools available for such tasks. For example, the PennCNV site (http://www.openbioinformatics.org /PennCNV
) provides a protocol for that transformation. Although MixHMM currently only works for CNV detection from autosomes, it can be extended to cope with X, Y if the LRR values are well normalized.
In conclusion, our novel algorithm offers several distinct advantages over previous algorithms. MixHMM allows detection of copy number variations in tumor cells from a heterozygous sample contaminated with stromal cells, and it allows detection of higher copy numbers and richer allelic imbalance. MixHMM requires no training data, and the model can be easily adapted to special set of samples. These features are critical components of algorithms which will fully exploit the potential of the rapid evolving genotyping platforms for the detection of genomic variances and biomarkers.