We have developed an HMM-based method to infer the probability of LOH events from tumor samples without matched normals. The method utilizes several sources of information, including intermarker distances, SNP genotyping and mapping error rates, and haplotype information. LOH inferences using only tumor samples agree well with LOH patterns determined by analysis of tumor/normal pairs in two different array types (10 K and 100 K), three different tissue types (lung, glioma, and prostate), and in both cell lines and tumors, in test and in validation datasets. The inferences are robust to model parameter specifications. LOH is resolved to about 3 Mb or 100 SNPs in 100 K array data. This method makes it feasible to use SNP array technology to map LOH in tumor samples for which normal DNA is unavailable. Given that genotyping paired normal samples constitutes up to half the cost of LOH mapping experiments, this method also makes it feasible to perform these experiments at a much lower cost per sample, at the expense of slightly reduced accuracy.
One advantage of a model-based approach over the existing tumor-only LOH inference methods [3
] is its extensibility. The basic HMM was developed using average heterozygosity rates, but readily extended to incorporate the SNP-specific heterozygosity rates and haplotype information as they became available. In addition, rather than making definitive calls the algorithm infers the probability of LOH at each marker of a sample. This SNP-specific probability can then be used in further downstream analyses, such as identifying regions of shared LOH and sample clustering [5
]. For example, a high probability of LOH across many samples can indicate potential TSGs (). The HMM approach can also be used to infer LOH probabilities for paired normal and tumor samples (see Protocol S1
), unifying the LOH analysis for paired tumor/normal and unpaired tumor samples.
At higher SNP densities, where the haplotype structure of the human genome becomes relevant, an approach that considers the dependence among multiple SNPs in a region of LD is necessary in addition to the LD-HMM. We used a haplotype correction that compared regions of inferred putative LOH to a set of reference normal samples to reduced false LOH inference. This method works best if the reference samples have similar haplotypes to the tumor sample. Use of reference samples from a different ethnic group tends not to decrease the sensitivity of the method, but can substantially decrease its specificity.
False designation of regions of LOH due to allelic imbalance may lead to paradoxical results, with recurrently amplified oncogenes seen as potential TSGs. SNP arrays, by providing signal intensity along with genotyping data, allow such regions to be identified. We can thus integrate these data to exclude regions of putative LOH with high copy numbers as likely due to allelic imbalance. At the interpretive level, our finding that LOH is often copy-neutral suggests that LOH and copy loss should be considered independently when predicting the presence of a TSG, and may best be used in conjoined analyses.
The ability to identify regions of LOH in tumors without paired normal DNA allows LOH mapping in the many model systems lacking paired normal DNA, including cell lines and xenografts. As such model systems are the platform for experiments aimed at understanding the biology of human tumors, it is critical that we understand their genetic relationship to real human tumors. As an example, among the prostate cancer samples, LOH at the NKX3.1 locus is more prevalent among real tumors and xenografts than among cell lines, LOH at the p53 locus is more prevalent among xenografts than among real tumors or cell lines, and LOH at the Rb locus is equally prevalent in all three groups (). Larger sample numbers are required to see whether these differences are statistically significant. Such studies of the prevalence of regions of LOH across model systems compared to real tumors may indicate systematic faults in the ability of model systems to reflect in vivo cancer biology and guide the use and development of appropriate models based on genetic organization.
SNP array analysis of cancer genomes provides a single platform for copy number and LOH analysis. As these arrays move to higher resolution (500K), accounting for the haplotype structure of the human genome in the analysis of these data will be of greater import. The methods described herein should be readily extensible to both the higher density arrays and to the increasingly detailed information describing the haplotype structure of the human genome. The software package, dChipSNP, is freely available at http://www.dchip.org