Definition of major copy proportion
The major copy proportion (MCP) of a SNP is defined as C2/(C1 + C2), where C1 and C2 are the parental copy numbers at this SNP in a sample and C1 ≤ C2. The value of MCP is between 0.5 and 1 by definition, with various values corresponding to different relative proportions of parental copy numbers. The MCP is 0.5 for normal loci or balanced copy alterations, 1 for LOH, and a value between 0.5 and 1 for allelic imbalanced copy number alterations. MCP therefore quantifies allelic imbalance and is a natural extension of LOH analysis. MCP and total copy number (C1 + C2) together provide the same amount of information as allele-specific copy numbers, while each of them is a scalar quantity that can be more efficiently estimated and conveniently used in downstream analysis.
We describe here two examples that can benefit from estimating MCP values. First, normal sample contamination in tumors often leads to conservative "No Call" genotypes and intervening LOH or retention calls (see below for specific examples). MCP can better quantify the proportion of normal sample contamination while still identify allelic-imbalanced regions due to LOH. Second, tumors with hyperploidy often contain allelic-imbalanced regions with both parental alleles kept. If the total copy number in such regions is close to the cell ploidy, copy number analysis will reveal a normal relative copy and LOH analysis will show retention. However ASCN or MCP analysis can discover allelic-imbalance as genomic alteration in such regions.
Hidden Markov Model for estimating MCP
SNP-based LOH or copy number data along a chromosome are locally correlated and HMM is an effective analytic method for such data structure. HMMs have been utilized to analyze array-based copy number changes [12
] and to infer LOH from unpaired tumor samples [22
]. Here we used a similar HMM for the MCP inference.
The normalized probe intensity values (Figure ) were used to compute allele-specific SNP signals and raw copy numbers (see "Methods"). We then used HMM to model the MCP correlation of neighboring SNPs on a chromosome. The MCP values to be inferred are between the range of 0.5 to 1 and have 11 states under the default increasing step of 0.05 (comparable to the noise level in our data). The observed data is the raw allele A proportion (RAP), defined as RA/(RA + RB), where RA and RB are the raw allelic copy numbers of the two genotype alleles (A and B) at a SNP. For a heterozygous SNP in a sample, RAP should vary by a certain noise level around the unobserved MCP (when A is the major copy allele) or 1 – MCP (when A is the minor copy allele); for a homozygous SNP, RAP is close to 1 for genotype AA and close to 0 for genotype BB, since one of RA and RB is close to 0. These considerations motivate the function form of the HMM emission distribution (see "Methods").
Figure 1 The probe level data of one SNP. (A) Left: The SNP has 20 probe pairs, whose normalized intensity values in one array are displayed and connected in blue (perfect match or PM) and gray (mismatch or MM) lines. The probe set has probe pairs for both A and (more ...)
The HMM with emission and transition distributions specify the joint probability of the unobserved MCP and the observed RAP of all SNPs in a chromosome of a sample. The Viterbi algorithm [23
] was then used to obtain the most probable MCP state path as the inferred MCP values. The procedure was run separately for all chromosomes and all samples in a dataset.
We used several SNP array datasets to illustrate the analysis and visualization methods and to compare the results. (1) 10 K SNP dataset. Zhao et al. [12
] generated Early Access 10 K SNP array data for 14 breast and lung carcinoma cell lines and their paired normal cell lines, as well as 4 primary lung carcinomas and their paired normals. The array contains 10,043 SNPs with an average resolution of 300 kb. This work is one of the first demonstrations of combined copy number and LOH analysis using SNP arrays. (2) 100 K SNP dataset. Zhao et al. [25
] generated 100 K SNP array data for 70 primary human lung carcinoma specimens and 31 cell lines derived from human lung carcinomas. 12 unpaired normal samples were used as reference in copy number analysis. The array contains 115,593 SNP with an average resolution of 24 kb. LaFramboise et al. [13
] further analyzed this dataset to develop allele-specific copy number analysis and generated allele-specific quantitative PCR (Q-PCR) measurements for selected loci to compare allele-specific copy numbers from Q-PCR and SNP arrays. (3) 250 K cell line dataset. Affymetrix has made freely available a 500 K (consisting two 250 K SNP arrays) SNP dataset consisting of 9 tumor/normal pairs derived from breast and lung cancer cell line [27
]. The average marker resolution is 5.8 kb and 85% of the human genome is within 10 kb of a SNP. In this work we used the subset of the 250 K STY array data. (4) 250 K lung tumor dataset. Weir et al. [28
] generated 250 K STY SNP array data for 371 primary lung adenocarcinomas and 242 matched normal samples. We used a subset of 45 pairs of normal and tumor samples that are publicly available [29
The dChip software [30
] was used to implement the methods and visualize the analysis results. Figure compares the observed LOH view in dChip with the new MCP view using the chromosome 7 of the 10 K SNP dataset. In the MCP view (Figure ), different shades of gray correspond to MCP values greater than 0.5, highlighting allelic imbalanced regions. Comparing to the raw LOH calls on the left, MCP discovers more allelic-imbalanced regions that could cause excessive No Calls (sample H128t) or intervening LOH and retention calls (sample 57588T) in genotype-based LOH analysis.
Figure 2 The LOH and MCP data views of a chromosome. The tumor samples are displayed on columns and the SNPs are ordered on rows by their chromosome positions. (A) Observed LOH calls by comparing normal (N) and tumor (T) genotypes at the same SNP. Yellow: retention (more ...)
In another example, we compared different SNP data views of a lung cancer cell line with paired normal (sample H1395 from the 10 K dataset). The most interesting region in chromosome 18 is indicated by braces in Figure . The tumor sample contains many No Call genotypes in this region (white colors in Figure ), and the paired LOH analysis yields intervening retention, LOH and No Calls (Figure ). The raw copy numbers of this region center around ploidy or the relative copy number 2 (Figure ). The raw major allele proportion curve in Figure reveals that most values are either close to 1 (corresponding to homozygous SNPs) or between 0.6 and 0.7. A likely explanation for these data is that this chromosome region has three copies and the whole genome is near triploid, which is confirmed by spectral karyotyping data [32
]. Two copies of this region are from one parent and one copy is from another, creating underlying MCP of two thirds (0.67), close to the inferred MCP value of 0.65 (blue curve in Figure ). Interestingly, the chromosome region below this region has retention of heterozygosity, a MCP of 0.5, but copy numbers below ploidy (indicated by arrow in Figure ). This region most likely has one copy of each parental chromosome, creating copy number decrease from ploidy while retaining the heterozygosity.
Figure 3 The LOH, copy number, genotype and MCP data views of chromosome 18 of sample H1395t. (A) LOH from paired analysis, similar to Figure 2A. (B) Raw copy numbers. (C) Genotype calls. The red, yellow, blue and white colors represent genotype AA, AB, BB and (more ...)
Therefore, genotype-based LOH analysis suggests the middle region in Figure to be unusual, but the MCP analysis helps to pinpoint the underlying cause of the abnormality. The MCP result also reveals that the heterozygous SNPs in this region have real genotype AAB or ABB. The standard genotyping algorithm are trained by normal samples [6
], thus making conservative No Call or incorrect AB or AA/BB call for these complex genotypes and leading to intervening LOH, retention or No Calls in genotype-based LOH analysis. Combining MCP and total copy number, we can thus obtain a more complete understanding of the genomic structure of tumor samples.
Comparing MCP and LOH
We then used 18 pairs of normal and tumor samples in the 10 K dataset to compare the paired LOH and MCP analysis. 14 of these pairs are normal and tumor cell line samples. SNPs were classified as LOH or retention SNPs based on paired LOH analysis (Figure ) and then compared with their MCP values. Figure orders these tumor samples by their increasing sample-wise LOH rates (1.2% to 75%, curve 1). In all the samples, more than 90% LOH SNPs have MCP ≥ 0.6 (curve 2); in all but two samples, more than 80% LOH SNPs have MCP ≥ 0.9 (curve 3). The two samples 83437 and 57588 both have LOH rate below 20% and most of their LOH SNPs are in the intervening LOH and retention regions (the last two columns of Figure ). MCP values between 0.6 and 0.9 correctly identified these regions as allelic imbalance rather than LOH. In fact, these two samples are primary tumor samples and could contain normal sample contamination [12
], which likely cause most LOH areas in pure tumor cells to become allelic imbalance in the normal-contaminated tumor samples. In contrast to the LOH SNPs, the retention SNPs seldom have MCP value ≥ 0.9 in all the samples (curve 5), although in many samples more than 20% of the retention SNPs have MCP value ≥ 0.6 (curve 4). These regions often have copy number alterations that cause allelic imbalance but not LOH, leading to intervening retention and LOH calls (the last three columns of Figure ).
Comparing LOH and MCP from paired analysis. The samples are ordered on the X-axis by their sample LOH percentage from paired LOH analysis. (A) 10 K SNP data. (B) 250 K cell line data. (C) 250 K lung tumor data, where the sample names are omitted.
We also made similar comparison using the two 250 K datasets. Figure shows the same percentages as Figure for the 250 K cell line dataset of 9 sample pairs. All percentages have similar performance as the 10 K data. The percent of MCP ≥ 0.9 at paired LOH calls (curve 3) is low at the sample CRL-2338D. Inspecting the paired LOH calls in this sample reveals that most LOH in them are intervened with retention calls (Figure ), indicating allelic-imbalance rather than LOH. In contrast, the MCP analysis inferred smooth and moderate MCP values in this sample (Figure ), better discovering the underlying genomic alterations. Figure shows the same comparison percentages for the 250 K lung primary tumor dataset of 45 sample pairs. All samples except one have paired LOH call percentage below 20% (curve 1). This can be due that the LOH events are at a lower frequency in these primary tumors or that the homozygous genotypes in LOH regions of tumor cells are masked by normal sample contamination up to 30% [28
]. The portions of MCP ≥ 0.6 among paired LOH or retention calls both drop as paired LOH percentage drops (curve 2 and 4), while the portions of MCP ≥ 0.9 among paired LOH are near zero for all samples (curve 3, which overlays with curve 5), supporting the existence of normal sample contamination in most samples.
Figure 5 The LOH, MCP and copy number data of chromosome 1 of the 250 K cell line data. (A) The LOH calls from paired normal and tumor analysis. (B) The inferred MCP values. (C) The copy number of tumor samples is displayed in log2 ratio scale relative to the (more ...)
In summary, MCP analysis is able to discover real LOH and retention events. It also discriminates allelic-imbalanced regions from LOH through intermediate MCP values between 0.5 and 1, instead of yielding intervening LOH and retention calls.
Comparing MCP and copy numbers
It is of interest in cancer research whether copy number gains or amplifications are allelic-balanced or imbalanced events, since the genes in the imbalanced events could have a variant or mutant form preferentially amplified [13
]. SNP arrays have the advantage of providing both allelic-imbalance and copy number information. We have previously used the 10 K dataset to show that the LOH events can involve copy number deletion, copy-neutral or amplification events, while retention mostly occur at copy-neutral or amplification events [22
]. With the inferred MCP as the extension of LOH calls, we now ask how allelic imbalanced events correlate with copy numbers. A visual comparison of LOH, MCP and copy number using the 250 K cell line dataset is in Figure . In the p-arm of sample CRL-2324D, LOH events correspond to both copy number amplifications and deletions.
We then stratified SNPs by copy number bins and computed the distribution of MCP values at various copy numbers. In the 10 K data (mostly cell lines, Figure ) and the 250 K cell line data (Figure ), the copy numbers below 1.5 mostly correspond to MCP values ≥ 0.95 (LOH or extreme allelic imbalance). As copy number increases, the percent of "MCP ≥ 0.95" decreases while the percent of "MCP ≤ 0.6" (retention or near allelic balance) increases, indicating that larger copy number gains or amplifications involve less frequently with LOH and more frequently with allelic balanced and imbalanced events. In both 10 K and 250 K cell line data, there is a noticeable drop of allelic-balanced events ("MCP ≤ 0.6") around copy number of 3. The fact that 3 copies can have balanced amplifications is due to that the copy numbers from SNP arrays are not absolute copy numbers but relative to the ploidy of tumor cells (see "Methods"). Similarly, there are SNPs with MCP values close to 0.5 but copy number below 1, since in hyper-diploid samples the real copy number 2 has array-based relative copy number below 2. The peak at copy number 0.4 in Figure is likely due to the small sample variation (4 of 39 data points at the copy bin 0.4 have MCP of 0.5) or inaccurately inferred MCP values.
Figure 6 Stratifying MCP according to copy numbers. The 5-SNP smoothed copy numbers were scaled to have mode at two copies for each sample, and then were binned into copy number intervals of a width of 0.2. For example, on the X-axis, 0 indicates the copy number (more ...)
In contrast, different patterns emerge from the 250 K tumor dataset (Figure ). The percent of "MCP ≥ 0.95" is nearly zero at all copy numbers, and the portion of allelic-balanced events has a single peak around copy number of 2 and drops rapidly as copy number goes lower or higher. These could be explained by potentially prevalent normal sample contamination in these primary tumors, which could cause the attenuation of aberrant copy number values toward the normal copy of 2, as well as make the LOH or allelic-imbalanced amplification events in tumor cells to appear less allelic-imbalanced in contaminated tumor, leading to high percent of MCP values between 0.55 and 0.75.
Next we used the 100 K SNP array dataset with allele-specific copy numbers measured by quantitative PCR (Q-PCR) to compare the MCP based on SNP arrays and Q-PCR. We used the same SNPs to design the primers for Q-PCR and to obtain their array-based signals for comparison. Since Q-PCR measures allele-specific copy numbers rather than parent-specific copy numbers, we defined major allele proportion (MAP) as Max(A, B)/(A + B) and used it in the comparison, where A and B are Q-PCR or array-based allelic copy numbers. Table shows that most array-based MCP and the Q-PCR-based MAP values agree within a difference of 0.15 ("PCR MAP" and "Array MCP" columns). The largest difference of 0.44 (bold values in the table) occurs at SNP 589797 in sample S0515T. This SNP has homozygous genotype in the sample (both PCR and array-based MAP values are close to 1), but its multiple neighboring SNPs have heterozygous genotypes and MAP values close to 0.5 (data not shown), which contributes to the final inference of MCP 0.55 at the SNP 589797. Together with an amplified total copy number, we conclude that the DNA at the SNP is about equally amplified for both parental alleles.
Comparison of MCP based on SNP arrays and major allele proportions (MAP, Max(A, B)./(A+B)) based on allele-specific Q-PCR or array-based allelic signals.
MCP analysis of normal contaminated samples
We next checked how well MCP can address two challenges of applying SNP arrays in cancer genomics: tumor samples frequently lack paired normal samples to perform paired LOH or MCP analysis, and tumor tissue samples often contain normal stromal cell contamination. The HMM emission distributions can flexibly use either paired normal genotypes in paired MCP analysis or population-based normal genotype distribution in tumor-only analysis (see "Methods"). We checked how well the MCP estimated using paired normal and tumor samples agree with the MCP estimated using only tumor samples. In the 10 K dataset, the sample-wise absolute MCP differences between the two methods in the 18 samples range from 0.0006 to 0.025, and the sample-wise standard deviations of the MCP differences range from 0.013 to 0.075. In the 250 K lung tumor dataset, these two differences measures are larger across the 45 tumors, ranging from 0.011 to 0.041 and 0.050 to 0.114 respectively. Visual inspection of the MCP inferred from the 250 K data reveals many small regions that have MCP value ≥ 0.5 in tumor-only analysis but have MCP value of 0.5 in paired analysis. They are caused by stretches of homozygous genotypes that are in linkage disequilibrium, in a similar way as the false positives in tumor-only LOH analysis [22
]. By utilizing the genotype dependence of neighboring 5 SNPs in the HMM emission probabilities of tumor-only MCP analysis (see "Methods"), the two differences measures are smaller (ranges are 0.0001 – 0.026 and 0.008 – 0.097). Overall the differences between paired and tumor-only MCP inferences are fairly small compared to the values that MCP can take (0.5 to 1).
We next asked how much the tumor-only MCP method is tolerant to normal contamination. The 10 K dataset contains a mixing experiment of paired normal and tumor samples [12
]. A tumor cell line (HCC38t) was mixed with its paired normal cell line (HCC38) at various proportions and then hybridized to 10 K SNP arrays. In Figure , sample HCC38M9 to HCC38M6 are mixture samples with tumor content of 90%, 80%, 70% and 60% respectively. The LOH regions in the pure tumor sample should become allelic-imbalanced regions in the mixture samples. Figure shows a typical example of inferred LOH and MCP data using unpaired analysis (paired analysis for the column labeled with "HCC38t" in blue color). Compared to the LOH data (Figure ), MCP analysis better identified the boundaries of the allelic-imbalanced regions in all the mixture samples (Figure ). The values of estimated MCP for the LOH regions are less than 1 in the mixture samples (Figure ), corresponding to allelic-imbalance created by normal sample contamination. Interestingly, both LOH and MCP analysis performs better for the bottom LOH region than the top LOH region in the mixture samples, due to copy-neutral LOH in the bottom region and hemizygous deletion in the top region (copy number data not shown).
Figure 7 Comparing LOH and MCP using the 10 K mixing samples. The inferred LOH (A) and MCP (B, C) are displayed for chromosome 4. In the tumor-only LOH inference (the columns except column 1 in A), the inferred probability of LOH is displayed using a blue (1 – (more ...)
Figure shows the whole-genome MCP values inferred for paired analysis (column 1) and for tumor-only analysis of tumor (column 2) and mixture samples (column 3–6). The MCP patterns largely preserve but MCP values attenuate toward 0.5 as tumor content decreases. If a threshold of "MCP ≥ 0.6" is used to call allele-imbalanced SNPs (red vertical line in the shaded boxes) and we regard paired MCP analysis as the ground truth, at tumor content of 80% (column 4) we could achieve 88.5% for sensitivity and 88.2% for specificity. But at tumor content of 70% (column 5) the sensitivity and specificity dropped to 82.6% and 60.4%. This shows that normal contamination of up to 20% is tolerable when calling allele-imbalanced regions in MCP analysis.
The genome-wide view of inferred MCP in the mixture samples. The red vertical lines in the gray boxes represent a MCP threshold of 0.6. See the legend of Figure 2B and 3D for color schemes.
Major and minor copy alleles
One feature of the MCP algorithm is that it also infers the major and minor copy alleles for SNPs that are heterozygous in normal sample and undergo LOH or allele-imbalance in tumor (see "Methods"). In Figure , a SNP is colored in red or blue for major copy allele A or B if its major copy allele (MCA) can be inferred and is different from the minor copy allele, which is not displayed. In the paired MCP analysis (Figure , column "HCC38t" with blue color), the MCA is inferred to be different from minor copy allele for many SNPs since the normal sample contains information on heterozygous SNPs and LOH in the tumor sample provides information on the kept allele as MCA. In contrast, in the tumor-only MCP analysis (Figure , column "HCC38t" with black color), we can infer two regions of LOH (MCP is 1) as well as MCA, but there is no information on minor copy allele, so no SNP is colored. However, as normal contamination increases from sample HCC38M9 to HCC38M6, the minor copy allele of more and more SNPs can be estimated from the mixing normal sample, so more SNPs are colored to indicate MCA is different from minor copy allele (Figure ). These results show that allelic-imbalances (such as those in the mixture samples) can help to distinguish major and minor copy alleles, while LOH or allelic-balanced events can not.
Therefore in tumor-only MCP analysis, normal contamination at low percentage (≤ 20%) can be turned beneficial through MCP analysis. The normal contaminated samples contain the information of both normal sample genotypes and tumor genome alterations (LOH and copy number changes). In the tumor-only MCP analysis, the normal genotype information is utilized in the form of allele-specific raw copy numbers in the HMM emission distributions (see "Methods"). As the result, the allele-imbalanced regions are identified in a similar way to paired LOH analysis via normal-tumor genotype comparison rather than tumor-only LOH inference, which resorts to unrelated reference normal genotypes to distinguish between LOH and homozygous haplotype blocks [22
]. The normal contaminated samples also help to provide information on both major and minor copy allele at allelic-imbalanced regions, which can be useful in downstream analysis. If the contamination percentage can be estimated by other experimental measures, the inferred MCP or copy number from normal contaminated samples can be adjusted proportionally to obtain the MCP/LOH and copy number values of the unavailable pure tumor samples. However, efforts should be made to obtain pure tumor samples and their paired normals for separate hybridization whenever possible, as paired MCP analysis can better recover allele-imbalance information than tumor-only MCP analysis of contaminated samples (Figure ).
Downstream analysis and related analysis methods
MCP is an extension of LOH and contains complementary information to total copy numbers. Similar to LOH and copy number analysis of a set of tumor samples, a MCP summary score for each SNP may be computed across samples, such as the average MCP value across all samples. Then the chromosome regions can be permuted within samples, and the MCP scores computed from the permuted data can be compared to the original MCP scores to assess the significance of the latter [34
]. A composite alteration score using both MCP and total copy number may also be used, such as the proportion of samples with copy > 3 and MCP > 0.65 to capture only allelic imbalanced amplifications.
There are several allele-specific copy number (ASCN) analysis methods for SNP arrays [13
]. While total copy number plus MCP provide the same amount of information as allele-specific copy numbers, MCP extends the LOH analysis in a natural way and offers a univariate statistic to capture both LOH and allele-imbalance events. Such univariate quantity is more efficiently estimated and easily used in downstream analyses than a pair of allelic copy numbers. If needed, the total copy number and MCP can be combined to be equal to the analysis using allelic-specific copy numbers. The MCP analysis also reports major and minor copy alleles for allelic comparisons in allele-imbalanced regions. Several of the above ASCN methods also use probe sequence and restriction fragment length to adjust for probe signals to improve signal to noise ratios. Such adjusted raw allelic copy numbers can be conveniently used as the input of the MCP analysis through the dChip software.
Similar to all copy number analysis of SNP arrays, ideally we need paired normal samples for MCP analysis. When such paired samples are not available and an independent set of normal samples are used for reference signals, copy number variations (CNV) in normal samples may confound tumor copy number analysis [36
]. To address this issue, we have implemented a trimming method. Specifically, we assumed that in reference normal samples, for any SNP at most a certain percent (such as 10%) of the samples have abnormal copy numbers. Then for each SNP, 5% of samples with extreme signals are trimmed from the high and low end of the raw signal distribution and the rest samples are used to compute the signal mean and standard deviations of normal copy numbers at the SNP. This trimming method is designed to accommodate a small amount of CNVs in reference normal samples and has proven useful in copy number analysis with unpaired or limited number of normal samples. The same trimming method can be used to obtain raw allele-specific copy numbers in the MCP analysis.