This study is outlined as follows with results and discussion presented accordingly. First we demonstrate that segmentation methods used in DNA copy number analysis can directly be applied to matched tumor-normal samples for identification of regions of similar allelic proportions. Next, the segmentation approach is generalized for use with unpaired tumor samples. The performance of the segmentation strategy in comparison to other methods is comprehensively evaluated using simulated as well as experimental data sets from different Illumina WGG platforms. Then, we describe how the segmentation approach with high accuracy and sensitivity detects and estimates the fraction of cells affected by an allelic imbalance. Finally, we describe how the segmentation approach can be adapted to Affymetrix WGG data.
Segmentation identifies regions of identical allelic proportions in matched tumor-normal samples
Allelic imbalances in tumor samples may conveniently be displayed using BAF plots, which illustrate the presence and location of genomic regions of apparently the same allelic proportion (Figure ). The nature of an allelic imbalance may be revealed by comparison to the corresponding copy number profile (Figure ). In conventional LOH analysis a matched normal sample is used for detection of LOH. SNPs that are homozygous in constitutional cells are non-informative for LOH analysis. For paired tumor-normal samples analyzed using WGG platforms, non-informative homozygous SNPs may be identified and removed by comparison of SNP genotype calls between the tumor and the matched normal, resulting in a tumor-specific BAF profile (Figure ). Furthermore, since alleles for SNPs are, with respect to haplotypes, arbitrarily called A or B, a set of genomically consecutive SNPs will appear in BAF plots as horizontal bands that are expected to be symmetrically positioned around 0.5. By performing a reflection of BAF data along the 0.5 axis, we obtain mirrored BAF (mBAF) estimates resembling a copy number profile (Figure ). Homozygous SNPs (AA or BB) are thus positioned at 1, while heterozygous SNPs are positioned at 0.5. A similar transformation was used in the recently reported SOMATICs algorithm [17
Figure 1 Transformation of B allele frequency data for a paired tumor sample. (a) BAF for chromosome 8 of breast tumor 2 (data set 1). (b) Copy number profile of chromosome 8 with CBS segmentation profile superimposed in red. Gains (red bars) and losses (green (more ...)
In DNA copy number analysis, segmentation methods such as CBS [18
] have been extensively tested for their ability to identify CNAs [20
]. CBS can be directly applied to the mBAF tumor profile in Figure to identify the breakpoints of the observed allelic imbalances (Figure ). When comparing the segmented mBAF profile (Figure ) to the copy number profile (Figure ) we find that the segmentation accurately detects regions of allelic imbalance due to copy number loss on 8p23.3 to 8p12 and 8q11.23 to 8q21.3, allelic imbalance due to copy number gain on 8p11.23 to 8p11.21 and 8q22.2 to 8q24.12, and apparent copy neutral LOH on 8q24.13 to 8q24.3. In conclusion, we find that a segmentation-based approach can be applied to Illumina WGG data to identify regions of allelic imbalance in matched tumor-normal samples.
Generalization of the segmentation approach to unpaired tumor samples
The initial step in the segmentation approach is to remove non-informative homozygous SNPs from the tumor mBAF profile. Thus, generalization of the segmentation approach to unpaired tumor samples requires identification of non-informative SNPs when a matched normal sample is not available. Since the B allele frequency is a quantitative estimate of the allelic proportion for a given SNP, expected mBAF values for different types of allelic imbalances can be calculated for diploid genomes. An estimate of the tumor content of the analyzed sample can thus be translated into a maximal obtainable expected mBAF value for different types of allelic imbalances. The highest expected mBAF value, 1, is obtained for hemizygous loss or copy neutral LOH in a sample with 100% tumor content and no tumor heterogeneity. The highest achievable expected mBAF value decreases when contaminating normal cells and/or tumor cell sub-clones are present.
An estimation of tumor content can be used for generalization of the segmentation approach to unpaired tumor samples. Based on tumor content, the maximal obtainable expected mBAF value can be calculated and SNPs above this value can be removed as in the procedure for matched tumor-normal samples. For example, SNPs informative for a hemizygous deletion are, on average, not expected to obtain mBAF values larger than 0.91 for tumor samples with 10% normal cell contamination. On the other hand, for samples of purity above approximately 95%, using a fixed mBAF threshold for removal of non-informative homozygous SNPs may be inappropriate. The reason is that the range in mBAF of SNPs homozygous in all analyzed cells is often 0.97 to 1, as seen for normal samples analyzed on Illumina BeadChips (Table , Figure ). This variation makes non-informative homozygous SNPs difficult to distinguish from SNPs affected by tumor specific allelic imbalances for pure tumor samples. Still, for tumor samples of purity below 90-95%, or tumor samples of higher purity but with tumor cell subpopulations, a fixed mBAF threshold is an effective single parameter method for removing non-informative homozygous SNPs.
mBAF statistics for homozygous SNPs in HapMap samples analyzed on Illumina BeadChips
Figure 2 Generalization of the segmentation approach to unpaired tumor samples using a fixed mBAF threshold. (a) Histogram of mBAF values for the HapMap sample NA06991 (reference data set 4) hybridized on an Illumina Infinium 550k BeadChip. Bins with homozygous (more ...)
Applying a maximal mBAF cut-off of 0.97 to breast tumor 2 for removal of non-informative homozygous SNPs followed by segmentation results in a similar segmentation profile (Figure ) as when using the paired normal sample (Figure ). However, a fixed threshold may not fully remove non-informative SNPs if it is set too high. See, for example, Figure , where some SNPs with high mBAF values (mBAF >0.9) are not removed compared to the matched case (Figure ). To remove such remaining non-informative SNPs, we first identify them by the absolute sum of the difference in mBAF between an investigated SNP and the SNPs that, in the maximal mBAF filtered data, precede and succeed the SNP. Next, SNPs having a deviation in mBAF from their neighboring SNPs larger than a set threshold are removed. This filtering process, herein referred to as triplet filtering (see Materials and methods), is illustrated in Figure S1 in Additional data file 1. To systematically evaluate the effect of triplet filtering, we applied it to the paired urothelial tumors in data set 2. We found that the addition of triplet filtering significantly improved the removal of non-informative SNPs (Figure S1 in Additional data file 1; Additional data file 2). In conclusion, the segmentation strategy can be generalized for unpaired tumor analysis by filtering out putative non-informative homozygous SNPs based on their mBAF values. Furthermore, normal cell contamination is advantageous for the segmentation strategy in unpaired tumor analysis, as the analyzed cells are a mix of cells with allelic imbalance (tumor cells) and cells with no imbalance (matched normal cells). This mix results in a compression of BAF estimates that distinguishes tumor-specific regions of allelic imbalance from non-informative regions of homozygosity.
Calling of segmented regions as allelic imbalance
As illustrated in Figures and , segmentation can delineate regions of apparently the same allelic proportions for both paired and unpaired tumor samples. To differentiate regions of allelic imbalance from the heterozygous state, we can apply similar approaches as for calling CNAs from segmented data in DNA copy number analysis. In its simplest form we use a fixed mBAF threshold to compare segmented values against. If the segmented value of a genomic region is above the threshold, it is called as allelic imbalance. A fixed mBAF threshold may be given biological meaning through the equations giving expected mBAF values for different types of allelic imbalances (see Materials and methods). For example, to detect hemizygous loss in 20% of analyzed cells implies a maximum mBAF threshold of 0.56. We may also employ a sample adaptive approach for estimating the mBAF threshold as described for copy number analysis [19
Figure shows a schematic overview of the analysis steps in the segmentation approach with parameters for paired and unpaired tumor analysis. Using fixed thresholds, the number of parameters to optimize is typically one for paired tumor analysis (threshold for calling allelic imbalance) and two for unpaired analysis (threshold for removing non-informative SNPs and threshold for calling allelic imbalance). For the Illumina data sets we have analyzed, we have not found that other parameters (triplet-filtering cut-off, segmentation algorithm parameters, and minimum segment size) need to be tuned. If the threshold for removing non-informative SNPs in an unpaired analysis is set too high, a large number of non-informative SNPs may, for noisier samples, remain in the tumor mBAF profile. Such SNPs may form non-informative homozygous regions detected by the segmentation and falsely identified as regions of allelic imbalance. If the threshold is not optimized properly, haplotype correction [9
] or size filtering of segments with high mBAF values needs to be employed to reduce the number of such false positive calls. When the tumor content of the analyzed cells is known, false positive segments can be filtered out on the basis of their segmented mBAF values.
Flow chart of the analysis steps for the segmentation approach with parameters (in red) for paired and unpaired tumor analysis.
Evaluation and comparison of sensitivity and specificity using simulated Illumina data
To investigate the sensitivity and specificity of the segmentation approach compared to other methods, we created a simulated data set based on experimental 550k Illumina data for HapMap sample NA06991 (as described in Additional data file 3). Briefly, to the diploid HapMap sample we added a number of different CNAs and regions of copy neutral LOH to mimic a tumor sample. The simulated tumor sample was next diluted with normal cells creating a dilution series ranging from 0-100% tumor cell content in 5% increments. The ability to detect SNPs in allelic imbalance was evaluated for the segmentation strategy in both a paired and an unpaired setting. The performance of the segmentation strategy was compared with three published copy number variation (CNV) or allelic imbalance algorithms: PennCNV [12
], QuantiSNP [13
] and SOMATICs [17
]. PennCNV and QuantiSNP are HMM-based methods developed for CNV analysis and should only detect allelic imbalances originating from DNA copy number gain and loss, whereas SOMATICSs also detects copy neutral allelic imbalances.
First, we evaluated whether the methods identified regions of allelic imbalance regardless of whether the methods also correctly identified the type of aberration (gain, loss or copy neutral). We calculated sensitivities for each allelic imbalance and overall specificities using SNPs heterozygous in the original HapMap sample. In this analysis, the sensitivity for a simulated allelic imbalance is the fraction of its SNPs that are called as allelic imbalance, and the overall specificity is the fraction of SNPs outside of all simulated allelic imbalances that are not called.
Sensitivities for detecting simulated allelic imbalances regardless of whether the correct type of aberration was identified are shown in Figure . For lower normal cell contaminations (<40%), all methods showed high sensitivity and concordance for detecting allelic imbalance originating from copy number gains and losses. For higher normal cell contaminations the segmentation strategy outperformed both PennCNV and QuantiSNP in both a paired and an unpaired analysis setting. Compared to SOMATICs, the segmentation strategy showed similar sensitivity throughout the dilution range. Even though PennCNV and QuantiSNP should not detect copy neutral events, we note that reducing calling to allelic imbalance or not cause both methods to erroneously detect copy neutral LOH regions, for example, chromosome 5p. The overall specificity was high (>99.99%) for PennCNV, QuantiSNP and the segmentation strategy across the dilution range (Figure ). SOMATICs showed the lowest specificity across the dilution range (ranging from approximately 97% to 99%), mainly due to a large number of erroneously called SNPs in the so-called red band of the algorithm. Additionally, SOMATICs identified the largest erroneously called segments, ranging up to larger than and exceeding 500 heterozygous SNPs in size (Figure ). Hence, SOMATICs obtains sensitivities similar to the segmentation strategy at the expense of identifying a larger number of false positive regions.
Figure 4 Comparison of sensitivity for detecting ten simulated allelic imbalances for different methods. Heterozygous SNPs in NA06991 were used to estimate the sensitivity for the methods in detecting allelic imbalances in the simulated data set with increasing (more ...)
Figure 5 Comparison of specificity for detecting simulated allelic imbalances for different methods. Heterozygous SNPs in NA06991 were used to estimate the specificity of methods for detecting allelic imbalances with increasing normal cell contamination in the (more ...)
The detection of copy neutral imbalances using PennCNV and QuantiSNP led us to evaluate whether the methods, when they identify a region in allelic imbalance, also call the correct type of the aberration (gain, loss or copy neutral). In this second evaluation, the sensitivity for a simulated allelic imbalance is the fraction of its SNPs that are called as the correct type of imbalance. The overall specificity is calculated as in the previous evaluation with the addition that SNPs within an imbalance called as the incorrect type also contribute to lowering the overall specificity. For the segmentation strategy we used fixed cut-offs for the average log R ratio of SNPs in regions called as allelic imbalance to also call the type of aberration (see Materials and methods). The segmentation strategy had higher sensitivity than SOMATICs for correctly identifying gains and losses (Figure S2 in Additional data file 1). The CNV calling algorithm in SOMATICs repeatedly failed to call several regions of gain and loss correctly. Compared to only identifying allelic imbalance, the overall specificity for correct identification of the type of simulated allelic imbalance was considerably lower for PennCNV, QuantiSNP and SOMATICs, whereas it was high for the segmentation strategy also in this case (Figure ).
The segmentation strategy was, with the simulated data, able to detect regions of copy neutral LOH when the tumor content was only 15%. For hemizygous loss the maximum normal cell contamination that allowed detection was 75-80%, which corresponds well to the used mBAF threshold of 0.56 for calling allelic imbalance (hemizygous loss in >21% of analyzed cells). Single copy gain was detected with up to 75% normal cell contamination. Differences in sensitivity between paired and unpaired segmentation were seen for small allelic imbalances in samples of high tumor content. The low sensitivity for the 126 kb hemizygous loss on 13q13.1 for unpaired segmentation with 0-10% normal cell contamination is due to the fixed mBAF threshold of 0.97 for removing putatively non-informative homozygous SNPs (Figure ). With this threshold value several of the tumor-specific homozygous SNPs for this CNA are removed, making it difficult to detect by segmentation.
BAF and copy number profiles for the simulated data set with regions called as allelic imbalance marked for PennCNV, QuantiSNP, SOMATICs, unpaired segmentation, and paired segmentation are available as described in Additional data file 4. In conclusion, we find that the segmentation strategy can sensitively detect different types of allelic imbalances in highly heterogeneous samples and perform well compared with other published methods.
Evaluation and comparison of sensitivity using an experimental Illumina dilution series
To investigate the ability of the segmentation approach to detect allelic imbalances in experimental Illumina data, we generated a dilution series of the CRL-2324 breast cancer cell line on Illumina 370k BeadChips (data set 3). In addition to the methods applied to the simulated data (segmentation, PennCNV, QuantiSNP, and SOMATICs), we also included dChipSNP in this comparison. Since dChipSNP is a SNP genotype call-based method it could not be applied to the simulated data in which genotype calls were not simulated. CRL-2324 cells display a complex genetic make-up with polyploid cell populations having varying ploidy indices [21
]. Aneuploidy may confound normalization and data interpretation of Illumina WGG data [6
]. Normalization of Illumina WGG data in BeadStudio is made under the assumption that homozygous SNPs exist, on average, in two copies [6
], an assumption that can lack validity for aneuploid tumor samples. Substantiating this concern, we observed for the CRL-2324 dilution series that BeadStudio normalization results in copy number profiles that are centered differently as the tumor content decreases (Figure ). As a consequence of this variation in centering, many of the methods will call the same type of allelic imbalance differently (gain, loss, or copy neutral) depending on how much the tumor is diluted. Therefore, we evaluated the methods using calls of allelic imbalance without regarding the type of aberrations.
Figure 6 Allelic imbalances in CRL-2324 cells used for estimation of tumor dilution percentage by segmentation. CRL-2324 breast cancer cells were hybridized on Illumina 370k BeadChips in a dilution series with matched normal DNA (data set 3). For all parts, the (more ...)
Sensitivity was determined for eight different CNAs having BAF values in the undiluted cancer cell line consistent with presence in all tumor cells (Figure ). We found that the segmentation approach outperformed PennCNV, QuantiSNP and dChipSNP in sensitivity when tumor content was less than 50%. SNP call-based methods, such as dChipSNP, have been reported to be unable to detect regions of LOH when tumor content is less than 50% (corresponding to an mBAF of 0.66 for hemizygous loss), despite available paired constitutive DNA [14
]. Aneuploidy is problematic for model-based HMM methods when detecting allelic imbalances. For example, using Penn CNV and QuantiSNP, the single copy gain on chromosome 13q11-q12.3 is not detected in the pure breast cancer cell line (Figures and ). This failure is a consequence of how BeadStudio centers the copy number profile. A further investigation of the normalization of tumor samples analyzed on Illumina WGG arrays is thus warranted. In concordance with the simulated data, the segmentation approach showed similar sensitivity as SOMATICs with decreasing tumor content for all allelic imbalances; except for the single copy gain on chromosome 20p, which was better detected by SOMATICs (Figure ).
Figure 7 Comparison of sensitivity for detecting eight different allelic imbalances in the CRL-2324 dilution series for five methods. Lines correspond to sensitivity for PennCNV (black), QuantiSNP (green), unpaired segmentation (red), SOMATICs (blue), and dChipSNP (more ...)
Application of the segmentation approach to experimental Illumina tumor data sets
To investigate the performance of the segmentation approach in solid tumors, we applied it to two data sets containing matched tumor-normal samples (data sets 1 and 2). By removal of SNPs homozygous in the paired normal sample we generated a tumor specific BAF profile for each sample (as in Figure ), which was transformed to an mBAF profile (as in Figure ). A method for sensitive detection of allelic imbalances in tumors should detect genomic regions containing SNPs with small but distinct differences in mBAF compared to the 0.5 mBAF baseline. Consequently, to compare methods, we calculated the number of SNPs detected as allelic imbalance across a data set for different tumor specific mBAF values (Figure ). We found that the segmentation strategy outperforms PennCNV, QuantiSNP and dChipSNP for both data sets in detecting SNPs at lower mBAF values. The segmentation strategy performs similar to SOMATICs in both data sets down to mBAF values as low as 0.56, which was used as the cut-off to call allelic imbalance in the segmentation strategy. Paired BAF and copy number profiles for seven paired tumor samples (data sets 1 and 2) with regions called as allelic imbalance marked for PennCNV, QuantiSNP, dChipSNP, SOMATICs, and unpaired segmentation are available as described in Additional data file 4.
Figure 8 Total number of tumor specific SNPs detected as allelic imbalance in two paired tumor data sets plotted against their mBAF values for five methods. From each tumor, SNPs homozygous in the matched blood were removed. Only SNPs in segments of allelic imbalance (more ...)
Detection of homozygous deletions using the B allele frequency alone can be challenging [22
]. In the case of complete homozygous deletion in all investigated cells no genetic material remains and the BAF estimates become essentially random due to the low SNP signal intensity [22
]. With an increasing fraction of normal cell contamination, BAF estimates for homozygously deleted regions will eventually become indistinguishable from regions of 2N (Figure ). However, homozygous deletions frequently occur within regions of somatic LOH in tumor specimens. Such events can create a clearly distinguishable pattern detectable by the segmentation approach (Figure ). Nevertheless, homozygous deletions are, in general, probably best detected from analyzing copy number ratios [6
Figure 9 Detection of homozygous deletions in various tumor samples by the segmentation approach. All samples are hybridized on Illumina 300k or 370k BeadChips. For all parts, the upper panel shows the mirrored B allele frequency profile and the bottom panel shows (more ...)
While the segmentation strategy is designed to identify LOH and allelic imbalances in heterogeneous cancer samples, germline CNVs can be either missed or detected depending on their genotype and size. Germline CNVs involving loss result in BAF profiles identical to hemizygous loss in pure tumor samples and hence may be detected due to the absence of heterozygous loci if the CNVs are sufficiently large. Small germline CNVs involving gain of genetic material are not detected if the affected SNPs only show a homozygous genotype (for example, AAA or BBB, giving mBAF values close to 1). Larger germline CNVs involving gain may be detected similarly as for tumors with gain of genetic material.
Estimating cellular composition of samples from segmented B allele frequencies
BAF values in combination with copy number status allow for a direct estimation of the proportion of cells displaying a certain allelic imbalance [22
]. For a diploid genome, theoretical BAF values for allelic imbalances such as single copy gain, hemizygous loss or copy neutral LOH can be determined for varying percentages of normal cell contamination. Furthermore, knowledge of the sample purity can be used to estimate the fraction of tumor cells affected by an allelic imbalance.
Two studies have used different approaches to demonstrate how BAF data can be used to estimate normal cell contamination for tumor samples [17
]. We have derived equations for how mBAF values expected for different types of allelic imbalances depend on the fraction of cells harboring the imbalance (see Materials and methods). Nancarrow et al.
] do not present equations, but for the different allelic imbalances in the simulated dilution series we obtain, using their software SiDCoN, theoretical BAF values identical to those obtained with our equations. We conclude that the equations we have derived are identical to the approach used by Nancarrow et al
. Since the segmented mBAF value of a genomic region represents an average of the investigated SNPs, it can directly be used for estimation of the fraction of cells not affected by an allelic imbalance. We first evaluated the accuracy of the segmented value as a tool for estimation of tumor content in heterogeneous samples using the simulated data set. The simulated tumor content was compared to the value calculated from the observed segmented mBAF values for three different types of allelic imbalances. The segmentation approach finds the theoretical values with high accuracy and provides close estimates of the simulated tumor content (Table ). The discrepancy for the unpaired tumor setting when tumor content is above 95% is due to the fixed mBAF threshold of 0.97 used to filter our SNPs believed to be non-tumor specific.
Estimation of tumor cell content from simulated data using segmentation
To verify the accuracy of the segmented value in experimental Illumina data, we applied it to the CRL-2324 dilution series (data set 3). Three different allelic imbalances with 100% penetrance in CRL-2324 cells were selected (Figure ) for comparing the tumor content estimated by segmentation with the dilution percentage. In concordance with the simulated data, we found that the segmentation approach provides close estimates of the theoretical mBAF values and can accurately estimate tumor content in experimental Illumina data (Table ). The discrepancy for 100% tumor content is due to the fixed mBAF threshold of 0.97. Furthermore, the expected value for 0% tumor content is not in reality 0.5 due to the transformation from BAF to mBAF. The experimental CRL-2324 dilution series shows the expected linear compression of mBAF for the 13q21.31-qter copy neutral LOH region (Figure ). Tumor content appears to be best estimated from regions of hemizygous loss or copy neutral LOH, due to their larger span in mBAF (Figure ). Discrepancies between the dilution percentage and the estimated percentage from segmentation may in part be explained by uncertainty in the measured DNA content, which introduces bias in the expected dilution percentages (Table ). Such bias may explain differences seen in sensitivity between the simulated data set (Figure ) and the CRL-2324 dilution series (Figure ) for low tumor contents. Due to the chosen mBAF threshold of 0.56 for calling allelic imbalance, hemizygous loss cannot be detected below 20%, and single copy gain not below 25% tumor content.
Estimation of tumor content by segmentation in the CRL-2324 dilution series (data set 3)
Figure 10 Theoretical and observed mBAF values for different types of allelic imbalances with increasing tumor content for the CRL-2324 dilution series. (a) Linearity of segmented mBAF values for the copy neutral LOH region on 13q21.31-qter across the CRL-2324 (more ...)
When the tumor content of the analyzed cells is known, the segmentation strategy can be used to estimate the tumor sub-clone content for allelic imbalances. A reported comparison of four different array platforms for detection of CNAs and LOH in chronic lymphocytic leukemias (CLLs) included fluorescent in situ
hybridization (FISH) verifications of a number of hemizygous losses observed in tumor cell subpopulations [23
]. We applied the segmentation strategy to the Illumina data from this CLL study (data set 4). Our results demonstrate that the tumor cell sub-clone content for hemizygous losses can be accurately estimated from the segmented mBAF value (Table ). Furthermore, the percentage of cells affected by copy number neutral LOH can also be estimated using the segmented value. CLL sample 7 was shown to be copy neutral for chromosome 13, besides a homozygous loss of 13q14 [23
] (Figure ). Of the tumor cells, 11% were found to have hemizygous loss of 13q14 and 80% to have homozygous loss by FISH [23
]. However, the mBAF profile reveals allelic imbalance of the whole chromosome, implying copy neutral LOH (Figure ). Using the segmented value for chromosome 13, excluding 13q14, we estimated the percentage of tumor cells affected by the copy neutral LOH to be 83%. Intriguingly, this estimated percentage closely matches the fraction of tumor cells shown to have the homozygous loss by FISH (80%) [23
]. This observation suggests that a small fraction of tumor cells carry only the hemizygous loss of 13q14 found by FISH, while the larger population has both the bi-allelic loss on 13q14 and loss of one allele followed by duplication of the remaining allele for chromosome 13.
Estimation of tumor cell sub-clone content by segmentation for hemizygous loss in CLL samples
Estimation of tumor content is difficult and usually rare for solid tumors. Tumor content and tumor cell sub-clone content can be estimated with the segmentation approach under certain assumptions. For example, by assuming that a certain allelic imbalance occurs in all tumor cells, normal cell contamination becomes the sole driving force behind BAF compression. In this case, the tumor cell content can be estimated from the segmented value of the imbalance. Once the tumor cell content is estimated, the fraction of tumor cells affected by every other allelic imbalance can be calculated. In conclusion, we have shown that the segmentation strategy can be used to accurately estimate normal cell contamination and the fraction of cells affected by an allelic imbalance.
Application of the segmentation approach to Affymetrix WGG data
Allelic imbalances for Affymetrix data are usually not displayed using BAF plots. BAF estimates can, however, be generated for Affymetrix WGG data in a similar fashion as for Illumina [23
]. Technical variation in BAF estimates appears to differ between Affymetrix and Illumina WGG data as observed in Gunnarsson et al
]. The difference is further illustrated in Figure S3 in Additional data file 1 for an urothelial carcinoma hybridized on an Illumina 370k BeadChip and on an Affymetrix 250k Nsp array. The values for the thresholds in the segmentation strategy need to be modified in order for the strategy to handle the larger variation in Affymetrix BAF estimates (Figure S4 in Additional data file 1). Due to larger variation for homozygous SNPs, both the mBAF threshold and the triplet cut-off need to be reduced to filter out non-informative SNPs. As a consequence, the sensitivity is reduced for tumor samples of high purity. Additionally, the increased mBAF variation results in increased average values for segments. By replacing the mean with the median in the CBS algorithm when determining the segmented value for a genomic region, such increases can be counteracted.
Applying the segmentation strategy to two urothelial tumors analyzed on Affymetrix 250k Nsp arrays demonstrates how regions of allelic imbalance in solid tumors, missed by both dChipSNP and CNAG [24
], can be detected (Figure ). To estimate the tumor fraction affected by specific allelic imbalances, we applied the segmentation approach to Affymetrix data for CLL cases 8, 9 and 10 (data set 4) and investigated the same hemizygous deletions as we did using the Illumina data. For the hemizygous losses we obtained the tumor content estimates 75%, 56% and 85%, respectively, which are comparable to the results for the Illumina data (Table ) and also closely match the FISH results. The percentage of tumor cells affected by copy neutral LOH on chromosome 13 in CLL sample 7 was, as for the Illumina data, estimated to be 83% using the segmented mBAF value. B allele frequency and copy number profiles for the two urothelial tumors in data set 5 with regions called as allelic imbalance marked for CNAG, dChipSNP, and unpaired segmentation are available as described in Additional data file 4. In conclusion, the segmentation approach can be applied to Affymetrix WGG data with modified parameter values to address the larger variation in BAF estimates for this platform.
Figure 11 Application of the segmentation strategy to urothelial tumors hybridized on Affymetrix 250k Nsp arrays. Bars indicate allelic imbalances detected by unpaired segmentation (red), CNAG (blue), and dChipSNP (purple). In both parts, the top panel shows BAF (more ...)