Our normalization approach mainly consists of three steps: channel standardization, genome composition artifacts correction and recurrent genome artifacts correction. All of these three steps are necessary to improve single-cell CNV detection. The investigation of the single-cell amplification bias is covered in the 'Exploration of the amplification bias' section and the exploration of genome artifacts is covered in the 'Detection of copy number variation' section.
Exploration of the amplification bias
We first explored the amplification bias caused by the different natures of the test and reference samples with the help of graphical plots. MA, density, and quantile-quantile (QQ) plots are used to check for potential artifacts before and after normalization. The y-axis and x-axis of a MA plot represent the log2 ratios and average log2 intensities between two hybridized samples, respectively. The points of a MA plot should be randomly located around zero in the y-axis if no large aberrations or artifacts exist in the data. The density plot and QQ plot are graphical techniques to show the similarity between intensity distributions from test and reference samples. If the test sample intensities are distributed similarly to reference intensities, the density plot of two hybridized samples should overlap and the QQ plot should be located along the 45-degree line.
An obvious intensity-dependent pattern is observed in the MA plot of all single-cell array CGH experiments (Figure ; Additional file 1
). The pattern visualized using the red lowess smoothing line shows that the log2 ratio increases nonlinearly with the increase of the average intensities in the single-cell array CGH data. In contrast, the MA plot of an array CGH experiment using non-amplified genomic DNA shows no aberrant pattern (Figure ). Since both array CGH experiments were performed using the same series of Agilent 244K arrays and the only difference between them was the processing of the test samples, we suspect that the intensity-dependent pattern artifact is caused by the amplification of the single-cell DNA. This suspicion is confirmed by the larger standard deviation (SD) of the intensities in the amplified test sample compared to the non-amplified reference sample (Figure ). Consequently, the median SD of single-cell array CGH log2 ratios is 1.38, ranging from 0.85 to 1.44 across 7 arrays, whereas that of the genomic array CGH experiments is 0.28, ranging from 0.2 to 0.35 across 6 arrays. This larger SD of log2 ratios in the single-cell array CGH experiments hampers the accurate detection of CNVs at the single-cell level. It is thus necessary to remove this amplification bias.
Figure 3 MA plot of a single EBV-transformed cell. (a-c) MA plot for EBV-transformed single lymphoblastoid cell 1162 before normalization (a), genomic DNA before normalization (b), EBV-transformed single lymphoblastoid 1162 after channel standardization (c). The (more ...)
Figure 4 Density plot of a single EBV-transformed cell. (a,b) Density plot for EBV-transformed single lymphoblastoid cell 1162 before normalization (a), and after channel standardization (b). The solid line represents the reference sample and the dashed line represents (more ...)
After the data are normalized by the channel standardization step, the pattern between averaged intensity and log2 ratio disappears and the lowess curve fitted to the data is close to horizontal (Figure ). The intensity distributions of the reference and test samples are adjusted to have approximate mean zero and SD equal to 1 (Figure ). The QQ plot in Figure shows that most points after the channel clone normalization are located around the 45-degree reference line, meaning that the intensities of normalized test and reference samples follow similar distributions. We conclude that the amplification bias has been successfully removed by the channel standardization step.
Detection of copy number variation
After the exploration of the amplification bias, we checked the impact of genome composition artifacts and recurrent genome artifacts on the performance of single-cell CNV detection using the CBS algorithm [17
]. Genome composition artifacts, appearing as incorrect chromosomal aberrations, are frequently observed in the array CGH data. These artifacts are illustrated in Figure S2a,b in Additional file 2
with the low log2 ratios of the chromosome 1 p terminus and the chromosome 10 q terminus. Studies have shown that these genome composition artifacts could be caused by GC content as well as other unknown factors [18
We therefore use a genome composition correction step to correct the artifacts caused by GC content and a recurrent artifact correction step to correct unknown recurrent artifacts. For the genome composition correction step, we considered two possible methods: correction based on the GC content of (1) the probe sequence itself or (2) an enlarged window around the probe. Similarly, for the recurrent genome artifact correction we also considered two methods: (1) CBS segmented residuals followed by the recurrent genome artifact correction and (2) an artifact correction without the CBS segmentation in advance. The details of the channel clone normalization are introduced in the Materials and methods section. We compare our channel clone approach with four sub-methods to show that the combination of channel standardization, genome composition artifact correction and recurrent genome artifact correction together give the best single-cell CNV detection performance: CG (channel plus genome composition correction using enlarged window GC contents); CA (channel plus recurrent genome artifact correction without CBS segmentation); CGprobeA (channel plus genome composition correction using probe GC contents plus recurrent genome artifact correction without CBS segmentation); CGACBS (channel plus genome composition correction using enlarged window GC contents plus CBS segmented residuals followed by recurrent genome artifact correction); channel clone (channel plus genome composition correction using enlarged window GC contents plus recurrent genome artifact correction without CBS segmentation).
The genome profiles before and after genome composition correction are shown in Additional file 2
. It is obvious that the GC-content-related artifacts, appearing as a wave pattern in Figure S2a,b in Additional file 2
are adjusted after the genome composition correction shown in Figure S2c,d in Additional file 2
. Similarly, Figure shows that the CNV detection performance of CG with a TPR 0.86 and FPR 0.06, respectively, is better than for the methods that do not account for genome composition correction (for example, global loess, CGHnormaliter, and poplowess).
Barplot of true positive rate and false positive rate of 7 EBV-transformed cells. All the TPRs and FPRs were calculated after global loess, CGHnormaliter, poplowess, Haarseg, CG, CA, CGprobeA, CGACBS and channel clone normalization approaches.
Different studies have used genome composition corrections to correct the genome wave pattern [18
]. Array CGH hybridization is influenced not only by the GC content of the probe sequence but also the DNA sequences that lie in an enlarged window around the probe sequence corresponding to a DNA sequence fragment the probe hybridizes to. Diskin et al.
] used an ordinary linear regression model to regress the Log2Ratio on the GC content of a fixed 1Mb window size around the probe to correct the genome composition artifacts. Since this method was developed for single-channel arrays and cannot be directly implemented for the two-color arrays, we developed a comparable but more elaborate genome composition correction approach. To account for the GC content of the unknown genome fragments, our method extracts the GC percentage from different window sizes around each probe and elects the window size with the highest correlated GC content to the log2 ratio for the genome composition correction. Secondly, in contrast with Diskin et al
.'s method, we use a weighted linear regression model with larger weights for the GC-rich probes to avoid the overcorrection of real chromosomal aberrations. Other genome correction methods could also be valid. However, comparison of all GC correction methods is outside the scope of our study. To show that accounting for the GC content from enlarged window sizes improves the genome composition correction, we also performed the correction based only on the GC content of each probe, as proposed by the CGprobeA normalization. Figure shows that the TPR and FPR values are 0.86 and 0.015, respectively, for the CGprobeA normalization method, whereas the values for our channel normalization are 0.98 and 0.006, respectively. This comparison confirms the importance of finding the optimal GC-content window for the genome composition correction.
The impact of the recurrent genome artifact correction of each chromosome is especially explained in Additional file 3
and shown in Additional files 4
. For instance, chromosome 3 of EBV-transformed cell 1168 was experimentally confirmed to have no aberrations. However, two deletions at the location around 50 Mb and the q-arm terminal region were observed when no correction was applied (Figure S3a in Additional file 3
). The estimated common profile of chromosome 3 (Figure S3b in Additional file 3
) shows the artifacts at the same locations as in the individual profile of EBV-transformed cell 1168. Since the common profile is estimated across all the EBV-transformed samples, the artifacts observed in the common profile represent the recurrent genome artifacts existing in multiple EBV-transformed samples. Figure S3c in Additional file 3
shows that after the extraction of the estimated common profile, these two artifacts have been removed and the segmentation line of this chromosomal profile is horizontal around the zero line.
Comparison of the CG and CA methods to channel clone normalization is shown in Figure andTable . Both the CG and CA normalization methods show lower TPRs and larger FPRs for single-cell CNV detection performance. These results confirm our hypothesis that not all genome artifacts can be explained by GC content. Our channel clone normalization method removes genome composition artifacts, as well as unknown recurrent genome artifacts. Therefore, the combination of channel standardization, genome composition and recurrent genome artifact corrections, which we propose, gives the best single-cell CNV detection performance, with a TPR of 0.98 and a FPR of 0.006.
A recent study suggests that the combination of segmentation with recurrent genome artifact correction can improve aberration detection in genomic array CGH applications [16
]. We tested this CGACBS approach on our single-cell array CGH data. Table shows that the TPR and FPR of CGACBS are 0.86 and 0.02, respectively, which is outperformed by channel clone normalization, with values of 0.98 and 0.006, respectively. CGACBS uses CBS segmented residuals for genome artifact correction to avoid overcorrection of real chromosomal aberrations. However, this method also protects genome artifacts with log2 ratios comparable to real aberrations from being corrected. Consequently, it results in higher false positive calling of aberrations. Therefore, it is a trade-off between keeping real aberration signals and removing undesired genome artifacts.
Moreover, we have compared our normalization approach to global loess, CGHnormaliter, poplowess, Haarseg, and GADA methods. Using the TPR and FPR as given in Figure andTable we compared the overall CNV detection performance for global loess, CGHnormaliter, poplowess, Haarseg, and channel clone normalization. The TPR values were 0.66, 0.71, 0.71, 0.66, and 0.98, respectively, while the FPRs were 0.13, 0.09, 0.15, 0.05, and 0.006, respectively. Although the recently developed poplowess and CGHnormaliter normalization methods perform better than the original global loess normalization, they have a high FPR as well. The common feature of both methods is the separation of probes with normal log2 ratios from probes with aberrant log2 ratios, as well as the normalization of the data based on the normal probe log2 ratios; however, this is not suitable in single-cell array CGH. The reason is that many genome artifacts appear next to real aberrations caused by amplification bias in the single-cell approach. As a consequence, these genome artifacts are incorrectly segmented or clustered by the CGHnormaliter or poplowess algorithms into aberrant groups, yielding poor results.
The channel clone normalization method has shown its advantage in correcting recurrent genome artifacts across samples. Notice that CBS fails to detect a 2.22 Mb deletion at the chromosome 20 p terminus of cell 1151 after channel clone normalization (Figure S12 in Additional file 12
). The possible reason is that this deletion is located in the terminal region of a chromosome with a short length of 2.22 Mb. This aberration thus shows a pattern similar to the artifacts located at the same position and results in an overcorrection by the channel clone normalization. However, considering the large FPR caused by chromosomal artifacts from the single-cell array CGH, it is worthwhile to reduce the FPR from around 10% to 0.6%, even while missing one short aberration.
The performance of global loess, CGHnormaliter, poplowess, Haarseg and channel clone normalization on each genome profile is shown in Figures and and Additional files 4
. For instance, cell 1151 carries a known terminal 9.3 Mb duplication at the chromosome 18 p terminus (Figure ). This duplication is called after channel clone normalization, but not after the other loess-based methods. Figure illustrates that chromosome 21 of cell 1160 is expected to have no aberration. This is confirmed by SNP-array analysis that revealed no loss-of-heterozygosity for this 21q-ter segment. However, the q-terminal region of this chromosome is detected as a deletion after global loess, CGHnormaliter and poplowess normalizations, thus resulting in a false-positive CNV region.
Figure 6 Copy number variation detection in chromosome 18 of an EBV-transformed sample. (a-d) CBS segmentation of chromosome 18 from the EBV-transformed single lymphoblastoid cell 1151 using global loess normalization (a), CGHnormaliter (b), poplowess (c), and (more ...)
Figure 7 Copy number variation detection in chromosome 21 of an EBV-transformed sample. (a-d) CBS segmentation of chromosome 21 from the EBV-transformed single lymphoblastoid cell 1160 using global loess normalization (a), CGHnormaliter (b), poplowess (c), and (more ...)
Haarseg is an algorithm integrating signal smoothing, normalization, segmentation, and copy number calling [13
]. However, this algorithm performs somewhat conservatively in calling chromosomal aberrations in the single-cell array CGH data, even though it gives a lower FPR than loess-based normalization methods. We also checked the performance of GADA in the single-cell application. GADA is an iterative procedure combining normalization and segmentation by sparse Bayesian learning. Around 800 breakpoints were detected in each EBV-transformed sample by GADA (Additional file 18
). This is biologically unrealistic, and we conclude that many false positive aberrations have been detected. Although Haarseg and GADA are suitable in genomic array CGH data [13
], the implementation of these methods becomes inappropriate for single-cell array CGH data. The channel clone method outperforms these methods, having the largest TPR (0.98) and smallest FPR (0.006). Clearly, channel clone normalization improves the TPR considerably compared to these other normalization algorithms or normalization integrated algorithms for single-cell array CGH.
Recently, a unified model has been developed by the simultaneous integration of normalization, segmentation and copy number calling [16
]. This model has been shown to be efficient for genomic array CGH data. The advantage of this model is that it can incorporate existing preprocessing methods into one model. It would be attractive to enrich this model by accounting for single-cell data properties for single-cell CNV detection in the near future.