PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
IEEE/ACM Trans Comput Biol Bioinform. Author manuscript; available in PMC 2017 May 1.
Published in final edited form as:
PMCID: PMC4905595
NIHMSID: NIHMS722963

Optimization of Signal Decomposition Matched Filtering (SDMF) for Improved Detection of Copy-Number Variations

Abstract

We aim to improve the performance of the previously proposed signal decomposition matched filtering (SDMF) method [26] for the detection of copy-number variations (CNV) in the human genome. Through simulations, we show that the modified SDMF is robust even at high noise levels and outperforms the original SDMF method, which indirectly depends on CNV frequency. Simulations are also used to develop a systematic approach for selecting relevant parameter thresholds in order to optimize sensitivity, specificity and computational efficiency. We apply the modified method to array CGH data from normal samples in the cancer genome atlas (TCGA) and compare detected CNVs to those estimated using circular binary segmentation (CBS) [19], a hidden Markov model (HMM)-based approach [11] and a subset of CNVs in the Database of Genomic Variants. We show that a substantial number of previously identified CNVs are detected by the optimized SDMF, which also outperforms the other two methods.

Index Terms: Bioinformatics, array CGH, matched filtering

1 Introduction

Copy-number variations (CNV) are relatively large (≥1000 base pairs) structural changes in the genome that reflect either normal genetic heterogeneity [15], [18], [22], [23], or various pathologies [8], [16], [17], [24], [29]. CNVs vary considerably in length. Although they may be located in genomic regions that encode proteins, they are often found in regions that either contain functional non-coding DNA elements or have currently unknown function.

Genomic data are typically contaminated by various types of noise making it difficult to identify CNVs. Over the last decade various methods have been developed for CNV detection [4], [5], [7], [10], [11], [13], [19], [21], [25], [28]. A few signal processing methods have also been proposed [3], [6], [14], [20], [26]. Circular binary segmentation (CBS), a change-point detection method [19], is popular for CNV detection in array comparative genomic hybridization (aCGH) data. Other methods, such as the much faster HaarSeg segmentation [6], hold promise for robust CNV detection at low computational cost. We recently proposed a signal processing-inspired methodology (signal decomposition matched filtering (SDMF)) for improving the signal-to-noise ratio (SNR) in genomic data and consequently facilitating CNV detection [26]. As a signal processing approach, SDMF is appropriate for any signals that are either continuous or can be treated as such, e.g., aCGH and high-resolution next-generation sequencing (NGS) data.

It is anticipated that in the near future CNVs will be detected using multiple high-resolution platforms. NGS is becoming less expensive for investigating the entire genome. Consequently, the role of the lower resolution aCGH may eventually change from a detection to a validation tool, i.e., aCGH-based CNVs may be used to validate CNVs detected in NGS data. However, some genomic studies may still use aCGH since it directly estimates a relative measure between test and reference sequences (the ratio of fluorescence intensities), whereas NGS requires the identification of a reference genome for comparison to the genome of interest. The selection objectivity, estimation and robustness of this signal are important technical issues that will eventually be resolved as NGS evolves. Finally, large volumes of publicly available aCGH data, e.g., in the cancer genome atlas [1] remain valuable in genomic studies. Given the value of aCGH and promise of NGS, it is desirable to develop CNV detection methods that are applicable, robust and computationally efficient across platforms. A common methodology also eliminates method-related uncertainties that arise when distinct approaches are applied to distinct data types. Advantages and shortcomings of various methods developed for aCGH and applied to NGS data were recently shown in [9]. Overall, it is expected that signal processing methods, which are inherently based on principles of analysis of high-dimensional and correlated signals, may be more appropriate for CNV detection in aCGH and NGS data, particularly as progressively larger volumes of continuous NGS data from longer segments of the genome become available.

SDMF aims to improve the data SNR first by eliminating noisy signal components, followed by matched-filtering [27] resulting in local SNR changes. By design, SDMF compares pairs of signals to assess their waveform similarity/dissimilarity. The original study [26] focused on the development of the SDMF method and optimization of the denoising process (the first step of the method). It was tested with normal aCGH samples from the TCGA. Given that CNVs in the healthy genome are relatively sparse (often with frequencies ≤10%), typically a large number of pairwise sequence comparisons will result in SNR changes that reflect dissimilarity (mismatch). Thus, in the original SDMF, SNR changes following matched-filtering between each sequence and all others in the dataset were averaged and compared to a mismatch threshold. For relatively rare CNVs, this averaged SNR reflects the overall mismatch between each sequence in the small subset of the dataset that contains a CNV with the majority of sequences that have no CNV in a particular segment. Although this approach is reasonable in this setting, it makes SDMF indirectly dependent on CNV frequency. As the latter increases, the average of pairwise SNR changes may no longer reflect overall mismatch and may not meet the mismatch threshold. Thus, as CNV frequency increases, the sensitivity of SDMF is likely to decrease. It is more appropriate to eliminate this averaging, use the traditional application of matched filtering where SNR gain due to signal match is the relevant measure, and estimate relevant thresholds for SNR gain. Finally, for SDMF to become a computationally efficient and useful tool for analysis of genomic data, its parameters also need to be optimized, primarily the processing window length and the SNR gain threshold.

To address the limitations of the original method and expand the capability of SDMF to detect CNVs independently of their frequency of occurrence in a dataset, we have developed a modified SDMF method that uses pairwise SNR gains to identify genomic regions potentially containing CNVs in sequence pairs. Therefore, the proposed method can detect both rare CNVs that may occur in as few as one pair of sequences as well as CNVs that may occur in as many as the entire sequence dataset. In addition, we have developed a systematic approach for parameter and relevant SNR threshold selection. Using extensive simulations we have investigated the effects of noise level, processing window length, CNV number, frequency, length and distance between CNVs on SNR changes. These simulations provide important information into the effects of various parameters on the performance of SDMF. Results from these simulations may be used to select an optimal combination of a processing window length and SNR threshold for improved CNV detection. We have applied the improved SDMF method to 429 normal sequences from the TCGA, and have compared detected CNVs to i) those obtained using the original SDMF, ii) CNVs identified using CBS and the Hidden Markov modeling (HMM) approach proposed by [11] and iii) CNVs in the database for genomic variants (DGV) [2]. We have selected these two approaches based on their wide application to array CGH data for CNV detection. Finally, using simulations, we have systematically compared the computational efficiency of SDMF and CBS.

2 Methods

2.1 Overview of SDMF

SDMF is a two-step algorithm. The first step involves filtering through sequence decomposition. The empirical mode decomposition (EMD) approach [12] is used to decompose each aCGH sequence y(k) into its dominant components (modes) in an unsupervised fashion. When linearly superimposed, these modes yield a signal that is an accurate representation of the original signal, i.e., the mean square error (MAE) or reconstruction error between y(k) and the estimated signal ŷ(k) is very small. The EMD method fits spline functions through local extrema of the original signal y(k) to obtain upper and lower envelopes of the signal. At the first iteration, the mean of these envelopes m1,1(k) is subtracted from y(k) to obtain the residual h1,1(k). If this residual’s mean is zero, then h1,1 represents the first mode. Typically this is rarely the case, and the process of envelope estimation and subtraction from y(k) is repeated several times until convergence, i.e., until the mean of h(·) reaches a pre-defined threshold ε [double less-than sign] 1. The mode d1(k)= h1,q(k), estimated at iteration q is then subtracted from the original signal y(k) to obtain the residual signal r1(k) = y(k) − d1(k), which is treated as the new signal to be decomposed. This process is repeated for r2(k), …, rM–1(k), to obtain modes d2(k), …, dM(k). The decomposition stops once the difference between successive reconstruction errors, based on sets with progressively higher number of modes, becomes negligible. To minimize the MAE, in the re-synthesized signal y^(k)=i=1Mdi(k) all estimated modes need to be superimposed, irrespective of whether they may be signal- or noise-related. So, if one of these modes is noise-related and needs to be eliminated for denoising purposes, the resulting MAE of the re-synthesized signal without that mode will increase and at the same time the variance of that signal will decrease. In order to eliminate noise and/or artifact-related modes from genomic sequences without substantially affecting their signal structure, an optimization approach was proposed in [26]. Specifically, a risk function was defined as the difference between the MAE and the signal variance. As modes are progressively eliminated, the MAE increases, the signal variance decreases and the difference between these two statistics also changes. Minimization of this risk function results in an optimum number of signal components to be removed while maintaining the original structure of the signal. This process is described in detail in [26].

The second step of SDMF involves comparison of pairs of sequences through matched filtering. The match-filter (MF) maximizes SNR, i.e., the ratio 20log10<y(.)>rms<ν(.)>rms, where ν(k) is Gaussian noise and <·>rms denotes root-mean square, by linear convolution of a known signal (the template) with an unknown signal (the measurement). SNR increases in regions of template-measurement waveform match. The resulting increase in SNR following this process is referred to as SNR gain. By definition, the filter f(k) that maximizes SNR of signal y(k) at spatial location K is the space- (or time in the case of time series) reversed signal y(Kk). In the absence of a known template, a pair of signals may be convolved to assess their similarity or dissimilarity. Given that the match-filter is a quasi-optimum filter, in traditional signal processing applications the filtered data are used in all subsequent analyses. However, in the context of SDMF, the matched filtering process is only used to identify regions potentially containing CNVs, i.e., the actual matched-filtered data are not used in further analysis. Instead, this process is used to determine waveform similarity/dissimilarity of pairs of segments within genomic sequences based on matching/mismatching-induced SNR changes. For example, if a pair of segments contains a CNV, i.e., the pair’s waveforms are similar, the SNR increase (gain) following matched-filtering reflects this similarity. If only one of the two segments in the pair contains the CNV and the other segment only contains noise (in the absence of CNVs aCGH segments should in theory have zero amplitude but are typically contaminated with random noise), the SNR change following matched-filtering reflects this waveform dissimilarity.

The primary difference of the two versions of SDMF is in how they identify regions potentially containing CNVs based on SNR changes following matched filtering. For a particular genomic segment of interest, once every pair from a set of N sequences has been convolved with each other, the ratio of pre- over post-filtering SNR ΔSNR=20log10<yMF>rms<yobs>rms is computed. In the original SDMF, for each sequence, the N − 1 ΔSNRs resulting from its convolution with N − 1 sequences are averaged to obtain ΔSNR¯. A segment is further examined for potential CNVs if ΔSNR¯ is less than a pre-established threshold. The original SDMF further assumes that CNVs are relatively rare and thus the majority of sequences will contain noise and only a few will contain a CNV. Consequently, for a sequence that contains a CNV, its comparison with sequences that do not contain that CNV will result in a ΔSNR¯ that will reflect an overall mismatch. The threshold set for SNR associated with mismatch was set based on a simple simulation and is somewhat arbitrary. The limitation of the original SDMF is that the above assumptions may cause SDMF to miss CNVs with high frequencies of occurrence, since in these cases the ΔSNR¯ may reflect an overall match of a particular sequence with several others that also contain a common CNV and will not meet the threshold for mismatch. In addition to pathological CNVs, which may have high frequencies of occurrence in particular patient cohorts, even the healthy genome includes a few common CNVs with relatively high frequencies.

To address the limitations of the original SDMF and eliminate the indirect dependence of the detection on CNV frequency, the optimized SDMF does not use SNR averaging following matched filtering, but instead examines SNR changes at the sequence pair level. The proposed modification allows the detection of CNVs of any frequency, from a single pair of sequences to the entire dataset. However, it still requires setting a threshold for SNR gain. Using extensive simulations the present study investigated various potential factors affecting SNR, including the processing window length, number, length and space between CNVs. A practical approach to the selection of the SNR threshold is proposed. Fig. 1 summarizes the optimized SDMF.

Fig. 1
Schematic organization of the optimized SDMF. The first step consists of denoising individual sequences through a time-domain decomposition based on the EMD method, and elimination of noise-related components using the approach in [26]. Pairs of denoised ...

SNR gain is proportional to the amplitudes of the signals that are being compared and the length of the window over which the convolution is computed. Once thresholds have been established for detection, the raw (globally denoised) data in pairs of segments that exceed these thresholds are examined to identify specific types of aberrations (gains ( log232) or losses ( log212)). The optimal processing window length depends on CNV size which is not a priori known. Thus the data may be processed using windows of various length, each resulting in different SNR gains. In typical applications of matched filtering the filtered signal is used in all further analysis. In the absence of precise knowledge of the template waveform and length, longer processing windows may result in higher SNR but also in signals where the original waveform is distorted. Thus, there is always a trade-off between window length and signal resolution. However, in this application matched filtering is solely used to identify segments in pairs of sequences potentially containing CNVs, i.e., the filtered signal is not used in further analysis, and the issue of its waveform shape is not as relevant. Nevertheless, for different filter lengths it is important to examine segments adjacent to the one of interest for potentially spurious SNR changes not associated with signal similarity. This issue is addressed in simulations below.

2.2 Simulations

Simulations were performed using the following signal model:

x(k)={i=1Ns(k)Ki+n(k)k[k0,i:k0,i+Ki]n(k)otherwise,
(1)

where x(k) is a linear superposition of Gaussian noise n(k) and the CNV signal s(k)Ki, i.e., the log2 ratio that is non-zero in the interval [k0,i, k0,i + Ki] and zero otherwise. CNV length Ki is variable. N is the number of CNVs. Simulated data consisted of 200 sequences containing one or a combination of multiple gains and losses and 100 sequences of pure Gaussian noise, with zero mean and variable variance. All sequences were 4,000 probes long. Without loss of generality, CNV amplitude was either a single gain ( log2(32) or a single loss ( log2(12)), rather than multiple gains or 2 losses. The noise level (variance) was varied at 4 percent increments, from 4 percent of the amplitude of the gain or loss to 400 percent. Although high noise levels that completely mask the CNV are possible, they may not be frequent, particularly in sequences that have been denoised using SDMF’s first step. However, here they were simulated to demonstrate adequate performance of SDMF even in extreme cases. In all simulations each sequence was compared to all others. Consequently, pairs of sequences with similar or dissimilar levels of noise were match-filtered. This is a realistic scenario as noise may vary modestly within sequence batches but substantially across batches or arrays. Each dataset of 300 sequences with different CNV distributions was analyzed using processing windows 50, 100, 150, 200, 300 and 400 data points long. In each case, the window covered the region(s) containing the CNV(s) differently. Note that varying the window length is equivalent to varying probe density in a sequence.

Simulation 1—One CNV with fixed length and location

In this simplest case, one CNV (200 points long, in locations 2,001 to 2,200) was completely aligned in all sequences that contained it. The only variable parameter was noise level. An additional simulation with one large CNV (2,000 points long, in locations 22–2,021) was also performed. There are pathologies, such as cancer, where large genomic amplifications or deletions sometimes spanning an entire chromosome arm have been reported. The same window lengths were used as in all other simulations.

Simulation 2—One CNV of variable length but common starting location across sequences

In real data a CNV may have somewhat different lengths in individual sequences, even if its starting location is exactly the same. Depending on pairwise differences in CNV length, SNR gains following match-filtering may be substantially different than SNR gains in sequence pairs that contain a CNV with the same length. In this simulation, CNV length was increased by 2 points in each sequence, i.e., in the first sequence, CNV length was 202 points and in the 100th sequence it was 400 points. Thus, different processing windows covered each CNV differently.

Simulation 3—One CNV of variable length and starting location

Both length and starting location were simultaneously varied. CNV length was progressively increased by 2 points as before and the starting location was progressively shifted to the left by 2 points. Thus, the first sequence started at location 2,001 − 2 = 1,999 and ended at location 2,202 (2,200 + 2 points), and the 100th sequence started at location 1,801 and ended at location 2400 (2,200+200 points).

Simulation 4—Multiple CNVs

This simulation addressed the issue of SNR gain in a processing window partially/completely covering CNVs in close proximity to each other. One hundred sequences included a gain in locations 2,001–2,200, a loss in locations 2,401–2,600, a gain in locations 2,801–2,900, and a loss at locations 3,201–3,450. Another 100 sequences included the same number of CNVs and at the same locations, but with gains and losses switched in comparison to the first 100 sequences.

Simulation 5—Imprecise a priori information on CNV length

A more realistic data scenario was also simulated, where a processing window length was selected based on the data structure. Two hundred sequences contained CNVs 10, 50, 100, 120, 200, 350 and 620 points long. A simple but imprecise detector was applied to one sequence with fairly low SNR (~ −5 dB) to identify segments of consecutive data (or non-consecutive but no more than 5 data points apart) with amplitudes log2(32) or log2(12). Application of the detector to a sequence will low SNR was selected as a worst case scenario, as it led to imprecise estimates of CNV length and consequently a selection of processing window lengths that in some cases were not well correlated with CNV length. In practice it could be applied to any and all sequences in a dataset prior to the selection of processing windows. Based on this detector, we estimated CNV lengths 9, 27 (not close to any of the CNV lengths), 50, 94, 123, 203, 350 and 617 points. The data were analyzed using these as processing window lengths.

Simulation 6—Variation of CNV frequency

To compare the performance of the original and proposed SDMF methods, CNV frequency was also varied in data generated for simulation 5. A single processing window of moderate size (123 points) was selected, given that the effects of varying the window length were assessed in simulation 5. The goal of this simulation was to compare the two approaches using the same method parameters, including window length. Simulations 1–5 assumed that CNVs were in 200 of 300 sequences (frequency ~67 percent). The original SDMF is expected to perform relatively poorly for CNVs with high frequency, since the averaged SNR change for all comparisons of each sequence with all others may not meet the mismatch threshold. In this simulation an increasingly larger number of randomly selected sequences were replaced by pure noise, to progressively decrease the CNV frequency in the dataset from ~67 to ~33 percent, ~17 and 10 percent.

2.3 Real Genomic Data

The proposed and original SDMF were applied to 429 normal sequences from TCGA, also analyzed in [26]. Data from 10 chromosomes were processed. Detected CNVs were also compared to these detected by CBS and by a HMM. Similarly to SDMF which compares local data patterns (waveforms) in pairs of sequences, the HMM approach explicitly models these local patterns and was, therefore, chosen as an additional relevant method for comparison to SDMF. Without an independent validation of identified CNVs, it is not possible to assess the sensitivity and specificity of any detection method when applied to real data, despite agreement with other computational approaches. To compare the relative performance of the two versions of SDMF, CBS and the HMM for real data, a subset of CNVs (gains and losses) in the database for genomic variants was selected, from studies with ≥250 samples. Numbers of gains and losses separately estimated for each chromosome, and CNV locations that our results were compared to can be found in http://www.hsph.harvard.edu/rebecca-betensky/software. Merged-level CNV coordinates are reported, i.e., sample-level CNVs with at least 70% spatial overlap were merged in the DGV. Note that in Stamoulis& Betensky (2011), CNVs were selected from the DGV using more subjective criteria for CNV selection and no coordinate merging. No criteria based on sample demographics were used to select CNVs in the DGV as corresponding demographic information for the TCGA samples were not available.

3 RESULTS

3.1 Simulations

In all simulations true and false positive rates (TPR and FPR) were calculated for every segment covering a CNV partially or totally. Sensitivity (TPR) was calculated from 200 sequences containing a CNV, as the ratio of detected true positives over all true positives. Specificity (1-FPR) was calculated from 100 pure noise sequences as the ratio of detected true negatives over all true negatives. False negatives were calculated again from 200 sequences containing a CNV, and false positives were calculated from 100 noise sequences. Receiver operating characteristic (ROC) curves and areas under the curves (AUC) were also estimated, as the SNR gain threshold varied from 0 to ~90 dB at 1 dB increments. Fig. 2 shows SNR gain as a function of the original sequence SNR, for different template lengths in simulation 1. Each data point on the y-axis corresponds to an SNR change following matched filtering of one pair of sequences (with a total of (3002) SNRs) at individual segments containing CNVs. Given that in these simulations the noise level is a priori known, the SNR of the original sequences are known and are plotted in the x-axis.

Fig. 2
Case 1: SNR change following matched filtering as a function of the original sequence SNR. Each color corresponds to a window length: 50 (black), 100 (blue) 150 (cyan), 200 (green), 300 (yellow), and 400 (magenta) points. SNR changes for each sequence ...

SNR changes were clustered: SNR gains >20 dB irrespective of the original signal SNR were estimated for both gains and losses (loss-related SNR gains were slightly higher) when pairs of sequences both containing the CNV were match-filtered. A wide range of SNR changes were estimated (from ~ − 40 dB to ~ + 40 dB) for sequences with very low SNR (< ~ − 5 dB). SNR gains were estimated for pairs of low SNR sequences that both contained the CNV, SNR changes < 0 dB were estimated for pairs of pure noise sequences or pairs of noise and low SNR sequences. Finally, another cluster of SNR changes, including lower than ~ 20 dB gains and <0 dB changes were identified for noisier sequences matched to each other or to pure noise. These clustered SNR changes suggest that a threshold for SNR gain may be estimated to maximize CNV detection.

The ROC curves for simulations 1–4 are shown in Fig. 3. In the case of one CNV with fixed length and location the optimum processing window corresponded to CNV length. However, even for a 400 points-long window covering the entire CNV (200 points) and a 200-point noise segment, the performance of the detector was still good. In the case of the 2,000-point CNV, AUCs were 0.91, 0.94, 0.95, 0.97. 0.94 and 0.79 for window lengths 50, 100, 150, 200, 300 and 400 points, respectively. For the largest window, the segment 2,000–2,400 points only included 22 points of signal (probes 2,000–2,021), and 380 points of pure noise. Consequently the AUC for this window was lower than all others. Overall SDMF performed very well for this CNV too, with comparable performance for window lengths 50–300 points. In simulations where CNVs varied either only in length or also in location, the detector performed poorly for small windows and substantially better for larger windows (300–400 points). Finally, in simulations of multiple CNVs, both smaller (50–100 points) and large windows (400 points) resulted in similar AUCs (Table 1). Note that SNR gain following matched filtering depends on several factors. First, it is proportional to the window length, as well as the amplitudes of the signals of interest. For small processing windows and/or low-amplitude signals, SNR changes following filtering may be small. In addition, noise levels in the signals being filtered also affects SNR changes. In these simulations noise levels varied widely, and 175 of 200 sequences with CNVs had noise levels greater than equal to the CNV amplitudes (gain or loss). Although such high noise levels may be substantially higher than those encountered in real genomic data, we have included them in the simulations to demonstrate the robust performance of SDMF even in high noise levels. Nevertheless, in cases of relatively small window lengths, high noise levels and relatively low or high SNR thresholds there may be an increasing number of false positives (noise segments identified as signals) or increased number of false negatives (signals identified as noise), respectively. In these cases, the ROC curves may fall below the diagonal, highlighting the need of larger processing windows for improved detection. Overall, the simulations suggest that a modestly large processing window may be more appropriate that a small window. However, a small CNV may not be easily detectable if the window is very large and thus there should be a trade-off between the two.

Fig. 3
ROC curves for each simulation and each processing window length, from 50 to 400 points (color denotes window length). The scenario for CNV length and location simulated in each case is also schematically shown.
TABLE 1
AUC as a Function of Window Length for Each of the Four Simulations

Fig. 4 shows the ROC curves for simulation 5. Both the overall sensitivity and specificity (averaged over all segments containing CNVs) is shown (top left panel), as well as CNV-specific sensitivity and specificity. Corresponding AUCs are summarized in Table 2.

Fig. 4
ROC curves for each simulation and each processing window length, from 9 to 617 points (color denotes window length).
TABLE 2
AUC as a Function of Window Length, Overall for All CNVs (Column 2) and for Individual CNVs (columns 3–9) Where the Locations of the Seven CNVs were 401–410 (First), 1001–1100 (Second), 1681–1800 (Third), 1901–1950 ...

These results suggest that 1) small CNVs are difficult to detect even when using a small processing window, particularly when the noise level is high, and 2) medium size windows (here ~50–200 points) may be more appropriate for identifying segments potentially including CNVs and may yield similar results, and 3) in addition to window length, the way the window covers a CNV also affects SNR gain and consequently detection. For example, for some CNVs a window length of 94 points was worse than 50 and 123 points, in part due to differential CNV coverage.

AUCs for simulation 6 comparing the original and modified SDMF are summarized in Table 3. As expected, the proposed SDMF substantially outperformed the original SDMF for 67 percent CNV frequency (overall AUC was 0.9 compared to 0.42), demonstrating the shortcomings of the original method in the case of common CNVs with relatively high frequencies in a set of genomic sequences. It also outperformed the original SDMF for all CNVs >1% of the sequence length (six of seven CNVs). Interestingly, for the smallest CNV (0.25 percent of the sequence length) the AUC for the original SDMF was 0.75 compared to 0.42 AUC for the proposed SDMF, for 67 percent CNV frequency. This is the only case where the original method substantially outperformed the proposed SDMF. As previously noted, short CNVs are difficult to detect, particularly with larger processing windows, and SNR gains may be small and may not meet the thresholds. In contrast, sequence mismatch may be easier to detect in these cases. Therefore, if the goal of the detection is to identify primarily very short CNVs, it may be appropriate to apply both versions of SDMF and compare the findings. The performance of the original SDMF improved substantially with decreasing CNV frequency, though noise level in sequences with the CNV also affected the AUCs. Thus, at 33 percent frequency, the overall AUC was higher than at 17 percent frequency. However, given the random selection of sequences that are replaced by pure noise, these differences are probably due to the distribution of noise levels in the respective datasets, as AUCs for the proposed SDMF also differed by ~12 percent. Overall, the performance of the proposed SDMF varied less with decreasing CNV frequency, and since SNR changes are compared at the signal pair level, performance is independent of CNV frequency. For each CNV frequency the random selection of sequence subsets were repeated 5 times. For each CNV frequency the overall AUC varied ~10–15 percent in datasets with the same CNV frequency but different sets of selected sequences. In contrast, for the original SDMF, AUCs varied substantially in repeated simulations for the same CNV frequency by ~10–25 percent.

TABLE 3
AUC for New SDMF (New) and Original SDMF (Old) for Variable CNV Frequency

3.2 Application of SDMF to Real Data

In simulations, we can estimate a reasonable threshold for SNR gain from plots like the one shown in Fig. 2. In real data, with no independent knowledge of noise level, we only have knowledge of the raw signal variance and SNR gain following matched filtering. A similar plot may, however, be used to select a reasonable SNR threshold. We selected two SNR gain thresholds, 5 and 10 dB. As suggested by Fig. 2, a higher SNR threshold may be required for larger windows, since lower thresholds for larger windows raise the issue of specificity. Table 4 summarizes the number of gains and losses in each chromosome in the DGV, the number of CNVs estimated by SDMF (original and proposed) using two processing window lengths, 200 and 400 data points long, and the number of discrete regions estimated by CBS and by the HMM. The reported CNVs estimated by SDMF are only those that have any overlap with those in the DGV. This overlap varied substantially among CNVs, and in some cases was <1% of the CNV length. Furthermore, given that SDMF uses a sequential segmentation approach for detection, probes that met the threshold for gain or loss in different segments, but based on their genomic coordinates were part of the same CNV in the DGV, were reported as a single CNV. The threshold for mismatch in the original SDMF was set at −1.5 dB. CBS detects CNVs in individual sequences using a two-sample t-statistic to compare the mean of a data segment to the mean of the sequence outside that segment. A change in the mean is assumed to be statistically significant if the p-value is less than a threshold of 0.01. Note that each genomic segment corresponding to a particular set of breakpoints estimated by CBS or by the HMM overlapped with at least one CNV in the DGV. This is because both CBS and HMM estimated a relatively small number of discrete genomic regions covering most of the chromosome and thus had limited spatial specificity compared to SDMF. To highlight this difference and limitation of CBS and HMM-based approaches, Table 4 reports the total number of regions, defined by CBS-based breakpoints and HMM-based transition points, that overlap with the DGV. For SDMF, each estimated CNV with any overlap with the DGV is included in the list.

TABLE 4
CNVs in DGV (Partial), i.e., Subset with Lengths Detectable by the 244 Array (Column 2), CNVs Using the Original SDMF (Old) (Columns 3,6) CNVs Using the Proposed SDMF Approach (New) (Column 4,5,7,8) Both for 200- and 400-pointWindows, and Thresholds of ...

Overall, the proposed SDMF performed better that the original SDMF, particularly when a 400-point window was used with a lower SNR gain threshold. The highest numbers of CNVs overlapping with the DGV were consistently estimated with the proposed SDMF, a 400-point window and 5 dB SNR gain threshold for 9 of 10 chromosomes. For chromosome 10, a higher number of CNVs overlapping with the DGV was estimated with the smaller 200-point window. As shown in simulations, SNR gain and consequently CNV detection also depends on how the CNVs are covered by the processing window. Thus for chromosome 10, possibly smaller CNVs were better covered by the smaller window. For the 200-point window, the original SDMF performed better than the proposed SDMF with a 10 dB gain threshold, and worse for a 5 dB threshold, again highlighting the fact that a high threshold may substantially lower the sensitivity of the SDMF algorithm. Note that the mismatch threshold for the original SDMF was kept constant at −1.5 dB, and only the threshold for the proposed SDMF was varied. For the 400-point window, the choice of threshold had a smaller effect on CNV detection, i.e., only slightly higher numbers of CNVs were detected with the a lower threshold. A threshold ≥10 dB may have had a more substantial effect on the sensitivity of SDMF with this window. In regard to the other methods, the HMM-based approach estimated a higher number of discrete regions than CBS, and for chromosomes 4, 5, and 7 this number was comparable, though consistently lower, to the number of CNVs estimated with SDMF.

3.3 Performance

In contrast to other algorithms that process individual sequences, SDMF compares sequence pairs, both in its original and modified forms. Thus, as the dataset size increases the computational time also increases. The length of the processing window also affects computational time. Ten sets of 400 sequences were simulated as before, varying the additive Gaussian noise level as previously described. Each set included an increasing number of CNVs from 1 to 10. All sequences contained the CNVs. The speed of the original and optimized SDMF are expected to be approximately equal, since both involve the same decomposition (first step) and convolution process (involved in matched filtering). Thus, only the optimized SDMF and CBS, because to its widespread use in genomic research, were applied to these data. The window length affects only the performance of SDMF. However, the number of CNVs in each sequence, and thus the non-random structure of the signal could affect the performance of both SDMF and CBS. Central processing unit (CPU) time was used to measure performance. Clock time was also recorded, and the two were approximately the same (±10–20 s from each other). All simulations were ran on a Dell Precision T7500 Four-processor workstation, using a Linux operating system, and the software Matlab. Fig. 5 shows CPU time for SDMF (black curves) and CBS (green curves) as a function of number of CNVs. Each panel corresponds to a different processing window length (20, 50, 100, 200, 400, 500 points).

Fig. 5
CPU time (s) for SDMF (black curves) and CBS (green curves) as a function of number of CNVs for varying window lengths 20–500 points (top left to bottom right). All sequences were 4,000 points long.

For a 20-point window, CBS was faster than SDMF for all numbers of CNVs, although the CBS CPU time rapidly increased with increasing number of CNVs. As the window length increased, the SDMF CPU time decreased. For windows ≥100 points, SDMF was faster than CBS for sequences with ≥3–4 CNVs and substantially faster for sequences with ≥8 CNVs. Genomic sequences typically contain a large number of aberrations and thus have significant spatial structure. Thus, simulations with higher number of CNVs are more realistic. For datasets of 400 sequences containing ≥8 CNVs, SDMF was at least three times faster than CBS for windows ≥100 points. When the number of sequences was also varied from 20 to 400, SDMF was up to 100 times faster than CBS. SDMF was faster for windows 200–400 points than for longer windows. Thus, the relationship between window length and performance may not inversely proportional. These results support the selection of a medium-size window for efficient processing.

4 Discussion

In this study we aimed to modify and improve the SDMF method and propose an approach for selecting its parameters for robust CNV detection. The original SDMF indirectly depends on CNV frequency as it uses averaged SNR changes due to mismatch as the relevant measure. As shown in simulations, its performance may decrease substantially with increasing CNV frequency. To improve the sensitivity of the method independently of CNV frequency, we have proposed a modification that uses non-averaged SNR gain as the relevant measure of sequence similarity, to identify segments of pairwise match that may contain CNVs. Overall, the proposed SDMF consistently outperformed the original method for all CNV frequencies, and is thus a more flexible approach for robust CNV detection. Our simulations have also shown that it performs well even in sequences contaminated by very high noise levels. The only case where the original SDMF may perform better than the proposed method is for very short CNVs and processing windows, where small SNR gains may be difficult to detect. Finally, when detection of pathological CNVs is of interest, the proposed SDMF is more appropriate for identifying subsets of disease-related CNVs that are common in samples from patients sharing the pathology, since it compares sequences for common genomic similarity (match) rather than relatively rare mismatch in normal samples.

The modified and original SDMFs, as well as CBS and an HMM-based approach were also applied to normal samples from the TCGA. To compare their performance we assessed the overlap of detected CNVs with a subset of normal CNVs reported in DGV. Overall, the proposed SDMF consistently outperformed the original SDMF, CBS and HMM, particularly with a larger processing window and lower SNR thresholds. CBS detected relatively long segments containing CNVs but lacked the spatial specificity of SDMF, which in part depends on the choice of the processing window. The HMM approach detected a higher number of discrete segments than CBS, but consistently lower than SDMF. Based on separate simulations for SDMF parameter selection and the assessment of the relative computational cost of SDMF and CBS, and the results of the real data analysis, a medium-size processing window relative to CNV length may be optimal. Although CNV lengths are not a priori known, an adequate window may be chosen based on approximate estimates of the data structure. In fact, another advantage of SDMF is that its parameters are not hardwired and can be easily changed according to the research question or dataset of interest.

Although developed and optimized for aCGH data, SDMF may be adapted and extended to other high-dimensional genomic data, such as NGS. By design, it directly compares pairs of genomes, e.g., a reference and a target genome sequenced using NGS, and may thus be used to rapidly identify and separate regions that may contain DNA aberrations, which may be further analyzed with smaller processing windows, and regions of no further interest, which may be selectively compressed for efficient processing. Thus, SDMF may become a valuable tool both for efficient preprocessing and CNV detection in very high-dimensional data.

Acknowledgments

Funding: This work was supported by NIH Grant R03 CA121884.

Footnotes

Conflict of interest: None declared.

Software and Data

The results published here are based on data generaged by The Cancer Genome Atlas established by the NCI and NHGRI [1]. Information from the Database of Genomic Variants [2] has also been used in this study. Software associated with various simulations and the SDMF method can be found at http://www.hsph.harvard.edu/betensky/.

Personal use is permitted, but republication/redistribution requires IEEE permission.

Contributor Information

Catherine Stamoulis, Department of Radiology, Harvard Medical School and Boston Children’s Hospital, Boston, MA 02115.

Rebecca A. Betensky, Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115.

References

1. The results published here are based on data generated by The Cancer Genome Atlas established by the NCI and NHGRI.
2. Information from the Database of Genomic Variants has been used, http://cancergenome.nih.gov/
3. Alter O, Brown PO, Bostein D. Singular value decomposition for genome-wide expression data processing. Proc Nat Acad Sci USA. 2000;97:101016. http://dgv.tcag.ca/dgv/app/home. [PubMed]
4. Barros A, Delaney AD, Li HI, Nayar T, Flibotte S, Qian H, Chan SY, Asano J, Ally A, Cao M, Birch P, Brown-John M, Fernandes N, Go A, Kennedy G, Langlois S, Eydoux P, Friedman JM, Marra MA. Assessment of algorithms for high throughput detection of genomic copy number variation in oligonucleotide microarray data. BMC Bioinformat. 2007;8:368. [PMC free article] [PubMed]
5. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J Roy Stat Soc B. 1995;57:289–300.
6. Ben-Yacoov E, Eldar YC. A fast and flexible method for the segmentation of aCGH data. Bioinformatics. 2015;24(16):i39–i45. [PubMed]
7. Beroukhim R, Getz G, Nghiemphu L, Barretina J, Hsueh T, Linhart D, Vivanco I, Lee JC, Huang JH, Alexander S, Du J, Kau T, Thomas RK, Shah K, Soto H, Perner S, Prensner J, Debiasi RM, Demichelis F, Hatton C, Rubin MA, Garraway LA, Nelson SF, Liau L, Mischel PS, Cloughesy TF, Meyerson M, Golub TA, Lander ES, Mellinghoff IK, Sellers WR. Assessing the significance of chromosomal aberrations in cancer. Proc Nat Acad Sci USA. 2007;104(50):20007–20012. [PubMed]
8. de Vrie BB, Pfundt R, Leisink M, Koolen DA, Vissers LE, Janssen IM, Reijmersdal SV, Nillesen WM, Huys EH, Leeuw ND, Smeets D, Sistermans EA, Feuth T, van Ravenswaaij-Arts CM, van Kessel AG, Schoenmakers EF, Brunner HG, Veltman JA. Diagnostic genomic profiling in mental retardation. Amer J Human Genet. 2005;77(4):606–616. [PubMed]
9. Duan J, Zhang JG, Deng HW, Wang YP. Comparative studies of copy number variation detection methods for next-generation sequencing technologies. PLos One. 2013;8(3):e59128. [PMC free article] [PubMed]
10. Fridlyand J, Snidjers AM, Pinkel D, Albertson DG, Jain AN. Hidden Markov models approach to the analysis of array CGH data. J Multivariance Anal. 2004;90(1):132–153.
11. Guha S, Li Y, Neuberg D. Bayesian hidden Markov modeling of array CGH data. J Amer Statist Assoc. 2008;103(482):485–497. [PMC free article] [PubMed]
12. Huang NE, Shen Z, Long SR, Wu MC, Shih WH, Zheng Q, Yen NC, Tung CC, Liu HH. Empirical mode decomposition and hilbert spectrum for non-linear, non-stationary time series analysis. Proc Roy Soc London A. 1998;454:903–995.
13. Hupé P, Stransky N, Thiery JP, Radvanyi F, Barillot E. Analysis of array CGH data: From signal ratio to gain and loss of DNA regions. Bioinformatics. 2004;20(8):3413–3422. [PubMed]
14. Hsu L, Self SG, Grove D, Randolph T, Wang K, Delrow JJ, Loo L, Porter P. Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics. 2005;6(2):211–226. [PubMed]
15. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C. Detection of large-scale variation in the human genome. Nat Genet. 2004;39:949–951. [PubMed]
16. Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray JW, Waldman F, Pinkel D. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science. 2004;258(5083):818– 821. [PubMed]
17. Lupski JR. Genomic rearrangements and sporadic disease. Nat Genet. 2007;39:S43–47. [PubMed]
18. McCarroll SA, Kuruvilla FG, Korn JM, Cawley S, Nemesh J, Wysoker A, Shapero MH, de Bakker PI, Maller JB, Kirby A, Elliott AL, Parkin M, Hubbell E, Webster T, Mei R, Veitch J, Collins PJ, Handsaker R, Lincoln S, Nizzari M, Blume J, Jones KW, Rava R, Daly MJ, Gabriel SB, Altshuler D. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet. 2008;40:1166–1174. [PubMed]
19. Olshen AB, et al. Circular binary segmentation for the analysis of arraybased DNA copy number data. Biostatistics. 2004;5(4):557–572. [PubMed]
20. Picard F, et al. A statistical approach for array CGH data analysis. BMC Bioinformatics. 2005;6:27. [PMC free article] [PubMed]
21. Pique-Regi R, et al. Sparse representation and Bayesian detection of genome copy number alterations from microarray data. Bionformatics. 2008;24(3):309–318. [PMC free article] [PubMed]
22. Redon R, et al. Global variation in copy number in the human genome. Nature. 2006;44:444–454. [PMC free article] [PubMed]
23. Sebat J, et al. Large-scale copy number polymorphism in the human genome. Science. 2004;305(5683):525–528. [PubMed]
24. Sebat J, et al. Strong association of de novo copy number mutations in autism. Science. 2007;316(5823):445–449. [PMC free article] [PubMed]
25. Snijders, et al. Assembly of microarrays for genome-wide measurement of DNA copy number. Nat Genet. 2001;29:263–264. [PubMed]
26. Stamoulis C, Betensky RA. A novel signal processing approach for the detection of copy-number variations in the human genome. Bioinformatics. 2011;27(17):2338–2345. [PMC free article] [PubMed]
27. Turin GL. An introduction to matched filters. IRE Trans Inf Theory. 1960;6:311–329.
28. Wineinger NE, et al. Statistical issues in the analysis of DNA copy number variations. J Comput Biol Drug Des. 2008;1(4):368–395. [PMC free article] [PubMed]
29. Walsh T, et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science. 2008;320(5875):539–543. [PubMed]