Structural variation (SV), including copy-number variation (CNV), is a major form of genetic variations in mammals (1–4
). CNVs have been shown to affect gene expressions in human cell lines (5
) as well as in different tissues of rodents (6–8
), and to play an important role in the etiology of schizophrenia (9
), autism (11
) and non-psychiatric diseases (13–15
). In functional genomic studies, failing to account for copy-number differences can lead to errors in ribonucleic acid sequencing, chromatin immunoprecipitation sequencing, DNase hypersensitive site mapping sequencing and formaldehyde-assisted isolation of regulatory elements sequencing (16
). Thus, accurate detection for CNVs in a single genome is important.
Microarray technologies were the main platform for initial work in CNV characterization (18
) and remain a cost-efficient choice (20
). High-throughput sequencing (HTS) technologies (21–23
) promise a complete catalog of SV and could replace microarrays as a discovery platform (20
). While microarray-based CNV detection analyzes probe hybridization intensities, HTS-based CNV detection uses conceptually distinctive approaches: read-pair, split-read and read-depth (RD) analyses, which vary in their sensitivity and specificity depending on the sizes and classes of SVs (1
). Converging evidence suggests that multiple approaches should be considered together to maximize CNV detection from HTS data. For example, the 1000 Genomes Project (1000GP) used 19 algorithms to independently identify CNVs in 185 human genomes and pooled the results according to the specificity of each algorithm (1
). Recent methods [SPANNER (1
), CNVer (24
) and Genome STRiP (25
)] integrate read-pair and RD in the detection process in different ways.
The RD approach looks for higher or lower than expected sequencing coverage in a genomic region to infer gain or loss of DNA. RD has been computed in a variety of ways, including counting the number of fragments (25
) or reads (26–29
) mapped to a particular genomic region and calculating the sum of per-base coverage within a region (30
). Existing CNV detection methods assume that RD follows a Poisson distribution (or a normal distribution as the large-sample approximation of the Poisson model) for a diploid genome and search for regions that diverge from this distribution. However, in practice, neither sampling nor mapping of the reads is uniform because of experimental biases. GC content can lead to certain genomic regions being over- or under-sampled (22
). Repetitive DNA elements are abundant in the mammalian genomes (32
); consequently, the number of reads unambiguously mapped to a region could be very different from the number of reads sequenced from the region. Additional sources of bias, which are more difficult to trace (e.g. noise arising from sequencing, sequencing errors), create further variability in RD coverage. Violation of the assumed Poisson distribution entails loss of sensitivity/specificity to detect CNVs using RD. In studies of cancer, matched pairs of tumor- and normal-tissue samples may be used to correct biases by computing RD ratios (33–36
); however, matched control samples are generally not available.
Bias correction has not been adequately addressed in the literature. Some existing methods (26
) adopt a two-step approach where RD is first smoothed for GC content differences using linear regression, and the GC-adjusted RD is then segmented. Other methods (25
) account for mapping bias of a candidate region using its effective length (e.g. the number of confidently mapped bases); however, this approach does not account for the dependence between consecutive regions or additional sources of noise in the data. While various kinds of adjusted RD have been used as input data, nearly all methods (25
) use the Poisson or normal-distribution assumption without subsequent evaluation the adequacy of the distributional assumption.
In this study, we first show that evaluation of the distribution assumption for RD is important, as it may not hold true. Second, we show that it is important to jointly estimate copy number and the effect of confounding factors. Third, we develop a novel statistical method to accurately model RD and detect CNVs from HTS data. We measure RD by the number of sequence fragments mapped in sliding windows tiled along the genome, and we model the fragment counts by negative binomial (NB) distributions, which allow for over-dispersion and account for the effects of confounders. Furthermore, we account for the dependence of fragment counts of adjacent windows using a hidden Markov model (HMM). Known confounding factors are treated as covariates and corrected explicitly, while unknown experimental biases are accommodated by the over-dispersion parameter of the NB distribution and by an additional noise component via a mixture model. Fourth, we calibrate our method using simulation and whole-genome sequencing data from the 1000GP, and we compare our method with CNVnator (26
), the best-performing RD-based CNV detection algorithm in the literature (1
). Finally, to demonstrate the utility and robustness of our method, we apply our method to both human and mouse HTS data sets.
In summary, our method outperforms existing RD-based CNV detection algorithms and distinguishes homozygous and heterozygous deletions and high-copy duplications. Our method complements the current literature, and the concept of simultaneous bias correction and CNV detection can serve as a basis for combining RD with read-pair or split-read in a single analysis. A user-friendly and computationally efficient implementation of our complete analytic protocol is freely available at https://sourceforge.net/projects/genseng/