Large biases in fragment counts related to the GC composition of regions were found in the data sets we examined. These observed effects have a recurring unimodal shape, but varied considerably between different samples.
We have shown that this GC effect is mostly driven by the GC composition of the full fragment. Conditioning on the GC of the fragments captures the strongest bias, and removing this effect provides the best correction, compared with alternative GC windows. When single base pair predictions based on the fragment composition are aggregated, the results trace the observed GC dependence. This cannot be said about local effects that take only the reads into account. This conclusion holds for various data sets, with different fragment length composition, read lengths and GC effect shapes.
That the GC curve is unimodal is key to this analysis. In all data sets shown, the rate of GC-poor or GC-rich fragments is significantly lower than average, in many cases zero. Unimodality was overlooked by Dohm et al.
), probably because GC-rich areas are rare (especially in simpler organisms). Even in humans, it is hard to spot this effect if counts are binned by GC quantiles instead of GC values. Nevertheless, it is this departure from linearity that allowed pinpointing an optimal
scale—the fragment size. In that, unimodality gives us important clues as to the causes of the GC bias.
While we have described other sequence-related biases, we believe they are not driving the strong coverage GC biases. These include an increased coverage when the ends are AT rich, and location-specific fragmentation biases near the fragment ends. We have shown that the end effects, as measured on the 5′-end, are far weaker than the effect from the full fragment. They are also surprisingly negligible in the context of larger bins. Still, they might locally mitigate the fragment GC effect: the effect of fragment length on GC curve seems to be associated with these biases.
Our conclusions seem to complement those of Aird et al
). If PCR is the major source of the GC bias, we would expect GC of the full fragment to be associated with the bias, rather than the GC of one or both reads. We have shown this is indeed the case. Moreover, data sets generated according to a PCR-free protocol (10
) and an optimized PCR protocol (12
) both display a reduced GC bias (Supplementary Figure S4 and S5
). It should be noted that even these optimized PCR protocols can still display significant biases and may require GC correction.
Our refined description of the GC effect is of practical value for GC correction. First of all, the non-linearity of the GC effect is a warning sign regarding two-sample correction methods. In the main example we study, the pair of normal and tumor samples do not have the same GC curves. We have seen this in additional data sets as well. Using normal counts to correct tumor counts could sometimes produce GC-related artifacts, which might lead to faulty segmentations. The GC effects of samples should be carefully studied before such corrections are made.
A single sample correction for GC requires a model, and we demonstrate the importance of choosing the best model. Overlapping windows smaller than the fragment fail to remove the bulk of the GC effect. Similarly, using read coverage rather than fragment count hurts the correction. Instead, measuring fragment rate for single base pair positions, decouples the GC modeling from the downstream analysis. Thus, it removes the lower threshold on the scale of analysis, providing single base pair estimates, which can be later smoothed by the researcher as needed (or binned into uneven bins if needed). An important benefit of DNA-seq over previous technologies is that simply repeating the experiment can increase the resolution of the analysis. Our model assures that this increased resolution does not hurt the GC correction.
Unlike other bias correction methods, such as BEADS (14
), we generate weights (predicted fragment rates) for the genomic location rather than for the observed reads. Mappable genomic positions are stratified according to the GC of a hypothetical fragment, and rates per GC stratum are estimated by counting the fragments at those same positions. Estimating predicted rates for both covered and uncovered locations can help detect deletions, and these predicted rates form a natural input for downstream analysis using heterogeneous Poisson models. Another important novelty is the use of TV scores to determine the representative ‘fragment length’ of each data set, one that best fits the distribution of fragment lengths and properly discards the fragment end biases. This procedure can be critical when length information is unavailable (i.e. for single-ended reads). A more detailed comparison to BEADS is found in the Supplementary Figures S9 and S10
In this work, we estimated DNA abundance from non-tumor genomes, implicitly assuming that abundance of DNA along the genome is uniform. It is true that CN variation may occur in non-tumor sequences; these jumps are rare however, and by random sampling we hope to average over any large CN changes. That the windows are small should reduce the dependence between GC and specific positions in the genome. From our experience, estimating GC curves using small windows turned out to be surprisingly robust to CN changes on tumor data (as displayed above). To extend this method to other applications or protocols would require identifying regions in which the signal of interest is not expected to vary, and perhaps co-estimation of the abundance and the GC effect. That said, for CN purposes there is enough data to get stable estimates of the GC effect.
Our prediction accounts for a large portion of the variation, but residual variation is still present. Additional inhomogeneities in fragment rates include unexplained hot spots or zero-counts, as well as milder low and high frequency variation in the counts. The first two categories may be due to errors in the annotation of the genome or amplification artifacts. The latter point to existence of additional factors that affect fragment rates, which is to be expected. We have discussed additional sequenced-related biases, including fragmentation and AT preference. The tools developed here, primarily the total variation scores, allow analysts to further investigate these effects as needed. Nevertheless, by and large, our model successfully describes the bulk of the low-frequency variability, which confounds segmentation to CN regions.
One effect that we have not deeply explored is the relation between sequencing error probability and the GC effect. In the Supplumentary Data, we have shown evidence that the global GC of the fragment can effect the sequencing error probability. Especially for longer reads, changing the parameterization of the mapping processes can sometimes produce different mappability patterns related to the GC composition. There have been reports (11
) that specific sequences in reads are more prone for errors, for example a GGC sequence. A better model for reads that are harder to sequence would allow better estimation of the fragment GC effect in the GC-rich regions, and improve the accuracy of the corrections. Jointly correcting by the GC of the read as well as the GC of the fragment may be a useful approximation for this effect.
Our analysis focused only on DNA-seq data from human subjects, but results from this work can be extended. GC content biases were seen in additional experimental protocols using high-throughput sequencing. [See Supplementary Figures S7 and S8
, and Ref. (14
) for similar correction approaches in ChIP-seq data.] Some of these protocols focus on highly localized signals on the genome, and could also benefit from strand-specific and uneven bin normalization. Moreover, when length of the fragments is constrained (exon sequencing, RNA-seq), a model taking both GC and fragment length into account may prove important. Fitting the model for each application is a challenge; still we believe that all these applications can benefit from our refined GC model.