With simulated data, both the sensitivity and specificity attained by our method were exceptionally high, although it should be emphasized that other methods have generated similarly impressive results in similar benchmarks but show, in particular, lower sensitivity with real data (26
). This is unsurprising as the effects of repetitive sequences and inherent biases in sequence coverage tend to be minimized in simulations. However, for the study of heterozygous events, simulation for now provides the only realistic possibility, due to the lack of large scale validated heterozygosity catalogs associated with individual genomes. SVM2
showed relatively poor accuracy in the detection and classification of very short heterozygous SV. All mapping-distance-based methods are expected to suffer from this limitation as distance perturbations are diluted at heterozygous loci. In addition, our current approach uses measures of coverage, and in the case of heterozygous deletions, a reduction rather than an absence of reads in the deleted region would be expected. Conversely, reduced perturbations of BP mapping patterns are expected upstream of heterozygous insertions. These limitations might be partially addressed by some of the potential developments in the strategy that are envisaged (see later). However, in simulation at least, we note a satisfactory performance by SVM2
in the identification and classification of larger heterozygous events.
In this work, SVM2 was trained to recognize hypervariable regions as distinct from SV events. In practice, few predictions of this type were made. Indeed, an examination of these predictions suggested that they showed a similar specificity in detection of SVs as the other categories of prediction—although all validated predictions in this category corresponded to events of four bases or less. This is likely a function of the read mapping strategy used. Allowing up to 2 mismatches in 35 base reads tends to allow correct mapping of the majority of reads in intra-specific comparisons, and in any case, perturbations of read mapping caused by hypervariable genomic regions are expected to be extremely subtle.
The Bentley et al.
/Kidd et al
. data represent one of the few cases where extensive Sanger resequencing and SV calling have been performed on an individual for which PE NGS data are also available, providing an ‘independent’ validation set. For this reason, the data set has been widely used in other studies (21
) and allows immediate comparisons between methods. These considerations notwithstanding, the data set has several relevant limitations that complicate interpretation of results and merit discussion. First, the coverage by Sanger sequencing is rather limited (theoretical coverage 0.3X), suggesting that, even if we make the—optimistic—assumption that all reads were mapped correctly and uniquely, at most less than a third of the SV events between this individual and the hg18 reference could be detected. Second, the low coverage implies that the accurate annotation of heterozygous events should be, at best, extremely limited. Finally, the original study of Kidd et al
. only attempted to identify events of less than 100
bp in length, and although a second evaluation of these data (39
) was more comprehensive, the detection of large insertions is limited by the properties of split-mapping methods. It has been suggested that the majority of intra-specific SVs are small (32
), and although this generalization is almost certainly correct, our knowledge of the frequency of medium to large events remains rather limited. Our method made few predictions of insertions larger than the insert size of the library. However, this is an inherently difficult category of events to detect by any current approach and, with the available data, it is difficult to perform statistical analysis of sensitivity and specificity of tools with respect to detection of such events.
Taken together, these observations render the objective assessment of the overall specificity of methods, with respect to both homozygous and heterozygous SV, extremely difficult. Additionally, the probability that a proportion of the Kidd et al. and Mills et al. predictions are heterozygous complicates estimates of sensitivity with respect to homozygous events. In this context, we believe that although limited in precision, apparent sensitivity and specificity are the best available metrics for comparison of the performance of different methods. By all metrics and validation sets used, SVM2 outperformed BreakDancer in terms of sensitivity over a range of SV event sizes, attaining at least the same apparent specificity. This is perhaps not surprising given that additional mapping information, not used by BreakDancer, is used by SVM2. Perhaps more relevant is the observation that SVM2 identified a large number of small SVs that were not detected by a contemporary split-mapping method.
One alternative to the use of individual genome Sanger resequencing as a biological validation set would be to estimate specificity by comparing genome wide predictions to collections of validated population level SVs [dbSNP (40
), 1000 genomes project (2
)] making the assumption that coincidence of predictions with an annotated SV implied the presence of the same SV in the donor genome. However, a recent study demonstrated a relatively low overlap between the two aforementioned databases, implying that a significant fraction of human SVs remain undetected (39
). It is also worth noting that the 1000 genomes set of SV events was generated from NGS data. Given that our objective was to explore the potential of this very type of data to uncover additional, previously undetected events, we consider that the use of ‘independent’ data from the individual genome under study as our principal validation set to be a justified strategy. Nevertheless, comparisons of apparent specificity of different methods when ‘validated’ by Sanger or NGS-based data sets showed interesting patterns, particularly with respect to the genomic context of indel events.
The ‘elephant in the room’ of all methods to determine locations of SV from resequencing data, be they based on split mapping or on statistical approaches, is the abundance of repeated sequences in complex genomes. Sequence reads (from any technology) that fall within perfectly repeated regions cannot be unambiguously mapped. PE approaches (dependent on library insert size and repeat length) can ameliorate this problem to some extent, as can probabilistic mapping strategies (28
), but the fundamental problem remains. For example, SVs within recent segmental duplications present an almost insurmountable problem for all approaches apart from read-depth methods—and even these will not be able to specify the location of the event. For now, the most promising way to address the problem of repeats may be the maximization of read length and the use of different insert-size libraries. The use of larger insert-size libraries will aid the detection of larger SV events by insert-size-based methods (and contribute to an additional loss of accuracy in the identification of small indels by such methods). Conversely, as the production of longer resequencing reads using NGS technologies becomes more commonplace, the sensitivity of split-mapping methods is expected to increase for small to medium size events and to reduce the impact of repetitive sequences on the performance of all methods. Despite these problems, we note that our analyses of genomic context of predictions and validated predictions suggest that in simple repeats and low-complexity regions, SVM2
attained higher sensitivity than other methods tested, even for small SV events. The observations that a large number of small SV events detected by Sanger resequencing, but not by PinDel (or 1000 genomes) fall in simple repeat and low-complexity regions, and that a larger proportion of validated SVM2
than PinDel predictions fall in such regions are interesting. In this light, the similarity of overall ‘specificity’ between methods when evaluated with the Kidd et al
. data or with dbSNP and the differences in this metric with respect to the 1000 genomes database is intriguing, particularly given the types of data used to construct these catalogs. Simple repeat/low-complexity regions represent a notable proportion of the ‘inaccessible’ genome described by the 1000 genomes consortium (2
). We suggest that our method, or others based on similar principles, might be of particular use in addressing SV in such regions. Indeed, it is interesting to note that Breakpointer (29
), a recently proposed method that incorporates information from read depth, mismatch profiles and split mapping to identify genomic rearrangements also showed an increased sensitivity to SV in repetitive regions with respect to PinDel. However, Breakpointer, unlike PinDel or SVM2
, is apparently not capable of identifying the very smallest (<3
bp) SV events, again emphasizing the value of using complementary approaches dedicated to the detection of different types of events.
We can envisage several potential developments to the approach presented in this study, some of which might be expected to improve the performance with respect to heterozygous SV. First, sequence coverage might be improved by using split mapping in the initial generation of read maps (herein we have used only gapless alignment). Second, additional features, for example the gapless and gapped alignment coverage for each genomic site could be incorporated into the SVM analysis. Another possible step would be to use positional constraints (based on SVM2
predictions) in split mapping of reads as a post-processing step in establishing additional support for events and in fine mapping positions of SV as currently implemented in Breakpointer (29
In conclusion, we have shown that inclusion of more detailed information on the local patterns of read mapping can notably enhance the sensitivity of detection of SV events by non-split-mapping methodologies.
Furthermore, we showed that insert-size-based SV detectors such as SVM2 can complement split-mapping approaches in the localization of ultra-short SV events, particularly those in repetitive and low-complexity regions of the genome.