Structural variants are a major form of genetic variation1,2
. The investigation of structural variants has provided key insights into the genetic basis of common human disease3
, most notably in neuropsychiatric disorders, where it has been well established that rare and de novo
mutations confer significant risk4
A variety of tools have been developed to detect structural variation using genomic sequence data. Some approaches rely on information gleaned from changes in read depth5,6
, while others also include signals from outlier read pairs7,8
. Still other methods rely on split-read signals9
(reads that span a breakpoint). Recently, the 1000 Genomes Project undertook a thorough evaluation10
of these and other structural variant detection methods and found that there was wide variability in terms of sensitivity and specificity among the methods. While some of the tested methods were clear leaders (notably GenomeSTRiP7
for its low false discovery rate), no single method produced results with a sensitivity comparable to a careful merging of calls from all the methods.
The fact that the fusion of structural variant calls from multiple discovery methods yields better results than any single method suggests that existing tools may be individually too narrow in terms of the signals they consider or the methodology they employ. If a single method could effectively capture the collective insight provided by the variety of available tools, it might obviate the need to run all of them and carefully merge the results, saving valuable time and resources.
In this work, we present a first step in this direction by drawing on machine learning ideas to facilitate better structural variant discovery. We used previous studies as a basis for extracting the multidimensional patterns that discriminate calls that validate experimentally from those that do not.
Beyond the typical read depth and read pair signatures used by most structural variant discovery approaches, we constructed additional features from the sequencing data (described in Online Methods), thereby representing the genome in a higher-dimensional space (shown as the feature matrix X
). We trained a Random Forest classifier11
to partition this space in a way that optimizes the classification of deletions, duplications, and false positives whose identities are stored in the class label vector Y
, Supplementary Fig. 1
For training, the features in X
were constructed from data available in the BAM files for the high-coverage trios (a total of six individuals) in the 1000 Genomes Project12
(1KG). The structural variant class labels in Y
were assigned based on an integration of the calls from refs. 10, 13, 14
. The seven class labels “deletion”, “duplication”, “deletion-flanking”, “ duplication-flanking”,” false positive deletion”, “false positive duplication”, and “invariant” are defined in the supplementary information (Supplementary Table 1)
The rows of X
(and consequently the elements of Y
) correspond to overlapping 100 bp windows of the genome that we call “subjects” here (Supplementary Fig. 1
). When presented with new data (X'
) the trained classifier assigns a predicted class (Y'
) to each subject (
). The final variant calls are then generated by a simple segmentation routine that merges consecutive subjects of the same predicted class into a single event. These events are ranked by a prediction confidence score produced by the classifier (the Random Forest vote proportion,
Mapping of features to structural variant calls by Random Forests
The performance of forestSV was first assessed on the 1KG data, and compared with calls of other widely used methods10
. We trained the classifier on five of the six 1KG genomes, and then made structural variant calls on the left-out genome, doing this for each of the six individuals in turn. We then combined the scored calls and constructed sensitivity-specificity curves (Supplementary Fig. 2
) that reflect forestSV's ability to correctly prioritize the calls in a gold standard set. For comparison, we also marked the sensitivity and specificity of the “donor” call sets (method/group designations are carried over from ref. 10
). For both deletion and duplication events, forestSV provided a combination of sensitivity and specificity that, in the gold standard set we examined, was superior to all of the donor call sets and matched the sensitivity and specificity of the merged and genotyped call set provided in ref. 10
(Supplementary Fig. 2
). This suggests that integrative methods, such as forestSV, outperform approaches that rely on a single methodology or sequence feature for structural variant discovery. Lastly, we examined whether the inclusion of related individuals in the training set of this leave-one-out scheme might lead to a disproportionate boost in the performance of forestSV, and we did not find evidence for such an effect (Supplementary Fig. 3
The characteristics of the calls generated by forestSV are in line with the consensus genotyped calls in ref. 10
(Supplementary Results, Supplementary Figs. 4-6
). There is a relationship between event size and forestSV event score for duplications (Supplementary Fig. 7
), while for deletions scores are unbiased across a range of event sizes. We found that while forestSV was trained on high-coverage data, it also performs reasonably well on low-coverage (5×) data (Supplementary Fig. 8
). The accumulation of calls with a relaxing event score is characteristically different for deletions and duplications (Supplementary Fig. 9
As an independent evaluation of our method, we applied forestSV to high coverage genome sequence data from a five-person family currently under investigation in our laboratory, consisting of a father, mother, monozygotic (MZ) twins concordant for autism, and an unaffected sibling. Performance of forestSV was evaluated based on patterns of Mendelian inheritance in the family and identity of genotypes between MZ twins (Supplementary Fig. 10
). “Mendelian inconsistencies” were defined as calls present in any of the children but not in either of the parents (with > 50% reciprocal overlap). “MZ discordances” were defined similarly as events detected in one twin and not found with > 50% reciprocal overlap in the other. We then plotted how rates of Mendelian inconsistency and MZ discordance varied with the event scores in the children (Supplementary Fig. 10
). By both measures, error rates are very low among events scoring > 0.65, suggesting that this set is enriched for true structural variants. These results suggest that an event score threshold in the range of 0.65-0.70 will keep error rates below 5% for deletions and ~10% for duplications. At a threshold of 0.65, forestSV discovered 2,300-2,593 unique deletions per individual, and 53-72 unique duplications per individual. Clearly, sensitivity for detecting duplications is lower than for deletions and hence a more relaxed threshold may be needed (see Supplementary Fig. 3d
and Supplementary Fig. 9
). Because this data set has not yet been as thoroughly characterized as the 1KG high-coverage trio samples, we are unable to reliably estimate sensitivity directly.
forestSV is particularly well suited to the detection of rare variants because it is not reliant on finding variant support in multiple individuals. It can call structural variants effectively in a single genome. This contrasts with earlier methods developed by our group 5
that require multiple genomes to be analyzed simultaneously and which favor variants with support in multiple individuals. This is a key advantage in light of the knowledge that rare structural variants, with population frequencies of ≤10-4
, play an important role in common disease and particularly in diseases of the brain15
We implemented forestSV as an R package. It includes an executable that, with a single command, takes BAM files as input and produces deletion and duplication call sets. The package source code and documentation, together with a thorough technical tutorial describing its use, are available for download from our website at http://sebatlab.ucsd.edu/software
. forestSV will continually improve its discovery abilities as additional training data becomes available (Supplementary Results, Supplementary Fig. 11
) and we will periodically provide updates to the trained classifier on our website.