|Home | About | Journals | Submit | Contact Us | Français|
ZI, GM designed the study, developed the mathematical models, and wrote the paper; MC, ZI developed the variant discovery algorithms, designed the multi-color graph data structures, and implemented software; ZI performed simulations, and analyses for Cases 1,3,4; IT, ZI performed analyses for Case 2; PF contributed to early plans for Cortex.
Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex; the first de novo assembler capable of assembling multiple eukaryote genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variation in a high coverage human genome. Second, we identify over 3Mb of novel sequence in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from 10 chimpanzees enables accurate variant calls without a reference sequence. Finally, we estimate classical HLA genotypes at HLA-B, the most variable gene in the human genome.
Characterization of genetic variants present in an individual, population or ecological sample has been transformed by the development of high throughput sequencing (HTS) technologies. The standard approach to variant discovery and genotyping from HTS data is to map reads to a reference genome1-5, so identifying positions where the sample contains simple variant sequences. This approach has proved powerful in the study of single nucleotide polymorphisms6 (SNPs), short insertion-deletion (indel) polymorphisms3,5,7,8 and larger structural variation9-14 in well-characterised genomes, such as human15-17.
However, the mapping approach has limitations. First, the sample may contain sequence absent or divergent from the reference, for example through horizontal transfer events in microbial genomes18,19 or at highly diverse loci, such as the classical HLA genes20. In such cases, short reads either cannot or are unlikely to map correctly to the reference. Second, reference sequences, particularly of higher eukaryotes, are incomplete, notably in telomeric and pericentromeric regions. Reads from missing regions will often map, sometimes with apparently high certainty, to paralogous regions, potentially leading to false variant calls. Third, samples under study may either have no available reference sequence or it may not be possible to define a single suitable reference, as in ecological sequencing21. Fourth, methods for variant calling from mapped reads typically focus on a single variant type. However, where variants of different types cluster, focus on a single type can lead to errors, for example through incorrect alignment around indel polymorphisms6,7. Fifth, although there are methods for detecting large structural variants, using array CGH 22-25 and mapped reads11,12,14,26, these cannot determine the exact location, size or allelic sequence of variants. Finally, mapping approaches typically ignore prior information about genetic variation within the species.
Several of these limitations can potentially be solved through de novo assembly, which is agnostic with regards to variant type and divergence from any reference. However, while there are established algorithms for de novo assembly from HTS shotgun data, based on overlap27-29 or de Bruijn graphs30-32, current approaches have limitations. Notably, they focus on consensus assembly, treating the sequence as if derived from a monomorphic sample (e.g. haploid genome, inbred line or clonal population). Consequently, variation is ignored (processed in the same way as sequencing artefacts) and can lead to assembly errors. Some variation-aware de novo assembly algorithms have been developed31,33-36, but these do not represent a general solution to sequencing experiments where genetic variation is either the primary concern or unavoidable (outbred diploid samples, pooled data or ecological samples).
Current assembly methods also typically ignore pre-existing information, such as a reference sequence or known variants. Although variant discovery should not be biased by such information, nor should it be discarded. For example, in a single outbred diploid sample it is hard to distinguish paralogous from orthologous variation. However, if variation is also observed in the reference haploid genome it is most likely driven by paralogy. Finally, current implementations of de novo assembly algorithms for HTS data have very substantial computational requirements, which make them impractical for large-scale studies on eukaryote genomes.
Here, we introduce de novo assembly algorithms focused on detecting and characterising genetic variation in one or more individuals. These algorithms extend classical de Bruijn graphs37,38 by colouring the nodes and edges in the graph by the samples in which they are observed. This approach accommodates information from multiple samples, including one or more reference sequences and known variants. We show how the method can detect variation in species without a reference, combine information across multiple individuals to improve accuracy, and genotype known variants. Cortex has already contributed to public datasets as part of the 1000 Genomes Project17.
De Bruijn graphs, which represent overlap information within a set of DNA sequences, are widely used in genome assembly and underlie many popular algorithms, including AllPaths-LG 31, SOAPdenovo 32, Abyss39 and Velvet 30,40. The graph consists of a set of nodes, representing words of length k (k-mers). Directed edges join k-mers seen consecutively in the input. Variation between genomes generates new nodes and edges. In the simplest case polymorphisms appear as bubbles within unique contigs. However, more complex structures also arise, for example where a variant generates a k-mer found in a paralogous location (Supplementary Figure 1).
A colored de Bruijn graph generalises the original formulation to multiple samples embedded in a union graph, where the identity of each sample is retained by colouring those nodes present in a sample. The samples may reflect HTS data from multiple samples, experiments, reference sequences, known variant sequences or any combination. Below we outline four algorithms for variant discovery and genotyping that make use of the colored de Bruijn graph structure (Methods Section 1-4).
The simplest use of colored de Bruijn graphs is to identify variant bubbles in a single diploid individual. This approach may have a high false positive rate because of the difficulty of separating repeat-induced and variant-induced bubbles. However, inclusion of a haploid reference genome improves the reliability of the algorithm because most repeat structures will be present in the reference and any bubble in the reference colour must be a repeat (Figure 1a). A reference genome also aids the detection of variants because only the variant allele contig need be assembled, and is essential for detecting homozygous variant sites. All types of variant can induce bubbles, hence the process of bubble detection is influenced by variant type only through the graph complexity of the variants and the probability of assembling both sides of the bubble given the sequence coverage, k-mer size and error rate (see Methods Section 1 for details of a model for predicting power). Haplotypes arising from variants within k bases of each other will naturally be assembled as single compound variants. Our implementation is referred to as the ‘bubble-calling’ (BC) algorithm (Methods Section 2).
The bubble-calling algorithm relies on the detection of clean bubbles. However, for complex variants (e.g. novel sequence insertions, large deletions and inversions), the path of at least one allele is unlikely to generate a clean contig. Nevertheless, in some cases, particularly deletions, path complexity is restricted to the reference allele. Such cases can be identified by following the (known) reference path through the joint graph and detecting where it diverges from the sample graph (Figure 1b; Methods Section 2 and Supplementary Figure 2). This ‘path-divergence’ (PD) algorithm typically only identifies homozygous variants and is biased towards identifying deletions. Nevertheless, the algorithm can substantially increase power to detect some variant types and potentially identifies events of arbitrary size, irrespective of read-length.
The joint analysis of HTS data from multiple samples can improve the power and false discovery rate (FDR) of variant detection substantially. In the simplest case, samples are combined in a single colour (for example, in a pooling experiment) and the data can be analysed as above. However, by maintaining separate colours for each sample there is additional information about whether a bubble is likely to be induced by repeats (where many or all samples show coverage of both paths in the bubble; Figure 1c) or errors (where the error-carrying side of bubble will typically have low coverage). This observation leads to an approach to variant discovery even in species where there is no suitable reference. We have devised statistical methods that enable probabilistic classification of bubble structures into those arising from errors, repeat structures or variants (Methods Section 3). When a reference genome is available, this approach can still help distinguish true variants from errors and repeat structures absent from the reference.
Colored de Bruijn graphs can be used to genotype samples at known loci even when coverage is insufficient to enable variant assembly (Figure 1d). We construct a colored de Bruijn graph of the reference sequence, known allelic variants (which may include those discovered using the above methods) and data from the sample. The likelihood of each possible genotype is calculated accounting for the graph structure of both local and genome-wide sequence (Methods Section 4). The approach generalises to multiple allelic types and, because the algorithm does not require variants to form simple bubble structures, it is possible to genotype complex and compound variants such as those at classical HLA loci.
We have developed Cortex, a memory-efficient assembler for building and representing colored de Bruijn graphs and to perform variant calling and genotyping from HTS data (cortexassembler.sourceforge.net; Methods Section 5). The implementation uses an efficient hash table that implicitly encodes the graph; memory use is specified in advance according to a simple formula, and many standard operations have linear or better algorithmic complexity (Methods Sections 2, 5). Novel cleaning methods are used to increase sensitivity (Methods Section 6 and Supplementary Figure 3). Cortex is the only assembler able to handle multiple eukaryotes simultaneously, for example 1000 S. cerevisiae samples in under 64Gb of RAM, or 10 humans in under 256Gb of RAM.
We simulated high coverage (10-50x) sequencing data from a diploid human sample that carries SNPs, indels and structural variants (Methods Section 7). Data was analysed by both the BC and PD methods.
For a variant to be identified successfully, the bubble must both be assembled without gaps and be identifiable within the wider graph. Genome complexity, sequencing depth, read-length, k-mer size and error rate interact to influence both factors. As k-mer size increases, the fraction of SNP sites with unconfounded bubbles ranges from 51% with k=21 to 85% with k=75 in humans (Supplementary Figure 4). Increasing k-mer size reduces the risk of error-induced contigs confounding the graph, but also increases the probability of a k-mer containing an error. Furthermore, for a fixed per-base depth and read-length, as k-mer-size increases, the effective depth of each k-mer decreases leading to an increased probability of gaps in the assembly (Figure 2a, and Methods Section 1). Consequently, the k-mer size that maximises the sensitivity of the BC algorithm within the simulation varies with coverage and read-length; approximately 55 for 30x, 55 for 40x and 65 for 50x with 100bp reads. The loss in power relative to the theoretical maximum is, however, small. For example, with 50x coverage (k=65, 100bp reads) we identify 86% of heterozygous SNPs compared to the maximum possible of 92%. Simulation-based estimates of power closely track predictions (Figure 2a, Supplementary Figures 5, 6).
To assess the power of Cortex to detect a range of variants of different types and sizes we applied the BC and PD algorithms at a single point (30x, read length 100, k=55; see Methods Section 7). For isolated SNPs, short indels (1-100bp) and small complex combinations of SNPs and indels (1-100bp) we have 80% power to detect heterozygous sites and 90% power to detect homozygous variant sites (Figure 2b). For moderate size (100-1000bp) indels and complex variants, power is 50% and 75-80% for heterozygous and homozygous sites respectively. For large variants (1-50kb) we only have power to detect homozygous variant sites (c. 35%), entirely through PD. These sensitivity estimates are attained with an FDR of 2%.
We simulated sequence data from ten diploid individuals based on human chromosome 22 (100bp reads, 10x per individual; Methods Section 8). Data were analysed at the level of individual haplotypes, error-free reads and error-containing reads (k=55) under two cleaning thresholds (relaxed and stringent). Two filtering approaches were compared. First, we remove bubbles that are present in the reference. Second, we use a probabilistic model to classify bubbles as arising through errors, repeats or true variants (Methods Section 3).
At k=55, 10% of SNPs fail to make clean bubble structures (Supplementary Figure 4), however, coverage is sufficient that only for rare variants (count < 3) is there substantial loss of power (Figure 2c). With realistic levels of sequence error, power drops by an additional 10% under the relaxed cleaning threshold, but recovers under the more stringent cleaning threshold because confounding error contigs are removed. FDR with relaxed cleaning is 29%, but probabilistic classification reduces this to 1.5% with only a 1.7% loss in power (Figure 2d). The more stringent cleaning approach has an FDR of 2.3% before classification and 1.6% after, with a 1.0% loss in power. In contrast, removal of bubbles present in the reference has only a marginal effect on FDR (29.0% and 2.3% for the two cleaning thresholds respectively), as the majority of false calls are read-error driven.
We analysed high coverage data (26x with 100bp reads analysed at k=55; Methods Section 9) from a single individual of European ancestry (NA12878 from the CEU) for whom independent validation data are available through 3 Mb of fosmid sequence (median length 40kb), selected to contain structural variation41. This sample has been analysed using mapping-based strategies in the 1000 Genomes Project17 (63x of mostly 36bp paired-end reads), thus enabling comparison of Cortex with alternative strategies. The fosmid data enables us to estimate an upper-bound for FDR for variants of different types from the fraction of sites called as homozygous variant where the fosmid sequence contains only the reference allele. A detailed discussion of the validation results can be found in the Supplementary Methods.
After cleaning, the de Bruijn graph for NA12878 has 2,777,352,792 nodes (unique k-mers) compared to 2,691,115,653 in the reference sequence (cleaning reduces the initial number of nodes by 23%). The bubble-calling (BC) algorithm identifies 2,686,963 bubbles, of which 5.6% are removed because both sides of the bubble are also present in the reference. The path-divergence (PD) algorithm identifies 528,651 deviations from the reference, of which 39.8% were not identified by BC. The union of the BC and PD call-sets includes 2,245,279 SNPs, 361,531 short indels (insertion to deletion ratio in the 5-30bp range of 1:1.3 for BC and 1:1.7 for PD, compared with 1:3.7 for the 1000 Genomes calls) and 1,100 larger or more complex variants.
The Cortex and 1000 Genomes call sets have different properties arising from differences in experimental design and analysis approach. Only 80% of the genome is accessible to the 1000 Genomes SNP calls17, but power within these regions is high. In contrast, at k=55, over 85% of the genome is accessible to Cortex, but power is reduced (by approximately 40% at heterozygous, and <5% at homozygous, sites) due to fluctuations in coverage. Thus while the call set sizes for homozygous SNPs and short indels within the fosmid footprint are similar, Cortex calls only half the number of heterozygous sites (Table 1). Across the genome, Cortex detects variation at 87% of sites called as homozygous alternative by the 1000 Genomes Project and 67% of sites called as heterozygous (comparison to HapMap 2 sites gives equivalent figures, Supplementary Table 7).
SNP variants identified by Cortex have diagnostic properties such as transition-transversion ratio (Cortex = 2.02 (BC) / 2.1 (PD), 1000 Genomes = 2.07) and dbSNP rate (Cortex = 92.7% (BC) / 95.6% (PD), 1000 Genomes = 92.1%, dbSNP 129) that are comparable to the 1000 Genomes calls17.
Overall FDR for Cortex SNP calls is 4%, reduced to 1.5% by applying a homopolymer filter and selecting high confidence calls (25% reduction in call set; Table 1 and Methods Section 9). None of the 1000 Genomes homozygous SNP calls invalidates (Table 1). Of the 43 invalidated homozygous SNP calls from Cortex, 35 were called as heterozygous sites by the 1000 Genomes Project, hence the FDR for Cortex is probably <1%. Short indels (1-100 bp) have similar FDR (0% for high confidence set), but whereas the 1000 Genomes calls are restricted to variants under 30bp in length, both the BC and PD approaches identify indels over 100bp (Figure 3a).
Across the genome, Cortex identifies 138,262 complex variants, consisting of phased SNPs (74%), closely-sited SNPs and indels (25%) and complexes of insertions, deletions and local rearrangements (1%). FDR for complex variants is low (2.7% for BC and 1.7% for PD; Supplementary Table 1). While mapping-based approaches can call closely-sited variants of different types, these are often filtered out. However, our results indicate that Cortex can identify complex variants with FDR comparable to simple variants. Figure 3b shows examples of complex variants validated in the fosmid data.
We constructed three pooled population graphs for 164 humans (CEU, YRI and CHB/JPT) sequenced at low coverage (2-4x) in the 1000 Genomes Project17 (Methods). By including the reference sequence as a fourth colour, we identified 21,281 novel contigs of ≥100bp (<90% homology to any reference sequence), totalling 3.2Mb. The novel unique sequence load carried by a typical individual is 1.4Mb for CEU, 1.5Mb for YRI and 1.5Mb for CHB/JPT respectively. Of this, 93% is estimated to be allelic, and copy-number estimates for other sequences range up to 6.3 (Figure 4a). On average, 45kb per individual is homologous to a known gene and we see strong over-representation for matches to variants at classical HLA and KIR loci, both known to be highly variable in sequence, structure and copy-number. Some sequences show very strong differentiation between populations. For example, we find three contigs in YRI, homologous to olfactory receptor genes, that are absent from other populations (Figure 4a).
There are practical implications of these results. First, the novel sequences, particularly those matching genes or strongly differentiated between populations are candidates for functionally-relevant polymorphism. Second, the combined population graphs provide a summary of human genome diversity against which it is possible to map sequence data from future studies. We have released the novel sequence contigs, population estimates of the per-genome copy number, the combined population graph and tools for aligning reads to the graph (Methods Sections 10, 14).
We applied Cortex to data from 10 Western Chimpanzees (50 bp reads, average coverage 6x; Methods Section 11). Bubbles were identified after using the relaxed cleaning threshold and classified probabilistically. Power was estimated by comparison to previous SNP genotype data on the same samples42.
Across the genome we identify 3.5 million variants, of which 2.7 million are single nucleotide variants. The probabilistic filter classified 153,921 of these as repeats, of which 69% were bubbles in the reference. For bubbles classified as SNPs we estimate FDR from the fraction of sites that are also bubbles in the reference, here 3.5% (compared to 6.5% before classification). This estimate is substantially greater than that predicted from simulations. However, manual inspection revealed that many of these sites are segregating in the sample and therefore are either polymorphic segmental duplications (not currently considered in the classification process) or allelic variants mis-assembled as paralogous in the reference. Power compared to the SNP genotype data is 55% before classification and 54% after. Thus probabilistic classification results in a data set with low FDR at a small cost to power and can be applied to any species, regardless of reference availability. The relationship between allele count and detection rate closely follows the theoretical predictions (Figure 4b).
We applied the genotyping algorithm to both simple variants, here HapMap2 SNPs43, and complex variants, specifically HLA-B genotype, using the sequencing data from NA12878 described above and high coverage sequence data from an individual of African origin (NA19240). Both individuals have classical HLA alleles typed from a previous project44. At HapMap2 sites, we find discordance of less than 1% at high confidence sites (Methods Sections 9.4 and 12, Supplementary Tables 4, 5). Discrepancies are driven by sites called as homozygous variant by the BC algorithm and heterozygous by HapMap2; a result of stochastic loss of coverage of k-mers spanning the reference allele.
Classical HLA allele genotyping, of importance in many areas of medical genetics, is laborious and expensive. Although DNA-sequencing represents the gold-standard quality, most genotyping is performed through a mix of PCR amplification and oligo hybridisation. HTS genome-wide data has the potential to provide classical HLA sequence information, but sequence diversity, structural variation and extensive paralogy within the region currently restricts mapping-based approaches. To evaluate the performance of Cortex for genotyping HLA-B we constructed a graph containing the reference genome, all 1429 known HLA-B alleles and data from each high coverage sample as separate colours (Methods Section 12). We calculated the likelihood of all 1,021,735 possible genotypes. For NA19240, the most likely genotype (B*57:03:01, B*35:01:01) agrees at 4-digit resolution with previous data obtained using classical typing methods44 and is very strongly supported (likelihood ratio ~1023; Figure 5a). For NA12878 the most likely genotype contains B*56:01 and cannot distinguish between B*08:03, B*08:15, B*08:36, B*08:47 and B*08:13 for the second allele (Figure 5b). The lab-based genotype is B*56:01/B*08:01 which differs from the maximum likelihood estimate by 3.2 units of log-likelihood and agrees at the 2-digit level with the graph-based estimate. We note that lab-typing was based on primer amplication and oligo hybridisation, which can often lead to minor ambiguities at the 4-digit level. By sub-sampling NA12878 to generate graphs at between 2x and 20x we find that 16x coverage was required to attain 2-digit agreement with lab-based typing (Figure 5b).
We have introduced a new approach to combining de novo assembly with variant detection and genotyping from HTS data. We use colored de Bruijn graphs to represent information from multiple sources and a mixture of graph-analytic and statistical approaches to detect variants of different types and subsequently genotype. Our method is the first de novo assembly-based variant caller, although previous work has made steps towards reference-free variant calling45,46. Technically, the key advance is the development of a highly efficient de Bruijn graph implementation. This efficiency enables data from multiple samples, as well as reference sequences and known variants to be included within a single graph structure that preserves sample identity through the use of colours. For single high coverage genomes the algorithms provide power to detect and genotype simple and complex variants. However, the major strength of the approach lies in the simultaneous analysis of multiple genomes, which enables powerful and accurate approaches to variant detection without any need of a reference genome. This makes possible HTS analysis of genetic variation in any species. It could also provide an approach for detecting changes between highly related genomes, as in tumour-normal pairs in cancer genomics47 or bacteria in within transmission chains48.
We have also developed a simple mathematical model to describe de Bruijn graph assembly from HTS data, which has two practical benefits. First, the model has predictive power both in simulated and empirical data, hence can guide experimental design. Second, the model can be used to calculate the likelihood of any particular genome sequence given HTS data and an estimate of error rates; one application of which is genotyping complex variants, such as the classical HLA loci.
Finally, the Cortex algorithms have several limitations. Most notably, we do not use read-pair information to improve local assembly, which can be of substantial value around repeat sequence. However, there are established algorithms for using read-pair information to disambiguate de Bruijn graphs30,31,39,40,49,50. There is also the potential for error correction51, which can compensate for the loss of coverage caused by errors. There are, however, more fundamental challenges in using de Bruijn graphs, including the greater need for error-correction as the k-mer size increases, the lack of any natural way of encoding read-pair information and the potential for graph explosion as more individuals are included in the graph. Nevertheless, multi-colored graphs provide one solution to the obvious inadequacy of representing the genetic composition of a species by a single haploid reference.
We would like to thank the members of the 1000 Genomes Project Consortium for discussion, suggestions and sequencing data. We thank Bartu Ahiska, Adam Auton, Ewan Birney, Richard Durbin, Gerton Lunter, Jon Woolf, Daniel Zerbino for discussion, two anonymous reviewers for their comments, and members of the PanMap Project and the Genomics Core at the Wellcome Trust Centre for Human Genetics for access to sequence data. ZI is funded by a grant from the Wellcome Trust (WT086084/Z/08/Z) to GM. The sequencing of NA12878 was done by the Wellcome Trust Sequencing Core at Oxford, under a grant from the Wellcome Trust (090532/Z/09/Z).
Full details of the methods used in this paper can be found in the Supplementary Methods. These include definitions of terminology, followed by the sections: (1) Mathematical model for power to detect variants. (2)Variant calling algorithms (Bubble Caller and Path Divergence Caller). (3) Probabilistic Classification of graph structures as repeat-, error- or variant-induced. (4) Genotyping of simple and complex variants in a de Bruijn graph. (5) Description of Cortex software implementation. (6) Error-cleaning algorithms for high coverage samples and low coverage populations. (7) Single genome simulations. (8) Population simulations. (9) Case 1 - Analysis of high coverage human sample NA12878. (10) Case 2 – Pooled assembly of 164 samples from 1000 Genomes Pilot. (11) Case 3 – using probabilistic classification of bubbles on 10 chimpanzees. (12) Case 4 – Genotyping of simple and complex variants. (13) Making appropriate choice of parameters for experimental design. (14) Release of source code and 1000 Genomes population graph. Cortex is available from http://cortexassembler.sourceforge.net/.