Characterization of genetic variants present in an individual, population or ecological sample has been transformed by the development of high throughput sequencing (HTS) technologies. The standard approach to variant discovery and genotyping from HTS data is to map reads to a reference genome1-5
, so identifying positions where the sample contains simple variant sequences. This approach has proved powerful in the study of single nucleotide polymorphisms6
(SNPs), short insertion-deletion (indel) polymorphisms3,5,7,8
and larger structural variation9-14
in well-characterised genomes, such as human15-17
However, the mapping approach has limitations. First, the sample may contain sequence absent or divergent from the reference, for example through horizontal transfer events in microbial genomes18,19
or at highly diverse loci, such as the classical HLA genes20
. In such cases, short reads either cannot or are unlikely to map correctly to the reference. Second, reference sequences, particularly of higher eukaryotes, are incomplete, notably in telomeric and pericentromeric regions. Reads from missing regions will often map, sometimes with apparently high certainty, to paralogous regions, potentially leading to false variant calls. Third, samples under study may either have no available reference sequence or it may not be possible to define a single suitable reference, as in ecological sequencing21
. Fourth, methods for variant calling from mapped reads typically focus on a single variant type. However, where variants of different types cluster, focus on a single type can lead to errors, for example through incorrect alignment around indel polymorphisms6,7
. Fifth, although there are methods for detecting large structural variants, using array CGH 22-25
and mapped reads11,12,14,26
, these cannot determine the exact location, size or allelic sequence of variants. Finally, mapping approaches typically ignore prior information about genetic variation within the species.
Several of these limitations can potentially be solved through de novo assembly, which is agnostic with regards to variant type and divergence from any reference. However, while there are established algorithms for de novo assembly from HTS shotgun data, based on overlap27-29
or de Bruijn graphs30-32
, current approaches have limitations. Notably, they focus on consensus assembly, treating the sequence as if derived from a monomorphic sample (e.g. haploid genome, inbred line or clonal population). Consequently, variation is ignored (processed in the same way as sequencing artefacts) and can lead to assembly errors. Some variation-aware de novo assembly algorithms have been developed31,33-36
, but these do not represent a general solution to sequencing experiments where genetic variation is either the primary concern or unavoidable (outbred diploid samples, pooled data or ecological samples).
Current assembly methods also typically ignore pre-existing information, such as a reference sequence or known variants. Although variant discovery should not be biased by such information, nor should it be discarded. For example, in a single outbred diploid sample it is hard to distinguish paralogous from orthologous variation. However, if variation is also observed in the reference haploid genome it is most likely driven by paralogy. Finally, current implementations of de novo assembly algorithms for HTS data have very substantial computational requirements, which make them impractical for large-scale studies on eukaryote genomes.
Here, we introduce de novo assembly algorithms focused on detecting and characterising genetic variation in one or more individuals. These algorithms extend classical de Bruijn graphs37,38
by colouring the nodes and edges in the graph by the samples in which they are observed. This approach accommodates information from multiple samples, including one or more reference sequences and known variants. We show how the method can detect variation in species without a reference, combine information across multiple individuals to improve accuracy, and genotype known variants. Cortex has already contributed to public datasets as part of the 1000 Genomes Project17