For many years, sequence assembly algorithms were mainly designed around the overlap-layout-consensus (Pevzner et al., 2001
). This approach builds an assembly from the pairwise alignment of the all the reads. With capillary reads, this approach was very successful, and produced many high quality genome assemblies.
However, assembling short reads with these programs proved very costly. The redundancy of these datasets caused the quadratic number of read alignments to become prohibitively costly to compute, and the short length of the reads reduced the statistical significance of overlaps between reads.
This spurred the development of another category of assemblers, based on the use of the de Bruijn graph, among which EULER
(Pevzner et al., 2001
; Chaisson et al., 2009
(Butler et al., 2008
(Simpson et al., 2009
(Li et al., in press
) and Velvet
(Zerbino and Birney, 2008
). To cut down on computation time, these approaches simply search for exact word matches, and only at a later stage try to separate falls positive alignments.
The main parameter which the user must set is the hash length. Velvet can generally derive all the other parameters once this one is given. One approach is to use scripts such as the VelvetOptimiser (cf. Support Protocol 2) to scan a set of possible values. However it is possible to determine a near-optimal hash length with some simple estimates.
The hash length, sometimes referred to as k-mer length or word length, corresponds to the length of the words which are stored and compared in the construction of the de Bruijn graph. The hash length must be an odd number which is shorter than most reads in the dataset. All reads shorter than the hash length are simply ignored throughout the assembly. Finding the optimal hash length is essentially striking a compromise between sensitivity and specificity.
On one hand, the longer the hash length, the less likely words will be exactly repeated in the genome being sequenced. Similarly, spurious errors are less likely to overlap with other sequences if the word length is increased, making them easier to detect and remove. A large hash length therefore leads to a simpler de Bruijn graph with fewer tangles and fewer errors.
On the other hand, the longer the hash length, the fewer times each word will be observed in the dataset. Neglecting errors and repeats, the expected number of times a k
-mer is observed in a sequencing dataset, or k-mer coverage Ck
, can be derived from the common coverage depth C
, the average read length L
, and the hash length k
, using the following formula:
As the hash length k
gets close to L, Ck
can drop quickly, thus creating more coverage gaps in the assembly. A short hash length therefore gives the assembler more sensitivity to fill in gaps.
Experience shows that optimal results are obtained if k is adjusted so that Ck is close to 15x. The function which associates a contig N50 to each hash length generally has a very sharp bell shape. It is therefore generally quite easy to find the optimum value with a few iterations.
It is quite common to observe that the N50 rises steadily as the hash length reaches the maximum length allowed by Velvet. If that it the case, you simply need to re-compile Velvet with a higher MAXKMERLENGTH parameter (cf. Support Protocol 1).
Two more important parameters are the expected coverage and coverage cutoff. They can be determined automatically by Velvet. This behavior is triggered by adding the following options to the velvetg command line (see Basic Protocol):
velvetg directory (…) –exp_cov auto –cov_cutoff auto
Generally, this process works smoothly, but it worth keeping an eye on Velvet’s estimates. If for some reason its estimates are flawed, then the user should override them manually.
The expected coverage is crucial, as it allows Velvet to determine which contigs correspond to unique regions of the genome and which contigs correspond to repeats. As explained above, the expected coverage can be determined using a probabilistic formula. However, it does not take into account the error rate which decreases the effective coverage. It is generally more reliable to observe the coverage distribution after a first assembly then enter it manually in the velvetg command line.
The coverage cutoff is a simple but effective way of removing many basic errors which passed the previous filters. It simply removes contigs with an average cutoff below a given threshold. It must therefore be set so as to remove more errors than it removes genuine sequence. If the coverage threshold is too high, then correct contigs are removed from the assembly, potentially creating misassemblies. If determined automatically, this cutoff is simply set to half the expected coverage.
The main obstacle encountered by Velvet users is lack of memory. Depending on the system, the computer will either slow down significantly as it starts swapping memory onto the hard drive, or Velvet will simply stop, printing a short message on the standard output. A number of steps can be used to fit a dataset onto the available hardware.
The first step consists in eliminating unnecessary redundancy in the dataset. Velvet requires typically 15x k-mer coverage to function properly. However, higher coverage is not necessarily beneficial to the assembly stage. It is therefore possible to temporarily discard part of the dataset, and only assembly the remaining reads. The entire dataset can then be aligned to the assembled contigs to obtain variation data.
The second step consists in eliminating low quality information. Erroneous sequences do not generally overlap with each other, so each individual error creates its own independent data structures. The error rate therefore has a significant impact on the memory footprint of Velvet. If memory is an issue but coverage is sufficient, you can trim the reads according to quality, to conserve a smaller but more reliable dataset.
Finally, it is useful to take into account the effect of the hash length on the memory usage. The closer the hash length is to the optimum value, the more compact the de Bruijn graph, the lower the memory consumption. Therefore, a proper setting of the hash length can help fit a dataset into the computer.
Another issue is misassemblies. In many cases, this can be linked to the parameterization. For example, the automatic measurement of coverage or insert length can fail on highly fragmented datasets, in which case it is necessary to enter the corresponding manually (check the standard output on the screen for the relevant information). Another issue is the specificity of long reads or paired-end information. At high coverage, it may be necessary to raise the –min_pair_count or –long_mult_cutoff thresholds to filter out spurious connections between contigs. These parameters determine how many short read pairs or long reads are necessary to connect two contigs.
By default, Velvet automatically estimates the insert length of the different libraries present in the assembly. This approach relies on paired reads which land on the same contigs, thus introducing an interval censoring bias. In other words, if the expected insert length is longer than most contigs, then only read pairs with a short insert length will be observed on the same contigs. This is why it is recommended to check the estimates as they are printed in the standard output. If you have prior knowledge which indicates that these estimates are wrong, then you can simply override them by setting the average insert length manually. For example to set the insert length of the third library to 3200 bp:
velvetg directory (…) -ins_length3 3200
If you manually provide this insert length, then Velvet uses this value, and assumes that the standard deviation is 10% of the expected value. If you are using several insert libraries, and you know that the standard deviation of one library’s insert length is more important relative to its expected value, then you can indicate this to Velvet. For example, if the above library has a insert length standard deviation of 1000 bp, then the command line becomes:
velvetg directory (…) -ins_length3 3200 -ins_length3_sd 1000
You can provide as many insert lengths and standard deviations as there were libraries (or categories) provided to velveth, changing the flag’s number accordingly: -ins_length2, -ins_length5_sd, etc.
If you are using pre-assembled contigs as long reads, then it is necessary to lower the -long_mult_cutoff parameter to 0. This is because to reduce noise Velvet requires a minimum number (by default 2) of long reads to connect two contigs before they are merged. In the case of pre-computed contigs, these “long reads” do not overlap, and the above constraint prevents Velvet from using them to resolve repeats.
You can also request extra outputs from Velvet to suit your pipeline. For example, if you wish to know which reads were not used in the assembly, then add the following flag to the command line:
velvetg directory (…) -unused_reads yes
This will produce an UnusedReads.fa file in the working directory, containing all the desired reads.
If you wish to visualize the assembly, you can request the production of an AFG assembly file called velvet_asm.afg, which can be handled or converted by a number of tools (see Internet Resources).
velvetg directory (…) -amos_file yes