Violations of the five basic rules described in the Rationale are most commonly caused not by mis-assemblies, but by statistical variation or errors in the underlying data provided to the assembler. The high-throughput biochemical processes used to sequence genomes are error-prone, leading to non-random coverage across the genome, sequencing errors, and mis-paired reads. Furthermore, experimental measurements (for example, mate-pair sizes) are inherently noisy. Separating such experimental artifacts from errors introduced by mis-assemblies is one of the main requirements of a robust validation pipeline. To reduce the effect of these errors on the analysis, multiple sources of evidence must be combined to increase the specificity of mis-assembly detection. In addition, certain types of mis-assembly can only be detected by specific methods, while the sequencing strategy employed may restrict the types of information that can be used for validation (for example, many emerging sequencing technologies do not yet generate mate-pair information). In the remainder of this section we describe our approach for assembly validation based on several measures of assembly consistency. We will describe the types of mis-assemblies detected by each of the measures and conclude with examples of how these measures are integrated to reveal potential assembly errors.
The mate-pair validation component of the pipeline separately identifies the four types of mis-mated reads: mates too close to each other; mates too far from each other; mates with the same orientation; and mates pointing away from each other. Reads with mates not present in the assembly or whose mates are present in a different contig are also reported. In order to reduce the impact of noise in the underlying data, multiple mate-pair violations must co-occur at a specific location in the assembly before reporting the presence of an error. In addition, the CE statistic described in the Rationale aids in the identification of clusters of compressed or expanded mate-pairs.
The actual size of shotgun libraries is sometimes mis-estimated by sequencing centers; therefore, a mechanism to re-estimate the library parameters on the basis of mate-pairs that are co-assembled within a contig is required. Reads that occur too close to the end of a contig may bias the distribution in favor of short mate-pairs (the mate-pairs at the upper end of the distribution would fall beyond the end of the contig and, therefore, not contribute to the calculations) and are thus ignored. Specifically, we ignore every read that is closer than μ + 3σ from the end of the contig when re-estimating the parameters of a library with mean μ and standard deviation σ. It is often necessary to iterate this process a few times until convergence. The size of a library is re-estimated only if the size of a sufficient number of mate-pairs can be estimated and only if either the mean or the standard deviation change significantly from the original estimate.
In addition to mate-pair violations, regions of inadequate depth of coverage are identified, as well as regions that are not spanned by any valid mate-pair (that is, 0X fragment coverage). The latter may represent situations where non-adjacent regions of the genome were co-assembled across a repeat. When computing fragment coverage we exclude from consideration the paired reads sequenced from each fragment. This is necessary in order to make the distinction between read and fragment coverage at a specific location. By our definition, the read coverage cannot drop below one within a contig, but the fragment coverage can be as low as zero, indicating the absence of long-range support for this region of the contig. At the typical depths of read coverage used in sequencing, each location in the genome is generally well covered by mate-pairs.
Most mis-assemblies are caused by repeats; therefore, understanding the repeat structure of a genome can aid in the validation of its assembly. Some repeats can be found by aligning the assembled contigs against each other and identifying duplicated regions. Tools like Vmatch [29
] and Tandem Repeat Finder [30
] can be used for the de novo
identification of repetitive regions in the assembly, which can then be examined for correctness. This approach, however, is not appropriate for all types of mis-assemblies. For example, the complete collapse of a two copy tandem repeat into a single copy cannot be detected by comparative means.
For validation purposes we are not simply interested in identifying the location of all repeats, rather we are trying to identify those repeats that have been assembled incorrectly, in particular those repeats that cannot be easily identified through comparative analysis. Specifically, we try to identify regions of the genome that are over-represented in the set of reads, yet appear unique when examining the consensus sequence generated by the assembler. We achieve this by comparing the frequencies of k-mers (k-length words) computed within the set of reads (KR) with those computed solely on the basis of the consensus sequence (KC). KR is the frequency of all k-mers inside the clear range of all reads; and KC is the frequency of all k-mers across the consensus sequence of the assembled contigs. The forward and reverse complements of each k-mer are combined into a single frequency. The normalized k-mer frequency, K* = KR/KC, is computed for each k-mer in the consensus, where a deviation from the expected K* (in a correctly assembled region, K* should approximately equal the average depth of coverage c) reveals those repeats likely to be mis-assembled. For example, KR measured across a two copy repeat is 2c regardless of whether the assembly is correct or not. If the repeat is correctly assembled into two distinct copies, KC = 2, and, therefore, K* = c. If instead the repeat is collapsed, then KC = 1 and K* = 2c, indicating the presence of a mis-assembly. This approach is particularly powerful when used in conjunction with the technique described below for identifying dense clusters of SNPs because the two methods are complementary. SNP based detection will find collapsed, heterogeneous repeats, while K* will reveal collapsed, identical repeats.
As described in the introduction, the collapse of a repeat results in an increase in the depth of coverage. This characteristic signature can, therefore, be used to detect the presence of mis-assemblies. For short repeats with low copy number (for example, two-copy repeats), this effect cannot be distinguished from the variation in coverage caused by the randomness of the shotgun sequencing process, limiting the applicability of this method to repeats that occur in many copies throughout the genome, or to relatively long stretches of repetitive DNA (sustained deviations from the average depth of coverage are unlikely to occur by chance). The significance of observing a certain level of over-representation, given the parameters of the shotgun process, can be calculated through statistical means (see the A-statistic used by Celera Assembler [12
Identification of micro-heterogeneities
Under the assumption of a random distribution of sequencing errors, and an independent random sampling of the genome during the shotgun process, it is unlikely that any two overlapping reads have sequencing errors at the same consensus position. While there are several examples of sequence-dependent sequencing errors that invalidate our assumption of independence between errors occurring in different reads (for example, hard-stops caused by the formation of DNA hair-pin structures, or long homopolymer regions characterized by frequent polymerase slippage), these assumptions are true for the vast majority of sequencing errors. Also, the following discussion assumes the genome being sequenced represents a single clonal organism. The assembly of non-clonal bacterial populations or heterozygous eukaryotes is characterized by frequent heterogeneities between co-assembled reads. Such situations are often known a priori and the validation pipeline can be adjusted accordingly.
As described in the introduction, mis-assemblies often result in the presence of micro-heterogeneities (SNPs) that are correlated across multiple overlapping reads. Identifying such polymorphisms can, therefore, indicate potential errors in the assembly. To identify mis-assembly induced SNPs, and distinguish them from simple sequencing errors, we take advantage of the base quality values provided by the sequencing software. The phred
quality values [31
], for example, represent the log-probability of error at every base in the sequence. Under the assumption of independence of errors across reads, we can sum these values to estimate the probability of observing multiple correlated errors at a specific location in the assembly, and mark as polymorphism those locations where this probability exceeds a specific threshold. For example, the probability of error for two reads reporting the same base, each with a quality value of 20, is equivalent to the probability of error for a single base with a quality value of 40 (P
(error) = 1/10,000). This is, in essence, the same approach used by genome assembly software in assigning quality values for the consensus sequence [32
]. For each heterogeneous column of the multi-alignment, reads are grouped into 'alleles' by which nucleotide they report. The quality values for each read in an allele are summed, and if two or more alleles have a quality value of 40 or greater (by default), the difference is marked as a SNP. For a concrete example, if two reads report a C
each with quality 25, and three reads report a G
each with quality 20, the qualities of the alleles are 50 and 60, respectively, and the difference is marked as a C/G
SNP. If, however, the quality of either allele is below 40, the difference is not marked as a SNP. In addition, our software evaluates the proximity of SNPs to further increase the confidence in our predictions; clusters of SNPs that occur within a small range in the assembly are likely indicative of a mis-assembly. By default we mark regions containing at least 2 high quality SNPs occurring within a 500 bp window.
Note that this technique for mis-assembly detection can also be applied in heterogeneous genomes, for example, by identifying regions with a significantly higher SNP density than the background rate. In such genomes, however, we expect much higher false-positive rates due to localized regions of heterogeneity, requiring this method to be combined with other validation measures.
Read breakpoint analysis
The reads provided to an assembler must be consistent with the resulting assembly. Thus, examining how the un-assembled reads (also called singletons, or shrapnel) disagree with the assembly can reveal potential mis-assemblies. To compare un-assembled reads to a consensus we use the nucmer
component of the MUMmer package [33
], and allow fragmented alignments to the consensus. For instance, a mapping that aligns the first half of a read to a different region than the second half, but at 100% identity, is preferable to a mapping that aligns the read contiguously at 80% identity. The fragmented, high identity alignment is more likely because the read sequence should be nearly identical to the consensus sequence, modulo sequencing errors. From among all alignments of a read to the genome we choose the placement that maximizes the sum of len(Ai)
over all alignment segments Ai
, where len(Ai)
are the length and percent identity of the ith
segment of alignment A
, and len(Ai)
is adjusted where necessary to avoid scoring the overlap between adjacent segments twice. This scoring function estimates the number of non-redundant bases matching the consensus, and the MUMmer utility delta-filter
computes an optimal alignment using this function and a modified version of the Longest Increasing Subsequence (LIS) algorithm [35
]. Most mappings consist of a single alignment that covers the entire read, while the fragmented mappings indicate either incorrect trimming of the read or the presence of a mis-assembly.
For fragmented alignments, the locations where the alignment breaks - boundaries of alignment fragments that do not coincide with the ends of the read - are called 'breakpoints'. Under the assumption that all reads map perfectly to the assembly, breakpoints indicate the presence of errors, either in the assembly, or in the reads themselves (for example, incomplete trimming, or chimeric fragments). Breakpoints supported by a single read are rarely cause for concern, and can often be explained by errors in the reads themselves. However, multiple reads that share a common breakpoint often indicate assembly problems. These multiply supported breakpoints are identified, after the alignment process described in the previous section, by sorting the boundaries of fragmented alignments by their location in the consensus, and reporting those that occur in multiple reads. In addition, for each read we store a vector of coordinates encoding all breakpoints in the alignment of the read to the genome. This vector allows us to determine not only if two reads share common breakpoints, but also if they have similar mappings to the consensus. For each breakpoint, we then examine the cluster of reads with similar alignment signatures to characterize different classes of mis-assemblies in much the same way mate-pairs are used to characterize collapse, inversion, and so on. But while mate-pair and coverage methods can only bound a mis-assembly to a certain region, breakpoints can identify the precise position in the consensus at which the error occurs.
Integration of validation signatures
Our validation pipeline, amosvalidate, executes the analyses described above to tag regions that appear mis-assembled. Independently, each analysis method may report many false-positives that reflect violations of the data constraints, but that do not necessarily represent mis-assemblies or incorrect consensus sequence. A common example is clusters of overlapping stretched or compressed mate-pairs caused by a wide variance in fragment sizes rather than mis-assembly. By combining multiple mis-assembly signatures we increase the likelihood that the tagged regions identify true errors in the assembly. For example, a region with a largely negative CE value is more likely to indicate the presence of a collapsed repeat if an unusually high density of correlated SNPs is also present. This particular combination is especially strong, since mate-pair and sequence data are independent sources.
Since some types of signatures do not necessarily tag the exact location of a mis-assembly, combining mis-assembly signatures requires considering not only overlapping signatures, but also those that occur in close proximity. To combine mis-assembly signatures, the pipeline identifies regions in the assembly where multiple signatures co-occur within a small window (2 Kbp by default). If multiple signatures of at least two different evidence types occur within this window, the region is flagged as 'suspicious'. Each such region is reported along with detailed information about the individual signatures, and forms the initial focus for subsequent validation and correction efforts. For manual analysis, these regions, along with the individual mis-assembly features, can be viewed alongside the assembly data in the AMOS assembly viewer, Hawkeye.
The validation modules of amosvalidate
are implemented in C++ and included as part of the AMOS assembly package [36
]. AMOS is a modular, open-source framework for genome assembly research and development, which provides integration between software modules through a centralized data store and a well defined API. This framework allows developers to focus on a particular area of interest, for example, scaffolding, without needing to develop a complete assembly infrastructure. Furthermore, AMOS can import data from common assembly programs and formats - ACE, NCBI Assembly/Trace Archives [37
], Arachne [38
], Celera Assembler [12
], PCAP [40
], Phrap [41
], Phusion [42
] and Newbler [43
], allowing for the integration of AMOS modules into existing assembly pipelines.