PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of bioinfoLink to Publisher's site
 
Bioinformatics. 2009 May 1; 25(9): 1118–1124.
Published online 2009 March 5. doi:  10.1093/bioinformatics/btp131
PMCID: PMC2732307

A consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads

Abstract

Motivation: Novel high-throughput sequencing technologies pose new algorithmic challenges in handling massive amounts of short-read, high-coverage data. A robust and versatile consensus tool is of particular interest for such data since a sound multi-read alignment is a prerequisite for variation analyses, accurate genome assemblies and insert sequencing.

Results: A multi-read alignment algorithm for de novo or reference-guided genome assembly is presented. The program identifies segments shared by multiple reads and then aligns these segments using a consistency-enhanced alignment graph. On real de novo sequencing data obtained from the newly established NCBI Short Read Archive, the program performs similarly in quality to other comparable programs. On more challenging simulated datasets for insert sequencing and variation analyses, our program outperforms the other tools.

Availability: The consensus program can be downloaded from http://www.seqan.de/projects/consensus.html. It can be used stand-alone or in conjunction with the Celera Assembler. Both application scenarios as well as the usage of the tool are described in the documentation.

Contact: ed.nilreb-uf.fni@hcsuar

1 INTRODUCTION

Gene prediction, genome comparison or phylogenetic studies rely on accurate reference sequences. These sequences are assembled in sequencing projects that require a robust and versatile consensus tool. Most of the present consensus tools are an integral part of fragment assemblers such as the Celera (Myers et al., 2000), Arachne (Batzoglou et al., 2002) or Atlas (Havlak et al., 2004) assembler. These assemblers follow the established three-phase assembly methodology: Overlap-Phase, Layout-Phase and Consensus-Phase. In this scenario, the layout module forwards for each contig a set of reads with their approximate positions to the consensus module. The consensus module then computes a multi-read alignment and a final consensus sequence. A multi-read alignment is, however, also required in other applications, including reference-guided genome assembly projects (Pop et al., 2004) and variation analyses (Denisov et al., 2008). A reference-guided genome assembly uses an already sequenced reference genome to assemble a new genome. The new genome might have insertions or deletions with respect to the reference genome. These insertions or deletions can be detected using mate pair information as shown in Figure 1. The mate pairs themselves can be used to infer the layout of the reads within an insertion. The paired-end libraries, however, exhibit high deviations from the mean mate pair library length and hence determining the correct layout for such an insert sequence is difficult. Furthermore, a consensus algorithm together with mate pair information can be used to bridge the gaps between a scaffold's contigs or to compute the consensus sequence of a resolved repetitive region. To support both, reference-guided and de novo genome assemblies, our proposed algorithm works in two running modes. The first one allows insert sequencing and the second one, as a component of the Celera Assembler, de novo assembly.

Fig. 1.
A newly sequenced genome with an unknown insertion with respect to a reference genome. The mapped reads (black lines) can be used to infer the layout of the mate pairs (grey lines). Mate pairs are indicated by arrows pointing to each other. From this ...

Ad hoc techniques that incorporate reads one-by-one into a growing multiple alignment are highly error-prone and inadequate for the data produced by the new sequencing platforms such as 454 Life Sciences (www.454.com), Illumina's Solexa sequencing technology (www.illumina.com) and Applied Biosystems SOLiD Sequencing (www.appliedbiosystems.com). In contrast to Sanger reads, next generation sequencing platforms have higher error rates and produce high volumes of short read, deep coverage data. Therefore, new consensus methods are greatly needed. Especially for many downstream analyses such as repeat resolution or variation analysis accurate multi-read alignments showing single nucleotide polymorphisms (SNPs), indels and sequencing errors are indispensable. Unfortunately, multiple sequence alignment programs such as Clustal W (Thompson et al., 1994), T-Coffee (Notredame et al., 2000), MUSCLE (Edgar, 2004) or MAFFT (Katoh et al., 2002) cannot be directly used to compute a multi-read alignment because they all assume globally related sequences or at least a shared local region. This is, of course, inappropriate for the alignment of reads where only about c reads overlap a given nucleotide with c being the assembly coverage. However, established multiple sequence alignment techniques such as consistency (Notredame et al., 2000) and progressive alignment (Feng and Doolittle, 1987) can still be used and are the basis of the proposed method.

2 METHODS

2.1 An alignment graph of sequence segments

Throughout the consensus construction, an alignment graph is used to represent a multi-read alignment as shown in Figure 2. Vertices represent non-overlapping sequence segments, edges connect vertices and represent ungapped aligned sequence segments and gaps are implicitly represented by the topology of the graph. For example, the GCTG vertex in Figure 2 has no outgoing edges (degree zero) and thus, it is aligned to gaps in all other sequences. The properties of an alignment graph G are given below.

  • For a set S={S0, S1, …, Sn−1} of n reads, the alignment graph G=(V={V0[union or logical sum]V1[union or logical sum][union or logical sum]Vn−1}, E) is an n-partite graph.
  • Each vertex vip[set membership]Vi represents a sequence segment in Si of arbitrary length. We also say that vip covers all positions of the segment. For instance, vip might cover the sequence segment Siu1,u2=siu1siu1+1siu2−1.
  • Every position in Si=si0si1i|Si|−1 is covered by one and only one vertex vip[set membership]Vi.
  • Three integers are associated with each vertex: (i) the sequence identifier it belongs to; (ii) the beginning of the segment; and (iii) the length of the segment.
  • An edge e={vip, vjq}[set membership]E with ij indicates that vertex vip can be aligned with vertex vjq. In other words, the sequence substring in Si covered by vip can be aligned without gaps to the substring in Sj covered by vjq.
  • The benefit of aligning vip with vjq is given by an edge-weight we.

The alignment graph is a compact and versatile description of an alignment. Large-scale alignments can be efficiently stored because long segments are represented by only a single vertex. Furthermore, the extension and direction of an alignment is completely defined by alignment edges. That is, the graph formulation is equally suitable to align globally related sequences or thousands of reads where only subsets are related by mutual overlaps (Fig. 3). The algorithm to convert an alignment graph G into an ordinary alignment matrix is given below.

  1. Compute the connected components Ci of G. If G is an actual alignment, each Ci[set membership]C={C0, C1, …, Ck−1} has at most one vertex from every sequence.
  2. A directed component graph GC is constructed with GC=(VC={vC0, vC1, …, vCk−1}, EC). A directed edge e=(vCu, vCv)[set membership]EC with uv is inserted into the graph if and only if there is a vertex in component Cu that precedes a vertex in component Cv in one of the sequences. Informally, the edges in GC mirror the order of the vertices along a sequence.
  3. A topological sort is computed on the vertices of GC. Note that this operation does not impose an order on adjacent indels and thus, there is a many-to-one relationship between alignment matrices and alignment graphs.

Besides representing actual alignments, the graph can also be used to store arbitrary match information as illustrated in Figure 4. The set of edges E is now solely a set of possible alignment edges and only a proper subset E[subset or is implied by]E constitutes a valid alignment. This subset E′ is called a trace (Kececioglu, 1993; Sankoff and Kruskal, 1983). The best alignment is the set of edges of maximum weight (maximal trace). In the multiple case, this problem is known to be NP-hard (Wang and Jiang, 1994) and the optimal dynamic programming algorithm has exponential complexity in the number of sequences. In the Section 2.2, we present an efficient and accurate heuristic to compute such a trace for an alignment graph of thousands of reads.

Fig. 2.
An alignment graph and the corresponding alignment matrix for three reads. Vertices represent non-overlapping sequence segments, edges represent un-gapped aligned sequence segments and gaps are implicitly represented by the topology of the graph.
Fig. 3.
The alignment on the left shows globally related sequences whereas the one on the right shows a simplified multi-read alignment. Note that the direction of the alignment is solely dependent on the edges.
Fig. 4.
A general alignment graph of two sequences with weighted match information. Only a subset of the edges can be realized in an alignment.

2.2 Multi-read alignment algorithm

The multi-read alignment algorithm has four processing steps. step 1: computation of pairwise overlap alignments; step 2: alignment graph construction; step 3: consistency extension; and step 4: a graph-based progressive alignment.

Step 1. The input of the algorithm is a set of reads in FASTA format with their estimated begin and end positions. These estimated layout positions are either forwarded from the layout module of a de novo assembler or inferred from mate pair information (Fig. 1). In the first case, the positions are estimated during the layout computation of the assembler and thus, these positions tend to be very accurate. In the second case, the layout positions are derived from a paired-end library and hence, they deviate strongly from the true positions depending on the standard deviation of the paired-end library. Because of these differences, our proposed algorithm works in two running modes.

The first running mode assumes accurate start and end positions. These positions are used to detect potential overlaps and to estimate the alignment diagonal of a banded overlap alignment with affine gap costs (Gotoh, 1982). A banded alignment initializes the dynamic programming matrix with zeros and computes only a band of size k around the estimated alignment diagonal where k is the bandwidth. Note that the bandwidth k depends on two key characteristics—the sequencing error rate and the length of the overlap. Hence, the baseline for k is a configurable parameter that is adjusted linearly based on the length of the overlap. Usually, it is unnecessary to compute all possible pairwise overlaps, especially for deep coverage sequencing projects. For that reason, we provide a parameter that adjusts how many overlaps are computed per given read. The more error-prone the reads are, the more overlaps one should compute per read. The second running mode assumes layout positions estimated from a paired-end library. In this case, the positions tend to be unreliable and hence, all overlaps of reads in a user-defined window are computed with a standard dynamic programming algorithm.

Subsequent to the computation of overlap alignments, we select all overlaps of significant quality and length. Similar to the bandwidth, both parameters are adaptable from the command line. The selected overlaps are then subdivided into ungapped alignments, called segment matches. A segment match Mij=(Siu1,u2, Sjw1,w2) represents an ungapped alignment of Siu1,u2=siu1siu1+1···siu2−1 with Sjw1,w2=sjw1sjw1+1···sjw2−1.

Step 2. All collected segment matches x2133 are used to construct an alignment graph. Unfortunately, segment matches can overlap and intersect each other. This violates the alignment graph property that sequence segments are covered by one and only one vertex. For instance, the set x2133 might contain two matches Mij and Mjk where Mij=(Siu1,u2, Sjx2,x4) and Mjk=(Sjx1,x3, Sky1,y2). Then Mij and Mjk intersect each other in sequence Sj if x1<x2<x3<x4 as illustrated in Figure 5. To avoid a greedy method that would choose one of the conflicting matches, we implemented a multiple segment match refinement algorithm (Rausch et al., 2008). This algorithm subdivides overlapping segment matches into distinct submatches so that no segment match begins or ends within another segment match as shown in Figure 5. This procedure ensures that we can ultimately divide each sequence into non-overlapping sequence segments so that every segment match begins or ends on the boundary of a sequence segment. We then define vertices for each sequence segment and we connect two vertices v1 and v2 with an edge e={v1, v2} if and only if there is a segment match between the corresponding sequence segments. The weight of this edge depends on the quality of the segment match.

Fig. 5.
The segment match refinement algorithm cuts recursively all overlapping segments until all segments are disjoint. In the above example, the two overlapping segment matches are refined into four distinct matches.

Step 3. The initial alignment graph was constructed using solely pairwise sequence alignment information. We now extend the graph to incorporate multiple alignment information using the triplet extension from T-Coffee (Notredame et al., 2000). The triplet extension is one example of a consistency method (Gotoh, 1990) and relies on the following observation—the two segment matches Mij=(Siu1,u2, Sjx1,x2) and Mjk=(Sjx1,x2, Sky1,y2) induce a putative transitive segment match Mik=(Siu1,u2, Sky1,y2) that is either consistent or inconsistent with a segment match from the precomputed pairwise alignment of Si and Sk. If it is consistent, greater confidence in this segment match is established. In terms of the alignment graph, we thus traverse all pairs of alignment edges {eij={vip, vjq}, ejk={vjq, vkr}} that share a common vertex. If the transitive edge eik={vip, vkr} is present we found a three-way clique and add to weik the minimum weight of the other two edges min(weij, wejk). If the transitive edge is not present, we create it with the minimum weight of the other two edges.

Step 4. In the last step, this consistency-enhanced alignment graph is used to progressively align the reads according to a guide tree (Fig. 6). This guide tree is constructed from the overlap alignment scores using a sparse distance matrix and the fast UPGMA algorithm (Sokal and Michener, 1958). A sparse distance matrix is used instead of an ordinary matrix because for each read only c other reads are expected to overlap where c is the assembly coverage. The guide tree ensures that the best quality overlaps are aligned first whereas the difficult and error-prone overlaps caused by reads with many sequencing errors come in late, when partial alignments along subtrees are already quite large and fixed. The progressive alignment itself is completely independent of the nature of the sequences. Given an input alignment graph, it builds a multiple alignment along the guide tree simply by aligning strings of vertices. In the pairwise case, we can directly apply the heaviest common subsequence algorithm (Jacobson and Vo, 1992) to compute the maximal trace. If we progress up in the guide tree, we have profiles of vertices. We can always project these profiles onto a string of nodes and estimate the edge weights between the profiles from the original edge weights by taking the average weight. At each internal vertex of the guide tree, we thus compute another heaviest common subsequence of these profiles. The procedure is summarized in Figure 6. On average, the final profile will have about c vertices at each position where c is again the coverage. Thus, the amount of required memory depends on the coverage and the source sequence length. It is, however, largely independent of the number of reads. This is a key distinction of our method to current multiple sequence alignment tools where the profiles grow linearly with the number of sequences. In a final post-processing step, we compute the consensus sequence and convert the final profile into a multi-read alignment. The consensus sequence can be computed using a simple majority vote in each column or using a Bayesian method (Churchill and Waterman, 1992). The final multi-read alignment can be visualized in Hawkeye (Schatz et al., 2007) or it can be written to a simple text file.

Fig. 6.
The progressive alignment proceeds along a guide tree shown on the right. On the left is the original alignment graph with edge-weights indicating the quality of a match. Next to each internal guide, tree node a profile of the already aligned subtree ...

3 RESULTS

In the introduction, we proposed the applicability of our tool in three main scenarios: (i) as the consensus tool in a de novo assembly; (ii) as a consensus tool in a reference-guided genome assembly for insert sequencing; and (iii) as a prerequisite for variation analyses.

3.1 Consensus generation during de novo assembly

On simulated data, the true source sequence is known in advance and consensus errors can be counted and compared among different tools. We tried to compare our tool to as many as possible present consensus tools. Unfortunately, most assemblers are monolithic and were not designed to run their consensus algorithm as a stand-alone process. Either because of this lack of modularity or because of a lack of documentation, we did not succeed for the Atlas (Havlak et al., 2004), PCAP (Huang et al., 2003) and Phusion assembler (Mullikin and Ning, 2003). However, we could run the consensus module of the Minimus assembler (Sommer et al., 2007) from the AMOS consortium (http://amos.sourceforge.net) and the consensus module of the Celera Assembler (Myers et al., 2000). Reads were simulated under various settings and different error rate assumptions. The settings and results are summarized in Table 1. For the old Sanger-style reads, the difference in the results is small and our tool outperforms the other tools only for high error rates. The AMOS consensus module performs very well on reads of length 200 even for a relatively high coverage. For these kind of reads, our own tool delivers similar results but is not as fast as the AMOS consensus module. This increased running time can be seen in all experiments and is caused by the pairwise banded overlap alignments and the graph-based progressive alignment. However, it clearly pays off in terms of consensus quality, especially in the case of very short reads of length 35.

Table 1.
Consensus generation during de novo assembly: results of a simulation study for consensus computation

As mentioned in the Section 1, the consensus program was also integrated in the Celera Assembler as an optional consensus module. To evaluate this integration, a short read assembly was carried out with data from the new NCBI Short Read Archive (SRA, www.ncbi.nlm.nih.gov/Traces/sra/). The 454 sequencing data of the P. gingivalis W83 strain (Accession: SRA001027) was obtained. It included two mated (Accession: SRR001351) and two non-mated read datasets (Accession: SRR001352) from a 454 FLX half-plate. The assembler was tested on all four datasets E8YURXS01, E8YURXS02 (mated reads), E9T0MN001 and E9T0MN002 (non-mated reads) separately. We also obtained the provisional reference sequence of P. gingivalis W83 from NCBI (Accession: NC_002950). In Table 2, we compared the assembly statistics of the Celera Assembler using the default consensus with the version of the Celera Assembler using the new consensus module. The results of both versions are very similar over all datasets. The number of contigs slightly differs because both tools are used twice during unitig consensus and scaffold consensus. Aligning the largest contig from each assembly with the provisional reference sequence revealed, however, minor differences. We counted these differences and report the results in Table 2. Unfortunately, the 454 sequencing data was generated from a different organism of the same strain than the original assembly, so it remains unclear if these differences are actual consensus errors, true polymorphisms or errors in the provisional reference sequence. For a few differences, we inspected the multi-read alignment manually. Some of the common differences were in well-assembled, non-repeat areas and hence, they are probably true polymorphisms. Most of the discrepancies between the two consensus methods occurred in low-coverage regions where two nucleotides occur equally often.

Table 2.
Porphyromonas gingivalis W83: results for all four sequencing datasets

3.2 Insert sequencing

In Figure 1, we illustrated the insert sequencing scenario. In contrast to de novo assembly, the layout positions in this case are rather imprecise, because they are derived from mate pair information. To simulate such a scenario, we assumed all mate pairs were sequenced from a fragment of a given length that was sampled from a Normal distribution N(μ,σ). In Table 3, we show the results for different parameters of the Normal distribution. Since the layout positions are now unreliable, we compute all pairwise overlaps in a given window and rely on the consistency-enhanced alignment graph to pull the reads into the right position. The results strongly support the assumption that this procedure is more robust than classical consensus approaches. This could be confirmed by aligning the consensus sequence of each tool with the source insert sequence using MUMmer (Kurtz et al., 2004) and the NUCmer program (Delcher et al., 2002). Two example alignments are shown in Figure 7. These results confirm the assumption that the consistency extension seems to be superior to the other approaches because it differentiates between random overlaps among two reads and overlaps confirmed by multiple reads. Table 3 also clearly indicates that the insert sequencing task becomes more difficult for short reads because they induce an increasing number of random overlaps.

Fig. 7.
Alignment of each consensus with the source insert sequence where coverage >2. Straight lines indicate matching segments and line endpoints are circled. Errors and gaps at the beginning and at the end of the source insert sequence are due to an ...
Table 3.
Insert sequencing: given a set of mapped reads and a set of unmapped mate pairs we can approximate the positions of the unmapped mate pairs from the library size and standard deviation parameters

3.3 Variation analyses

A mere high-quality consensus without an accurate multi-read alignment, where all the sequencing errors and DNA polymorphisms can be readily identified, is insufficient for a sound variation analysis. Several applications such as separating haplotypes, calling SNPs, or repeat resolution rely on the multi-read alignment itself. The difficulty for the algorithms is that besides sequencing errors, now source sequence variation further complicates the problem. For the Taeniopygia guttata (zebra finch) mitochondrion the NCBI Genome database provides four submitted haplotypes (Accessions: DQ453512 - DQ453515). Although there is a reference genome for this species (Accession: NC_007897), we took the four haplotypes to sample reads in order to test our algorithm in case of sequence variation. To quantify the amount of variation, we first aligned all four haplotypes. Because of high-sequence similarity, this could be easily done with MAFFT (Katoh et al., 2002). The final multiple sequence alignment of the four haplotypes revealed 104 SNP locations and one insertion. We then simulated reads of length 200 with 2% error rate from each single haplotype at 5-fold coverage. All reads were combined in a final testing set. On this set, all consensus tools computed a mixture of the four haplotypes as the consensus sequence. We then inspected the multi-read alignments at the SNP locations manually to identify potential misalignments. In Table 4, the multi-read alignments of all tools are shown for a short segment with two SNP. Both the Seq-Cons and AMOS-Cons alignments allow a simple column-based SNP identification and consensus calling, since all alleles of the SNP ended up in the same column. The CA-Cons alignment requires a more sophisticated approach to correctly call the consensus sequence: the haplotype confirmed by the largest number of ungapped reads is picked (Denisov et al., 2008). The algorithm used by CA-Cons may have problems if the read error rate is very high and every haplotype is confirmed by only one read.

Table 4.
Multi-read alignment: a clipped view of a multi-read alignment with two SNPs (bold font)

4 DISCUSSION

In contrast to the monolithic assemblers, our own approach favors a highly modular design. All components of the algorithm are part of the generic SeqAn library (Döring et al., 2008) for sequence analysis and individual components can be exchanged. Current work, for instance, focusses on the development of a new overlap computation method based upon seed and extend algorithms. The implemented dynamic programming-based solution has a quadratic runtime for each pair of overlapping reads. An index based all against all comparison is significantly faster in practice. Building the index takes a total of O(n) time where n is the total length of all reads. Then a q-gram filter such as SWIFT (Rasmussen et al., 2006) can be used to efficiently identify potential overlaps. The runtime of this filter depends on the size of q and this in turn depends on the sequencing error rate. The verification phase simply extends the identified seeds from the filtration phase and checks whether the computed overlap is above a user-defined quality and length threshold.

The proposed consensus algorithm might also be helpful to determine expressed sequence tags (EST) consensus sequences (Malde et al., 2005). For this kind of application, the tool should be integrated into a complete EST analysis pipeline that includes other important steps such as EST fragment clustering and splice variant detection. From our own experience of integrating our tool into the Celera Assembler, we believe that such an integration is rather simple given the modular design and simple interfaces of our tool. Additionally, the tool could also be used for realignment of a set of given contigs similar to the ReAligner program (Anson and Myers, 1997).

The quality of the multi-read alignment provides several opportunities for future research. First, the alignment lends itself for an accurate genetic variation analysis, including an improved SNP calling or the retrieval of multiple consensus sequences for polyploid organisms (Denisov et al., 2008). Second, the alignment might improve the performance of repeat resolution algorithms in DNA sequence assembly (Kececioglu and Ju, 2001). Third, as noticed in the Section 1, the approach is certainly fundamental for reference-guided assembly methods to infer newly inserted sequence segments with respect to a reference genome. Lastly, the approach could prove useful to close the gaps between a scaffold's contigs or to bridge small repeat regions.

ACKNOWLEDGEMENTS

We wish to thank the AMOS group at the CBCB at the University of Maryland for valuable input on the use of the AMOS library.

Funding: National Institute of General Medical Sciences (NIGMS) (grant R01-GM077117-02 to S.K. and G.D.)

Conflict of interest: none declared.

REFERENCES

  • Anson EL, Myers EW. Proceedings of the first annual international conference on computational molecular biology, RECOMB '97. New York, NY, USA: ACM Press; 1997. Realigner: a program for refining dna sequence multi-alignments; pp. 9–16.
  • Batzoglou S, et al. ARACHNE: a whole-genome shotgun assembler. Genome Res. 2002;12:177–189. [PubMed]
  • Churchill GA, Waterman MS. The accuracy of DNA sequences: estimating sequence quality. Genomics. 1992;14:89–98. [PubMed]
  • Delcher AL, et al. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 2002;30:2478–2483. [PMC free article] [PubMed]
  • Denisov G, et al. Consensus generation and variant detection by Celera Assembler. Bioinformatics. 2008;24:1035–1040. [PubMed]
  • Döring A, et al. SeqAn – an efficient, generic C++ library for sequence analysis. BMC Bioinformatics. 2008;9:11. [PMC free article] [PubMed]
  • Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. [PMC free article] [PubMed]
  • Feng D-F, Doolittle RF. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 1987;25:351–360. [PubMed]
  • Gotoh O. An improved algorithm for matching biological sequences. J. Mol. Biol. 1982;162:705–708. [PubMed]
  • Gotoh O. Consistency of optimal sequence alignments. BMB: Bull. Math. Biol. 1990;52 [PubMed]
  • Havlak P, et al. The atlas genome assembly system. Genome Res. 2004;14:721–732. [PubMed]
  • Huang X, et al. PCAP: A whole-genome assembly program. Genome Res. 2003;13:2164–2170. [PubMed]
  • Jacobson G, Vo K.-P. Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching, CPM '92. London, UK: Springer-Verlag; 1992. Heaviest increasing/common subsequence problems; pp. 52–66.
  • Katoh K, et al. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002;30:3059–3066. [PMC free article] [PubMed]
  • Kececioglu JD. Proceedings of the Forth Annual Symposium on Combinatorial Pattern Matching, CPM '93. London, UK: Springer-Verlag; 1993. The maximum weight trace problem in multiple sequence alignment; pp. 106–119.
  • Kececioglu J, Ju J. Proceedings of the Fifth Annual International Conference on Computational Biology, RECOMB '01. New York, NY, USA: ACM Press; 2001. Separating repeats in DNA sequence assembly; pp. 176–183.
  • Kurtz S, et al. Versatile and open software for comparing large genomes. Genome Biol. 2004;5:R12. [PMC free article] [PubMed]
  • Malde K, et al. A graph based algorithm for generating EST consensus sequences. Bioinformatics. 2005;21:1371–1375. [PubMed]
  • Mullikin JC, Ning Z. The Phusion assembler. Genome Res. 2003;13:81–90. [PubMed]
  • Myers EW, et al. A whole-genome assembly of Drosophila. Science. 2000;287:2196–2204. [PubMed]
  • Notredame C, et al. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 2000;302:205–217. [PubMed]
  • Pop M, et al. Comparative genome assembly. Brief. Bioinform. 2004;5:237–248. [PubMed]
  • Rasmussen K, et al. Efficient q-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol. 2006;13:296–308. [PubMed]
  • Rausch T, et al. Segment-based multiple sequence alignment. Bioinformatics. 2008;24:i187–i192. [PubMed]
  • Sankoff D, Kruskal JB. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Reading, MA: Addison-Wesley; 1983.
  • Schatz M, et al. Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biol. 2007;8:R34. [PMC free article] [PubMed]
  • Sokal RR, Michener CD. A statistical method for evaluating systematic relationships. Univ. Kansas Sci. Bull. 1958;38:1409–1438.
  • Sommer D, et al. Minimus: a fast, lightweight genome assembler. BMC Bioinformatics. 2007;8:64. [PMC free article] [PubMed]
  • Thompson JD, et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. [PMC free article] [PubMed]
  • Wang L, Jiang T. On the complexity of multiple sequence alignment. J. Comput. Biol. 1994;1:337–348. [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press