DNA sequence comparison is based on optimal local alignment of two sequences using a similarity score. However, some new DNA sequencing technologies do not directly measure the base sequence, but rather an encoded form, such as the two-base encoding considered here. In order to compare such data to a reference sequence, the data must be decoded into sequence. The decoding is deterministic, but the possibility of measurement errors requires searching among all possible error modes and resulting alignments to achieve an optimal balance of fewer errors versus greater sequence similarity.
We present an extension of the standard dynamic programming method for local alignment, which simultaneously decodes the data and performs the alignment, maximizing a similarity score based on a weighted combination of errors and edits, and allowing an affine gap penalty. We also present simulations that demonstrate the performance characteristics of our two base encoded alignment method and contrast those with standard DNA sequence alignment under the same conditions.
The new local alignment algorithm for two-base encoded data has substantial power to properly detect and correct measurement errors while identifying underlying sequence variants, and facilitating genome re-sequencing efforts based on this form of sequence data.
Data-encoding synthetic DNA, inserted into the genome of a living organism, is thought to be more robust than the current media. Because the living genome is duplicated and copied into new generations, one of the merits of using DNA material is long-term data storage within heritable media. A disadvantage of this approach is that encoded data can be unexpectedly broken by mutation, deletion, and insertion of DNA, which occurs naturally during evolution and prolongation, or laboratory experiments. For this reason, several information theory-based approaches have been developed as an error check of broken DNA data in order to achieve data durability. These approaches cannot efficiently recover badly damaged data-encoding DNA. We recently developed a DNA data-storage approach based on the multiple sequence alignment method to achieve a high level of data durability. In this paper, we overview this technology and discuss strategies for optimal application of this approach.
DNA data storage; Sequence alignment; Polymerase chain reaction (PCR); Error check; Error correction; Genetically modified organism (GMO)
Accurate identification of DNA polymorphisms using next-generation sequencing technology is challenging because of a high rate of sequencing error and incorrect mapping of reads to reference genomes. Currently available short read aligners and DNA variant callers suffer from these problems. We developed the Coval software to improve the quality of short read alignments. Coval is designed to minimize the incidence of spurious alignment of short reads, by filtering mismatched reads that remained in alignments after local realignment and error correction of mismatched reads. The error correction is executed based on the base quality and allele frequency at the non-reference positions for an individual or pooled sample. We demonstrated the utility of Coval by applying it to simulated genomes and experimentally obtained short-read data of rice, nematode, and mouse. Moreover, we found an unexpectedly large number of incorrectly mapped reads in ‘targeted’ alignments, where the whole genome sequencing reads had been aligned to a local genomic segment, and showed that Coval effectively eliminated such spurious alignments. We conclude that Coval significantly improves the quality of short-read sequence alignments, thereby increasing the calling accuracy of currently available tools for SNP and indel identification. Coval is available at http://sourceforge.net/projects/coval105/.
New DNA sequencing technologies have achieved breakthroughs in throughput, at the expense of higher error rates. The primary way of interpreting biological sequences is via alignment, but standard alignment methods assume the sequences are accurate. Here, we describe how to incorporate the per-base error probabilities reported by sequencers into alignment. Unlike existing tools for DNA read mapping, our method models both sequencer errors and real sequence differences. This approach consistently improves mapping accuracy, even when the rate of real sequence difference is only 0.2%. Furthermore, when mapping Drosophila melanogaster reads to the Drosophila simulans genome, it increased the amount of correctly mapped reads from 49 to 66%. This approach enables more effective use of DNA reads from organisms that lack reference genomes, are extinct or are highly polymorphic.
High quality sequence alignments of RNA and DNA sequences are an important prerequisite for the comparative analysis of genomic sequence data. Nucleic acid sequences, however, exhibit a much larger sequence heterogeneity compared to their encoded protein sequences due to the redundancy of the genetic code. It is desirable, therefore, to make use of the amino acid sequence when aligning coding nucleic acid sequences. In many cases, however, only a part of the sequence of interest is translated. On the other hand, overlapping reading frames may encode multiple alternative proteins, possibly with intermittent non-coding parts. Examples are, in particular, RNA virus genomes.
The standard scoring scheme for nucleic acid alignments can be extended to incorporate simultaneously information on translation products in one or more reading frames. Here we present a multiple alignment tool, codaln, that implements a combined nucleic acid plus amino acid scoring model for pairwise and progressive multiple alignments that allows arbitrary weighting for almost all scoring parameters. Resource requirements of codaln are comparable with those of standard tools such as ClustalW.
We demonstrate the applicability of codaln to various biologically relevant types of sequences (bacteriophage Levivirus and Vertebrate Hox clusters) and show that the combination of nucleic acid and amino acid sequence information leads to improved alignments. These, in turn, increase the performance of analysis tools that depend strictly on good input alignments such as methods for detecting conserved RNA secondary structure elements.
PacBio RS, a newly emerging third-generation DNA sequencing platform, is based on a real-time, single-molecule, nano-nitch sequencing technology that can generate very long reads (up to 20-kb) in contrast to the shorter reads produced by the first and second generation sequencing technologies. As a new platform, it is important to assess the sequencing error rate, as well as the quality control (QC) parameters associated with the PacBio sequence data. In this study, a mixture of 10 prior known, closely related DNA amplicons were sequenced using the PacBio RS sequencing platform. After aligning Circular Consensus Sequence (CCS) reads derived from the above sequencing experiment to the known reference sequences, we found that the median error rate was 2.5% without read QC, and improved to 1.3% with an SVM based multi-parameter QC method. In addition, a De Novo assembly was used as a downstream application to evaluate the effects of different QC approaches. This benchmark study indicates that even though CCS reads are post error-corrected it is still necessary to perform appropriate QC on CCS reads in order to produce successful downstream bioinformatics analytical results.
PacBio; CCS read; quality control (QC); pass number; quality value (QV); SVM regression; assembly
The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ∼2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1–4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download.
A new approach for encoding DNA sequences as input for DNA sequence analysis is proposed using the error correction coding theory of communication engineering. The encoder was designed as a convolutional code model whose generator matrix is designed based on the degeneracy of codons, with a codon treated in the model as an informational unit. The utility of the proposed model was demonstrated through the analysis of twelve prokaryote and nine eukaryote DNA sequences having different GC contents. Distinct differences in code distances were observed near the initiation and termination sites in the open reading frame, which provided a well-regulated characterization of the DNA sequences. Clearly distinguished period-3 features appeared in the coding regions, and the characteristic average code distances of the analyzed sequences were approximately proportional to their GC contents, particularly in the selected prokaryotic organisms, presenting the potential utility as an added taxonomic characteristic for use in studying the relationships of living organisms.
convolutional code; degeneracy; codon; informational unit; code distance; characteristic average code distance; GC content; taxonomy
Next (second) generation sequencing is an increasingly important tool for many areas of molecular biology, however, care must be taken when interpreting its output. Even a low error rate can cause a large number of errors due to the high number of nucleotides being sequenced. Identifying sequencing errors from true biological variants is a challenging task. For organisms without a reference genome this difficulty is even more challenging.
We have developed a method for the correction of sequencing errors in data from the Illumina Solexa sequencing platforms. It does not require a reference genome and is of relevance for microRNA studies, unsequenced genomes, variant detection in ultra-deep sequencing and even for RNA-Seq studies of organisms with sequenced genomes where RNA editing is being considered.
The derived error model is novel in that it allows different error probabilities for each position along the read, in conjunction with different error rates depending on the particular nucleotides involved in the substitution, and does not force these effects to behave in a multiplicative manner. The model provides error rates which capture the complex effects and interactions of the three main known causes of sequencing error associated with the Illumina platforms.
The Chloroplast Genome Database (ChloroplastDB) is an interactive, web-based database for fully sequenced plastid genomes, containing genomic, protein, DNA and RNA sequences, gene locations, RNA-editing sites, putative protein families and alignments (). With recent technical advances, the rate of generating new organelle genomes has increased dramatically. However, the established ontology for chloroplast genes and gene features has not been uniformly applied to all chloroplast genomes available in the sequence databases. For example, annotations for some published genome sequences have not evolved with gene naming conventions. ChloroplastDB provides unified annotations, gene name search, BLAST and download functions for chloroplast encoded genes and genomic sequences. A user can retrieve all orthologous sequences with one search regardless of gene names in GenBank. This feature alone greatly facilitates comparative research on sequence evolution including changes in gene content, codon usage, gene structure and post-transcriptional modifications such as RNA editing. Orthologous protein sets are classified by TribeMCL and each set is assigned a standard gene name. Over the next few years, as the number of sequenced chloroplast genomes increases rapidly, the tools available in ChloroplastDB will allow researchers to easily identify and compile target data for comparative analysis of chloroplast genes and genomes.
In a functioning genetic system, the information-encoding molecule must form a regular self-complementary complex (e.g., the base paired double helix of DNA) and it must be able to encode information and pass it on to new generations. Here we study a benzo-widened DNA-like molecule (yDNA) as a candidate for an alternative genetic set, and we explicitly test these two structural and functional requirements. The solution structure of a 10-bp yDNA duplex is measured by 2D-NMR methods for a simple sequence composed of T-yA/yA-T pairs. The data confirm an antiparallel, right-handed, hydrogen-bonded helix resembling B-DNA but with wider diameter and enlarged base pair size In addition to this, the abilities of two different polymerase enzymes (Klenow fragment of DNA pol I (Kf) and the repair enzyme Dpo4) to synthesize and extend the yDNA pairs T-yA, A-yT, and G-yC are measured by steady-state kinetics studies. Not surprisingly, insertion of complementary bases opposite yDNA bases is inefficient due to the larger base pair size. We find that correct pairing occurs in several cases by both enzymes, but that common and relatively efficient mispairing involving T-yT and T-yC pairs interferes with fully correct formation and extension of pairs by these polymerases. Interestingly, the data show that extension of the large pairs is considerably more efficient with the flexible repair enzyme (Dpo4) than with the more rigid Kf enzyme. The results shed light on the properties of yDNA as a candidate for an alternative genetic information-encoding molecule and as a tool for application in basic science and biomedicine.
Wide DNA; genetic set; base pair; polymerase; solution structure; NMR
Motivation: Discovering variation among high-throughput sequenced genomes relies on efficient and effective mapping of sequence reads. The speed, sensitivity and accuracy of read mapping are crucial to determining the full spectrum of single nucleotide variants (SNVs) as well as structural variants (SVs) in the donor genomes analyzed.
Results: We present drFAST, a read mapper designed for di-base encoded ‘color-space’ sequences generated with the AB SOLiD platform. drFAST is specially designed for better delineation of structural variants, including segmental duplications, and is able to return all possible map locations and underlying sequence variation of short reads within a user-specified distance threshold. We show that drFAST is more sensitive in comparison to all commonly used aligners such as Bowtie, BFAST and SHRiMP. drFAST is also faster than both BFAST and SHRiMP and achieves a mapping speed comparable to Bowtie.
Availability: The source code for drFAST is available at http://drfast.sourceforge.net
This paper presents a new information theoretic framework for aligning sequences in bioinformatics. A transmitter compresses a set of sequences by constructing a regular expression that describes the regions of similarity in the sequences. To retrieve the original set of sequences, a receiver generates all strings that match the expression. An alignment algorithm uses minimum description length to encode and explore alternative expressions; the expression with the shortest encoding provides the best overall alignment. When two substrings contain letters that are similar according to a substitution matrix, a code length function based on conditional probabilities defined by the matrix will encode the substrings with fewer bits. In one experiment, alignments produced with this new method were found to be comparable to alignments from . A second experiment measured the accuracy of the new method on pairwise alignments of sequences from the BAliBASE alignment benchmark.
Many recent advances in biology and medicine have resulted from DNA sequence alignment algorithms and technology. Traditional approaches for the matching of DNA sequences are based either on global alignment schemes or heuristic schemes that seek to approximate global alignment algorithms while providing higher computational efficiency. This report describes an approach using the mathematical operation of cross-correlation to compare sequences. It can be implemented using the fast fourier transform for computational efficiency. The algorithm is summarized and sample applications are given. These include gene sequence alignment in long stretches of genomic DNA, finding sequence similarity in distantly related organisms, demonstrating sequence similarity in the presence of massive (approximately 90%) random point mutations, comparing sequences related by internal rearrangements (tandem repeats) within a gene, and investigating fusion proteins. Application to RNA and protein sequence alignment is also discussed. The method is efficient, sensitive, and robust, being able to find sequence similarities where other alignment algorithms may perform poorly.
Sequence alignment; algorithm; software; cross-correlation
The currently used serological subtyping scheme for the pathogen Neisseria meningitidis is not comprehensive, a proportion of isolates are reported as not subtypeable (NST), and few isolates are fully characterized with two subtypes for each strain. To establish the reasons for this and to assess the effectiveness of DNA-based subtyping schemes, dot blot hybridization and nucleotide sequence analyses were used to characterize the genes encoding antigenic variants of the meningococcal subtyping antigen, the PorA protein. A total of 233 strains, including 174 serologically NST and 59 partially or completely subtyped meningococcal strains, were surveyed. The NST isolates were chosen to be temporally and geographically representative of NST strains, isolated in England and Wales, and submitted to the Meningococcal Reference Unit in the period 1989 to 1991. The DNA-based analyses demonstrated that all of the strains examined possessed a porA gene. Some of these strains were serologically NST because of a lack of monoclonal antibodies against certain PorA epitopes; in other cases, strains expressed minor variants of known PorA epitopes that did not react with monoclonal antibodies in serological assays. Lack of expression remained a possible explanation for serological typing failure in some cases. These findings have important implications for epidemiological analysis and vaccine design and demonstrate the need for genetic characterization, rather than phenotypic characterization using monoclonal antibodies, for the identification of meningococcal strains.
The simple fact that proteins are built from 20 amino acids while DNA only contains four different bases, means that the ‘signal-to-noise ratio’ in protein sequence alignments is much better than in alignments of DNA. Besides this information-theoretical advantage, protein alignments also benefit from the information that is implicit in empirical substitution matrices such as BLOSUM-62. Taken together with the generally higher rate of synonymous mutations over non-synonymous ones, this means that the phylogenetic signal disappears much more rapidly from DNA sequences than from the encoded proteins. It is therefore preferable to align coding DNA at the amino acid level and it is for this purpose we have constructed the program RevTrans. RevTrans constructs a multiple DNA alignment by: (i) translating the DNA; (ii) aligning the resulting peptide sequences; and (iii) building a multiple DNA alignment by ‘reverse translation’ of the aligned protein sequences. In the resulting DNA alignment, gaps occur in groups of three corresponding to entire codons, and analogous codon positions are therefore always lined up. These features are useful when constructing multiple DNA alignments for phylogenetic analysis. RevTrans also accepts user-provided protein alignments for greater control of the alignment process. The RevTrans web server is freely available at http://www.cbs.dtu.dk/services/RevTrans/.
The pragmatic species concept for Bacteria and Archaea is ultimately based on DNA-DNA hybridization (DDH). While enabling the taxonomist, in principle, to obtain an estimate of the overall similarity between the genomes of two strains, this technique is tedious and error-prone and cannot be used to incrementally build up a comparative database. Recent technological progress in the area of genome sequencing calls for bioinformatics methods to replace the wet-lab DDH by in-silico genome-to-genome comparison. Here we investigate state-of-the-art methods for inferring whole-genome distances in their ability to mimic DDH. Algorithms to efficiently determine high-scoring segment pairs or maximally unique matches perform well as a basis of inferring intergenomic distances. The examined distance functions, which are able to cope with heavily reduced genomes and repetitive sequence regions, outperform previously described ones regarding the correlation with and error ratios in emulating DDH. Simulation of incompletely sequenced genomes indicates that some distance formulas are very robust against missing fractions of genomic information. Digitally derived genome-to-genome distances show a better correlation with 16S rRNA gene sequence distances than DDH values. The future perspectives of genome-informed taxonomy are discussed, and the investigated methods are made available as a web service for genome-based species delineation.
Archaea; Bacteria; BLAST; GBDP; genomics; MUMmer; phylogeny; species concept; taxonomy
Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of the novel genome. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads that align to the reference sequence, and then choosing the reference sequence which has the highest number of reads aligning to it. This article explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and Sophisticated MDL, in identifying the optimal reference sequence for genome assembly. The article compares the MDL based proposed scheme with the standard method coming to the conclusion that “counting the number of reads of the novel genome present in the reference sequence” is not a sufficient condition. Therefore, the proposed MDL scheme includes within itself the standard method of “counting the number of reads that align to the reference sequence” and also moves forward towards looking at the model, the reference sequence, as well, in identifying the optimal reference sequence. The proposed MDL based scheme not only becomes the sufficient criterion for identifying the optimal reference sequence for genome assembly but also improves the reference sequence so that it becomes more suitable for the assembly of the novel genome.
Base-By-Base is a Java-based multiple sequence alignment editor. It is capable of working with protein and DNA molecules, but many of its unique features relate to the manipulation of the genomes of large DNA viruses such as poxviruses, herpesviruses, baculoviruses and asfarviruses (1-400 kb). The tool was built to serve as a platform for comparative genomics at the level of individual nucleotides.
In version 2, BBB-v2, of Base-By-Base we have added a series of new features aimed at providing the bench virologist with a better platform to view, annotate and analyze these complex genomes. Although a poxvirus genome, for example, may be less than 200 kb, it probably encodes close to 200 proteins using multiple classes of promoters with frequent overlapping of promoters and coding sequences and even some overlapping of genes. The new features allow users to 1) add primer annotations or other data sets in batch mode, 2) export differences between sequences to other genome browsers, 3) compare multiple genomes at a single nucleotide level of detail, 4) create new alignments from subsets/subsequences of a very large master alignment and 5) allow display of summaries of deep RNA sequencing data sets on a genome sequence.
BBB-v2 significantly improves the ability of virologists to work with genome sequences and provides a platform with which they can use a multiple sequence alignment as the basis for their own editable documents. Also, a .bbb document, with a variety of annotations in addition to the basic coding regions, can be shared among collaborators or made available to an entire research community. The program is available via Virology.ca using Java Web Start and is platform independent; the Java 1.5 virtual machine is required.
Mitochondrial encephalomyopathies are a heterogeneous group of clinical disorders generally caused due to mutations in either mitochondrial DNA (mtDNA) or nuclear genes encoding oxidative phosphorylation (OXPHOS). We analyzed the mtDNA sequences from a group of 23 pediatric patients with clinical and morphological features of mitochondrial encephalopathies and tried to establish a relationship of identified variants with the disease.
Complete mitochondrial genomes were amplified by PCR and sequenced by automated DNA sequencing. Sequencing data was analyzed by SeqScape software and also confirmed by BLASTn program. Nucleotide sequences were compared with the revised Cambridge reference sequence (CRS) and sequences present in mitochondrial databases. The data obtained shows that a number of known and novel mtDNA variants were associated with the disease. Most of the non-synonymous variants were heteroplasmic (A4136G, A9194G and T11916A) suggesting their possibility of being pathogenic in nature. Some of the missense variants although homoplasmic were showing changes in highly conserved amino acids (T3394C, T3866C, and G9804A) and were previously identified with diseased conditions. Similarly, two other variants found in tRNA genes (G5783A and C8309T) could alter the secondary structure of Cys-tRNA and Lys-tRNA. Most of the variants occurred in single cases; however, a few occurred in more than one case (e.g. G5783A and A10149T).
Conclusions and Significance
The mtDNA variants identified in this study could be the possible cause of mitochondrial encephalomyopathies with childhood onset in the patient group. Our study further strengthens the pathogenic score of known variants previously reported as provisionally pathogenic in mitochondrial diseases. The novel variants found in the present study can be potential candidates for further investigations to establish the relationship between their incidence and role in expressing the disease phenotype. This study will be useful in genetic diagnosis and counseling of mitochondrial diseases in India as well as worldwide.
Fluorescence-based sequencing is playing an increasingly important role in efforts to identify DNA polymorphisms and mutations of biological and medical interest. The application of this technology in generating the reference sequence of simple and complex genomes is also driving the development of new computer programs to automate base calling (Phred), sequence assembly (Phrap) and sequence assembly editing (Consed) in high throughput settings. In this report we describe a new computer program known as PolyPhred that automatically detects the presence of heterozygous single nucleotide substitutions by fluorescencebased sequencing of PCR products. Its operations are integrated with the use of the Phred, Phrap and Consed programs and together these tools generate a high throughput system for detecting DNA polymorphisms and mutations by large scale fluorescence-based resequencing. Analysis of sequences containing known DNA variants demonstrates that the accuracy of PolyPhred with single pass data is >99% when the sequences are generated with fluorescent dye-labeled primers and approximately 90% for those prepared with dye-labeled terminators.
Over the past few years, new massively parallel DNA sequencing technologies have emerged. These platforms generate massive amounts of data per run, greatly reducing the cost of DNA sequencing. However, these techniques also raise important computational difficulties mostly due to the huge volume of data produced, but also because of some of their specific characteristics such as read length and sequencing errors. Among the most critical problems is that of efficiently and accurately mapping reads to a reference genome in the context of re-sequencing projects.
We present an efficient method for the local alignment of pyrosequencing reads produced by the GS FLX (454) system against a reference sequence. Our approach explores the characteristics of the data in these re-sequencing applications and uses state of the art indexing techniques combined with a flexible seed-based approach, leading to a fast and accurate algorithm which needs very little user parameterization. An evaluation performed using real and simulated data shows that our proposed method outperforms a number of mainstream tools on the quantity and quality of successful alignments, as well as on the execution time.
The proposed methodology was implemented in a software tool called TAPyR--Tool for the Alignment of Pyrosequencing Reads--which is publicly available from http://www.tapyr.net.
The use of next-generation DNA sequencing technologies has greatly facilitated reference-guided variant detection in complex plant genomes. However, complications may arise when regions adjacent to a read of interest are used for marker assay development, or when reference sequences are incomplete, as short reads alone may not be long enough to ascertain their uniqueness. Here, the possibility of generating longer sequences in discrete regions of the large and complex genome of maize is demonstrated, using a modified version of a paired-end RAD library construction strategy. Reads are generated from DNA fragments first digested with a methylation-sensitive restriction endonuclease, sheared, enriched with biotin and a selective PCR amplification step, and then sequenced at both ends. Sequences are locally assembled into contigs by subgrouping pairs based on the identity of the read anchored by the restriction site. This strategy applied to two maize inbred lines (B14 and B73) generated 183,609 and 129,018 contigs, respectively, out of which at least 76% were >200 bps in length. A subset of putative single nucleotide polymorphisms from contigs aligning to the B73 reference genome with at least one mismatch was resequenced, and 90% of those in B14 were confirmed, indicating that this method is a potent approach for variant detection and marker development in species with complex genomes or lacking extensive reference sequences.
The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25–100 base range, in the presence of errors and true biological variation.
We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels.
We compare BFAST to a selection of large-scale alignment tools - BLAT, MAQ, SHRiMP, and SOAP - in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at http://bfast.sourceforge.net.
The DNA barcoding technology uses a standard region of DNA sequence for species identification and discovery. At present, “DNA barcode” actually refers to DNA sequences, which are not amenable to information storage, recognition, and retrieval. Our aim is to identify the best symbology that can represent DNA barcode sequences in practical applications. A comprehensive set of sequences for five DNA barcode markers ITS2, rbcL, matK, psbA-trnH, and CO1 was used as the test data. Fifty-three different types of one-dimensional and ten two-dimensional barcode symbologies were compared based on different criteria, such as coding capacity, compression efficiency, and error detection ability. The quick response (QR) code was found to have the largest coding capacity and relatively high compression ratio. To facilitate the further usage of QR code-based DNA barcodes, a web server was developed and is accessible at http://qrfordna.dnsalias.org. The web server allows users to retrieve the QR code for a species of interests, convert a DNA sequence to and from a QR code, and perform species identification based on local and global sequence similarities. In summary, the first comprehensive evaluation of various barcode symbologies has been carried out. The QR code has been found to be the most appropriate symbology for DNA barcode sequences. A web server has also been constructed to allow biologists to utilize QR codes in practical DNA barcoding applications.