The majority of next-generation sequencing short-reads can be properly aligned by leading aligners at high speed. However, the alignment quality can still be further improved, since usually not all reads can be correctly aligned to large genomes, such as the human genome, even for simulated data. Moreover, even slight improvements in this area are important but challenging, and usually require significantly more computational endeavor. In this paper, we present CUSHAW3, an open-source parallelized, sensitive and accurate short-read aligner for both base-space and color-space sequences. In this aligner, we have investigated a hybrid seeding approach to improve alignment quality, which incorporates three different seed types, i.e. maximal exact match seeds, exact-match k-mer seeds and variable-length seeds, into the alignment pipeline. Furthermore, three techniques: weighted seed-pairing heuristic, paired-end alignment pair ranking and read mate rescuing have been conceived to facilitate accurate paired-end alignment. For base-space alignment, we have compared CUSHAW3 to Novoalign, CUSHAW2, BWA-MEM, Bowtie2 and GEM, by aligning both simulated and real reads to the human genome. The results show that CUSHAW3 consistently outperforms CUSHAW2, BWA-MEM, Bowtie2 and GEM in terms of single-end and paired-end alignment. Furthermore, our aligner has demonstrated better paired-end alignment performance than Novoalign for short-reads with high error rates. For color-space alignment, CUSHAW3 is consistently one of the best aligners compared to SHRiMP2 and BFAST. The source code of CUSHAW3 and all simulated data are available at http://cushaw3.sourceforge.net.
Motivation: Discovering variation among high-throughput sequenced genomes relies on efficient and effective mapping of sequence reads. The speed, sensitivity and accuracy of read mapping are crucial to determining the full spectrum of single nucleotide variants (SNVs) as well as structural variants (SVs) in the donor genomes analyzed.
Results: We present drFAST, a read mapper designed for di-base encoded ‘color-space’ sequences generated with the AB SOLiD platform. drFAST is specially designed for better delineation of structural variants, including segmental duplications, and is able to return all possible map locations and underlying sequence variation of short reads within a user-specified distance threshold. We show that drFAST is more sensitive in comparison to all commonly used aligners such as Bowtie, BFAST and SHRiMP. drFAST is also faster than both BFAST and SHRiMP and achieves a mapping speed comparable to Bowtie.
Availability: The source code for drFAST is available at http://drfast.sourceforge.net
Next-generation sequencing technologies provide an unparallelled opportunity for the characterization and discovery of known and novel viruses. Because viruses are known to have the highest mutation rates when compared to eukaryotic and bacterial organisms, we assess the extent to which eleven well-known alignment algorithms (BLAST, BLAT, BWA, BWA-SW, BWA-MEM, BFAST, Bowtie2, Novoalign, GSNAP, SHRiMP2 and STAR) can be used for characterizing mutated and non-mutated viral sequences - including those that exhibit RNA splicing - in transcriptome samples. To evaluate aligners objectively we developed a realistic RNA-Seq simulation and evaluation framework (RiSER) and propose a new combined score to rank aligners for viral characterization in terms of their precision, sensitivity and alignment accuracy. We used RiSER to simulate both human and viral read sequences and suggest the best set of aligners for viral sequence characterization in human transcriptome samples. Our results show that significant and substantial differences exist between aligners and that a digital-subtraction-based viral identification framework can and should use different aligners for different parts of the process. We determine the extent to which mutated viral sequences can be effectively characterized and show that more sensitive aligners such as BLAST, BFAST, SHRiMP2, BWA-SW and GSNAP can accurately characterize substantially divergent viral sequences with up to 15% overall sequence mutation rate. We believe that the results presented here will be useful to researchers choosing aligners for viral sequence characterization using next-generation sequencing data.
Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals.
Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ∼10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package.
U87MG is a commonly studied grade IV glioma cell line that has been analyzed in at least 1,700 publications over four decades. In order to comprehensively characterize the genome of this cell line and to serve as a model of broad cancer genome sequencing, we have generated greater than 30× genomic sequence coverage using a novel 50-base mate paired strategy with a 1.4kb mean insert library. A total of 1,014,984,286 mate-end and 120,691,623 single-end two-base encoded reads were generated from five slides. All data were aligned using a custom designed tool called BFAST, allowing optimal color space read alignment and accurate identification of DNA variants. The aligned sequence reads and mate-pair information identified 35 interchromosomal translocation events, 1,315 structural variations (>100 bp), 191,743 small (<21 bp) insertions and deletions (indels), and 2,384,470 single nucleotide variations (SNVs). Among these observations, the known homozygous mutation in PTEN was robustly identified, and genes involved in cell adhesion were overrepresented in the mutated gene list. Data were compared to 219,187 heterozygous single nucleotide polymorphisms assayed by Illumina 1M Duo genotyping array to assess accuracy: 93.83% of all SNPs were reliably detected at filtering thresholds that yield greater than 99.99% sequence accuracy. Protein coding sequences were disrupted predominantly in this cancer cell line due to small indels, large deletions, and translocations. In total, 512 genes were homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and 35 by interchromosomal translocations to reveal a highly mutated cell line genome. Of the small homozygously mutated variants, 8 SNVs and 99 indels were novel events not present in dbSNP. These data demonstrate that routine generation of broad cancer genome sequence is possible outside of genome centers. The sequence analysis of U87MG provides an unparalleled level of mutational resolution compared to any cell line to date.
Glioblastoma has a particularly dismal prognosis with median survival time of less than fifteen months. Here, we describe the broad genome sequencing of U87MG, a commonly used and thus well-studied glioblastoma cell line. One of the major features of the U87MG genome is the large number of chromosomal abnormalities, which can be typical of cancer cell lines and primary cancers. The systematic, thorough, and accurate mutational analysis of the U87MG genome comprehensively identifies different classes of genetic mutations including single-nucleotide variations (SNVs), insertions/deletions (indels), and translocations. We found 2,384,470 SNVs, 191,743 small indels, and 1,314 large structural variations. Known gene models were used to predict the effect of these mutations on protein-coding sequence. Mutational analysis revealed 512 genes homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and up to 35 by interchromosomal translocations. The major mutational mechanisms in this brain cancer cell line are small indels and large structural variations. The genomic landscape of U87MG is revealed to be much more complex than previously thought based on lower resolution techniques. This mutational analysis serves as a resource for past and future studies on U87MG, informing them with a thorough description of its mutational state.
Motivation: The rapid development of next-generation sequencing technologies able to produce huge amounts of sequence data is leading to a wide range of new applications. This triggers the need for fast and accurate alignment software. Common techniques often restrict indels in the alignment to improve speed, whereas more flexible aligners are too slow for large-scale applications. Moreover, many current aligners are becoming inefficient as generated reads grow ever larger. Our goal with our new aligner GASSST (Global Alignment Short Sequence Search Tool) is thus 2-fold—achieving high performance with no restrictions on the number of indels with a design that is still effective on long reads.
Results: We propose a new efficient filtering step that discards most alignments coming from the seed phase before they are checked by the costly dynamic programming algorithm. We use a carefully designed series of filters of increasing complexity and efficiency to quickly eliminate most candidate alignments in a wide range of configurations. The main filter uses a precomputed table containing the alignment score of short four base words aligned against each other. This table is reused several times by a new algorithm designed to approximate the score of the full dynamic programming algorithm. We compare the performance of GASSST against BWA, BFAST, SSAHA2 and PASS. We found that GASSST achieves high sensitivity in a wide range of configurations and faster overall execution time than other state-of-the-art aligners.
Availability: GASSST is distributed under the CeCILL software license at http://www.irisa.fr/symbiose/projects/gassst/
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
Single Nucleotide Polymorphism (SNP) genotyping analysis is very susceptible to SNPs chromosomal position errors. As it is known, SNPs mapping data are provided along the SNP arrays without any necessary information to assess in advance their accuracy. Moreover, these mapping data are related to a given build of a genome and need to be updated when a new build is available. As a consequence, researchers often plan to remap SNPs with the aim to obtain more up-to-date SNPs chromosomal positions. In this work, we present G-SNPM a GPU (Graphics Processing Unit) based tool to map SNPs on a genome.
G-SNPM is a tool that maps a short sequence representative of a SNP against a reference DNA sequence in order to find the physical position of the SNP in that sequence. In G-SNPM each SNP is mapped on its related chromosome by means of an automatic three-stage pipeline. In the first stage, G-SNPM uses the GPU-based short-read mapping tool SOAP3-dp to parallel align on a reference chromosome its related sequences representative of a SNP. In the second stage G-SNPM uses another short-read mapping tool to remap the sequences unaligned or ambiguously aligned by SOAP3-dp (in this stage SHRiMP2 is used, which exploits specialized vector computing hardware to speed-up the dynamic programming algorithm of Smith-Waterman). In the last stage, G-SNPM analyzes the alignments obtained by SOAP3-dp and SHRiMP2 to identify the absolute position of each SNP.
Results and conclusions
To assess G-SNPM, we used it to remap the SNPs of some commercial chips. Experimental results shown that G-SNPM has been able to remap without ambiguity almost all SNPs. Based on modern GPUs, G-SNPM provides fast mappings without worsening the accuracy of the results. G-SNPM can be used to deal with specialized Genome Wide Association Studies (GWAS), as well as in annotation tasks that require to update the SNP mapping probes.
Motivation: Many programs for aligning short sequencing reads to a reference genome have been developed in the last 2 years. Most of them are very efficient for short reads but inefficient or not applicable for reads >200 bp because the algorithms are heavily and specifically tuned for short queries with low sequencing error rate. However, some sequencing platforms already produce longer reads and others are expected to become available soon. For longer reads, hashing-based software such as BLAT and SSAHA2 remain the only choices. Nonetheless, these methods are substantially slower than short-read aligners in terms of aligned bases per unit time.
Results: We designed and implemented a new algorithm, Burrows-Wheeler Aligner's Smith-Waterman Alignment (BWA-SW), to align long sequences up to 1 Mb against a large sequence database (e.g. the human genome) with a few gigabytes of memory. The algorithm is as accurate as SSAHA2, more accurate than BLAT, and is several to tens of times faster than both.
Motivation: A critical task in high-throughput sequencing is aligning millions of short reads to a reference genome. Alignment is especially complicated for RNA sequencing (RNA-Seq) because of RNA splicing. A number of RNA-Seq algorithms are available, and claim to align reads with high accuracy and efficiency while detecting splice junctions. RNA-Seq data are discrete in nature; therefore, with reasonable gene models and comparative metrics RNA-Seq data can be simulated to sufficient accuracy to enable meaningful benchmarking of alignment algorithms. The exercise to rigorously compare all viable published RNA-Seq algorithms has not been performed previously.
Results: We developed an RNA-Seq simulator that models the main impediments to RNA alignment, including alternative splicing, insertions, deletions, substitutions, sequencing errors and intron signal. We used this simulator to measure the accuracy and robustness of available algorithms at the base and junction levels. Additionally, we used reverse transcription–polymerase chain reaction (RT–PCR) and Sanger sequencing to validate the ability of the algorithms to detect novel transcript features such as novel exons and alternative splicing in RNA-Seq data from mouse retina. A pipeline based on BLAT was developed to explore the performance of established tools for this problem, and to compare it to the recently developed methods. This pipeline, the RNA-Seq Unified Mapper (RUM), performs comparably to the best current aligners and provides an advantageous combination of accuracy, speed and usability.
Availability: The RUM pipeline is distributed via the Amazon Cloud and for computing clusters using the Sun Grid Engine (http://cbil.upenn.edu/RUM).
Contact: email@example.com; firstname.lastname@example.org
Supplementary Information:The RNA-Seq sequence reads described in the article are deposited at GEO, accession GSE26248.
Accurate identification of DNA polymorphisms using next-generation sequencing technology is challenging because of a high rate of sequencing error and incorrect mapping of reads to reference genomes. Currently available short read aligners and DNA variant callers suffer from these problems. We developed the Coval software to improve the quality of short read alignments. Coval is designed to minimize the incidence of spurious alignment of short reads, by filtering mismatched reads that remained in alignments after local realignment and error correction of mismatched reads. The error correction is executed based on the base quality and allele frequency at the non-reference positions for an individual or pooled sample. We demonstrated the utility of Coval by applying it to simulated genomes and experimentally obtained short-read data of rice, nematode, and mouse. Moreover, we found an unexpectedly large number of incorrectly mapped reads in ‘targeted’ alignments, where the whole genome sequencing reads had been aligned to a local genomic segment, and showed that Coval effectively eliminated such spurious alignments. We conclude that Coval significantly improves the quality of short-read sequence alignments, thereby increasing the calling accuracy of currently available tools for SNP and indel identification. Coval is available at http://sourceforge.net/projects/coval105/.
The rapid growth of short read datasets poses a new challenge to the short read mapping problem in terms of sensitivity and execution speed. Existing methods often use a restrictive error model for computing the alignments to improve speed, whereas more flexible error models are generally too slow for large-scale applications. A number of short read mapping software tools have been proposed. However, designs based on hardware are relatively rare. Field programmable gate arrays (FPGAs) have been successfully used in a number of specific application areas, such as the DSP and communications domains due to their outstanding parallel data processing capabilities, making them a competitive platform to solve problems that are “inherently parallel”.
We present a hybrid system for short read mapping utilizing both FPGA-based hardware and CPU-based software. The computation intensive alignment and the seed generation operations are mapped onto an FPGA. We present a computationally efficient, parallel block-wise alignment structure (Align Core) to approximate the conventional dynamic programming algorithm. The performance is compared to the multi-threaded CPU-based GASSST and BWA software implementations. For single-end alignment, our hybrid system achieves faster processing speed than GASSST (with a similar sensitivity) and BWA (with a higher sensitivity); for pair-end alignment, our design achieves a slightly worse sensitivity than that of BWA but has a higher processing speed.
This paper shows that our hybrid system can effectively accelerate the mapping of short reads to a reference genome based on the seed-and-extend approach. The performance comparison to the GASSST and BWA software implementations under different conditions shows that our hybrid design achieves a high degree of sensitivity and requires less overall execution time with only modest FPGA resource utilization. Our hybrid system design also shows that the performance bottleneck for the short read mapping problem can be changed from the alignment stage to the seed generation stage, which provides an additional requirement for the future development of short read aligners.
Alignments of homologous DNA sequences are crucial for comparative genomics and phylogenetic analysis. However, multiple alignment represents a computationally difficult problem. For protein-coding DNA sequences, it is more advantageous in terms of both speed and accuracy to align the amino-acid sequences specified by the DNA sequences rather than the DNA sequences themselves. Many implementations making use of this concept of "translated alignments" are incomplete in the sense that they require the user to manually translate the DNA sequences and to perform the amino-acid alignment. As such, they are not well suited to large-scale automated alignments of large and/or numerous DNA data sets.
transAlign is an open-source Perl script that aligns protein-coding DNA sequences via their amino-acid translations to take advantage of the superior multiple-alignment capabilities and speed of an amino-acid alignment. It operates by translating each DNA sequence into its corresponding amino-acid sequence, passing the entire matrix to ClustalW for alignment, and then back-translating the resulting amino-acid alignment to derive the aligned DNA sequences. In the translation step, transAlign determines the optimal orientation and reading frame for each DNA sequence according to the desired genetic code. It also checks for apparent frame shifts in the DNA sequences and can handle frame-shifted sequences in one of three ways (delete, align as amino acids regardless, or profile align as DNA). As a set of comparative benchmarks derived from six protein-coding genes for mammals shows, the strategy implemented in transAlign always improves the speed and usually the apparent accuracy of the alignment of protein-coding DNA sequences.
transAlign represents one of few full and cross-platform implementations of the concept of translated alignments. Both the advantages accruing from performing a translated alignment and the suite of user-definable options available in the program mean that transAlign is ideally suited for large-scale automated alignments of very large and/or very numerous protein-coding DNA data sets. However, the good performance offered by the program also translates to the alignment of any set of protein-coding sequences. transAlign, including the source code, is freely available at http://www.tierzucht.tum.de/Bininda-Emonds/ (under "Programs").
Metagenomics is a powerful methodology to study microbial communities, but it is highly dependent on nucleotide sequence similarity searching against sequence databases. Metagenomic analyses with next-generation sequencing technologies produce enormous numbers of reads from microbial communities, and many reads are derived from microbes whose genomes have not yet been sequenced, limiting the usefulness of existing sequence similarity search tools. Therefore, there is a clear need for a sequence similarity search tool that can rapidly detect weak similarity in large datasets.
We developed a tool, which we named CLAST (CUDA implemented large-scale alignment search tool), that enables analyses of millions of reads and thousands of reference genome sequences, and runs on NVIDIA Fermi architecture graphics processing units. CLAST has four main advantages over existing alignment tools. First, CLAST was capable of identifying sequence similarities ~80.8 times faster than BLAST and 9.6 times faster than BLAT. Second, CLAST executes global alignment as the default (local alignment is also an option), enabling CLAST to assign reads to taxonomic and functional groups based on evolutionarily distant nucleotide sequences with high accuracy. Third, CLAST does not need a preprocessed sequence database like Burrows–Wheeler Transform-based tools, and this enables CLAST to incorporate large, frequently updated sequence databases. Fourth, CLAST requires <2 GB of main memory, making it possible to run CLAST on a standard desktop computer or server node.
CLAST achieved very high speed (similar to the Burrows–Wheeler Transform-based Bowtie 2 for long reads) and sensitivity (equal to BLAST, BLAT, and FR-HIT) without the need for extensive database preprocessing or a specialized computing platform. Our results demonstrate that CLAST has the potential to be one of the most powerful and realistic approaches to analyze the massive amount of sequence data from next-generation sequencing technologies.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0406-y) contains supplementary material, which is available to authorized users.
GPU; CUDA; Metagenomics; Microbial community; Sequence similarity search tool
Expressed sequences (e.g. ESTs) are a strong source of evidence to improve gene structures and predict reliable alternative splicing events. When a genome assembly is available, ESTs are suitable to generate gene-oriented clusters through the well-established EasyCluster software. Nowadays, EST-like sequences can be massively produced using Next Generation Sequencing (NGS) technologies. In order to handle genome-scale transcriptome data, we present here EasyCluster2, a reimplementation of EasyCluster able to speed up the creation of gene-oriented clusters and facilitate downstream analyses as the assembly of full-length transcripts and the detection of splicing isoforms.
EasyCluster2 has been developed to facilitate the genome-based clustering of EST-like sequences generated through the NGS 454 technology. Reads mapped onto the reference genome can be uploaded using the standard GFF3 file format. Alignment parsing is initially performed to produce a first collection of pseudo-clusters by grouping reads according to the overlap of their genomic coordinates on the same strand. EasyCluster2 then refines read grouping by including in each cluster only reads sharing at least one splice site and optionally performs a Smith-Waterman alignment in the region surrounding splice sites in order to correct for potential alignment errors. In addition, EasyCluster2 can include unspliced reads, which generally account for >50% of 454 datasets, and collapses overlapping clusters. Finally, EasyCluster2 can assemble full-length transcripts using a Directed-Acyclic-Graph-based strategy, simplifying the identification of alternative splicing isoforms, thanks also to the implementation of the widespread AStalavista methodology. Accuracy and performances have been tested on real as well as simulated datasets.
EasyCluster2 represents a unique tool to cluster and assemble transcriptome reads produced with 454 technology, as well as ESTs and full-length transcripts. The clustering procedure is enhanced with the employment of genome annotations and unspliced reads. Overall, EasyCluster2 is able to perform an effective detection of splicing isoforms, since it can refine exon-exon junctions and explore alternative splicing without known reference transcripts. Results in GFF3 format can be browsed in the UCSC Genome Browser. Therefore, EasyCluster2 is a powerful tool to generate reliable clusters for gene expression studies, facilitating the analysis also to researchers not skilled in bioinformatics.
With few exceptions, current methods for short read mapping make use of simple seed heuristics to speed up the search. Most of the underlying matching models neglect the necessity to allow not only mismatches, but also insertions and deletions. Current evaluations indicate, however, that very different error models apply to the novel high-throughput sequencing methods. While the most frequent error-type in Illumina reads are mismatches, reads produced by 454's GS FLX predominantly contain insertions and deletions (indels). Even though 454 sequencers are able to produce longer reads, the method is frequently applied to small RNA (miRNA and siRNA) sequencing. Fast and accurate matching in particular of short reads with diverse errors is therefore a pressing practical problem. We introduce a matching model for short reads that can, besides mismatches, also cope with indels. It addresses different error models. For example, it can handle the problem of leading and trailing contaminations caused by primers and poly-A tails in transcriptomics or the length-dependent increase of error rates. In these contexts, it thus simplifies the tedious and error-prone trimming step. For efficient searches, our method utilizes index structures in the form of enhanced suffix arrays. In a comparison with current methods for short read mapping, the presented approach shows significantly increased performance not only for 454 reads, but also for Illumina reads. Our approach is implemented in the software segemehl available at http://www.bioinf.uni-leipzig.de/Software/segemehl/.
The successful mapping of high-throughput sequencing (HTS) reads to reference genomes largely depends on the accuracy of both the sequencing technologies and reference genomes. Current mapping algorithms focus on mapping with mismatches but largely neglect insertions and deletions—regardless of whether they are caused by sequencing errors or genomic variation. Furthermore, trailing contaminations by primers and declining read qualities can be cumbersome for programs that allow a maximum number of mismatches. We have developed and implemented a new approach for short read mapping that, in a first step, computes exact matches of the read and the reference genome. The exact matches are then modified by a limited number of mismatches, insertions and deletions. From the set of exact and inexact matches, we select those with minimum score-based E-values. This gives a set of regions in the reference genome which is aligned to the read using Myers bitvector algorithm . Our method utilizes enhanced suffix arrays  to quickly find the exact and inexact matches. It maps more reads and achieves higher recall rates than previous methods. This consistently holds for reads produced by 454 as well as Illumina sequencing technologies.
We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/.
Biological sequence alignment is one of the fundamental problems in comparative genomics, yet it remains unsolved. Over sixty sequence alignment programs are listed on Wikipedia, and many new programs are published every year. However, many popular programs suffer from pathologies such as aligning unrelated sequences and producing discordant alignments in protein (amino acid) and codon (nucleotide) space, casting doubt on the accuracy of the inferred alignments. Inaccurate alignments can introduce large and unknown systematic biases into downstream analyses such as phylogenetic tree reconstruction and substitution rate estimation. We describe a new program for multiple sequence alignment which can align protein, RNA and DNA sequence and improves on the accuracy of existing approaches on benchmarks of protein and RNA structural alignments and simulated mammalian and fly genomic alignments. Our approach, which seeks to find the alignment which is closest to the truth under our statistical model, leaves unrelated sequences largely unaligned and produces concordant alignments in protein and codon space. It is fast enough for difficult problems such as aligning orthologous genomic regions or aligning hundreds or thousands of proteins. It furthermore has a companion GUI for visualizing the estimated alignment reliability.
Motivation: Next-generation sequence analysis has become an important task both in laboratory and clinical settings. A key stage in the majority sequence analysis workflows, such as resequencing, is the alignment of genomic reads to a reference genome. The accurate alignment of reads with large indels is a computationally challenging task for researchers.
Results: We introduce SeqAlto as a new algorithm for read alignment. For reads longer than or equal to 100 bp, SeqAlto is up to 10 × faster than existing algorithms, while retaining high accuracy and the ability to align reads with large (up to 50 bp) indels. This improvement in efficiency is particularly important in the analysis of future sequencing data where the number of reads approaches many billions. Furthermore, SeqAlto uses less than 8 GB of memory to align against the human genome. SeqAlto is benchmarked against several existing tools with both real and simulated data.
Availability: Linux and Mac OS X binaries free for academic use are available at http://www.stanford.edu/group/wonglab/seqalto
Genomic read alignment involves mapping (exactly or approximately) short reads from a particular individual onto a pre-sequenced reference genome of the same species. Because all individuals of the same species share the majority of their genomes, short reads alignment provides an alternative and much more efficient way to sequence the genome of a particular individual than does direct sequencing. Among many strategies proposed for this alignment process, indexing the reference genome and short read searching over the index is a dominant technique. Our goal is to design a space-efficient indexing structure with fast searching capability to catch the massive short reads produced by the next generation high-throughput DNA sequencing technology.
We concentrate on indexing DNA sequences via sparse suffix arrays (SSAs) and propose a new short read aligner named Ψ-RA (PSI-RA: parallel sparse index read aligner). The motivation in using SSAs is the ability to trade memory against time. It is possible to fine tune the space consumption of the index based on the available memory of the machine and the minimum length of the arriving pattern queries. Although SSAs have been studied before for exact matching of short reads, an elegant way of approximate matching capability was missing. We provide this by defining the rightmost mismatch criteria that prioritize the errors towards the end of the reads, where errors are more probable. Ψ-RA supports any number of mismatches in aligning reads. We give comparisons with some of the well-known short read aligners, and show that indexing a genome with SSA is a good alternative to the Burrows-Wheeler transform or seed-based solutions.
Ψ-RA is expected to serve as a valuable tool in the alignment of short reads generated by the next generation high-throughput sequencing technology. Ψ-RA is very fast in exact matching and also supports rightmost approximate matching. The SSA structure that Ψ-RA is built on naturally incorporates the modern multicore architecture and thus further speed-up can be gained. All the information, including the source code of Ψ-RA, can be downloaded at: http://www.busillis.com/o_kulekci/PSIRA.zip.
Recent tools for aligning short DNA reads have been designed to optimize the trade-off between correctness and speed. This paper introduces a method for assigning a set of short DNA reads to a reference genome, under Local Rank Distance (LRD). The rank-based aligner proposed in this work aims to improve correctness over speed. However, some indexing strategies to speed up the aligner are also investigated. The LRD aligner is improved in terms of speed by storing -mer positions in a hash table for each read. Another improvement, that produces an approximate LRD aligner, is to consider only the positions in the reference that are likely to represent a good positional match of the read. The proposed aligner is evaluated and compared to other state of the art alignment tools in several experiments. A set of experiments are conducted to determine the precision and the recall of the proposed aligner, in the presence of contaminated reads. In another set of experiments, the proposed aligner is used to find the order, the family, or the species of a new (or unknown) organism, given only a set of short Next-Generation Sequencing DNA reads. The empirical results show that the aligner proposed in this work is highly accurate from a biological point of view. Compared to the other evaluated tools, the LRD aligner has the important advantage of being very accurate even for a very low base coverage. Thus, the LRD aligner can be considered as a good alternative to standard alignment tools, especially when the accuracy of the aligner is of high importance. Source code and UNIX binaries of the aligner are freely available for future development and use at http://lrd.herokuapp.com/aligners. The software is implemented in C++ and Java, being supported on UNIX and MS Windows.
Genome resequencing with short reads generated from pyrosequencing generally relies on mapping the short reads against a single reference genome. However, mapping of reads from multiple reference genomes is not possible using a pairwise mapping algorithm. In order to align the reads w.r.t each other and the reference genomes, existing multiple sequence alignment(MSA) methods cannot be used because they do not take into account the position of these short reads with respect to the genome, and are highly inefficient for large number of sequences. In this paper, we develop a highly scalable parallel algorithm based on domain decomposition, referred to as P-Pyro-Align, to align such large number of reads from single or multiple reference genomes. The proposed alignment algorithm accurately aligns the erroneous reads, and has been implemented on a cluster of workstations using MPI library. Experimental results for different problem sizes are analyzed in terms of execution time, quality of the alignments, and the ability of the algorithm to handle reads from multiple haplotypes. We report high quality multiple alignment of up to 0.5 million reads. The algorithm is shown to be highly scalable and exhibits super-linear speedups with increasing number of processors.
The development of Next Generation Sequencing technologies, capable of sequencing hundreds of millions of short reads (25–70 bp each) in a single run, is opening the door to population genomic studies of non-model species. In this paper we present SHRiMP - the SHort Read Mapping Package: a set of algorithms and methods to map short reads to a genome, even in the presence of a large amount of polymorphism. Our method is based upon a fast read mapping technique, separate thorough alignment methods for regular letter-space as well as AB SOLiD (color-space) reads, and a statistical model for false positive hits. We use SHRiMP to map reads from a newly sequenced Ciona savignyi individual to the reference genome. We demonstrate that SHRiMP can accurately map reads to this highly polymorphic genome, while confirming high heterozygosity of C. savignyi in this second individual. SHRiMP is freely available at http://compbio.cs.toronto.edu/shrimp.
Next Generation Sequencing (NGS) technologies are revolutionizing the way biologists acquire and analyze genomic data. NGS machines, such as Illumina/Solexa and AB SOLiD, are able to sequence genomes more cheaply by 200-fold than previous methods. One of the main application areas of NGS technologies is the discovery of genomic variation within a given species. The first step in discovering this variation is the mapping of reads sequenced from a donor individual to a known (“reference”) genome. Differences between the reference and the reads are indicative either of polymorphisms, or of sequencing errors. Since the introduction of NGS technologies, many methods have been devised for mapping reads to reference genomes. However, these algorithms often sacrifice sensitivity for fast running time. While they are successful at mapping reads from organisms that exhibit low polymorphism rates, they do not perform well at mapping reads from highly polymorphic organisms. We present a novel read mapping method, SHRiMP, that can handle much greater amounts of polymorphism. Using Ciona savignyi as our target organism, we demonstrate that our method discovers significantly more variation than other methods. Additionally, we develop color-space extensions to classical alignment algorithms, allowing us to map color-space, or “dibase”, reads generated by AB SOLiD sequencers.
DNA methylation changes are associated with a wide array of biological processes. Bisulfite conversion of DNA followed by high-throughput sequencing is increasingly being used to assess genome-wide methylation at single-base resolution. The relative slowness of most commonly used aligners for processing such data introduces an unnecessarily long delay between receipt of raw data and statistical analysis. While this process can be sped-up by using computer clusters, current tools are not designed with them in mind and end-users must create such implementations themselves.
Here, we present a novel BS-seq aligner, Bison, which exploits multiple nodes of a computer cluster to speed up this process and also has increased accuracy. Bison is accompanied by a variety of helper programs and scripts to ease, as much as possible, the process of quality control and preparing results for statistical analysis by a variety of popular R packages. Bison is also accompanied by bison_herd, a variant of Bison with the same output but that can scale to a semi-arbitrary number of nodes, with concomitant increased demands on the underlying message passing interface implementation.
Bison is a new bisulfite-converted short-read aligner providing end users easier scalability for performance gains, more accurate alignments, and a convenient pathway for quality controlling alignments and converting methylation calls into a form appropriate for statistical analysis. Bison and the more scalable bison_herd are natively able to utilize multiple nodes of a computer cluster simultaneously and serve to simplify to the process of creating analysis pipelines.
DNA Methylation; Alignment; Bisulfite sequencing; Computer cluster
Massively parallel sequencing readouts of epigenomic assays are enabling integrative genome-wide analyses of genomic and epigenomic variation. Pash 3.0 performs sequence comparison and read mapping and can be employed as a module within diverse configurable analysis pipelines, including ChIP-Seq and methylome mapping by whole-genome bisulfite sequencing.
Pash 3.0 generally matches the accuracy and speed of niche programs for fast mapping of short reads, and exceeds their performance on longer reads generated by a new generation of massively parallel sequencing technologies. By exploiting longer read lengths, Pash 3.0 maps reads onto the large fraction of genomic DNA that contains repetitive elements and polymorphic sites, including indel polymorphisms.
We demonstrate the versatility of Pash 3.0 by analyzing the interaction between CpG methylation, CpG SNPs, and imprinting based on publicly available whole-genome shotgun bisulfite sequencing data. Pash 3.0 makes use of gapped k-mer alignment, a non-seed based comparison method, which is implemented using multi-positional hash tables. This allows Pash 3.0 to run on diverse hardware platforms, including individual computers with standard RAM capacity, multi-core hardware architectures and large clusters.
New mutations leading to structural variation (SV) in genomes—in the form of mobile element insertions, large deletions, gene duplications, and other chromosomal rearrangements—can play a key role in microbial evolution. Yet, SV is considerably more difficult to predict from short-read genome resequencing data than single-nucleotide substitutions and indels (SN), so it is not yet routinely identified in studies that profile population-level genetic diversity over time in evolution experiments. We implemented an algorithm for detecting polymorphic SV as part of the breseq computational pipeline. This procedure examines split-read alignments, in which the two ends of a single sequencing read match disjoint locations in the reference genome, in order to detect structural variants and estimate their frequencies within a sample. We tested our algorithm using simulated Escherichia coli data and then applied it to 500- and 1000-generation population samples from the Lenski E. coli long-term evolution experiment (LTEE). Knowledge of genes that are targets of selection in the LTEE and mutations present in previously analyzed clonal isolates allowed us to evaluate the accuracy of our procedure. Overall, SV accounted for ~25% of the genetic diversity found in these samples. By profiling rare SV, we were able to identify many cases where alternative mutations in key genes transiently competed within a single population. We also found, unexpectedly, that mutations in two genes that rose to prominence at these early time points always went extinct in the long term. Because it is not limited by the base-calling error rate of the sequencing technology, our approach for identifying rare SV in whole-population samples may have a lower detection limit than similar predictions of SNs in these data sets. We anticipate that this functionality of breseq will be useful for providing a more complete picture of genome dynamics during evolution experiments with haploid microorganisms.
genome resequencing; experimental evolution; insertion sequence; genetic parallelism; evolutionary dead end
Sequence alignments form the basis for many comparative and population genomic studies. Alignment tools provide a range of accuracies dependent on the divergence between the sequences and the alignment methods. Despite widespread use, there is no standard method for assessing the accuracy of a dataset and alignment strategy after resequencing. We present a framework and tool for determining the overall accuracies of an input read dataset, alignment and SNP-calling method providing an isolate in that dataset has a corresponding, or closely related reference sequence available. In addition to this tool for comparing False Discovery Rates (FDR), we include a method for determining homozygous and heterozygous positions from an alignment using binomial probabilities for an expected error rate. We benchmark this method against other SNP callers using our FDR method with three fungal genomes, finding that it was able achieve a high level of accuracy. These tools are available at http://cfdr.sourceforge.net/.