Motivation: Discovering variation among high-throughput sequenced genomes relies on efficient and effective mapping of sequence reads. The speed, sensitivity and accuracy of read mapping are crucial to determining the full spectrum of single nucleotide variants (SNVs) as well as structural variants (SVs) in the donor genomes analyzed.
Results: We present drFAST, a read mapper designed for di-base encoded ‘color-space’ sequences generated with the AB SOLiD platform. drFAST is specially designed for better delineation of structural variants, including segmental duplications, and is able to return all possible map locations and underlying sequence variation of short reads within a user-specified distance threshold. We show that drFAST is more sensitive in comparison to all commonly used aligners such as Bowtie, BFAST and SHRiMP. drFAST is also faster than both BFAST and SHRiMP and achieves a mapping speed comparable to Bowtie.
Availability: The source code for drFAST is available at http://drfast.sourceforge.net
With the introduction of next-generation sequencing (NGS) technologies, we are facing an exponential increase in the amount of genomic sequence data. The success of all medical and genetic applications of next-generation sequencing critically depends on the existence of computational techniques that can process and analyze the enormous amount of sequence data quickly and accurately. Unfortunately, the current read mapping algorithms have difficulties in coping with the massive amounts of data generated by NGS.
We propose a new algorithm, FastHASH, which drastically improves the performance of the seed-and-extend type hash table based read mapping algorithms, while maintaining the high sensitivity and comprehensiveness of such methods. FastHASH is a generic algorithm compatible with all seed-and-extend class read mapping algorithms. It introduces two main techniques, namely Adjacency Filtering, and Cheap K-mer Selection.
We implemented FastHASH and merged it into the codebase of the popular read mapping program, mrFAST. Depending on the edit distance cutoffs, we observed up to 19-fold speedup while still maintaining 100% sensitivity and high comprehensiveness.
The most crucial step in data processing from high-throughput sequencing applications is the accurate and sensitive alignment of the sequencing reads to reference genomes or transcriptomes. The accurate detection of insertions and deletions (indels) and errors introduced by the sequencing platform or by misreading of modified nucleotides is essential for the quantitative processing of the RNA-based sequencing (RNA-Seq) datasets and for the identification of genetic variations and modification patterns. We developed a new, fast and accurate algorithm for nucleic acid sequence analysis, FANSe, with adjustable mismatch allowance settings and ability to handle indels to accurately and quantitatively map millions of reads to small or large reference genomes. It is a seed-based algorithm which uses the whole read information for mapping and high sensitivity and low ambiguity are achieved by using short and non-overlapping reads. Furthermore, FANSe uses hotspot score to prioritize the processing of highly possible matches and implements modified Smith–Watermann refinement with reduced scoring matrix to accelerate the calculation without compromising its sensitivity. The FANSe algorithm stably processes datasets from various sequencing platforms, masked or unmasked and small or large genomes. It shows a remarkable coverage of low-abundance mRNAs which is important for quantitative processing of RNA-Seq datasets.
Motivation: With improved short-read assembly algorithms and the recent development of long-read sequencers, split mapping will soon be the preferred method for structural variant (SV) detection. Yet, current alignment tools are not well suited for this.
Results: We present YAHA, a fast and flexible hash-based aligner. YAHA is as fast and accurate as BWA-SW at finding the single best alignment per query and is dramatically faster and more sensitive than both SSAHA2 and MegaBLAST at finding all possible alignments. Unlike other aligners that report all, or one, alignment per query, or that use simple heuristics to select alignments, YAHA uses a directed acyclic graph to find the optimal set of alignments that cover a query using a biologically relevant breakpoint penalty. YAHA can also report multiple mappings per defined segment of the query. We show that YAHA detects more breakpoints in less time than BWA-SW across all SV classes, and especially excels at complex SVs comprising multiple breakpoints.
Availability: YAHA is currently supported on 64-bit Linux systems. Binaries and sample data are freely available for download from http://faculty.virginia.edu/irahall/YAHA.
The development of Next Generation Sequencing technologies, capable of sequencing hundreds of millions of short reads (25–70 bp each) in a single run, is opening the door to population genomic studies of non-model species. In this paper we present SHRiMP - the SHort Read Mapping Package: a set of algorithms and methods to map short reads to a genome, even in the presence of a large amount of polymorphism. Our method is based upon a fast read mapping technique, separate thorough alignment methods for regular letter-space as well as AB SOLiD (color-space) reads, and a statistical model for false positive hits. We use SHRiMP to map reads from a newly sequenced Ciona savignyi individual to the reference genome. We demonstrate that SHRiMP can accurately map reads to this highly polymorphic genome, while confirming high heterozygosity of C. savignyi in this second individual. SHRiMP is freely available at http://compbio.cs.toronto.edu/shrimp.
Next Generation Sequencing (NGS) technologies are revolutionizing the way biologists acquire and analyze genomic data. NGS machines, such as Illumina/Solexa and AB SOLiD, are able to sequence genomes more cheaply by 200-fold than previous methods. One of the main application areas of NGS technologies is the discovery of genomic variation within a given species. The first step in discovering this variation is the mapping of reads sequenced from a donor individual to a known (“reference”) genome. Differences between the reference and the reads are indicative either of polymorphisms, or of sequencing errors. Since the introduction of NGS technologies, many methods have been devised for mapping reads to reference genomes. However, these algorithms often sacrifice sensitivity for fast running time. While they are successful at mapping reads from organisms that exhibit low polymorphism rates, they do not perform well at mapping reads from highly polymorphic organisms. We present a novel read mapping method, SHRiMP, that can handle much greater amounts of polymorphism. Using Ciona savignyi as our target organism, we demonstrate that our method discovers significantly more variation than other methods. Additionally, we develop color-space extensions to classical alignment algorithms, allowing us to map color-space, or “dibase”, reads generated by AB SOLiD sequencers.
With the advent of next-generation sequencers, the growing demands to map short DNA sequences to a genome have promoted the development of fast algorithms and tools. The tools commonly used today are based on either a hash table or the suffix array/Burrow–Wheeler transform. These algorithms are the best suited to finding the genome position of exactly matching short reads. However, they have limited capacity to handle the mismatches. To find n-mismatches, they requires O(2n) times the computation time of exact matches. Therefore, acceleration techniques are required.
We propose a hash-based method for genome mapping that reduces the number of hash references for finding mismatches without increasing the size of the hash table. The method regards DNA subsequences as words on Galois extension field GF(22) and each word is encoded to a code word of a perfect Hamming code. The perfect Hamming code defines equivalence classes of DNA subsequences. Each equivalence class includes subsequence whose corresponding words on GF(22) are encoded to a corresponding code word. The code word is used as a hash key to store these subsequences in a hash table. Specifically, it reduces by about 70% the number of hash keys necessary for searching the genome positions of all 2-mismatches of 21-base-long DNA subsequence.
The paper shows perfect hamming code can reduce the number of hash references for hash-based genome mapping. As the computation time to calculate code words is far shorter than a hash reference, our method is effective to reduce the computation time to map short DNA sequences to genome. The amount of data that DNA sequencers generate continues to increase and more accurate genome mappings are required. Thus our method will be a key technology to develop faster genome mapping software.
Bisulfite sequencing is a powerful technique to study DNA cytosine methylation. Bisulfite treatment followed by PCR amplification specifically converts unmethylated cytosines to thymine. Coupled with next generation sequencing technology, it is able to detect the methylation status of every cytosine in the genome. However, mapping high-throughput bisulfite reads to the reference genome remains a great challenge due to the increased searching space, reduced complexity of bisulfite sequence, asymmetric cytosine to thymine alignments, and multiple CpG heterogeneous methylation.
We developed an efficient bisulfite reads mapping algorithm BSMAP to address the above issues. BSMAP combines genome hashing and bitwise masking to achieve fast and accurate bisulfite mapping. Compared with existing bisulfite mapping approaches, BSMAP is faster, more sensitive and more flexible.
BSMAP is the first general-purpose bisulfite mapping software. It is able to map high-throughput bisulfite reads at whole genome level with feasible memory and CPU usage. It is freely available under GPL v3 license at .
Recent methods have been developed to perform high-throughput sequencing of DNA by Single Molecule Sequencing (SMS). While Next-Generation sequencing methods may produce reads up to several hundred bases long, SMS sequencing produces reads up to tens of kilobases long. Existing alignment methods are either too inefficient for high-throughput datasets, or not sensitive enough to align SMS reads, which have a higher error rate than Next-Generation sequencing.
We describe the method BLASR (Basic Local Alignment with Successive Refinement) for mapping Single Molecule Sequencing (SMS) reads that are thousands of bases long, with divergence between the read and genome dominated by insertion and deletion error. The method is benchmarked using both simulated reads and reads from a bacterial sequencing project. We also present a combinatorial model of sequencing error that motivates why our approach is effective.
The results indicate that it is possible to map SMS reads with high accuracy and speed. Furthermore, the inferences made on the mapability of SMS reads using our combinatorial model of sequencing error are in agreement with the mapping accuracy demonstrated on simulated reads.
Methyltransferases (MTases) of procaryotes affect general cellular processes such as mismatch repair, regulation of transcription, replication, and transposition, and in some cases may be essential for viability. As components of restriction-modification systems, they contribute to bacterial genetic diversity. The genome of Helicobacter pylori strain 26695 contains 25 open reading frames encoding putative DNA MTases. To assess which MTase genes are active, strain 26695 genomic DNA was tested for cleavage by 147 restriction endonucleases; 24 were found that did not cleave this DNA. The specificities of 11 expressed MTases and the genes encoding them were identified from this restriction data, combined with the known sensitivities of restriction endonucleases to specific DNA modification, homology searches, gene cloning and genomic mapping of the methylated bases m4C, m5C, and m6A.
Saccharomyces cerevisiae strains carrying vps18 mutations are defective in the sorting and transport of vacuolar enzymes. The precursor forms of these proteins are missorted and secreted from the mutant cells. Most vps18 mutants are temperature sensitive for growth and are defective in vacuole biogenesis; no structure resembling a normal vacuole is seen. A plasmid complementing the temperature-sensitive growth defect of strains carrying the vps18-4 allele was isolated from a centromere-based yeast genomic library. Integrative mapping experiments indicated that the 26-kb insert in this plasmid was derived from the VPS18 locus. A 4-kb minimal complementing fragment contains a single long open reading frame predicted to encode a 918-amino-acid hydrophilic protein. Comparison of the VPS18 sequence with the PEP3 sequence reported in the accompanying paper (R. A. Preston, H. F. Manolson, K. Becherer, E. Weidenhammer, D. Kirkpatrick, R. Wright, and E. W. Jones, Mol. Cell. Biol. 11:5801-5812, 1991) shows that the two genes are identical. Disruption of the VPS18/PEP3 gene (vps18 delta 1::TRP1) is not lethal but results in the same vacuolar protein sorting and growth defects exhibited by the original temperature-sensitive vps18 alleles. In addition, vps18 delta 1::TRP1 MAT alpha strains exhibit a defect in the Kex2p-dependent processing of the secreted pheromone alpha-factor. This finding suggests that vps18 mutations alter the function of a late Golgi compartment which contains Kex2p and in which vacuolar proteins are thought to be sorted from proteins destined for the cell surface. The Vps18p sequence contains a cysteine-rich, zinc finger-like motif at the COOH terminus. A mutant in which the first cysteine of this motif was changed to serine results in a temperature-conditional carboxypeptidase Y sorting defect shortly after a shift to nonpermissive conditions. We identified a similar cysteine-rich motif near the COOH terminus of another Vps protein, the Vps11/Pep5/End1 protein. Preston et al. (Mol. Cell. Biol. 11:5801-5812, 1991) present evidence that the Vps18/Pep3 protein colocalizes with the Vps11/Pep5 protein to the cytosolic face of the vacuolar membrane. Together with the similar phenotypes exhibited by both vps11 and vps18 mutants, this finding suggests that they may function at a common step during vacuolar protein sorting and that the integrity of their zinc finger motifs may be required for this function.
The gene encoding the major capsid protein of the baculovirus Autographa californica nuclear polyhedrosis virus (AcMNPV) was identified, sequenced, and transcriptionally mapped. The location of the gene was determined by immunological screening of an expression library of AcMNPV open reading frame-beta-galactosidase fusions with an antibody raised to virus structural proteins. The DNA sequence of the corresponding region, which mapped within 56.6 and 58.0 map units on the AcMNPV genome, revealed a 1,040-base-pair open reading frame capable of encoding a 39-kilodalton polypeptide. The identity of the polypeptide was determined by Western blot (immunoblot) analysis of purified empty capsids with an antibody raised to the capsid-beta-galactosidase fusion protein. The identity of the peptide encoded by the gene was confirmed by immunoprecipitation of an in vitro translation product with RNA selected by hybridization to DNA sequences from the coding region of the gene. Transcripts of the capsid gene were analyzed by Northern (RNA) blots and mapped by nuclease protection and primer extension analysis. The capsid gene is transcribed maximally at 12 and 24 h postinfection but not in the presence of cycloheximide, a protein synthesis inhibitor, or aphidicolin, a viral DNA synthesis inhibitor, and is therefore classified as a late gene. The gene is transcribed in a counterclockwise direction with respect to the circular map. There are three transcriptional start sites, all containing the AGTAAG consensus sequence found at the start site of all late AcMNPV genes.
Over the past few years, new massively parallel DNA sequencing technologies have emerged. These platforms generate massive amounts of data per run, greatly reducing the cost of DNA sequencing. However, these techniques also raise important computational difficulties mostly due to the huge volume of data produced, but also because of some of their specific characteristics such as read length and sequencing errors. Among the most critical problems is that of efficiently and accurately mapping reads to a reference genome in the context of re-sequencing projects.
We present an efficient method for the local alignment of pyrosequencing reads produced by the GS FLX (454) system against a reference sequence. Our approach explores the characteristics of the data in these re-sequencing applications and uses state of the art indexing techniques combined with a flexible seed-based approach, leading to a fast and accurate algorithm which needs very little user parameterization. An evaluation performed using real and simulated data shows that our proposed method outperforms a number of mainstream tools on the quantity and quality of successful alignments, as well as on the execution time.
The proposed methodology was implemented in a software tool called TAPyR--Tool for the Alignment of Pyrosequencing Reads--which is publicly available from http://www.tapyr.net.
Several bioinformatics methods have been proposed for the detection and characterization of genomic structural variation (SV) from ultra high-throughput genome resequencing data. Recent surveys show that comprehensive detection of SV events of different types between an individual resequenced genome and a reference sequence is best achieved through the combination of methods based on different principles (split mapping, reassembly, read depth, insert size, etc.). The improvement of individual predictors is thus an important objective. In this study, we propose a new method that combines deviations from expected library insert sizes and additional information from local patterns of read mapping and uses supervised learning to predict the position and nature of structural variants. We show that our approach provides greatly increased sensitivity with respect to other tools based on paired end read mapping at no cost in specificity, and it makes reliable predictions of very short insertions and deletions in repetitive and low-complexity genomic contexts that can confound tools based on split mapping of reads.
MOCAT is a highly configurable, modular pipeline for fast, standardized processing of single or paired-end sequencing data generated by the Illumina platform. The pipeline uses state-of-the-art programs to quality control, map, and assemble reads from metagenomic samples sequenced at a depth of several billion base pairs, and predict protein-coding genes on assembled metagenomes. Mapping against reference databases allows for read extraction or removal, as well as abundance calculations. Relevant statistics for each processing step can be summarized into multi-sheet Excel documents and queryable SQL databases. MOCAT runs on UNIX machines and integrates seamlessly with the SGE and PBS queuing systems, commonly used to process large datasets. The open source code and modular architecture allow users to modify or exchange the programs that are utilized in the various processing steps. Individual processing steps and parameters were benchmarked and tested on artificial, real, and simulated metagenomes resulting in an improvement of selected quality metrics. MOCAT can be freely downloaded at http://www.bork.embl.de/mocat/.
Summary: Sequencing reads generated by RNA-sequencing (RNA-seq) must first be mapped back to the genome through alignment before they can be further analyzed. Current fast and memory-saving short-read mappers could give us a quick view of the transcriptome. However, they are neither designed for reads that span across splice junctions nor for repetitive reads, which can be mapped to multiple locations in the genome (multi-reads). Here, we describe a new software package: ABMapper, which is specifically designed for exploring all putative locations of reads that are mapped to splice junctions or repetitive in nature.
Availability and Implementation: The software is freely available at: http://abmapper.sourceforge.net/. The software is written in C++ and PERL. It runs on all major platforms and operating systems including Windows, Mac OS X and LINUX.
Supplementary information: Supplementary data are available at Bioinformatics online.
A vaccinia virus (VV) gene required for DNA replication has been mapped to the left side of the 16-kilobase (kb) VV HindIII D DNA fragment by marker rescue of a DNA- temperature-sensitive mutant, ts17, using cloned fragments of the viral genome. The region of VV DNA containing the ts17 locus (3.6 kb) was sequenced. This nucleotide sequence contains one complete open reading frame (ORF) and two incomplete ORFs reading from left to right. Analysis of this region at early times revealed that transcription from the incomplete upstream ORF terminates coincidentally with the complete ORF encoding the ts17 gene product, which is directly downstream. The predicted proteins encoded by this region correlate well with polypeptides mapped by in vitro translation of hybrid-selected early mRNA. The nucleotide sequences of a 1.3-kb BglII fragment derived from ts17 and from two ts17 revertants were also determined, and the nature of the ts17 mutation was identified. S1 nuclease protection studies were carried out to determine the 5' and 3' ends of the transcripts and to examine the kinetics of expression of the ts17 gene during viral infection. The ts17 transcript is present at both early and late times postinfection, indicating that this gene is constitutively expressed. Surprisingly, the transcriptional start throughout infection occurs at the proposed late regulatory element TAA, which immediately precedes the putative initiation codon ATG. Although the biological activity of the ts17-encoded polypeptide was not identified, it was noted that in ts17-infected cells, expression of a nonlinked VV immediate-early gene (thymidine kinase) was deregulated at the nonpermissive temperature. This result may indicate that the ts17 gene product is functionally required at an early step of the VV replicative cycle.
Massively parallel sequencing readouts of epigenomic assays are enabling integrative genome-wide analyses of genomic and epigenomic variation. Pash 3.0 performs sequence comparison and read mapping and can be employed as a module within diverse configurable analysis pipelines, including ChIP-Seq and methylome mapping by whole-genome bisulfite sequencing.
Pash 3.0 generally matches the accuracy and speed of niche programs for fast mapping of short reads, and exceeds their performance on longer reads generated by a new generation of massively parallel sequencing technologies. By exploiting longer read lengths, Pash 3.0 maps reads onto the large fraction of genomic DNA that contains repetitive elements and polymorphic sites, including indel polymorphisms.
We demonstrate the versatility of Pash 3.0 by analyzing the interaction between CpG methylation, CpG SNPs, and imprinting based on publicly available whole-genome shotgun bisulfite sequencing data. Pash 3.0 makes use of gapped k-mer alignment, a non-seed based comparison method, which is implemented using multi-positional hash tables. This allows Pash 3.0 to run on diverse hardware platforms, including individual computers with standard RAM capacity, multi-core hardware architectures and large clusters.
A Tn5-based mutagenesis strategy was used to generate a collection of trichloroethylene (TCE)-sensitive (TCS) mutants in order to identify repair systems or protective mechanisms that shield Burkholderia cepacia G4 from the toxic effects associated with TCE oxidation. Single Tn5 insertion sites were mapped within open reading frames putatively encoding enzymes involved in DNA repair (UvrB, RuvB, RecA, and RecG) in 7 of the 11 TCS strains obtained (4 of the TCS strains had a single Tn5 insertion within a uvrB homolog). The data revealed that the uvrB-disrupted strains were exceptionally susceptible to killing by TCE oxidation, followed by the recA strain, while the ruvB and recG strains were just slightly more sensitive to TCE than the wild type. The uvrB and recA strains were also extremely sensitive to UV light and, to a lesser extent, to exposure to mitomycin C and H2O2. The data from this study establishes that there is a link between DNA repair and the ability of B. cepacia G4 cells to survive following TCE transformation. A possible role for nucleotide excision repair and recombination repair activities in TCE-damaged cells is discussed.
cDNA encoding Ca2+-ATPase was cloned from a chicken skeletal muscle library. The cDNA (termed FCa) comprised 3,239 base pairs, including an open reading frame encoding 994 amino acids which showed the highest degree of homology with the adult rabbit fast-twitch Ca2+-ATPase isoform (C. J. Brandl, S. de Leon, D. R. Martin, and D. H. MacLennan, J. Biol. Chem. 262:3768-3774, 1987). Radiolabeled FCa hybridized to a 3.2-kilobase transcript in chicken skeletal muscle RNA but not to cardiac muscle RNA, which confirmed its identity as encoding the fast Ca2+-ATPase isoenzyme. FCa was transfected into the mouse myogenic line C2C12, from which a protein of 100 kilodaltons was immunopurified by using a monoclonal antibody specific for the avian fast Ca2+-ATPase. Immunofluorescence microscopy of a line (designated C2FCa2) stably expressing the avian Ca2+-ATPase localized the protein to the nuclear envelope and a population of cytoplasmic vesicles. A similar pattern was observed when C2FCa2 cells were stained with DiOC6(3), a cyanine dye that labels endoplasmic reticulum and mitochondria (M. Terasaki, J. Song, J. R. Wong, M. J. Weiss, and L. B. Chen, Cell 38:101-108, 1984). We conclude that the avian Ca2+-ATPase fast isoform is expressed and correctly targeted to the endoplasmic reticulum in mouse C2C12 cells.
Motivation: Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computational cost and complexity of these with very large numbers of short reads. Here, we show how to use suffix array-based methods that have formed the basis of recent very fast sequence mapping algorithms to find overlaps and generate assembly string graphs asymptotically faster than previously described algorithms.
Results: Standard overlap assembly methods have time complexity O(N2), where N is the sum of the lengths of the reads. We use the Ferragina–Manzini index (FM-index) derived from the Burrows–Wheeler transform to find overlaps of length at least τ among a set of reads. As well as an approach that finds all overlaps then implements transitive reduction to produce a string graph, we show how to output directly only the irreducible overlaps, significantly shrinking memory requirements and reducing compute time to O(N), independent of depth. Overlap-based assembly methods naturally handle mixed length read sets, including capillary reads or long reads promised by the third generation sequencing technologies. The algorithms we present here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly.
A 2.8-kb region of the Autographa californica nuclear polyhedrosis virus genome was sequenced and found to contain an open reading frame (p47) which was capable of rescuing a previously characterized temperature-sensitive mutant, ts317 (S. Partington, H. Yu, A. Lu, and E. B. Carstens, Virology 157:91-102, 1990). Transcriptional mapping demonstrated that an early 4.2-kb RNA encoded the p47 open reading frame and probably overlapped the 39K delayed-early gene. The p47 open reading frame was cloned behind the polyhedrin promoter in a baculovirus transfer plasmid, which was then used to prepare a recombinant baculovirus overexpressing the p47 polypeptide. The overexpressed polypeptide was used to prepare p47-specific monoclonal antibodies. These antibodies detected a polypeptide of 47 kDa in A. californica nuclear polyhedrosis virus-infected cells, demonstrating that p47 is expressed as an authentic viral product. The p47 gene product was localized to the nucleus of infected cells, supporting the hypothesis that it is involved in regulating viral transcription at late times postinfection.
RNA-seq has proven to be a powerful technique for transcriptome profiling based on next-generation sequencing (NGS) technologies. However, due to the short length of NGS reads, it is challenging to accurately map RNA-seq reads to splice junctions (SJs), which is a critically important step in the analysis of alternative splicing (AS) and isoform construction. In this article, we describe a new method, called TrueSight, which for the first time combines RNA-seq read mapping quality and coding potential of genomic sequences into a unified model. The model is further utilized in a machine-learning approach to precisely identify SJs. Both simulations and real data evaluations showed that TrueSight achieved higher sensitivity and specificity than other methods. We applied TrueSight to new high coverage honey bee RNA-seq data to discover novel splice forms. We found that 60.3% of honey bee multi-exon genes are alternatively spliced. By utilizing gene models improved by TrueSight, we characterized AS types in honey bee transcriptome. We believe that TrueSight will be highly useful to comprehensively study the biology of alternative splicing.
The DNA sequence of the ermC gene of plasmid pE194 is presented. This determinant is responsible for erythromycin-induced resistance to the macrolide-lincosamide-streptogramin B group of antibiotics and specifies a 29,000 dalton inducible protein. The locations of the ermC promoter, as well as that of a probable transcriptional terminator, are established both from the sequence and by transcription mapping. The sequence contains an open reading frame sufficient to encode the previously identified 29,000 dalton ermC protein. Between the promoter and the putative ATG start codon is a 141 base pair leader sequence, within which several regulatory (constitutive) mutations have been mapped and sequenced. The leader has a second open reading frame, sufficient to encode a 19 amino acid peptide. It is suggested that induction by erythromycin involves a shift between alternative ribosome-bound mRNA conformations, so that the ribosome binding sequence and the start codon for synthesis of the 29K protein are unmasked in the presence of inducer. Possible active and inactive folded configuration of the leader sequence are presented, as well as the effects on these configurations of regulatory mutations.
An antibody made against the herpes simplex virus 1 US5 gene predicted to encode glycoprotein J was found to react strongly with two proteins, one with an apparent Mr of 23,000 and mapping in the S component and one with a herpes simplex virus protein with an apparent Mr of 43,000. The antibody also reacted with herpes simplex virus type 2 proteins forming several bands with apparent Mrs ranging from 43,000 to 50,000. Mapping studies based on intertypic recombinants, analyses of deletion mutants, and ultimately, reaction of the antibody with a chimeric protein expressed by in-frame fusion of the glutathione S-transferase gene to an open reading frame antisense to the gene encoding glycoprotein B led to the definitive identification of the new open reading frame, designated UL27.5. Sequence analyses indicate the conservation of a short amino acid sequence common to US5 and UL27.5. The coding sequence of the herpes simplex virus UL27.5 open reading frame is strongly homologous to the sequence encoding the carboxyl terminus of the herpes simplex virus 2 UL27.5 sequence. However, both open reading frames could encode proteins predicted to be significantly larger than the mature UL27.5 proteins accumulating in the infected cells, indicating that these are either processed posttranslationally or synthesized from alternate, nonmethionine-initiating codons. The UL27.5 gene expression is blocked by phosphonoacetate, indicating that it is a γ2 gene. The product accumulated predominantly in the cytoplasm. UL27.5 is the third open reading frame found to map totally antisense to another gene and suggests that additional genes mapping antisense to known genes may exist.
Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals.
Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ∼10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package.