PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (757980)

Clipboard (0)
None

Related Articles

1.  Local alignment of two-base encoded DNA sequence 
BMC Bioinformatics  2009;10:175.
Background
DNA sequence comparison is based on optimal local alignment of two sequences using a similarity score. However, some new DNA sequencing technologies do not directly measure the base sequence, but rather an encoded form, such as the two-base encoding considered here. In order to compare such data to a reference sequence, the data must be decoded into sequence. The decoding is deterministic, but the possibility of measurement errors requires searching among all possible error modes and resulting alignments to achieve an optimal balance of fewer errors versus greater sequence similarity.
Results
We present an extension of the standard dynamic programming method for local alignment, which simultaneously decodes the data and performs the alignment, maximizing a similarity score based on a weighted combination of errors and edits, and allowing an affine gap penalty. We also present simulations that demonstrate the performance characteristics of our two base encoded alignment method and contrast those with standard DNA sequence alignment under the same conditions.
Conclusion
The new local alignment algorithm for two-base encoded data has substantial power to properly detect and correct measurement errors while identifying underlying sequence variants, and facilitating genome re-sequencing efforts based on this form of sequence data.
doi:10.1186/1471-2105-10-175
PMCID: PMC2709925  PMID: 19508732
2.  Stabilizing synthetic data in the DNA of living organisms 
Systems and Synthetic Biology  2008;2(1-2):19-25.
Data-encoding synthetic DNA, inserted into the genome of a living organism, is thought to be more robust than the current media. Because the living genome is duplicated and copied into new generations, one of the merits of using DNA material is long-term data storage within heritable media. A disadvantage of this approach is that encoded data can be unexpectedly broken by mutation, deletion, and insertion of DNA, which occurs naturally during evolution and prolongation, or laboratory experiments. For this reason, several information theory-based approaches have been developed as an error check of broken DNA data in order to achieve data durability. These approaches cannot efficiently recover badly damaged data-encoding DNA. We recently developed a DNA data-storage approach based on the multiple sequence alignment method to achieve a high level of data durability. In this paper, we overview this technology and discuss strategies for optimal application of this approach.
doi:10.1007/s11693-008-9020-5
PMCID: PMC2671590  PMID: 19083123
DNA data storage; Sequence alignment; Polymerase chain reaction (PCR); Error check; Error correction; Genetically modified organism (GMO)
3.  Incorporating sequence quality data into alignment improves DNA read mapping 
Nucleic Acids Research  2010;38(7):e100.
New DNA sequencing technologies have achieved breakthroughs in throughput, at the expense of higher error rates. The primary way of interpreting biological sequences is via alignment, but standard alignment methods assume the sequences are accurate. Here, we describe how to incorporate the per-base error probabilities reported by sequencers into alignment. Unlike existing tools for DNA read mapping, our method models both sequencer errors and real sequence differences. This approach consistently improves mapping accuracy, even when the rate of real sequence difference is only 0.2%. Furthermore, when mapping Drosophila melanogaster reads to the Drosophila simulans genome, it increased the amount of correctly mapped reads from 49 to 66%. This approach enables more effective use of DNA reads from organisms that lack reference genomes, are extinct or are highly polymorphic.
doi:10.1093/nar/gkq010
PMCID: PMC2853142  PMID: 20110255
4.  Multiple sequence alignments of partially coding nucleic acid sequences 
BMC Bioinformatics  2005;6:160.
Background
High quality sequence alignments of RNA and DNA sequences are an important prerequisite for the comparative analysis of genomic sequence data. Nucleic acid sequences, however, exhibit a much larger sequence heterogeneity compared to their encoded protein sequences due to the redundancy of the genetic code. It is desirable, therefore, to make use of the amino acid sequence when aligning coding nucleic acid sequences. In many cases, however, only a part of the sequence of interest is translated. On the other hand, overlapping reading frames may encode multiple alternative proteins, possibly with intermittent non-coding parts. Examples are, in particular, RNA virus genomes.
Results
The standard scoring scheme for nucleic acid alignments can be extended to incorporate simultaneously information on translation products in one or more reading frames. Here we present a multiple alignment tool, codaln, that implements a combined nucleic acid plus amino acid scoring model for pairwise and progressive multiple alignments that allows arbitrary weighting for almost all scoring parameters. Resource requirements of codaln are comparable with those of standard tools such as ClustalW.
Conclusion
We demonstrate the applicability of codaln to various biologically relevant types of sequences (bacteriophage Levivirus and Vertebrate Hox clusters) and show that the combination of nucleic acid and amino acid sequence information leads to improved alignments. These, in turn, increase the performance of analysis tools that depend strictly on good input alignments such as methods for detecting conserved RNA secondary structure elements.
doi:10.1186/1471-2105-6-160
PMCID: PMC1182351  PMID: 15985156
5.  Error and Error Mitigation in Low-Coverage Genome Assemblies 
PLoS ONE  2011;6(2):e17034.
The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ∼2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1–4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download.
doi:10.1371/journal.pone.0017034
PMCID: PMC3038916  PMID: 21340033
6.  ChloroplastDB: the Chloroplast Genome Database 
Nucleic Acids Research  2005;34(Database issue):D692-D696.
The Chloroplast Genome Database (ChloroplastDB) is an interactive, web-based database for fully sequenced plastid genomes, containing genomic, protein, DNA and RNA sequences, gene locations, RNA-editing sites, putative protein families and alignments (). With recent technical advances, the rate of generating new organelle genomes has increased dramatically. However, the established ontology for chloroplast genes and gene features has not been uniformly applied to all chloroplast genomes available in the sequence databases. For example, annotations for some published genome sequences have not evolved with gene naming conventions. ChloroplastDB provides unified annotations, gene name search, BLAST and download functions for chloroplast encoded genes and genomic sequences. A user can retrieve all orthologous sequences with one search regardless of gene names in GenBank. This feature alone greatly facilitates comparative research on sequence evolution including changes in gene content, codon usage, gene structure and post-transcriptional modifications such as RNA editing. Orthologous protein sets are classified by TribeMCL and each set is assigned a standard gene name. Over the next few years, as the number of sequenced chloroplast genomes increases rapidly, the tools available in ChloroplastDB will allow researchers to easily identify and compile target data for comparative analysis of chloroplast genes and genomes.
doi:10.1093/nar/gkj055
PMCID: PMC1347418  PMID: 16381961
7.  Structure and Replication of yDNA: A Novel Genetic Set Widened by Benzo-homologation 
In a functioning genetic system, the information-encoding molecule must form a regular self-complementary complex (e.g., the base paired double helix of DNA) and it must be able to encode information and pass it on to new generations. Here we study a benzo-widened DNA-like molecule (yDNA) as a candidate for an alternative genetic set, and we explicitly test these two structural and functional requirements. The solution structure of a 10-bp yDNA duplex is measured by 2D-NMR methods for a simple sequence composed of T-yA/yA-T pairs. The data confirm an antiparallel, right-handed, hydrogen-bonded helix resembling B-DNA but with wider diameter and enlarged base pair size In addition to this, the abilities of two different polymerase enzymes (Klenow fragment of DNA pol I (Kf) and the repair enzyme Dpo4) to synthesize and extend the yDNA pairs T-yA, A-yT, and G-yC are measured by steady-state kinetics studies. Not surprisingly, insertion of complementary bases opposite yDNA bases is inefficient due to the larger base pair size. We find that correct pairing occurs in several cases by both enzymes, but that common and relatively efficient mispairing involving T-yT and T-yC pairs interferes with fully correct formation and extension of pairs by these polymerases. Interestingly, the data show that extension of the large pairs is considerably more efficient with the flexible repair enzyme (Dpo4) than with the more rigid Kf enzyme. The results shed light on the properties of yDNA as a candidate for an alternative genetic information-encoding molecule and as a tool for application in basic science and biomedicine.
doi:10.1002/cbic.200900434
PMCID: PMC2982676  PMID: 19780073
Wide DNA; genetic set; base pair; polymerase; solution structure; NMR
8.  Sensitive and fast mapping of di-base encoded reads 
Bioinformatics  2011;27(14):1915-1921.
Motivation: Discovering variation among high-throughput sequenced genomes relies on efficient and effective mapping of sequence reads. The speed, sensitivity and accuracy of read mapping are crucial to determining the full spectrum of single nucleotide variants (SNVs) as well as structural variants (SVs) in the donor genomes analyzed.
Results: We present drFAST, a read mapper designed for di-base encoded ‘color-space’ sequences generated with the AB SOLiD platform. drFAST is specially designed for better delineation of structural variants, including segmental duplications, and is able to return all possible map locations and underlying sequence variation of short reads within a user-specified distance threshold. We show that drFAST is more sensitive in comparison to all commonly used aligners such as Bowtie, BFAST and SHRiMP. drFAST is also faster than both BFAST and SHRiMP and achieves a mapping speed comparable to Bowtie.
Availability: The source code for drFAST is available at http://drfast.sourceforge.net
Contact: calkan@u.washington.edu
doi:10.1093/bioinformatics/btr303
PMCID: PMC3129524  PMID: 21586516
9.  Antigenic diversity of meningococcal outer membrane protein PorA has implications for epidemiological analysis and vaccine design. 
The currently used serological subtyping scheme for the pathogen Neisseria meningitidis is not comprehensive, a proportion of isolates are reported as not subtypeable (NST), and few isolates are fully characterized with two subtypes for each strain. To establish the reasons for this and to assess the effectiveness of DNA-based subtyping schemes, dot blot hybridization and nucleotide sequence analyses were used to characterize the genes encoding antigenic variants of the meningococcal subtyping antigen, the PorA protein. A total of 233 strains, including 174 serologically NST and 59 partially or completely subtyped meningococcal strains, were surveyed. The NST isolates were chosen to be temporally and geographically representative of NST strains, isolated in England and Wales, and submitted to the Meningococcal Reference Unit in the period 1989 to 1991. The DNA-based analyses demonstrated that all of the strains examined possessed a porA gene. Some of these strains were serologically NST because of a lack of monoclonal antibodies against certain PorA epitopes; in other cases, strains expressed minor variants of known PorA epitopes that did not react with monoclonal antibodies in serological assays. Lack of expression remained a possible explanation for serological typing failure in some cases. These findings have important implications for epidemiological analysis and vaccine design and demonstrate the need for genetic characterization, rather than phenotypic characterization using monoclonal antibodies, for the identification of meningococcal strains.
PMCID: PMC170365  PMID: 8807211
10.  Aligning Sequences by Minimum Description Length 
This paper presents a new information theoretic framework for aligning sequences in bioinformatics. A transmitter compresses a set of sequences by constructing a regular expression that describes the regions of similarity in the sequences. To retrieve the original set of sequences, a receiver generates all strings that match the expression. An alignment algorithm uses minimum description length to encode and explore alternative expressions; the expression with the shortest encoding provides the best overall alignment. When two substrings contain letters that are similar according to a substitution matrix, a code length function based on conditional probabilities defined by the matrix will encode the substrings with fewer bits. In one experiment, alignments produced with this new method were found to be comparable to alignments from . A second experiment measured the accuracy of the new method on pairwise alignments of sequences from the BAliBASE alignment benchmark.
doi:10.1155/2007/72936
PMCID: PMC3171350  PMID: 18274649
11.  Sequence Alignment by Cross-Correlation 
Many recent advances in biology and medicine have resulted from DNA sequence alignment algorithms and technology. Traditional approaches for the matching of DNA sequences are based either on global alignment schemes or heuristic schemes that seek to approximate global alignment algorithms while providing higher computational efficiency. This report describes an approach using the mathematical operation of cross-correlation to compare sequences. It can be implemented using the fast fourier transform for computational efficiency. The algorithm is summarized and sample applications are given. These include gene sequence alignment in long stretches of genomic DNA, finding sequence similarity in distantly related organisms, demonstrating sequence similarity in the presence of massive (approximately 90%) random point mutations, comparing sequences related by internal rearrangements (tandem repeats) within a gene, and investigating fusion proteins. Application to RNA and protein sequence alignment is also discussed. The method is efficient, sensitive, and robust, being able to find sequence similarities where other alignment algorithms may perform poorly.
PMCID: PMC2291754  PMID: 16522868
Sequence alignment; algorithm; software; cross-correlation
12.  Digital DNA-DNA hybridization for microbial species delineation by means of genome-to-genome sequence comparison 
Standards in Genomic Sciences  2010;2(1):117-134.
The pragmatic species concept for Bacteria and Archaea is ultimately based on DNA-DNA hybridization (DDH). While enabling the taxonomist, in principle, to obtain an estimate of the overall similarity between the genomes of two strains, this technique is tedious and error-prone and cannot be used to incrementally build up a comparative database. Recent technological progress in the area of genome sequencing calls for bioinformatics methods to replace the wet-lab DDH by in-silico genome-to-genome comparison. Here we investigate state-of-the-art methods for inferring whole-genome distances in their ability to mimic DDH. Algorithms to efficiently determine high-scoring segment pairs or maximally unique matches perform well as a basis of inferring intergenomic distances. The examined distance functions, which are able to cope with heavily reduced genomes and repetitive sequence regions, outperform previously described ones regarding the correlation with and error ratios in emulating DDH. Simulation of incompletely sequenced genomes indicates that some distance formulas are very robust against missing fractions of genomic information. Digitally derived genome-to-genome distances show a better correlation with 16S rRNA gene sequence distances than DDH values. The future perspectives of genome-informed taxonomy are discussed, and the investigated methods are made available as a web service for genome-based species delineation.
doi:10.4056/sigs.531120
PMCID: PMC3035253  PMID: 21304684
Archaea; Bacteria; BLAST; GBDP; genomics; MUMmer; phylogeny; species concept; taxonomy
13.  RevTrans: multiple alignment of coding DNA from aligned amino acid sequences 
Nucleic Acids Research  2003;31(13):3537-3539.
The simple fact that proteins are built from 20 amino acids while DNA only contains four different bases, means that the ‘signal-to-noise ratio’ in protein sequence alignments is much better than in alignments of DNA. Besides this information-theoretical advantage, protein alignments also benefit from the information that is implicit in empirical substitution matrices such as BLOSUM-62. Taken together with the generally higher rate of synonymous mutations over non-synonymous ones, this means that the phylogenetic signal disappears much more rapidly from DNA sequences than from the encoded proteins. It is therefore preferable to align coding DNA at the amino acid level and it is for this purpose we have constructed the program RevTrans. RevTrans constructs a multiple DNA alignment by: (i) translating the DNA; (ii) aligning the resulting peptide sequences; and (iii) building a multiple DNA alignment by ‘reverse translation’ of the aligned protein sequences. In the resulting DNA alignment, gaps occur in groups of three corresponding to entire codons, and analogous codon positions are therefore always lined up. These features are useful when constructing multiple DNA alignments for phylogenetic analysis. RevTrans also accepts user-provided protein alignments for greater control of the alignment process. The RevTrans web server is freely available at http://www.cbs.dtu.dk/services/RevTrans/.
PMCID: PMC169015  PMID: 12824361
14.  Base-By-Base version 2: single nucleotide-level analysis of whole viral genome alignments 
Background
Base-By-Base is a Java-based multiple sequence alignment editor. It is capable of working with protein and DNA molecules, but many of its unique features relate to the manipulation of the genomes of large DNA viruses such as poxviruses, herpesviruses, baculoviruses and asfarviruses (1-400 kb). The tool was built to serve as a platform for comparative genomics at the level of individual nucleotides.
Results
In version 2, BBB-v2, of Base-By-Base we have added a series of new features aimed at providing the bench virologist with a better platform to view, annotate and analyze these complex genomes. Although a poxvirus genome, for example, may be less than 200 kb, it probably encodes close to 200 proteins using multiple classes of promoters with frequent overlapping of promoters and coding sequences and even some overlapping of genes. The new features allow users to 1) add primer annotations or other data sets in batch mode, 2) export differences between sequences to other genome browsers, 3) compare multiple genomes at a single nucleotide level of detail, 4) create new alignments from subsets/subsequences of a very large master alignment and 5) allow display of summaries of deep RNA sequencing data sets on a genome sequence.
Conclusion
BBB-v2 significantly improves the ability of virologists to work with genome sequences and provides a platform with which they can use a multiple sequence alignment as the basis for their own editable documents. Also, a .bbb document, with a variety of annotations in addition to the basic coding regions, can be shared among collaborators or made available to an entire research community. The program is available via Virology.ca using Java Web Start and is platform independent; the Java 1.5 virtual machine is required.
doi:10.1186/2042-5783-1-2
PMCID: PMC3348662  PMID: 22587754
15.  Analysis of Mitochondrial DNA Sequences in Childhood Encephalomyopathies Reveals New Disease-Associated Variants 
PLoS ONE  2007;2(9):e942.
Background
Mitochondrial encephalomyopathies are a heterogeneous group of clinical disorders generally caused due to mutations in either mitochondrial DNA (mtDNA) or nuclear genes encoding oxidative phosphorylation (OXPHOS). We analyzed the mtDNA sequences from a group of 23 pediatric patients with clinical and morphological features of mitochondrial encephalopathies and tried to establish a relationship of identified variants with the disease.
Methodology/Principle Findings
Complete mitochondrial genomes were amplified by PCR and sequenced by automated DNA sequencing. Sequencing data was analyzed by SeqScape software and also confirmed by BLASTn program. Nucleotide sequences were compared with the revised Cambridge reference sequence (CRS) and sequences present in mitochondrial databases. The data obtained shows that a number of known and novel mtDNA variants were associated with the disease. Most of the non-synonymous variants were heteroplasmic (A4136G, A9194G and T11916A) suggesting their possibility of being pathogenic in nature. Some of the missense variants although homoplasmic were showing changes in highly conserved amino acids (T3394C, T3866C, and G9804A) and were previously identified with diseased conditions. Similarly, two other variants found in tRNA genes (G5783A and C8309T) could alter the secondary structure of Cys-tRNA and Lys-tRNA. Most of the variants occurred in single cases; however, a few occurred in more than one case (e.g. G5783A and A10149T).
Conclusions and Significance
The mtDNA variants identified in this study could be the possible cause of mitochondrial encephalomyopathies with childhood onset in the patient group. Our study further strengthens the pathogenic score of known variants previously reported as provisionally pathogenic in mitochondrial diseases. The novel variants found in the present study can be potential candidates for further investigations to establish the relationship between their incidence and role in expressing the disease phenotype. This study will be useful in genetic diagnosis and counseling of mitochondrial diseases in India as well as worldwide.
doi:10.1371/journal.pone.0000942
PMCID: PMC1976591  PMID: 17895983
16.  BFAST: An Alignment Tool for Large Scale Genome Resequencing 
PLoS ONE  2009;4(11):e7767.
Background
The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25–100 base range, in the presence of errors and true biological variation.
Methodology
We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels.
Conclusions
We compare BFAST to a selection of large-scale alignment tools - BLAT, MAQ, SHRiMP, and SOAP - in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at http://bfast.sourceforge.net.
doi:10.1371/journal.pone.0007767
PMCID: PMC2770639  PMID: 19907642
17.  Efficient alignment of pyrosequencing reads for re-sequencing applications 
BMC Bioinformatics  2011;12:163.
Background
Over the past few years, new massively parallel DNA sequencing technologies have emerged. These platforms generate massive amounts of data per run, greatly reducing the cost of DNA sequencing. However, these techniques also raise important computational difficulties mostly due to the huge volume of data produced, but also because of some of their specific characteristics such as read length and sequencing errors. Among the most critical problems is that of efficiently and accurately mapping reads to a reference genome in the context of re-sequencing projects.
Results
We present an efficient method for the local alignment of pyrosequencing reads produced by the GS FLX (454) system against a reference sequence. Our approach explores the characteristics of the data in these re-sequencing applications and uses state of the art indexing techniques combined with a flexible seed-based approach, leading to a fast and accurate algorithm which needs very little user parameterization. An evaluation performed using real and simulated data shows that our proposed method outperforms a number of mainstream tools on the quantity and quality of successful alignments, as well as on the execution time.
Conclusions
The proposed methodology was implemented in a software tool called TAPyR--Tool for the Alignment of Pyrosequencing Reads--which is publicly available from http://www.tapyr.net.
doi:10.1186/1471-2105-12-163
PMCID: PMC3118166  PMID: 21672185
18.  ReCoil - an algorithm for compression of extremely large datasets of dna data 
The growing volume of generated DNA sequencing data makes the problem of its long term storage increasingly important. In this work we present ReCoil - an I/O efficient external memory algorithm designed for compression of very large collections of short reads DNA data. Typically each position of DNA sequence is covered by multiple reads of a short read dataset and our algorithm makes use of resulting redundancy to achieve high compression rate.
While compression based on encoding mismatches between the dataset and a similar reference can yield high compression rate, good quality reference sequence may be unavailable. Instead, ReCoil's compression is based on encoding the differences between similar or overlapping reads. As such reads may appear at large distances from each other in the dataset and since random access memory is a limited resource, ReCoil is designed to work efficiently in external memory, leveraging high bandwidth of modern hard disk drives.
doi:10.1186/1748-7188-6-23
PMCID: PMC3219593  PMID: 21988957
19.  PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. 
Nucleic Acids Research  1997;25(14):2745-2751.
Fluorescence-based sequencing is playing an increasingly important role in efforts to identify DNA polymorphisms and mutations of biological and medical interest. The application of this technology in generating the reference sequence of simple and complex genomes is also driving the development of new computer programs to automate base calling (Phred), sequence assembly (Phrap) and sequence assembly editing (Consed) in high throughput settings. In this report we describe a new computer program known as PolyPhred that automatically detects the presence of heterozygous single nucleotide substitutions by fluorescencebased sequencing of PCR products. Its operations are integrated with the use of the Phred, Phrap and Consed programs and together these tools generate a high throughput system for detecting DNA polymorphisms and mutations by large scale fluorescence-based resequencing. Analysis of sequences containing known DNA variants demonstrates that the accuracy of PolyPhred with single pass data is >99% when the sequences are generated with fluorescent dye-labeled primers and approximately 90% for those prepared with dye-labeled terminators.
PMCID: PMC146817  PMID: 9207020
20.  Meta-analysis of small RNA-sequencing errors reveals ubiquitous post-transcriptional RNA modifications 
Nucleic Acids Research  2009;37(8):2461-2470.
Recent advances in DNA-sequencing technology have made it possible to obtain large datasets of small RNA sequences. Here we demonstrate that not all non-perfectly matched small RNA sequences are simple technological sequencing errors, but many hold valuable biological information. Analysis of three small RNA datasets originating from Oryza sativa and Arabidopsis thaliana small RNA-sequencing projects demonstrates that many single nucleotide substitution errors overlap when aligning homologous non-identical small RNA sequences. Investigating the sites and identities of substitution errors reveal that many potentially originate as a result of post-transcriptional modifications or RNA editing. Modifications include N1-methyl modified purine nucleotides in tRNA, potential deamination or base substitutions in micro RNAs, 3′ micro RNA uridine extensions and 5′ micro RNA deletions. Additionally, further analysis of large sequencing datasets reveal that the combined effects of 5′ deletions and 3′ uridine extensions can alter the specificity by which micro RNAs associate with different Argonaute proteins. Hence, we demonstrate that not all sequencing errors in small RNA datasets are technical artifacts, but that these actually often reveal valuable biological insights to the sites of post-transcriptional RNA modifications.
doi:10.1093/nar/gkp093
PMCID: PMC2677864  PMID: 19255090
21.  Local Assemblies of Paired-End Reduced Representation Libraries Sequenced with the Illumina Genome Analyzer in Maize 
The use of next-generation DNA sequencing technologies has greatly facilitated reference-guided variant detection in complex plant genomes. However, complications may arise when regions adjacent to a read of interest are used for marker assay development, or when reference sequences are incomplete, as short reads alone may not be long enough to ascertain their uniqueness. Here, the possibility of generating longer sequences in discrete regions of the large and complex genome of maize is demonstrated, using a modified version of a paired-end RAD library construction strategy. Reads are generated from DNA fragments first digested with a methylation-sensitive restriction endonuclease, sheared, enriched with biotin and a selective PCR amplification step, and then sequenced at both ends. Sequences are locally assembled into contigs by subgrouping pairs based on the identity of the read anchored by the restriction site. This strategy applied to two maize inbred lines (B14 and B73) generated 183,609 and 129,018 contigs, respectively, out of which at least 76% were >200 bps in length. A subset of putative single nucleotide polymorphisms from contigs aligning to the B73 reference genome with at least one mismatch was resequenced, and 90% of those in B14 were confirmed, indicating that this method is a potent approach for variant detection and marker development in species with complex genomes or lacking extensive reference sequences.
doi:10.1155/2012/360598
PMCID: PMC3474217  PMID: 23093955
22.  DNA Barcode Goes Two-Dimensions: DNA QR Code Web Server 
PLoS ONE  2012;7(5):e35146.
The DNA barcoding technology uses a standard region of DNA sequence for species identification and discovery. At present, “DNA barcode” actually refers to DNA sequences, which are not amenable to information storage, recognition, and retrieval. Our aim is to identify the best symbology that can represent DNA barcode sequences in practical applications. A comprehensive set of sequences for five DNA barcode markers ITS2, rbcL, matK, psbA-trnH, and CO1 was used as the test data. Fifty-three different types of one-dimensional and ten two-dimensional barcode symbologies were compared based on different criteria, such as coding capacity, compression efficiency, and error detection ability. The quick response (QR) code was found to have the largest coding capacity and relatively high compression ratio. To facilitate the further usage of QR code-based DNA barcodes, a web server was developed and is accessible at http://qrfordna.dnsalias.org. The web server allows users to retrieve the QR code for a species of interests, convert a DNA sequence to and from a QR code, and perform species identification based on local and global sequence similarities. In summary, the first comprehensive evaluation of various barcode symbologies has been carried out. The QR code has been found to be the most appropriate symbology for DNA barcode sequences. A web server has also been constructed to allow biologists to utilize QR codes in practical DNA barcoding applications.
doi:10.1371/journal.pone.0035146
PMCID: PMC3344831  PMID: 22574113
23.  Pairagon+N-SCAN_EST: a model-based gene annotation pipeline 
Genome Biology  2006;7(Suppl 1):S5.
Background
This paper describes Pairagon+N-SCAN_EST, a gene annotation pipeline that uses only native alignments. For each expressed sequence it chooses the best genomic alignment. Systems like ENSEMBL and ExoGean rely on trans alignments, in which expressed sequences are aligned to the genomic loci of putative homologs. Trans alignments contain a high proportion of mismatches, gaps, and/or apparently unspliceable introns, compared to alignments of cDNA sequences to their native loci. The Pairagon+N-SCAN_EST pipeline's first stage is Pairagon, a cDNA-to-genome alignment program based on a PairHMM probability model. This model relies on prior knowledge, such as the fact that introns must begin with GT, GC, or AT and end with AG or AC. It produces very precise alignments of high quality cDNA sequences. In the genomic regions between Pairagon's cDNA alignments, the pipeline combines EST alignments with de novo gene prediction by using N-SCAN_EST. N-SCAN_EST is based on a generalized HMM probability model augmented with a phylogenetic conservation model and EST alignments. It can predict complete transcripts by extending or merging EST alignments, but it can also predict genes in regions without EST alignments. Because they are based on probability models, both Pairagon and N-SCAN_EST can be trained automatically for new genomes and data sets.
Results
On the ENCODE regions of the human genome, Pairagon+N-SCAN_EST was as accurate as any other system tested in the EGASP assessment, including ENSEMBL and ExoGean.
Conclusion
With sufficient mRNA/EST evidence, genome annotation without trans alignments can compete successfully with systems like ENSEMBL and ExoGean, which use trans alignments.
doi:10.1186/gb-2006-7-s1-s5
PMCID: PMC1810554  PMID: 16925839
24.  Hobbes: optimized gram-based methods for efficient read alignment 
Nucleic Acids Research  2011;40(6):e41.
Recent advances in sequencing technology have enabled the rapid generation of billions of bases at relatively low cost. A crucial first step in many sequencing applications is to map those reads to a reference genome. However, when the reference genome is large, finding accurate mappings poses a significant computational challenge due to the sheer amount of reads, and because many reads map to the reference sequence approximately but not exactly. We introduce Hobbes, a new gram-based program for aligning short reads, supporting Hamming and edit distance. Hobbes implements two novel techniques, which yield substantial performance improvements: an optimized gram-selection procedure for reads, and a cache-efficient filter for pruning candidate mappings. We systematically tested the performance of Hobbes on both real and simulated data with read lengths varying from 35 to 100 bp, and compared its performance with several state-of-the-art read-mapping programs, including Bowtie, BWA, mrsFast and RazerS. Hobbes is faster than all other read mapping programs we have tested while maintaining high mapping quality. Hobbes is about five times faster than Bowtie and about 2–10 times faster than BWA, depending on read length and error rate, when asked to find all mapping locations of a read in the human genome within a given Hamming or edit distance, respectively. Hobbes supports the SAM output format and is publicly available at http://hobbes.ics.uci.edu.
doi:10.1093/nar/gkr1246
PMCID: PMC3315303  PMID: 22199254
25.  U87MG Decoded: The Genomic Sequence of a Cytogenetically Aberrant Human Cancer Cell Line 
PLoS Genetics  2010;6(1):e1000832.
U87MG is a commonly studied grade IV glioma cell line that has been analyzed in at least 1,700 publications over four decades. In order to comprehensively characterize the genome of this cell line and to serve as a model of broad cancer genome sequencing, we have generated greater than 30× genomic sequence coverage using a novel 50-base mate paired strategy with a 1.4kb mean insert library. A total of 1,014,984,286 mate-end and 120,691,623 single-end two-base encoded reads were generated from five slides. All data were aligned using a custom designed tool called BFAST, allowing optimal color space read alignment and accurate identification of DNA variants. The aligned sequence reads and mate-pair information identified 35 interchromosomal translocation events, 1,315 structural variations (>100 bp), 191,743 small (<21 bp) insertions and deletions (indels), and 2,384,470 single nucleotide variations (SNVs). Among these observations, the known homozygous mutation in PTEN was robustly identified, and genes involved in cell adhesion were overrepresented in the mutated gene list. Data were compared to 219,187 heterozygous single nucleotide polymorphisms assayed by Illumina 1M Duo genotyping array to assess accuracy: 93.83% of all SNPs were reliably detected at filtering thresholds that yield greater than 99.99% sequence accuracy. Protein coding sequences were disrupted predominantly in this cancer cell line due to small indels, large deletions, and translocations. In total, 512 genes were homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and 35 by interchromosomal translocations to reveal a highly mutated cell line genome. Of the small homozygously mutated variants, 8 SNVs and 99 indels were novel events not present in dbSNP. These data demonstrate that routine generation of broad cancer genome sequence is possible outside of genome centers. The sequence analysis of U87MG provides an unparalleled level of mutational resolution compared to any cell line to date.
Author Summary
Glioblastoma has a particularly dismal prognosis with median survival time of less than fifteen months. Here, we describe the broad genome sequencing of U87MG, a commonly used and thus well-studied glioblastoma cell line. One of the major features of the U87MG genome is the large number of chromosomal abnormalities, which can be typical of cancer cell lines and primary cancers. The systematic, thorough, and accurate mutational analysis of the U87MG genome comprehensively identifies different classes of genetic mutations including single-nucleotide variations (SNVs), insertions/deletions (indels), and translocations. We found 2,384,470 SNVs, 191,743 small indels, and 1,314 large structural variations. Known gene models were used to predict the effect of these mutations on protein-coding sequence. Mutational analysis revealed 512 genes homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and up to 35 by interchromosomal translocations. The major mutational mechanisms in this brain cancer cell line are small indels and large structural variations. The genomic landscape of U87MG is revealed to be much more complex than previously thought based on lower resolution techniques. This mutational analysis serves as a resource for past and future studies on U87MG, informing them with a thorough description of its mutational state.
doi:10.1371/journal.pgen.1000832
PMCID: PMC2813426  PMID: 20126413

Results 1-25 (757980)