Search tips
Search criteria

Results 1-14 (14)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  BAsE-Seq: a method for obtaining long viral haplotypes from short sequence reads 
Genome Biology  2014;15(11):517.
We present a method for obtaining long haplotypes, of over 3 kb in length, using a short-read sequencer, Barcode-directed Assembly for Extra-long Sequences (BAsE-Seq). BAsE-Seq relies on transposing a template-specific barcode onto random segments of the template molecule and assembling the barcoded short reads into complete haplotypes. We applied BAsE-Seq on mixed clones of hepatitis B virus and accurately identified haplotypes occurring at frequencies greater than or equal to 0.4%, with >99.9% specificity. Applying BAsE-Seq to a clinical sample, we obtained over 9,000 viral haplotypes, which provided an unprecedented view of hepatitis B virus population structure during chronic infection. BAsE-Seq is readily applicable for monitoring quasispecies evolution in viral diseases.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0517-9) contains supplementary material, which is available to authorized users.
PMCID: PMC4269956  PMID: 25406369
2.  Draft Genome Sequence of Polychlorinated Biphenyl-Dechlorinating Dehalococcoides mccartyi Strain SG1, Which Carries a Circular Putative Plasmid 
Genome Announcements  2014;2(5):e00901-14.
Dehalococcoides mccartyi strain SG1, isolated from digester sludge, dechlorinates polychlorinated biphenyls (PCBs) to lower congeners. Here we report the draft genome sequence of SG1, which carries a 22.65 kbp circular putative plasmid.
PMCID: PMC4183867  PMID: 25278523
3.  Species Identification and Profiling of Complex Microbial Communities Using Shotgun Illumina Sequencing of 16S rRNA Amplicon Sequences 
PLoS ONE  2013;8(4):e60811.
The high throughput and cost-effectiveness afforded by short-read sequencing technologies, in principle, enable researchers to perform 16S rRNA profiling of complex microbial communities at unprecedented depth and resolution. Existing Illumina sequencing protocols are, however, limited by the fraction of the 16S rRNA gene that is interrogated and therefore limit the resolution and quality of the profiling. To address this, we present the design of a novel protocol for shotgun Illumina sequencing of the bacterial 16S rRNA gene, optimized to amplify more than 90% of sequences in the Greengenes database and with the ability to distinguish nearly twice as many species-level OTUs compared to existing protocols. Using several in silico and experimental datasets, we demonstrate that despite the presence of multiple variable and conserved regions, the resulting shotgun sequences can be used to accurately quantify the constituents of complex microbial communities. The reconstruction of a significant fraction of the 16S rRNA gene also enabled high precision (>90%) in species-level identification thereby opening up potential application of this approach for clinical microbial characterization.
PMCID: PMC3620293  PMID: 23579286
4.  LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets 
Nucleic Acids Research  2012;40(22):11189-11201.
The study of cell-population heterogeneity in a range of biological systems, from viruses to bacterial isolates to tumor samples, has been transformed by recent advances in sequencing throughput. While the high-coverage afforded can be used, in principle, to identify very rare variants in a population, existing ad hoc approaches frequently fail to distinguish true variants from sequencing errors. We report a method (LoFreq) that models sequencing run-specific error rates to accurately call variants occurring in <0.05% of a population. Using simulated and real datasets (viral, bacterial and human), we show that LoFreq has near-perfect specificity, with significantly improved sensitivity compared with existing methods and can efficiently analyze deep Illumina sequencing datasets without resorting to approximations or heuristics. We also present experimental validation for LoFreq on two different platforms (Fluidigm and Sequenom) and its application to call rare somatic variants from exome sequencing datasets for gastric cancer. Source code and executables for LoFreq are freely available at
PMCID: PMC3526318  PMID: 23066108
5.  A Randomized, Double-Blind Placebo Controlled Trial of Balapiravir, a Polymerase Inhibitor, in Adult Dengue Patients 
The Journal of Infectious Diseases  2012;207(9):1442-1450.
Background. Dengue is the most common arboviral infection of humans. There are currently no specific treatments for dengue. Balapiravir is a prodrug of a nucleoside analogue (called R1479) and an inhibitor of hepatitis C virus replication in vivo.
Methods. We conducted in vitro experiments to determine the potency of balapiravir against dengue viruses and then an exploratory, dose-escalating, randomized placebo-controlled trial in adult male patients with dengue with <48 hours of fever.
Results. The clinical and laboratory adverse event profile in patients receiving balapiravir at doses of 1500 mg (n = 10) or 3000 mg (n = 22) orally for 5 days was similar to that of patients receiving placebo (n = 32), indicating balapiravir was well tolerated. However, twice daily assessment of viremia and daily assessment of NS1 antigenemia indicated balapiravir did not measurably alter the kinetics of these virological markers, nor did it reduce the fever clearance time. The kinetics of plasma cytokine concentrations and the whole blood transcriptional profile were also not attenuated by balapiravir treatment.
Conclusions. Although this trial, the first of its kind in dengue, does not support balapiravir as a candidate drug, it does establish a framework for antiviral treatment trials in dengue and provides the field with a clinically evaluated benchmark molecule.
Clinical Trials Registration. NCT01096576.
PMCID: PMC3610419  PMID: 22807519
Dengue; therapeutics; clinical trial; anti-viral
6.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega 
Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega
Multiple sequence alignments are fundamental to many sequence analysis methods. The new program Clustal Omega can align virtually any number of protein sequences quickly and has powerful features for adding sequences to existing precomputed alignments.
Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.
PMCID: PMC3261699  PMID: 21988835
bioinformatics; hidden Markov models; multiple sequence alignment
7.  BlastR—fast and accurate database searches for non-coding RNAs 
Nucleic Acids Research  2011;39(16):6886-6895.
We present and validate BlastR, a method for efficiently and accurately searching non-coding RNAs. Our approach relies on the comparison of di-nucleotides using BlosumR, a new log-odd substitution matrix. In order to use BlosumR for comparison, we recoded RNA sequences into protein-like sequences. We then showed that BlosumR can be used along with the BlastP algorithm in order to search non-coding RNA sequences. Using Rfam as a gold standard, we benchmarked this approach and show BlastR to be more sensitive than BlastN. We also show that BlastR is both faster and more sensitive than BlastP used with a single nucleotide log-odd substitution matrix. BlastR, when used in combination with WU-BlastP, is about 5% more accurate than WU-BlastN and about 50 times slower. The approach shown here is equally effective when combined with the NCBI-Blast package. The software is an open source freeware available from
PMCID: PMC3167602  PMID: 21624887
8.  Sequence embedding for fast construction of guide trees for multiple sequence alignment 
The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.
In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.
We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from
PMCID: PMC2893182  PMID: 20470396
9.  High DNA melting temperature predicts transcription start site location in human and mouse 
Nucleic Acids Research  2009;37(22):7360-7367.
The accurate computational prediction of transcription start sites (TSS) in vertebrate genomes is a difficult problem. The physicochemical properties of DNA can be computed in various ways and a many combinations of DNA features have been tested in the past for use as predictors of transcription. We looked in detail at melting temperature, which measures the temperature, at which two strands of DNA separate, considering the cooperative nature of this process. We find that peaks in melting temperature correspond closely to experimentally determined transcription start sites in human and mouse chromosomes. Using melting temperature alone, and with simple thresholding, we can predict TSS with accuracy that is competitive with the most accurate state-of-the-art TSS prediction methods. Accuracy is measured using both experimentally and manually determined TSS. The method works especially well with CpG island containing promoters, but also works when CpG islands are absent. This result is clear evidence of the important role of the physical properties of DNA in the process of transcription. It also points to the importance for TSS prediction methods to include melting temperature as prior information.
PMCID: PMC2794178  PMID: 19820114
10.  R-Coffee: a web server for accurately aligning noncoding RNA sequences 
Nucleic Acids Research  2008;36(Web Server issue):W10-W13.
The R-Coffee web server produces highly accurate multiple alignments of noncoding RNA (ncRNA) sequences, taking into account predicted secondary structures. R-Coffee uses a novel algorithm recently incorporated in the T-Coffee package. R-Coffee works along the same lines as T-Coffee: it uses pairwise or multiple sequence alignment (MSA) methods to compute a primary library of input alignments. The program then computes an MSA highly consistent with both the alignments contained in the library and the secondary structures associated with the sequences. The secondary structures are predicted using RNAplfold. The server provides two modes. The slow/accurate mode is restricted to small datasets (less than 5 sequences less than 150 nucleotides) and combines R-Coffee with Consan, a very accurate pairwise RNA alignment method. For larger datasets a fast method can be used (RM-Coffee mode), that uses R-Coffee to combine the output of the three packages which combines the outputs from programs found to perform best on RNA (MUSCLE, MAFFT and ProbConsRNA). Our BRAliBase benchmarks indicate that the R-Coffee/Consan combination is one of the best ncRNA alignment methods for short sequences, while the RM-Coffee gives comparable results on longer sequences. The R-Coffee web server is available at
PMCID: PMC2447777  PMID: 18483080
11.  ConStruct: Improved construction of RNA consensus structures 
BMC Bioinformatics  2008;9:219.
Aligning homologous non-coding RNAs (ncRNAs) correctly in terms of sequence and structure is an unresolved problem, due to both mathematical complexity and imperfect scoring functions. High quality alignments, however, are a prerequisite for most consensus structure prediction approaches, homology searches, and tools for phylogeny inference. Automatically created ncRNA alignments often need manual corrections, yet this manual refinement is tedious and error-prone.
We present an extended version of CONSTRUCT, a semi-automatic, graphical tool suitable for creating RNA alignments correct in terms of both consensus sequence and consensus structure. To this purpose CONSTRUCT combines sequence alignment, thermodynamic data and various measures of covariation.
One important feature is that the user is guided during the alignment correction step by a consensus dotplot, which displays all thermodynamically optimal base pairs and the corresponding covariation. Once the initial alignment is corrected, optimal and suboptimal secondary structures as well as tertiary interaction can be predicted. We demonstrate CONSTRUCT's ability to guide the user in correcting an initial alignment, and show an example for optimal secondary consensus structure prediction on very hard to align SECIS elements. Moreover we use CONSTRUCT to predict tertiary interactions from sequences of the internal ribosome entry site of CrP-like viruses. In addition we show that alignments specifically designed for benchmarking can be easily be optimized using CONSTRUCT, although they share very little sequence identity.
CONSTRUCT's graphical interface allows for an easy alignment correction based on and guided by predicted and known structural constraints. It combines several algorithms for prediction of secondary consensus structure and even tertiary interactions. The CONSTRUCT package can be downloaded from the URL listed in the Availability and requirements section of this article.
PMCID: PMC2408607  PMID: 18442401
12.  R-Coffee: a method for multiple alignment of non-coding RNA 
Nucleic Acids Research  2008;36(9):e52.
R-Coffee is a multiple RNA alignment package, derived from T-Coffee, designed to align RNA sequences while exploiting secondary structure information. R-Coffee uses an alignment-scoring scheme that incorporates secondary structure information within the alignment. It works particularly well as an alignment improver and can be combined with any existing sequence alignment method. In this work, we used R-Coffee to compute multiple sequence alignments combining the pairwise output of sequence aligners and structural aligners. We show that R-Coffee can improve the accuracy of all the sequence aligners. We also show that the consistency-based component of T-Coffee can improve the accuracy of several structural aligners. R-Coffee was tested on 388 BRAliBase reference datasets and on 11 longer Cmfinder datasets. Altogether our results suggest that the best protocol for aligning short sequences (less than 200 nt) is the combination of R-Coffee with the RNA pairwise structural aligner Consan. We also show that the simultaneous combination of the four best sequence alignment programs with R-Coffee produces alignments almost as accurate as those obtained with R-Coffee/Consan. Finally, we show that R-Coffee can also be used to align longer datasets beyond the usual scope of structural aligners. R-Coffee is freely available for download, along with documentation, from the T-Coffee web site (
PMCID: PMC2396437  PMID: 18420654
13.  An enhanced RNA alignment benchmark for sequence alignment programs 
The performance of alignment programs is traditionally tested on sets of protein sequences, of which a reference alignment is known. Conclusions drawn from such protein benchmarks do not necessarily hold for the RNA alignment problem, as was demonstrated in the first RNA alignment benchmark published so far. For example, the twilight zone – the similarity range where alignment quality drops drastically – starts at 60 % for RNAs in comparison to 20 % for proteins. In this study we enhance the previous benchmark.
The RNA sequence sets in the benchmark database are taken from an increased number of RNA families to avoid unintended impact by using only a few families. The size of sets varies from 2 to 15 sequences to assess the influence of the number of sequences on program performance. Alignment quality is scored by two measures: one takes into account only nucleotide matches, the other measures structural conservation. The performance order of parameters – like nucleotide substitution matrices and gap-costs – as well as of programs is rated by rank tests.
Most sequence alignment programs perform equally well on RNA sequence sets with high sequence identity, that is with an average pairwise sequence identity (APSI) above 75 %. Parameters for gap-open and gap-extension have a large influence on alignment quality lower than APSI ≤ 75 %; optimal parameter combinations are shown for several programs. The use of different 4 × 4 substitution matrices improved program performance only in some cases. The performance of iterative programs drastically increases with increasing sequence numbers and/or decreasing sequence identity, which makes them clearly superior to programs using a purely non-iterative, progressive approach. The best sequence alignment programs produce alignments of high quality down to APSI > 55 %; at lower APSI the use of sequence+structure alignment programs is recommended.
PMCID: PMC1635699  PMID: 17062125
14.  A benchmark of multiple sequence alignment programs upon structural RNAs 
Nucleic Acids Research  2005;33(8):2433-2439.
To date, few attempts have been made to benchmark the alignment algorithms upon nucleic acid sequences. Frequently, sophisticated PAM or BLOSUM like models are used to align proteins, yet equivalents are not considered for nucleic acids; instead, rather ad hoc models are generally favoured. Here, we systematically test the performance of existing alignment algorithms on structural RNAs. This work was aimed at achieving the following goals: (i) to determine conditions where it is appropriate to apply common sequence alignment methods to the structural RNA alignment problem. This indicates where and when researchers should consider augmenting the alignment process with auxiliary information, such as secondary structure and (ii) to determine which sequence alignment algorithms perform well under the broadest range of conditions. We find that sequence alignment alone, using the current algorithms, is generally inappropriate <50–60% sequence identity. Second, we note that the probabilistic method ProAlign and the aging Clustal algorithms generally outperform other sequence-based algorithms, under the broadest range of applications.
PMCID: PMC1087786  PMID: 15860779

Results 1-14 (14)