Search tips
Search criteria

Results 1-14 (14)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  Automated Forward and Reverse Ratcheting of DNA in a Nanopore at Five Angstrom Precision1 
Nature biotechnology  2012;30(4):344-348.
Single-molecule techniques have been developed for commercial DNA sequencing1,2. One emerging strategy uses a nanopore to analyze DNA molecules as they are driven electrophoretically in single file order past a sensor3-5. However, uncontrolled DNA strand electrophoresis through nanopores is too fast for accurate base reads6. A proposed solution would employ processive enzymes to deliver DNA through the pore at a slower average rate7. Here, we describe forward and reverse ratcheting of DNA templates through the α–hemolysin (α-HL) nanopore controlled by wild-type phi29 DNA polymerase (phi29 DNAP). DNA strands were examined in single file order at one nucleotide spatial precision in real time. The registry error probability (either an insertion or deletion during one pass along a template strand) ranged from 10% to 24.5% absent optimization. This general strategy facilitates multiple reads of individual template strands and is transferrable to other nanopore devices for implementation of DNA sequence analysis.
PMCID: PMC3408072  PMID: 22334048
2.  Complete genome sequence of Pyrobaculum oguniense 
Standards in Genomic Sciences  2012;6(3):336-345.
Pyrobaculum oguniense TE7 is an aerobic hyperthermophilic crenarchaeon isolated from a hot spring in Japan. Here we describe its main chromosome of 2,436,033 bp, with three large-scale inversions and an extra-chromosomal element of 16,887 bp. We have annotated 2,800 protein-coding genes and 145 RNA genes in this genome, including nine H/ACA-like small RNA, 83 predicted C/D box small RNA, and 47 transfer RNA genes. Comparative analyses with the closest known relative, the anaerobe Pyrobaculum arsenaticum from Italy, reveals unexpectedly high synteny and nucleotide identity between these two geographically distant species. Deep sequencing of a mixture of genomic DNA from multiple cells has illuminated some of the genome dynamics potentially shared with other species in this genus.
PMCID: PMC3558965  PMID: 23407329
Pyrobaculum oguniense; Pyrobaculum arsenaticum; Crenarchaea; inversion
3.  Identification of prokaryotic small proteins using a comparative genomic approach 
Bioinformatics  2011;27(13):1765-1771.
Motivation: Accurate prediction of genes encoding small proteins (on the order of 50 amino acids or less) remains an elusive open problem in bioinformatics. Some of the best methods for gene prediction use either sequence composition analysis or sequence similarity to a known protein coding sequence. These methods often fail for small proteins, however, either due to a lack of experimentally verified small protein coding genes or due to the limited statistical significance of statistics on small sequences.
Our approach is based upon the hypothesis that true small proteins will be under selective pressure for encoding the particular amino acid sequence, for ease of translation by the ribosome and for structural stability. This stability can be achieved either independently or as part of a larger protein complex. Given this assumption, it follows that small proteins should display conserved local protein structure properties much like larger proteins. Our method incorporates neural-net predictions for three local structure alphabets within a comparative genomic approach using a genomic alignment of 22 closely related bacteria genomes to generate predictions for whether or not a given open reading frame (ORF) encodes for a small protein.
Results: We have applied this method to the complete genome for Escherichia coli strain K12 and looked at how well our method performed on a set of 60 experimentally verified small proteins from this organism. Out of a total of 11 407 possible ORFs, we found that 6 of the top 10 and 27 of the top 100 predictions belonged to the set of 60 experimentally verified small proteins. We found 35 of all the true small proteins within the top 200 predictions. We compared our method to Glimmer, using a default Glimmer protocol and a modified small ORF Glimmer protocol with a lower minimum size cutoff. The default Glimmer protocol identified 16 of the true small proteins (all in the top 200 predictions), but failed to predict on 34 due to size cutoffs. The small ORF Glimmer protocol made predictions for all the experimentally verified small proteins but only contained 9 of the 60 true small proteins within the top 200 predictions.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3117347  PMID: 21551138
4.  Identification of a Chemoreceptor Zinc-Binding Domain Common to Cytoplasmic Bacterial Chemoreceptors ▿  
Journal of Bacteriology  2011;193(17):4338-4345.
We report the identification and characterization of a previously unidentified protein domain found in bacterial chemoreceptors and other bacterial signal transduction proteins. This domain contains a motif of three noncontiguous histidines and one cysteine, arranged as Hxx[WFYL]x21-28Cx[LFMVI]Gx[WFLVI]x18-27HxxxH(boldface type indicates residues that are nearly 100% conserved). This domain was first identified in the soluble Helicobacter pylorichemoreceptor TlpD. Using inductively coupled plasma mass spectrometry on heterologously and natively expressed TlpD, we determined that this domain binds zinc with a subfemtomolar dissociation constant. We thus named the domain CZB, for chemoreceptor zinc binding. Further analysis showed that many bacterial signaling proteins contain the CZB domain, most commonly proteins that participate in chemotaxis but also those that participate in c-di-GMP signaling and nitrate/nitrite sensing, among others. Proteins bearing the CZB domain are found in several bacterial phyla. The variety of signaling proteins using the CZB domain suggests that it plays a critical role in several signal transduction pathways.
PMCID: PMC3165512  PMID: 21725005
5.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega 
Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega
Multiple sequence alignments are fundamental to many sequence analysis methods. The new program Clustal Omega can align virtually any number of protein sequences quickly and has powerful features for adding sequences to existing precomputed alignments.
Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.
PMCID: PMC3261699  PMID: 21988835
bioinformatics; hidden Markov models; multiple sequence alignment
6.  Applying Undertaker Cost Functions to Model Quality Assessment 
Proteins  2009;75(3):550-555.
Undertaker is a program designed to help predict protein structure using alignments to proteins of known structure and fragment assembly. The program generates conformations and uses cost functions to select the best structures from among the generated conformations. This paper describes the use of Undertaker's cost functions for model quality assessment (MQA). We achieve an accuracy that is similar to other methods, without using consensus-based techniques. Adding consensus-based features further improves our approach substantially. We report several correlation measures, including a new weighted version of Kendall's τ (τ3) and show MQA results superior to previously published results on all correlation measures when using only models with no missing atoms.
PMCID: PMC2992551  PMID: 19004017
7.  Improving physical realism, stereochemistry and side-chain accuracy in homology modeling: four approaches that performed well in CASP8 
Proteins  2009;77(Suppl 9):114-122.
A correct alignment is an essential requirement in homology modeling. Yet in order to bridge the structural gap between template and target, which may not only involve loop rearrangements, but also shifts of secondary structure elements and repacking of core residues, high-resolution refinement methods with full atomic details are needed. Here we describe four approaches that address this ‘last mile of the protein folding problem’ and have performed well during CASP8, yielding physically realistic models:
YASARA, which runs molecular dynamics simulations of models in explicit solvent, using a new partly knowledge-based all atom force field derived from Amber, whose parameters have been optimized to minimize the damage done to protein crystal structures.
The LEE-SERVER, which makes extensive use of conformational space annealing to create alignments, to help Modeller build physically realistic models while satisfying input restraints from templates and CHARMM stereochemistry, and to remodel the side-chains.
ROSETTA, whose high resolution refinement protocol combines a physically realistic all atom force field with Monte Carlo minimization to allow the large conformational space to be sampled quickly.
And finally UNDERTAKER, which creates a pool of candidate models from various templates and then optimizes them with an adaptive genetic algorithm, using a primarily empirical cost function that does not include bond angle, bond length, or other physics-like terms.
PMCID: PMC2922016  PMID: 19768677
high-resolution refinement; template-based modeling; Rosetta; under-taker; Modeller; conformational space annealing; YASARA
8.  Model Quality Assessment using Distance Constraints from Alignments 
Proteins  2009;75(3):540-549.
Given a set of alternative models for a specific protein sequence, the model quality assessment (MQA) problem asks for an assignment of scores to each model in the set. A good MQA program assigns these scores such that they correlate well with real quality of the models, ideally scoring best that model which is closest to the true structure.
In this paper, we present a new approach for addressing the MQA problem. It is based on distance constraints extracted from alignments to templates of known structure, and is implemented in the Undertaker [9] program for protein structure prediction. One novel feature is that we extract non-contact constraints as well as contact constraints.
We describe how the distance constraint extraction is done and we show how they can be used to address the MQA problem. We have compared our method on CASP7 targets and the results show that our method is at least comparable with the best MQA methods that were assessed at CASP7 [7].
We also propose a new evaluation measure, Kendall's τ, that is more interpretable than conventional measures used for evaluating MQA methods (Pearson's r and Spearman's ρ).
We show clear examples where Kendall's τ agrees much more with our intuition of a correct MQA and we therefore propose that Kendall's τ be used for future CASP MQA assessments.
PMCID: PMC2670941  PMID: 19003987
9.  Applying Undertaker to Quality Assessment 
Proteins  2009;77(Suppl 9):191-195.
Our group tested three quality assessment functions in CASP8: a function which used only distance constraints derived from alignments (SAM-T08-MQAO), a function which added other single-model terms to the distance constraints (SAM-T08-MQAU), and a function which used both single-model and consensus terms (SAM-T08-MQAC).
We analyzed the functions both for ranking models for a single target and for producing an accurate estimate of GDT_TS. Our functions were optimized for the ranking problem, so are perhaps more appropriate for metaserver applications than for providing a trustworthiness estimate for single models.
On the CASP8 test, the functions with more terms performed better. The MQAC consensus method was substantially better than either single-model function, and the MQAU function was substantially better than the MQAO function that used only constraints from alignments.
PMCID: PMC2825389  PMID: 19639637
10.  Improving protein secondary structure prediction using a simple k-mer model 
Bioinformatics  2010;26(5):596-602.
Motivation: Some first order methods for protein sequence analysis inherently treat each position as independent. We develop a general framework for introducing longer range interactions. We then demonstrate the power of our approach by applying it to secondary structure prediction; under the independence assumption, sequences produced by existing methods can produce features that are not protein like, an extreme example being a helix of length 1. Our goal was to make the predictions from state of the art methods more realistic, without loss of performance by other measures.
Results: Our framework for longer range interactions is described as a k-mer order model. We succeeded in applying our model to the specific problem of secondary structure prediction, to be used as an additional layer on top of existing methods. We achieved our goal of making the predictions more realistic and protein like, and remarkably this also improved the overall performance. We improve the Segment OVerlap (SOV) score by 1.8%, but more importantly we radically improve the probability of the real sequence given a prediction from an average of 0.271 per residue to 0.385. Crucially, this improvement is obtained using no additional information.
PMCID: PMC2828123  PMID: 20130034
11.  Predict-2nd: a tool for generalized protein local structure prediction 
Bioinformatics  2008;24(21):2453-2459.
Motivation: Predictions of protein local structure, derived from sequence alignment information alone, provide visualization tools for biologists to evaluate the importance of amino acid residue positions of interest in the absence of X-ray crystal/NMR structures or homology models. They are also useful as inputs to sequence analysis and modeling tools, such as hidden Markov models (HMMs), which can be used to search for homology in databases of known protein structure. In addition, local structure predictions can be used as a component of cost functions in genetic algorithms that predict protein tertiary structure. We have developed a program (predict-2nd) that trains multilayer neural networks and have applied it to numerous local structure alphabets, tuning network parameters such as the number of layers, the number of units in each layer and the window sizes of each layer. We have had the most success with four-layer networks, with gradually increasing window sizes at each layer.
Results: Because the four-layer neural nets occasionally get trapped in poor local optima, our training protocol now uses many different random starts, with short training runs, followed by more training on the best performing networks from the short runs. One recent addition to the program is the option to add a guide sequence to the profile inputs, increasing the number of inputs per position by 20. We find that use of a guide sequence provides a small but consistent improvement in the predictions for several different local-structure alphabets.
Availability: Local structure prediction with the methods described here is available for use online at The source code and example networks for PREDICT-2ND are available at A required C++ library is available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2732275  PMID: 18757875
12.  SAM-T08, HMM-based protein structure prediction 
Nucleic Acids Research  2009;37(Web Server issue):W492-W497.
The SAM-T08 web server is a protein structure prediction server that provides several useful intermediate results in addition to the final predicted 3D structure: three multiple sequence alignments of putative homologs using different iterated search procedures, prediction of local structure features including various backbone and burial properties, calibrated E-values for the significance of template searches of PDB and residue–residue contact predictions. The server has been validated as part of the CASP8 assessment of structure prediction as having good performance across all classes of predictions. The SAM-T08 server is available at
PMCID: PMC2703928  PMID: 19483096
13.  Pokefind: a novel topological filter for use with protein structure prediction 
Bioinformatics  2009;25(12):i281-i288.
Motivation: Our focus has been on detecting topological properties that are rare in real proteins, but occur more frequently in models generated by protein structure prediction methods such as Rosetta. We previously created the Knotfind algorithm, successfully decreasing the frequency of knotted Rosetta models during CASP6. We observed an additional class of knot-like loops that appeared to be equally un-protein-like and yet do not contain a mathematical knot. These topological features are commonly referred to as slip-knots and are caused by the same mechanisms that result in knotted models. Slip-knots are undetectable by the original Knotfind algorithm. We have generalized our algorithm to detect them, and analyzed CASP6 models built using the Rosetta loop modeling method.
Results: After analyzing known protein structures in the PDB, we found that slip-knots do occur in certain proteins, but are rare and fall into a small number of specific classes. Our group used this new Pokefind algorithm to distinguish between these rare real slip-knots and the numerous classes of slip-knots that we discovered in Rosetta models and models submitted by the various CASP7 servers. The goal of this work is to improve future models created by protein structure prediction methods. Both algorithms are able to detect un-protein-like features that current metrics such as GDT are unable to identify, so these topological filters can also be used as additional assessment tools.
PMCID: PMC2687952  PMID: 19478000
14.  Identification and Characterization of RbmA, a Novel Protein Required for the Development of Rugose Colony Morphology and Biofilm Structure in Vibrio cholerae 
Journal of Bacteriology  2006;188(3):1049-1059.
Phase variation between smooth and rugose colony variants of Vibrio cholerae is predicted to be important for the pathogen's survival in its natural aquatic ecosystems. The rugose variant forms corrugated colonies, exhibits increased levels of resistance to osmotic, acid, and oxidative stresses, and has an enhanced capacity to form biofilms. Many of these phenotypes are mediated in part by increased production of an exopolysaccharide termed VPS. In this study, we compared total protein profiles of the smooth and rugose variants using two-dimensional gel electrophoresis and identified one protein that is present at a higher level in the rugose variant. A mutation in the gene encoding this protein, which does not have any known homologs in the protein databases, causes cells to form biofilms that are more fragile and sensitive to sodium dodecyl sulfate than wild-type biofilms. The results indicate that the gene, termed rbmA (rugosity and biofilm structure modulator A), is required for rugose colony formation and biofilm structure integrity in V. cholerae. Transcription of rbmA is positively regulated by the response regulator VpsR but not VpsT.
PMCID: PMC1347326  PMID: 16428409

Results 1-14 (14)