Search tips
Search criteria

Results 1-8 (8)

Clipboard (0)
more »
Year of Publication
Document Types
1.  Incorporating post-translational modifications and unnatural amino acids into high-throughput modeling of protein structures 
Bioinformatics  2014;30(12):1681-1689.
Motivation: Accurately predicting protein side-chain conformations is an important subproblem of the broader protein structure prediction problem. Several methods exist for generating fairly accurate models for moderate-size proteins in seconds or less. However, a major limitation of these methods is their inability to model post-translational modifications (PTMs) and unnatural amino acids. In natural living systems, the chemical groups added following translation are often critical for the function of the protein. In engineered systems, unnatural amino acids are incorporated into proteins to explore structure–function relationships and create novel proteins.
Results: We present a new version of SIDEpro to predict the side chains of proteins containing non-standard amino acids, including 15 of the most frequently observed PTMs in the Protein Data Bank and all types of phosphorylation. SIDEpro uses energy functions that are parameterized by neural networks trained from available data. For PTMs, the and accuracies are comparable with those obtained for the precursor amino acid, and so are the RMSD values for the atoms shared with the precursor amino acid. In addition, SIDEpro can accommodate any PTM or unnatural amino acid, thus providing a flexible prediction system for high-throughput modeling of proteins beyond the standard amino acids.
Availability and implementation: SIDEpro programs and Web server, rotamer libraries and data are available through the SCRATCH suite of protein structure predictors at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4058938  PMID: 24574112
2.  A unifying kinetic framework for modeling oxidoreductase-catalyzed reactions 
Bioinformatics  2013;29(10):1299-1307.
Motivation: Oxidoreductases are a fundamental class of enzymes responsible for the catalysis of oxidation–reduction reactions, crucial in most bioenergetic metabolic pathways. From their common root in the ancient prebiotic environment, oxidoreductases have evolved into diverse and elaborate protein structures with specific kinetic properties and mechanisms adapted to their individual functional roles and environmental conditions. While accurate kinetic modeling of oxidoreductases is thus important, current models suffer from limitations to the steady-state domain, lack empirical validation or are too specialized to a single system or set of conditions.
Results: To address these limitations, we introduce a novel unifying modeling framework for kinetic descriptions of oxidoreductases. The framework is based on a set of seven elementary reactions that (i) form the basis for 69 pairs of enzyme state transitions for encoding various specific microscopic intra-enzyme reaction networks (micro-models), and (ii) lead to various specific macroscopic steady-state kinetic equations (macro-models) via thermodynamic assumptions. Thus, a synergistic bridge between the micro and macro kinetics can be achieved, enabling us to extract unitary rate constants, simulate reaction variance and validate the micro-models using steady-state empirical data. To help facilitate the application of this framework, we make available RedoxMech: a Mathematica™ software package that automates the generation and customization of micro-models.
Availability: The Mathematica™ source code for RedoxMech, the documentation and the experimental datasets are all available from:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3732027  PMID: 23613486
3.  Deep architectures for protein contact map prediction 
Bioinformatics  2012;28(19):2449-2457.
Motivation: Residue–residue contact prediction is important for protein structure prediction and other applications. However, the accuracy of current contact predictors often barely exceeds 20% on long-range contacts, falling short of the level required for ab initio structure prediction.
Results: Here, we develop a novel machine learning approach for contact map prediction using three steps of increasing resolution. First, we use 2D recursive neural networks to predict coarse contacts and orientations between secondary structure elements. Second, we use an energy-based method to align secondary structure elements and predict contact probabilities between residues in contacting alpha-helices or strands. Third, we use a deep neural network architecture to organize and progressively refine the prediction of contacts, integrating information over both space and time. We train the architecture on a large set of non-redundant proteins and test it on a large set of non-homologous domains, as well as on the set of protein domains used for contact prediction in the two most recent CASP8 and CASP9 experiments. For long-range contacts, the accuracy of the new CMAPpro predictor is close to 30%, a significant increase over existing approaches.
Availability: CMAPpro is available as part of the SCRATCH suite at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3463120  PMID: 22847931
4.  High-throughput prediction of protein antigenicity using protein microarray data 
Bioinformatics  2010;26(23):2936-2943.
Motivation: Discovery of novel protective antigens is fundamental to the development of vaccines for existing and emerging pathogens. Most computational methods for predicting protein antigenicity rely directly on homology with previously characterized protective antigens; however, homology-based methods will fail to discover truly novel protective antigens. Thus, there is a significant need for homology-free methods capable of screening entire proteomes for the antigens most likely to generate a protective humoral immune response.
Results: Here we begin by curating two types of positive data: (i) antigens that elicit a strong antibody response in protected individuals but not in unprotected individuals, using human immunoglobulin reactivity data obtained from protein microarray analyses; and (ii) known protective antigens from the literature. The resulting datasets are used to train a sequence-based prediction model, ANTIGENpro, to predict the likelihood that a protein is a protective antigen. ANTIGENpro correctly classifies 82% of the known protective antigens when trained using only the protein microarray datasets. The accuracy on the combined dataset is estimated at 76% by cross-validation experiments. Finally, ANTIGENpro performs well when evaluated on an external pathogen proteome for which protein microarray data were obtained after the initial development of ANTIGENpro.
Availability: ANTIGENpro is integrated in the SCRATCH suite of predictors available at
PMCID: PMC2982151  PMID: 20934990
5.  A CROC stronger than ROC: measuring, visualizing and optimizing early retrieval 
Bioinformatics  2010;26(10):1348-1356.
Motivation: The performance of classifiers is often assessed using Receiver Operating Characteristic ROC [or (AC) accumulation curve or enrichment curve] curves and the corresponding areas under the curves (AUCs). However, in many fundamental problems ranging from information retrieval to drug discovery, only the very top of the ranked list of predictions is of any interest and ROCs and AUCs are not very useful. New metrics, visualizations and optimization tools are needed to address this ‘early retrieval’ problem.
Results: To address the early retrieval problem, we develop the general concentrated ROC (CROC) framework. In this framework, any relevant portion of the ROC (or AC) curve is magnified smoothly by an appropriate continuous transformation of the coordinates with a corresponding magnification factor. Appropriate families of magnification functions confined to the unit square are derived and their properties are analyzed together with the resulting CROC curves. The area under the CROC curve (AUC[CROC]) can be used to assess early retrieval. The general framework is demonstrated on a drug discovery problem and used to discriminate more accurately the early retrieval performance of five different predictors. From this framework, we propose a novel metric and visualization—the CROC(exp), an exponential transform of the ROC curve—as an alternative to other methods. The CROC(exp) provides a principled, flexible and effective way for measuring and visualizing early retrieval performance with excellent statistical power. Corresponding methods for optimizing early retrieval are also described in the Appendix.
Availability: Datasets are publicly available. Python code and command-line utilities implementing CROC curves and metrics are available at
PMCID: PMC2865862  PMID: 20378557
6.  Data structures and compression algorithms for genomic sequence data 
Bioinformatics  2009;25(14):1731-1738.
Motivation: The continuing exponential accumulation of full genome data, including full diploid human genomes, creates new challenges not only for understanding genomic structure, function and evolution, but also for the storage, navigation and privacy of genomic data. Here, we develop data structures and algorithms for the efficient storage of genomic and other sequence data that may also facilitate querying and protecting the data.
Results: The general idea is to encode only the differences between a genome sequence and a reference sequence, using absolute or relative coordinates for the location of the differences. These locations and the corresponding differential variants can be encoded into binary strings using various entropy coding methods, from fixed codes such as Golomb and Elias codes, to variables codes, such as Huffman codes. We demonstrate the approach and various tradeoffs using highly variables human mitochondrial genome sequences as a testbed. With only a partial level of optimization, 3615 genome sequences occupying 56 MB in GenBank are compressed down to only 167 KB, achieving a 345-fold compression rate, using the revised Cambridge Reference Sequence as the reference sequence. Using the consensus sequence as the reference sequence, the data can be stored using only 133 KB, corresponding to a 433-fold level of compression, roughly a 23% improvement. Extensions to nuclear genomes and high-throughput sequencing data are discussed.
Availability: Data are publicly available from GenBank, the HapMap web site, and the MITOMAP database. Supplementary materials with additional results, statistics, and software implementations are available from
PMCID: PMC2705231  PMID: 19447783
7.  MotifMap: a human genome-wide map of candidate regulatory motif sites 
Bioinformatics  2008;25(2):167-174.
Motivation: Achieving a comprehensive map of all the regulatory elements encoded in the human genome is a fundamental challenge of biomedical research. So far, only a small fraction of the regulatory elements have been characterized, and there is great interest in applying computational techniques to systematically discover these elements. Such efforts, however, have been significantly hindered by the overwhelming size of non-coding DNA regions and the statistical variability and complex spatial organizations of mammalian regulatory elements.
Results: Here we combine information from multiple mammalian genomes to derive the first fairly comprehensive map of regulatory elements in the human genome. We develop a procedure for identifying regulatory sites, with high levels of conservation across different species, using a new scoring scheme, the Bayesian branch length score (BBLS). Using BBLS, we predict 1.5 million regulatory sites, corresponding to 380 known regulatory motifs, with an estimated false discovery rate (FDR) of <50%. We demonstrate that the method is particularly effective for 155 motifs, for which 121 056 sites can be mapped with an estimated FDR of <10%. Over 28K SNPs are located in regions overlapping the 1.5 million predicted motif sites, suggesting potential functional implications for these SNPs. We have deposited these elements in a database and created a user-friendly web server for the retrieval, analysis and visualization of these elements. The initial map provides a systematic view of gene regulation in the genome, which will be refined as additional motifs become available.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2732295  PMID: 19017655
8.  BLASTing small molecules—statistics and extreme statistics of chemical similarity scores 
Bioinformatics  2008;24(13):i357-i365.
Motivation: Small organic molecules, from nucleotides and amino acids to metabolites and drugs, play a fundamental role in chemistry, biology and medicine. As databases of small molecules continue to grow and become more open, it is important to develop the tools to search them efficiently. In order to develop a BLAST-like tool for small molecules, one must first understand the statistical behavior of molecular similarity scores.
Results: We develop a new detailed theory of molecular similarity scores that can be applied to a variety of molecular representations and similarity measures. For concreteness, we focus on the most widely used measure—the Tanimoto measure applied to chem-ical fingerprints. In both the case of empirical fingerprints and fingerprints generated by several stochastic models, we derive accurate approximations for both the distribution and extreme value distribution of similarity scores. These approximation are derived using a ratio of correlated Gaussians approach. The theory enables the calculation of significance scores, such as Z-scores and P-values, and the estimation of the top hits list size. Empirical results obtained using both the random models and real data from the ChemDB database are given to corroborate the theory and show how it can be applied to mine chemical space.
Availability: Data and related resources are available through
PMCID: PMC2718662  PMID: 18586735

Results 1-8 (8)