PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (801155)

Clipboard (0)
None

Related Articles

1.  ProDis-ContSHC: learning protein dissimilarity measures and hierarchical context coherently for protein-protein comparison in protein database retrieval 
BMC Bioinformatics  2012;13(Suppl 7):S2.
Background
The need to retrieve or classify protein molecules using structure or sequence-based similarity measures underlies a wide range of biomedical applications. Traditional protein search methods rely on a pairwise dissimilarity/similarity measure for comparing a pair of proteins. This kind of pairwise measures suffer from the limitation of neglecting the distribution of other proteins and thus cannot satisfy the need for high accuracy of the retrieval systems. Recent work in the machine learning community has shown that exploiting the global structure of the database and learning the contextual dissimilarity/similarity measures can improve the retrieval performance significantly. However, most existing contextual dissimilarity/similarity learning algorithms work in an unsupervised manner, which does not utilize the information of the known class labels of proteins in the database.
Results
In this paper, we propose a novel protein-protein dissimilarity learning algorithm, ProDis-ContSHC. ProDis-ContSHC regularizes an existing dissimilarity measure dij by considering the contextual information of the proteins. The context of a protein is defined by its neighboring proteins. The basic idea is, for a pair of proteins (i, j), if their context N(i) and N(j) is similar to each other, the two proteins should also have a high similarity. We implement this idea by regularizing dij by a factor learned from the context N(i) and N(j).
Moreover, we divide the context to hierarchial sub-context and get the contextual dissimilarity vector for each protein pair. Using the class label information of the proteins, we select the relevant (a pair of proteins that has the same class labels) and irrelevant (with different labels) protein pairs, and train an SVM model to distinguish between their contextual dissimilarity vectors. The SVM model is further used to learn a supervised regularizing factor. Finally, with the new Supervised learned Dissimilarity measure, we update the Protein Hierarchial Context Coherently in an iterative algorithm--ProDis-ContSHC.
We test the performance of ProDis-ContSHC on two benchmark sets, i.e., the ASTRAL 1.73 database and the FSSP/DALI database. Experimental results demonstrate that plugging our supervised contextual dissimilarity measures into the retrieval systems significantly outperforms the context-free dissimilarity/similarity measures and other unsupervised contextual dissimilarity measures that do not use the class label information.
Conclusions
Using the contextual proteins with their class labels in the database, we can improve the accuracy of the pairwise dissimilarity/similarity measures dramatically for the protein retrieval tasks. In this work, for the first time, we propose the idea of supervised contextual dissimilarity learning, resulting in the ProDis-ContSHC algorithm. Among different contextual dissimilarity learning approaches that can be used to compare a pair of proteins, ProDis-ContSHC provides the highest accuracy. Finally, ProDis-ContSHC compares favorably with other methods reported in the recent literature.
doi:10.1186/1471-2105-13-S7-S2
PMCID: PMC3348016  PMID: 22594999
2.  PRODIS: a proteomics data management system with support to experiment tracking 
BMC Genomics  2011;12(Suppl 4):S15.
Background
A research area that has greatly benefited from the development of new and improved analysis technologies is Proteomics and large amounts of data have been generated by proteomic analysis as a consequence. Previously, the storage, management and analysis of these data have been done manually. This is, however, incompatible with the volume of data generated by modern proteomic analysis. Several attempts have been made to automate the tasks of data analysis and management. In this work we propose PRODIS (Proteomics Database Integrated System), a system for proteomic experimental data management. The proposed system enables an efficient management of the proteomic experimentation workflow, simplifies controlling experiments and associated data and establishes links between similar experiments through the experiment tracking function.
Results
PRODIS is fully web based which simplifies data upload and gives the system the flexibility necessary for use in complex projects. Data from Liquid Chromatography, 2D-PAGE and Mass Spectrometry experiments can be stored in the system. Moreover, it is simple to use, researchers can insert experimental data directly as experiments are performed, without the need to configure the system or change their experiment routine. PRODIS has a number of important features, including a password protected system in which each screen for data upload and retrieval is validated; users have different levels of clearance, which allow the execution of tasks according to the user clearance level. The system allows the upload, parsing of files, storage and display of experiment results and images in the main formats used in proteomics laboratories: for chromatographies the chromatograms and lists of peaks resulting from separation are stored; For 2D-PAGE images of gels and the files resulting from the analysis are stored, containing information on positions of spots as well as its values of intensity, volume, etc; For Mass Spectrometry, PRODIS presents a function for completion of the mapping plate that allows the user to correlate the positions in plates to the samples separated by 2D-PAGE. Furthermore PRODIS allows the tracking of experiments from the first stage until the final step of identification, enabling an efficient management of the complete experimental process.
Conclusions
The construction of data management systems for Proteomics data importing and storing is a relevant subject. PRODIS is a system complementary to other proteomics tools that combines a powerful storage engine (the relational database) and a friendly access interface, aiming to assist Proteomics research directly at data handling and storage.
doi:10.1186/1471-2164-12-S4-S15
PMCID: PMC3287584  PMID: 22369043
3.  BalestraWeb: efficient online evaluation of drug–target interactions 
Bioinformatics  2014;31(1):131-133.
Summary: BalestraWeb is an online server that allows users to instantly make predictions about the potential occurrence of interactions between any given drug–target pair, or predict the most likely interaction partners of any drug or target listed in the DrugBank. It also permits users to identify most similar drugs or most similar targets based on their interaction patterns. Outputs help to develop hypotheses about drug repurposing as well as potential side effects.
Availability and implementation: BalestraWeb is accessible at http://balestra.csb.pitt.edu/. The tool is built using a probabilistic matrix factorization method and DrugBank v3, and the latent variable models are trained using the GraphLab collaborative filtering toolkit. The server is implemented using Python, Flask, NumPy and SciPy.
Contact: bahar@pitt.edu
doi:10.1093/bioinformatics/btu599
PMCID: PMC4271144  PMID: 25192741
4.  Optimization of minimum set of protein–DNA interactions: a quasi exact solution with minimum over-fitting 
Bioinformatics  2009;26(3):319-325.
Motivation: A major limitation in modeling protein interactions is the difficulty of assessing the over-fitting of the training set. Recently, an experimentally based approach that integrates crystallographic information of C2H2 zinc finger–DNA complexes with binding data from 11 mutants, 7 from EGR finger I, was used to define an improved interaction code (no optimization). Here, we present a novel mixed integer programming (MIP)-based method that transforms this type of data into an optimized code, demonstrating both the advantages of the mathematical formulation to minimize over- and under-fitting and the robustness of the underlying physical parameters mapped by the code.
Results: Based on the structural models of feasible interaction networks for 35 mutants of EGR–DNA complexes, the MIP method minimizes the cumulative binding energy over all complexes for a general set of fundamental protein–DNA interactions. To guard against over-fitting, we use the scalability of the method to probe against the elimination of related interactions. From an initial set of 12 parameters (six hydrogen bonds, five desolvation penalties and a water factor), we proceed to eliminate five of them with only a marginal reduction of the correlation coefficient to 0.9983. Further reduction of parameters negatively impacts the performance of the code (under-fitting). Besides accurately predicting the change in binding affinity of validation sets, the code identifies possible context-dependent effects in the definition of the interaction networks. Yet, the approach of constraining predictions to within a pre-selected set of interactions limits the impact of these potential errors to related low-affinity complexes.
Contact: ccamacho@pitt.edu; droleg@pitt.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp664
PMCID: PMC2815656  PMID: 19965883
5.  Small-molecule inhibitor starting points learned from protein–protein interaction inhibitor structure 
Bioinformatics  2011;28(6):784-791.
Motivation: Protein–protein interactions (PPIs) are a promising, but challenging target for pharmaceutical intervention. One approach for addressing these difficult targets is the rational design of small-molecule inhibitors that mimic the chemical and physical properties of small clusters of key residues at the protein–protein interface. The identification of appropriate clusters of interface residues provides starting points for inhibitor design and supports an overall assessment of the susceptibility of PPIs to small-molecule inhibition.
Results: We extract Small-Molecule Inhibitor Starting Points (SMISPs) from protein-ligand and protein–protein complexes in the Protein Data Bank (PDB). These SMISPs are used to train two distinct classifiers, a support vector machine and an easy to interpret exhaustive rule classifier. Both classifiers achieve better than 70% leave-one-complex-out cross-validation accuracy and correctly predict SMISPs of known PPI inhibitors not in the training set. A PDB-wide analysis suggests that nearly half of all PPIs may be susceptible to small-molecule inhibition.
Availability: http://pocketquery.csb.pitt.edu.
Contact: dkoes@pitt.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr717
PMCID: PMC3307105  PMID: 22210869
6.  Meta-analysis for pathway enrichment analysis when combining multiple genomic studies 
Bioinformatics  2010;26(10):1316-1323.
Motivation: Many pathway analysis (or gene set enrichment analysis) methods have been developed to identify enriched pathways under different biological states within a genomic study. As more and more microarray datasets accumulate, meta-analysis methods have also been developed to integrate information among multiple studies. Currently, most meta-analysis methods for combining genomic studies focus on biomarker detection and meta-analysis for pathway analysis has not been systematically pursued.
Results: We investigated two approaches of meta-analysis for pathway enrichment (MAPE) by combining statistical significance across studies at the gene level (MAPE_G) or at the pathway level (MAPE_P). Simulation results showed increased statistical power of meta-analysis approaches compared to a single study analysis and showed complementary advantages of MAPE_G and MAPE_P under different scenarios. We also developed an integrated method (MAPE_I) that incorporates advantages of both approaches. Comprehensive simulations and applications to real data on drug response of breast cancer cell lines and lung cancer tissues were evaluated to compare the performance of three MAPE variations. MAPE_P has the advantage of not requiring gene matching across studies. When MAPE_G and MAPE_P show complementary advantages, the hybrid version of MAPE_I is generally recommended.
Availability: http://www.biostat.pitt.edu/bioinfo/
Contact: ctseng@pitt.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq148
PMCID: PMC2865865  PMID: 20410053
7.  Prodi proposes food agency for the EU 
BMJ : British Medical Journal  1999;319(7216):1025.
PMCID: PMC1116838  PMID: 10521188
8.  Modeling DNA methylation dynamics with approaches from phylogenetics 
Bioinformatics  2014;30(17):i408-i414.
Motivation: Methylation of CpG dinucleotides is a prevalent epigenetic modification that is required for proper development in vertebrates. Genome-wide DNA methylation assays have become increasingly common, and this has enabled characterization of DNA methylation in distinct stages across differentiating cellular lineages. Changes in CpG methylation are essential to cellular differentiation; however, current methods for modeling methylation dynamics do not account for the dependency structure between precursor and dependent cell types.
Results: We developed a continuous-time Markov chain approach, based on the observation that changes in methylation state over tissue differentiation can be modeled similarly to DNA nucleotide changes over evolutionary time. This model explicitly takes precursor to descendant relationships into account and enables inference of CpG methylation dynamics. To illustrate our method, we analyzed a high-resolution methylation map of the differentiation of mouse stem cells into several blood cell types. Our model can successfully infer unobserved CpG methylation states from observations at the same sites in related cell types (90% correct), and this approach more accurately reconstructs missing data than imputation based on neighboring CpGs (84% correct). Additionally, the single CpG resolution of our methylation dynamics estimates enabled us to show that DNA sequence context of CpG sites is informative about methylation dynamics across tissue differentiation. Finally, we identified genomic regions with clusters of highly dynamic CpGs and present a likely functional example. Our work establishes a framework for inference and modeling that is well suited to DNA methylation data, and our success suggests that other methods for analyzing DNA nucleotide substitutions will also translate to the modeling of epigenetic phenomena.
Availability and implementation: Source code is available at www.kostkalab.net/software.
Contact: tony.capra@vanderbilt.edu or kostka@pitt.edu
doi:10.1093/bioinformatics/btu445
PMCID: PMC4147898  PMID: 25161227
9.  QAARM: quasi-anharmonic autoregressive model reveals molecular recognition pathways in ubiquitin 
Bioinformatics  2011;27(13):i52-i60.
Motivation: Molecular dynamics (MD) simulations have dramatically improved the atomistic understanding of protein motions, energetics and function. These growing datasets have necessitated a corresponding emphasis on trajectory analysis methods for characterizing simulation data, particularly since functional protein motions and transitions are often rare and/or intricate events. Observing that such events give rise to long-tailed spatial distributions, we recently developed a higher-order statistics based dimensionality reduction method, called quasi-anharmonic analysis (QAA), for identifying biophysically-relevant reaction coordinates and substates within MD simulations. Further characterization of conformation space should consider the temporal dynamics specific to each identified substate.
Results: Our model uses hierarchical clustering to learn energetically coherent substates and dynamic modes of motion from a 0.5 μs ubiqutin simulation. Autoregressive (AR) modeling within and between states enables a compact and generative description of the conformational landscape as it relates to functional transitions between binding poses. Lacking a predictive component, QAA is extended here within a general AR model appreciative of the trajectory's temporal dependencies and the specific, local dynamics accessible to a protein within identified energy wells. These metastable states and their transition rates are extracted within a QAA-derived subspace using hierarchical Markov clustering to provide parameter sets for the second-order AR model. We show the learned model can be extrapolated to synthesize trajectories of arbitrary length.
Contact: ramanathana@ornl.gov; chakracs@pitt.edu
doi:10.1093/bioinformatics/btr248
PMCID: PMC3117343  PMID: 21685101
10.  Biomarker detection in the integration of multiple multi-class genomic studies 
Bioinformatics  2009;26(3):333-340.
Motivation: Systematic information integration of multiple-related microarray studies has become an important issue as the technology becomes mature and prevalent in the past decade. The aggregated information provides more robust and accurate biomarker detection. So far, published meta-analysis methods for this purpose mostly consider two-class comparison. Methods for combining multi-class studies and considering expression pattern concordance are rarely explored.
Results: In this article, we develop three integration methods for biomarker detection in multiple multi-class microarray studies: ANOVA-maxP, min-MCC and OW-min-MCC. We first consider a natural extension of combining P-values from the traditional ANOVA model. Since P-values from ANOVA do not guarantee to reflect the concordant expression pattern information across studies, we propose a multi-class correlation (MCC) measure to specifically seek for biomarkers of concordant inter-class patterns across a pair of studies. For both ANOVA and MCC approaches, we use extreme order statistics to identify biomarkers differentially expressed (DE) in all studies (i.e. ANOVA-maxP and min-MCC). The min-MCC method is further extended to identify biomarkers DE in partial studies by incorporating a recently developed optimally weighted (OW) technique (OW-min-MCC). All methods are evaluated by simulation studies and by three meta-analysis applications to multi-tissue mouse metabolism datasets, multi-condition mouse trauma datasets and multi-malignant-condition human prostate cancer datasets. The results show complementary strength of the three methods for different biological purposes.
Availability: http://www.biostat.pitt.edu/bioinfo/
Contact: ctseng@pitt.edu
Supplementary information: Supplementary data is available at Bioinformatics online.
doi:10.1093/bioinformatics/btp669
PMCID: PMC2815659  PMID: 19965884
11.  An integer programming formulation to identify the sparse network architecture governing differentiation of embryonic stem cells 
Bioinformatics  2010;26(10):1332-1339.
Motivation: Primary purpose of modeling gene regulatory networks for developmental process is to reveal pathways governing the cellular differentiation to specific phenotypes. Knowledge of differentiation network will enable generation of desired cell fates by careful alteration of the governing network by adequate manipulation of cellular environment.
Results: We have developed a novel integer programming-based approach to reconstruct the underlying regulatory architecture of differentiating embryonic stem cells from discrete temporal gene expression data. The network reconstruction problem is formulated using inherent features of biological networks: (i) that of cascade architecture which enables treatment of the entire complex network as a set of interconnected modules and (ii) that of sparsity of interconnection between the transcription factors. The developed framework is applied to the system of embryonic stem cells differentiating towards pancreatic lineage. Experimentally determined expression profile dynamics of relevant transcription factors serve as the input to the network identification algorithm. The developed formulation accurately captures many of the known regulatory modes involved in pancreatic differentiation. The predictive capacity of the model is tested by simulating an in silico potential pathway of subsequent differentiation. The predicted pathway is experimentally verified by concurrent differentiation experiments. Experimental results agree well with model predictions, thereby illustrating the predictive accuracy of the proposed algorithm.
Contact: ipb1@pitt.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq139
PMCID: PMC2865861  PMID: 20363729
12.  Module-based prediction approach for robust inter-study predictions in microarray data 
Bioinformatics  2010;26(20):2586-2593.
Motivation: Traditional genomic prediction models based on individual genes suffer from low reproducibility across microarray studies due to the lack of robustness to expression measurement noise and gene missingness when they are matched across platforms. It is common that some of the genes in the prediction model established in a training study cannot be matched to another test study because a different platform is applied. The failure of inter-study predictions has severely hindered the clinical applications of microarray. To overcome the drawbacks of traditional gene-based prediction (GBP) models, we propose a module-based prediction (MBP) strategy via unsupervised gene clustering.
Results: K-means clustering is used to group genes sharing similar expression profiles into gene modules, and small modules are merged into their nearest neighbors. Conventional univariate or multivariate feature selection procedure is applied and a representative gene from each selected module is identified to construct the final prediction model. As a result, the prediction model is portable to any test study as long as partial genes in each module exist in the test study. We demonstrate that K-means cluster sizes generally follow a multinomial distribution and the failure probability of inter-study prediction due to missing genes is diminished by merging small clusters into their nearest neighbors. By simulation and applications of real datasets in inter-study predictions, we show that the proposed MBP provides slightly improved accuracy while is considerably more robust than traditional GBP.
Availability: http://www.biostat.pitt.edu/bioinfo/
Contact: ctseng@pitt.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq472
PMCID: PMC2951088  PMID: 20719761
13.  An R package suite for microarray meta-analysis in quality control, differentially expressed gene analysis and pathway enrichment detection 
Bioinformatics  2012;28(19):2534-2536.
Summary: With the rapid advances and prevalence of high-throughput genomic technologies, integrating information of multiple relevant genomic studies has brought new challenges. Microarray meta-analysis has become a frequently used tool in biomedical research. Little effort, however, has been made to develop a systematic pipeline and user-friendly software. In this article, we present MetaOmics, a suite of three R packages MetaQC, MetaDE and MetaPath, for quality control, differentially expressed gene identification and enriched pathway detection for microarray meta-analysis. MetaQC provides a quantitative and objective tool to assist study inclusion/exclusion criteria for meta-analysis. MetaDE and MetaPath were developed for candidate marker and pathway detection, which provide choices of marker detection, meta-analysis and pathway analysis methods. The system allows flexible input of experimental data, clinical outcome (case–control, multi-class, continuous or survival) and pathway databases. It allows missing values in experimental data and utilizes multi-core parallel computing for fast implementation. It generates informative summary output and visualization plots, operates on different operation systems and can be expanded to include new algorithms or combine different types of genomic data. This software suite provides a comprehensive tool to conveniently implement and compare various genomic meta-analysis pipelines.
Availability: http://www.biostat.pitt.edu/bioinfo/software.htm
Contact: ctseng@pitt.edu
Supplementary Information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/bts485
PMCID: PMC3463115  PMID: 22863766
14.  Biological impact of missing-value imputation on downstream analyses of gene expression profiles 
Bioinformatics  2010;27(1):78-86.
Motivation: Microarray experiments frequently produce multiple missing values (MVs) due to flaws such as dust, scratches, insufficient resolution or hybridization errors on the chips. Unfortunately, many downstream algorithms require a complete data matrix. The motivation of this work is to determine the impact of MV imputation on downstream analysis, and whether ranking of imputation methods by imputation accuracy correlates well with the biological impact of the imputation.
Methods: Using eight datasets for differential expression (DE) and classification analysis and eight datasets for gene clustering, we demonstrate the biological impact of missing-value imputation on statistical downstream analyses, including three commonly employed DE methods, four classifiers and three gene-clustering methods. Correlation between the rankings of imputation methods based on three root-mean squared error (RMSE) measures and the rankings based on the downstream analysis methods was used to investigate which RMSE measure was most consistent with the biological impact measures, and which downstream analysis methods were the most sensitive to the choice of imputation procedure.
Results: DE was the most sensitive to the choice of imputation procedure, while classification was the least sensitive and clustering was intermediate between the two. The logged RMSE (LRMSE) measure had the highest correlation with the imputation rankings based on the DE results, indicating that the LRMSE is the best representative surrogate among the three RMSE-based measures. Bayesian principal component analysis and least squares adaptive appeared to be the best performing methods in the empirical downstream evaluation.
Contact: ctseng@pitt.edu; guy.brock@louisville.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq613
PMCID: PMC3008641  PMID: 21045072
15.  Efficiency Analysis of Competing Tests for Finding Differentially Expressed Genes in Lung Adenocarcinoma 
Cancer Informatics  2008;6:389-421.
In this study, we introduce and use Efficiency Analysis to compare differences in the apparent internal and external consistency of competing normalization methods and tests for identifying differentially expressed genes. Using publicly available data, two lung adenocarcinoma datasets were analyzed using caGEDA (http://bioinformatics2.pitt.edu/GE2/GEDA.html) to measure the degree of differential expression of genes existing between two populations. The datasets were randomly split into at least two subsets, each analyzed for differentially expressed genes between the two sample groups, and the gene lists compared for overlapping genes. Efficiency Analysis is an intuitive method that compares the differences in the percentage of overlap of genes from two or more data subsets, found by the same test over a range of testing methods. Tests that yield consistent gene lists across independently analyzed splits are preferred to those that yield less consistent inferences. For example, a method that exhibits 50% overlap in the 100 top genes from two studies should be preferred to a method that exhibits 5% overlap in the top 100 genes. The same procedure was performed using all available normalization and transformation methods that are available through caGEDA. The ‘best’ test was then further evaluated using internal cross-validation to estimate generalizable sample classification errors using a Naïve Bayes classification algorithm. A novel test, termed D1 (a derivative of the J5 test) was found to be the most consistent, and to exhibit the lowest overall classification error, and highest sensitivity and specificity. The D1 test relaxes the assumption that few genes are differentially expressed. Efficiency Analysis can be misleading if the tests exhibit a bias in any particular dimension (e.g. expression intensity); we therefore explored intensity-scaled and segmented J5 tests using data in which all genes are scaled to share the same intensity distribution range. Efficiency Analysis correctly predicted the ‘best’ test and normalization method using the Beer dataset and also performed well with the Bhattacharjee dataset based on both efficiency and classification accuracy criteria.
PMCID: PMC2623303  PMID: 19259419
16.  Bayesian rule learning for biomedical data mining 
Bioinformatics  2010;26(5):668-675.
Motivation: Disease state prediction from biomarker profiling studies is an important problem because more accurate classification models will potentially lead to the discovery of better, more discriminative markers. Data mining methods are routinely applied to such analyses of biomedical datasets generated from high-throughput ‘omic’ technologies applied to clinical samples from tissues or bodily fluids. Past work has demonstrated that rule models can be successfully applied to this problem, since they can produce understandable models that facilitate review of discriminative biomarkers by biomedical scientists. While many rule-based methods produce rules that make predictions under uncertainty, they typically do not quantify the uncertainty in the validity of the rule itself. This article describes an approach that uses a Bayesian score to evaluate rule models.
Results: We have combined the expressiveness of rules with the mathematical rigor of Bayesian networks (BNs) to develop and evaluate a Bayesian rule learning (BRL) system. This system utilizes a novel variant of the K2 algorithm for building BNs from the training data to provide probabilistic scores for IF-antecedent-THEN-consequent rules using heuristic best-first search. We then apply rule-based inference to evaluate the learned models during 10-fold cross-validation performed two times. The BRL system is evaluated on 24 published ‘omic’ datasets, and on average it performs on par or better than other readily available rule learning methods. Moreover, BRL produces models that contain on average 70% fewer variables, which means that the biomarker panels for disease prediction contain fewer markers for further verification and validation by bench scientists.
Contact: vanathi@pitt.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq005
PMCID: PMC2852212  PMID: 20080512
17.  Identifying informative subsets of the Gene Ontology with information bottleneck methods 
Bioinformatics  2010;26(19):2445-2451.
Motivation: The Gene Ontology (GO) is a controlled vocabulary designed to represent the biological concepts pertaining to gene products. This study investigates the methods for identifying informative subsets of GO terms in an automatic and objective fashion. This task in turn requires addressing the following issues: how to represent the semantic context of GO terms, what metrics are suitable for measuring the semantic differences between terms, how to identify an informative subset that retains as much as possible of the original semantic information of GO.
Results: We represented the semantic context of a GO term using the word-usage-profile associated with the term, which enables one to measure the semantic differences between terms based on the differences in their semantic contexts. We further employed the information bottleneck methods to automatically identify subsets of GO terms that retain as much as possible of the semantic information in an annotation database. The automatically retrieved informative subsets align well with an expert-picked GO slim subset, cover important concepts and proteins, and enhance literature-based GO annotation.
Availability: http://carcweb.musc.edu/TextminingProjects/
Contact: xinghua@pitt.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btq449
PMCID: PMC2944202  PMID: 20702400
18.  Analysis of correlated mutations in HIV-1 protease using spectral clustering 
Bioinformatics  2008;24(10):1243-1250.
Motivation: The ability of human immunodeficiency virus-1 (HIV-1) protease to develop mutations that confer multi-drug resistance (MDR) has been a major obstacle in designing rational therapies against HIV. Resistance is usually imparted by a cooperative mechanism that can be elucidated by a covariance analysis of sequence data. Identification of such correlated substitutions of amino acids may be obscured by evolutionary noise.
Results: HIV-1 protease sequences from patients subjected to different specific treatments (set 1), and from untreated patients (set 2) were subjected to sequence covariance analysis by evaluating the mutual information (MI) between all residue pairs. Spectral clustering of the resulting covariance matrices disclosed two distinctive clusters of correlated residues: the first, observed in set 1 but absent in set 2, contained residues involved in MDR acquisition; and the second, included those residues differentiated in the various HIV-1 protease subtypes, shortly referred to as the phylogenetic cluster. The MDR cluster occupies sites close to the central symmetry axis of the enzyme, which overlap with the global hinge region identified from coarse-grained normal-mode analysis of the enzyme structure. The phylogenetic cluster, on the other hand, occupies solvent-exposed and highly mobile regions. This study demonstrates (i) the possibility of distinguishing between the correlated substitutions resulting from neutral mutations and those induced by MDR upon appropriate clustering analysis of sequence covariance data and (ii) a connection between global dynamics and functional substitution of amino acids.
Contact: bahar@ccbb.pitt.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btn110
PMCID: PMC2373918  PMID: 18375964
19.  RuleBender: a visual interface for rule-based modeling 
Bioinformatics  2011;27(12):1721-1722.
Summary: Rule-based modeling (RBM) is a powerful and increasingly popular approach to modeling intracellular biochemistry. Current interfaces for RBM are predominantly text-based and command-line driven. Better visual tools are needed to make RBM accessible to a broad range of users, to make specification of models less error prone and to improve workflows. We present RULEBENDER, an open-source visual interface that facilitates interactive debugging, simulation and analysis of RBMs.
Availability: RULEBENDER is freely available for Mac, Windows and Linux at http://rulebender.org.
Contact: faeder@pitt.edu; marai@cs.pitt.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr197
PMCID: PMC3106190  PMID: 21493655
20.  Ratio adjustment and calibration scheme for gene-wise normalization to enhance microarray inter-study prediction 
Bioinformatics  2009;25(13):1655-1661.
Motivation: Reproducibility analyses of biologically relevant microarray studies have mostly focused on overlap of detected biomarkers or correlation of differential expression evidences across studies. For clinical utility, direct inter-study prediction (i.e. to establish a prediction model in one study and apply to another) for disease diagnosis or prognosis prediction is more important. Normalization plays a key role for such a task. Traditionally, sample-wise normalization has been a standard for inter-array and inter-study normalization. For gene-wise normalization, it has been implemented for intra-study or inter-study predictions in a few papers while its rationale, strategy and effect remain unexplored.
Results: In this article, we investigate the effect of gene-wise normalization in microarray inter-study prediction. Gene-specific intensity discrepancies across studies are commonly found even after proper sample-wise normalization. We explore the rationale and necessity of gene-wise normalization. We also show that the ratio of sample sizes in normal versus diseased groups can greatly affect the performance of gene-wise normalization and an analytical method is developed to adjust for the imbalanced ratio effect. Both simulation results and applications to three lung cancer and two prostate cancer data sets, considering both binary classification and survival risk predictions, showed significant and robust improvement of the new adjustment. A calibration scheme is developed to apply the ratio-adjusted gene-wise normalization for prospective clinical trials. The number of calibration samples needed is estimated from existing studies and suggested for future applications. The result has important implication to the translational research of microarray as a practical disease diagnosis and prognosis prediction tool.
Contact: ctseng@pitt.edu
Availability: http://www.biostat.pitt.edu/bioinfo/
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp292
PMCID: PMC2732320  PMID: 19414534
21.  Principal component analysis of native ensembles of biomolecular structures (PCA_NEST): insights into functional dynamics 
Bioinformatics  2009;25(5):606-614.
Motivation: To efficiently analyze the ‘native ensemble of conformations’ accessible to proteins near their folded state and to extract essential information from observed distributions of conformations, reliable mathematical methods and computational tools are needed.
Result: Examination of 24 pairs of structures determined by both NMR and X-ray reveals that the differences in the dynamics of the same protein resolved by the two techniques can be tracked to the most robust low frequency modes elucidated by principal component analysis (PCA) of NMR models. The active sites of enzymes are found to be highly constrained in these PCA modes. Furthermore, the residues predicted to be highly immobile are shown to be evolutionarily conserved, lending support to a PCA-based identification of potential functional sites. An online tool, PCA_NEST, is designed to derive the principal modes of conformational changes from structural ensembles resolved by experiments or generated by computations.
Availability: http://ignm.ccbb.pitt.edu/oPCA_Online.htm
Contact: lwy1@iam.u-tokyo.ac.jp
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp023
PMCID: PMC2647834  PMID: 19147661
22.  iGNM: a database of protein functional motions based on Gaussian Network Model 
Bioinformatics (Oxford, England)  2005;21(13):2978-2987.
Motivation
The knowledge of protein structure is not sufficient for understanding and controlling its function. Function is a dynamic property. Although protein structural information has been rapidly accumulating in databases, little effort has been invested to date toward systematically characterizing protein dynamics. The recent success of analytical methods based on elastic network models, and in particular the Gaussian Network Model (GNM), permits us to perform a high-throughput analysis of the collective dynamics of proteins.
Results
We computed the GNM dynamics for 20 058 structures from the Protein Data Bank, and generated information on the equilibrium dynamics at the level of individual residues. The results are stored on a web-based system called i GNM and configured so as to permit the users to visualize or download the results through a standard web browser using a simple search engine. Static and animated images for describing the conformational mobility of proteins over a broad range of normal modes are accessible, along with an online calculation engine available for newly deposited structures. A case study of the dynamics of 20 non-homologous hydrolases is presented to illustrate the utility of the iGNM database for identifying key residues that control the cooperative motions and revealing the connection between collective dynamics and catalytic activity.
Availability
http://ignm.ccbb.pitt.edu/
doi:10.1093/bioinformatics/bti469
PMCID: PMC1752228  PMID: 15860562
23.  Contact replacement for NMR resonance assignment 
Bioinformatics  2008;24(13):i205-i213.
Motivation: Complementing its traditional role in structural studies of proteins, nuclear magnetic resonance (NMR) spectroscopy is playing an increasingly important role in functional studies. NMR dynamics experiments characterize motions involved in target recognition, ligand binding, etc., while NMR chemical shift perturbation experiments identify and localize protein–protein and protein–ligand interactions. The key bottleneck in these studies is to determine the backbone resonance assignment, which allows spectral peaks to be mapped to specific atoms. This article develops a novel approach to address that bottleneck, exploiting an available X-ray structure or homology model to assign the entire backbone from a set of relatively fast and cheap NMR experiments.
Results: We formulate contact replacement for resonance assignment as the problem of computing correspondences between a contact graph representing the structure and an NMR graph representing the data; the NMR graph is a significantly corrupted, ambiguous version of the contact graph. We first show that by combining connectivity and amino acid type information, and exploiting the random structure of the noise, one can provably determine unique correspondences in polynomial time with high probability, even in the presence of significant noise (a constant number of noisy edges per vertex). We then detail an efficient randomized algorithm and show that, over a variety of experimental and synthetic datasets, it is robust to typical levels of structural variation (1–2 AA), noise (250–600%) and missings (10–40%). Our algorithm achieves very good overall assignment accuracy, above 80% in α-helices, 70% in β-sheets and 60% in loop regions.
Availability: Our contact replacement algorithm is implemented in platform-independent Python code. The software can be freely obtained for academic use by request from the authors.
Contact: gopal@cs.purdue.edu; cbk@cs.dartmouth.edu
doi:10.1093/bioinformatics/btn167
PMCID: PMC2718645  PMID: 18586716
24.  Mycobacterium tuberculosis and Clostridium difficille interactomes: demonstration of rapid development of computational system for bacterial interactome prediction 
Background
Protein-protein interaction (PPI) networks (interactomes) of most organisms, except for some model organisms, are largely unknown. Experimental methods including high-throughput techniques are highly resource intensive. Therefore, computational discovery of PPIs can accelerate biological discovery by presenting "most-promising" pairs of proteins that are likely to interact. For many bacteria, genome sequence, and thereby genomic context of proteomes, is readily available; additionally, for some of these proteomes, localization and functional annotations are also available, but interactomes are not available. We present here a method for rapid development of computational system to predict interactome of bacterial proteomes. While other studies have presented methods to transfer interologs across species, here, we propose transfer of computational models to benefit from cross-species annotations, thereby predicting many more novel interactions even in the absence of interologs. Mycobacterium tuberculosis (Mtb) and Clostridium difficile (CD) have been used to demonstrate the work.
Results
We developed a random forest classifier over features derived from Gene Ontology annotations and genetic context scores provided by STRING database for predicting Mtb and CD interactions independently. The Mtb classifier gave a precision of 94% and a recall of 23% on a held out test set. The Mtb model was then run on all the 8 million protein pairs of the Mtb proteome, resulting in 708 new interactions (at 94% expected precision) or 1,595 new interactions at 80% expected precision. The CD classifier gave a precision of 90% and a recall of 16% on a held out test set. The CD model was run on all the 8 million protein pairs of the CD proteome, resulting in 143 new interactions (at 90% expected precision) or 580 new interactions (at 80% expected precision). We also compared the overlap of predictions of our method with STRING database interactions for CD and Mtb and also with interactions identified recently by a bacterial 2-hybrid system for Mtb. To demonstrate the utility of transfer of computational models, we made use of the developed Mtb model and used it to predict CD protein-pairs. The cross species model thus developed yielded a precision of 88% at a recall of 8%. To demonstrate transfer of features from other organisms in the absence of feature-based and interaction-based information, we transferred missing feature values from Mtb orthologs into the CD data. In transferring this data from orthologs (not interologs), we showed that a large number of interactions can be predicted.
Conclusions
Rapid discovery of (partial) bacterial interactome can be made by using existing set of GO and STRING features associated with the organisms. We can make use of cross-species interactome development, when there are not even sufficient known interactions to develop a computational prediction system. Computational model of well-studied organism(s) can be employed to make the initial interactome prediction for the target organism. We have also demonstrated successfully, that annotations can be transferred from orthologs in well-studied organisms enabling accurate predictions for organisms with no annotations. These approaches can serve as building blocks to address the challenges associated with feature coverage, missing interactions towards rapid interactome discovery for bacterial organisms.
Availability
The predictions for all Mtb and CD proteins are made available at: http://severus.dbmi.pitt.edu/TB and http://severus.dbmi.pitt.edu/CD respectively for browsing as well as for download.
doi:10.1186/2042-5783-2-4
PMCID: PMC3353838  PMID: 22587966
25.  BNFinder2: Faster Bayesian network learning and Bayesian classification 
Bioinformatics  2013;29(16):2068-2070.
Summary: Bayesian Networks (BNs) are versatile probabilistic models applicable to many different biological phenomena. In biological applications the structure of the network is usually unknown and needs to be inferred from experimental data. BNFinder is a fast software implementation of an exact algorithm for finding the optimal structure of the network given a number of experimental observations. Its second version, presented in this article, represents a major improvement over the previous version. The improvements include (i) a parallelized learning algorithm leading to an order of magnitude speed-ups in BN structure learning time; (ii) inclusion of an additional scoring function based on mutual information criteria; (iii) possibility of choosing the resulting network specificity based on statistical criteria and (iv) a new module for classification by BNs, including cross-validation scheme and classifier quality measurements with receiver operator characteristic scores.
Availability and implementation: BNFinder2 is implemented in python and freely available under the GNU general public license at the project Web site https://launchpad.net/bnfinder, together with a user’s manual, introductory tutorial and supplementary methods.
Contact: dojer@mimuw.edu.pl or bartek@mimuw.edu.pl
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt323
PMCID: PMC3722519  PMID: 23818512

Results 1-25 (801155)