Search tips
Search criteria

Results 1-10 (10)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Detection of Protein Catalytic Sites in the Biomedical Literature 
This paper explores the application of text mining to the problem of detecting protein functional sites in the biomedical literature, and specifically considers the task of identifying catalytic sites in that literature. We provide strong evidence for the need for text mining techniques that address residue-level protein function annotation through an analysis of two corpora in terms of their coverage of curated data sources. We also explore the viability of building a text-based classifier for identifying protein functional sites, identifying the low coverage of curated data sources and the potential ambiguity of information about protein functional sites as challenges that must be addressed. Nevertheless we produce a simple classifier that achieves a reasonable ∼69% F-score on our full text silver corpus on the first attempt to address this classification task. The work has application in computational prediction of the functional significance of protein sites as well as in curation workflows for databases that capture this information.
PMCID: PMC3664919  PMID: 23424147
text mining; information extraction; machine learning; catalytic site; biomedical literature; biomedical natural language processing; protein functional sites
2.  Literature mining of protein-residue associations with graph rules learned through distant supervision 
Journal of Biomedical Semantics  2012;3(Suppl 3):S2.
We propose a method for automatic extraction of protein-specific residue mentions from the biomedical literature. The method searches text for mentions of amino acids at specific sequence positions and attempts to correctly associate each mention with a protein also named in the text. The methods presented in this work will enable improved protein functional site extraction from articles, ultimately supporting protein function prediction. Our method made use of linguistic patterns for identifying the amino acid residue mentions in text. Further, we applied an automated graph-based method to learn syntactic patterns corresponding to protein-residue pairs mentioned in the text. We finally present an approach to automated construction of relevant training and test data using the distant supervision model.
The performance of the method was assessed by extracting protein-residue relations from a new automatically generated test set of sentences containing high confidence examples found using distant supervision. It achieved a F-measure of 0.84 on automatically created silver corpus and 0.79 on a manually annotated gold data set for this task, outperforming previous methods.
The primary contributions of this work are to (1) demonstrate the effectiveness of distant supervision for automatic creation of training data for protein-residue relation extraction, substantially reducing the effort and time involved in manual annotation of a data set and (2) show that the graph-based relation extraction approach we used generalizes well to the problem of protein-residue association extraction. This work paves the way towards effective extraction of protein functional residues from the literature.
PMCID: PMC3465209  PMID: 23046792
3.  Text Mining Improves Prediction of Protein Functional Sites 
PLoS ONE  2012;7(2):e32171.
We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites). The structure analysis was carried out using Dynamics Perturbation Analysis (DPA), which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites) in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions.
PMCID: PMC3290545  PMID: 22393388
4.  Genome Majority Vote Improves Gene Predictions 
PLoS Computational Biology  2011;7(11):e1002284.
Recent studies have noted extensive inconsistencies in gene start sites among orthologous genes in related microbial genomes. Here we provide the first documented evidence that imposing gene start consistency improves the accuracy of gene start-site prediction. We applied an algorithm using a genome majority vote (GMV) scheme to increase the consistency of gene starts among orthologs. We used a set of validated Escherichia coli genes as a standard to quantify accuracy. Results showed that the GMV algorithm can correct hundreds of gene prediction errors in sets of five or ten genomes while introducing few errors. Using a conservative calculation, we project that GMV would resolve many inconsistencies and errors in publicly available microbial gene maps. Our simple and logical solution provides a notable advance toward accurate gene maps.
Author Summary
The genetic code tells us precisely how a DNA sequence will be translated into a protein. However, it is more difficult to identify where translation will start and stop in the entire length of an organism's genome sequence. Computer software can predict where the start sites are, and this is successful most of the time; however, errors do occur. We hypothesized that some errors might be corrected by comparing predictions for the genome sequences of closely related organisms. This correction scheme seems especially appropriate for bacterial genomes: not only is protein production in bacteria simpler than in higher organisms, but hundreds of bacterial DNA sequences are now available, and many of these are closely related. To test the hypothesis, we developed a method to detect whether a gene's start site is inconsistent with the majority of equivalent genes in a set of related bacterial genomes. The method then modifies the start if it can be made consistent with the majority of genomes. Our tests show this majority vote method improves the accuracy of gene start sites. Application of the method to existing bacterial genomes should eliminate many inconsistencies and correct a large number of errors.
PMCID: PMC3219611  PMID: 22131910
5.  Consistency of gene starts among Burkholderia genomes 
BMC Genomics  2011;12:125.
Evolutionary divergence in the position of the translational start site among orthologous genes can have significant functional impacts. Divergence can alter the translation rate, degradation rate, subcellular location, and function of the encoded proteins.
Existing Genbank gene maps for Burkholderia genomes suggest that extensive divergence has occurred--53% of ortholog sets based on Genbank gene maps had inconsistent gene start sites. However, most of these inconsistencies appear to be gene-calling errors. Evolutionary divergence was the most plausible explanation for only 17% of the ortholog sets. Correcting probable errors in the Genbank gene maps decreased the percentage of ortholog sets with inconsistent starts by 68%, increased the percentage of ortholog sets with extractable upstream intergenic regions by 32%, increased the sequence similarity of intergenic regions and predicted proteins, and increased the number of proteins with identifiable signal peptides.
Our findings highlight an emerging problem in comparative genomics: single-digit percent errors in gene predictions can lead to double-digit percentages of inconsistent ortholog sets. The work demonstrates a simple approach to evaluate and improve the quality of gene maps.
PMCID: PMC3049151  PMID: 21342528
6.  Model of Transcriptional Activation by MarA in Escherichia coli 
PLoS Computational Biology  2009;5(12):e1000614.
The AraC family transcription factor MarA activates ∼40 genes (the marA/soxS/rob regulon) of the Escherichia coli chromosome resulting in different levels of resistance to a wide array of antibiotics and to superoxides. Activation of marA/soxS/rob regulon promoters occurs in a well-defined order with respect to the level of MarA; however, the order of activation does not parallel the strength of MarA binding to promoter sequences. To understand this lack of correspondence, we developed a computational model of transcriptional activation in which a transcription factor either increases or decreases RNA polymerase binding, and either accelerates or retards post-binding events associated with transcription initiation. We used the model to analyze data characterizing MarA regulation of promoter activity. The model clearly explains the lack of correspondence between the order of activation and the MarA-DNA affinity and indicates that the order of activation can only be predicted using information about the strength of the full MarA-polymerase-DNA interaction. The analysis further suggests that MarA can activate without increasing polymerase binding and that activation can even involve a decrease in polymerase binding, which is opposite to the textbook model of activation by recruitment. These findings are consistent with published chromatin immunoprecipitation assays of interactions between polymerase and the E. coli chromosome. We find that activation involving decreased polymerase binding yields lower latency in gene regulation and therefore might confer a competitive advantage to cells. Our model yields insights into requirements for predicting the order of activation of a regulon and enables us to suggest that activation might involve a decrease in polymerase binding which we expect to be an important theme of gene regulation in E. coli and beyond.
Author Summary
When environmental conditions change, cell survival can depend on sudden production of proteins that are normally in low demand. Protein production is controlled by transcription factors which bind to DNA near genes and either increase or decrease RNA production. Many puzzles remain concerning the ways transcription factors do this. Recently we collected data relating the intracellular level of a single transcription factor, MarA, to the increase in expression of several genes related to antibiotic and superoxide resistance in Escherichia coli. These data indicated that target genes are turned on in a well-defined order with respect to the level of MarA, enabling cells to mount a response that is commensurate to the level of threat detected in the environment. Here we develop a computational model to yield insight into how MarA turns on its target genes. The modeling suggests that MarA can increase the frequency with which a transcript is made while decreasing the overall presence of the transcription machinery at the start of a gene. This mechanism is opposite to the textbook model of transcriptional activation; nevertheless it enables cells to respond quickly to environmental challenges and is likely of general importance for gene regulation in E. coli and beyond.
PMCID: PMC2787020  PMID: 20019803
7.  Activation of the E. coli marA/soxS/rob regulon in response to transcriptional activator concentration 
Journal of molecular biology  2008;380(2):278-284.
The paralogous transcriptional activators, MarA, SoxS and Rob, activate a common set of promoters, the marA/soxS/rob regulon of Escherichia coli, by binding a cognate site (marbox) upstream of each promoter. The extent of activation varies from one promoter to another and is only poorly correlated with the in vitro affinity of the activator for the specific marbox. Here, we examine the dependence of promoter activation on the level of activator in vivo by manipulating the steady-state concentrations of MarA and SoxS in Lon protease mutants and measuring promoter activation using lacZ transcriptional fusions. We found that: (i) the MarA concentrations needed for half-maximal stimulation varied by at least 19-fold among the 10 promoters tested; (ii) most marboxes were not saturated when there were 24,000 molecules of MarA per cell; (iii) the correlation between MarA concentration needed for half-maximal promoter activity in vivo with marbox binding affinity in vitro was poor and (iv) the two activators differed in their promoter activation profiles. The marRAB and sodA promoters could both be saturated by MarA and SoxS in vivo. However, saturation by MarA resulted in greater marRAB and lesser sodA transcription than did saturation by SoxS implying that the two activators interact with RNAP in different ways at the different promoters. Thus, the concentration and nature of activator determines which regulon promoters are activated and the extent of their activation.
PMCID: PMC2614912  PMID: 18514222
gene regulation; AraC protein family; stress response
8.  Fast dynamics perturbation analysis for prediction of protein functional sites 
We present a fast version of the dynamics perturbation analysis (DPA) algorithm to predict functional sites in protein structures. The original DPA algorithm finds regions in proteins where interactions cause a large change in the protein conformational distribution, as measured using the relative entropy Dx. Such regions are associated with functional sites.
The Fast DPA algorithm, which accelerates DPA calculations, is motivated by an empirical observation that Dx in a normal-modes model is highly correlated with an entropic term that only depends on the eigenvalues of the normal modes. The eigenvalues are accurately estimated using first-order perturbation theory, resulting in a N-fold reduction in the overall computational requirements of the algorithm, where N is the number of residues in the protein. The performance of the original and Fast DPA algorithms was compared using protein structures from a standard small-molecule docking test set. For nominal implementations of each algorithm, top-ranked Fast DPA predictions overlapped the true binding site 94% of the time, compared to 87% of the time for original DPA. In addition, per-protein recall statistics (fraction of binding-site residues that are among predicted residues) were slightly better for Fast DPA. On the other hand, per-protein precision statistics (fraction of predicted residues that are among binding-site residues) were slightly better using original DPA. Overall, the performance of Fast DPA in predicting ligand-binding-site residues was comparable to that of the original DPA algorithm.
Compared to the original DPA algorithm, the decreased run time with comparable performance makes Fast DPA well-suited for implementation on a web server and for high-throughput analysis.
PMCID: PMC2276503  PMID: 18234095
10.  Domain motions of Argonaute, the catalytic engine of RNA interference 
BMC Bioinformatics  2007;8:470.
The Argonaute protein is the core component of the RNA-induced silencing complex, playing the central role of cleaving the mRNA target. Visual inspection of static crystal structures already has enabled researchers to suggest conformational changes of Argonaute that might occur during RNA interference. We have taken the next step by performing an all-atom normal mode analysis of the Pyrococcus furiosus and Aquifex aeolicus Argonaute crystal structures, allowing us to quantitatively assess the feasibility of these conformational changes. To perform the analysis, we begin with the energy-minimized X-ray structures. Normal modes are then calculated using an all-atom molecular mechanics force field.
The analysis reveals low-frequency vibrations that facilitate the accommodation of RNA duplexes – an essential step in target recognition. The Pyrococcus furiosus and Aquifex aeolicus Argonaute proteins both exhibit low-frequency torsion and hinge motions; however, differences in the overall architecture of the proteins cause the detailed dynamics to be significantly different.
Overall, low-frequency vibrations of Argonaute are consistent with mechanisms within the current reaction cycle model for RNA interference.
PMCID: PMC2238725  PMID: 18053142

Results 1-10 (10)