Search tips
Search criteria

Results 1-5 (5)

Clipboard (0)
more »
Year of Publication
Document Types
1.  Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature 
Bioinformatics  2010;27(3):408-415.
Motivation: A major goal of biomedical research in personalized medicine is to find relationships between mutations and their corresponding disease phenotypes. However, most of the disease-related mutational data are currently buried in the biomedical literature in textual form and lack the necessary structure to allow easy retrieval and visualization. We introduce a high-throughput computational method for the identification of relevant disease mutations in PubMed abstracts applied to prostate (PCa) and breast cancer (BCa) mutations.
Results: We developed the extractor of mutations (EMU) tool to identify mutations and their associated genes. We benchmarked EMU against MutationFinder—a tool to extract point mutations from text. Our results show that both methods achieve comparable performance on two manually curated datasets. We also benchmarked EMU's performance for extracting the complete mutational information and phenotype. Remarkably, we show that one of the steps in our approach, a filter based on sequence analysis, increases the precision for that task from 0.34 to 0.59 (PCa) and from 0.39 to 0.61 (BCa). We also show that this high-throughput approach can be extended to other diseases.
Discussion: Our method improves the current status of disease-mutation databases by significantly increasing the number of annotated mutations. We found 51 and 128 mutations manually verified to be related to PCa and Bca, respectively, that are not currently annotated for these cancer types in the OMIM or Swiss-Prot databases. EMU's retrieval performance represents a 2-fold improvement in the number of annotated mutations for PCa and BCa. We further show that our method can benefit from full-text analysis once there is an increase in Open Access availability of full-text articles.
Availability: Freely available at:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3031038  PMID: 21138947
2.  DMDM: domain mapping of disease mutations 
Bioinformatics  2010;26(19):2458-2459.
Summary: Domain mapping of disease mutations (DMDM) is a database in which each disease mutation can be displayed by its gene, protein or domain location. DMDM provides a unique domain-level view where all human coding mutations are mapped on the protein domain. To build DMDM, all human proteins were aligned to a database of conserved protein domains using a Hidden Markov Model-based sequence alignment tool (HMMer). The resulting protein-domain alignments were used to provide a domain location for all available human disease mutations and polymorphisms. The number of disease mutations and polymorphisms in each domain position are displayed alongside other relevant functional information (e.g. the binding and catalytic activity of the site and the conservation of that domain location). DMDM's protein domain view highlights molecular relationships among mutations from different diseases that might not be clearly observed with traditional gene-centric visualization tools.
Availability: Freely available at
PMCID: PMC2944201  PMID: 20685956
3.  Threshold Average Precision (TAP-k): a measure of retrieval designed for bioinformatics 
Bioinformatics  2010;26(14):1708-1713.
Motivation: Since database retrieval is a fundamental operation, the measurement of retrieval efficacy is critical to progress in bioinformatics. This article points out some issues with current methods of measuring retrieval efficacy and suggests some improvements. In particular, many studies have used the pooled receiver operating characteristic for n irrelevant records (ROCn) score, the area under the ROC curve (AUC) of a ‘pooled’ ROC curve, truncated at n irrelevant records. Unfortunately, the pooled ROCn score does not faithfully reflect actual usage of retrieval algorithms. Additionally, a pooled ROCn score can be very sensitive to retrieval results from as little as a single query.
Methods: To replace the pooled ROCn score, we propose the Threshold Average Precision (TAP-k), a measure closely related to the well-known average precision in information retrieval, but reflecting the usage of E-values in bioinformatics. Furthermore, in addition to conditions previously given in the literature, we introduce three new criteria that an ideal measure of retrieval efficacy should satisfy.
Results: PSI-BLAST, GLOBAL, HMMER and RPS-BLAST provided examples of using the TAP-k and pooled ROCn scores to evaluate sequence retrieval algorithms. In particular, compelling examples using real data highlight the drawbacks of the pooled ROCn score, showing that it can produce evaluations skewing far from intuitive expectations. In contrast, the TAP-k satisfies most of the criteria desired in an ideal measure of retrieval efficacy.
Availability and Implementation: The TAP-k web server and downloadable Perl script are freely available at
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2894514  PMID: 20505002
4.  Gain and loss of phosphorylation sites in human cancer 
Bioinformatics  2008;24(16):i241-i247.
Motivation: Coding-region mutations in human genes are responsible for a diverse spectrum of diseases and phenotypes. Among lesions that have been studied extensively, there are insights into several of the biochemical functions disrupted by disease-causing mutations. Currently, there are more than 60 000 coding-region mutations associated with inherited disease catalogued in the Human Gene Mutation Database (HGMD, August 2007) and more than 70 000 polymorphic amino acid substitutions recorded in dbSNP (dbSNP, build 127). Understanding the mechanism and contribution these variants make to a clinical phenotype is a formidable problem.
Results: In this study, we investigate the role of phosphorylation in somatic cancer mutations and inherited diseases. Somatic cancer mutation datasets were shown to have a significant enrichment for mutations that cause gain or loss of phosphorylation when compared to our control datasets (putatively neutral nsSNPs and random amino acid substitutions). Of the somatic cancer mutations, those in kinase genes represent the most enriched set of mutations that disrupt phosphorylation sites, suggesting phosphorylation target site mutation is an active cause of phosphorylation deregulation. Overall, this evidence suggests both gain and loss of a phosphorylation site in a target protein may be important features for predicting cancercausing mutations and may represent a molecular cause of disease for a number of inherited and somatic mutations.
PMCID: PMC2732209  PMID: 18689832
5.  Predicting protein–protein interaction by searching evolutionary tree automorphism space 
Bioinformatics (Oxford, England)  2005;21(Suppl 1):i241-i250.
Uncovering the protein–protein interaction network is a fundamental step in the quest to understand the molecular machinery of a cell. This motivates the search for efficient computational methods for predicting such interactions. Among the available predictors are those that are based on the co-evolution hypothesis “evolutionary trees of protein families (that are known to interact) are expected to have similar topologies”. Many of these methods are limited by the fact that they can handle only a small number of protein sequences. Also, details on evolutionary tree topology are missing as they use similarity matrices in lieu of the trees.
We introduce MORPH, a new algorithm for predicting protein interaction partners between members of two protein families that are known to interact. Our approach can also be seen as a new method for searching the best superposition of the corresponding evolutionary trees based on tree automorphism group. We discuss relevant facts related to the predictability of protein–protein interaction based on their co-evolution. When compared with related computational approaches, our method reduces the search space by ~3 × 105-fold and at the same time increases the accuracy of predicting correct binding partners.
PMCID: PMC1618802  PMID: 15961463

Results 1-5 (5)