Search tips
Search criteria

Results 1-25 (388214)

Clipboard (0)

Related Articles

1.  CGKB: an annotation knowledge base for cowpea (Vigna unguiculata L.) methylation filtered genomic genespace sequences 
BMC Bioinformatics  2007;8:129.
Cowpea [Vigna unguiculata (L.) Walp.] is one of the most important food and forage legumes in the semi-arid tropics because of its ability to tolerate drought and grow on poor soils. It is cultivated mostly by poor farmers in developing countries, with 80% of production taking place in the dry savannah of tropical West and Central Africa. Cowpea is largely an underexploited crop with relatively little genomic information available for use in applied plant breeding. The goal of the Cowpea Genomics Initiative (CGI), funded by the Kirkhouse Trust, a UK-based charitable organization, is to leverage modern molecular genetic tools for gene discovery and cowpea improvement. One aspect of the initiative is the sequencing of the gene-rich region of the cowpea genome (termed the genespace) recovered using methylation filtration technology and providing annotation and analysis of the sequence data.
CGKB, Cowpea Genespace/Genomics Knowledge Base, is an annotation knowledge base developed under the CGI. The database is based on information derived from 298,848 cowpea genespace sequences (GSS) isolated by methylation filtering of genomic DNA. The CGKB consists of three knowledge bases: GSS annotation and comparative genomics knowledge base, GSS enzyme and metabolic pathway knowledge base, and GSS simple sequence repeats (SSRs) knowledge base for molecular marker discovery. A homology-based approach was applied for annotations of the GSS, mainly using BLASTX against four public FASTA formatted protein databases (NCBI GenBank Proteins, UniProtKB-Swiss-Prot, UniprotKB-PIR (Protein Information Resource), and UniProtKB-TrEMBL). Comparative genome analysis was done by BLASTX searches of the cowpea GSS against four plant proteomes from Arabidopsis thaliana, Oryza sativa, Medicago truncatula, and Populus trichocarpa. The possible exons and introns on each cowpea GSS were predicted using the HMM-based Genscan gene predication program and the potential domains on annotated GSS were analyzed using the HMMER package against the Pfam database. The annotated GSS were also assigned with Gene Ontology annotation terms and integrated with 228 curated plant metabolic pathways from the Arabidopsis Information Resource (TAIR) knowledge base. The UniProtKB-Swiss-Prot ENZYME database was used to assign putative enzymatic function to each GSS. Each GSS was also analyzed with the Tandem Repeat Finder (TRF) program in order to identify potential SSRs for molecular marker discovery. The raw sequence data, processed annotation, and SSR results were stored in relational tables designed in key-value pair fashion using a PostgreSQL relational database management system. The biological knowledge derived from the sequence data and processed results are represented as views or materialized views in the relational database management system. All materialized views are indexed for quick data access and retrieval. Data processing and analysis pipelines were implemented using the Perl programming language. The web interface was implemented in JavaScript and Perl CGI running on an Apache web server. The CPU intensive data processing and analysis pipelines were run on a computer cluster of more than 30 dual-processor Apple XServes. A job management system called Vela was created as a robust way to submit large numbers of jobs to the Portable Batch System (PBS).
CGKB is an integrated and annotated resource for cowpea GSS with features of homology-based and HMM-based annotations, enzyme and pathway annotations, GO term annotation, toolkits, and a large number of other facilities to perform complex queries. The cowpea GSS, chloroplast sequences, mitochondrial sequences, retroelements, and SSR sequences are available as FASTA formatted files and downloadable at CGKB. This database and web interface are publicly accessible at .
PMCID: PMC1868039  PMID: 17445272
2.  Analysis and Interpretation of Multiple Proteomic Datasets Biologically Relevant Information Obtained in Less than 3 Hours 
A growing issue in proteomics is data interpretation, in particular the time-consuming steps of analyzing proteins and peptides identified by database search engines, extracting biologically meaningful information, and sharing results with co-workers and collaborators. This poster shows the application of a bioinformatics tool that was used to manage protein and peptide lists, put them into a biological context and enable the results to be shared directly. Within 3 hours, output generated by protein database search engines sourced from multiple proteomic data sets, was translated into biological information and shared with collaborators.The bioinformatics tool, ProteinCenter (Proxeon) uses biological annotations from multiple resources to produce a biologically-relevant overview in large-scale proteomics studies. ProteinCenter contains public sequence databases to form a comprehensive and consistent superset of 12 million protein sequences derived from over 50 million protein records from GenBank, Refseq, EMBL, UniProt, Swiss-Prot, Trembl, PIR, IPI, PDB, Ensembl etc., including more than 5 million outdated accession numbers. The ProteinCenter database is built using Sun Java technology and Microsoft mySQL database technology for optimal performance. Here we present the bioinformatic analysis and comparison of proteomics datasets derived from the PRIDE database, including an organ-specific proteome map for Arabidopsis thaliana and HUPO projects. Protein identifications were clustered using ProteinCenter algorithms based on indistinguishable proteins or sequence homology. The results of ProteinCenter data processing will be presented including statistical analysis of over- and under-represented features such as gene ontology categories, PFAM annotated proteins, signal peptide proteins, trans-membrane annotated proteins, enzymes, involvement in KEGG pathways and others.A bioinformatics tool that gives access to all major protein repositories enabled analysis and interpretation of multiple, large scale proteomics datasets, sourced directly from search engines, to be performed within 2-3 hours.
PMCID: PMC2918002
3.  Representative Proteomes: A Stable, Scalable and Unbiased Proteome Set for Sequence Analysis and Functional Annotation 
PLoS ONE  2011;6(4):e18910.
The accelerating growth in the number of protein sequences taxes both the computational and manual resources needed to analyze them. One approach to dealing with this problem is to minimize the number of proteins subjected to such analysis in a way that minimizes loss of information. To this end we have developed a set of Representative Proteomes (RPs), each selected from a Representative Proteome Group (RPG) containing similar proteomes calculated based on co-membership in UniRef50 clusters. A Representative Proteome is the proteome that can best represent all the proteomes in its group in terms of the majority of the sequence space and information. RPs at 75%, 55%, 35% and 15% co-membership threshold (CMT) are provided to allow users to decrease or increase the granularity of the sequence space based on their requirements. We find that a CMT of 55% (RP55) most closely follows standard taxonomic classifications. Further analysis of this set reveals that sequence space is reduced by more than 80% relative to UniProtKB, while retaining both sequence diversity (over 95% of InterPro domains) and annotation information (93% of experimentally characterized proteins). All sets can be browsed and are available for sequence similarity searches and download at, while the set of 637 RPs determined using a 55% CMT are also available for text searches. Potential applications include sequence similarity searches, protein classification and targeted protein annotation and characterization.
PMCID: PMC3083393  PMID: 21556138
4.  MitoProteome: mitochondrial protein sequence database and annotation system 
Nucleic Acids Research  2004;32(Database issue):D463-D467.
MitoProteome is an object-relational mitochondrial protein sequence database and annotation system. The initial release contains 847 human mitochondrial protein sequences, derived from public sequence databases and mass spectrometric analysis of highly purified human heart mitochondria. Each sequence is manually annotated with primary function, subfunction and subcellular location, and extensively annotated in an automated process with data extracted from external databases, including gene information from LocusLink and Ensembl; disease information from OMIM; protein–protein interaction data from MINT and DIP; functional domain information from Pfam; protein fingerprints from PRINTS; protein family and family-specific signatures from InterPro; structure data from PDB; mutation data from PMD; BLAST homology data from NCBI NR; and proteins found to be related based on LocusLink and SWISS-PROT references and sequence and taxonomy data. By highly automating the processes of maintaining the MitoProteome Protein List and extracting relevant data from external databases, we are able to present a dynamic database, updated frequently to reflect changes in public resources. The MitoProteome database is publicly available at Users may browse and search MitoProteome, and access a complete compilation of data relevant to each protein of interest, cross-linked to external databases.
PMCID: PMC308782  PMID: 14681458
5.  GFam: a platform for automatic annotation of gene families 
Nucleic Acids Research  2012;40(19):e152.
We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam’s capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from
PMCID: PMC3479161  PMID: 22790981
6.  3D-GENOMICS: a database to compare structural and functional annotations of proteins between sequenced genomes 
Nucleic Acids Research  2004;32(Database issue):D245-D250.
The 3D-GENOMICS database ( provides structural annotations for proteins from sequenced genomes. In August 2003 the database included data for 93 proteomes. The annotations stored in the database include homologous sequences from various sequence databases, domains from SCOP and Pfam, patterns from Prosite and other predicted sequence features such as transmembrane regions and coiled coils. In addition to annotations at the sequence level, several precomputed cross- proteome comparative analyses are available based on SCOP domain superfamily composition. Annotations are available to the user via a web interface to the database. Multiple points of entry are available so that a user is able to: (i) directly access annotations for a single protein sequence via keywords or accession codes, (ii) examine a sequence of interest chosen from a summary of annotations for a particular proteome, or (iii) access precomputed frequency-based cross-proteome comparative analyses.
PMCID: PMC308798  PMID: 14681404
7.  The TIGRFAMs database of protein families 
Nucleic Acids Research  2003;31(1):371-373.
TIGRFAMs is a collection of manually curated protein families consisting of hidden Markov models (HMMs), multiple sequence alignments, commentary, Gene Ontology (GO) assignments, literature references and pointers to related TIGRFAMs, Pfam and InterPro models. These models are designed to support both automated and manually curated annotation of genomes. TIGRFAMs contains models of full-length proteins and shorter regions at the levels of superfamilies, subfamilies and equivalogs, where equivalogs are sets of homologous proteins conserved with respect to function since their last common ancestor. The scope of each model is set by raising or lowering cutoff scores and choosing members of the seed alignment to group proteins sharing specific function (equivalog) or more general properties. The overall goal is to provide information with maximum utility for the annotation process. TIGRFAMs is thus complementary to Pfam, whose models typically achieve broad coverage across distant homologs but end at the boundaries of conserved structural domains. The database currently contains over 1600 protein families. TIGRFAMs is available for searching or downloading at
PMCID: PMC165575  PMID: 12520025
8.  The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation 
BMC Bioinformatics  2008;9:52.
Automated protein function prediction methods are needed to keep pace with high-throughput sequencing. With the existence of many programs and databases for inferring different protein functions, a pipeline that properly integrates these resources will benefit from the advantages of each method. However, integrated systems usually do not provide mechanisms to generate customized databases to predict particular protein functions. Here, we describe a tool termed PIPA (Pipeline for Protein Annotation) that has these capabilities.
PIPA annotates protein functions by combining the results of multiple programs and databases, such as InterPro and the Conserved Domains Database, into common Gene Ontology (GO) terms. The major algorithms implemented in PIPA are: (1) a profile database generation algorithm, which generates customized profile databases to predict particular protein functions, (2) an automated ontology mapping generation algorithm, which maps various classification schemes into GO, and (3) a consensus algorithm to reconcile annotations from the integrated programs and databases.
PIPA's profile generation algorithm is employed to construct the enzyme profile database CatFam, which predicts catalytic functions described by Enzyme Commission (EC) numbers. Validation tests show that CatFam yields average recall and precision larger than 95.0%. CatFam is integrated with PIPA.
We use an association rule mining algorithm to automatically generate mappings between terms of two ontologies from annotated sample proteins. Incorporating the ontologies' hierarchical topology into the algorithm increases the number of generated mappings. In particular, it generates 40.0% additional mappings from the Clusters of Orthologous Groups (COG) to EC numbers and a six-fold increase in mappings from COG to GO terms. The mappings to EC numbers show a very high precision (99.8%) and recall (96.6%), while the mappings to GO terms show moderate precision (80.0%) and low recall (33.0%).
Our consensus algorithm for GO annotation is based on the computation and propagation of likelihood scores associated with GO terms. The test results suggest that, for a given recall, the application of the consensus algorithm yields higher precision than when consensus is not used.
The algorithms implemented in PIPA provide automated genome-wide protein function annotation based on reconciled predictions from multiple resources.
PMCID: PMC2259298  PMID: 18221520
9.  HMMerThread: Detecting Remote, Functional Conserved Domains in Entire Genomes by Combining Relaxed Sequence-Database Searches with Fold Recognition 
PLoS ONE  2011;6(3):e17568.
Conserved domains in proteins are one of the major sources of functional information for experimental design and genome-level annotation. Though search tools for conserved domain databases such as Hidden Markov Models (HMMs) are sensitive in detecting conserved domains in proteins when they share sufficient sequence similarity, they tend to miss more divergent family members, as they lack a reliable statistical framework for the detection of low sequence similarity. We have developed a greatly improved HMMerThread algorithm that can detect remotely conserved domains in highly divergent sequences. HMMerThread combines relaxed conserved domain searches with fold recognition to eliminate false positive, sequence-based identifications. With an accuracy of 90%, our software is able to automatically predict highly divergent members of conserved domain families with an associated 3-dimensional structure. We give additional confidence to our predictions by validation across species. We have run HMMerThread searches on eight proteomes including human and present a rich resource of remotely conserved domains, which adds significantly to the functional annotation of entire proteomes. We find ∼4500 cross-species validated, remotely conserved domain predictions in the human proteome alone. As an example, we find a DNA-binding domain in the C-terminal part of the A-kinase anchor protein 10 (AKAP10), a PKA adaptor that has been implicated in cardiac arrhythmias and premature cardiac death, which upon stress likely translocates from mitochondria to the nucleus/nucleolus. Based on our prediction, we propose that with this HLH-domain, AKAP10 is involved in the transcriptional control of stress response. Further remotely conserved domains we discuss are examples from areas such as sporulation, chromosome segregation and signalling during immune response. The HMMerThread algorithm is able to automatically detect the presence of remotely conserved domains in proteins based on weak sequence similarity. Our predictions open up new avenues for biological and medical studies. Genome-wide HMMerThread domains are available at
PMCID: PMC3053371  PMID: 21423752
10.  The Protein Information Resource: an integrated public resource of functional annotation of proteins 
Nucleic Acids Research  2002;30(1):35-37.
The Protein Information Resource (PIR) serves as an integrated public resource of functional annotation of protein data to support genomic/proteomic research and scientific discovery. The PIR, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the PIR-International Protein Sequence Database (PSD), the major annotated protein sequence database in the public domain, containing about 250 000 proteins. To improve protein annotation and the coverage of experimentally validated data, a bibliography submission system is developed for scientists to submit, categorize and retrieve literature information. Comprehensive protein information is available from iProClass, which includes family classification at the superfamily, domain and motif levels, structural and functional features of proteins, as well as cross-references to over 40 biological databases. To provide timely and comprehensive protein data with source attribution, we have introduced a non-redundant reference protein database, PIR-NREF. The database consists of about 800 000 proteins collected from PIR-PSD, SWISS-PROT, TrEMBL, GenPept, RefSeq and PDB, with composite protein names and literature data. To promote database interoperability, we provide XML data distribution and open database schema, and adopt common ontologies. The PIR web site ( features data mining and sequence analysis tools for information retrieval and functional identification of proteins based on both sequence and annotation information. The PIR databases and other files are also available by FTP (
PMCID: PMC99125  PMID: 11752247
11.  Re-annotation and re-analysis of the Campylobacter jejuni NCTC11168 genome sequence 
BMC Genomics  2007;8:162.
Campylobacter jejuni is the leading bacterial cause of human gastroenteritis in the developed world. To improve our understanding of this important human pathogen, the C. jejuni NCTC11168 genome was sequenced and published in 2000. The original annotation was a milestone in Campylobacter research, but is outdated. We now describe the complete re-annotation and re-analysis of the C. jejuni NCTC11168 genome using current database information, novel tools and annotation techniques not used during the original annotation.
Re-annotation was carried out using sequence database searches such as FASTA, along with programs such as TMHMM for additional support. The re-annotation also utilises sequence data from additional Campylobacter strains and species not available during the original annotation. Re-annotation was accompanied by a full literature search that was incorporated into the updated EMBL file [EMBL: AL111168]. The C. jejuni NCTC11168 re-annotation reduced the total number of coding sequences from 1654 to 1643, of which 90.0% have additional information regarding the identification of new motifs and/or relevant literature. Re-annotation has led to 18.2% of coding sequence product functions being revised.
Major updates were made to genes involved in the biosynthesis of important surface structures such as lipooligosaccharide, capsule and both O- and N-linked glycosylation. This re-annotation will be a key resource for Campylobacter research and will also provide a prototype for the re-annotation and re-interpretation of other bacterial genomes.
PMCID: PMC1899501  PMID: 17565669
12.  FunSecKB: the Fungal Secretome KnowledgeBase 
The Fungal Secretome KnowledgeBase (FunSecKB) provides a resource of secreted fungal proteins, i.e. secretomes, identified from all available fungal protein data in the NCBI RefSeq database. The secreted proteins were identified using a well evaluated computational protocol which includes SignalP, WolfPsort and Phobius for signal peptide or subcellular location prediction, TMHMM for identifying membrane proteins, and PS-Scan for identifying endoplasmic reticulum (ER) target proteins. The entries were mapped to the UniProt database and any annotations of subcellular locations that were either manually curated or computationally predicted were included in FunSecKB. Using a web-based user interface, the database is searchable, browsable and downloadable by using NCBI’s RefSeq accession or gi number, UniProt accession number, keyword or by species. A BLAST utility was integrated to allow users to query the database by sequence similarity. A user submission tool was implemented to support community annotation of subcellular locations of fungal proteins. With the complete fungal data from RefSeq and associated web-based tools, FunSecKB will be a valuable resource for exploring the potential applications of fungal secreted proteins.
Database URL:
PMCID: PMC3263735  PMID: 21300622
13.  Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature 
Bioinformatics  2010;27(3):408-415.
Motivation: A major goal of biomedical research in personalized medicine is to find relationships between mutations and their corresponding disease phenotypes. However, most of the disease-related mutational data are currently buried in the biomedical literature in textual form and lack the necessary structure to allow easy retrieval and visualization. We introduce a high-throughput computational method for the identification of relevant disease mutations in PubMed abstracts applied to prostate (PCa) and breast cancer (BCa) mutations.
Results: We developed the extractor of mutations (EMU) tool to identify mutations and their associated genes. We benchmarked EMU against MutationFinder—a tool to extract point mutations from text. Our results show that both methods achieve comparable performance on two manually curated datasets. We also benchmarked EMU's performance for extracting the complete mutational information and phenotype. Remarkably, we show that one of the steps in our approach, a filter based on sequence analysis, increases the precision for that task from 0.34 to 0.59 (PCa) and from 0.39 to 0.61 (BCa). We also show that this high-throughput approach can be extended to other diseases.
Discussion: Our method improves the current status of disease-mutation databases by significantly increasing the number of annotated mutations. We found 51 and 128 mutations manually verified to be related to PCa and Bca, respectively, that are not currently annotated for these cancer types in the OMIM or Swiss-Prot databases. EMU's retrieval performance represents a 2-fold improvement in the number of annotated mutations for PCa and BCa. We further show that our method can benefit from full-text analysis once there is an increase in Open Access availability of full-text articles.
Availability: Freely available at:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3031038  PMID: 21138947
14.  The Online Protein Processing Resource (TOPPR): a database and analysis platform for protein processing events 
Nucleic Acids Research  2012;41(Database issue):D333-D337.
We here present The Online Protein Processing Resource (TOPPR;, an online database that contains thousands of published proteolytically processed sites in human and mouse proteins. These cleavage events were identified with COmbinded FRActional DIagonal Chromatography proteomics technologies, and the resulting database is provided with full data provenance. Indeed, TOPPR provides an interactive visual display of the actual fragmentation mass spectrum that led to each identification of a reported processed site, complete with fragment ion annotations and search engine scores. Apart from warehousing and disseminating these data in an intuitive manner, TOPPR also provides an online analysis platform, including methods to analyze protease specificity and substrate-centric analyses. Concretely, TOPPR supports three ways to retrieve data: (i) the retrieval of all substrates for one or more cellular stimuli or assays; (ii) a substrate search by UniProtKB/Swiss-Prot accession number, entry name or description; and (iii) a motif search that retrieves substrates matching a user-defined protease specificity profile. The analysis of the substrates is supported through the presence of a variety of annotations, including predicted secondary structure, known domains and experimentally obtained 3D structure where available. Across substrates, substrate orthologs and conserved sequence stretches can also be shown, with iceLogo visualization provided for the latter.
PMCID: PMC3531153  PMID: 23093603
15.  Predicting Antigenicity of Proteins in a Bacterial Proteome; a Protein Microarray and Naïve Bayes Classification Approach 
Chemistry & biodiversity  2012;9(5):977-990.
Discovery of novel antigens associated with infectious diseases is fundamental to the development of serodiagnostic tests and protein subunit vaccines against existing and emerging pathogens. Efforts to predict antigenicity have relied on a few computational algorithms predicting signal peptide sequences (SignalP), transmembrane domains, or subcellular localization (pSort). An empirical protein microarray approach was developed to scan the entire proteome of any infectious microorganism and empirically determine immunoglobulin reactivity against all the antigens from a microorganism in infected individuals. The current database from this activity contains quantitative antibody reactivity data against 35,000 proteins derived from 25 infectious microorganisms and more than 30 million data points derived from 15,000 patient sera. Interrogation of these data sets has revealed ten proteomic features that are associated with antigenicity, allowing an in silico protein sequence and functional annotation based approach to triage the least likely antigenic proteins from those that are more likely to be antigenic. The first iteration of this approach applied to Brucella melitensis predicted 37% of the bacterial proteome containing 91% of the antigens empirically identified by probing proteome microarrays. In this study, we describe a naïve Bayes classification approach that can be used to assign a relative score to the likelihood that an antigen will be immunoreactive and serodiagnostic in a bacterial proteome. This algorithm predicted 20% of the B. melitensis proteome including 91% of the serodiagnostic antigens, a nearly twofold improvement in specificity of the predictor. These results give us confidence that further development of this approach will lead to further improvements in the sensitivity and specificity of this in silico predictive algorithm.
PMCID: PMC3593142  PMID: 22589097
16.  3PFDB - A database of Best Representative PSSM Profiles (BRPs) of Protein Families generated using a novel data mining approach 
BioData Mining  2009;2:8.
Protein families could be related to each other at broad levels that group them as superfamilies. These relationships are harder to detect at the sequence level due to high evolutionary divergence. Sequence searches are strongly directed and influenced by the best representatives of families that are viewed as starting points. PSSMs are useful approximations and mathematical representations of protein alignments, with wide array of applications in bioinformatics approaches like remote homology detection, protein family analysis, detection of new members and evolutionary modelling. Computational intensive searches have been performed using the neural network based sensitive sequence search method called FASSM to identify the Best Representative PSSMs for families reported in Pfam database version 22.
We designed a novel data mining approach for the assessment of individual sequences from a protein family to identify a single Best Representative PSSM profile (BRP) per protein family. Using the approach, a database of protein family-specific best representative PSSM profiles called 3PFDB has been developed. PSSM profiles in 3PFDB are curated using performance of individual sequence as a reference in a rigorous scoring and coverage analysis approach using FASSM. We have assessed the suitability of 10, 85,588 sequences derived from seed or full alignments reported in Pfam database (Version 22). Coverage analysis using FASSM method is used as the filtering step to identify the best representative sequence, starting from full length or domain sequences to generate the final profile for a given family. 3PFDB is a collection of best representative PSSM profiles of 8,524 protein families from Pfam database.
Availability of an approach to identify BRPs and a curated database of best representative PSI-BLAST derived PSSMs for 91.4% of current Pfam family will be a useful resource for the community to perform detailed and specific analysis using family-specific, best-representative PSSM profiles. 3PFDB can be accessed using the URL:
PMCID: PMC2801675  PMID: 19961575
17.  Manual GO annotation of predictive protein signatures: the InterPro approach to GO curation 
InterPro amalgamates predictive protein signatures from a number of well-known partner databases into a single resource. To aid with interpretation of results, InterPro entries are manually annotated with terms from the Gene Ontology (GO). The InterPro2GO mappings are comprised of the cross-references between these two resources and are the largest source of GO annotation predictions for proteins. Here, we describe the protocol by which InterPro curators integrate GO terms into the InterPro database. We discuss the unique challenges involved in integrating specific GO terms with entries that may describe a diverse set of proteins, and we illustrate, with examples, how InterPro hierarchies reflect GO terms of increasing specificity. We describe a revised protocol for GO mapping that enables us to assign GO terms to domains based on the function of the individual domain, rather than the function of the families in which the domain is found. We also discuss how taxonomic constraints are dealt with and those cases where we are unable to add any appropriate GO terms. Expert manual annotation of InterPro entries with GO terms enables users to infer function, process or subcellular information for uncharacterized sequences based on sequence matches to predictive models.
Database URL: The complete InterPro2GO mappings are available at:
PMCID: PMC3270475  PMID: 22301074
18.  Superfamily Assignments for the Yeast Proteome through Integration of Structure Prediction with the Gene Ontology  
PLoS Biology  2007;5(4):e76.
Saccharomyces cerevisiae is one of the best-studied model organisms, yet the three-dimensional structure and molecular function of many yeast proteins remain unknown. Yeast proteins were parsed into 14,934 domains, and those lacking sequence similarity to proteins of known structure were folded using the Rosetta de novo structure prediction method on the World Community Grid. This structural data was integrated with process, component, and function annotations from the Saccharomyces Genome Database to assign yeast protein domains to SCOP superfamilies using a simple Bayesian approach. We have predicted the structure of 3,338 putative domains and assigned SCOP superfamily annotations to 581 of them. We have also assigned structural annotations to 7,094 predicted domains based on fold recognition and homology modeling methods. The domain predictions and structural information are available in an online database at
Author Summary
The three-dimensional structure of a protein can reveal much about that protein's evolutionary relationships and functions. Such information about all the proteins in an organism—the proteome—would offer a more global view of these relationships, but solving each structure individually would be a formidable task. In this study, we have parsed all Saccharomyces cerevisiae proteins into nearly 15,000 distinct domains and then used de novo structure prediction methods together with worldwide distributed computing to predict structures for all domains lacking sequence similarity to proteins of known structure. To overcome the uncertainties in de novo structure prediction, we combined these predictions with data on the biological process, function, and localization of the proteins from previous experimental studies to assign the domains to families of evolutionarily related proteins. Our genome-wide domain predictions and superfamily assignments provide the basis for the generation of experimentally testable hypotheses about the mechanism of action for a large number of yeast proteins.
Protein three-dimensional structures were predicted for allSaccharomyces cerevisiae domains that were found to have no sequence similarity to any proteins of known structure.
PMCID: PMC1828141  PMID: 17373854
19.  ESG: extended similarity group method for automated protein function prediction 
Bioinformatics  2009;25(14):1739-1745.
Motivation: Importance of accurate automatic protein function prediction is ever increasing in the face of a large number of newly sequenced genomes and proteomics data that are awaiting biological interpretation. Conventional methods have focused on high sequence similarity-based annotation transfer which relies on the concept of homology. However, many cases have been reported that simple transfer of function from top hits of a homology search causes erroneous annotation. New methods are required to handle the sequence similarity in a more robust way to combine together signals from strongly and weakly similar proteins for effectively predicting function for unknown proteins with high reliability.
Results: We present the extended similarity group (ESG) method, which performs iterative sequence database searches and annotates a query sequence with Gene Ontology terms. Each annotation is assigned with probability based on its relative similarity score with the multiple-level neighbors in the protein similarity graph. We will depict how the statistical framework of ESG improves the prediction accuracy by iteratively taking into account the neighborhood of query protein in the sequence similarity space. ESG outperforms conventional PSI-BLAST and the protein function prediction (PFP) algorithm. It is found that the iterative search is effective in capturing multiple-domains in a query protein, enabling accurately predicting several functions which originate from different domains.
Availability: ESG web server is available for automated protein function prediction at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2705228  PMID: 19435743
20.  An analysis of the transcriptome of Teladorsagia circumcincta: its biological and biotechnological implications 
BMC Genomics  2012;13(Suppl 7):S10.
Teladorsagia circumcincta (order Strongylida) is an economically important parasitic nematode of small ruminants (including sheep and goats) in temperate climatic regions of the world. Improved insights into the molecular biology of this parasite could underpin alternative methods required to control this and related parasites, in order to circumvent major problems associated with anthelmintic resistance. The aims of the present study were to define the transcriptome of the adult stage of T. circumcincta and to infer the main pathways linked to molecules known to be expressed in this nematode. Since sheep develop acquired immunity against T. circumcincta, there is some potential for the development of a vaccine against this parasite. Hence, we infer excretory/secretory molecules for T. circumcincta as possible immunogens and vaccine candidates.
A total of 407,357 ESTs were assembled yielding 39,852 putative gene sequences. Conceptual translation predicted 24,013 proteins, which were then subjected to detailed annotation which included pathway mapping of predicted proteins (including 112 excreted/secreted [ES] and 226 transmembrane peptides), domain analysis and GO annotation was carried out using InterProScan along with BLAST2GO. Further analysis was carried out for secretory signal peptides using SignalP and non-classical sec pathway using SecretomeP tools.
For ES proteins, key pathways, including Fc epsilon RI, T cell receptor, and chemokine signalling as well as leukocyte transendothelial migration were inferred to be linked to immune responses, along with other pathways related to neurodegenerative diseases and infectious diseases, which warrant detailed future studies. KAAS could identify new and updated pathways like phagosome and protein processing in endoplasmic reticulum. Domain analysis for the assembled dataset revealed families of serine, cysteine and proteinase inhibitors which might represent targets for parasite intervention. InterProScan could identify GO terms pertaining to the extracellular region. Some of the important domain families identified included the SCP-like extracellular proteins which belong to the pathogenesis-related proteins (PRPs) superfamily along with C-type lectin, saposin-like proteins. The 'extracellular region' that corresponds to allergen V5/Tpx-1 related, considered important in parasite-host interactions, was also identified.
Six cysteine motif (SXC1) proteins, transthyretin proteins, C-type lectins, activation-associated secreted proteins (ASPs), which could represent potential candidates for developing novel anthelmintics or vaccines were few other important findings. Of these, SXC1, protein kinase domain-containing protein, trypsin family protein, trypsin-like protease family member (TRY-1), putative major allergen and putative lipid binding protein were identified which have not been reported in the published T. circumcincta proteomics analysis.
Detailed analysis of 6,058 raw EST sequences from dbEST revealed 315 putatively secreted proteins. Amongst them, C-type single domain activation associated secreted protein ASP3 precursor, activation-associated secreted proteins (ASP-like protein), cathepsin B-like cysteine protease, cathepsin L cysteine protease, cysteine protease, TransThyretin-Related and Venom-Allergen-like proteins were the key findings.
We have annotated a large dataset ESTs of T. circumcincta and undertaken detailed comparative bioinformatics analyses. The results provide a comprehensive insight into the molecular biology of this parasite and disease manifestation which provides potential focal point for future research. We identified a number of pathways responsible for immune response. This type of large-scale computational scanning could be coupled with proteomic and metabolomic studies of this parasite leading to novel therapeutic intervention and disease control strategies. We have also successfully affirmed the use of bioinformatics tools, for the study of ESTs, which could now serve as a benchmark for the development of new computational EST analysis pipelines.
PMCID: PMC3521389  PMID: 23282110
21.  PANDORA: analysis of protein and peptide sets through the hierarchical integration of annotations 
Nucleic Acids Research  2010;38(Web Server issue):W84-W89.
Derivation of biological meaning from large sets of proteins or genes is a frequent task in genomic and proteomic studies. Such sets often arise from experimental methods including large-scale gene expression experiments and mass spectrometry (MS) proteomics. Large sets of genes or proteins are also the outcome of computational methods such as BLAST search and homology-based classifications. We have developed the PANDORA web server, which functions as a platform for the advanced biological analysis of sets of genes, proteins, or proteolytic peptides. First, the input set is mapped to a set of corresponding proteins. Then, an analysis of the protein set produces a graph-based hierarchy which highlights intrinsic relations amongst biological subsets, in light of their different annotations from multiple annotation resources. PANDORA integrates a large collection of annotation sources (GO, UniProt Keywords, InterPro, Enzyme, SCOP, CATH, Gene-3D, NCBI taxonomy and more) that comprise ∼200 000 different annotation terms associated with ∼3.2 million sequences from UniProtKB. Statistical enrichment based on a binomial approximation of the hypergeometric distribution and corrected for multiple hypothesis tests is calculated using several background sets, including major gene-expression DNA-chip platforms. Users can also visualize either standard or user-defined binary and quantitative properties alongside the proteins. PANDORA 4.2 is available at
PMCID: PMC2896089  PMID: 20444873
22.  The Zebrafish Secretome 
Zebrafish  2008;5(2):131-138.
The secretome is a functionally rich proteome subset, including cellular membrane and extracellular proteins processed through the secretory pathway. In this study, Danio rerio and Homo sapiens RefSeq proteins were analyzed with SignalP, TargetP, Phobius, and pTarget algorithms. About 16.5% of the zebrafish proteome and 17.0% of the human proteome possessed predicted N-terminal signal sequences. Nearly half of these proteins were subsequently classified as soluble, as they lacked predicted transmembrane domains. The soluble proteins were further subclassified, predicting 1345 (3.8%) zebrafish and 1207 (3.2%) human proteins as extracellular. Comparison of the zebrafish and human soluble secretome proteins identified 372 as orthologs, on the basis of reciprocal BLAST best hits. The computational characterization of the zebrafish proteins found many more members of the secretome than annotated in the SwissProt database. Only 180 of the 2078 zebrafish SwissProt protein entries, and 995 of the 19,294 human SwissProt protein entries were annotated with secreted protein locales. A specific investigation of the fibroblast growth factor and matrix metalloproteinase (MMP) protein families confirmed the prediction data and generated annotation of three additional putative MMP zebrafish proteins. This study presents the first known published description of the zebrafish secretome since the completion of the zebrafish genome sequencing project.
PMCID: PMC2738611  PMID: 18554177
23.  The Zebrafish Secretome 
Zebrafish  2008;5(2):131-138.
The secretome is a functionally rich proteome subset, including cellular membrane and extracellular proteins processed through the secretory pathway. In this study, Danio rerio and Homo sapiens RefSeq proteins were analyzed with SignalP, TargetP, Phobius, and pTarget algorithms. About 16.5% of the zebrafish proteome and 17.0% of the human proteome possessed predicted N-terminal signal sequences. Nearly half of these proteins were subsequently classified as soluble, as they lacked predicted transmembrane domains. The soluble proteins were further subclassified, predicting 1345 (3.8%) zebrafish and 1207 (3.2%) human proteins as extracellular. Comparison of the zebrafish and human soluble secretome proteins identified 372 as orthologs, on the basis of reciprocal BLAST best hits. The computational characterization of the zebrafish proteins found many more members of the secretome than annotated in the SwissProt database. Only 180 of the 2078 zebrafish SwissProt protein entries, and 995 of the 19,294 human SwissProt protein entries were annotated with secreted protein locales. A specific investigation of the fibroblast growth factor and matrix metalloproteinase (MMP) protein families confirmed the prediction data and generated annotation of three additional putative MMP zebrafish proteins. This study presents the first known published description of the zebrafish secretome since the completion of the zebrafish genome sequencing project.
PMCID: PMC2738611  PMID: 18554177
24.  CDD: a Conserved Domain Database for the functional annotation of proteins 
Nucleic Acids Research  2010;39(Database issue):D225-D229.
NCBI’s Conserved Domain Database (CDD) is a resource for the annotation of protein sequences with the location of conserved domain footprints, and functional sites inferred from these footprints. CDD includes manually curated domain models that make use of protein 3D structure to refine domain models and provide insights into sequence/structure/function relationships. Manually curated models are organized hierarchically if they describe domain families that are clearly related by common descent. As CDD also imports domain family models from a variety of external sources, it is a partially redundant collection. To simplify protein annotation, redundant models and models describing homologous families are clustered into superfamilies. By default, domain footprints are annotated with the corresponding superfamily designation, on top of which specific annotation may indicate high-confidence assignment of family membership. Pre-computed domain annotation is available for proteins in the Entrez/Protein dataset, and a novel interface, Batch CD-Search, allows the computation and download of annotation for large sets of protein queries. CDD can be accessed via
PMCID: PMC3013737  PMID: 21109532
25.  Consequences of the discontinuation of the International Protein Index (IPI) database and its substitution by the UniProtKB “complete proteome” sets 
Proteomics  2011;11(22):4434-4438.
The International Protein Index (IPI) database has been one of the most widely used protein databases in MS proteomics approaches. Recently, the closure of IPI in September 2011 was announced. Its recommended replacement is the new UniProt Knowledgebase (UniProtKB) “complete proteome” sets, launched in May 2011. Here, we analyze the consequences of IPI's discontinuation for human and mouse data, and the effect of its substitution with UniProtKB on two levels: (i) data already produced and (ii) newly performed experiments. To estimate the effect on existing data, we investigated how well IPI identifiers map to UniProtKB accessions. We found that 21% of human and 10% of mouse identifiers do not map to UniProtKB and would thus be “lost.” To investigate the impact on new experiments, we compared the theoretical search space (i.e. the tryptic peptides) of both resources and found that it is decreased by 14.0% for human and 8.9% for mouse data through IPI's closure. An analysis on the experimental evidence for these “lost” peptides showed that the vast majority has not been identified in experiments available in the major proteomics repositories. It thus seems likely that the search space provided by UniProtKB is of higher quality than the one currently provided by IPI.
PMCID: PMC3556690  PMID: 21932440
Bioinformatics; Discontinuation; Gene annotation; International Protein Index; Protein databases; UniProt Knowledgebase

Results 1-25 (388214)