Search tips
Search criteria

Results 1-25 (749740)

Clipboard (0)

Related Articles

1.  InterPro: the integrative protein signature database 
Nucleic Acids Research  2008;37(Database issue):D211-D215.
The InterPro database ( integrates together predictive models or ‘signatures’ representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. Integration is performed manually and approximately half of the total ∼58 000 signatures available in the source databases belong to an InterPro entry. Recently, we have started to also display the remaining un-integrated signatures via our web interface. Other developments include the provision of non-signature data, such as structural data, in new XML files on our FTP site, as well as the inclusion of matchless UniProtKB proteins in the existing match XML files. The web interface has been extended and now links out to the ADAN predicted protein–protein interaction database and the SPICE and Dasty viewers. The latest public release (v18.0) covers 79.8% of UniProtKB (v14.1) and consists of 16 549 entries. InterPro data may be accessed either via the web address above, via web services, by downloading files by anonymous FTP or by using the InterProScan search software (
PMCID: PMC2686546  PMID: 18940856
2.  The InterPro BioMart: federated query and web service access to the InterPro Resource 
The InterPro BioMart provides users with query-optimized access to predictions of family classification, protein domains and functional sites, based on a broad spectrum of integrated computational models (‘signatures’) that are generated by the InterPro member databases: Gene3D, HAMAP, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. These predictions are provided for all protein sequences from both the UniProt Knowledge Base and the UniParc protein sequence archive. The InterPro BioMart is supplementary to the primary InterPro web interface (, providing a web service and the ability to build complex, custom queries that can efficiently return thousands of rows of data in a variety of formats. This article describes the information available from the InterPro BioMart and illustrates its utility with examples of how to build queries that return useful biological information.
Database URL:
PMCID: PMC3170169  PMID: 21785143
3.  The InterPro Database, 2003 brings increased coverage and new features 
Nucleic Acids Research  2003;31(1):315-318.
InterPro, an integrated documentation resource of protein families, domains and functional sites, was created in 1999 as a means of amalgamating the major protein signature databases into one comprehensive resource. PROSITE, Pfam, PRINTS, ProDom, SMART and TIGRFAMs have been manually integrated and curated and are available in InterPro for text- and sequence-based searching. The results are provided in a single format that rationalises the results that would be obtained by searching the member databases individually. The latest release of InterPro contains 5629 entries describing 4280 families, 1239 domains, 95 repeats and 15 post-translational modifications. Currently, the combined signatures in InterPro cover more than 74% of all proteins in SWISS-PROT and TrEMBL, an increase of nearly 15% since the inception of InterPro. New features of the database include improved searching capabilities and enhanced graphical user interfaces for visualisation of the data. The database is available via a webserver ( and anonymous FTP (
PMCID: PMC165493  PMID: 12520011
4.  Manual GO annotation of predictive protein signatures: the InterPro approach to GO curation 
InterPro amalgamates predictive protein signatures from a number of well-known partner databases into a single resource. To aid with interpretation of results, InterPro entries are manually annotated with terms from the Gene Ontology (GO). The InterPro2GO mappings are comprised of the cross-references between these two resources and are the largest source of GO annotation predictions for proteins. Here, we describe the protocol by which InterPro curators integrate GO terms into the InterPro database. We discuss the unique challenges involved in integrating specific GO terms with entries that may describe a diverse set of proteins, and we illustrate, with examples, how InterPro hierarchies reflect GO terms of increasing specificity. We describe a revised protocol for GO mapping that enables us to assign GO terms to domains based on the function of the individual domain, rather than the function of the families in which the domain is found. We also discuss how taxonomic constraints are dealt with and those cases where we are unable to add any appropriate GO terms. Expert manual annotation of InterPro entries with GO terms enables users to infer function, process or subcellular information for uncharacterized sequences based on sequence matches to predictive models.
Database URL: The complete InterPro2GO mappings are available at:
PMCID: PMC3270475  PMID: 22301074
5.  SIFTS: Structure Integration with Function, Taxonomy and Sequences resource 
Nucleic Acids Research  2012;41(Database issue):D483-D489.
The Structure Integration with Function, Taxonomy and Sequences resource (SIFTS; is a close collaboration between the Protein Data Bank in Europe (PDBe) and UniProt. The two teams have developed a semi-automated process for maintaining up-to-date cross-reference information to UniProt entries, for all protein chains in the PDB entries present in the UniProt database. This process is carried out for every weekly PDB release and the information is stored in the SIFTS database. The SIFTS process includes cross-references to other biological resources such as Pfam, SCOP, CATH, GO, InterPro and the NCBI taxonomy database. The information is exported in XML format, one file for each PDB entry, and is made available by FTP. Many bioinformatics resources use SIFTS data to obtain cross-references between the PDB and other biological databases so as to provide their users with up-to-date information.
PMCID: PMC3531078  PMID: 23203869
6.  The InterPro database, an integrated documentation resource for protein families, domains and functional sites 
Nucleic Acids Research  2001;29(1):37-40.
Signature databases are vital tools for identifying distant relationships in novel sequences and hence for inferring protein function. InterPro is an integrated documentation resource for protein families, domains and functional sites, which amalgamates the efforts of the PROSITE, PRINTS, Pfam and ProDom database projects. Each InterPro entry includes a functional description, annotation, literature references and links back to the relevant member database(s). Release 2.0 of InterPro (October 2000) contains over 3000 entries, representing families, domains, repeats and sites of post-translational modification encoded by a total of 6804 different regular expressions, profiles, fingerprints and Hidden Markov Models. Each InterPro entry lists all the matches against SWISS-PROT and TrEMBL (more than 1 000 000 hits from 462 500 proteins in SWISS-PROT and TrEMBL). The database is accessible for text- and sequence-based searches at Questions can be emailed to
PMCID: PMC29841  PMID: 11125043
7.  The Universal Protein Resource (UniProt): an expanding universe of protein information 
Nucleic Acids Research  2005;34(Database issue):D187-D191.
The Universal Protein Resource (UniProt) provides a central resource on protein sequences and functional annotation with three database components, each addressing a key need in protein bioinformatics. The UniProt Knowledgebase (UniProtKB), comprising the manually annotated UniProtKB/Swiss-Prot section and the automatically annotated UniProtKB/TrEMBL section, is the preeminent storehouse of protein annotation. The extensive cross-references, functional and feature annotations and literature-based evidence attribution enable scientists to analyse proteins and query across databases. The UniProt Reference Clusters (UniRef) speed similarity searches via sequence space compression by merging sequences that are 100% (UniRef100), 90% (UniRef90) or 50% (UniRef50) identical. Finally, the UniProt Archive (UniParc) stores all publicly available protein sequences, containing the history of sequence data with links to the source databases. UniProt databases continue to grow in size and in availability of information. Recent and upcoming changes to database contents, formats, controlled vocabularies and services are described. New download availability includes all major releases of UniProtKB, sequence collections by taxonomic division and complete proteomes. A bibliography mapping service has been added, and an ID mapping service will be available soon. UniProt databases can be accessed online at or downloaded at .
PMCID: PMC1347523  PMID: 16381842
8.  SUPERFAMILY—sophisticated comparative genomics, data mining, visualization and phylogeny 
Nucleic Acids Research  2008;37(Database issue):D380-D386.
SUPERFAMILY provides structural, functional and evolutionary information for proteins from all completely sequenced genomes, and large sequence collections such as UniProt. Protein domain assignments for over 900 genomes are included in the database, which can be accessed at Hidden Markov models based on Structural Classification of Proteins (SCOP) domain definitions at the superfamily level are used to provide structural annotation. We recently produced a new model library based on SCOP 1.73. Family level assignments are also available. From the web site users can submit sequences for SCOP domain classification; search for keywords such as superfamilies, families, organism names, models and sequence identifiers; find over- and underrepresented families or superfamilies within a genome relative to other genomes or groups of genomes; compare domain architectures across selections of genomes and finally build multiple sequence alignments between Protein Data Bank (PDB), genomic and custom sequences. Recent extensions to the database include InterPro abstracts and Gene Ontology terms for superfamiles, taxonomic visualization of the distribution of families across the tree of life, searches for functionally similar domain architectures and phylogenetic trees. The database, models and associated scripts are available for download from the ftp site.
PMCID: PMC2686452  PMID: 19036790
9.  Java GUI for InterProScan (JIPS): A tool to help process multiple InterProScans and perform ortholog analysis 
BMC Bioinformatics  2006;7:462.
Recent, rapid growth in the quantity of available genomic data has generated many protein sequences that are not yet biochemically classified. Thus, the prediction of biochemical function based on structural motifs is an important task in post-genomic analysis. The InterPro databases are a major resource for protein function information. For optimal results, these databases should be searched at regular intervals, since they are frequently updated.
We describe here a new program JIPS (Java GUI for InterProScan), a tool for tracking and viewing results obtained from repeated InterProScan searches. JIPS stores matches (in a local database) obtained from InterProScan searches performed with multiple versions of the InterPro database and highlights hits that have been added since the last search of the InterPro database. Results are displayed in an easy-to-use tabular format. JIPS also contains tools to assist with ortholog-based comparative studies of protein signatures.
JIPS is an efficient tool for performing repeated InterProScans on large batches of protein sequences, tracking and viewing search results, and mining the collected data.
PMCID: PMC1626093  PMID: 17054794
10.  CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins 
Nucleic Acids Research  2001;29(1):33-36.
The CluSTr (Clusters of SWISS-PROT and TrEMBL proteins) database offers an automatic classification of SWISS-PROT and TrEMBL proteins into groups of related proteins. The clustering is based on analysis of all pairwise comparisons between protein sequences. Analysis has been carried out for different levels of protein similarity, yielding a hierarchical organisation of clusters. The database provides links to InterPro, which integrates information on protein families, domains and functional sites from PROSITE, PRINTS, Pfam and ProDom. Links to the InterPro graphical interface allow users to see at a glance whether proteins from the cluster share particular functional sites. CluSTr also provides cross-references to HSSP and PDB. The database is available for querying and browsing at
PMCID: PMC29804  PMID: 11125042
11.  CGKB: an annotation knowledge base for cowpea (Vigna unguiculata L.) methylation filtered genomic genespace sequences 
BMC Bioinformatics  2007;8:129.
Cowpea [Vigna unguiculata (L.) Walp.] is one of the most important food and forage legumes in the semi-arid tropics because of its ability to tolerate drought and grow on poor soils. It is cultivated mostly by poor farmers in developing countries, with 80% of production taking place in the dry savannah of tropical West and Central Africa. Cowpea is largely an underexploited crop with relatively little genomic information available for use in applied plant breeding. The goal of the Cowpea Genomics Initiative (CGI), funded by the Kirkhouse Trust, a UK-based charitable organization, is to leverage modern molecular genetic tools for gene discovery and cowpea improvement. One aspect of the initiative is the sequencing of the gene-rich region of the cowpea genome (termed the genespace) recovered using methylation filtration technology and providing annotation and analysis of the sequence data.
CGKB, Cowpea Genespace/Genomics Knowledge Base, is an annotation knowledge base developed under the CGI. The database is based on information derived from 298,848 cowpea genespace sequences (GSS) isolated by methylation filtering of genomic DNA. The CGKB consists of three knowledge bases: GSS annotation and comparative genomics knowledge base, GSS enzyme and metabolic pathway knowledge base, and GSS simple sequence repeats (SSRs) knowledge base for molecular marker discovery. A homology-based approach was applied for annotations of the GSS, mainly using BLASTX against four public FASTA formatted protein databases (NCBI GenBank Proteins, UniProtKB-Swiss-Prot, UniprotKB-PIR (Protein Information Resource), and UniProtKB-TrEMBL). Comparative genome analysis was done by BLASTX searches of the cowpea GSS against four plant proteomes from Arabidopsis thaliana, Oryza sativa, Medicago truncatula, and Populus trichocarpa. The possible exons and introns on each cowpea GSS were predicted using the HMM-based Genscan gene predication program and the potential domains on annotated GSS were analyzed using the HMMER package against the Pfam database. The annotated GSS were also assigned with Gene Ontology annotation terms and integrated with 228 curated plant metabolic pathways from the Arabidopsis Information Resource (TAIR) knowledge base. The UniProtKB-Swiss-Prot ENZYME database was used to assign putative enzymatic function to each GSS. Each GSS was also analyzed with the Tandem Repeat Finder (TRF) program in order to identify potential SSRs for molecular marker discovery. The raw sequence data, processed annotation, and SSR results were stored in relational tables designed in key-value pair fashion using a PostgreSQL relational database management system. The biological knowledge derived from the sequence data and processed results are represented as views or materialized views in the relational database management system. All materialized views are indexed for quick data access and retrieval. Data processing and analysis pipelines were implemented using the Perl programming language. The web interface was implemented in JavaScript and Perl CGI running on an Apache web server. The CPU intensive data processing and analysis pipelines were run on a computer cluster of more than 30 dual-processor Apple XServes. A job management system called Vela was created as a robust way to submit large numbers of jobs to the Portable Batch System (PBS).
CGKB is an integrated and annotated resource for cowpea GSS with features of homology-based and HMM-based annotations, enzyme and pathway annotations, GO term annotation, toolkits, and a large number of other facilities to perform complex queries. The cowpea GSS, chloroplast sequences, mitochondrial sequences, retroelements, and SSR sequences are available as FASTA formatted files and downloadable at CGKB. This database and web interface are publicly accessible at .
PMCID: PMC1868039  PMID: 17445272
12.  InterPro (The Integrated Resource of Protein Domains and Functional Sites) 
Yeast (Chichester, England)  2000;17(4):327-334.
The family and motif databases, PROSITE, PRINTS, Pfam and ProDom, have been integrated into a powerful resource for protein secondary annotation. As of June 2000, InterPro had processed 384 572 proteins in SWISS-PROT and TrEMBL. Because the contributing databases have different clustering principles and scoring sensitivities, the combined assignments compliment each other for grouping protein families and delineating domains. The graphic displays of all matches above the scoring thresholds enables judgements to be made on the concordances or differences between the assignments. The website links can be used to analyse novel sequences and for queries across the proteomes of 32 organisms, including the partial human set, by domain and/or protein family. An analysis of selected HtrA/DegQ proteases demonstrates the utility of this website for detailed comparative genomics. Further information on the project can be found at the European Bioinformatics Institute at
PMCID: PMC2448387  PMID: 11119311
13.  PANDORA: analysis of protein and peptide sets through the hierarchical integration of annotations 
Nucleic Acids Research  2010;38(Web Server issue):W84-W89.
Derivation of biological meaning from large sets of proteins or genes is a frequent task in genomic and proteomic studies. Such sets often arise from experimental methods including large-scale gene expression experiments and mass spectrometry (MS) proteomics. Large sets of genes or proteins are also the outcome of computational methods such as BLAST search and homology-based classifications. We have developed the PANDORA web server, which functions as a platform for the advanced biological analysis of sets of genes, proteins, or proteolytic peptides. First, the input set is mapped to a set of corresponding proteins. Then, an analysis of the protein set produces a graph-based hierarchy which highlights intrinsic relations amongst biological subsets, in light of their different annotations from multiple annotation resources. PANDORA integrates a large collection of annotation sources (GO, UniProt Keywords, InterPro, Enzyme, SCOP, CATH, Gene-3D, NCBI taxonomy and more) that comprise ∼200 000 different annotation terms associated with ∼3.2 million sequences from UniProtKB. Statistical enrichment based on a binomial approximation of the hypergeometric distribution and corrected for multiple hypothesis tests is calculated using several background sets, including major gene-expression DNA-chip platforms. Users can also visualize either standard or user-defined binary and quantitative properties alongside the proteins. PANDORA 4.2 is available at
PMCID: PMC2896089  PMID: 20444873
14.  GreenPhylDB: a database for plant comparative genomics 
Nucleic Acids Research  2007;36(Database issue):D991-D998.
GreenPhylDB ( is a comprehensive platform designed to facilitate comparative functional genomics in Oryza sativa and Arabidopsis thaliana genomes. The main functions of GreenPhylDB are to assign O. sativa and A. thaliana sequences to gene families using a semi-automatic clustering procedure and to create ‘orthologous’ groups using a phylogenomic approach. To date, GreenPhylDB comprises the most complete list of plant gene families, which have been manually curated (6421 families). GreenPhylDB also contains all of the phylogenomic relationships computed for 4375 families. A total of 492 TAIR, 1903 InterPro and 981 KEGG families and subfamilies were manually curated using the clusters created with the TribeMCL software. GreenPhylDB integrates information from several other databases including UniProt, KEGG, InterPro, TAIR and TIGR. Several entry points can be used to display phylogenomic relationships for A. thaliana or O. sativa sequences, using TAIR, TIGR gene ID, family name, InterPro, gene alias, UniProt or protein/nucleic sequence. Finally, a powerful phylogenomics tool, GreenPhyl Ortholog Search Tool (GOST), was incorporated into GreenPhylDB to predict orthologous relationships between O. sativa/A. thaliana protein(s) and sequences from other plant species.
PMCID: PMC2238940  PMID: 17986457
15.  PCAS – a precomputed proteome annotation database resource 
BMC Genomics  2003;4:42.
Many model proteomes or "complete" sets of proteins of given organisms are now publicly available. Much effort has been invested in computational annotation of those "draft" proteomes. Motif or domain based algorithms play a pivotal role in functional classification of proteins. Employing most available computational algorithms, mainly motif or domain recognition algorithms, we set up to develop an online proteome annotation system with integrated proteome annotation data to complement existing resources.
We report here the development of PCAS (ProteinCentric Annotation System) as an online resource of pre-computed proteome annotation data. We applied most available motif or domain databases and their analysis methods, including hmmpfam search of HMMs in Pfam, SMART and TIGRFAM, RPS-PSIBLAST search of PSSMs in CDD, pfscan of PROSITE patterns and profiles, as well as PSI-BLAST search of SUPERFAMILY PSSMs. In addition, signal peptide and TM are predicted using SignalP and TMHMM respectively. We mapped SUPERFAMILY and COGs to InterPro, so the motif or domain databases are integrated through InterPro. PCAS displays table summaries of pre-computed data and a graphical presentation of motifs or domains relative to the protein. As of now, PCAS contains human IPI, mouse IPI, and rat IPI, A. thaliana, C. elegans, D. melanogaster, S. cerevisiae, and S. pombe proteome.
PCAS is available at
PCAS gives better annotation coverage for model proteomes by employing a wider collection of available algorithms. Besides presenting the most confident annotation data, PCAS also allows customized query so users can inspect statistically less significant boundary information as well. Therefore, besides providing general annotation information, PCAS could be used as a discovery platform. We plan to update PCAS twice a year. We will upgrade PCAS when new proteome annotation algorithms identified.
PMCID: PMC293463  PMID: 14594458
16.  The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection 
Nucleic Acids Research  2011;40(Database issue):D1-D8.
The 19th annual Database Issue of Nucleic Acids Research features descriptions of 92 new online databases covering various areas of molecular biology and 100 papers describing recent updates to the databases previously described in NAR and other journals. The highlights of this issue include, among others, a description of neXtProt, a knowledgebase on human proteins; a detailed explanation of the principles behind the NCBI Taxonomy Database; NCBI and EBI papers on the recently launched BioSample databases that store sample information for a variety of database resources; descriptions of the recent developments in the Gene Ontology and UniProt Gene Ontology Annotation projects; updates on Pfam, SMART and InterPro domain databases; update papers on KEGG and TAIR, two universally acclaimed databases that face an uncertain future; and a separate section with 10 wiki-based databases, introduced in an accompanying editorial. The NAR online Molecular Biology Database Collection, available at, has been updated and now lists 1380 databases. Brief machine-readable descriptions of the databases featured in this issue, according to the BioDBcore standards, will be provided at the web site. The full content of the Database Issue is freely available online on the Nucleic Acids Research web site (
PMCID: PMC3245068  PMID: 22144685
17.  ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins 
Nucleic Acids Research  2006;34(Web Server issue):W362-W365.
ScanProsite——is a new and improved version of the web-based tool for detecting PROSITE signature matches in protein sequences. For a number of PROSITE profiles, the tool now makes use of ProRules—context-dependent annotation templates—to detect functional and structural intra-domain residues. The detection of those features enhances the power of function prediction based on profiles. Both user-defined sequences and sequences from the UniProt Knowledgebase can be matched against custom patterns, or against PROSITE signatures. To improve response times, matches of sequences from UniProtKB against PROSITE signatures are now retrieved from a pre-computed match database. Several output modes are available including simple text views and a rich mode providing an interactive match and feature viewer with a graphical representation of results.
PMCID: PMC1538847  PMID: 16845026
18.  EnzML: multi-label prediction of enzyme classes using InterPro signatures 
BMC Bioinformatics  2012;13:61.
Manual annotation of enzymatic functions cannot keep up with automatic genome sequencing. In this work we explore the capacity of InterPro sequence signatures to automatically predict enzymatic function.
We present EnzML, a multi-label classification method that can efficiently account also for proteins with multiple enzymatic functions: 50,000 in UniProt. EnzML was evaluated using a standard set of 300,747 proteins for which the manually curated Swiss-Prot and KEGG databases have agreeing Enzyme Commission (EC) annotations. EnzML achieved more than 98% subset accuracy (exact match of all correct Enzyme Commission classes of a protein) for the entire dataset and between 87 and 97% subset accuracy in reannotating eight entire proteomes: human, mouse, rat, mouse-ear cress, fruit fly, the S. pombe yeast, the E. coli bacterium and the M. jannaschii archaebacterium. To understand the role played by the dataset size, we compared the cross-evaluation results of smaller datasets, either constructed at random or from specific taxonomic domains such as archaea, bacteria, fungi, invertebrates, plants and vertebrates. The results were confirmed even when the redundancy in the dataset was reduced using UniRef100, UniRef90 or UniRef50 clusters.
InterPro signatures are a compact and powerful attribute space for the prediction of enzymatic function. This representation makes multi-label machine learning feasible in reasonable time (30 minutes to train on 300,747 instances with 10,852 attributes and 2,201 class values) using the Mulan Binary Relevance Nearest Neighbours algorithm implementation (BR-kNN).
PMCID: PMC3483700  PMID: 22533924
19.  The First Draft of the Interaction Network of Endostatin, an Inhibitor of Angiogenesis 
Endostatin is a C-terminal proteolytic fragment of collagen XVIII that is localized in vascular basement membrane zones in various organs, and inhibits angiogenesis and tumor growth. We have used protein and glycosaminoglycan arrays probed by surface plasmon resonance (SPR) to identify partners of endostatin, and to give further insights into its molecular mechanism of action. Glycosaminoglycans, matricellular proteins, collagens, the Abeta amyloid peptide, and transglutaminase-2 were found to bind endostatin. Endostatin was also shown to interact with intact pathogens (parasites invading the extracellular matrix) injected in buffer flow over SPR arrays, and could thus participate in host-pathogen interactions. The interaction network of endostatin was built using those experimental data and data available in the extracellular interaction database MatrixDB ( The network was visualized with the software environment Cytoscape, and was annotated using UniProtKB, Gene Ontology and InterPro data. The predominant function associated with the endostatin network was cell adhesion. The most represented domains in the network were EGF (Epidermal Growth Factor) and EGF-like domains. 46% of endostatin partners bind calcium. Kinetics and affinity constants calculated by SPR experiments were integrated into the network to prioritize interactions according to their rate of formation and their stability. Data on binding sites were also integrated to discriminate simultaneous from mutually exclusive interactions. The integrated network was used as a framework to build a mathematical model of endostatin mechanism of action. We focus in the network established by endostatin at the cell surface, where it is able to bind to several receptors, to understand how endostatin selects a receptor and to determine if its binding to a receptor modify its interactions with other cell-surface associated molecules. The building of a dynamic interaction network will be helpful to understand how information/signaling is conveyed through the network and to predict the consequences of perturbations.
PMCID: PMC2918166
20.  Domain fusion analysis by applying relational algebra to protein sequence and domain databases 
BMC Bioinformatics  2003;4:16.
Domain fusion analysis is a useful method to predict functionally linked proteins that may be involved in direct protein-protein interactions or in the same metabolic or signaling pathway. As separate domain databases like BLOCKS, PROSITE, Pfam, SMART, PRINTS-S, ProDom, TIGRFAMs, and amalgamated domain databases like InterPro continue to grow in size and quality, a computational method to perform domain fusion analysis that leverages on these efforts will become increasingly powerful.
This paper proposes a computational method employing relational algebra to find domain fusions in protein sequence databases. The feasibility of this method was illustrated on the SWISS-PROT+TrEMBL sequence database using domain predictions from the Pfam HMM (hidden Markov model) database. We identified 235 and 189 putative functionally linked protein partners in H. sapiens and S. cerevisiae, respectively. From scientific literature, we were able to confirm many of these functional linkages, while the remainder offer testable experimental hypothesis. Results can be viewed at .
As the analysis can be computed quickly on any relational database that supports standard SQL (structured query language), it can be dynamically updated along with the sequence and domain databases, thereby improving the quality of predictions over time.
PMCID: PMC156618  PMID: 12734020
21.  IPRStats: visualization of the functional potential of an InterProScan run 
BMC Bioinformatics  2010;11(Suppl 12):S13.
InterPro is a collection of protein signatures for the classification and automated annotation of proteins. Interproscan is a software tool that scans protein sequences against Interpro member databases using a variety of profile-based, hidden markov model and positional specific score matrix methods. It not only combines a set of analysis tools, but also performs data look-up from various sources, as well as some redundancy removal. Interproscan is robust and scalable, able to perform on any machine from a netbook to a large cluster. However, when performing whole-genome or metagenome analysis, there is a need for a fast statistical visualization of the results to have good initial grasp on the functional potential of the sequences in the analyzed data set. This is especially important when analyzing and comparing metagenomic or metaproteomic data-sets.
IPRStats is a tool for the visualization of Interproscan results. Interproscan results are parsed from the Interproscan XML or EBIXML file into an SQLite or MySQL database. The results for each signature database scan are read and displayed as pie-charts or bar charts as summary statistics. A table is also provided, where each entry is a signature (e.g. a Pfam entry) accompanied by one or more Gene Ontology terms, if Interproscan was run using the Gene Ontology option.
We present an platform-independent, open source licensed tool that is useful for Interproscan users who wish to view the summary of their results in a rapid and concise fashion.
PMCID: PMC3040527  PMID: 21210980
22.  PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways 
Nucleic Acids Research  2006;35(Database issue):D247-D252.
PANTHER is a freely available, comprehensive software system for relating protein sequence evolution to the evolution of specific protein functions and biological roles. Since 2005, there have been three main improvements to PANTHER. First, the sequences used to create evolutionary trees are carefully selected to provide coverage of phylogenetic as well as functional information. Second, PANTHER is now a member of the InterPro Consortium, and the PANTHER hidden markov Models (HMMs) are distributed as part of InterProScan. Third, we have dramatically expanded the number of pathways associated with subfamilies in PANTHER. Pathways provide a detailed, structured representation of protein function in the context of biological reaction networks. PANTHER pathways were generated using the emerging Systems Biology Markup Language (SBML) standard using pathway network editing software called CellDesigner. The pathway collection currently contains ∼1500 reactions in 130 pathways, curated by expert biologists with authorship attribution. The curation environment is designed to be easy to use, and the number of pathways is growing steadily. Because the reaction participants are linked to subfamilies and corresponding HMMs, reactions can be inferred across numerous different organisms. The HMMs can be downloaded by FTP, and tools for analyzing data in the context of pathways and function ontologies are available at .
PMCID: PMC1716723  PMID: 17130144
23.  Representative Proteomes: A Stable, Scalable and Unbiased Proteome Set for Sequence Analysis and Functional Annotation 
PLoS ONE  2011;6(4):e18910.
The accelerating growth in the number of protein sequences taxes both the computational and manual resources needed to analyze them. One approach to dealing with this problem is to minimize the number of proteins subjected to such analysis in a way that minimizes loss of information. To this end we have developed a set of Representative Proteomes (RPs), each selected from a Representative Proteome Group (RPG) containing similar proteomes calculated based on co-membership in UniRef50 clusters. A Representative Proteome is the proteome that can best represent all the proteomes in its group in terms of the majority of the sequence space and information. RPs at 75%, 55%, 35% and 15% co-membership threshold (CMT) are provided to allow users to decrease or increase the granularity of the sequence space based on their requirements. We find that a CMT of 55% (RP55) most closely follows standard taxonomic classifications. Further analysis of this set reveals that sequence space is reduced by more than 80% relative to UniProtKB, while retaining both sequence diversity (over 95% of InterPro domains) and annotation information (93% of experimentally characterized proteins). All sets can be browsed and are available for sequence similarity searches and download at, while the set of 637 RPs determined using a 55% CMT are also available for text searches. Potential applications include sequence similarity searches, protein classification and targeted protein annotation and characterization.
PMCID: PMC3083393  PMID: 21556138
24.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology 
Nucleic Acids Research  2004;32(Database issue):D262-D266.
The Gene Ontology Annotation (GOA) database ( aims to provide high-quality electronic and manual annotations to the UniProt Knowledgebase (Swiss-Prot, TrEMBL and PIR-PSD) using the standardized vocabulary of the Gene Ontology (GO). As a supplementary archive of GO annotation, GOA promotes a high level of integration of the knowledge represented in UniProt with other databases. This is achieved by converting UniProt annotation into a recognized computational format. GOA provides annotated entries for nearly 60 000 species (GOA-SPTr) and is the largest and most comprehensive open-source contributor of annotations to the GO Consortium annotation effort. By integrating GO annotations from other model organism groups, GOA consolidates specialized knowledge and expertise to ensure the data remain a key reference for up-to-date biological information. Furthermore, the GOA database fully endorses the Human Proteomics Initiative by prioritizing the annotation of proteins likely to benefit human health and disease. In addition to a non-redundant set of annotations to the human proteome (GOA-Human) and monthly releases of its GO annotation for all species (GOA-SPTr), a series of GO mapping files and specific cross-references in other databases are also regularly distributed. GOA can be queried through a simple user-friendly web interface or downloaded in a parsable format via the EBI and GO FTP websites. The GOA data set can be used to enhance the annotation of particular model organism or gene expression data sets, although increasingly it has been used to evaluate GO predictions generated from text mining or protein interaction experiments. In 2004, the GOA team will build on its success and will continue to supplement the functional annotation of UniProt and work towards enhancing the ability of scientists to access all available biological information. Researchers wishing to query or contribute to the GOA project are encouraged to email:
PMCID: PMC308756  PMID: 14681408
25.  The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across multiple source databases 
BMC Bioinformatics  2007;8:401.
Each major protein database uses its own conventions when assigning protein identifiers. Resolving the various, potentially unstable, identifiers that refer to identical proteins is a major challenge. This is a common problem when attempting to unify datasets that have been annotated with proteins from multiple data sources or querying data providers with one flavour of protein identifiers when the source database uses another. Partial solutions for protein identifier mapping exist but they are limited to specific species or techniques and to a very small number of databases. As a result, we have not found a solution that is generic enough and broad enough in mapping scope to suit our needs.
We have created the Protein Identifier Cross-Reference (PICR) service, a web application that provides interactive and programmatic (SOAP and REST) access to a mapping algorithm that uses the UniProt Archive (UniParc) as a data warehouse to offer protein cross-references based on 100% sequence identity to proteins from over 70 distinct source databases loaded into UniParc. Mappings can be limited by source database, taxonomic ID and activity status in the source database. Users can copy/paste or upload files containing protein identifiers or sequences in FASTA format to obtain mappings using the interactive interface. Search results can be viewed in simple or detailed HTML tables or downloaded as comma-separated values (CSV) or Microsoft Excel (XLS) files suitable for use in a local database or a spreadsheet. Alternatively, a SOAP interface is available to integrate PICR functionality in other applications, as is a lightweight REST interface.
We offer a publicly available service that can interactively map protein identifiers and protein sequences to the majority of commonly used protein databases. Programmatic access is available through a standards-compliant SOAP interface or a lightweight REST interface. The PICR interface, documentation and code examples are available at .
PMCID: PMC2151082  PMID: 17945017

Results 1-25 (749740)