PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (528010)

Clipboard (0)
None

Related Articles

1.  Proteome Analysis Database: online application of InterPro and CluSTr for the functional classification of proteins in whole genomes 
Nucleic Acids Research  2001;29(1):44-48.
The SWISS-PROT group at EBI has developed the Proteome Analysis Database utilising existing resources and providing comparative analysis of the predicted protein coding sequences of the complete genomes of bacteria, archaea and eukaryotes (http://www.ebi.ac.uk/proteome/). The two main projects used, InterPro and CluSTr, give a new perspective on families, domains and sites and cover 31–67% (InterPro statistics) of the proteins from each of the complete genomes. CluSTr covers the three complete eukaryotic genomes and the incomplete human genome data. The Proteome Analysis Database is accompanied by a program that has been designed to carry out InterPro proteome comparisons for any one proteome against any other one or more of the proteomes in the database.
PMCID: PMC29822  PMID: 11125045
2.  InterPro: the integrative protein signature database 
Nucleic Acids Research  2008;37(Database issue):D211-D215.
The InterPro database (http://www.ebi.ac.uk/interpro/) integrates together predictive models or ‘signatures’ representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. Integration is performed manually and approximately half of the total ∼58 000 signatures available in the source databases belong to an InterPro entry. Recently, we have started to also display the remaining un-integrated signatures via our web interface. Other developments include the provision of non-signature data, such as structural data, in new XML files on our FTP site, as well as the inclusion of matchless UniProtKB proteins in the existing match XML files. The web interface has been extended and now links out to the ADAN predicted protein–protein interaction database and the SPICE and Dasty viewers. The latest public release (v18.0) covers 79.8% of UniProtKB (v14.1) and consists of 16 549 entries. InterPro data may be accessed either via the web address above, via web services, by downloading files by anonymous FTP or by using the InterProScan search software (http://www.ebi.ac.uk/Tools/InterProScan/).
doi:10.1093/nar/gkn785
PMCID: PMC2686546  PMID: 18940856
3.  The InterPro Database, 2003 brings increased coverage and new features 
Nucleic Acids Research  2003;31(1):315-318.
InterPro, an integrated documentation resource of protein families, domains and functional sites, was created in 1999 as a means of amalgamating the major protein signature databases into one comprehensive resource. PROSITE, Pfam, PRINTS, ProDom, SMART and TIGRFAMs have been manually integrated and curated and are available in InterPro for text- and sequence-based searching. The results are provided in a single format that rationalises the results that would be obtained by searching the member databases individually. The latest release of InterPro contains 5629 entries describing 4280 families, 1239 domains, 95 repeats and 15 post-translational modifications. Currently, the combined signatures in InterPro cover more than 74% of all proteins in SWISS-PROT and TrEMBL, an increase of nearly 15% since the inception of InterPro. New features of the database include improved searching capabilities and enhanced graphical user interfaces for visualisation of the data. The database is available via a webserver (http://www.ebi.ac.uk/interpro) and anonymous FTP (ftp://ftp.ebi.ac.uk/pub/databases/interpro).
PMCID: PMC165493  PMID: 12520011
4.  The InterPro BioMart: federated query and web service access to the InterPro Resource 
The InterPro BioMart provides users with query-optimized access to predictions of family classification, protein domains and functional sites, based on a broad spectrum of integrated computational models (‘signatures’) that are generated by the InterPro member databases: Gene3D, HAMAP, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. These predictions are provided for all protein sequences from both the UniProt Knowledge Base and the UniParc protein sequence archive. The InterPro BioMart is supplementary to the primary InterPro web interface (http://www.ebi.ac.uk/interpro), providing a web service and the ability to build complex, custom queries that can efficiently return thousands of rows of data in a variety of formats. This article describes the information available from the InterPro BioMart and illustrates its utility with examples of how to build queries that return useful biological information.
Database URL: http://www.ebi.ac.uk/interpro/biomart/martview.
doi:10.1093/database/bar033
PMCID: PMC3170169  PMID: 21785143
5.  New developments in the InterPro database 
Nucleic Acids Research  2007;35(Database issue):D224-D228.
InterPro is an integrated resource for protein families, domains and functional sites, which integrates the following protein signature databases: PROSITE, PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D and PANTHER. The latter two new member databases have been integrated since the last publication in this journal. There have been several new developments in InterPro, including an additional reading field, new database links, extensions to the web interface and additional match XML files. InterPro has always provided matches to UniProtKB proteins on the website and in the match XML file on the FTP site. Additional matches to proteins in UniParc (UniProt archive) are now available for download in the new match XML files only. The latest InterPro release (13.0) contains more than 13 000 entries, covering over 78% of all proteins in UniProtKB. The database is available for text- and sequence-based searches via a webserver (), and for download by anonymous FTP (). The InterProScan search tool is now also available via a web service at .
doi:10.1093/nar/gkl841
PMCID: PMC1899100  PMID: 17202162
6.  InterPro in 2011: new developments in the family and domain prediction database 
Nucleic Acids Research  2011;40(D1):D306-D312.
InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
doi:10.1093/nar/gkr948
PMCID: PMC3245097  PMID: 22096229
7.  The InterPro database, an integrated documentation resource for protein families, domains and functional sites 
Nucleic Acids Research  2001;29(1):37-40.
Signature databases are vital tools for identifying distant relationships in novel sequences and hence for inferring protein function. InterPro is an integrated documentation resource for protein families, domains and functional sites, which amalgamates the efforts of the PROSITE, PRINTS, Pfam and ProDom database projects. Each InterPro entry includes a functional description, annotation, literature references and links back to the relevant member database(s). Release 2.0 of InterPro (October 2000) contains over 3000 entries, representing families, domains, repeats and sites of post-translational modification encoded by a total of 6804 different regular expressions, profiles, fingerprints and Hidden Markov Models. Each InterPro entry lists all the matches against SWISS-PROT and TrEMBL (more than 1 000 000 hits from 462 500 proteins in SWISS-PROT and TrEMBL). The database is accessible for text- and sequence-based searches at http://www.ebi.ac.uk/interpro/. Questions can be emailed to interhelp@ebi.ac.uk.
PMCID: PMC29841  PMID: 11125043
8.  The Proteome Analysis database: a tool for the in silico analysis of whole proteomes 
Nucleic Acids Research  2003;31(1):414-417.
The Proteome Analysis database (http://www.ebi.ac.uk/proteome/) has been developed by the Sequence Database Group at EBI utilizing existing resources and providing comparative analysis of the predicted protein coding sequences of the complete genomes of bacteria, archeae and eukaryotes. Three main projects are used, InterPro, CluSTr and GO Slim, to give an overview on families, domains, sites, and functions of the proteins from each of the complete genomes. Complete proteome analysis is available for a total of 89 proteome sets. A specifically designed application enables InterPro proteome comparisons for any one proteome against any other one or more of the proteomes in the database.
PMCID: PMC165552  PMID: 12520037
9.  Manual GO annotation of predictive protein signatures: the InterPro approach to GO curation 
InterPro amalgamates predictive protein signatures from a number of well-known partner databases into a single resource. To aid with interpretation of results, InterPro entries are manually annotated with terms from the Gene Ontology (GO). The InterPro2GO mappings are comprised of the cross-references between these two resources and are the largest source of GO annotation predictions for proteins. Here, we describe the protocol by which InterPro curators integrate GO terms into the InterPro database. We discuss the unique challenges involved in integrating specific GO terms with entries that may describe a diverse set of proteins, and we illustrate, with examples, how InterPro hierarchies reflect GO terms of increasing specificity. We describe a revised protocol for GO mapping that enables us to assign GO terms to domains based on the function of the individual domain, rather than the function of the families in which the domain is found. We also discuss how taxonomic constraints are dealt with and those cases where we are unable to add any appropriate GO terms. Expert manual annotation of InterPro entries with GO terms enables users to infer function, process or subcellular information for uncharacterized sequences based on sequence matches to predictive models.
Database URL: http://www.ebi.ac.uk/interpro. The complete InterPro2GO mappings are available at: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/external2go/interpro2go
doi:10.1093/database/bar068
PMCID: PMC3270475  PMID: 22301074
10.  Java GUI for InterProScan (JIPS): A tool to help process multiple InterProScans and perform ortholog analysis 
BMC Bioinformatics  2006;7:462.
Background
Recent, rapid growth in the quantity of available genomic data has generated many protein sequences that are not yet biochemically classified. Thus, the prediction of biochemical function based on structural motifs is an important task in post-genomic analysis. The InterPro databases are a major resource for protein function information. For optimal results, these databases should be searched at regular intervals, since they are frequently updated.
Results
We describe here a new program JIPS (Java GUI for InterProScan), a tool for tracking and viewing results obtained from repeated InterProScan searches. JIPS stores matches (in a local database) obtained from InterProScan searches performed with multiple versions of the InterPro database and highlights hits that have been added since the last search of the InterPro database. Results are displayed in an easy-to-use tabular format. JIPS also contains tools to assist with ortholog-based comparative studies of protein signatures.
Conclusion
JIPS is an efficient tool for performing repeated InterProScans on large batches of protein sequences, tracking and viewing search results, and mining the collected data.
doi:10.1186/1471-2105-7-462
PMCID: PMC1626093  PMID: 17054794
11.  OikoBase: a genomics and developmental transcriptomics resource for the urochordate Oikopleura dioica 
Nucleic Acids Research  2012;41(D1):D845-D853.
We report the development of OikoBase (http://oikoarrays.biology.uiowa.edu/Oiko/), a tiling array-based genome browser resource for Oikopleura dioica, a metazoan belonging to the urochordates, the closest extant group to vertebrates. OikoBase facilitates retrieval and mining of a variety of useful genomics information. First, it includes a genome browser which interrogates 1260 genomic sequence scaffolds and features gene, transcript and CDS annotation tracks. Second, we annotated gene models with gene ontology (GO) terms and InterPro domains which are directly accessible in the browser with links to their entries in the GO (http://www.geneontology.org/) and InterPro (http://www.ebi.ac.uk/interpro/) databases, and we provide transcript and peptide links for sequence downloads. Third, we introduce the transcriptomics of a comprehensive set of developmental stages of O. dioica at high resolution and provide downloadable gene expression data for all developmental stages. Fourth, we incorporate a BLAST tool to identify homologs of genes and proteins. Finally, we include a tutorial that describes how to use OikoBase as well as a link to detailed methods, explaining the data generation and analysis pipeline. OikoBase will provide a valuable resource for research in chordate development, genome evolution and plasticity and the molecular ecology of this important marine planktonic organism.
doi:10.1093/nar/gks1159
PMCID: PMC3531137  PMID: 23185044
12.  CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins 
Nucleic Acids Research  2001;29(1):33-36.
The CluSTr (Clusters of SWISS-PROT and TrEMBL proteins) database offers an automatic classification of SWISS-PROT and TrEMBL proteins into groups of related proteins. The clustering is based on analysis of all pairwise comparisons between protein sequences. Analysis has been carried out for different levels of protein similarity, yielding a hierarchical organisation of clusters. The database provides links to InterPro, which integrates information on protein families, domains and functional sites from PROSITE, PRINTS, Pfam and ProDom. Links to the InterPro graphical interface allow users to see at a glance whether proteins from the cluster share particular functional sites. CluSTr also provides cross-references to HSSP and PDB. The database is available for querying and browsing at http://www.ebi.ac.uk/clustr.
PMCID: PMC29804  PMID: 11125042
13.  SIMAP—structuring the network of protein similarities 
Nucleic Acids Research  2007;36(Database issue):D289-D292.
Protein sequences are the most important source of evolutionary and functional information for new proteins. In order to facilitate the computationally intensive tasks of sequence analysis, the Similarity Matrix of Proteins (SIMAP) database aims to provide a comprehensive and up-to-date dataset of the pre-calculated sequence similarity matrix and sequence-based features like InterPro domains for all proteins contained in the major public sequence databases. As of September 2007, SIMAP covers ∼17 million proteins and more than 6 million non-redundant sequences and provides a complete annotation based on InterPro 16. Novel features of SIMAP include a new, portlet-based web portal providing multiple, structured views on retrieved proteins and integration of protein clusters and a unique search method for similar domain architectures. Access to SIMAP is freely provided for academic use through the web portal for individuals at http://mips.gsf.de/simap/and through Web Services for programmatic access at http://mips.gsf.de/webservices/services/SimapService2.0?wsdl.
doi:10.1093/nar/gkm963
PMCID: PMC2238827  PMID: 18037617
14.  Conservation of Intrinsic Disorder in Protein Domains and Families: I. A Database of Conserved Predicted Disordered Regions† 
Journal of proteome research  2006;5(4):879-887.
Many protein regions have been shown to be intrinsically disordered, lacking unique structure under physiological conditions. These intrinsically disordered regions are not only very common in proteomes, they are also crucial to the function of many proteins, especially those involved in signaling, recognition, and regulation. The goal of this work was to identify the prevalence, characteristics, and functions of conserved disordered regions within protein domains and families.
A database was created to store the amino acid sequences of nearly one million proteins and their domain matches from the InterPro database, a resource integrating eight different protein family and domain databases. Disorder prediction was performed on these protein sequences. Regions of sequence corresponding to domains were aligned using a multiple sequence alignment tool. From this initial information, regions of conserved predicted disorder were found within the domains. The methodology for this search consisted of finding regions of consecutive positions in the multiple sequence alignments in which a 90% or more of the sequences were predicted to be disordered. This procedure was constrained to find such regions of conserved disorder prediction that were at least 20 amino acids in length.
The results of this work were 3,653 regions of conserved disorder prediction, found within 2,898 distinct InterPro entries. Most regions of conserved predicted disorder detected were short, with less than 10% of those found exceeding 30 residues in length.
doi:10.1021/pr060048x
PMCID: PMC2543136  PMID: 16602695
intrinsic disorder; protein structure-function; disorder prediction; PONDR
15.  SNAPPI-DB: a database and API of Structures, iNterfaces and Alignments for Protein–Protein Interactions 
Nucleic Acids Research  2007;35(Database issue):D580-D589.
SNAPPI-DB, a high performance database of Structures, iNterfaces and Alignments of Protein–Protein Interactions, and its associated Java Application Programming Interface (API) is described. SNAPPI-DB contains structural data, down to the level of atom co-ordinates, for each structure in the Protein Data Bank (PDB) together with associated data including SCOP, CATH, Pfam, SWISSPROT, InterPro, GO terms, Protein Quaternary Structures (PQS) and secondary structure information. Domain–domain interactions are stored for multiple domain definitions and are classified by their Superfamily/Family pair and interaction interface. Each set of classified domain–domain interactions has an associated multiple structure alignment for each partner. The API facilitates data access via PDB entries, domains and domain–domain interactions. Rapid development, fast database access and the ability to perform advanced queries without the requirement for complex SQL statements are provided via an object oriented database and the Java Data Objects (JDO) API. SNAPPI-DB contains many features which are not available in other databases of structural protein–protein interactions. It has been applied in three studies on the properties of protein–protein interactions and is currently being employed to train a protein–protein interaction predictor and a functional residue predictor. The database, API and manual are available for download at: .
doi:10.1093/nar/gkl836
PMCID: PMC1899103  PMID: 17202171
16.  The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection 
Nucleic Acids Research  2011;40(D1):D1-D8.
The 19th annual Database Issue of Nucleic Acids Research features descriptions of 92 new online databases covering various areas of molecular biology and 100 papers describing recent updates to the databases previously described in NAR and other journals. The highlights of this issue include, among others, a description of neXtProt, a knowledgebase on human proteins; a detailed explanation of the principles behind the NCBI Taxonomy Database; NCBI and EBI papers on the recently launched BioSample databases that store sample information for a variety of database resources; descriptions of the recent developments in the Gene Ontology and UniProt Gene Ontology Annotation projects; updates on Pfam, SMART and InterPro domain databases; update papers on KEGG and TAIR, two universally acclaimed databases that face an uncertain future; and a separate section with 10 wiki-based databases, introduced in an accompanying editorial. The NAR online Molecular Biology Database Collection, available at http://www.oxfordjournals.org/nar/database/a/, has been updated and now lists 1380 databases. Brief machine-readable descriptions of the databases featured in this issue, according to the BioDBcore standards, will be provided at the http://biosharing.org/biodbcore web site. The full content of the Database Issue is freely available online on the Nucleic Acids Research web site (http://nar.oxfordjournals.org/).
doi:10.1093/nar/gkr1196
PMCID: PMC3245068  PMID: 22144685
17.  The Genome Database for Rosaceae (GDR): year 10 update 
Nucleic Acids Research  2013;42(D1):D1237-D1244.
The Genome Database for Rosaceae (GDR, http:/www.rosaceae.org), the long-standing central repository and data mining resource for Rosaceae research, has been enhanced with new genomic, genetic and breeding data, and improved functionality. Whole genome sequences of apple, peach and strawberry are available to browse or download with a range of annotations, including gene model predictions, aligned transcripts, repetitive elements, polymorphisms, mapped genetic markers, mapped NCBI Rosaceae genes, gene homologs and association of InterPro protein domains, GO terms and Kyoto Encyclopedia of Genes and Genomes pathway terms. Annotated sequences can be queried using search interfaces and visualized using GBrowse. New expressed sequence tag unigene sets are available for major genera, and Pathway data are available through FragariaCyc, AppleCyc and PeachCyc databases. Synteny among the three sequenced genomes can be viewed using GBrowse_Syn. New markers, genetic maps and extensively curated qualitative/Mendelian and quantitative trait loci are available. Phenotype and genotype data from breeding projects and genetic diversity projects are also included. Improved search pages are available for marker, trait locus, genetic diversity and publication data. New search tools for breeders enable selection comparison and assistance with breeding decision making.
doi:10.1093/nar/gkt1012
PMCID: PMC3965003  PMID: 24225320
18.  Domain fusion analysis by applying relational algebra to protein sequence and domain databases 
BMC Bioinformatics  2003;4:16.
Background
Domain fusion analysis is a useful method to predict functionally linked proteins that may be involved in direct protein-protein interactions or in the same metabolic or signaling pathway. As separate domain databases like BLOCKS, PROSITE, Pfam, SMART, PRINTS-S, ProDom, TIGRFAMs, and amalgamated domain databases like InterPro continue to grow in size and quality, a computational method to perform domain fusion analysis that leverages on these efforts will become increasingly powerful.
Results
This paper proposes a computational method employing relational algebra to find domain fusions in protein sequence databases. The feasibility of this method was illustrated on the SWISS-PROT+TrEMBL sequence database using domain predictions from the Pfam HMM (hidden Markov model) database. We identified 235 and 189 putative functionally linked protein partners in H. sapiens and S. cerevisiae, respectively. From scientific literature, we were able to confirm many of these functional linkages, while the remainder offer testable experimental hypothesis. Results can be viewed at .
Conclusion
As the analysis can be computed quickly on any relational database that supports standard SQL (structured query language), it can be dynamically updated along with the sequence and domain databases, thereby improving the quality of predictions over time.
doi:10.1186/1471-2105-4-16
PMCID: PMC156618  PMID: 12734020
19.  Genomic Functional Investigation through Statistical Analysis of Protein Families and Domains 
Protein families and domains represent a very relevant resource useful to understand protein functions and interactions among their codifying genes. To perform evaluations of gene annotations sparsely available in numerous different databanks accessible via Internet, we previously developed GFINDer, a Web server that performs statistical analysis of functional and phenotypic annotations of gene lists. To exploit protein information present in Pfam and InterPro databanks, in GFINDer we integrated two new modules that allow annotating and statistically analyzing user-classified nucleotide sequence with controlled information on related protein families, domains and functional sites.
PMCID: PMC1839345  PMID: 17238639
20.  The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation 
BMC Bioinformatics  2008;9:52.
Background
Automated protein function prediction methods are needed to keep pace with high-throughput sequencing. With the existence of many programs and databases for inferring different protein functions, a pipeline that properly integrates these resources will benefit from the advantages of each method. However, integrated systems usually do not provide mechanisms to generate customized databases to predict particular protein functions. Here, we describe a tool termed PIPA (Pipeline for Protein Annotation) that has these capabilities.
Results
PIPA annotates protein functions by combining the results of multiple programs and databases, such as InterPro and the Conserved Domains Database, into common Gene Ontology (GO) terms. The major algorithms implemented in PIPA are: (1) a profile database generation algorithm, which generates customized profile databases to predict particular protein functions, (2) an automated ontology mapping generation algorithm, which maps various classification schemes into GO, and (3) a consensus algorithm to reconcile annotations from the integrated programs and databases.
PIPA's profile generation algorithm is employed to construct the enzyme profile database CatFam, which predicts catalytic functions described by Enzyme Commission (EC) numbers. Validation tests show that CatFam yields average recall and precision larger than 95.0%. CatFam is integrated with PIPA.
We use an association rule mining algorithm to automatically generate mappings between terms of two ontologies from annotated sample proteins. Incorporating the ontologies' hierarchical topology into the algorithm increases the number of generated mappings. In particular, it generates 40.0% additional mappings from the Clusters of Orthologous Groups (COG) to EC numbers and a six-fold increase in mappings from COG to GO terms. The mappings to EC numbers show a very high precision (99.8%) and recall (96.6%), while the mappings to GO terms show moderate precision (80.0%) and low recall (33.0%).
Our consensus algorithm for GO annotation is based on the computation and propagation of likelihood scores associated with GO terms. The test results suggest that, for a given recall, the application of the consensus algorithm yields higher precision than when consensus is not used.
Conclusion
The algorithms implemented in PIPA provide automated genome-wide protein function annotation based on reconciled predictions from multiple resources.
doi:10.1186/1471-2105-9-52
PMCID: PMC2259298  PMID: 18221520
21.  InterPro (The Integrated Resource of Protein Domains and Functional Sites) 
Yeast (Chichester, England)  2000;17(4):327-334.
The family and motif databases, PROSITE, PRINTS, Pfam and ProDom, have been integrated into a powerful resource for protein secondary annotation. As of June 2000, InterPro had processed 384 572 proteins in SWISS-PROT and TrEMBL. Because the contributing databases have different clustering principles and scoring sensitivities, the combined assignments compliment each other for grouping protein families and delineating domains. The graphic displays of all matches above the scoring thresholds enables judgements to be made on the concordances or differences between the assignments. The website links can be used to analyse novel sequences and for queries across the proteomes of 32 organisms, including the partial human set, by domain and/or protein family. An analysis of selected HtrA/DegQ proteases demonstrates the utility of this website for detailed comparative genomics. Further information on the project can be found at the European Bioinformatics Institute at http://www.ebi.ac.uk/interpro/.
doi:10.1002/1097-0061(200012)17:4<327::AID-YEA45>3.0.CO;2-K
PMCID: PMC2448387  PMID: 11119311
22.  Improvements to CluSTr: the database of SWISS-PROT+TrEMBL protein clusters 
Nucleic Acids Research  2003;31(1):388-389.
The CluSTr database (http://www.ebi.ac.uk/clustr/) offers an automatic classification of SWISS-PROT+TrEMBL proteins into groups of related proteins. The clustering is based on analysis of all pair-wise sequence comparisons between proteins using the Smith–Waterman algorithm. The analysis, carried out on different levels of protein similarity, yields a hierarchical organization of clusters. Information about domain content of the clustered proteins is provided via the InterPro resource. The introduced InterPro ‘condensed graphical view’ simplifies the visual analysis of represented domain architectures. Integrated applications allow users to visualize and edit multiple alignments and build sequence divergence trees. Links to the relevant structural data in Protein Data Bank (PDB) and Homology derived Secondary Structure of Proteins (HSSP) are also provided.
PMCID: PMC165482  PMID: 12520029
23.  Predicting pathway membership via domain signatures 
Bioinformatics  2008;24(19):2137-2142.
Motivation: Functional characterization of genes is of great importance for the understanding of complex cellular processes. Valuable information for this purpose can be obtained from pathway databases, like KEGG. However, only a small fraction of genes is annotated with pathway information up to now. In contrast, information on contained protein domains can be obtained for a significantly higher number of genes, e.g. from the InterPro database.
Results: We present a classification model, which for a specific gene of interest can predict the mapping to a KEGG pathway, based on its domain signature. The classifier makes explicit use of the hierarchical organization of pathways in the KEGG database. Furthermore, we take into account that a specific gene can be mapped to different pathways at the same time. The classification method produces a scoring of all possible mapping positions of the gene in the KEGG hierarchy. Evaluations of our model, which is a combination of a SVM and ranking perceptron approach, show a high prediction performance. Moreover, for signaling pathways we reveal that it is even possible to forecast accurately the membership to individual pathway components.
Availability: The R package gene2pathway is a supplement to this article.
Contact: h.froehlich@dkfz-heidelberg.de
Supplementary Information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btn403
PMCID: PMC2553439  PMID: 18676972
24.  MitoProteome: mitochondrial protein sequence database and annotation system 
Nucleic Acids Research  2004;32(Database issue):D463-D467.
MitoProteome is an object-relational mitochondrial protein sequence database and annotation system. The initial release contains 847 human mitochondrial protein sequences, derived from public sequence databases and mass spectrometric analysis of highly purified human heart mitochondria. Each sequence is manually annotated with primary function, subfunction and subcellular location, and extensively annotated in an automated process with data extracted from external databases, including gene information from LocusLink and Ensembl; disease information from OMIM; protein–protein interaction data from MINT and DIP; functional domain information from Pfam; protein fingerprints from PRINTS; protein family and family-specific signatures from InterPro; structure data from PDB; mutation data from PMD; BLAST homology data from NCBI NR; and proteins found to be related based on LocusLink and SWISS-PROT references and sequence and taxonomy data. By highly automating the processes of maintaining the MitoProteome Protein List and extracting relevant data from external databases, we are able to present a dynamic database, updated frequently to reflect changes in public resources. The MitoProteome database is publicly available at http://www.mitoproteome.org/. Users may browse and search MitoProteome, and access a complete compilation of data relevant to each protein of interest, cross-linked to external databases.
doi:10.1093/nar/gkh048
PMCID: PMC308782  PMID: 14681458
25.  LEAPdb: a database for the late embryogenesis abundant proteins 
BMC Genomics  2010;11:221.
Background
Late Embryogenesis Abundant Proteins database (LEAPdb) contains resource regarding LEAP from plants and other organisms. Although LEAP are grouped into several families, there is no general consensus on their definition and on their classification. They are associated with abiotic stress tolerance, but their actual function at the molecular level is still enigmatic. The scarcity of 3-D structures for LEAP remains a handicap for their structure-function relationships analysis. Finally, the growing body of published data about LEAP represents a great amount of information that needs to be compiled, organized and classified.
Results
LEAPdb gathers data about 8 LEAP sub-families defined by the PFAM, the Conserved Domain and the InterPro databases. Among its functionalities, LEAPdb provides a browse interface for retrieving information on the whole database. A search interface using various criteria such as sophisticated text expression, amino acids motifs and other useful parameters allows the retrieving of refined subset of entries. LEAPdb also offers sequence similarity search. Information is displayed in re-ordering tables facilitating the analysis of data. LEAP sequences can be downloaded in three formats. Finally, the user can submit his sequence(s). LEAPdb has been conceived as a user-friendly web-based database with multiple functions to search and describe the different LEAP families. It will likely be helpful for computational analyses of their structure - function relationships.
Conclusions
LEAPdb contains 769 non-redundant and curated entries, from 196 organisms. All LEAP sequences are full-length. LEAPdb is publicly available at http://forge.info.univ-angers.fr/~gh/Leadb/index.php.
doi:10.1186/1471-2164-11-221
PMCID: PMC2858754  PMID: 20359361

Results 1-25 (528010)