|Home | About | Journals | Submit | Contact Us | Français|
The Universal Protein Resource (UniProt) provides the scientific community with a single, centralized, authoritative resource for protein sequences and functional information. Formed by uniting the Swiss-Prot, TrEMBL and PIR protein database activities, the UniProt consortium produces three layers of protein sequence databases: the UniProt Archive (UniParc), the UniProt Knowledgebase (UniProt) and the UniProt Reference (UniRef) databases. The UniProt Knowledgebase is a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase with extensive cross-references. This centrepiece consists of two sections: UniProt/Swiss-Prot, with fully, manually curated entries; and UniProt/TrEMBL, enriched with automated classification and annotation. During 2004, tens of thousands of Knowledgebase records got manually annotated or updated; we introduced a new comment line topic: TOXIC DOSE to store information on the acute toxicity of a toxin; the UniProt keyword list got augmented by additional keywords; we improved the documentation of the keywords and are continuously overhauling and standardizing the annotation of post-translational modifications. Furthermore, we introduced a new documentation file of the strains and their synonyms. Many new database cross-references were introduced and we started to make use of Digital Object Identifiers. We also achieved in collaboration with the Macromolecular Structure Database group at EBI an improved integration with structural databases by residue level mapping of sequences from the Protein Data Bank entries onto corresponding UniProt entries. For convenient sequence searches we provide the UniRef non-redundant sequence databases. The comprehensive UniParc database stores the complete body of publicly available protein sequence data. The UniProt databases can be accessed online (http://www.uniprot.org) or downloaded in several formats (ftp://ftp.uniprot.org/pub). New releases are published every two weeks.
Previously, Swiss-Prot + TrEMBL (1) and PIR-PSD (2) coexisted as protein databases with differing sequence coverage and annotation priorities. In 2002, the Swiss-Prot + TrEMBL groups at the SIB (Swiss Institute of Bioinformatics) and EBI (European Bioinformatics Institute) and the PIR (Protein Information Resource) group at Georgetown University Medical Center and National Biomedical Research Foundation joined forces as the UniProt consortium (3).
The UniProt consortium maintains three database layers:
Although most protein sequence data are derived from the translation of DDBJ/EMBL/GenBank (4) sequences, primary protein sequence data are also submitted directly to UniProt or appear in patent applications or in entries from the Protein Data Bank (PDB) (5). The UniParc (6) is designed to capture all available protein sequence data—not just from the aforementioned databases, but also from sources such as Ensembl (7), the International Protein Index (IPI) (8), RefSeq (9), FlyBase (10) and WormBase (11). This combination of sources makes UniParc the most comprehensive publicly accessible, non-redundant protein sequence database available.
UniParc represents each protein sequence once and only once, assigning it a unique UniParc identifier. The UniParc release 2.6 from September 2004 contained 4375775 unique sequences from 11978094 original source records. UniParc cross-references the accession numbers of the source databases, using flags to indicate the status of the entry in the original source database, with ‘active’ indicating that the entry is still present in the source database and ‘obsolete’ indicating that the entry no longer exists in the source database. A UniParc sequence version is incremented each time the underlying sequence changes, making it possible to observe sequence changes in all source databases. A sample UniParc report can be found at http://www.uniprot.org/entry/UPI0000000C37. UniParc records carry no annotation, but this information can be found in the UniProt Knowledgebase or other underlying databases.
The UniProt Knowledgebase merges Swiss-Prot, TrEMBL and PIR-PSD to provide a central database of protein sequences with annotations and functional information. All suitable PIR-PSD sequences missing from Swiss-Prot + TrEMBL were incorporated into UniProt and bi-directional cross-references were created to allow the easy tracking of PIR- PSD entries. The transfer into UniProt of references and experimentally verified data present in PIR but missing from Swiss-Prot + TrEMBL is ongoing.
The UniProt Knowledgebase has two parts: a section of fully, manually annotated records resulting from literature information extraction and curator-evaluated computational analysis, and a section with computationally analysed records awaiting full manual annotation. The two sections are referred to as ‘UniProt/Swiss-Prot’ (158337 records in UniProt release 2.6 from September 2004) and ‘UniProt/TrEMBL’ (1 400776 records in UniProt release 2.6 from September 2004), respectively. An example UniProt report can be found at http://www.uniprot.org/entry/P57727.
In the following paragraphs, we will explain the main principles of the UniProt Knowledgebase and enhancements introduced recently.
In addition to capturing the core data mandatory to each UniProt entry (consisting principally of the amino acid sequence, the protein name or description, taxonomic data and citation information), we attach other annotation information both manually and automatically.
Manual annotation is performed by biologists and is based on literature curation and sequence analysis. The annotation principles were described in detail previously (3,12). During 2004, tens of thousands of records were manually annotated or updated. We also have introduced a new comment (CC) line topic: TOXIC DOSE. This topic is used to store information on the poisoning potential (acute toxicity) of a toxin. Generally this topic holds information on the LD50 and PD50. LD stands for ‘Lethal Dose’. LD50 is the amount of a toxin, given all at once, which causes the death of 50% (one-half) of a group of test animals. PD50 stands for ‘Paralytic dose’. It is the amount of a toxin, which causes the paralysis of 50% of a group of test animals.
Much progress was made during 2004 in our attempt to provide automatic large-scale functional characterization and annotation, which is generated with limited human interaction.
We use InterPro (13) to recognize domains and to classify all the protein sequences in UniProt into families and superfamilies. InterPro is an integrated resource of protein families, domains and sites that amalgamates the efforts of the member databases: Pfam (14), PROSITE (15), PRINTS (16), ProDom (17), SMART (18), PIRSF (19), Superfamily (20) and TIGRFAMs (21). Approximately 80% of all UniProt Knowledgebase records are classified according to their InterPro domains and familes.
For automatic annotation, systems for standardized transfer of annotation from well-characterized proteins in the UniProt/Swiss-Prot to non-annotated UniProt/TrEMBL entries have been implemented. RuleBase (22) uses a semi-automatic approach, while the Spearmint approach is completely automated and is based on decision trees (23). InterPro is then used to assign UniProt entries into groups. The annotation shared by the functionally characterized UniProt/Swiss-Prot proteins of a group is then extracted and assigned to the non-annotated UniProt/TrEMBL entries of this group. These systems have been used to improve the annotation in 32% (RuleBase) and 55% (Spearmint) of UniProt/TrEMBL entries.
However, a part of the automatically added data will be erroneous, as are parts of the information coming from other sources. Therefore, we introduced a post-processing system called Xanthippe, which is based on a simple exclusion mechanism and a decision tree approach using the C4.5 data-mining algorithm. Xanthippe detects and flags a large part of the annotation errors and considerably increases the reliability of both automatically generated data and pre-existing annotation inherited from the underlying nucleotide sequence source data (24).
The PIRSF classification serves as the basis for a rule-based approach to automatically provide standardized and rich functional annotation for position-specific sequence features, protein names, Enzyme Commission (EC) name and number, keywords and Gene Ontology (GO) terms (25). Position-specific site rules are developed for annotating active site residues, binding site residues, modified residues or other functionally important amino acid residues. To exploit known structure information, site rules are defined starting with PIRSF families that contain at least one known three-dimensional (3D) structure with experimentally verified site information. The rules are defined using appropriate syntax and controlled vocabulary for site description and evidence attribution. As shown in Table Table1,1, each rule consists of the rule ID, template sequence (a representative sequence with known 3D structure), rule condition, feature for propagation (denoting site feature to be propagated) and reference. The rules are family-specific and there may be more than one site rule per family. Site rule curation involves manually editing a multiple sequence alignment of representative family members (including the template PDB entry), visualizing site residues in the 3D structure, and building hidden Markov models for the conserved regions containing the functional site residues (referred to as ‘site HMMs’). The HMM thus built allows one to map functionally important residues from the template structure to other members of the PIRSF family that do not have a solved structure.
For site feature propagation, the entire rule condition is examined by PIRSF membership checking, site HMM matching and site residue matching. To avoid false positives, site features are only propagated automatically if all site residues match perfectly in the conserved region by aligning both the template and query sequences to the profile HMM using HmmAlign. Potential functional sites missing one or more residues or containing conservative substitutions are only annotated after expert review with evidence attribution. For accurate site propagation, it is sometimes necessary to match more residues in the rule condition than those to be propagated. For example, a total of eight catalytic and binding residues in sulfite reductase need to be matched in order to correctly propagate the sirohaem-ion binding Cys residue (PIRSR000259-3, Table Table11).
The highly reliable automatic annotation has already been incorporated into the UniProt/TrEMBL flat files, while additional automatic annotation is available from the extended UniProt view at http://www.ebi.uniprot.org/.
The HAMAP project, or ‘High-quality Automated and Manual Annotation of microbial Proteomes’, aims to integrate manual and automatic annotation methods in order to enhance the speed of the curation process while preserving the quality of the database annotation (26). Automatic annotation is only applied to entries that belong to manually defined orthologous families and to entries with no identifiable similarities (ORFans). Many checks are enforced in order to prevent the propagation of wrong annotation and to spot problematic cases, which are channelled to manual curation. The results of this annotation are integrated in UniProt/Swiss-Prot.
Whenever available, we make use of the official nomenclature defined by international committees while still providing the published synonyms. For various other UniProt items we use controlled vocabularies, e.g. for tissues, plasmids and keywords, which are listed in UniProt documents. The UniProt keyword list was augmented by additional keywords. We improved the documentation of the keywords by adding, to the list of keywords, the definition of their usage in the UniProt knowledgebase and additional information such as synonyms or relevant GO terms. The UniProt curators also contribute to the work of the GOA project (27) by assigning GO terms from each of the GOs, i.e. the function of a protein, what processes it is involved in and where in the cell it is located. A major effort was started to continuously overhaul and standardize the annotation of post-translational modifications (PTMs). Furthermore, we introduced a new documentation file of the strains and their synonyms together with the mnemonic species identification code representing the biological source of the protein in the knowledgebase. These and other documents can be found at http://www.uniprot.org/support/documents.shtml.
UniProt provides cross-references to external data collections such as the underlying DNA sequence entries in the DDBJ/EMBL/GenBank nucleotide sequence databases, two dimensional (2D) PAGE and 3D protein structure databases, various protein domain and family characterization databases, PTM databases, species-specific data collections, variant databases and disease databases. Many new cross-references were included over the last year. Accordingly, UniProt acts as a central hub for biomolecular information with now more than four million cross-references to more than 60 databases. A document listing all databases cross-referenced in UniProt (http://www.uniprot.org/support/docs/dbxref.shtml) is available and contains, for each database, a short description and the server URL.
UniProt achieved in 2004 in collaboration with the Macromolecular Structure Database (MSD) group at EBI an improved integration with structural databases by residue level mapping of sequences from the PDB entries onto corresponding UniProt entries (28). This work led to an overhaul of the format of the UniProt cross-references to PDB to reflect the mappings. The UniProt–PDB mappings are available at ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/.
We also started to make use of Digital Object Identifiers (DOIs). The DOI system is used for identifying and exchanging intellectual property in the digital environment. We introduced the new optional identifier ‘DOI’ in the RX line to store the DOI of a cited document.
Many sequence databases contain, for a given protein sequence, separate entries that correspond to different literature reports. In the UniProt Knowledgebase we try as much as possible to merge all these data in order to minimize the redundancy of the database. Differences between sequencing reports due to splice variants, polymorphisms, disease-causing mutations, experimental sequence modifications or simply sequencing errors are indicated in the feature table of the corresponding UniProt entry.
The UniProt Knowledgebase is therefore by design non-redundant, with the goal of representing all known information regarding a particular protein. The definition of non-redundancy here is different from that employed in UniParc: in UniParc, all sequences that are 100% identical over their entire length are merged into a single entry, regardless of species; the UniProt Knowledgebase aims to describe in a single record all protein products derived from a certain gene (or genes if the translation from different genes in a genome leads to indistinguishable proteins) from a certain species and to give not only the whole record an accession number but to assign to each protein form derived by alternative splicing, proteolytic cleavage and post-translational modification Isoform identifiers, which are accession numbers for the isoforms. The underlying reason for giving each of these isoforms a unique identifier is that each of these may have a different function or biological role or may only exist during specific developmental stages or under certain environmental conditions, even when all these isoforms are derived from a single gene. Isoform identifiers have been so far only introduced for splice isoforms. Splice isoforms may differ considerably from one another, with potentially <50% sequence similarity between isoforms. The tool VARSPLIC (29), which is freely available, enables the recreation of all annotated splice variants from the feature table of a UniProt entry, or for the complete database. A FASTA-formatted file containing all splice variants annotated in UniProt can be downloaded for use with similarity search programs.
The UniProt consortium emphasizes the use of an evidence attribution mechanism for protein annotation that will include, for all data, the data source, the types of evidence and methods for annotation. This is essential as the UniProt Knowledgebase will contain data automatically imported from the underlying nucleotide sequence databases, data imported from other databases, data from specific programs, the results of automatic annotation systems and, most importantly, expert manual curation. The implementation of evidence tags will allow the user to distinguish between these data sources and to easily identify particular classes of data of interest such as experimentally proven protein annotation. Evidence tags for the annotation present in UniProt/TrEMBL records are already available in the UniProt XML distribution.
The most efficient and user-friendly way to browse the UniProt databases is via the UniProt website (http://www.uniprot.org), which serves as a portal to all aspects of the UniProt project, and contains detailed documentation about the background and scope of UniProt. It provides database query and data-mining mechanisms, user support and communication, file download capabilities, and links to consortium resources. The UniProt Help Desk (gro.torpinu@pleh) provides access to UniProt curators and database maintainers.
The standard way of linking to UniProt, displaying the UniProt ‘basic’ view as HTML, is: http://www.uniprot.org/entry/entryname or accession number.
UniProt, UniParc and UniRef entries, with supporting documentation, can be retrieved in various formats (Swiss-Prot/TrEMBL flat file, FASTA, XML) via anonymous FTP from ftp://ftp.uniprot.org/pub/. New UniProt, UniParc and UniRef releases are produced every two weeks.
UniProt accepts submissions of new sequences, entry updates and corrections, and annotated bibliographic information for protein entries. Directions for submission are available at http://www.uniprot.org/support/submissions.shtml.
Complete and up-to-date databases of biological knowledge are vital for information-dependent biological and biotechnological research. With the rapid accumulation of genome sequences for many organisms, attention is turning to the identification and functions of proteins encoded by these genomes. With the increasing volume and variety of protein sequences and functional information, UniProt serves as a central resource of protein sequence and function, providing a cornerstone for scientists active in modern biological research. The resource provides rich, consistent and non-redundant protein information by combining reliable automated annotation approaches with literature-based expert manual curation.
UniProt is mainly supported by the National Institutes of Health (NIH) grant 1 U01 HG02712-01. Minor support for the EBIs involvement in UniProt comes from the two European Union contracts BioBabel (QLRT-2000-00981) and TEMBLOR (QLRI-2001-00015) and from the NIH grant 1R01HGO2273-01. UniProt/Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science. PIR activities are also supported by the National Science Foundation (NSF) grants DBI-0138188 and ITR-0205470.