The UniProt knowledgebase is the centrepiece of the consortium activities. We have merged Swiss-Prot, TrEMBL and PIR-PSD to form the UniProt knowledgebase in order to provide a central database of protein sequences with annotations and functional information. All suitable PIR-PSD sequences missing from Swiss-Prot + TrEMBL were incorporated into UniProt. Bidirectional cross-references between Swiss-Prot + TrEMBL and PIR-PSD were created to allow the easy tracking of the PIR-PSD entries. The transfer into UniProt of references and experimentally verified data present in PIR but missing from Swiss-Prot + TrEMBL is ongoing.
In the following paragraphs we will explain the main principles of the UniProt knowledgebase.
We will curate UniProt knowledgebase entries to an even higher level of detail than that already achieved in Swiss-Prot + TrEMBL and PIR-PSD. In addition to capturing the core data mandatory to each UniProt entry (consisting principally of the amino acid sequence, the protein name or description, taxonomic data and citation information), we strive to attach as much annotation information as possible to the protein. This is achieved in two ways: manually and automatically.
Manual annotation by curators based on literature and sequence analysis
Sequences for which novel functional, structural, and/or biochemical data have been published are assigned high manual annotation priority. In UniProt, annotation consists of the description of the following items:
function(s) of the protein;
enzyme-specific information (catalytic activity, cofactors, metabolic pathway, regulation mechanisms);
biologically relevant domains and sites;
post-translational modification (PTM)(s);
molecular weight determined by mass spectrometry;
subcellular location(s) of the protein;
tissue-specific expression of the protein;
developmentally specific expression of the protein;
mature protein products;
similarities to other proteins;
use of the protein in a biotechnological process;
diseases associated with deficiencies or abnormalities of the protein;
use of the protein as a pharmaceutical drug;
sequence conflicts, etc.
This annotation is found in the comment lines (CC), feature table (FT) and keyword lines (KW). Comments are classified according to topics to allow easy retrieval of specific categories of data from the database.
To acquire the most up-to-date and wide-ranging knowledge regarding a protein, information is obtained not only from publications reporting new sequence data, but also from review articles to facilitate the periodic revision of protein families or groups of proteins. Furthermore, we have enlisted external experts to send us comments and updates concerning specific groups of proteins.
In order to provide the high level of annotation described above, all UniProt curators read a large amount of scientific literature related to each protein. This enables them to contribute to the work of the gene ontology (GO) consortium (9
) by assigning GO terms during the annotation process as they extract information related to each of the GO ontologies, i.e. the function of a protein, what processes it is involved in and where in the cell it is located.
Automatic classification and annotation
With the rapid growth of sequence databases, there is an increasing need for reliable functional characterization and annotation of newly predicted proteins. To cope with such large data volumes, faster and more effective means of protein sequence characterization and annotation are required. One promising approach is automatic large-scale functional characterization and annotation, which is generated with limited human interaction.
We use InterPro (10
) to recognize domains and to classify all protein sequences in UniProt into families and superfamilies. InterPro is an integrated resource of protein families, domains and sites that amalgamates the efforts of the member databases: Pfam (11
), PROSITE (12
), PRINTS (13
), ProDom (14
), SMART (15
), PIRSF (16
), Superfamily (17
) and TIGRFAMs (18
). The comprehensive InterPro classification is a prerequisite for improving the quality and quantity of our annotation using highly structured, classification-driven, rule-based, automated procedures.
Automatic functional annotation of the TrEMBL section of UniProt.
For automatic annotation, a novel system of standardized transfer of annotation from well-characterized proteins in the Swiss-Prot section of UniProt to non-annotated TrEMBL entries has been developed (19
). Using this system, the Swiss-Prot section is used as the source to generate the annotation rules, which are then stored and managed in RuleBase. InterPro is then used to assign TrEMBL entries into groups. The annotation shared by the functionally characterized Swiss-Prot proteins of the group is then extracted and is assigned to the unannotated TrEMBL entry. This system has been used to improve the annotation in 25% of TrEMBL entries. A new data mining approach to automatic annotation is also being developed to complement this system, which will increase coverage by automatic annotation over the next year and will bring the standard of annotation in the TrEMBL section of UniProt closer to that of the Swiss-Prot section.
Also to be incorporated into the RuleBase annotation pipeline are the PIR classification-driven rule-based procedures, which will provide standardized and rich UniProt annotation for position-specific features, protein names and keywords. New feature rules are being defined systematically for fully curated PIRSF families that contain at least one known 3D structure with experimentally verified functional/active/binding site information. The PIRSF classification, based on the evolutionary relationships of whole proteins, have also been used to detect and correct numerous genome annotation errors that have resulted from identifications based only on local domain similarities and subsequently propagated based on transitivity (20
High-quality Automated and Manual Annotation of microbial Proteomes (HAMAP)
A combined approach of automated and manual annotation for prokaryotic genomes in Swiss-Prot has resulted in the development of the HAMAP project (21
). The HAMAP project, or ‘High-quality Automated and Manual Annotation of microbial Proteomes’ aims to integrate manual and automatic annotation methods in order to enhance the speed of the curation process while preserving the quality of the database annotation. Automatic annotation is only applied to entries that belong to manually defined orthologous families and to entries with no identifiable similarities (ORFans).
Annotation of ORFans. Various prediction tools are applied to proteins that show no similarity to known protein families. Possible transmembrane regions, signal sequence, coiled coils, ATP/GTP binding sites, LPXTG motifs and some defined repeats are automatically annotated using rules of consistency and dependency, and without any further manual verification.
Annotation of members of well-characterized (sub)families. Proteins belonging to well-characterized protein (sub)families can be annotated automatically using a rule system that describes the extent and nature of annotations that can be assigned by similarity to a prototype manually annotated entry. Such a rule system also includes a carefully edited multiple alignment of the (sub)family, which is used both to propagate feature annotation from a model entry and to generate profiles used to identify new members of the family. Species-specific rules and rules specific to the biochemical pathways are used to develop a system able to spot inconsistencies at the level of the entire proteome.
Standardized nomenclature and controlled vocabularies
Consistent nomenclature is indispensable for communication and literature search. UniProt aims to standardize the nomenclature for a given protein and its isoforms across related organisms. For various other UniProt items we use controlled vocabulary, e.g. for tissues, plasmids and keywords, which are listed in UniProt documents. The unified UniProt keyword list is based on Swiss-Prot keywords augmented by the addition of selected PIR keywords that represent new concepts or new parent/child nodes of existing Swiss-Prot keywords. Whenever available, we make use of the official nomenclature defined by international committees while still providing the published synonyms. Collaborations and regular data exchange with other databases and organizations allow the implementation of community-specific nomenclatures.
Integration with other databases
UniProt provides cross-references to external data collections such as the underlying DNA sequence entries in the DDBJ/EMBL/GenBank nucleotide sequence databases, 2D PAGE and 3D protein structure databases, various protein domain and family characterization databases, PTM databases, species-specific data collections, variant databases and disease databases. As a result of this, UniProt acts as a central hub for biomolecular information archived in more than 50 cross-referenced databases. A document listing all databases cross-referenced in UniProt (http://www.uniprot.org/support/docs/dbxref.shtml
) is available and contains, for each database, a short description and the server URL. This interconnectivity is achieved almost exclusively via Database cross-Reference (DR) lines. In addition, links from subsequences or particular sites to databases specializing in certain types of PTMs or mutations are provided. Unique and stable feature identifiers (FTId) allow reference to a position-specific annotation item in the feature table. Currently these are systematically attributed to FT VARIANT lines of human sequence entries, to alternative splicing events (VARSPLIC) and to certain glycosylation sites (CARBOHYD), but will ultimately be assigned to all types of FT lines.
Many sequence databases contain, for a given protein sequence, separate entries that correspond to different literature reports. In UniProt we try as much as possible to merge all these data in order to minimize the redundancy of the database. Differences between sequencing reports due to splice variants, polymorphisms, disease-causing mutations, experimental sequence modifications or simply sequencing errors are indicated in the feature table of the corresponding UniProt entry.
Splice isoforms may differ considerably from one another, with potentially <50% sequence similarity between isoforms. The tool VARSPLIC (22
), which is freely available enables the recreation of all annotated splice variants from the FT of a UniProt entry, or for the complete database. A FASTA-formatted file containing all splice variants annotated in UniProt can be downloaded for use with similarity search programs.
The UniProt consortium emphasizes the use of an evidence attribution mechanism for protein annotation that will include, for all data, the data source, the types of evidence and methods for annotation. This is essential as the UniProt knowledgebase will contain data automatically imported from the underlying nucleotide sequence databases, data imported from other databases, data from specific programs, the results of automatic annotation systems and most important of all, expert manual curation. The implementation of evidence tags will allow the user to distinguish between all these data sources and to easily identify particular classes of data of interest such as the experimentally proven protein annotation.
To further improve the quality of protein annotation by increasing the amount of experimentally verified data with source attribution, UniProt has developed a bibliography submission system and is conducting retrospective attribution of literature data. The submission page allows submission and categorization of literature citations for experimental annotations, and displays comprehensive bibliographic data collected from many curated databases for each UniProt entry. A systematic manual attribution of experimental features is being carried out with computer-assisted mapping to existing protein bibliographic information. So far, a few thousand experimental features have been associated with publications and cross-referenced to the corresponding PMIDs for direct incorporation into the UniProt knowledgebase.