UniProt comprises three database components, each of which addresses a key need in protein bioinformatics. The UniProt Knowledgebase (UniProtKB) provides protein sequences with extensive annotation and cross-references. The UniProt Archive (UniParc) is the main sequence storehouse. The UniProt Reference Clusters (UniRef) condense sequence information and annotation to facilitate both sequence similarity searches and analyses of the results. summarizes the three UniProt databases and their sizes in the current release.
Names and sizes of the UniProt databases
The centerpiece UniProt database is the UniProtKB—a richly annotated protein sequence database with extensive cross-references. Much of the annotation data are buried within the ever-increasing volume of scientific publications or spread among individual databases stored at different locations with differing formats. The UniProtKB provides an integrated and uniform presentation of these disparate data, including annotations such as protein name and function, taxonomy, enzyme-specific information (catalytic activity, cofactors, metabolic pathway, regulatory mechanisms), domains and sites, post-translational modifications, subcellular locations, tissue-specific or developmentally specific expression, interactions, splice isoforms, polymorphisms, diseases and sequence conflicts. Literature citations provide evidence for experimental data. Entries connect to various external data collections such as the underlying DNA sequence entries, protein structure databases, protein domain and family databases, and species- and function-specific data collections. As a result, UniProtKB acts as a central hub connecting biomolecular information archived in ~100 cross-referenced databases.
The UniProtKB contains two sections. UniProtKB/Swiss-Prot contains records with full manual annotation or computer-assisted, manually-verified annotation performed by biologists and based on published literature and sequence analysis. UniProtKB/TrEMBL contains records with computationally generated annotation and large-scale functional characterization. The computer-assisted annotation may employ automatically generated rules as in Spearmint (1
), or manually curated rules based on protein families, including HAMAP family rules (2
), PIRSF classification-based name rules and site rules (3
) and Rulebase rules (4
UniProt Reference Clusters
The UniRef are three separate datasets that compress sequence space at different resolutions, achieved by merging sequences and sub-sequences that are 100% (UniRef100), ≥90% (UniRef90) or ≥50% (UniRef50) identical, regardless of source organism. Reduction of sequence redundancy speeds sequence similarity searches while rendering such searches more informative.
To maximize the chances of biological discovery, homology searches are performed using up-to-date collections of sequences. However, with the accelerated growth of the number of sequences, similarity searching has become increasingly computationally intensive and prohibitive for resource providers and their users. Furthermore, there is an uneven distribution of sequences in sequence space (5
). An overabundance of very closely related sequences (e.g. >90% identity) slows down database searches, and long lists of similar or identical alignments can obscure novel matches in the output. A more even sampling of sequences will shorten and clean output listings without repetition of redundant hits. The compression of UniRef100 into UniRef90 and UniRef50 yielded size reductions of ~40 and 65%, respectively.
Protein sequences are publicly available from several sources that largely—but not completely—overlap in coverage. The UniParc houses all new and revised protein sequences from these various sources to ensure that comprehensive coverage is available at a single site. A simple collection of sequences from disparate sources can potentially lead to redundancy in the archive, since the same sequence may be found in many sources (UniProt, GenPept, RefSeq, etc.). To avoid redundancy, each unique sequence is assigned a unique identifier and is stored only once. The basic information stored with each UniParc entry is the identifier, the sequence, cyclic redundancy check number (CRC64), source database(s) with accession and version numbers, and a time stamp. In addition, each source database accession number is tagged with its status in that database, indicating if the sequence still exists or has been deleted at that source. The archive thus provides a history of protein sequences.