In order to provide a high-quality dataset of mammalian protein complexes, all entries are manually created. Only protein complexes which have been isolated and characterized by reliable experimental evidence are included in CORUM. To be considered for CORUM, a protein complex has to be isolated as one molecule and must not be a construct derived from several experiments. Also, artificial constructs of subcomplexes are not taken into account. Since information from high-throughput experiments contains a significant fraction of false-positive results, this type of data is excluded. References for relevant articles were mainly found in general review articles, cross-references to related protein complexes within analysed literature and comments on referenced articles in UniProt (
12). shows that the highest fraction of the used articles was published in
The Journal of Biological Chemistry (23.6%), followed by PNAS (7.1%),
Molecular and Cellular Biology (6.1%) and
Cell (5.7%). The vast majority of the used articles are from journals with high-impact factor, which shows that characterization of protein complexes is considered as important information.
| Table 1.Analysis about the absolute number and fraction of articles from respective scientific journals that were used for the annotation of mammalian protein complexes in CORUM |
In order to define community standards for data representation in proteomics to facilitate data comparison, exchange and verification the PSI-MI standard was introduced (
13). The CORUM dataset is annotated according to the currently valid PSI-MI 2.5 standard. One rule of PSI-MI annotation is to separate information about molecular interactions, which are described redundantly by different publications. The advantage of this approach is that annotators present the information exactly as described by the authors and do not need to amalgamate the result of different groups, if the experiments show conflicting results. Another advantage is that if one protein complex has been isolated and characterized by different groups, the reproducibility confirms the composition of the protein complex. The drawback of this approach is that it results in a certain extent of redundancy.
Many well-characterized protein complexes are associated with scientific names like ribosome, proteasome or spliceosome in literature. These descriptions are also provided in CORUM, as well as synonyms if they are frequently used in the literature. An example is the eukaryotic chaperonin CCT (chaperonin containing TCP-1), which is also well known as TRiC (TCP-1 ring complex). If there is no name found for a protein complex available, we define one which is usually composed of gene names of the complex, e.g. ‘BRCA1-RAD51 complex’ or ‘Ubiquitin E3 ligase (containing FBXW7, CUL1, SKP1A and RBX1)’.
Another annotated feature is the organism, from which the protein complex originates. The concentration of many research activities towards the biology of humans is reflected by the high content of human protein complexes in CORUM. The vast majority of all analysed protein complexes in CORUM originates from human (65%), followed by mouse (14%) and rat (14%).
The subunits of protein complexes are annotated according to the respective SwissProt entries. In CORUM, only the primary accessions are stored as identifiers. Associated information like gene names and protein names is retrieved via the BioRS sequence retrieval system, enabling up-to-date information from the primary data sources without the need of synchronization.
Other important information besides the identification of the different subunits is the number of individual proteins that are required to assemble the complex. In most cases, the molecular characterization of the protein complex composition is limited to the identification of the subunits. For cases where the stoichiometry of the subunits has been analysed, the information is given in the ‘Number of subunits’ field (see e.g. complex 960).
We use the Functional Catalogue (FunCat) annotation scheme for protein and protein complex function characterization (
14). The FunCat has been used for manual annotation of model organisms like
Saccharomyces cerevisiae, Arabidopsis thaliana and mouse (
15) and was also frequently used for the analysis of protein networks and high-throughput experiments (
14). Application of FunCat organizes data in a systematic, computer-readable format. The hierarchical structure of FunCat allows browsing for protein complexes with particular cellular functions or localizations (). This reveals subsets, which would otherwise require specialised databases like the PIN database for nuclear protein complexes (
16). Examples of such sub datasets are presented on the CORUM home page. In addition, FunCat annotation allows fast access to some statistics of the data. The CORUM dataset contains e.g. far more protein complexes from the nucleus (67% of all complexes with annotated localization) than from the cytoplasm (9%). This might be explained by the complexity of the information processes within the nucleus. However, the data do not necessarily correlate to the situation of living cells but might rather reflect the topics which have been investigated by individual research projects.
The evidence for applying a functional category is given in a separate field (). There are five different evidences which provide information about the underlying rationale why a functional category has been applied. These include different qualities, ranging from experimental evidence (exp) to predicted functions (pred). For all evidences but predicted annotation the underlying PubMed references are provided (). Additional information like disease relevance or more detailed information about the cellular function of protein complexes is given in the comment field ().
One of the mandatory information for PSI-MI compliant annotation is the experimental method which led to the identification of the protein interaction. For this kind of data the PSI consortium provides a list of methods (
http://www.psidev.info/). If several methods were used to isolate a protein complex, all methods are listed. The PubMed reference of the article that describes the isolation of the protein complex is given in the PubMed field ().
Concerning the inventory of protein complexes of an organism, the complexosome, important questions are (i) of how many different subunits protein complexes are composed, (ii) the fraction of protein coding genes are devoted by cells to build the complexosome and (iii) how many protein complexes does a cell contain?
(i) In September 2007, the CORUM database contained about 1750 protein complexes. On average, each complex consists of 4.7 different subunits. This is well in line with data from yeast (see above). The largest protein complex of the dataset, the spliceosome, consists of 143 different proteins. (ii) If all genes of the protein complexes in CORUM are mapped to e.g. the human genome, the dataset covers ~2400 genes. Based on current estimations that the human genome codes for 20 488 protein coding genes (
17), the entries from CORUM represent 12% of a mammalian genome. Due to the lack of data it is not possible to give a reliable approximation for the total number of protein coding genes in mammalian organisms involved in complex formation. However, assuming that in mammals like in yeast more than half of the protein coding genes are used for the formation of protein complexes would offer cells an enormous repertoire of building blocks for the development of complexes with novel functionalities. The modular architecture of protein complexes is exemplified by large protein complex families like SNARE complexes (96 members), integrin complexes (69 members) and ubiquitin E3 ligases (54 members) that were annotated in CORUM. These protein complex families originated from the association of protein family members which emerged in evolution as the result of gene duplication and specification events (
2). (iii) To date, including data from CORUM, BIND (
18) and HPRD (
19) the number of non-redundant protein complexes in mammals is well above 2500. Since those are only a part of all protein complexes from literature that have been annotated and novel experimentally characterized protein complexes continuously appear, this number will certainly increase in the future.