In this study, we analyzed the influence of the searched database on the long-term storage of proteomics data. We found distinct differences in the stability of protein identifiers between the investigated protein databases: UniProtKB, the only database that contains a high proportion of manually curated records (UniProtKB/Swiss-Prot), proved to be significantly more stable than IPI and NCBI gi numbers. As mentioned above UniProtKB/Swiss-Prot and UniProtKB/TrEMBL were treated as one unique searched database in this study (UniProtKB).
Given the fact that IPI was by far the most commonly used database in PRIDE submissions, the stability of IPI identifiers seems especially problematic. For example, already 10% to 20% (depending on the mapping algorithm) of the reported protein identifications were deleted after only two years. A concerning and, at the same time, surprising point is that several of these investigated experiments were published in 2010 and thus contained a considerable amount of basically outdated or invalid data already at the time of publication—data that is published and immediately perished (2
). One such example would be a study done by Gammulla et al.
in Indian rice (22
) (PRIDE accessions 10726–10740) using the NCBI nr database from August 2008. This study contained 18.8% deleted identifiers when it was published in August 2010 (see above). This effect of changing protein identifiers on published data increased considerably over time. For instance, another study (28
) published in May 2009, already contained 33.0% of deleted protein identifiers at the time of investigation (November 2010, PRIDE accessions 3706–3714). A representative example of a much older project is the data from the HUPO PPP project: at the time of the here presented study data from the HUPO PPP only contained 44.7% active identifiers (see above). These results are not caused by any misconduct of the respective authors but reflect the instability of certain protein databases for specific species.
The primary focus of this study was the quantification of changing protein identifiers from different databases over time. The results presented in , , and do not fully reflect the true impact of changing protein databases on the long-term storage of proteomics data. To correctly approximate this effect the consequences of changing protein sequences on the actual peptide identifications need to be taken into consideration. Based on our results when investigating the fraction of peptides no longer fitting the current protein sequences we expect that the fraction of deleted identifiers is actually twice as high. Comparing the distribution of peptide scores between fitting and nonfitting peptides, we found that the scores of the still fitting ones were statistically significantly different (higher) than the scores of the nonfitting ones (except for Mascot). It thus seems probable that the fraction of nonfitting peptides contains more false-positive identifications than the fraction of fitting peptides. However, without carefully considering how the protein inference (29
) was done in each particular case it is impossible to reach further conclusions.
Deleted identifiers are the worst but not the only problem caused by changes in protein databases. Cases where protein identifiers from UniProtKB are demerged into several new identifiers may also alter the original significance of the data. UniProtKB/Swiss-Prot has historically “merged” 100% identical protein sequences from different genes in the same species into one single record. However, UniProt recently started to demerge entries containing multiple individual genes coding for 100% identical protein sequences into individual UniProtKB/Swiss-Prot entries containing a single gene (see UniProt release 2010_09 notes, http://www.uniprot.org/news/2010/08/10/release
). This development might cause significant problems when comparing old and more recent data. For example protein P05209, identified in several PRIDE experiments, was demerged and currently maps to 13 different identifiers. For human, mouse, and rat there are even two different mappings for each species. Another problematic example is protein P59641, identified in an experiment performed on human (PRIDE accession number 1645). Currently, P59641 maps to four UniProtKB/Swiss-Prot entries but none of which is human. The time when UniProtKB identifiers were demerged into entries for every species can clearly be seen in . The majority of these cases could be resolved based on the investigated species and thus have only a limited negative effect on the stored data.
After studying the stability of the protein identifiers stored in PRIDE, we decided to compare these findings with the total rate of change of the underlying protein databases. For this analysis we used at least two releases of UniProtKB, IPI, and Ensembl per year since 2005 (human and mouse only). The overall rate of change was comparable to the one found when only the identifiers reported in PRIDE experiments for UniProtKB and IPI were taken into account (see , , and supplemental Fig. S3
). The different identifier stability of IPI and UniProtKB as well as the different stability of mouse and human identifiers was also reflected in the analysis of the complete database builds. A possible reason for the higher instability of IPI is the varying quality (based on the improvements of genome annotation) of the source databases' records that are used to create the “clustered” IPI protein entries: UniProt, Ensembl, RefSeq, TAIR, H-inv (30
), and Vega (31
Surprisingly, human identifiers were significantly less stable than mouse identifiers in UniProtKB even though human is considered the “more stable” species. This might be caused by a stronger curation effort put into human than in mouse data. Nevertheless, the constant number of about 6% deleted human identifiers compared with about 2% deleted mouse identifiers until the middle of 2010 is striking.
The instability of protein identifiers is not only a problem for published data but can furthermore cause unforeseen problems in long-term projects as, for example, clinical studies. In these cases, samples are generally collected over several years but often need to be processed immediately. If the raw data from these experiments is not reprocessed before the final overall data interpretation, invalid results may be retrieved. In such cases, the observed variation in results caused by changing protein databases will be considerably higher than the here reported numbers as we could only assess the loss of data. When reprocessing MS data the changes in protein databases do not only cause a potential loss of data but will also result in new findings. If this effect is not taken into consideration long-term studies might produce invalid results.
The instability of certain protein databases reported in this study does not only influence the storage of “pure” proteomics data. Other biological resources such as Reactome (32
), that process and curate proteomics data from publications need to find ways to handle these changes of protein sequence databases. The two protein identifier mapping algorithms produced significantly different results. Although the PICR mapping algorithm seems more stringent it sometimes reported twice as many IPI identifiers deleted compared with the logical mapping algorithm when mapping protein identifiers retrieved from PRIDE (see ). This difference was not observed when mapping whole database releases (see and supplemental Fig. S7
). These results clearly suggest that it is imperative to carefully pick the used protein identifier mapping algorithm for specific applications and thoroughly test its effect. The logical mapping approach seems more suited for applications that are focused on extracting biological knowledge from proteomics data. Especially in manually curated databases like UniProtKB/Swiss-Prot, protein identifier changes are curated according to the biological meaning. Thus, the logical mapping approach seems most likely to maintain the biological significance. A striking example of the differences between the two mapping algorithms can be found when looking at the results from less characterized species. The PICR service, for instance, reported 50% of UniProtKB identifiers deleted from the submission of chicken data in January 2005 (see above) compared with little more than 10% reported deleted by the logical mappings. A similar example is the submission of data on zebrafish in August 2007 (see above) where the PICR service reported virtually all of the identifiers to be deleted compared with 20% based on the logical mappings. Nevertheless, PICR's approach seems to be more stringent and thus better suited for data repositories like PRIDE.
In this study we could show that changing protein identifiers are a risk for the long-term storage of proteomics data as well as the evaluation of long-term proteomics studies. There is a significant difference between the different protein databases concerning identifier stability. Based on the here presented findings UniProtKB seems the best database for applications that rely on the long-term storage of proteomics data. Nevertheless, there are several applications were UniProtKB cannot be used. This is the case when, for example, investigated species are not present in UniProtKB. It is therefore imperative to take the effect of changing protein identifiers into consideration when performing proteomics experiments and evaluating proteomics data. The results from the two protein identifier mapping algorithms used in this study differed considerably. These differences have to be taken into consideration when choosing a protein database and mapping algorithm for a specific task to prevent the misinterpretation of proteomics data.