Five different genomes were downloaded as specified: H.sapiens (ENSEMBL NCBI36), M.musculus (ENSEMBL NCBIM36), S.cerevisiae (ENSEMBL SGD1), C.elegans (ENSEMBL CEL150) and A.thaliana (TAIR6).
For each protein the corresponding SwissProt entry in release 50 was found, when existing, searching for exactly matching sequences. The amount of genomic sequences that is deposited in the SwissProt database ranges from 13% for both A.thaliana and C.elegans to 79% for S.cerevisiae ().
Number of proteins with an experimental or a similarity-based annotation of the subcellular localization
For these proteins the experimental annotation was extracted by parsing the ‘Subcellular localization’ section of the COMMENT field of the SwissProt file. Entries annotated as ‘probable’, ‘possible’ or ‘by similarity’ were not considered. The annotations directly or implicitly referring to one of the following 17 classes were taken into account: Nucleus, Cytoplasm, Mitochondrion, Plastid, Golgi, Endoplasmic reticulum, Lysosome, Endosome, Vesicles, Peroxisome, Vacuole, Cell wall, Secretory pathway, Extracellular, Cytoskeleton, Membrane and Transmembrane ( in Supplementary Data lists the keywords that have been considered for assigning the localization). Only 22% of all the SwissProt entries for the five considered species record the experimental subcellular localization. The rate of experimental annotation ranges from 46% of the S.cerevisiae proteome to <10% for A.thaliana and C.elegans ().
The ‘Experimental annotation’ column in lists the amount of proteins experimentally annotated in each one of the 17 types of considered localization. It is worth mentioning that the same sequence can be annotated in SwissProt with two (or rarely more) different localizations. For example, this happens for proteins that shuttle between the nucleus and the cytoplasm. In these cases the same entry counts two (or more) times in . It is evident that the amount of proteins in the different localizations spans two orders of magnitude.
Number of sequences in the 17 different subcellular localizations as derived with experimental and similarity-based annotations
The best way to annotate the remaining proteins is to search for experimentally annotated sequences sharing high identity (6
). Since the three eukaryotic kingdoms (Metazoa, Viridiplantae and Fungi) differ in number and types of possible localizations, three kingdom-specific datasets of annotated proteins were extracted from SwissProt. These dataset contains 26
192 sequences for Metazoa, 6370 sequences for Viridiplantae and 4023 sequences for Fungi. All the sequences of the five considered genomes were searched for similar sequences in the appropriate dataset using BLAST (18
). When matches were found with an E
(roughly corresponding to an identity level >30%) the annotation of the best-scoring match was transferred to the query sequence. When multiple matches are found with the same best scoring E
-value, all of them are reported in the database. This procedure assigns localization to 55% of the proteins in the database. This rate ranges from 33% of A.thaliana
up to 68% for H.sapiens
The ‘similarity-based annotation’ column in contains the number of proteins annotated with the above described procedure in each localization (including the sequences experimentally annotated). Also in this case, sequences that end up with a multiple annotation are counted several times.
It appears that a large portion of the sequences, ranging from 32% in H.sapiens up to 67% in A.thaliana, is not endowed with similar counterparts with an annotated localization. In this case subcellular localization can be predicted with specifically suited methods.
For generating our annotation system we developed a pipeline that comprises previously described methods, all based on machine learning tools and that are proved to outperform most of the available predictors for the same task when rigorous cross-validation procedures are adopted (8
). The pipeline is shown in . First of all, membrane proteins are discriminated with Spep (19
) and ENSEMBLE (20
): the former is a neural network based method for predicting the presence of signal peptide while the latter is a method based on neural networks and hidden Markov models for the prediction of the topology of all-alpha transmembrane proteins. When a signal peptide is predicted, it is cleaved from the sequence before predicting the presence and the location of the transmembrane helices. If no transmembrane helix is found, the uncleaved sequence is analyzed using BaCelLo (8
), a recently developed tool for predicting the subcellular location of eukaryotic proteins. This is based on a decision tree of support vector machines and it discriminates four localizations in Metazoa and Fungi (cytoplasm, nucleus, extracellular and mitochondrion) and five localizations in Viridiplantae (the same as before plus chloroplast).
Figure 1 Flow chart of the predicting pipeline adopted in eSLDB. SVM, support vector machine. BaCelLo, Spep and ENSEMBLE are predictive methods described previously (8,17,18).
At the end of the pipeline up to five localizations can be discriminated in Metazoa and Fungi and up to six in Viridiplantae. Although the possible types of localization are 17 (see above), the actual reduction in the number of discriminated localization is due to the lack of an adequate number of non-redundant examples for training. A novelty of BaCelLo is that first takes into consideration that the actual proportion of proteins targeted towards each compartment remains unknown by adopting an equiprobability hypothesis and a balancing procedure (8
The structure of the predictive system allows annotating the subcellular localization in a hierarchical way. First, all membrane proteins are separated from soluble ones; the latter are then divided into intracellular and secreted. Intracellular proteins are separated in nucleocytoplasmic and organellar; the former are then separated in cytoplasmic and nuclear while the latter, in the case of Viridiplantae, are further divided into mitochondrial and chloroplastic.
The topology of the decision tree and the balancing procedure were adopted for maximizing the prediction performances as evaluated on testing sets independent of the training sets. The best scoring binary decisions are at the top of the tree, the worst-scoring at the bottom. This procedure minimizes the propagation of the errors through the hierarchy of the tree. The predictions are stored in the database along with the hierarchy of the decisions in the pipeline.
All the proteins of a genome, also when experimental and/or homology-based annotation are possible, are annotated by means of the predictive method. In the number of proteins predicted in each class is listed.
Number of sequences in the six predicted subcellular localizations
contains the evaluation of the coverage and the accuracy of the prediction for the proteins of H.sapiens, as compared with both the experimental and the similarity derived annotations. We considered 6444 unique proteins experimentally annotated and 25 134 unique sequences for which a similarity-based annotation is available. also lists the distribution of these proteins among the different classes. The coverage is computed as the fraction of correctly predicted sequences in each class over the number of proteins belonging to the class. The accuracy is the fraction of correctly predicted proteins over the total number of proteins predicted in the class. The agreement between the annotations and the prediction is good, especially when predictions are compared with the experimental annotations and the higher levels of the hierarchical prediction are considered.
Performance of the prediction pipeline as compared with the experimental and the similarity-based annotations