Identification of motif-containing protein sequences with known subcellular localizations
We started with a dataset of 2344 profile and pattern sequence motifs. Each motif is linked to a set of true positive (TP) sequences (motif-containing proteins known to belong to the motif family), and a set of false positive (FP) sequences (proteins possessing the motif sequence but not its function). In order to analyze the subcellular localization data linked to each motif, we first assigned one or more subcellular localizations to each protein in the database (Figure ) as detailed in the Methods section. The most frequently assigned compartments were the cytosol, followed by the cell membrane, secreted proteins (extracellular), and the nucleus, probably because they are the largest cellular compartments. When motifs with multiple subcellular localizations were considered, the frequency of the ER and Golgi apparatus (GA) increased significantly, consistent with their role as transition compartments for a large number of cytoplasmic, membrane-bound and secreted proteins.
Figure 1 Subcellular localization frequency of Swiss-Prot database proteins. The number of sequences assigned to each localization is shown together with its relative percentage in brackets. The “Multiple” category represents the number of sequences (more ...)
Next, we assigned subcellular localizations to both TP and FP protein sets for each motif. Using this method, one or more localizations were assigned to 60-61% of sequences (Table ), although the TP and FP coverage for each motif was heterogeneous. Of the 2344 motifs analyzed, 299 had no subcellular localization data assigned to their sequences, with higher coverage in pattern versus matrix motifs (Figure ). We found a low number of motifs (28) whose TP sequences had no subcellular localization data, of which virtually all were pattern motifs. In contrast, a high number of motifs (1685) completely lacked subcellular localization data for FP sequences. Finally, only 7 motifs had the same single subcellular localization for both TP and FP sequences, all of which were pattern motifs.
Number of motifs in PROSITE and number of true and false positive proteins for each type of motif (Matrix/Pattern)
Figure 2 Relative percentage of motifs with subcellular localization of sequences in TP and FP sets. The number of motifs in each category, according to the results, is shown. The motifs are separated by matrices and patterns, and the total number is also shown (more ...)
Assignment of subcellular localizations to sequence motifs
We wanted to independently assign subcellular localizations to TP and FP sequence sets for each motif. To do this we compared the relative frequency of subcellular localizations in each set of TP and FP sequences against the frequency in the whole database. Only when a subcellular localization had a higher frequency in the set than the expected one (the frequency in the database) was it assigned to the motif. In this way we assigned one or multiple compartments (from 1 to 6) to 96% of the motifs with 1enough TP sequences for analysis. 69% of motifs were assigned a single subcellular localization, while 18% were assigned two different localizations (Table ). The results for patterns and matrices were very similar.
Frequency and percentage of subcellular localizations assigned to motifs
Next we tested if the subcellular localization of TP and FP sequences were significantly different (heterogeneous) from each other. To this end, the probability value for each motif was calculated using Fisher’s exact test for 2xc contingency tables. On average, 78% of the available motifs had a significant p-value (253 out of 325), indicating a high degree of heterogeneity between TP and FP compartments (Figure ). Moreover, this heterogeneity was strongly related to pattern motifs, with 82% having a significant p-value versus 52% for matrix motifs (Figure ).
Once the calculations had been performed, a table summarizing our analysis was produced for each motif (example motif tables are shown in Figure ). Each table independently lists the number of sequences assigned to each subcellular class for TP and FP sets, and highlights the most significant compartments. Tables for all the motifs can be found in Additional file 1
and Additional file 2
. The p-value obtained from Fisher’s exact test is also shown.
Figure 3 Example tables with results for each motif. For each motif: Accession, Description, Type (matrix or pattern), and consensus sequence is shown. Tables show the number of proteins annotated for each localization, and separated by TP and FP sequences. When (more ...)
Distribution of motif sequences between related subcellular compartments
Given the high degree of interdependency between cellular structures and processes, we expected to find functionally-linked TP proteins in related compartments. About 19% of motifs have TP sequences distributed between two different subcellular classes (see Table ). We tested these compartment pairs, and found that they were frequently linked (Figure A). The most frequent pairs were evolutionarily-related compartments such as mitochondrion and chloroplast, or compartments that share protein and molecular transit such as cytosol and nucleus or cell membrane and extracellular.
Figure 4 Heat map with the number of motifs assigned to pairs of localizations. Numbers represent the number of motifs assigned to two different localizations from higher (red) to lower (yellow) frequency. Numbers in black cells represent the number of motifs (more ...)
In some cases, multiple compartments were assigned to individual proteins. Thus, it is possible that our assignment of multiple subcellular localizations for individual motifs may be influenced by motif-containing proteins localized to multiple compartments. To test this possibility, we repeated our assignment of protein sequences to motifs but excluded sequences present in more than one compartment. The compartment pairs obtained in this way gave similar results to the previous analysis (Figure B), albeit with a lower number of pairs due to the reduced number of protein sequences used. In the second analysis, the ER appeared together with membrane and the cytosol, in addition to the nucleus. In fact, the ER, together with the GA, appeared linked to other compartments at a higher frequency than alone (1SL-2SL in Figure ).
Next, we extended this analysis by looking at the relationship between the subcellular localizations of motifs assigned to more than two compartments. Compartment heat maps were generated for motifs with 3, 4, or 5 different TP localizations. The ER clustered with most other regions (Figure ), consistent with its complex relationships with multiple cellular compartments.
Figure 5 Heat map with the number of motifs assigned to more than two localizations. Each row shows a set of localizations (cells marked in black) when these are jointly assigned to the motifs: (A) 3 localizations, (B) 4 localizations, (C) 5 localizations. Numbers (more ...)
Non-random distribution of FP protein localization may indicate sequence convergence
We have shown that TP and FP proteins have a strong tendency to differ in subcellular localization. This is expected given that true protein family members will generally be located in similar cellular regions to carry out their common functions. Conversely, if FPs are completely unrelated to the motif family and result from random sequence similarities, then we would not expect a strong bias in their subcellular distribution. However, we found several examples of motifs where FP sequences were concentrated in particular compartments. For example, the “Homeobox domain signature” motif (PROSITE:PS00027) was found in 1290 nuclear proteins (Figure A) where this pattern allows DNA binding through a helix-turn-helix type structure (PROSITE:PDOC00027). However, this motif was also found within 6 transmembrane proteins (false positives: 5 in the cell membrane and 1 in the mitochondrion membrane) with different known functions (Figure A). The homeobox motif overlaps a transmembrane region of 20 amino acids, according to the annotations in the Swiss-Prot database. It suggests that this motif has a different function in membrane-associated proteins. Another example, is the “MCM family signature” (PROSITE:PS00847) for minichromosome maintenance proteins involved in the initiation of ATP-dependent DNA replication. This pattern is a particular version of the B motif found in ATP-binding proteins, and is also found in 4 false positives from bacteria located in the cell inner membrane: 2 Xanthine phosphoribosyltransferases and 2 Glycerol-3-phosphate import ATP-binding proteins (Figure B). Again, it is likely that the motif of these latter 4 proteins arose independently during evolution due to the unrelated localization with respect to the nuclear true positives.
Figure 6 Keywords found in FP sequences. Swiss-Prot keywords with frequencies higher than one from groups of FP sequences: (A) Homeobox domain signature (6 FP sequences assigned to cell membrane and mitochondrion), where Transmembrane helix, Transmembrane and (more ...)
Motif sequences can occasionally be present in different cell compartments from where their associated function would indicate. In some cases this might suggest a common evolutionary origin. The “Endoplasmic reticulum targeting sequence” motif (PROSITE:PS00014) is a short C-terminal sequence (frequently with the four amino acids sequence: KDEL in vertebrates, or the consensus [HAD]DEL in yeasts) often found within proteins that accumulate in the lumen of the ER. We found this motif strongly linked to the ER, as expected, although some TP sequences also localize to other compartments. However, we also found FP sequences linked to the vacuole, where three proteins have this motif at their C-terminus (Figure C). We could hypothesize that the motif might still be involved in vesicle transport even though the proteins have not been reported as accumulating in the ER or they may play a modified but related function in the vacuole. Therefore, the C-terminal motifs in the FP sequences are likely to share a common evolutionary origin with the motif in TP sequences.
Interestingly, another 43 FP proteins with the same ER targeting motif (consensus [SQHA][QDEN]EL) at their C-terminus are localized to the nucleus and mainly involved in nucleosome biology and DNA repair (Figure B). The role of KDEL-like motifs in vesicle transport and ER retrieval has only been reported for cytoplasmic proteins and there is no evidence to link the function of these proteins to the nucleus. Thus, in contrast to the vacuolar proteins, it is unlikely that the motifs present in the nuclear FP sequences are evolutionarily related to the TP sequences.
In conclusion, the methodology presented in this work provides a rapid way of identifying motif-containing sequences associated with different cellular compartments that gives valuable information regarding the probable function of a motif and its evolutionary origin.