In eukaryotes, the majority of proteins are encoded in the nuclear genome and translated on ribosomes in the cytosol. Proteins are then transported to different subcellular locations, such as the nucleus, mitochondria, chloroplasts, peroxisomes, etc., where they perform their particular roles in various biological processes. Knowledge of subcellular location is an important asset in the annotation of newly discovered proteins, as it bears clues about a protein's function. Further, knowing the location of proteins and their molecular function allows us to infer where in the cell the corresponding biological process takes place, what the physiological role of this process may be, and how the various processes are spatially integrated. Finally, information on the makeup of proteomes from bacteria-derived organelles (mitochondria and chloroplasts) helps to elucidate the migration of protein-coding genes from the endosymbiont to the host.
A variety of experimental approaches are available today for identifying the subcellular localization of proteins, for example, co-expression of fluorescent proteins [1
], immunofluorescence labeling [3
], gene knockout/knockdown [4
], and proteomics techniques such as liquid-chromatography-tandem mass spectrometry (LC-MS/MS) [5
]. However, for most species, large-scale experimental identification of protein subcellular localization remains too expensive or unfeasible. This has set the stage for bioinformatics approaches to predict localization in silico
Can localization of a protein be confidently inferred via finding a homolog of known location by BLAST [7
]? A previous study indicated that localization can be predicted with up to 90% accuracy when BLAST identity is 50% or more, but that it falls short for more distant sequences (e.g., only 50% accuracy for 20% local identity, Additional file 1
]. Further, this approach ignores established biological knowledge that homologous proteins are not necessarily located in the same cellular compartment. For example, homologous beta oxidation enzymes are targeted to mitochondria in human and peroxisomes in yeast [9
]. Most importantly, the BLAST approach fails for divergent and novel proteins as they do not find significant matches in databases (see Additional file 1
). For all these reasons, the bioinformatics community turned to more suited approaches for protein localization prediction.
Today, more than 20 dedicated tools are available for in silico
protein localization prediction based on annotation or solely the sequence of proteins (Additional file 2
). Annotation information includes textual description taken from the SWISSPROT database, the Gene Ontology database, or PubMed literature [10
]. Also used for localization prediction is co-occurrence of functional motifs or structural domains in proteins [13
]. Sequence-based tools recognize specific targeting signals that guide proteins to different cellular compartments [15
]. Alternatively, proteins are classified according to single amino acid frequency [20
], dipeptide and gapped amino acid pair composition [22
], or physicochemical properties of amino acids [26
]. More recently published predictors combine different protein features [27
], or integrate annotation with sequences-based prediction [32
]. Finally, meta-predictors combine predictions from several heterogeneous tools [34
Two recent studies evaluated the performance of available localization predictors using datasets that contain only sequences not included in, nor similar to, those in the training sets of these predictors [37
]. One identified as best performing tools BaCelLo [39
], LOCtree [29
], Protein Prowler [18
], TargetP [16
], and Wolf-PSORT [40
], and the other evaluated BaCelLo, YLoc [38
], MuitiLoc2 [32
], and KnowPred [41
] as best (for sequence features and computational methods used, see Additional file 2
). In general, these tools have lower performance on data from plants compared to non-photosynthetic organisms such as animals and fungi, and this is due to the presence of mitochondria plus chloroplasts in the cell of plants. Both organelles descend from endosymbiotic bacteria and have their own machineries for protein import, DNA replication, and gene expression. This makes it difficult for the tools to distinguish the proteins from the two organelles.
localization prediction tools use full-length protein sequences that are usually inferred from genome sequence. Yet, for many eukaryotic groups of interest are only EST (Expressed Sequence Tag) data available, and it is unlikely that their genomes will be sequenced soon [42
]. (For relevant public databases see dbEST of NCBI [43
], The Gene Index Project (TGI) database [44
], and the Taxonomically Broad EST DataBase (TBestDB) [45
]). When attempting to use available localization prediction tools on protein sequences conceptually translated from ESTs, we realized that prediction accuracy is generally very low. We tested the performance of seven state-of-the-art tools with proteins inferred from plant ESTs, and the overall accuracies were below 50% (Table ). This is not surprising, because these tools have been designed for full-length proteins and not for ESTs, which often represent only partial coding regions with an average length of ~200 residues. Further, EST-inferred proteins (referred to as EST-peptides from here on) may have an amino acid composition that differs from that of the corresponding full-length proteins. More importantly, EST-peptides often lack the N-terminal region of the corresponding proteins, which usually contains the targeting signal.
Performance of available tools and TESTLoc on plant EST-peptides1
Finally BLAST, which we showed above to be unsuited for localization prediction of full-length proteins, is equally unsuited on EST-peptide data. Even at sequence identity levels above 90%, the class-averaged accuracy for plant ESTs was below 75% (Additional file 1
). In practice, the accuracy would be even lower as EST projects often discover novel proteins that lack matches in databases. For example, in a large-scale protist EST project, more than 60% of ESTs could not find informative matches [45
]. For the ESTs from such projects, the overall accuracy of localization prediction by BLAST would thus be less than 30%.
We set out to develop a method that is tailored for predicting subcellular localization based on ESTs. As a test case we used plant data, which, as mentioned above, are more challenging than those from non-photosynthetic taxa. The methodology we developed can be readily applied to ESTs from any taxonomic groups, and the models we constructed can be easily retrained with sequences from a particular taxon of interest.