An important challenge in systems biology is to build detailed models of cell organization that provide accurate predictions of cell behaviors. Since many (if not all) of the proteins expressed by a given cell are likely to require proper subcellular localization in order to make their contributions to those behaviors, systems for assigning locations to proteins are critically needed. The problem is further complicated by the fact that protein locations may vary between cell types or within the same cell type under different conditions. For example, changes in protein subcellular location are associated with oncogenesis and differentiation [1
]. This implies that assignment of proteins to a fixed location or set of locations will not be sufficient. Given that many proteins are found in only a specific region of an organelle and that these refined localizations are conserved across species, it is also not likely that assignments at the level of whole organelles will be sufficient either.
Extensive work has been done on proteome-scale determination of location at various levels of resolution both by fractionation [3
] and by microscopy [4
]. Since collection of information for tens of thousands of proteins for perhaps hundreds of cell or tissue types under many conditions may be infeasible, computational prediction of location is an important alternative. Furthermore, knowledge of the sequence elements within proteins that determine location will be useful for modeling how changes in location are regulated.
A number of methods have been proposed for using sequence information to predict localization. These include WoLF PSort [8
], TargetP [9
], LOCtree [10
], PSLT2 [11
], TBpred [12
], and iPSORT [13
]. While useful, some of these methods (e.g. LOCtree and WoLF PSort) are based on general sequence characteristics (GC content etc.) and thus it is hard to interpret the sequence features that lead to accurate classification in terms of localization mechanism. Some (e.g. TargetP and WoLF PSort) are based on known motifs, making it hard to correctly classify proteins that lack the known motifs. PSLT2 considers all motifs in the InterPro database [14
], but does not try to search for novel targeting motifs. Beside known motifs TargetP also uses the motif finder MEME [15
] to characterize manually curated special cases of the mitochondrial targeting signal [9
]. This procedure is necessary because no well defined sequence motif has been previously found for the mitochondrial targeting signal, but it cannot discover novel targeting motifs. TBpred uses MEME to identify one motif overrepresented in each of the four subcellular locations of mycobacterial proteins [12
], which is arguably not enough to explain all targeting pathways. Such a motif could also appear in other locations and therefore may not be associated with localization. TBpred made no attempt to examine or interpret the biological meaning of the identified motif. iPSORT can discover two type of features, amino acid properties and pattern matching. However amino acid properties (e.g. hydrophobic or hydrophilic) may be a result of the biochemical characteristics of the compartments, and do not provide as much information on the protein sorting mechanism as motifs do. iPSORT can discover patterns, but the patterns are not as expressive as common motif representation like regular expression.
To address these problems we developed a new method for searching among proteins that are known to localize to a certain compartment for motifs that may govern their localization patterns. We represent protein sequence motifs by profile hidden Markov models (HMMs), which models local alignments using match, insert, and delete states [16
]. Profile HMMs have been successfully utilized to model protein families and domains, and they are used to represent domains in the Pfam database [17
]. Unlike position weight matrices (PWMs) (for example, those used by MEME [15
]), profile HMMs allow for variable length insertions and deletions that are common in protein motifs, for example the nucleoplasmin nuclear location sequence [18
] and the sequence targeting proteases to the food vacuole in P. falciparum
]. Unlike regular expressions, which have also been used to represent such motifs, profile HMMs can assign different frequencies to each amino acid and are thus more expressive.
Traditional motif finding algorithms start by assembling a subset of sequences (for example, all proteins in the same compartment) and then searching for motifs in those sequences. These methods typically utilize generative models that attempt to model the process by which the motifs were generated based on simplifying assumptions. Generative motif finding methods and models for proteins include MEME [15
] and NestedMICA [20
] using PWMs, and HMMER [16
] using profile HMMs, among others. While useful, these methods do not use important information about the negative set (sequences that are assigned to other compartments) when constructing the models. Such information may be useful for building refined models of the differences between similar compartments.
Relatively little work has focused on a different approach: discriminative learning of probabilistic motif models (e.g. profile HMMs). Discriminative methods search for motifs that are present in one class (positive set) but absent in other classes (negative set). Such methods have been applied to search for DNA binding motifs represented by PWM, including gradient ascent optimizing conditional likelihood [21
], heuristic hill-climbing [22
], and enumerative search on a discrete space of PWM [23
]. Recently PWM is applied to protein motifs as well as DNA motifs in the DEME algorithm using a combination of substring search and conjugate gradient [24
]. However PWM does not allow insertion and deletion, making it less than optimal for protein sequence analysis. Here we develop and apply a discriminative motif finding algorithm which utilizes HMMs that are constructed to optimize a discriminative criteria, the conditional likelihood of the sequences given the motifs. We used maximal mutual information estimate (MMIE), a technique that was initially applied to speech recognition, to train these HMMs discriminatively. Our models select motifs that are unique to the different compartments. In addition to their use for classification they may also provide information about the function of the proteins in each compartment or the mechanisms involved in targeting these proteins to their cellular locations.
Since subcellular localization is determined by a hierarchical protein sorting process, several methods have been developed that utilize a hierarchical structure to improve prediction of protein localization [10
]. We apply such structures to motif discovery, rather than only prediction, by searching for discriminative motifs at every split (internal nodes) on the hierarchical compartment structure in . This allows us to take full advantage of biological knowledge regarding the organization of compartments within a cell.
Hierarchical structure of compartments based on cellular sorting.
For subcellular compartment classification, our discriminative HMM method that does not utilize any prior motif information improves upon methods that use a list of known motifs. We also show that incorporating the protein sorting hierarchy results in better prediction on average. Our method was able to recover known motifs and to suggest new motifs for various compartments. These new motifs are more conserved than average amino acids in agreement with their predicted role in protein localization. Using our predicted motifs we were also able to reassign a number of proteins to new compartments, correcting what we believe are errors in current annotation databases.