To perform their function(s), proteins usually need to be localized
to the specific compartment(s) in which they operate. Subcellular localization of proteins is typically achieved by sorting pathways involving carrier proteins. Disruption of these pathways leading to inaccurate localization plays an important role in several diseases, including cancer (Cohen et al., 2008
; Kau et al., 2004
; Gladden and Diehl, 2005
), Alzheimer's disease (De Strooper et al., 1997
), hyperoxaluria (Purdue et al., 1990
), and cystic fibrosis (Skach, 2000
). Thus, an important problem in systems biology is to determine how proteins are localized to their target compartments, the carriers and motifs that govern this localization, and the pathways that are being used.
Recent advances in fluorescent microscopy coupled with automated image-based analysis methods provide rich information about the compartments to which proteins are localized in yeast (Huh et al., 2003
; Chen et al., 2007
) and human (Osuna et al., 2007
; Barbe et al., 2008
; Newberg et al., 2009
). Several computational methods have been developed to predict subcellular localization by integrating sequence data with other types of high-throughput data (Chou and Shen, 2008
; Horton et al., 2007
; Emanuelsson et al., 2007
; Nair and Rost, 2005
; Scott et al., 2005
; Rashid et al., 2007
; Bannai et al., 2002
). These methods either treat the problem as a one versus all classification problem (Chou and Shen, 2008
; Emanuelsson et al., 2007
; Horton et al., 2007
) or utilize a tree that corresponds to the current knowledge regarding intermediate compartments, for example, LOCtree (Nair and Rost, 2005
), BaCelLo (Pierleoni et al., 2006
), and discriminative HMMs (Lin et al., 2011
). The tree-based methods were shown to be superior to the one versus all methods; however, these methods do not attempt to learn the sorting pathways, relying instead on current (partial) knowledge of protein sorting mechanism.
A number of methods have learned decision trees for predicting subcellular localization. These include PSLT2 (Scott et al., 2005
), which refines the location into sub-compartments using a decision tree learned from data, and YimLOC (Shen and Burger, 2007
), which learns a decision tree for the mitochondrion compartment only using features that include predictions from SherLoc (Shatkay et al., 2007
), an abstract-based localization classifier. While the decision trees generated by these methods are often quite accurate, they are not intended to reflect sorting pathways, and they utilize features that, while useful for classification, are not related to the biochemical process of protein sorting.
In contrast to the global localization prediction methods, several experimental researchers have focused on trying to assign a specific sorting pathway to a small number of proteins. For example, proteins containing a signal peptide are exported through the secretory pathway (Lodish et al., 2003
), while some proteins without a classical N-terminal signal peptide are found to be exported via the non-classical secretory pathway (Rubartelli and Sitia, 1997
). A number of computational methods were developed to use this information to predict, for a given pathway, whether a protein goes through that pathway or not based on its sequence—for example, SignalP (Bendtsen et al., 2004b
) and SecretomeP (Bendtsen et al., 2004a
). However, these methods rely on the pathway as an input and cannot be used to infer new pathways.
There are many methods developed for reconstruction of pathways of other types, for example, for signaling pathways (Ruths et al., 2008
; Bebek and Yang, 2007
; Scott et al., 2006
) and metabolic pathways (Dale et al., 2010
; Fischer and Sauer, 2005
; Covert et al., 2004
). These pathways are used to describe information flow: one protein senses the environments and by activating a signaling or regulatory pathway passes that information along so that the cells can mount a response. We focused on a completely different meaning of pathway: physical movement of a specific protein. When referring to sorting pathways, we mean that a single protein is being carried from one location to another. Unlike information flow pathways, which involve different molecules along the way, physical sorting pathways always involve the same proteins interacting with a set of different proteins. This makes it much more complicated to infer the order in which this is performed (since it is always the same protein). In addition, the outcome of an information flow pathway is often a change in genes expression which can be readily measured using microarrays. In contrast, the outcome of a sorting pathway is the localization of a single (or a few) proteins to a compartment. Again, this requires different methods for inference. We are not aware of any prior article discussing computational methods for large scale inference of pathways describing physical movement of a protein.
While the above experimental methods provide some information on sorting pathways, no method exists to try and infer global sorting pathways from current localization information. In this article, we show that, by integrating sequence, motif, and protein interaction data, we can develop global models for the process in which proteins are localized to subcellular compartments. We use a hidden Markov model (HMM) to represent sorting pathways. Carrier proteins and motifs are used to define internal states in this model and the compartments serve as the final (goal) state. Using this model, we identified several sorting pathways, the carrier proteins that govern them, and the proteins that are being sorted according to these pathways. Simulation data indicates that the models learned are accurate (leading to 81% prediction accuracy with a noise level of 5%; see below). Using data from yeast, we show that our model leads to accurate classification of protein compartments while at the same time enabling us to recover many known pathways and the proteins that govern these pathways. Several new predictions are provided by the model representing new putative sorting pathways.
FIG. 3. (A) Testing error of simulated dataset generated from a structure with 25 states with varying levels of noise (false positive and false negative in features). The training sample size was fixed at 1400. (B) Testing error versus different training sample (more ...)