The nucleolus was initially characterised over four decades ago and shown to be the site of ribosome subunit production [1
]. It is now known to play a role in other cellular activities, including assembly of diverse ribonucleoprotein particles (RNPs), cell cycle progression and proliferation regulation, as well as the response to numerous forms of cellular stress [2
]. All of the proteins that are strongly enriched in the nucleolus, including marker proteins such as fibrillarin, can nonetheless cycle continually in and out of the nucleolus, as discovered by photobleaching experiments [7
]. In addition, many of the processes that occur, at least in part, in the nucleolus require the re-location of proteins to this nuclear sub-compartment. Many proteins are able to conditionally relocate between either the nucleoplasm, or other nuclear sub-compartments and the nucleolus [3
]. In addition to the 'part-time' nucleolar proteins which remain in the nucleus, many proteins are known to travel between the cytoplasm (including cytoplasmic organelles) and the nucleolus. These include ribosomal and non-ribosomal proteins that travel to the nucleolus for assembly into ribosome subunits and other RNPs respectively, as well as many growth factors and cell cycle regulators [2
]. The nucleolus thus accommodates a large amount of traffic and its composition is very dynamic, which may be facilitated by its lack of a surrounding membrane [6
Recent large-scale proteomics experiments have detected thousands of distinct proteins that stably co-purify with nucleoli isolated from human cells [9
]. Although the first datasets defining the nucleolar proteome did not offer information regarding the proportion of each of these proteins in the nucleolus relative to other cellular compartments, this information has now been obtained in a high throughput manner using a combination of cellular fractionation and SILAC protocols [12
]. These data indicate that although thousands of distinct proteins are detected in the nucleolus, their degree of association with the nucleolus is variable. Some proteins are predominantly nucleolar while others, although detected in small numbers in the nucleolus and annotated as such in large databases, are present in much larger numbers in other cellular compartments. These proteomics data give a snapshot of the content of the nucleoli of a population of one cell type under specific conditions. In comparison to the first nucleolar proteome datasets [9
], they provide a much clearer picture of the dynamic protein content of the nucleolus and its relationship with other cellular compartments. This methodology also offers the possibility of distinguishing the nucleolar-enriched proteins from the proteins which cycle between the nucleolus and other cellular locations or conditionally localise to the nucleolus. However, because only one cell type and a small number of conditions have been examined so far and because of the current limitation of the methodology, which does not yet offer full proteome coverage, the dynamic nucleolar proteome still has not been fully defined. Here, we investigate how a computational method can help fill this gap.
The prediction of eukaryotic protein subcellular localisation has been extensively investigated over the past decade using various machine learning methods and based on many diverse protein characteristics (reviewed in [13
]). However, while many such predictors exist, most do not consider the nucleolus as a separate localisation: very few whole-cell predictors include the nucleolus in the list of cellular compartments to which they predict localisation [14
]. Several nuclear-centric mammalian protein localisation predictors have been created to predict membership to one of at least four nuclear sub-compartments including the nucleolus [18
]. However, proteins annotated as being in more than one subnuclear compartment are often not considered, thus substantially decreasing their actual coverage of the nuclear proteome. Because the individual nuclear subcompartments are not membrane-enclosed, it is expected that a significant proportion of nuclear proteins diffuse between these subcompartments and will be detected and annotated as present in several of these compartments. Thus these nuclear-centric predictors likely do not realistically model localisation patterns of nuclear proteins.
The prediction of nucleolar protein localisation has been investigated mainly in the context of a binary classification problem where proteins are predicted to be either associated with the nucleolus, or not. Such studies include a predicted nucleolar complex dataset, which is based on the clustering of protein-protein interactions, involving human proteins either detected experimentally in the nucleolus, or predicted to be nucleolar using a neural network [23
]. More recent studies include a naïve Bayesian classifier trained to predict yeast nucleolar proteins and ribosomal components [24
], a sequence-based support-vector machine predictor that differentiates between nucleolar associated and non-nucleolar associated nuclear mammalian proteins [25
] as well as a kernel canonical correlation analysis predictor based on genomic sequence and protein-protein interaction data that also differentiates between nucleolar associated and non-nucleolar associated nuclear mammalian proteins [26
Recent efforts to predict nucleolar association acknowledge the fluidity of the nucleolus and its close relationship with other cellular regions, but do not model different degrees of protein association with the nucleolus. In order to build on previous efforts, we investigate here the possibility of classifying the degree of nucleolar association of human proteins, by integrating various genomic and protein features in a Bayesian framework. More precisely, we predict whether proteins are highly nucleolar-enriched, highly non-nucleolar, nucleolar-nucleoplasmic or nucleolar-cytoplasmic (see Figure ). The last two groups include proteins that localise to other cellular regions and cycle to the nucleolus or relocate to the nucleolus under specific conditions. To perform this classification, we consider several protein features including the frequency of specific amino acids in the protein sequence, the predicted presence of signal peptides, mitochondrial targeting peptides and nucleolar localisation sequences as well as expression data, Gene Ontology (GO) annotations and subcellular localisation annotations of protein interactors.
Figure 1 Protein nucleolar association classes considered. PNAC classifies human proteins into four distinct classes according to their degree of nucleolar association. The nucleolar-enriched protein group (red) consists of proteins that are predominantly nucleolar (more ...)