Biological regulatory mechanisms and cellular responses are predominantly mediated by proteins and multi-protein complexes. The structures and properties of these proteins are crucial for their function and can vary greatly. For example, protein expression levels in mammalian cells vary over a large dynamic range of 106
or more (1
), whereas subcellular localization patterns, post-translational modifications, rates of synthesis, and degradation and interactions with partner proteins are also variable properties (2
). Furthermore, all of these properties not only vary between proteins, they are also dynamic and can vary for the same protein either at different times, or in different subcellular locations, depending on parameters such as cell cycle progression, growth rate, and signaling events.
In higher eukaryotes, many genes encode two or more separate protein isoforms (3
). Even minor structural differences between isoforms can alter their biological properties and result in distinct pools of related proteins whose subcellular location, function, and interactions vary (5
). Furthermore, even apart from isoforms, single polypeptides can partition into two or more distinct functional pools within the cell that have different roles. For example, a single isoform of protein phosphatase I can interact with numerous different interaction partners to create different phosphatase enzymes that target different substrates (7
). Proteomes are thus inherently complex and their properties in constant flux. This presents a major challenge for proteomic studies, as we aspire not only to identify which proteins are expressed in a cell or organelle, but also to characterize their properties and quantify how these change in response to different perturbations and cell cycle stages etc (8
Alternative splicing of pre-mRNA transcripts is commonplace and this can generate multiple mRNAs from the same gene and hence multiple different proteins (9
). Shoemaker et al.
have reported that over 73% of all human genes are alternatively spliced (11
). Such isoforms can vary in length, share common exons, include variable exons, and even have very different amino acid sequences because splicing events can alter the translational reading frame of the differentially spliced mRNAs. It is estimated that 15% of all point mutations causing human genetic disease result in an mRNA splicing defect (12
). Isoforms can also arise from differential post-translational processing and modification (6
) of a polypeptide encoded by a single mRNA. In other cases gene duplication results in expression of closely related protein paralogs that share extensive sequence identity and may thus be hard to distinguish by MS depending on the number and structures of peptides that encode the variance between these paralogs.
The structural and functional diversity of the expressed proteome in multicellular eukaryotes is thus generated by a combination of alternative splicing, together with other processes such as the use of alternative transcription start sites, alternative polyadenylation, RNA editing, SNPs, as well as complex patterns of post-translational modification and cleavage events (14
). Although the use of mass spectrometry has now revolutionized the efficient and sensitive detection and quantitation of cell proteins, a complication of interpreting the results of protein identification and quantification using mass spectrometry is that proteins are typically extracted from cells and digested into peptides before MS analysis. This affects the interpretation of the resulting data because the same peptide sequence can be present in either multiple different proteins, or protein isoforms (3
). As noted above, the same peptides may even belong in vivo
to functionally distinct pools of the same protein. Such shared peptides therefore can lead to ambiguities, both in determining the identities of proteins and in reliably measuring their functional properties. It will therefore be helpful to develop methods that can help to distinguish between different isoforms and functionally distinct protein pools when interpreting MS data.
Some of the current approaches used to identify protein isoforms include deep sequencing, tiling arrays, protein processing through identification of new N-terminal peptides, SNP detection, alignment of identified peptides to the genome combined with target analysis of predicted peptides and Expressed Sequence Tags (15
). A recent study by Alm et al.
reported the identification of isoforms via alignment of mass spectra of spots on two-dimensional gels by use of extracted peak lists and hierarchical clustering (16
). Methods for de novo
sequencing and identification of post-translational modifications have also been developed, which operate independent of sequence databases (17
). Combining transcriptome data with MS-based proteomics in specific forms of cancer cells has enabled identification of novel protein isoforms and splicing variants (18
). Bioinformatics approaches have made use of Expressed Sequence Tag and RNA and genomic sequence data to match new splice forms with peptides revealed in MS spectra. Nonetheless, these strategies are limited by the availability and often incompleteness and fragmentation of relevant gene expression data (19
The functional annotation of genome expression will be improved if it is possible to take into account and measure expression levels, structures, properties, and biological roles of separate protein isoforms and protein pools. In general this information on isoforms and protein pools is not available in most large-scale proteomic analyses (20
). It will aid the biological interpretation of proteomics experiments to decide whether all peptides identified and quantified that are mapped to a specific gene are encoded either in a single polypeptide, or in two or more isoforms, and whether the peptides belong to a polypeptide that behaves within the cell as one or more functional pools with respect to its properties, such as subcellular localization and/or turnover rate. For example, when studying subcellular localization, the averaged value for all peptides mapped to a specific gene may indicate that the protein is present in both the cytoplasm and the nucleus, when in fact they belong to two isoforms, with one isoform predominantly cytoplasmic and the other predominantly nuclear. This is likely to be of general importance for annotating the genome because a recent comparative study of subcellular protein localization in three human cell lines detected ~40% of the 4000 genes analyzed localizing to multiple subcellular compartments (21
Mass spectrometry-based proteomics has become the technology of choice for the direct identification and characterization of proteins (22
). In combination with quantitative approaches, such as SILAC (stable isotope labeling with amino acids in cell culture)1
, mass spectrometry can not only identify proteins and post-translational modifications, but also measure how relative protein levels change in cells under different conditions (23
). This provides a flexible assay format for proteomic studies that evaluate differences between two or more cell states, each defined by metabolic labeling of proteins with amino acids that have different combinations of isotopes incorporated into selected amino acids. Subsequent isolation of proteins and enzyme cleavage results in mixtures of isotopically labeled peptides where the relative levels of each isotopic form can be resolved and quantified by mass spectrometry. The peptide isotope ratios are then mapped back to the genome sequences encoding the cognate proteins and used to infer whether either the levels, or properties, of these proteins have been changed. The SILAC strategy has been used for quantitative studies of cell and organelle proteomes and for comparative studies of protein modifications, and interactions (22
) and to identify proteins isolated from mitotic chromosomes (25
). It has also been used in combination with cell fractionation to generate “isotope-encoded” subcellular compartments allowing subcellular protein localization to be evaluated on a system-wide level (26
). By examining incorporation rates of isotope-labeled amino acids into proteins, pulse-labeling SILAC has been employed to measure protein turnover in cells and organelles (28
). We have recently reported a global analysis of protein properties in human cells using a combined pulse-labeling, spatial proteomics and data analysis strategy to characterize the expression, localization, synthesis, degradation and turnover rates of endogenously expressed, untagged human proteins in different subcellular compartments (33
). Mass spectrometry combined with pulsed incorporation of stable isotopes of arginine and lysine were used to perform quantitative analyses of the rates of synthesis, degradation, and turnover of HeLa cell proteins. Cells were pulsed for 0.5, 4, 7, 11, 27, and 48 h before being fractionated into cytoplasmic, nucleoplasmic, and nucleolar fractions. Proteins from each of the respective subcellular fractions and time points were further fractionated by 1-D SDS-PAGE and each of 16 gel slices trypsin digested. The resulting peptides were analyzed by liquid chromatography (LC)-tandem MS (MS/MS) and ratios between light, medium, and heavy isotopic forms for each peptide quantified using MaxQuant and the data managed and analyzed using PepTracker. A total of 80,098 peptides from an estimated 8041 HeLa proteins were quantified, and their spatial distribution between the cytoplasm, nucleus, and nucleolus determined as described in the related paper (33
). Using information from ion intensities and rates of change in isotope ratios, protein abundance levels and protein synthesis, degradation, and turnover rates were calculated for the whole cell and for the respective cytoplasmic, nuclear, and nucleolar compartments.
Here we analyze this same HeLa proteomics data set (33
) using systematic approaches for the detection of protein isoforms and protein pools with differential biological properties. We evaluate methods that can identify human protein isoforms whose turnover and/or subcellular localization properties vary and analyze phosphorylated peptides that are correlated with altered rates of protein turnover in the separate cytoplasmic, nuclear, and nucleolar compartments. The methods described here maximize the opportunity of using empirically measured protein properties to identify functionally distinct pools of proteins and protein isoforms.