In this study, we determined the so far most extensively measured human cell proteome. We identified >10 000 proteins expressed in the commonly used human tissue culture cell line U2OS and demonstrate that protein discovery has reached saturation under the experimental conditions used, i.e., that further measurements of the same type would not be expected to identify additional proteins. We furthermore describe a large-scale estimate of protein abundances in a human cell. We and others have previously shown that the dynamic range of protein concentrations spans more than three orders of magnitude in the bacterium L. interrogans
(Malmstrom et al, 2009
) and five orders of magnitude in yeast (Ghaemmaghami et al, 2003
; de Godoy et al, 2008
; Picotti et al, 2009
). In the present study, we demonstrate that the protein copy numbers of a human cell span at least seven orders of magnitude. This range is similar to that determined in mouse cells (Schwanhausser et al, 2011
). This finding is furthermore in good agreement with the volume of the relevant cell types, namely ~0.2 μm3
in L. interrogans
(Beck et al, 2009
), and about ~30 μm3
in S. cerevisiae
and ~4000 μm3
in U2OS, (assuming spherical shape and 4 and 20 μm diameter for yeast and U2OS, respectively).
Interestingly, the bacterium L. interrogans
expresses a relatively small number of in very high copy proteins, e.g. proteins of the translation and protein folding system, metabolic enzymes as well as components of the cell wall. Those proteins make up the majority of the total protein mass (Malmstrom et al, 2009
) and a considerable fraction of the cytoplasmic volume (Beck et al, 2009
), while proteins functioning in signaling, protein transport, or regulatory pathways, e.g. transcription factors, comprise a minority of the quantitative proteome. To investigate whether the same holds true for eukaryotes, we systematically compared the four available data sets mentioned above (). We arbitrarily grouped all functional categories into three major classes: (i) cellular core functions containing carbohydrate, nucleobase, nucleoside, nucleotide, nucleic acid metabolic processes, lipid and other metabolic processes as well as transcription, translation, DNA replication, transport, and other core functions; (ii) regulatory functions, namely cytoskeleton organization, cell adhesion, cell division, phosphorylation, protein metabolic processes, signaling, developmental process, cell communication, and other regulatory functions; and (iii) others. The bacterium L. interrogans
devotes most of its protein mass (~75%) to core and <25% to regulatory functions. In contrast, less than half of the analyzed protein mass of U2OS fulfills core functions, and 51% carries out regulatory functions. In particular, the total fraction of protein devoted to cytoskeleton organization, protein metabolic processes and signaling is largely expanded in U2OS cells, while other processes with the exception of central metabolic processes are largely reduced. A very similar picture emerged for mouse cells. Yeast, at a first glance, does not seem to follow this trend. However, it devotes only one third of the total protein mass to metabolism, while the corresponding number is >50% in L. interrogans.
As a single cell eukaryote, yeast expands a significant fraction of its protein mass (~30%) on translation and protein sorting. Taken together, this analysis indicates that the fraction of total protein mass devoted to regulatory functions is largely expanded in higher eukaryotes.
Figure 3 Comparative analysis of protein abundance. Pie charts representing the annotated quantitative proteome of human U2OS cells, mouse NIH3T3 cells, S. cerevisiae and L. interrogans taking protein copy numbers per functional category into account. Functional (more ...)
In multicellular species, domain families fulfilling regulatory functions have been more frequently subject to gene expansion than domains fulfilling core functions (Vogel and Chothia, 2006
; Ori et al, 2011
). We therefore investigated, using the quantitative data generated in this study, how this effect is linked to protein abundance. We and others showed that protein abundance is linked to function, namely that high-abundant proteins are often responsible for core functions, such as energy metabolism and translation, while regulatory functions such as protein phosphorylation and transcriptional regulation are often carried out by low-abundant proteins (; Supplementary Table S3
; Schwanhausser et al, 2011
). There are several lines of evidence suggesting that protein abundance is also linked to evolvability. It has been previously shown that highly expressed proteins evolve more slowly than proteins expressed at lower levels, i.e., they display a reduced protein divergence on the sequence level (Pal et al, 2001
; Subramanian and Kumar, 2004
), while low-abundant proteins display decreased sequence conservation across organisms (Schrimpf et al, 2009
). It was further shown that protein families displaying lower abundance variability across species less often underwent gene duplication and that abundance variability scales inversely with protein expression (Weiss et al, 2010
). These findings indirectly suggest a link between protein abundance and gene duplicability. Our data support this hypothesis. We show a negative correlation between the frequency of domain families in the human genome and their median copy number per cell (; Supplementary Figure S4A
; Supplementary Table S5
). We also show that proteins, which have a higher number of paralogs, tend to be expressed at lower copy number (Supplementary Figure S4B
). These findings underline the view that duplications of genes encoding for proteins expressed at high level are maintained under purifying selection, likely because of energy constraints (Lane and Martin, 2010
) or higher risk of protein aggregation and toxicity (Drummond et al, 2005
). Interestingly, a recent study that compares the relative expression level of gene products of three human cell lines on proteome and transcriptome level showed that proteins involved in regulatory functions more often vary in their expression levels as compared with core functions (Lundberg et al, 2010
). One might thus speculate that the large fraction of the human proteome expressed at low copy number and involved in regulatory function was the main source of biological innovation during evolution. This hypothesis is supported by the following lines of evidence: (i) domain families occurring in low-abundant proteins are significantly more correlated with increase in organism complexity than the ones present in highly expressed proteins (P
=7.8e−9, one-sided Wilcoxon rank sum test; Supplementary Figure S4C
; Supplementary Table S5
). (ii) The abundance of proteins involved in core functions is more strongly conserved across species than for proteins involved in regulatory functions (Schrimpf et al, 2009
). (iii) The fraction of the proteome devoted to regulatory functions significantly expanded during the course of evolution ().
Regulatory, often low-abundant proteins are key players in mediating the integration of external stimuli with the cell's internal state and they control fundamental biological processes such as cell proliferation, migration, and cell differentiation. It was recently shown for mouse cells that low-abundant proteins and mRNAs are less stable than high-abundant ones (Schwanhausser et al, 2011
). Therefore, expression at low copy numbers might provide an efficient way of dynamic regulation by translation and rapid turnover. Vice versa, cellular core functions might be more efficiently regulated by other means than degradation.
Current limitations of protein abundance indices determined from MS data are the availability of PTPs accounting for the multitude of isoforms within protein families and a bias toward proteins that produce fewer well-ionizing peptides. In particular, GO analysis reveals an underrepresentation of transmembrane proteins in the identified proteome (Supplementary Table S4
). Such an effect has been observed before (Schrimpf et al, 2009
) and is likely a result of the reduced accessibility of membrane proteins for MS analysis, although we had used an MS compatible detergent during sample preparation. This finding is further underlined by fact that a significant fraction of high-abundant mRNAs not discovered on the protein level encodes for membrane proteins. Otherwise, the distribution of functional categories on the genome and proteome level is quite similar, suggesting high proteome coverage and that the assumption of an even extractability of proteins holds true for the majority of proteins but not for membrane proteins. We demonstrate the feasibility of establishing protein abundance scales in very complex proteomes with precision that is likely sufficient to allow the analysis of biological systems by means of computational modeling. The method used in this study is principally applicable to the majority of all cell types and might be useful to study a multitude of cellular states and organisms in the future.