|Home | About | Journals | Submit | Contact Us | Français|
This is an open-access article distributed under the terms of the Creative Commons Attribution Licence, which permits distribution and reproduction in any medium, provided the original author and source are credited. Creation of derivative works is permitted but the resulting work may be distributed only under the same or similar licence to this one. This licence does not permit commercial exploitation without specific permission.
Defining the protein profiles of tissues and organs is critical to understanding the unique characteristics of the various cell types in the human body. In this study, we report on an anatomically comprehensive analysis of 4842 protein profiles in 48 human tissues and 45 human cell lines. A detailed analysis of over 2 million manually annotated, high-resolution, immunohistochemistry-based images showed a high fraction (>65%) of expressed proteins in most cells and tissues, with very few proteins (<2%) detected in any single cell type. Similarly, confocal microscopy in three human cell lines detected expression of more than 70% of the analyzed proteins. Despite this ubiquitous expression, hierarchical clustering analysis, based on global protein expression patterns, shows that the analyzed cells can be still subdivided into groups according to the current concepts of histology and cellular differentiation. This study suggests that tissue specificity is achieved by precise regulation of protein levels in space and time, and that different tissues in the body acquire their unique characteristics by controlling not which proteins are expressed but how much of each is produced.
Recently, a detailed study of 1% of the human genome showed that chromosomes are pervasively transcribed and that the majority of all bases are included in primary transcripts (Birney et al, 2007). This has been recently confirmed by extensive parallel sequencing of transcripts, which has shown that a large fraction (70%) of the predicted 20 400 (Clamp et al, 2007) protein-encoded genes can be detected in a single human cell line (Sultan et al, 2008). In addition, a vast number of alternative splicing events have been identified, adding to the complexity and ubiquitous expression of the human transcriptome (Tress et al, 2007; Wang et al, 2008). This plasticity at the RNA level is even further accentuated by the presence of an immense numbers of inhibitory RNAs (Yelin et al, 2003; Katayama et al, 2005) and the recent discovery that tens of thousands of binding sites are present across the genome, as shown by genome-wide profiles of the DNA binding of mammalian transcription factors (Robertson et al, 2007). The question arises whether this ubiquitous RNA expression is also translated to the protein level and how this relates to central biological questions regarding the link between protein expression profiles and cellular phenotypes, and the divergence of protein levels in differentiated cells from normal and disease tissues.
We have previously described the high-throughput generation of antibodies and subsequent creation of a Human Protein Atlas (http://www.proteinatlas.org) based on tissue microarrays (TMAs), immunohistochemistry, and immunofluorescence (Berglund et al, 2008). In this study, we describe, for the first time, a systematic analysis of global protein expression patterns generated from this public resource. We have examined the spatial distribution and the relative abundance of proteins in the different cell populations of various tissues in all major human tissues and organs, including the brain, liver, kidney, lymphoid tissues, heart, lung, skin, GI tract, pancreas, endocrine tissues, and the reproductive organs. Thus, it has been possible to assess the functional associations between phenotypically different cells and to study the relationship between protein expression profiles and developmental origin.
An unsupervised cluster analysis (Eisen et al, 1998) was carried out based on the protein levels in 65 normal cell types. The variability introduced by the individual experimental staining protocol, including the choice of antibody dilution and antigen retrieval methods, was addressed by the use of TMAs (Kononen et al, 1998), thus allowing parallel determination of the relative levels of a particular protein target, within its dynamic range, across hundreds of biosamples (Warford et al, 2004; Taylor and Levenson, 2006). Annotations of more than 2 million images were performed by certified pathologists, and the relative expression level of a particular protein was translated into a four-color code ranging from strong (red), moderate (orange), weak (yellow), and no (white) protein expression (Kampf et al, 2004; Bjorling et al, 2008). It is important to point out that this color code represents the relative expression levels of a particular protein across tissues and organs, but the absolute levels of each protein have not been determined and could vary by many orders of magnitude. Although the level at which it is appropriate to divide cell types into categories is arbitrary, the resulting heat map (Figure 1) shows that the cells cluster into groups that could be expected on the basis of traditional embryology, histology, and anatomy, with most of the cells divided into six major groups: (i) cells of the central nervous system (CNS); (ii) hematopoietic cells; (iii) mesenchymal cells; (iv) cells with squamous differentiation; (v) endocrine cells; and (vi) glandular and transitional epithelial cells. Further subdivision is also evident, as exemplified by (i) separate subgroups containing neuronal and glial cells in the CNS cluster; (ii) subdivision of cells from the male and female genital tracts; and (iii) a distinct subcluster of glandular cells from the GI tract. The liver hepatocytes, together with striated and heart muscle cells (myocytes), have the most divergent protein profiles.
Sensitivity analyses were also carried out using proteins encoded from single human chromosomes to obtain a random stratification of a substantially smaller subset of the proteome. Similar dendrograms were obtained for the chromosome specificity (Supplementary Figures S1–S3), as well as random groups of 200 antibodies (Supplementary Figures S4–S6), suggesting that the phenotype of the cells is generated by a large fraction of human proteins, as a random sampling of only ~1% of the protein-encoded genes (200 proteins) are sufficient to group the cells in a nearly identical pattern compared with the whole data set. The dendrogram shows that cells with similar cellular functions exhibit similar protein profiles, as exemplified within the hematopoietic cell cluster, in which germinal center cells and peri-follicular lymphoid cells have a more closely related expression profile than the more distant hematopoietic cells in the bone marrow. Similarly, the myocytes in cardiac and striated muscle have similar expression profiles, and these are distinctly different from smooth muscle cells and other stroma cells in the mesenchymal cell cluster.
The similarity in protein profiles often coincides with the putative developmental origin (endoderm, ectoderm, or mesoderm) of cell types, as shown by different color codes for branches of the dendrogram (Figure 1). This can be exemplified by glandular cells in the GI tract, which are derived from the endoderm, cells in the CNS originating from the neuro-ectoderm, and hematopoietic/mesenchymal cells derived from the mesoderm. For certain cell types, morphological differentiation supersedes developmental origin, as exemplified in the cluster of cells with squamous differentiation, in which cells from all three germ layers are represented: surface epithelia of the esophagus (endoderm), epidermal cells from the skin (ectoderm), and surface epithelia from intra-vaginal elements of the cervix (mesoderm). These patterns show that global protein profiles in differentiated normal cells reflect the pluripotent origin of the corresponding stem cells in different germ layers, but that functional convergence also exists resulting in similar expression profiles that are independent of developmental origin.
Expression analysis can be used to estimate the relative level of different protein expressions in each cell type. An analysis of the fraction of cell types containing each protein is shown in Figure 2A, with the cells classified into groups exhibiting strong (red), medium (orange), or weak (yellow) expression. The analysis indicates that a large proportion of the proteome is expressed across many of the 65 cell types: 20% (949) of the proteins are found at detectable levels in 60 cell types, whereas only 3% (150) are detected in less than six cell types. A similar analysis was conducted to study the fraction of analyzed proteins (4842) that are detected in each cell type (Figure 2B). This showed that a large fraction of protein-encoding genes are present in any given cell type, with an average of 68% (range 40–84%) of all proteins expressed. The supportive cells are the most specialized, for example, glial cells in the CNS and stroma cells in the endometrium and ovary. In contrast, the study showed that several glandular cells have as many as 80% of the analyzed proteins present at detectable levels, and this raises the question how much the result is influenced by background staining due to nonspecific binding or cross-reactivity to homologous proteins. To explore this issue, a sensitivity analysis was carried out, using various subfractions of the antibodies (see Supplementary Table S1) with the selection based on paired antibodies with high confidence or documented supportive western blots. The analysis showed essentially the same results as when all antibodies were used (Supplementary Figures S7, S8 and Table S2), with an average of 61–71% of the proteins detected across the 65 cell types.
A study was performed to explore the difference in protein expression in three cells with distinctly different phenotypes, namely hepatocytes from the liver, neurons from the cerebral cortex of the brain, and lymphoid cells from the germinal center of the lymph nodes. A network analysis (Shannon et al, 2003) showed (Figure 2C) that as much as 90% of the antibodies (n=5138) detect proteins that are expressed in at least one of the three cells and few proteins are detected exclusively in one of the cells, as exemplified by the brain (9%). However, the cells still display a highly differentiated global expression pattern as shown by the fact that only 6% of the proteins are expressed at the same level in all three cell types. We extended this study to explore the protein profiles in three more closely related cell types: glandular cells in the colon, epidermal cells from the skin, and urothelial cells from the bladder. In this case, a smaller fraction (59%) of the antibodies (n=3376) detect proteins in all three cells and a larger fraction (17%) of the proteins were scored with the same expression level in all three cells (Figure 2D). However, considering the fact that these cells share a common epithelial phenotype, it is interesting that a large fraction (74%) of the proteins still have differential expression across the three cell lines. As expected, cells with more similar functions show less differences in global protein expression patterns compared with those with widely different functions. Despite the high fraction of overall expression of proteins, the given examples show that only 6% of expressed proteins are expressed at the same level in widely different cell types compared with 17% in more closely related cell types. The network analysis provides a further insight into the dynamics of cellular phenotypes and functions as a consequence of differences in protein signatures and allows for a novel angle to determine which cell types are most similar (and most dissimilar) irrespective morphological traits and functions.
To identify tissue-specific proteins, we analyzed how many proteins could be detected exclusively in a single cell and this query resulted in a list of 74 proteins (Supplementary Table S4). The list included several previously well-known cell-type-specific proteins, such as insulin, glucagon, IAPP (Langerhans islets), troponins (muscle), PSA, ACPP (prostate), and several CD markers (hematopoietic cells). The analysis also identified a subset of cell-type-specific proteins for which there is no or little information, including proteins exclusively expressed in the testis, skeletal muscle, placenta, seminal vesicle, prostate, and stomach (Figure 2E). Several of these proteins showed a remarkable specificity, with expression in only a subset of the entire annotated cell population, for example, expression in parietal cells of the stomach mucosa and fiber-type-specific expression myocytes. Expanding the query to include similar cell types at different locations, for example, the nine annotated cell populations in the brain, also showed a surprisingly low number (n=30) of proteins exclusively expressed in the brain (data not shown). These results are somewhat surprising considering the numerous reports describing genes expressed in a tissue-specific manner (Saito-Hisaminato et al, 2002), in particular examples of genes exclusively expressed in the brain, such as the KIAA genes (Ishikawa et al, 1997) In summary, our analysis shows a surprisingly low fraction (<2%) of proteins expressed in a single or only few distinct types of cells. The few cell- or tissue-specific proteins that were found are, of course, interesting starting points for further studies and this is facilitated by the fact that all the annotation results and the underlying original images are available as a public resource from the Human Protein Atlas portal.
As all the immunohistochemical images from the TMAs were manually annotated by pathologists involving subjective scoring, we decided to carry out the same analysis on 45 human cell lines in which an automated image analysis algorithm have been used (Stromberg et al, 2007; Lundberg et al, 2008). The data from 5349 antibodies corresponding to 4349 genes were analyzed, involving more than 450 000 additional images, and the results are shown in Figure 3. A pattern of protein expression similar to that for tissues and organs was recorded for the in vitro cultured cells, with most proteins expressed in the majority of the 45 cell lines (Figure 3A) and nearly 80% of the proteins expressed across all the analyzed human cell lines (Figure 3B and Supplementary Table S3). A sensitivity analysis using antibodies with supportive western blots (Figure 3C) or paired antibodies with highly correlated expression patterns (Figure 3D and E, and Supplementary Figure S9) produced similar results. An example of the automated image analysis algorithm can be seen in Figure 3F.
The immunohistochemical analysis, based on an enzyme amplification method, is semiquantitative and we therefore decided to extend the study using immunofluorescence analysis with confocal microscopy. An analysis of three selected human cell lines (Barbe et al, 2008) of epithelial (A-431), glial (U251-MG), and mesenchymal (U-2 OS) origin was carried out for 2,064 proteins (see Figure 4C). More than 70% of the proteins were detected in each of the three cell lines (Figure 4A), even when proteins without a defined subcellular localization (i.e., with weak, granular, and cytoplasmic staining) were excluded. Only 14% of the proteins could not be detected in any of the three cell lines. A plot was generated to show the number of proteins expressed in one, two, or all cell lines (Figure 4B), and the results show that the majority (72%) of the antibodies (n=1384) detect proteins in all three cell lines and that only 11% of the proteins are detected in just one out of the three cell lines. As to the tissue analysis (Figure 2C and D), the majority (66%) of the proteins were expressed at a different level in at least one of the three cell lines.
Our findings suggest that few proteins are expressed in a cell-type-specific manner, and that the phenotype and function of a cell is determined by localization and fluctuations in concentration of a large portion of the proteome, as opposed to a binary ‘on/off' expression pattern. Although antibody-based assays are sensitive, it is possible that even more sensitive assays, such as the use of sandwich-based analysis (Larsson et al, 2004), could allow the detection of lower levels of proteins, thus showing an even more ubiquitous expression. It is important to point out that the regulation of proteins is also mediated through protein modification, in which certain fractions of the proteins are activated by chemical modification (Olsen et al, 2006) or proteolysis (Yen et al, 2008). The separate functions of a particular cell are, thus, a consequence of local concentrations and modifications of proteins that are carefully regulated to ensure proper functionality in each cell type. For example, the concentrations of a majority of our proteins in a human kidney cell provide an interaction network appropriate for kidney functions (filtration), whereas the protein interaction network in a specific neuron in the brain is targeted toward neurological functions. These data therefore suggest that the phenotype of a particular cell is a consequence of the local concentration of a large portion of the human proteome. This underlines the importance of systems biology approaches (Kislinger et al, 2006) based on quantitative measurements of protein levels (Cox and Mann, 2008) and network predictions of protein interactions (Shlomi et al, 2008) to study mammalian biology. In this context, it would be interesting to add quantitative expression data from complementary technology platforms, such as mass spectrometry (Mann and Kelleher, 2008), to further explore the protein space in individual cells. The analysis could also be integrated with numerous RNA-based expression studies (Kilpinen et al, 2008) to gain further in-depth understanding of the relationship between transcript and protein profiles. To facilitate such bioinformatics comparisons, the expression data used in this analysis are available for downloads at the public Human Protein Atlas portal (http://www.proteinatlas.org/download.php). In conclusion, this study suggests that tissue specificity is achieved by precise regulation of protein levels in space and time, and the results emphasis the need for quantitative systems biology approaches to understand the molecular mechanisms of human biology and diseases.
To determine the level of protein expression of each protein in this study, antibodies were used to immunohistochemically stain human tissues assembled in TMA blocks (Kononen et al, 1998). Tissue cores with 1 mm diameter, sampled from 144 individuals, corresponding to 48 different normal human tissues types, were included in the study. In addition, a microarray containing human cell lines was assembled (CMA) (Kampf et al, 2004). In addition, the protein levels were estimated for three human cell lines (A-431, U-2OS, and U-251 MG) using immunofluorescence-based confocal microscopy (Barbe et al, 2008). Immunohistochemically stained sections from TMA/CMA blocks were scanned in high-resolution scanners and separated to individual spot images representing each core. For TMAs, all images were evaluated by certified pathologists in a web-based annotation system to collect parameters regarding distribution, the extent and level of protein expression (P Oksvold and E Björling, unpublished results). Parameters from the annotation included staining intensity, fraction of stained cells in a defined cell population, and subcellular localization of staining. The annotation was performed for selected cell types for each tissue, as most tissue types include several defined cell phenotypes, e.g., neurons and glial cells in brain tissue and glomeruli and tubules in the kidney (Bjorling et al, 2008). CMAs were evaluated using automated image analysis (Stromberg et al, 2007) where five output parameters were combined to calculate a score for the protein expression level. For immunofluorescence images, subcellular localizations were annotated and the relative expression levels were classified as strong, moderate, weak, or negative based on the employed laser power and detector gain settings.
In total, 5934 antibodies against 4842 proteins with 298 annotations were assembled from TMA measurements and 5349 antibodies against 4349 proteins with 45 annotations from CMAs. The annotation parameters for intensity and quantity (fraction of positively stained cells) were combined into a four-grade scale represented by the colors white (negative), yellow (weak), orange (moderate), and red (strong) level of protein expression. All data are presented in this format on the protein atlas (http://www.proteinatlas.org). For statistical analysis regarding protein expression, the color codes representing the staining levels were converted to numerical values using a red to 4, orange to 3, yellow to 2 and white to 1 transformation. In cases where the protein expression value could not be derived, because of low image quality, a not available (NA) value was introduced. These data were ordered into a matrix with m (number of antibodies) × n (number of tissues (n=65) or cell lines (n=45)) dimensions. The number of tissues is a combined tissue and cell type parameter, where the number of tissues and cell types give rise to × the number of tissues cell type parameters. In all, two matrices were constructed; one matrix contains protein expression data from human normal tissues and a second matrix contains protein expression data from human cell lines. The number of antibodies in the two matrices and all subsets used in the different figures is presented in Supplementary Table 1.
For the normal tissue data set, two correlation matrices based on Spearman's ρ were calculated for two dimensions (m × m and n × n, respectively) (Spearman, 1987). The correlation matrices were converted to a distance metric using a 1−correlation value transformation. These data were clustered using unsupervised top-down hierarchical clustering (Eisen et al, 1998; Golub et al, 1999), where at each stage the distances between clusters are recomputed by the Lance–Williams dissimilarity update formula according to average linkage. The algorithm consistently sorted the tighter cluster in each division to the left in the resulting dendrogram representing the hierarchical cluster output. The antibodies with no defined correlation due to constant expression across all tissues or cell lines were removed in the clustering procedure.
To estimate protein expression values for each protein across all tissues and cell lines, the different intensity categories (weak, moderate, and strong) were added as separate units into a marginal distribution, which constitutes of 4 × the number antibodies values. The marginal distribution can be seen as a proxy for the total expression level for the respective protein across the 65 tissues and cell types used in this study. A similar procedure was conducted for the protein expression for each tissue and cell line across all antibodies, resulting in a marginal distribution of 4 × the number of tissues or cell lines values.
In order to investigate the quality and the conclusions made from data, different subsets were constructed and used in the same analysis approaches as the full data set. For the hierarchical clustering, three additional data sets were constructed, where the subsets were selected based on chromosomal appurtenance using 215, 203, and 206 antibodies representing proteins from chromosome 9, 10, and 22 respectively. The similarity between the dendrograms generated with the three subsets and the dendrogram using all antibodies were investigated using cophenetic correlation coefficients (Sokal and Rohlf, 1962). To estimate the reliability of the marginal distributions, different subsets of antibodies were chosen (Supplementary Table S1). One of the subsets was chosen based on western blot data, where the expected size of the protein matches the correct band from a western blot gel image. Two additional subsets were built based on correlation analysis of paired antibodies, where the two antibodies in a pair are generated to different parts of the same protein. For each antibody pair, a correlation coefficient was calculated using Spearman's ρ. A cutoff value of 0.5 was applied to construct subsets used for the tissue and cell type data set and for the human cell lines. A cutoff of 0.8 was implemented to construct an additional data set for the human cell lines. To estimate the similarities of the sub sets and the full data set, a χ2 test statistic was used on the marginal distributions.
In order to visualize the protein expression overlap between different tissues or cell lines, Cytoscape (Shannon et al, 2003), a software package for analyzing biomolecular interaction networks, was used. The resulting images contain schemes that indicate the overlap of different proteins across different sets of cell types. The first combination was hepatocytes from the liver, neurons from the cerebral cortex of the brain, and lymphoid cells from the germinal center of the lymph nodes, where all three cell types have distinctly different phenotypes. The second combination consisted of three more closely related cell types, namely glandular cells in the colon, epidermal cells from the skin, and urothelial cells from the bladder. The third group of cells consisted of the three cell lines used for immunofluorescence analysis, namely A-431, U-2OS, and U-251 MG. Antibodies with missing protein expression data for any of these cell types were removed from the analysis. This resulted in three different sets of 5710, 5700, and 2250 antibodies, respectively. In each of the resulting networks, a node represents an antibody with a specific protein expression profile, and the edges connect the node to the cell(s) where certain protein is expressed, generating nodes with a degree (number of edges) of 1, 2, or 3. Thus, only antibodies corresponding to proteins expressed in the analyzed cell types are parts of the network. The total number of nodes in the three networks was 5138, 5186, and 1921. Each node was colored according to the variability of staining intensity between the analyzed cell types. Red nodes indicate different staining categories for the three cell types, yellow nodes indicate two cell types in the same staining category, and green nodes indicate that all three cell types belong to the same staining category and are therefore only found for nodes with a degree of three.
Supplementary figures S1-9, Supplementary table S1 and Supplementary information regarding tables S2-S4
Supplementary Table S2
Supplementary Table S3
Supplementary Table S4
This work was supported by grants from the Knut and Alice Wallenberg Foundation, the Swedish Research Council (VR), and the EU 7th Framework Program Prospects. We are grateful to Claes Wilhelmsson, Kenneth Nilsson, and Per-Åke Nygren for useful comments and advice.
The authors declare that they have no conflict of interest.