|Home | About | Journals | Submit | Contact Us | Français|
The Human Proteome Project has been proposed to create a knowledge-based resource based on a systematical mapping of all human proteins, chromosome by chromosome, in a gene-centric manner. With this background, we here describe the systematic analysis of chromosome 21 using an antibody-based approach for protein profiling using both confocal microscopy and immunohistochemistry, complemented with transcript profiling using next generation sequencing data. We also describe a new approach for protein isoform analysis using a combination of antibody-based probing and isoelectric focusing. The analysis has identified several genes on chromosome 21 with no previous evidence on the protein level, and the isoform analysis indicates that a large fraction of human proteins have multiple isoforms. A chromosome-wide matrix is presented with status for all chromosome 21 genes regarding subcellular localization, tissue distribution, and molecular characterization of the corresponding proteins. The path to generate a chromosome-specific resource, including integrated data from complementary assay platforms, such as mass spectrometry and gene tagging analysis, is discussed.
The Human Proteome Project has been proposed (1) to systematically map the human proteins in a chromosome-specific manner using mass spectrometry-based methods combined with antibody-based characterization. One of the major challenges to such a project is the dynamics of the human proteome, including temporal and spatial parameters, transient and stable interactions, and the vast amount of isoforms and protein variants (2). There have also been proposals for alternative strategies, such as a more disease-driven proteome project with the objective to explore various human diseases using mass spectrometry-based methods (3). These two approaches have now been combined into the Human Proteome Project launched by the Human Proteome Organization (HUPO) (4). The questioning of a gene-centric approach as the most suitable strategy for a systematic exploration of human proteins calls for pilot projects to demonstrate feasibility and to facilitate the definition of suitable milestones and deliverables for a complete genome-wide proteome project.
Here, we describe a pilot study to investigate the genes encoded on human chromosome 21 using antibody-based profiling with the aim of characterizing the proteome components, including protein isoforms, subcellular localization, and distribution profiles in cells, tissues, and organs. Chromosome 21 is the smallest autosomal chromosome, regarding both size and gene numbers, in humans, and three copies of the chromosome (trisomy 21) is the underlying cause for Down syndrome. With regards to chromosome 21, a first attempt to generate antibodies to the gene products from this chromosome was published already in 2003 (5), as a prelude to the Human Protein Atlas effort, aimed to generate publicly available subcellular localization data and expression data for most major human tissues and organs (6, 7). Recently, version 7 of the Human Protein Atlas portal was launched (8) with expression data for more than 50% (n = 10,170) of the human protein-coding genes.
We report on a first attempt on a chromosome-wide analysis using antibody-based methods, including tissue profiles to cover 131 of the 240 protein-coding genes defined by the Ensembl database, and extended the analysis by molecular characterization of the proteins, including an isoform analysis of selected proteins. In addition, we have included RNA data to provide evidence for existence of the protein-coding genes on the transcriptional level. The results demonstrate the power of an integrated approach to characterize the protein-coding genes using a gene-centric approach.
A panel comprising two cell lines (RT-4 and U-251 MG), two human tissues (liver and tonsil), and HSA/IgG depleted human plasma was selected for protein characterization using Western blot analysis. 15 μg of total protein lysate and 25 μg of depleted plasma were subjected to a precast 10–20% CriterionTM SDS-PAGE gradient gel (Bio-Rad Laboratories, CA) under reducing conditions followed by transfer to a PVDF membrane using CriterionTM gel blotting sandwiches (Bio-Rad Laboratories, CA) according to the manufacturer's recommendations. PVDF membranes were presoaked in methanol and blocked (5% dry milk, 0.5% Tween 20, 1*TBS (150 mM NaCl, 10 mM Tris HCL)) for 45 min at room temperature followed by 1 h of incubation with primary antibody, diluted 1:250 in blocking buffer. After four 5-min washes in TBST (0.1 m Tris-HCl, 0.5 m NaCl, 0.05% Tween 20), the membranes were incubated for 1 h with an horseradish peroxidase-conjugated polyclonal swine anti-rabbit antibody (Dako, Glostrup, Denmark) diluted 1:3000 in blocking buffer. A final round of four 5-min TBST washes was performed before chemiluminescence detection, using a CCD camera (Bio-Rad Laboratories, CA) and Immobilon Western chemiluminescent horseradish peroxidase substrate (Millipore Corporation, Billerica, MA).
Fourteen genes on chromosome 21 were transfected to HEK 293 cells, and proteins were extracted. The resulting protein lysates were purchased from OriGene Technologies (Rockville, MD). Protein concentration was measured by a Bio-Rad protein assay kit. Five micrograms of protein were diluted with 320 μl of rehydration buffer containing 6 m urea, 2 m thiourea, 3% CHAPS,1 1% Triton X-100, 13 mm DTT, 1% Pharmalyte pH 3–10 (GE Healthcare, Japan). The samples were loaded on a 18-cm IPG DryStrip gel (pH 3–10; GE Healthcare, Japan) by sample rehydration overnight. Subsequently, the strips were subjected to isoelectric focusing on a Multiphor II (GE Healthcare, Japan) at 20 °C under the following conditions: 500 V (gradient over 0.5 h), 3500 V (gradient, 1.5 h), or 3500 V (hold, 6.5 h), resulting in 16250 Vh. Subsequently, the strips were stored at −80 °C. The gel-separated protein samples were blotted onto PVDF membrane by passive diffusion for 3 h with conventional transfer buffer containing 50 mm Tris, pH 7.4, 200 mm NaCl, 0.05% Tween 20. The membranes were blocked with blocking buffer and 5% skimmed milk in TBS-T buffer at room temperature for 1 h and washed four times in TBS-T for 5 min. The primary antibodies were diluted with the blocking buffer (1:250) and incubated with the membranes at 4 °C overnight. The membranes were washed four times in TBS-T for 5 min. Subsequently, the primary antibodies were reacted with horseradish peroxidase-conjugated polyclonal sheep anti-rabbit IgG antibody in the blocking buffer (1:3000 dilution). After incubating for 1 h, the membranes were washed four times in TBS-T buffer for 5 min. The bound peroxidase-conjugated anti-rabbit antibody was detected using the ECL-plus kit (GE Healthcare, Japan) and LAS-3000 (Fuji-Film, Tokyo, Japan). The observed isoelectric point was calculated by measuring the electrophoretic migration in a linear pH gradient. The theoretical isoelectric point was obtained by the on-line software Compute pI/Mw tool at the ExPASy website.
Immunofluorescence microscopy was systematically used to determine the protein subcellular location in three human cell lines: the osteosarcoma U-2 OS, the epithelial carcinoma A-431, and the malignant glioma U-251 MG. The cells were fixed, permeabilized, and immunostained as previously described (9, 10).
Tissue microarrays containing triplicate 1-mm cores of 46 different types of normal tissue and duplicate 1-mm cores of 216 different cancer tissues representing the 20 most common forms of human cancer were generated as previously described (11). Tissue microarray sections were immunostained as previously described (12). Briefly, the slides were deparaffinized in xylene, hydrated in graded alcohols, and blocked for endogenous peroxidase in 0,3% hydrogen peroxide diluted in 95% ethanol. For antigen retrieval, a Decloaking ChamberTM (Biocare Medical, Walnut Creek, CA) was used. The slides were immersed and boiled in citrate buffer, pH 6 (Lab Vision, Freemont, CA) for 4 min at 125° C and then allowed to cool to 90° C. Automated immunohistochemistry was performed essentially as previously described (13), in brief, using an Autostainer 480 instrument (Lab Vision, Freemont, CA). Primary antibodies and a dextran polymer visualization system (UltraVision LP horseradish peroxidase polymer; Lab Vision, Freemont, CA) were incubated for 30 min each at room temperature, and the slides were developed for 10 min using diaminobenzidine (Lab Vision, Freemont, CA) as chromogen. All of the incubations were followed by rinse in wash buffer (Lab Vision, Freemont, CA). The slides were counterstained in Mayers hematoxylin (Histolab, Sweden) and coverslipped using Pertex® (Histolab, Sweden) as mounting medium. Incubation with PBS instead of primary antibody served as negative control. The Aperio Scan Scope CS slide scanner (Aperio Technologies, Vista, CA) system was used to capture digital whole slide images with a 20× objective. The slides were de-arrayed to obtain individual cores. The outcome of immunohistochemistry stainings in the screening phase, which included various normal and cancer tissues, was manually evaluated and scored by certified pathologists using a web-based annotation system as previously described (14). In brief, the manual score of immunohistochemistry-based protein expression was determined as the fraction of positive cells defined in different tissues: 0 = 0–1%, 1 = 2–25%, 2 = 26–75%, and 3 = >75% and intensity of immunoreactivity: 0 = negative, 1 = weak, 2 = moderate, and 3 = strong staining. All of the tissues used as donor blocks were acquired from the archives at the Department of Pathology of Uppsala University Hospital in agreement with approval from the Research Ethics Committee at Uppsala University (Uppsala, Sweden) (Ups 02-577).
The RNA-seq method using the SOLiD3 platform has been previously described (15). For this study, the RPKM (reads per kilobase of exon model per million mapped reads) value was calculated by dividing the number of reads mapping to the protein coding part of each gene by the length of the protein coding part of the gene and the total number of reads from the library to compensate for slightly different read depths for different samples. The total set of all RPKM values from all genes and all three cell lines have been ordered into three classes: low (the bottom third of the set), medium (middle third of the set), and high (top third of the set). These three classes are used to determine the abundance level for each gene in the cell line(s) where it was detected and to classify each gene into the categories: “supportive” (medium to high levels), “uncertain” (low levels), and “not supportive” for genes not detected in any of the cell lines.
In Fig. 1, the 240 putative protein-coding genes on chromosome 21 as defined by the Ensembl effort (16) (release 59) are outlined with a color code to show the current knowledge base according to UniProt (17–19). These putative genes are ranging from very well known genes, such as the amyloid β precursor protein responsible for Alzheimer disease, to many genes of unknown function or even questionable existence. Chromosome 21 has 48 keratin-associated genes (brown in Fig. 1) encoding small, homologous proteins that are involved in the formation of the cross-linked network of the keratin-intermediate filament proteins that support hair fibers (20, 21).
Excluding the keratin-associated proteins, there are 192 putative protein-coding genes, and according to UniProt (17–19), there is evidence at the protein level for 69% (n = 133) of these genes, whereas another 31 genes have been found only at the transcriptional level. For another nine genes, there is no evidence either on the transcript or protein level (class 4 and 5 genes), and for 19 genes there are no reviewed data in the UniProt portal. The large fraction of protein-coding genes lacking evidence at the protein level demonstrates the need for systematic strategies to characterize the putative proteins and presents chromosome 21 as an appropriate target for a gene-centric approach. In supplemental Table 1, a list of all 240 genes, including keratin-associated, are presented with data predicted from the genome sequence, including molecular weight, signal peptides, transmembrane regions, and number of splice variants. Another 41 proteins defined by Uniprot are not included in the Ensembl list of genes for this chromosome and therefore are excluded from this study (see supplemental Table 2). These genes might be included in extended studies of the chromosome 21 genes in the future.
As part of the Human Protein Atlas project, we have generated antibodies in a systematic effort, and this has been complemented with antibodies from more than 60 commercial providers. A summary of the overall status for the chromosome 21 gene products can be seen in Fig. 2a. Antibody-based protein profiling data are provided for 68% of the protein-coding genes, and for one-third of these proteins, knowledge-based annotated protein expression level data (8), based on at least two separate antibodies, are available. Antibodies approved by a multiplex microarray assay (7) exist for an additional 20% of the genes, and recombinant antigens were verified by mass spectrometry for another 5%. Thus, at present more than 90% of the putative genes on chromosome 21 have either antibodies or mass spectrometry-approved antigens.
The protein products of 167 genes were characterized by Western blot analysis (22) of protein lysates from selected human cell lines, tissues, and a pooled mixture of plasma. A summary of the results is presented in supplemental Fig. 1. 50% of the analyzed proteins displayed a band corresponding to the predicted size in one or more of the samples. According to Uniprot, there is no previous evidence on the protein level for many of these genes and in Fig. 2b, five examples of such genes (C21orf57, CHODL, CLDN8, IGSF5, and ABCG1) are shown. The results from the Western blot analysis show that a protein of the expected size is detected with the antibody in at least one of the analyzed cells or tissues for each of these genes, thus providing evidence on the protein level based on the appearance of a single band of the predicted size in the Western blot assay.
For 21% of the proteins, the results from the Western blot analysis were considered not conclusive, in most cases because of detection of either several bands or a single band of other than expected size. This might to some extent reflect the presence of yet uncharacterized splice variants or alternative isoforms resulting from protein modifications. An example of this is the human protein KCNJ15, which is predicted to be a multi-pass membrane protein belonging to the inward rectifier-type potassium channel family (23). The molecular mass according to the antibody-based Western blot analysis is ~75 kDa (Fig. 2b), whereas the theoretical size predicted from the genome sequence is 43 kDa. This is not unexpected, because integral membrane proteins frequently are N-glycosylated (24), yielding glycoproteins of higher molecular mass, and the results therefore suggest that human KCNJ15 is indeed glycosylated. It is reassuring that the specificity of the antibody is supported by confocal microscopy analysis showing a subcellular localization in the plasma membrane (supplemental Fig. 2).
The antibodies were subsequently used to determine the subcellular localization of each gene product as part of the Human Protein Atlas effort. Antibodies corresponding to 97 human genes were analyzed on the subcellular level using high resolution confocal microscopy of immunostained cell lines. 41% of the analyzed proteins were detected in a single subcellular compartment and 59% in multiple compartments (data not shown). Fig. 3a shows examples from the subcellular localization analysis with images I–III displaying proteins expressed in cytoplasmic/membranous locations, including the centrosomal protein PCNT, which is localized to centrosomes (image I), the receptor protein CXAR specifically localized to cell junctions (image II), and the mitochondrial protein ATP5J localized to the mitochondria (image III). Images IV–VI demonstrate three different types of nuclear distribution patterns with the transcription factor BACH1 localized to the nucleus, the ribosomal protein RRP1 localized to nucleoli, and the SON DNA-binding protein localized to specific patches of the nucleus, with a speckled pattern typical for DNA-binding proteins.
The validated antibodies have also been used to generate protein profiles corresponding to 131 of the 192 non-keratin-associated protein-coding genes defined by the Ensembl database, using immunohistochemistry-based protein detection in tissue microarrays as part of the Human Protein Atlas effort. In this way, the protein profiles in 46 normal tissues, and organs were determined, including liver, kidney, pancreas, gastrointestinal tract, lung, and various regions in the brain. For 22 of these gene products, no previous evidence on the protein level exists according to UniProt, and therefore this antibody-based effort contributes to the functional annotation of the corresponding proteins. For a subset of these proteins, annotated protein expression patterns were obtained using two or more paired antibodies to the same target (8). One such example is the RSPH1 gene with an interesting tissue-specific and highly selective expression pattern localized to cilia in ciliated epithelium, exemplified in respiratory epithelia (Fig. 3b, image I) and maturing spermatids in testis. The expression pattern is in agreement with earlier reports suggesting a role in ciliary function for other members of radial spoke head genes (25). Our data also suggest that the RSPH1 protein is expressed in a subset of ciliated glandular cells in the endometrium (Fig. 3b, image II). The putative protein LCA5L was also found to be expressed in a highly specific manner, with protein expression restricted to trophoblasts of the placenta, both early, immature placental tissue (Fig. 3b, image III), as well as fully matured, end stage placenta (image IV). The putative protein C21orf128 was found to be expressed in a selective manner with the highest expression levels in a subset of hematopoietic cells (Fig. 3b, image V) and liver hepatocytes (image VI). An example of a more ubiquitously expressed protein was the putative protein ABCG1, expressed abundantly in epithelial cell types, as exemplified by widespread cytoplasmic expression in glandular cells lining crypts in colon mucosa (Fig. 3b, image VII) and maturing germinal cells in seminiferous ducts of testis (image VIII). In addition to these examples of previously unknown proteins, there are several known proteins for which we report an in-depth analysis of protein expression across a multitude of human cells, tissues, and organs, such as the protein OLIG2, showing expression in a subset of glial cells in normal cerebral cortex and malignant gliomas of oligodendrocytic subtype. A summary view with additional examples of protein profiles in normal tissues and cancer tissues for the above five examples are shown in supplemental Fig. 3.
An important complement to the antibody-based profiling described above is to perform transcript profiling using RNA-seq. We have recently shown (15), using a comparison of mass spectrometry-based stable isotope labeling with amino acids in cell culture (SILAC) analysis (26, 27) and quantitative transcript profiling using RNA-seq, that there is a strong correlation between changes of RNA and protein levels when differences in levels between human cell lines are analyzed. This enforces the need to characterize expression levels on both the transcript and protein levels to use as validation of the respective results, but also to pinpoint genes with low correlation between protein and RNA changes. The RNA-seq data (15) were reanalyzed for all the putative genes on chromosome 21, and ~56% of the genes showed strong evidence at the transcriptional level (Fig. 4a) as judged by high or medium level transcripts in at least one of the three human cell lines analyzed. A comparison of the expression in the three analyzed cell lines show that more than 50% of the genes (n = 111) show a “housekeeping” expression pattern with similar or slightly changed transcript levels in all three cell lines, whereas 7% (n = 13) are cell type-specific (supplemental Fig. 4). These 13 genes are interesting starting points to understand the biology of phenotypes corresponding to cells of brain, epithelial, and mesenchymal origin, respectively.
The RNA-seq analysis can also be used to validate putative genes with no evidence on the protein level. A bar plot can be made showing the read coverage for each nucleotide in the region of the chromosome for the putative protein-coding gene, including exons and introns. Obviously, the read count should be larger for the exons as compared with the introns if the gene transcript is spliced to form a functional mRNA. In Fig. 4 (b–d), three examples of chromosome 21 genes with no previous evidence on either protein or transcript level are shown, with the predicted exons and introns together with the read coverage shown across the whole chromosomal region. For all three genes, the bar plots suggest efficient splicing of the exons with low number of reads in the intron regions. These results strongly support that the three putative genes are indeed coding for proteins and re-enforces that efforts should be made to generate antibodies to allow for characterization of the corresponding proteins.
In Fig. 5, a detailed matrix is shown with the status of the experimental characterization of the proteins encoded by chromosome 21 genes, here excluding the keratin-associated proteins, with further details presented in supplemental Table 3. The first column shows the status of the annotation performed by Uniprot for the 192 putative genes with the color code according to Fig. 1. The second column shows the status of the generation of antibodies with a green box representing genes with at least one antibody approved by the Human Protein Atlas program and available to the public. Most of the remaining genes have a yellow color code, indicating that at least one antigen corresponding to a unique region of the corresponding protein target has been expressed, purified, and verified by mass spectrometry as part of the Human Protein Atlas program. At present, only four putative genes have failed attempts to generate antigens: APOO1346.1, AFO15262.1, DSCR8, and C21orf33. Two of these lack transcript evidence according to UniProt, and corresponding transcripts have not been detected in any of the three diverse cell lines assayed using RNA-seq, which calls for more in-depth studies to confirm the annotation of these putative genes as protein coding.
The next three columns in the status matrix show the results of the antibody-based molecular, subcellular, and tissue profiling, respectively. The molecular profiling was done using Western blot analysis, the subcellular profiling was done using immunofluorescent-based confocal microscopy, and the tissue profiling was done using immunohistochemistry on tissue microarrays (see above for details). The color codes show the status of these applications for each gene, with the results displayed as supportive (green), uncertain (yellow), nonsupportive (red), or not done (black). The final column shows the results of transcript profiling in three functionally different cell lines using next generation sequencing (RNA-seq), and the color code indicates how well the results support actual transcription of the gene with green as supportive (high or medium expression in at least one cell line), yellow as uncertain (no more than low expression in any cell line), and red as unsupportive (not detected in any cell line).
An important part of the molecular characterization is to determine the status of post-translational modifications, such as phosphorylation, acetylation, methylation, glycosylation, and proteolytic truncations. Because these modifications cause changes in the isolectric point of the target protein, it is possible to explore the protein modification landscape of each gene product by isoelectric focusing followed by Western blotting analysis (28, 29). This allows a rapid and systematic analysis of the modification landscape for each protein, and here we decided to perform the analysis using recombinant proteins expressed from full-length clones of the respective putative genes.
In Fig. 6, 14 examples of chromosome 21 genes are shown with the theoretical pI values for each putative protein, including splice variants, given by red arrowheads. We examined the proteins over a pH gradient from 3 to 10, because a majority of the proteins have isoelectric points in this range. For all proteins analyzed, the observed isoelectric points based on the electrophoretic migration were somewhat different from the theoretical pI calculated from genome sequence data. Of the 14 examined proteins, six resulted in a single band, whereas eight were detected as multiple bands. Although the analysis is based on few samples only, the results suggest that at least half of the analyzed genes encode proteins that are post-translationally modified. These proteins are interesting starting points for systematic analysis to explore the molecular basis of these modifications. In summary, this pilot study shows that an antibody-based isoelectric analysis can be used for rapid and convenient identification of potential targets for further isoform analysis to explore the degree of modification of each gene product.
Here, we report on a gene-centric approach aimed to experimentally annotate all protein-coding genes of the human chromosome 21 using antibody-based profiling. The genome sequence analysis by the Ensembl group has in release 59 identified 192 non-keratin-associated putative genes coding for proteins on this chromosome, and these genes have been characterized on the protein level by antibody-based profiling, and the status was reported in a matrix. The overall aim is to fill this matrix with information on all levels to generate experimental evidence for molecular characterization, isoforms, subcellular localization, tissue profiles, and cell and tissue specificity and to contribute to the functional annotation of the proteome by identifying faulty annotated genes that do not code for proteins.
The study presented here has contributed to several insights of both general and specific interest. Five genes with no previous evidence on the protein level have been identified by molecular characterization (Fig. 2b), and the level of protein modifications has been studied using a new approach for isoelectric focusing based Western blot analysis. Although this analysis was performed only on a small number of genes, the results indicate that a large fraction of the analyzed proteins have multiple isoforms or post-translational modifications. In addition, the tissue profiling using immunohistochemistry has revealed several proteins with highly selective expression patterns.
The protein analysis has been complemented with transcript profiling using next generation sequencing. The results from this analysis provide a useful tool to yield evidence for protein-coding genes as demonstrated by the ratio of reads across introns and exons for a number of chromosome 21 putative genes with no previous evidence on the protein level. The power of the RNA-seq method for transcript analysis can also be further extended to define and characterize the alternative splice variants from each gene locus and to determine the quantitative levels of RNA expression in different cells, tissues, and organs.
At present, we report annotated protein expression using two or more (paired) antibodies for 22% of the genes on chromosome 21. An important priority for the future is to add additional antibodies to allow the results from one antibody to be validated by the other. It will also be important to extend the analysis with renewable antibodies, such as monoclonal antibodies or recombinant affinity binders to complement the polyclonal antibodies generated within the Human Protein Atlas program. In this context, it is reassuring that several programs have been initiated recently to develop new methods for systematic generation of renewable binders to human proteins (30, 31). Another important objective is to extend the validation of the molecular and subcellular localization to include analysis of cell lines in which the gene has been knocked down using siRNA technology. The combination of gene knockdowns and antibody-based profiling is a powerful approach for generating profiling data with high reliability.
In conclusion, we describe a human proteome project to perform a systematic characterization of all the protein-coding genes on human chromosome 21 using antibody-based protein profiling. Through collaboration with research groups utilizing several complementary technologies, this effort can be integrated with similar efforts as part of a Human Proteome Project to characterize the proteins in normal cells, tissues and organs to generate a proteome-wide knowledge-based resource. The objective is to ultimately create an experimentally validated resource covering all proteins encoded by the human genome.
We acknowledge the entire staff of the Human Protein Atlas project.
* This work was supported by grants from the Knut and Alice Wallenberg Foundation and the EU 7th framework program PROSPECTS. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
This article contains supplemental Tables 1–3 and Figs. 1–4.
1 The abbreviation used is: