We describe a strategy to identify tissue-specific biomarkers using publicly available gene and protein databases. Since serological biomarkers are protein-based, using only protein expression databases for the initial identification of candidate biomarkers seems more relevant. While the HPA has characterized more than 50% of human protein-encoding genes (11,200 unique proteins to date), it has not completely characterized the proteome [51
]. Therefore, proteins that have not been characterized by the HPA but fulfill our desired criteria would be missed by searching only the HPA. There are also important limitations in using gene expression databases since there is considerable variation between mRNA and protein expression [69
] and gene expression does not account for post-translational modification events [71
]. Therefore, mining both gene and protein expression databases minimizes the limitations of each platform. To the best of our knowledge, no studies for the initial identification of candidate cancer biomarkers have been conducted using both gene and protein databases.
Initially, the databases were searched for proteins highly specific to or strongly expressed in one tissue. The search criteria were tailored to accommodate the design of the databases, which did not allow for simultaneous searching with both criteria. Identifying proteins that were highly specific to and strongly expressed in one tissue was considered in a later step. In the verification of the expression profiles (see Methods), only 34% (48 of 143) of the proteins were found to meet both criteria. The number of databases mined in the initial identification can be varied at the discretion of the investigator. Additional databases will result in the same number of, or more, proteins being identified in two or more databases.
In the gene expression databases, the criteria used were set for maximum stringency for protein identification, to identify a manageable number of candidates. A more exhaustive search can be conducted using lower stringency criteria. The stringency could be varied in the correlation analysis using the BioGPS database plugin and the C-It database. The correlation cutoff of 0.9 used in identifying similarly expressed genes in the BioGPS database plugin could be reduced to as low as 0.75. The SymAtlas z-score of ≥|1.96| could be reduced to ≥|1.15|, corresponding to a 75% confidence level of enrichment. The literature information parameters used in the C-It database of fewer than five publications in PubMed and fewer than three publications with the MeSH term of the selected tissue could be reduced in stringency, to allow identification of well-studied proteins. Since C-It does not look at the content of publications in PubMed, it filters out proteins that have been studied even if they have not been studied in relation to cancer.
Although proteins that have been well studied but not as cancer biomarkers represent potential candidates, the emphasis in this study was on identifying novel candidates which have been, overall, minimally studied. A gene's mRNA level and protein expression can have significant variability. Therefore, if lower stringency criteria were used when identifying proteins from gene expression databases, a greater number of proteins would have been identified in at least two of the databases, potentially leading to a greater number of candidate protein biomarkers identified after application of the remaining filtering criteria.
The HPA was searched for proteins strongly expressed in one normal tissue with annotated IHC expression. Annotated IHC expression was selected because it uses paired antibodies to validate the staining pattern, providing the most reliable estimation of protein expression. Approximately 2,020 of the 10,100 proteins in version 7.0 of the HPA have annotated protein expression [51
]. Makawita et al.
] included the criteria of annotated protein expression when searching for proteins with 'strong' pancreatic exocrine cell staining for prioritization of pancreatic cancer biomarkers. A more exhaustive search could be conducted by searching the HPA without annotated IHC expression.
Secreted or shed proteins have the highest chance of entering the circulation and being detected in the serum. Many groups, including ours [23
], use Gene Ontology [72
] protein cellular localization annotations of 'extracellular space' and 'plasma membrane' to identify a protein as secreted or shed. Gene Ontology cellular annotations do not completely describe all proteins and are not always consistent if a protein is secreted or shed. An in-house secretome algorithm (GS Karagiannis et al.
, unpublished work) designates a protein as secreted or shed if it is predicted either to be secreted based on the presence of signal peptide or to have non-classical secretion, or predicted to be a membranous protein based on amino-acid sequences corresponding to transmembrane helices. It more robustly defines proteins as secreted or shed and was therefore used in this study.
Evaluating which of the databases had initially identified the 48 tissue-specific proteins that passed the filtering criteria showed that the gene expression databases had identified more of the proteins than the protein expression database. The HPA had initially identified only 9 of the 48 tissue-specific proteins. The low initial identification of tissue-specific proteins was due to the stringent search criteria requiring annotated IHC expression. For example, 20 of the 48 tissue-specific proteins had protein expression data available in the HPA, of which the 11 proteins that were not initially identified by HPA did not have annotated IHC expression. The expression profiles of those proteins would have passed the 'Verification of in silico expression profiles' filtering criteria and, therefore, would have resulted in a greater initial identification of tissue-specific proteins by the HPA.
The HPA has characterized 11,200 unique proteins, which is more than 50% of the human protein-encoding genes [51
]. Of the 48 tissue-specific proteins that met the selection criteria, only nine were initially identified from mining the HPA. Twenty of the tissue-specific proteins have been characterized by the HPA. This demonstrates the importance of combining gene and protein databases to identify candidate cancer serum biomarkers. If only the HPA had been searched for tissue-specific proteins, even with lowered stringency, the 28 proteins that met the filtering criteria and represent candidate biomarkers would not have been identified.
The TiGER, UniGene and C-It databases are based on ESTs and collectively identified 46 of the 48 proteins. Of those, only 41% (19 of the 46) were identified in two or more of those databases. The BioGPS and VeryGene databases are based on microarray data and collectively identified 46 of the 48 proteins. Of those, 56% (26 of the 46) were identified uniquely by BioGPS and VeryGene. Clearly, even though databases are based on similar sources of data, individual databases still identified unique proteins. This demonstrates the validity of our initial approach of using databases that differently mine the same data source. The TiGER, BioGPS and VeryGene databases collectively identified all 48 of the tissue-specific proteins. From those three databases, 88% (42 of the 48) were identified in two or more databases, demonstrating the validity of selecting proteins identified in more than one database.
The accuracy of the databases' initial protein identification is related to how explicitly the database could be searched for the filtering criteria of proteins highly specific to and strongly expressed in one tissue. The BioGPS database had the highest accuracy at 26%, as it was searched for proteins similarly expressed as a protein of known tissue specificity and strong expression. The UniGene database, with an accuracy of 20%, could only be searched for proteins with tissue-restricted expression, without the ability to search for proteins also with strong expression in the tissue. The VeryGene database, accuracy of 9%, was searched for tissue-selective proteins and the TiGER database, with 6% accuracy, was searched for proteins preferentially expressed in a tissue. Their lower accuracies reflect that they could not be explicitly searched for proteins highly specific to only one tissue. The C-It database, with an accuracy of 4%, searched for tissue-enriched proteins and the HPA, accuracy of 0.4%, searched for proteins with strong tissue staining. These very low accuracies reflect that the search looked for proteins with strong expression in a tissue, but could not be searched for proteins highly specific to only one tissue.
The low identification of tissue-specific proteins by the C-It database is not unexpected. Given that the literature search parameters initially used filtered out any proteins that had fewer than five publications in PubMed, regardless of whether those publications were related to cancer, C-It only identified proteins enriched in a selected tissue which have been minimally, if at all, studied. Of the nine proteins C-It initially identified from the tissue-specific list, eight of the proteins had not been previously studied as serum candidate cancer biomarkers. Syncollin (SYCN) has only very recently been shown to be elevated in the serum of pancreatic cancer patients [33
]. The eight remaining proteins that C-It identified represent especially interesting candidate biomarkers because they represent proteins that fulfill the filtering criteria but have not been well studied.
A PubMed search revealed that 15 of the 48 tissue-specific proteins identified had been previously studied as serum markers of cancer or benign disease, providing credence to our approach. The most widely used biomarkers currently suffer from a lack of sensitivity and specificity due to the fact they are not tissue-specific. CEA is a widely used colon and lung cancer biomarker. It was identified by the BioGPS and TiGER databases and the HPA as highly specific to or strongly expressed in the colon, but not by any of the databases for the lung. CEA was eliminated upon evaluating the protein expression profile in silico, because it is not tissue specific. High levels of CEA protein expression were seen in the normal tissues of the digestive tract, such as the esophagus, small intestine, appendix, colon and rectum, as well as in bone marrow, and medium levels were seen in the tonsil, nasopharynx, lung and vagina. PSA is an established, clinically relevant biomarker for prostate cancer with demonstrated tissue specificity. PSA was identified in our strategy as a prostate-specific protein, after passing all the filtering criteria. This provides credence to our approach because we re-identified known clinical biomarkers and our strategy filtered out the biomarkers based on tissue specificity.
From the list of candidate proteins that have not been studied as serum cancer or benign disease biomarkers, 18 of the 26 proteins were identified in proteomic datasets. The proteomic datasets primarily contain the CM proteomes of various cancer cell lines, and other relevant fluids, enriched for the secretome. For proteins that have not been characterized by the HPA, it is possible the transcripts are not translated, in which case they would represent unviable candidates. If the transcripts are translated and the protein enters circulation, it must do so at a level detectable by current proteomic techniques. Proteins that have been characterized by the HPA may not necessarily enter the circulation. The identification of a protein in the proteomic datasets verifies the presence of the protein in the secretome of cancer at a detectable level; therefore, the protein represents a viable candidate. Because cancer is a highly heterogeneous disease, the integration of multiple cancer cell lines and relevant biological fluids likely provides a more, if not necessarily complete picture of the cancer proteome.
Relaxin 1 is a candidate protein that was not identified in any of the proteomes but its expression was confirmed by semi-quantitative RT-PCR in prostate carcinomas [73
]. Therefore, a protein not being identified in any of the proteomic datasets does not necessarily imply that it is not expressed in cancer.
Acid phosphatase is a previously studied prostate cancer serum biomarker [74
]. When compared to proteomic datasets (data not shown), it was identified in the seminal plasma proteome [25
], the CM of many prostate cancer cell lines [28
] (P Saraon et al.
, unpublished work) and, interestingly, the CM of colon cancer cell lines Colo205 [52
] and LS180 (GS Karagiannis et al.
, unpublished work), the CM of breast cancer cell lines HCC-1143 (MP Pavlou et al.
, unpublished work) and MCF-7 [52
], the CM of oral cancer cell line OEC-M1 [52
] and the CM of ovarian cancer cell line HTB161 (N Musrap et al.
, unpublished work). Graddis et al.
] observed very low levels of acid phosphatase mRNA expression in both normal and cancerous breast and colon tissue, in normal ovary and salivary gland tissue and comparatively high levels in normal and malignant prostate tissue. We, therefore, reasoned that identification of a tissue-specific protein in a proteome of a different tissue does not necessarily correlate with strong expression in that proteome.
Identification of a tissue-specific protein in only proteomes corresponding to that tissue, coupled with in silico
evidence of strong and specific protein expression in that tissue, indicates an especially promising candidate cancer biomarker. SYCN has been shown to be increased in the serum of pancreatic cancer patients [33
]. SYCN was identified in the pancreatic juice proteome [33
] and in normal pancreatic tissue (H Kosanam et al.
, unpublished work) and by BioGPS, C-It, TiGER, UniGene and VeryGene databases as strongly expressed in only the pancreas. Folate hydrolase 1, also known as prostate-specific membrane antigen, and KLK2 have been studied as prostate cancer serum biomarkers [67
]. Folate hydrolase 1 and KLK2 were both identified in the CM of various prostate cancer cell lines [28
] (P Saraon et al.
, unpublished work) and the seminal plasma proteome [25
] and by BioGPS and TiGER databases as strongly expressed in only the prostate. Of the tissue-specific proteins which have not been previously studied as serum cancer or benign disease biomarkers, colon-specific protein GPA33, pancreas-specific proteins chymotrypsinogen B1 and B2, chymotrypsin C, CUB and zona pellucida-like domains 1, KLK1, PNLIP-related protein 1 and 2, regenerating islet-derived 1 beta and 3 gamma and prostate-specific protein NPY represent such candidates. Investigation of these candidates should be prioritized for further verification and validation studies.
The proposed strategy seeks to identify candidate tissue-specific biomarkers for further experimental studies. Using colon, lung, pancreatic and prostate cancer as case examples, we identified a total of 26 tissue-specific candidate biomarkers. In the future, we intend to validate the candidates; if validation is successful, we can validate the use of this strategy for in silico cancer biomarker discovery. Using this strategy, investigators can rapidly screen for candidate tissue-specific serum biomarkers and prioritize candidates for further study based on overlap with proteomic datasets. This strategy can be used to identify candidate biomarkers for any tissue, contingent on the data availability in the mined databases, and incorporate various proteomic datasets at the discretion of the investigator.