Our first goal was to test the hypothesis that blood- and urine-detectable protein biomarker candidates could be identified by using tissue-based gene expression microarray data. Using previously described methods
[12], we acquired gene expression data sets representing 41 diseases, as well as control tissue samples for each from GEO
[4], the largest international repository for gene expression microarray data with over 400,000 samples at the time of this writing.
We applied our IRDDP methodology to each disease. First, we calculated a set of differentially expressed genes for each disease using the RankProd meta-analysis package at a percentage of false prediction (pfp) ≤5%
[30]. For diseases with multiple microarray data sets, we included genes that were differentially expressed in at least one of the data sets. We then filtered the gene sets through a list of 3,638 proteins with known detectable abundance in serum, plasma, or urine. The list was created from public sources
[31],
[32],
[33],
[34] and has been described
[13]. This effort yielded a set of candidate protein biomarkers for each disease (
Dataset S1).
For each disease, we then compared our candidate biomarkers with known diagnostic protein biomarkers in the GVK BIO Online Biomarker Database (GOBIOM). GIOBIOM is an independent manually curated knowledge base taken from global clinical trials, annual meetings, and journal articles
[35]. As of this writing, GOBIOM contains 6,098 known biomarkers for 368 therapeutic indications with 23,166 unique references. For 22/41 diseases, known diagnostic protein biomarkers were enriched in our predicted protein sets (p<0.05, Fisher's exact, ). In 9/11 diseases for which at least three data sets were available, known diagnostic protein biomarkers were even more significantly enriched in our predicted protein sets. The -log(p-value) in diseases with three or more data sets (n

=

11) was significantly higher than those in diseases with fewer than three data sets (n

=

30; p

=

0.004, Fisher's exact, ). Of the remaining 19 diseases, 11 were represented by only a single gene expression data set. Therefore, we concluded that the more gene expression datasets for a disease, the more likely known biofluid protein biomarkers are going to be significantly differentially expressed across any one of those data sets, suggesting the likelihood of finding new biomarkers increases with more available data sets. While this finding is not at all surprising, we were able to conclude that joining as few as three experiments could statistically significantly improve the performance to rediscover clinically validated protein biomarkers across 41 diseases.
| Table 1Known diagnostic protein biomarkers were significantly enriched in the sets of differentially expressed RNA for 22 out of 41 diseases. |
We then applied IRDDP to the specific problem of finding serum biomarkers for the diagnosis of transplant acute rejection (AR). We integrated three biopsy-based gene expression microarray studies from pediatric renal, adult renal
[36], and adult cardiac
[29] transplantation, identified genes commonly upregulated in AR compared to stable graft function, and then measured the abundance of proteins encoded by these genes in serum to identify cross-organ AR protein biomarkers (). The first of the three studies was performed in pediatric renal transplantation. It compared gene expression profiles in biopsy samples from 18 AR patients and 18 patients with stable graft function (STA) at the absence of AR and any other substantive pathology (
Table S1). Using Significance Analysis of Microarrays
[37], we found 2,805 genes with increased expression in AR biopsies (q-value ≤0.05; fold change ≥2).
We combined the results of this study with data from two other transplant studies that we retrieved from GEO. One study compared biopsy samples from 13 AR patients with 19 STA samples after adult kidney transplant (GEO dataset GDS724
[36]). The study yielded 2,316 upregulated AR genes with q-values ≤0.05. The second study compared 12 AR biopsy samples with 13 non-rejection samples after cardiac transplant (GEO series GSE4470
[29]). It yielded 283 upregulated AR genes with q-values ≤0.05. By intersecting the three data sets, we identified a gene expression signature containing 45 genes in common, irrespective of the specific studies or transplanted organs (
Table S2). These genes are hereafter referred to as the “common-AR” set of genes.
To evaluate the significance of finding 45 genes in common, we shuffled the gene labels across the three data sets and repeated the entire analysis 100,000 times. In random performance, the number of intersecting genes was normally distributed around n

=

9 (
Fig. S1), suggesting a false discovery rate of 20%. This result also suggested that the probability of finding 17 or more commonly dysregulated AR genes by chance was less than 1%, and that the probability of finding 24 or more of them by chance was less than 1×10
−5.
We next retrieved mRNA expression data for each common-AR gene across 74 tissue and cell types from SymAtlas
[38], and identified the cell type with the highest expression. Surprisingly, our common-AR genes were most enriched in CD14+ monocytes (p

=

0.003, Fisher's exact). Seven of the 45 common-AR genes had their highest expression levels in CD14+ monocytes: they were
CD44,
IL10RA,
S100A4,
IGSF6,
CTSS,
CASP4, and
SCAND2. Our results suggest an important role for activated pro-inflammatory monocytes in transplant rejection. This finding is consistent with recent reports that monocyte/macrophage activation might induce inflammation, leading to impairment of graft function in renal transplant patients
[39].
We then analyzed the functions of the 45 common-AR genes using Ingenuity Pathway Analysis. As expected, 28 of the 45 common-AR genes were involved in the inflammatory response (p

=

3.37×10
−17, Fisher's exact; p<3.56×10
−3 after Benjamini-Hochberg multi-test correction). Furthermore, 23 common-AR genes were involved in cell-mediated immune responses, (p

=

3.34×10
−15; p<2.97×10
−3, Benjamini-Hochberg correction). Finally, 23 common-AR genes were involved in a single pathway associated with inflammatory responses, antimicrobial responses, and cellular movement regulated by STAT-1 (
Fig. S2).
ELISA kits were available for ten of the 45 candidate proteins, including six proteins known to be in biofluids and four outside. We measured all ten proteins in a pilot study of serum samples collected within 24 hours after biopsy from an independent set of 19 patients with biopsy-proven AR and 20 patients with absence of AR or any other substantive pathology (STA). The patients were from a pediatric and young adult renal transplant study. No patients were positive for BK virus infection, and no patient samples in the ELISA study were matched with samples used in the microarray study. The AR/STA samples were matched for recipient and donor gender, age, type of immunosuppression, time post-transplant, race, and type of end stage renal disease (
Table S3).
Three of the ten proteins were statistically significantly upregulated in the AR serum samples compared to the STA samples after renal transplantation (). They were PECAM1 (also known as CD31 antigen, or platelet/endothelial cell adhesion molecule), CXCL9 (MIG, chemokine ligand 9), and CD44 (hyaluronic acid receptor). Mann-Whiney U test for significant differences yielded p-values of 1×10−3, 1×10−4, and 5×10−3, respectively. Receiver Operating Characteristics (ROC) curves showed the ability of each individual protein to distinguish AR from STA (). The areas under the ROC curves (AUC) were 0.811, 0.864, and 0.761 for PECAM1, CXCL9, and CD44, respectively. At optimal performance, PECAM1 distinguished AR from STA with 89% sensitivity and 75% specificity; CXCL9: 78% sensitivity and 80% specificity; CD44: 80% sensitivity and 75% specificity.
We then measured the concentration of these proteins in a second pilot study on plasma samples of cardiac allograft recipients to identify cross-organ AR biomarkers. We compared samples from 32 AR patients and 31 STA patients. The samples were matched for demographic characteristics (
Table S4). None of them was infected with CMV. Interestingly, all three markers were upregulated in AR compared to STA. Mann-Whitney U test for significant differences yielded p values of 3×10
−3 (PECAM1), 0.019 (CXCL9), and 4×10
−3 (CD44) (). The areas under the ROC curves for distinguishing AR from STA were 0.716, 0.672, and 0.711 for PECAM1, CXCL9, and CD44, respectively.
We evaluated the performance of a combined panel of PECAM1 and CXCL9 using a three-fold cross-validation. We randomly selected two thirds of the samples, trained a multinomial logistic regression model, and calculated the predictive performance on the remaining one third of samples. After repeating the process 1000 times, the average ROC curves showed an improvement on cardiac AR diagnosis and no additional improvement on renal AR diagnosis (
Fig. S3), suggesting a large clinical trial combining PECAM1 and CXCL9 with other previously found protein biomarkers would be needed to evaluate the predictive diagnosis of AR. Adding CD44 did not improve the regression models.
We performed an immunohistochemistry study on our best-performing marker, PECAM1. The goal of the study was to compare its protein expression in AR and STA samples from renal, hepatic and cardiac allograft biopsies (). In STA kidney tissue, PECAM1 staining was mainly observed in the endothelial cells of glomeruli, in peritubular capillaries, and in large blood vessels. In contrast, examination of staining patterns in AR biopsies revealed dense infiltrates of PECAM1, as well as positive lymphocytes and mononuclear cells in the interstitium. Similarly, dense endothelial PECAM1 staining was observed in the hepatic and cardiac transplant AR tissues, along with staining in infiltrating mononuclear cells. We observed only minor endothelial staining in hepatic and cardiac STA tissues. These immunohistochemistry results showed significantly increased PECAM1 protein expression in the AR tissues compared to STA tissues across transplanted organs.
Furthermore, our studies showed that PECAM1 protein was also significantly upregulated in the serum samples from AR patients compared with samples from patients with BK virus infection (n

=

10, p

=

0.001, Mann-Whitney U test) and chronic allograft injury (n

=

10, p

=

6×10
−5, Mann-Whitney U test) after renal transplantation (
Fig. S4). Analysis across hundreds of diseases using our GeneChaser tool
[40] showed that the mRNA expression of
PECAM1 is significantly upregulated in various cancers, but not in other potential confounding conditions, such as infection and hypertension (
http://tinyurl.com/yhq9h3k). These results suggest that PECAM1 is a serum marker specific for allograft acute rejection, irrespective of the transplanted organ.
Finally, as mentioned above, 23 of our 45 common-AR genes were involved in a single pro-inflammatory pathway regulated by STAT-1 (
Fig. S2). Among the ten proteins we tested by ELISA, five were within this pathway and five were outside of it. All five proteins outside the pathway failed validation, while three of the five proteins inside it were validated as AR markers. The 60% success rate from within this single pathway suggests that it is likely to represent a common functional pathway in AR across transplanted organs. Other novel AR protein markers are likely to be found from the remaining 18 common-AR genes/proteins inside this pathway that have not yet been tested by ELISA (
Fig. S2). These proteins include CD2, Cathepsin S, and SH2D2A.