Metastatic Cancer Study
We now discuss the application of the proposed methodology to a study looking at metastatic cancer. Based on the availability of expression data for metastatic samples and clinical information regarding the distinction of primary and metastatic tumors, we selected three studies from publicly available data sources [10
]. These three studies were selected based on two criteria: 1) both localized and metastatic samples are profiled, and 2) a reasonable number of common genes appear across datasets. It should be noted that generally only a small number of metastatic samples are profiled, which was the case in all three datasets. Throughout the article, the terms primary and localized will be used interchangeably.
The goal of this meta-analysis is to identify the set of genes that best distinguishes metastatic tumors from primary tumors in human cancer tissue samples across distinct organ sites. The method mentioned in the previous section is applied to the three training sets to transform the data to POE using both the EM and MCMC algorithms, and an optimal signature based on leave-one-out cross-validation logistic regression framework is obtained. The method will be compared to a few alternative meta-analytic approach ([5
]) in terms of the selected gene signatures and the clustering of primary and metastatic tumors based on them. Although the validation of methodology is challenging, we used our gene signature to predict metastasis-free survival time in the breast cancer study proposed by van't Veer et al
] as a possible validation. The hypothesis presumed here is that the profile for distinguishing metastatic from nonmetastatic tumors can be used to predict aggressive cancer prognosis.
Chen et al
] mainly focus on characterizing the global gene expression patterns that distinguishes hepatocellular carcinoma (HCC) from non-HCC samples using cDNA microarrays. Our sample size numbers (see Table ) are different from theirs because we have excluded non-tumor samples as well as repeat samples on the same patient. Removing these samples leaves us with 69 unique primary tumors and 9 liver tumors which have metastasized.
Description of data used in meta-analysis
Garber et al
] describe the diversity of gene expression patterns in squamous cell carcinomas (SCC), large cell lung carcinomas (LCLC), small cell lung carcinomas (SCLC), and adenocarcinoma (AC) using cDNA microarrays. These four subtypes of lung cancer are often detected in epithelial cells that line different sections of airways in the lung, and their treatment options differ by these types due to the pathological distinction among them. We first selected all 6 unique metastatic tumors and removed their paired samples profiled at primary stage. Identifying and removing duplicate samples was performed the same way as for the Chen et al
. data. The subset of patients included in our meta-analysis were 27 primary adenocarcinoma samples and 6 samples with lymph node metastases.
Finally, the Latulippe et al
] study identifies genes that differentiates primary and metastatic cancers in the prostate. Using Affymetrix oligonucleotide array U95 human gene arrays, they reported gene expression profiles of nearly 25,000 genes/ESTs. All samples were included in our meta-analysis. The details for the three studies are summarized in Table .
An important aspect of this collection of data is that the organ sites are different. We are postulating a hypothesis that there is a common profile separating localized tumors from metastatic tumors across the three sites. Similar evidence for this type of hypothesis has been suggested before [15
]. The microarray platform differs by studies, so we mapped clone/probeset IDs to Unigene cluster IDs (UGIDs) of its most recent build through SOURCE [16
]. UGIDs are constantly updated. Because our initial mapping was done in the year 2004, we translated these UGIDs to the June 2006 build (No. 191) in the NCBI database. The genes we report here and their annotation in the remainder of the paper is consistent with all annotations associated with the most up-to-date Unigene clusters. When multiple clones are mapped to the same UGID, we averaged the expression over the clones within each sample. Such a mapping produced 1633 common UGIDs.
Before combining 140 samples from different sources into a single dataset, we transformed the raw data to POE from each study by normalizing the distribution of expression values in metastatic samples to that of localized samples. Note that the localized or primary tumors represent the baseline group, since our goal is to select gene signature that distinguishes metastatic tumors from localized tumors, for which many conflicting hypotheses have been postulated. The output of POE from each study was then combined to form a single expression dataset with 1633 genes and 140 samples.
In the following, the POE data transformations by the EM and MCMC algorithms will be analyzed in parallel for the sake of comparison. All primary tumors are color-coded in red and metastatic tumors in green. In terms of computational speed, estimation of POE based on the EM algorithm takes less than a minute for 1633 genes per dataset, while that using MCMC takes about 50 minutes for 2000 iterations and 4 periodic skips in the sampler. As the numbers of genes and samples grow, this difference will be substantial. For example, it usually takes 4 hours to fit POE for a dataset with 10,000 genes using full Bayesian modelling as opposed to 3 minutes for the maximum likelihood approach using the EM algorithm. The reason for the computational difference is that the EM algorithm is fit to one gene at a time, while the MCMC algorithm involves fitting to expression measurements for all genes simultaneously.
Figures and show the POE transformation for two genes using both the EM and MCMC algorithms. In both plots, the top panel shows the expression levels on the raw scale, followed by those on the POE scale from the EM and MCMC algorithms, respectively. The gene in Figure is TGFB1 (UGID Hs.155218), which controls proliferation and differentiation in many cell types. The gene in Figure is F2 (UGID Hs.410092), coagulation factor II, whose mutation leads to various forms of thrombosis and which is often expressed in liver tissues.
Figure 1 Expression of TGFB1. Transforming growth factor beta 1 (TGFB1) gene expression on raw (upper), POE EM (middle) and POE MCMC (lower) scales. This gene is uniformly underexpressed in metastatic samples. Open circles indicate primary tumor samples, and stars (more ...)
Figure 2 Expression of F2. F2 coagulation factor II (F2) gene expression on raw (upper), POE EM (middle) and POE MCMC (lower) scales. This gene is underexpressed primarily in metastatic samples of the Chen liver study. Open circles indicate primary tumor samples, (more ...)
Although both genes are in the signature obtained by our methods, they clearly represent different types of genes. Based on Figure , F2 is under-expressed in the metastatic liver samples of Chen et al., weakly under expressed in the lung samples of Garber et al., and not differentially expressed in the Latulippe et al. data. It was found significant only in the liver study among the three studies we considered here. On the other hand, TGFB1 is a gene whose expression is uniformly under expressed in metastatic samples across all three studies.
This observation on the two types of expression pattern on POE scale suggests that our signature will contain both types of genes. As will be shown later, a conventional meta-analytic approach that combines strength of differential expression across studies on the raw scale tends to select genes that behave similarly to TGFB1, whereas our method picks up both types of genes. Unless genes with expression patterns similar to F2 dominate the entire signature, the gene set from our method tends not to be influenced by a single study.
As we proposed POE transformations using two different implementations, we will refer to the signatures from the data transformed by the EM and MCMC algorithms as the POE EM signature and the POE MCMC signature, respectively.
To obtain a gene signature that distinguishes metastatic samples from localized samples, we calculated risk indices for all samples. What we call a risk index is described in the Methods section. A logistic regression is fitted for each gene with one sample held out at a time. The response variable is metastasis status (1 = metastatic, 0 = localized). For all genes we iterated the same procedure holding each sample out while recording coefficients β and p-values. Following the risk index approach for classification expalined in Methods section, we calculated risk indices for all 140 subjects at various sizes of the gene signature. The optimal signature size p was then determined based on classification performance.
For classification purposes, we predicted the subjects with positive risk index to be metastatic and those with negative risk index to be localized cancer. Using Figure , we took the optimal size to be 80 for the POE EM signature as the error rates in metastatic and primary tumor samples collectively reach a minimum and do not decrease further as more genes are added beyond 80. A similar criterion was applied to obtain a 70-gene POE MCMC signature. A plot of the risk indices and the optimal cutpoint is given in Figure . The POE EM and POE MCMC signatures share 52 common UGIDs.
Figure 3 Misclassification Error. Misclassification error rates in metastatic (starred line) and primary tumors (open circled line). The upper panel is the error rates from the data transformed by the EM algorithm, and the lower panel is that from the data transformed (more ...)
Risk index. Derived risk indices from the data transformed by the EM and MCMC algorithms. Primary and metastatic tumors are represented by open circles and stars respectively. The y-axis is risk index.
Comparison and Validation
We performed other analyses for the sake of comparison. First, we compared the classification performance of the signatures found using meta-analyses with that in which the classifiers were constructed on one dataset only and tested on the other two datasets. The performance is summarized in Table . While such individual study-specific signatures tended to perform well on the training dataset, their performance did not generalize well to other datasets. The consistently poor performance of all signatures on the Garber dataset, including its own signature, suggests that this dataset might have poorer reliability than the others within the common subset of 1633 genes used in this analysis.
We also compared our methods with two meta-analysis techniques developed in [5
] and [13
]. The former performs Bayesian inference on the classical Hedges-Olkin pooled effect sizes for each gene from multiple studies, and the latter uses Bayesian hierarchical model to pool datasets across studies through group-specific mean and variance parameterization and selects gene signature based on their Bayesian estimate of FDR.
First, since the method of [5
] pools the differential expression statistics from a collection of raw-scale data, there is no analogue of a risk index-based classification method available using their signatures. Instead, we first obtained a signature of size 80 based on univariate gene selection. Here the choice of size 80 in all signatures was chosen to provide a fair comparison of class prediction power with POE signatures. This corresponds to controlling the FDR at 0.02 in the method by [5
]. We call this the effect size (ES) signature. We also fitted the hierarchical model from [13
] using WinBUGS software [17
]. We used the prior specification reflecting vague prior information as in the original paper. The fitted model was obtained from a simulation of 12,000 iterations with the initial 2,000 iterations used for burn-in. The estimated probabilities of differential expression were surprisingly low, with the highest probability 0.003. This implies that the 80 gene signature has FDR 99%. For the sake of comparison, we also took the 80 gene to assess its class prediction ability. We call this Conlon signature. Since both POE and the latter method report the probability of differential expression of individual genes, we examined the concordance between the two sets of probabilities. Figure shows the probability in Conlon et al
. plotted against that in POE EM. The ES signature shared 15 UGIDs in common with the POE EM signature and 18 genes with the POE MCMC signature only, which suggests that the two signatures will have different characteristics.
POE and Conlon et al. comparison. Concordance of the gene specific probability of differential expression between the POE (EM) and the method in .
Meanwhile, the Conlon signature had an overlap of two genes with the ES signature and one gene with the POE EM and MCMC signatures. The poor overlap of Conlon signature with others is consistent with the high Bayesian FDR estimated above.
To assess the classification performance, we performed hierarchical clustering of tissue samples from the individual studies using the ES signature. Figures through show the heatmaps of the ES signature in individual studies with clustering tree. These were drawn separately because the raw scale data cannot be directly combined as in POE. Figures , are the heatmaps of the POE EM and MCMC signatures. To highlight the sample labels in each plot, a yellow/blue color strip was added to the top of the dendrograms through Figures , , , , , which should be viewed along with the breakdown of the clustering tree. For all plots, we used average linkage clustering with the distance metric defined using the Euclidean metric. This was also done for the Conlon signature [see Additional Files 1
]. We found that the clustering performance of this signature was similar to that in the ES signature as well, with most of the errors committed in Garber lung study. The overall classification performance across all signatures is provided in Table . Based on the classification table, we see that the proposed methods (EM and MCMC) greatly outperform the Conlon signature, while they also are superior to the ES method, although this difference is smaller.
Figure 6 Chen et al. data. Hierarchical clustering of tumors in Chen et al. data using the effect size signature. The expression here is on the raw scale. The color strip in blue and yellow below the heatmap indicates primary and metastatic tumors. Blue indicates (more ...)
Latulippe et al. data. Hierarchical clustering of tumors in Latulippe et al. using the ES signature. The expression here is on the raw scale. The color strip in blue and yellow below the heatmap indicate primary and metastatic tumors, respectively.
POE EM for all three datasets. Hierarchical clustering of tumors of all three studies using the POE EM signature. The expression is on the POE scale.
POE MCMC for all three datasets. Hierarchical clustering of tumors of all three studies using the POE MCMC signature. The expression is on the POE scale.
Garber et al. data. Hierarchical clustering of tumors in Garber et al. using the ES signature. The expression here is on the raw scale. The color strip in blue and yellow below the heatmap indicates primary and metastatic tumors, respectively
We note that clustering with all signatures give fairly accurate results in all three studies. In the ES signature, only a few metastatic samples are grouped together with two other primary tumors for the Chen liver study (Figure ). Two metastatic samples are situated under the same node with primary tumors in Garber lung study (Figure ), Finally, one primary and another three metastatic samples are in the opposite clusters in Latulippe prostate study (Figure ). Overall, the clustering can differentiate metastatic tumors from primary tumors, although some metastatic tumors were grouped with primary tumors. The Conlon signature had no classification error in Latulippe prostate study, but essentially there was no tight clustering in Garber lung study at all, although 4 out of 6 metastatic samples were clustered together in a local tree.
The POE EM and MCMC signatures give comparably good clusterings of the two types of tumors across all studies. In Figures and , all metastatic tumors except for two samples from the Garber lung study are grouped together, and some primary tumors from the Chen liver study are separated from other primary tumors. Furthermore, the lengths of the edges to the leaf nodes in the dendrogram are shorter than that in the ES signature, which suggests that the clustering of primary tumors is tighter than that using the ES signature. This is a consequence of normalizing the expression level of metastatic tumors to the distribution of primary tumors by utilizing phenotypic information in the estimation of POE. The heatmaps visually demonstrate the difference between the ES signatures and the POE signatures. We next used NIH DAVID [18
] to determine if there were functional groups enriched for in our gene expression signatures. In terms of gene annotation, the POE EM and MCMC signatures share many common functional categories because they have many UGIDs in common such as response to stress, immune response, endopeptidase and enzyme inhibitor activity, cell organization and biogenesis, and regulation of cell cycle. The class of functions common to the POE and ES signatures is cell cycle processes. GO terms such as antigen processing, endogenous antigen via MHC class I, DNA repair, many metabolism and transport activities appear in the ES signatures only. Also, a literature search has suggested the association of POE signature genes and their corresponding GO terms with tumor invasion and metastasis in various cancer types. For example, ALDH1A1 (stress response) and MAPK3 (cell proliferation) are targets of the HGF/MET signaling pathway which has been associated with tumor metastasis and poor prognosis in human hepatocellular carcinomas [19
]. In another example, overexpression of PFN2 (Regulation of actin cytoskeleton) and UBS (stress response) has been associated with lymph node metastasis of gastric cancer [20
] and colon cancer [21
] respectively. These observations indicate that the POE signatures lead to relevant findings toward understanding the potential mechanism of differentiation of metastatic tumors from primary tumors.
Finally, an additional validation of the method was attempted to see if the resulting gene expression signature can discriminate lethal from nonlethal cancers in an early detected population of cancers. Note that the signature selection was primarily oriented toward the distinction of metastatic tumors from primary tumors. Thus validation here is based on the conjecture that many metastatic tumors are highly likely to initiate lethal condition. We addressed this issue by using the data from the van't Veer et al
] study. Their study profiled 98 primary breast cancer samples in Hu25K inkjet arrays. Among these patients, 34 patients developed distant metastases within 5 years, 44 patients continued to be disease-free after a period of at least 5 years. Other 20 patients either had BRCA1 germline mutations or were BRCA2 carriers; we excluded these samples from the analysis.
The study was based on a large inkjet microarray profiling over 25,000 probes. About two-thirds of 1633 genes used in the three cancer studies appear in the Van't Veer et al. data. Based on the classifier trained from the three cancer datasets described above, we mapped the genes from the signatures to those in the van't Veer et al. data. We generated risk indices for subjects in van't Veer et al. Specifically, we first transformed the van't Veer et al. data to the POE scale using both the EM and MCMC algorithms without using the phenotypic information to prevent overfitting. Note that we did not consider the effect size and Conlon signatures here. Then we calculate the log odds ratio for each patient using the coefficients trained from training data and the newly generated POE data. Note that the estimated regression coefficients for the risk score came from the training set. As expected, the derived risk indices using the data from the EM and MCMC algorithms are highly correlated (Pearson correlation 0.83).
A proportional hazards model [22
] relating metastasis-free survival to the risk index, adjusting for covariates, was fit to the data. Tables and shows the results. In both analyses using data from the EM and MCMC algorithms concur in that the derived risk indices are strong predictor of metastasis-free survival times. This association remains strong even after adjusting for estrogen receptor status and age. Since we are interested in risk prediction, we calculated the C-index [23
] to see if the gene expression signature adds discriminatory information relative to estrogen status and age. For the model with just age and estrogen status, the C-index is 0.714. For the EM-based POE signature, the C-index with all three variables (Multivariate model in Table ) is 0.722. For the MCMC-based POE signature, the C-index with all three variables (Multivariate model in Table ) is 0.748.
Results of EM POE-based survival analysis
Results of MCMC POE-based survival analysis