4.1. Practical considerations
In data analysis, normalization of gene expressions needs to be conducted for each data set separately. For Affymetrix measurements, a floor and a ceiling may be added, and then the measurements are log2 transformed. We fill in missing expressions with means across samples. We then standardize each gene expression to have zero mean and unit variance. The proposed approach does not require the direct comparability of measurements from different studies. Additional transformations, which are necessary for intensity approaches, are not needed.
For simplicity of notation, we have assumed matched gene sets across multiple studies. When different sets of genes are measured in different studies, we use the following rescaling approach. Assume that gene 1 is measured only in the first K
( < M
) studies. We set β1K + 1
= 0 and J
) = [(β11
. The proposed approach and computational algorithm are then applicable with minor modifications.
We simulate data for 4 independent studies, each with 50 or 100 subjects. We simulate 50 or 250 gene clusters, with 20 genes in each cluster. Gene expressions have marginally normal distributions. Genes in different clusters have independent expressions. For genes within the same clusters, their expressions have the following correlation structures: (i) auto-regressive correlation, where expressions of genes j and k have correlation coefficient ρ|j − k|; (ii) banded correlation, where expressions of genes j and k have correlation coefficient max(0,1 − |j − k|×ρ); and (iii) compound symmetry, where expressions of genes j and k have correlation coefficient ρ when j≠k. Under each correlation scenario, we consider 2 different ρ values. Within each of the first 3 clusters, there are 10 genes associated with the responses. There are a total of 30 important genes, and the rest are noises. For important genes, we generate their regression coefficients from Unif[0,2]. Thus, some genes have large effects, and others have small to moderate effects. Six important genes are only measured in 2 studies. In addition, 10% noisy genes are only measured in 2 studies. We generate the binary response variables from the logistic regression models and Bernoulli distributions. The simulation settings closely mimic pharmacogenomic studies, where genes have the pathway structures. Genes within the same pathways tend to have correlated expressions, whereas genes within different pathways tend to have weakly correlated or independent expressions. Among a large number pathways, only a few are associated with the responses. Within those important pathways, there are some important genes and others are noises.
To better gauge performance of the proposed approach, we also consider the following alternative approaches: (a) meta-analysis. We analyze each data set separately using the Lasso or the bridge approaches. Genes that are identified in at least one study are identified in meta-analysis. An alternative is to consider genes identified in all studies. However, we have examined all simulation settings and found that there are very few such genes; (b) an intensity approach. Since all 4 data sets are generated under similar settings, we adopt the intensity approach in Shabalin and others (2008)
, make transformations of covariates, combine 4 data sets, and analyze as if they were from a single study. Both the Lasso and the bridge are used for analysis; (c) integrative analysis using the GLasso. For all approaches, we select the tuning parameters using 3-fold cross validation.
We evaluate the number of genes identified and the number of true positives. In addition, in -omics studies, we may also be interested in the prediction performance of identified markers. For each set of 4 data sets (training), we simulate 4 additional data sets (testing). We generate estimates using the training data, make prediction for subjects in the testing data and compute prediction errors.
Simulation suggests that the proposed approach is computationally affordable, with analysis of one replicate taking less than 2 min on a desktop PC. Summary statistics based on 500 replicates are shown in . We can see that the proposed approach is capable of identifying a majority of the true positives with a reasonable number of false positives. It outperforms alternatives under almost all scenarios, with a smaller number of genes identified, more true positives, and smaller error rates. We have examined a few other simulation scenarios and reached similar conclusions.
Simulation study: summary statistics based on 500 replicates
4.3. Pancreatic cancer studies
Pancreatic ductal adenocarcinoma (PDAC) is a major cause of malignancy-related death. Apart from surgery, there is still no effective therapy and even resected patients usually die within 1 year postoperatively. We collect and analyze 4 data sets (), which have been described in Iacobuzio-Donahue and others (2003)
, Logsdon and others (2003)
, Crnogorac-Jurcevic and others (2003)
, and Friess and others (2003)
. Two studies used cDNA arrays and the other 2 used oligonucleotide arrays. The 2 sample groups analyzed are the PDAC and normal pancreatic tissues. We remove genes with more than 30% missingness. Although it is possible to use all genes, missingness may make the analysis less reliable. 1204 genes are included in analysis. For each data set, we conduct preprocessing as described in Section 4.1.
We employ the proposed approach. Five genes are identified (). We can see that the estimates are nonzero in all 4 studies. For a specific gene, the magnitudes of estimates are not equal across studies. These observations are in line with discussion in Section 2. With real data, it is not possible to evaluate the gene identification accuracy. We evaluate the prediction performance using the LOO approach. We first remove one subject. With the reduced data, we carry out V-fold cross validation and penalized estimation. We then use the estimate to make prediction for the one removed subject. With the logistic model, the predicted probability can be computed. We use 0.5 as the cut off and predict the class label. We repeat this procedure over all subjects. Prediction error can then be computed. With the LOO approach, 2 subjects cannot be correctly classified, leading to an error rate of 0.036.
Pancreatic cancer studies: nonzero estimates
We also analyze using the following alternative approaches. (a) Meta-analysis. We analyze each data set using the Lasso approach. In data sets D1–D4, 9, 9, 10, and 5 genes are identified. A total of 29 genes are identified in at least one data set, with no gene identified in all 4 data sets. For each data set, we use the LOO approach to evaluate the prediction performance. A total of 8 subjects (2, 1, 2, and 3 in D1–D4, respectively) cannot be properly classified. We also analyze each data set using the bridge approach. In data sets D1–D4, 5, 9, 6, and 4 genes are identified, respectively. A total of 20 genes are identified in at least one data set, with no gene identified in all 4 data sets. With the LOO evaluation, a total of 4 subjects (1 in each data set) cannot be properly classified; (b) Intensity approach. We transform gene expressions, combine the 4 data sets, and analyze using the Lasso. Eight genes are identified. We use the LOO approach for prediction evaluation and 3 subjects are not properly classified. We also analyze the combined data using the bridge approach. Seven genes are identified, and 3 subjects are not properly classified in the LOO evaluation; (c) Integrative analysis using the GLasso. Eleven genes are identified. With the LOO evaluation, 4 subjects cannot be correctly classified.
4.4. Liver cancer studies
Gene profiling studies have been conducted on hepatocellular carcinoma, which is among the leading causes of cancer death in the world. Four microarray studies are described in Choi and others (2004)
and in . Data sets D1–D4 were generated in 3 different hospitals in South Korea. Although the studies were conducted in a controlled setting, the researchers “failed to directly merge the data even after normalization of each data set” (Choi and others, 2004
). 9984 genes were measured in all 4 studies. 3122 genes have less than 30% missingness and are analyzed. We also remove 8 subjects that have more than 30% gene expression measurements missing. The effective sample size is 125 (). For each data set, we conduct preprocessing as described in Section 4.1.
With the proposed approach, 23 genes are identified (). We observe similar behaviors as with the pancreatic cancer studies. We evaluate the prediction performance using the LOO approach. A total of 18 subjects cannot be properly classified, leading to an error rate of 0.144.
Liver cancer studies: nonzero estimates
We also analyze the data using the following alternative approaches. (a) Meta-analysis. When the Lasso is used to analyze each data set, 19, 25, 25, and 10 genes are identified, respectively. A total of 76 genes are identified in at least one data set, with no gene identified in all 4 data sets. The LOO evaluation misclassifies 4, 9, 4, and 5 subjects in D1–D4, respectively, leading to an overall error rate of 0.176. When the bridge is used to analyze each data set, 12, 16, 8, and 7 genes are identified, respectively. A total of 38 genes are identified in at least one data set, with no gene identified in all 4 data sets. The LOO evaluation misclassifies 4, 10, 5, and 9 subjects in D1–D4, respectively, leading to an overall error rate of 0.224; (b) Intensity approach. When the Lasso is used to analyze the combined data set, 34 genes are identified, and 35 subjects are not properly classified in the LOO evaluation. When the bridge is used to analyze the combined data set, 20 genes are identified and 40 subjects are not properly classified in the LOO evaluation; (c) Integrative analysis using the GLasso. 28 genes are identified. With the LOO evaluation, 21 subjects cannot be properly classified, leading to an error rate of 0.168.