In cancer research, high-throughput profiling studies have been extensively conducted, searching for genomic markers whose mutations or defects may increase susceptibility to cancer. Cancer genomic data has the “large d, small n” characteristic. For example, a typical microarray gene profiling study measures the expressions of 104 genes and a genome-wide association study measures 106 SNPs on 101-3 subjects. For simplicity of discussion, we focus on cancer gene profiling studies using microarrays and note that the proposed approach is also applicable to other omics (e.g. epigenetics or proteomics) studies and other diseases.
Results from analysis of individual cancer genomic datasets often suffer from a lack of reproducibility [
1-
3]. The unsatisfactory reproducibility is also obvious from our numerical study. Although there are multiple possible causes, the most important one is “small
n” and hence lack of power of individual studies. An ideal solution is to conduct large-scale prospective studies, which are extremely time-consuming and expensive. Knudsen [
4] shows that for many cancer traits and clinical outcomes, there are multiple studies sharing comparable designs. A cost-effective way to improve reproducibility is to pool data from multiple existing studies and increase statistical power.
In cancer genomic data analysis, “large
d” leads to high computational cost, particularly when it is necessary to simultaneously analyze all genes profiled. With most existing analysis approaches, computational cost increases significantly, linearly or even exponentially, as the number of genes increases. Under some scenarios, the computational cost may even put a ceiling on the number of genes that can be analyzed. For example, the R package WGCNA, which conducts the weighted gene co-expression network analysis, can only analyze ≤ 4000 genes [
5]. An effective solution to the computational problem caused by “large
d” is to conduct prescreening, which has relatively low computational cost, prior to complex analysis, which has high computational cost. Prescreening can be classified as unsupervised or supervised. Unsupervised prescreening does not utilize information on the response variable. For example, gene expressions with severe missingness or small variances are removed from downstream analysis. In contrast, supervised prescreening uses the response variables and can be more informative. In the analysis of single datasets, supervised prescreening studies include [
6], which conducts numerical study of prescreening based on eight different statistics. Huang and others [
7] proposed prescreening using a bridge penalization approach. Fan and Lv [
8] developed the SIS (Sure Independence Screening) approach. Chen and Chen [
9] proposed a tournament prescreening approach. Tibshirani [
10] used a Lasso penalization approach for prescreening under the Cox model.
In this article, we investigate prescreening with multiple cancer genomic datasets sharing comparable designs. With multiple datasets, one possibility is to first conduct prescreening with each dataset separately. Meta-analysis can then be conducted to combine results from multiple datasets. Because of small sample sizes, prescreening results with individual datasets can be unsatisfactory. Meta-analysis cannot generate superior results using inferior inputs. Another possibility is to adopt intensity approaches, which transform gene expressions, combine multiple datasets, and conduct prescreening as if they were from a single study. Intensity approaches demand the full comparability of transformed gene expressions from different studies (platforms), which is still questionable. In addition, they need to be conducted on a case-by-case basis.
The goal of this study is to develop a practically useful prescreening approach for integrative analysis of multiple cancer genomic datasets. The proposed approach shares similar spirit with existing prescreening approaches [
6-
8,
11]. Instead of fitting one model with
d genes,
d marginal models are fitted with only one gene in each model. A ranking statistic measuring marginal significance is computed in each marginal model, and only genes with statistics larger than a cutoff are analyzed in downstream analysis. On the other hand, this study also significantly advances from existing studies. The data setup is more complicated than that in existing studies due to the presence of multiple datasets and, more importantly, the heterogeneity among them. Available prescreening approaches have been designed for analyzing single datasets and cannot accommodate the hetero-geneity across multiple datasets. The proposed approach is an integrative analysis approach, pools and analyzes raw data from multiple studies, and can be more effective than meta-analysis approaches. Unlike intensity approaches, the proposed approach does not need to be conducted on a case-by-case basis and does not require the full comparability of gene expression measurements from different studies. Hence it can be more broadly applicable.