Microarrays are capable of profiling human tissues on a genome-wide scale and have been used extensively in cancer studies, where expressions of thousands of genes are measured along with clinical outcomes. A major goal of such studies is to identify a subset of cancer-associated genes that can be used as biomarkers for cancer diagnosis and prognosis and as targets for therapy. Early studies have shown that gene signatures identified from the analysis of individual cancer microarray experiments often have low reproducibility. There are several reasons for this. A main one is that the sample size of a single microarray experiment, which is usually in the hundreds, is much smaller than the number of genes, which is usually in the tens of thousands.
Within the field of clinical investigation, meta analysis has emerged as the gold standard for the comparison and combined analysis of clinical studies. It is generally accepted that only meta analysis can circumvent the problems inherent to studies with low statistical powers due to low sample sizes [1
]. With meta analysis, it is usually not the intention of researchers to analyze any new datasets. Rather, it provides an effective way of pooling and analyzing multiple existing datasets and generating results more reliable than those from the analysis of each individual data set.
Meta analysis of cancer microarray data is made possible by the many experiments conducted independently to measure the same set of genes and the same cancer clinical outcomes. As shown in [2
], the meta analysis of cancer microarray data has achieved considerable successes by identifying relatively reproducible, biologically meaningful gene signatures. We refer to [6
] for more discussions of the merits of meta analysis in genomic studies.
Meta analysis of cancer microarray data is challenging because (1) microarray experiments usually measure a small number of samples and a large number of genes, with only a subset of those genes associated with cancer clinical outcomes. Gene selection is needed along with estimation; (2) the meta analysis of cancer microarray data and the identification of cancer-associated genes often require the use of original expression measurements. For this reason, the type of analysis conducted in this article has also been referred to as "integrative analysis". Such analysis differs significantly from conventional meta analysis, where the analysis is based on summary statistics (such as p-values) from each individual experiment; and (3) different platforms may be used in different experiments. Arrays that hybridize one sample at a time (e.g., synthesized oligonucleotide arrays) measure gene expression based directly on the signal intensity of each probe set. In contrast, spotted cDNA arrays hybridized with fluorescent-labeled targets typically measure the ratio of the signal from a test sample to the signal of a co-hybridized reference sample. It has been shown that data from Affymetrix GeneChip oligonucleotide microarrays correlate poorly with the data from custom-printed cDNA microarrays [7
]. We note here that comparability of different platforms can be achieved by the transformation of the expressions. However, as noted in previous studies (such as [8
]), such transformation needs to be conducted on a case-by-case basis.
Several approaches have been proposed to analyze the marginal effects of genes using data from multiple microarray experiments. Examples of this include Fisher's approach (with application to breast cancer [9
]); an intensity approach that transforms and directly integrates gene expressions [5
]; a penalization approach [3
]; a random effect model based approach [10
]; a robust gene ranking approach [11
]; and a Bayesian approach [12
In light of the fact that cancer development and progression are caused by the effects of multiple genes, the following studies (which can account for the joint effects of genes) have been conducted. A majority voting (with impact factors) approach has been proposed by [13
]. Gene shaving approaches based on random forrest and Fisher's linear discrimination are applied in [14
]. And a computationally intensive Bayesian approach is proposed in [15
]. We note that the focus of those studies has been predictive model building, not gene selection.
On the other hand, there is rich literature for the analysis of a single cancer microarray data and gene selection. Examples include the parameterized classifier design approach in [16
]; the penalization approaches in [17
]; the Threshold Gradient Directed Regularization (TGDR) approach [19
]; and the support vector machine approach [22
]. We refer to [23
] for more discussions of gene selection approaches with individual microarray datasets. We note, however, that those approaches have been designed to analyze a single dataset, and cannot be used to analyze multiple, heterogeneous datasets.
The literature review suggests that (1) genes identified from analysis of a single cancer microarray data may suffer from low reproducibility because of the small sample size. Meta analysis pools multiple datasets, increases statistical power, and provides an effective way of improving reproducibility; (2) existing meta analysis approaches focus on either the investigation of the marginal effects of genes or the construction of predictive models with multiple genes; and (3) approaches exist that can select genes with joint effects on cancer in the analysis of a single dataset. However, these approaches cannot be used to analyze multiple, heterogeneous data. Thus, there is a critical need for approaches that can select genes with joint effects on cancer in the meta analysis of multiple microarray data.
In this article, we propose the Meta Threshold Gradient Descent Regularization (MTGDR) approach for gene selection in cancer microarray meta analysis. The MTGDR takes advantage of recent developments in regularized gene selection with a single microarray dataset. Compared to such single-dataset gene selection methods, the MTGDR has the desired flexibility of accommodating multiple experiments with different setups. And in comparison with the available meta analysis methods, the MTGDR can effectively select a subset of genes with joint effects on cancer.