Microarrays have been extensively used to profile tissues on a genome-wide scale. Genes identified from microarray studies can be used as cancer markers for diagnosis, prognosis prediction, and treatment selection. As an example, microarray gene signatures have been used in breast cancer and lymphoma clinical practices [1
]. In this article, we focus on microarray studies where gene expressions are measured along with certain cancer clinical outcomes. The goal of such studies is to identify genes with important impacts on the clinical outcomes of interest, which may include risk of developing cancer, cancer status, cancer survival, and response to treatment [2
Analysis of cancer microarray data is challenging first because of the high dimensionality of gene expressions. In addition, unlike simple Mendelian diseases, development and progression of cancer are affected by the joint
effects of multiple genetic defects. This in turn demands modeling the joint effects of a large number of genes in a single statistical model and makes analysis of one gene at a time (i.e, marginal gene effects) suboptimal. Moreover, out of a large number of genes surveyed, only a subset are cancer-associated. To discriminate those cancer-associated genes from noises, various filter, wrapper, and embedded statistical methods have been developed [3
In most existing studies, attentions have been focused on analysis of a single dataset and identification of genes associated with a single cancer clinical outcome. Consider a hypothetical study where we are interested in identifying genes associated with development of breast cancer. Assume that there are five genes of interest: genes A-E. The goal of most existing studies corresponds to the first column of Table , which is to distinguish between cancer-associated genes A and B from noisy genes C, D, and E. In this article, we refer to such a gene selection study as "one dimensional". That is, selection is only carried out on the genes.
All cancer cells share two essential characteristics: uncontrolled growth and local tissue invasion or metastasis. In addition, there is strong evidence that certain cancers share common susceptibility genes. Examples include the BRCA1 and BRCA2 tumor suppressor genes, whose mutations are associated with the inherited forms of both breast and ovarian cancers [4
]. Over-expression of the HER-2 oncogene has been reported in 10-40% of primary breast and ovarian tumors and is strongly associated with a poor clinical prognosis [5
]. Gene WWOX is a tumor suppressor gene mutated in both breast and prostate cancers [6
]. Gene ADH is associated with development of lung cancer and head/neck cancer [7
]. The wound response signature, which is a breast cancer prognostic gene signature, also has predictive power for prognosis of lung cancer and prostate cancer [9
]. Simultaneously examining multiple cancers and searching for their common genomic basis will enable us to identify more essential features of cancer and lead to a better understanding of the subtle connections among different types of cancers [10
When studying a single type of cancer, genes can be categorized simply as either cancer-associated or not. Selection only needs to be conducted at the gene dimension. When studying multiple cancers, the categorization becomes more complicated. Consider the hypothetical study presented in Table . Suppose that, in addition to breast cancer, we are also interested in ovarian and lung cancers. Among the five genes, gene A is associated with all three types of cancers. Genes B and C are associated with two types of cancers. Gene D is associated with only one type of cancer, and gene E is not associated with any of the three cancers. Examination of Table suggests that development of breast and ovarian cancers may share a common genomic mechanism, which likely involves the protein encoded by gene B. However, such a mechanism may have no effect on development of lung cancer. When multiple genes and multiple cancers are considered, selection needs to be carried out at two dimensions: (a) the gene dimension. For each type of cancer, genes associated with its development need to be identified. For example, for ovarian cancer, this dimension of selection amounts to differentiating genes A-C from genes D and E; and (b) the cancer dimension. For each gene, we are interested in identifying cancers it is associated with. For example, for gene B, this dimension of selection amounts to differentiating breast and ovarian cancers from lung cancer. Of note, although there are studies investigating multiple genes and multiple cancers, none of them formally considers this as a two-dimensional selection problem.
Studies conducted to identify genes associated with multiple cancers include [11
], where 218 tumor samples spanning 14 common tumor types and 90 normal tissue samples were collected and analyzed to identify a gene signature that is differentially expressed in metastatic tumors of diverse origins relative to primary cancers. A "support vector machine + recursive feature elimination" approach is proposed. Such an approach is limited to categorical clinical outcomes. We note that the data structures and scientific questions of interest in [11
] and their counterparts in this article are significantly different. More specifically, [11
] has one multiclass classification problem, whereas we have multiple binary classification problems. Rhodes et al. [10
] examined 21 cancer microarray datasets spanning 12 distinct cancer types and identified a set of 67 genes that are universally activated in most cancer types relative to normal tissues. The approach proposed in [10
] can only study the marginal effects of genes, whereas cancer development is associated with the joint effects of multiple genetic defects. Segal et al. [12
] pooled 1975 human DNA microarrays spanning 22 tumor types and characterized gene expression profiles in tumors as a combination of activated and deactivated modules. An approach similar to the Fisher's meta analysis approach is proposed, which can study the marginal effects of genes only. Chan and Mousavi [13
] proposed a stochastic Bayesian approach to identify susceptibility genes shared by development of breast and ovarian cancers. The SHEBA approach demands selection of closely related cancers. Considering our limited knowledge of mechanisms beneath cancer development, potential applications of this approach can be limited. Yang et al. [14
] analyzed 4 cancer prognosis studies involving breast cancer, leukemia, and mesothelioma and identified 42 genes that show consistent up- or down-regulation in patients with poor disease outcomes. An extension of the approach in [10
] is considered, which can only study the marginal effects of genes. Xu et al. [15
] collected 26 cancer datasets across 21 major human cancer types and identified a common cancer signature consisting of 46 genes. The proposed TSPG approach is limited to categorical clinical outcomes and hard to be extended. Choi et al. [16
] analyzed 10 gene expression datasets from cancers of 13 different tissues and constructed two distinct coexpression networks: a tumor network and a normal network. This study focuses on analyzing the pair-wise interactions between genes. Lê
Cao et al. [17
] analyzed the NCI60 datasets, where the transcriptome of 60 cancer cell lines was investigated. The sparse partial least squares (sPLS) method was used, which cannot be easily extended to other data setup/models.
Existing methods for analyzing multiple cancer microarray datasets may have one or more of the following drawbacks. First, attention has been focused on analyzing one gene at a time (i.e, the marginal
effects of genes). Examples include [10
] and others. Since development and progression of cancer is caused by the joint effects of multiple genes, analyzing individual genes separately does not make full use of information in data. In this study, we include all genes in a single statistical model and account for their joint effects. Second, the focus has been on identification of genes associated with all
cancers being investigated. Such a strategy demands preselection of cancers having a significantly overlapped genomic basis. For example, in [13
], only breast cancer and ovarian cancer - which are known to share a common genomic basis - are investigated. This strategy may have significant limitations given the great heterogeneity among different cancers and our limited knowledge of cancer genomics. In this study, we release this constraint, and allow the data to reveal which cancers a particular gene may be associated with. Third, multiple datasets are usually analyzed separately. Then, summary statistics (for example p-values) from analysis of each individual dataset are combined using meta analysis methods to search for overlaps of findings. Such an approach can be inefficient since microarray studies have small sample sizes, and analyzing each individual dataset separately may have insufficient power and may lead to high false positive and false negative errors. Fourth, inefficient feature selection methods are employed. For example, in [15
], the number of cancer-associated genes needs to be predetermined, and the heuristic exhaustive search approach in [13
] can accommodate only a small number of genes.
In this article, we propose a new statistical approach - Mc.TGD (Multi-cancer Threshold Gradient Directed) - for investigation of associations between multiple genes and multiple cancers. The Mc.TGD is an integrative analysis approach in which raw data from multiple studies are pooled and analyzed. It differs significantly from meta analysis methods, which analyze each dataset separately and pool summary statistics. Unlike existing approaches, the Mc.TGD can model the joint effects of multiple genes, does not make assumptions on the genomic basis of cancers, uses effective gene selection techniques, and is broadly applicable. In this article, we analyze studies investigating the risk of developing cancer, which have binary outcomes. The Mc.TGD can also be used to analyze cancer microarray studies with survival, quantitative, and categorical outcomes.