|Home | About | Journals | Submit | Contact Us | Français|
Systematic characterization of cancer genomes has revealed a staggering number of diverse aberrations that differ among individuals, such that the functional importance and physiological impact of most tumor genetic alterations remains poorly defined. We developed a computational framework that integrates chromosomal copy number and gene expression data for detecting aberrations that promote cancer progression. We demonstrate the utility of this framework using a melanoma dataset. Our analysis correctly identified known drivers of melanoma and predicted multiple novel tumor dependencies. Two dependencies, TBC1D16 and RAB27A, confirmed empirically, suggest that abnormal regulation of protein trafficking contributes to proliferation in melanoma. Together, these results demonstrate the ability of integrative Bayesian approaches to identify novel candidate drivers with biological, and possibly therapeutic, importance in cancer.
Large-scale initiatives to map chromosomal aberrations, mutations and gene expression have revealed a highly complex assortment of genetic and transcriptional changes within individual tumors. For example, copy number aberrations (CNAs) occur frequently in cancer due to genomic instability. Genomic data has been collected for thousands of tumors at high resolution using array comparative genomic hybridization (aCGH) (Pinkel et al., 1998), high density single nucleotide polymorphism (SNP) microarrays (Beroukhim et al., 2010; Lin et al., 2008) and massively parallel sequencing (Pleasance et al., 2010). Although multiple new genes have been implicated in cancer through sequencing and CNA analysis (Garraway et al., 2005), these studies have also revealed enormous diversity in genomic aberrations in tumors among individuals. Each tumor is unique and typically harbors a large number of genetic lesions, of which only a few drive proliferation and metastasis. Thus, identifying driving mutations (genetic changes that promote cancer progression) and distinguishing them from passengers (those with no selective advantage) has emerged as a major challenge in the genomic characterization of cancer.
The most widely used approaches are based on the frequency an aberration occurs: if a mutation provides a fitness advantage in a given tumor type, its persistence will be favored and it is likely to be found in multiple tumors. For example, GISTIC identifies regions of the genome that are aberrant more often than would be expected by chance, and has been used to analyze a number of cancers (Beroukhim et al., 2009; Beroukhim et al., 2007; Lin et al., 2008). However, there are limitations to analytical approaches based on CNA data alone: CNA regions are typically large and contain many genes, most of which are passengers that are indistinguishable in copy number from the drivers. CNA data has statistical power to detect only the most frequently recurring drivers above the large number of unrelated chromosomal aberrations that are typical in cancer. Finally, these approaches rarely elucidate the functional importance or physiological impact of the genetic alteration on the tumor. These limitations highlight the need for new approaches that can integrate additional data to identify drivers of cancer. Gene expression is readily available for many tumors, but how best to combine it with information on CNA is not obvious.
We postulate that driving mutations coincide with a “genomic footprint” in the form of a gene expression signature (see Box 1). We developed an algorithm that integrates chromosomal copy number and gene expression data to find these signatures and identify likely driver genes located in regions that are amplified or deleted in tumors. Each potential driver gene is altered in some, but not all tumors and, when altered, is considered likely to play a contributing role in tumorgenesis. Unique to our approach, each driver is associated with a gene module, which is assumed to be altered by the driver. We sometimes gain insight into the likely role of a candidate driver, based on the annotation of the genes in the associated module. We demonstrate the utility of our method using a dataset (Lin et al., 2008) that includes paired measurements of gene expression and copy number from 62 melanoma samples. Our analysis correctly identified known drivers of melanoma and connected them to many of their targets and biological functions. In addition, it predicted novel melanoma tumor dependencies, two of which, TBC1D16 and RAB27A, were confirmed experimentally. Both of these genes are involved in the regulation of vesicular trafficking, which highlights this process as important for proliferation in melanoma.
We define a “driving mutation” to be a genetic alteration that provides the tumor cell with a growth advantage during carcinogenesis or tumor progression (Stratton et al., 2009). We reasoned that driving mutations might leave a genomic ‘footprint’ that can assist in distinguishing between driver and passenger mutations based on the following assumptions:
Driving mutations are frequently associated with the abnormal regulation of processes such as proliferation, differentiation, motility and invasion. Given that many cancer phenotypes are reflected in coordinated differences in the expression of multiple genes (a module) (Golub et al., 1999; Segal et al., 2004), a driving mutation might be associated with a characteristic gene expression signature or other phenotypic output representing a group of genes whose expression is modulated by the driver. Additionally, CNAs do not typically alter the coding sequence of the driver and so are expected to influence cellular phenotype via changes in the driver’s expression. In consequence, changes in expression of the driver are important and so approaches that measure association between the expression of a candidate driver (as opposed to its copy number) and that of the genes in the corresponding module are likely to promote the identification of drivers.
Gene expression is particularly useful for identifying candidate drivers within large amplified or deleted regions of a chromosome: whereas genes located in a region of genomic copy gain/loss are indistinguishable in copy number, expression permits the ranking of genes based on how well they correspond with the phenotype (Figure 1D). CNA data aids in determining the direction of influence, which cannot be derived based on correlation in gene expression alone. This permits an unbiased approach for identifying candidate drivers from any functional family, beyond transcription factors or signaling proteins.
We developed a computational algorithm, COpy Number and EXpression In Cancer (CONEXIC), that integrates matched copy number (amplifications and deletions) and gene expression data from tumor samples to identify driving mutations and the processes they influence. CONEXIC is inspired by Module Networks (Segal et al., 2003), but has been augmented by a number of critical modifications that make it suitable for identifying drivers (see Supplementary Methods). CONEXIC uses a score-guided search to identify the combination of modulators that best explains the behavior of a gene expression module across tumor samples and searches for those with the highest score within the amplified or deleted region (Supplementary methods, Figure S1).
The resulting output is a ranked list of high scoring modulators that both correlate with differences in gene expression modules across samples and are located in amplified or deleted regions in a significant number of these samples. The fact that the modulators are amplified or deleted indicates that they are likely to control the expression of the genes in the corresponding modules (see Figure 3). Since the modulators are amplified or deleted in a significant number of tumors, it is reasonable to assume that expression of the modulator (altered by copy number) contributes a fitness advantage to the tumor. Therefore, the modulators likely include genes whose alteration provides a fitness advantage to the tumor.
We applied the CONEXIC algorithm to paired gene expression and CNA data from 62 cultured (long and short term) melanomas (Lin et al., 2008). A list of candidate drivers was generated using copy number data available for 101 melanoma samples by applying a modified version (Sanchez-Garcia et al., 2010) of GISTIC (Beroukhim et al., 2007) (see Table S1). Next, we integrated copy number and gene expression data (available for 62 tumors) to identify the most likely drivers (Supplementary methods). Statistical power is gained by integrating all data, and by combining statistical tests on thousands of genes to support the selected modulators. This resulted in the identification of 64 modulators that explain the behavior of 7869 genes. We consider the top 30 scoring modulators, presented in Figure 2, as likely drivers (see Table S2 for the complete list).
The top 30 modulators (likely drivers) include 10 known oncogenes and tumor suppressors (Figure 2). In many cases, CONEXIC chose the cancer related gene out of a large aberrant region containing many genes. For example, DIXDC1, a gene known to be involved in the induction of colon cancer (Wang et al., 2009b), was selected among 17 genes in an aberrant region (Figure S3). CCNB2, a cell cycle regulator, was selected from a large amplified region containing 33 genes. The modulators span diverse functional classes including: signal transducers (TRAF3), transcription factors (KLF6), translation factors (EIF5) and genes involved in vesicular trafficking (RAB27A).
Performing a comprehensive literature search for all genes is tedious and time consuming, so we developed an automated procedure, LitVAn - Literature Vector Analysis, that searches for over-represented terms in papers associated with genes in a gene set. LitVAn uses a manually curated database (NCBI Gene) to connect genes with terms from the complete text of more than 70,000 published scientific articles (Supplementary Methods). LitVAN found a number of over represented terms (Figure S2E) among the top 30 modulators, including ‘PI3K’ and ‘MAPK’, which are known to be activated in melanoma, ’cyclin’, representing proliferation which is common in all cancers and ‘RAB’. Rabs regulate vesicular trafficking, a process not previously implicated in melanoma (Chin et al., 2006).
Beyond generating a list of likely drivers (modulators), the CONEXIC output includes groups of genes that are associated with each modulator (modules). We tested how reproducible the modulators and their associated modules are using gene expression data from two other melanoma cohorts with 45 (Hoek et al., 2006) and 63 (Johansson et al., 2007) samples (see Supplementary methods, Figure S2). We found that 51/64 (80%) of the selected modulators are conserved across datasets in a statistically significant manner. Modules (statistically associated genes) are likely enriched with genes whose expression is biologically affected by the modulator (Figure 3). In consequence, the processes and pathways represented by genes in a module can help us to gain insight into how an aberration in the modulator might alter the cellular physiology and contribute to the malignant phenotype.
Annotation of data-derived sets of genes is typically carried out based on gene set enrichment using Gene Ontology (GO) annotation. Although this approach is useful, there are modules for which GO annotation does not capture the known biology. For example, the 'TNF module' is enriched with the GO terms 'developmental process' and 'cell differentiation' (q-value=0.0014 and 0.004 respectively). We used LitVAn to carry out a systematic literature search and found 11/20 genes in the module related to the TNF pathway, inflammation or both (Figure 3C, Table S3), although only 2 of these genes were annotated for these processes in GO. TRAF3, the modulator chosen by CONEXIC, is known to regulate the NF-kappa-B pathway (Vallabhapurapu et al., 2008), a major downstream target of TNF. Although TRAF3 has not been previously implicated in melanoma, the importance of the NF-kappa-B pathway in melanoma is well supported (Yukiko and Ann, 2006).
CONEXIC identified microphthalmia-associated transcription factor (MITF) as the highest scoring modulator. MITF is a master regulator of melanocyte development, function and survival (Levy et al., 2006; Steingrimsson et al., 2004) and the overexpression of MITF is known to have an adverse effect on patient survival (Garraway et al., 2005).
To test the association between modulator and module, we obtained an experimentally derived list of MITF targets (Hoek et al., 2008b) and asked whether the modules identified by CONEXIC associate MITF with its known targets. The MITF associated modules contained 45/80 previously identified targets (p-value < 1.5e−45) supporting a match between the transcription factor (TF) and its known targets. However, a few targets (TBC1D16, ZFP106 and RAB27A) are both associated with MITF and are themselves modulators of additional modules. CONEXIC limits each gene to a single module, so association with an MITF target would preclude association with MITF. If we permit indirect association to MITF through the modules of these additional modulators, CONEXIC correctly identifies 76 of the 80 targets identified by Hoek et. al. (p-value < 1.5e−78). Similar target sets are not available for any other modulator, precluding a more rigorous evaluation of our other predictions.
Expression of MITF correlates with the expression of its targets better than MITF copy number, though both correlations are statistically significant (p-value of 0.0001 versus 0.04, Figures 4A and 4B). This relationship is unidirectional: MITF is significantly overexpressed when its DNA is amplified (p-value 0.0004), but over-expressed MITF does not always correspond with MITF amplification. We find that MITF is less correlated with its copy number (rank 294th) than most other genes in aberrant regions (see Table S1C) and more than half of the tumors that over-express MITF do not have a CNA that spans the MITF gene. Comparison of MITF target expression between samples with and without MITF amplification did not show an effect of DNA amplification on expression of the targets (Supplementary methods).
We used LitVAN to identify the biological processes and pathways represented in each module associated with MITF. The module containing the genes most significantly upregulated by MITF (Figure 4B, Figure S4A) is significantly enriched for the terms 'melanosome' and 'pigment granule' (q-value= 4.86e−6 for each). It includes targets involved in proliferation such as CDK2, consistent with the observation that MITF can promote proliferation via lineage specific regulation of CDK2 (Du et al., 2004). The module containing genes most strongly inhibited by MITF (Figure 4B, Figure S4B) has a metastatic signature strongly associated with invasion, angiogenesis, the extracellular matrix and NF-kappa-B signaling. These modules and their annotation suggest that MITF serves as a developmental switch between two types of melanoma, where high MITF expression promotes proliferation and low MITF expression promotes invasion. Thus our automated, computationally derived findings dissect a complex response and accurately recapitulate the known literature, including the experimental characterization of MITF (Hoek et al., 2008a).
LitVAN annotated additional modulators with their known role (e.g., CCNB2 with cell cycle and mitosis, data not shown). The detailed match between the CONEXIC output and empirically derived knowledge of the role of known modulators in melanoma provides confidence in CONEXIC’s predictions for modulators that are not well characterized.
The second highest scoring modulator identified by CONEXIC is TBC1D16, a Rab GTPase-activating protein of unknown biological function. Rabs are small monomeric GTPases, involved in membrane transport and trafficking. TBC1D16 is well conserved and although its targets are not known, a close paralog, TBC1D15, regulates RAB7A (also selected as a modulator, Figure 2) (Itoh et al., 2006). We used a module associated with TBC1D16 to infer its potential role in melanoma (Figure 5A), and discovered that diverse biological processes are represented by genes in the module and that more than half are annotated for processes such as melanogenesis, vesicular trafficking and survival/proliferation (Table S4A). This suggests that TBC1D16 plays a role in cell survival and proliferation.
TBC1D16 is an uncharacterized gene located in an amplified region that contains 23 other genes, including CBX4, which is known to play a role in cancer (Satijn et al., 1997). Expression of TBC1D16 is not highly correlated with TBC1D16 copy number, compared to other genes in the region (ranked 7th out of 24) or to all candidate drivers (252th out of 428). Nevertheless, TBC1D16 is the top scoring gene in the region and the 2nd highest scoring modulator, so it was selected for experimental verification
The module exhibits a dose-response relationship between TBC1D16 expression and the expression of genes in the module such that higher expression of TBC1D16 is correlated with higher expression of genes in the module (correlation coefficient 0.76). We carried out western blotting and RT-PCR on some of the short term cultures (STCs) used to generate the Lin dataset and asked whether the TBC1D16 transcript correlates with protein levels. The results confirmed that the expression of TBC1D16 corresponds well with the amount of the 45kD form of TBC1D16 (data not shown). These results suggest that knockdown of TBC1D16 expression in tumors that have high levels of TBC1D16 will lead to a reduction in proliferation.
To test whether TBC1D16 is required for proliferation of melanoma cultures we carried out a knockdown experiment. We selected two STCs with high levels of TBC1D16, WM1960 (16-fold greater expression than WM1346, DNA not amplified) and WM1976 (34-fold greater expression, amplified DNA) and control STCs, WM262 and WM1346 that express TBC1D16 at a lower level. We used two shRNAs to knock down TBC1D16 expression in each of the four STCs and measured growth over 8 days (Supplementary methods). RT-PCR was used to confirm that the reduction in the amount of the TBC1D16 transcript was similar for all of the STCs (Figure S6). Knockdown of TBC1D16 expression reduced cell growth in WM1960 and WM1976 to 16% and 40%, respectively, relative to controls infected with GFP shRNA in the same STCs (Figures 5B, C and D). This result is specific for cultures with high levels of TBC1D16, as the controls, WM262 and WM1346, grow at similar rates to cultures infected with shGFP (75%–90%). As predicted, growth inhibition at day 8 is proportional to the amount of the TBC1D16 transcript and is independent of TBC1D16 copy number (Figures 5C and D). Taken together, these results support CONEXIC’s prediction that TBC1D16 is required for proliferation in melanomas that over express the gene.
The TBC1D16 module contains a second modulator, RAB27A, also known to be involved in vesicular trafficking (Figure 5A). RAB27A functions, with RAB7A, to control melanosome transport and secretion. RAB7A localises to early melanosomes, while RAB27A is found in mature melanosomes (Jordens et al., 2006). CONEXIC selected both RAB27A and RAB7A as modulators.
RAB27A is in an amplified region that did not pass the standard GISTIC q-value threshold for significance and expression of the gene is not highly correlated with RAB27A copy number, compared to other candidate drivers (323th out of 428). Nevertheless, CONEXIC identified it as the top-scoring modulator out of the 33 genes in this region, and ranked it 8th out of 64 modulators and it was therefore selected for empirical assessment.
To test the prediction that RAB27A is important for proliferation in tumors with high levels of RAB27A, we tested the effect of shRNA knockdown of the RAB27A transcript on proliferation. We chose two STCs in which the gene is highly expressed WM1385 (28-fold greater expression compared with A375, DNA amplified) and WM1960 (38-fold greater expression, DNA not amplified) and two controls that express RAB27A at a lower level (A375 and WM1930). Western blots show that expression of RAB27A correlates with expression of the cognate gene in these cultures (data not shown).
Knockdown of RAB27A expression using shRNA was similar for all cultures (Figure S6), but only reduced cell growth significantly in the STCs that overexpress RAB27A (18% or 35% in WM1385 or WM1960 relative to the same cultures infected with GFP shRNA). RAB27A shRNA had less impact (growth rates of 65–80%) in the control STCs that have low RAB27A (Figures 6A and B). Growth inhibition at 6 days is correlated with the amount of the RAB27A transcript and is independent of RAB27A copy number (Figures 6B and C). Taken together, these results support CONEXIC’s prediction that RAB27A is a tumor dependency in melanomas that overexpress RAB27A.
To test whether RAB27A affects the expression of genes in associated modules, as predicted by CONEXIC, we carried out microarray profiling after knockdown of RAB27A in the test STCs (WM1385, WM1960). We compared the expression profile after RAB27A knockdown to a control profile generated by infecting the same STC with GFP shRNA. We used Gene Set Enrichment Analysis (GSEA) (Subramanian et al., 2005) to test whether each of the 3 modules associated with RAB27A are enriched with genes that are differentially expressed (DEG) after knockdown (see Supplementary Methods). We found that all 3 RAB27A associated modules are significantly enriched for genes affected by RAB27A (p-values < 10−5 for all 3 modules, see Figure 7C), and that these modules responded in the direction predicted by CONEXIC.
These results support our computational prediction that the expression of RAB27A affects the expression of the genes in the associated modules. We note that RAB27A functions as vesicular trafficking protein, suggesting that it influences gene expression through an unknown, and likely indirect, mechanism. We used LitVAN to identify the biological processes and pathways represented among the DEGs. Cell cycle related terms are significant among the down-regulated genes, which might be expected given the reduced growth after RAB27A knockdown. In addition, we found that genes annotated for the Erk pathway are up-regulated (including MYC, FOSL1 and DUSP6). We used GSEA to measure enrichment of an experimentally derived set of genes that respond to MEK inhibition in melanoma (Pratilas et al., 2009). The resulting p-value < 4.7e−5 suggests that ERK signaling is altered after RAB27A knockdown in these STCs.
We carried out microarray profiling after knockdown of TBC1D16 to evaluate whether expression of TBC1D16 affects the expression of genes in the 4 modules associated with it. We used two shRNAs to knock down TBC1D16 in the test STCs (WM1960, WM1976) and compared the gene expression to controls infected with GFP shRNA (in the same STCs). GSEA analysis established that all 4 modules are significantly enriched for genes affected by differences in TBC1D16 expression (p-values < 10−5, 0.0002, 0.008 and 0.009 respectively, see Figure 7). Two modules responded to TBC1D16 knockdown in the direction predicted by CONEXIC. In addition, GSEA analysis ranked genes in the TBC1D16 module (Module25) highest out of 177 (based on the GSEA p-value), demonstrating that the genes in this module are the most highly differentially expressed genes in the data set.
The function of TBC1D16 is unknown, but it is predicted to be involved in vesicular trafficking. In our knockdown analysis LitVAN annotated the up-regulated genes with terms related to vesicular trafficking. These include RAB3C, RAB7A, CHMP1B, RAB18, SNX16, COPB1 and CAV1 (see Table S6A). However, it is not clear how TBC1D16 affects gene expression or how changes in expression impact vesicular trafficking.
We have demonstrated that combining tumor gene expression and copy number data into a single framework increases our ability to identify likely drivers in cancer and the processes affected by them. Gene expression allows us to distinguish between multiple genes in an amplified or deleted region (many of which are indistinguishable based on copy number) and to identify those that are likely to be drivers. The combination of data types allows us to identify regions that would be overlooked using methods based on DNA copy number alone.
The novelty of our method and the key to its success is our modeling paradigm: the expression of a driver should correspond with the expression of genes in an associated module. Examination of MITF and its targets supports our assumptions. Expression of MITF best correlates with the expression of its targets, but MITF overexpression does not always correspond with MITF amplification. A change in DNA copy number is only one of many ways that gene expression can be altered. For example, MITF expression can be upregulated via signaling from the Ras/Raf (oncogenic BRAF occurs frequently in melanoma) (Wellbrock et al., 2008) and Frizzled/Wnt pathways (Chin et al., 2006).
Most methods for identifying drivers within aberrant regions focus on genes whose expression is well correlated with the copy number of the cognate DNA (Lin et al., 2008; Turner et al., 2010). The expression of many of the predicted drivers we identify is poorly correlated with their copy number, relative to other genes in the region and to all other candidate drivers MITF (294th), TBC1D16 (252th) and RAB27A (323th) (see Table S1C). We believe the discrepancies between CNA and expression arise because there are multiple ways to up or down-regulate a gene. For example, TBC1D16 and RAB27A were both identified as transcriptional targets of MITF (Chiaverini et al., 2008; Hoek et al., 2008b), and are therefore up-regulated when MITF is over-expressed. Moreover, we postulate that many drivers are less correlated with their copy number than passengers due to selective pressure; if there is a fitness advantage to up or down regulate expression, the tumor will find a mechanism to do so.
We tested two drivers predicted by CONEXIC with knockdown experiments, and showed that tumors that express either TBC1D16 or RAB27A at high levels are dependent on the corresponding gene for growth. Our results demonstrate that these dependencies are determined by expression of the gene (in both cases), rather than DNA amplification status, further supporting the assumptions underlying our approach. Thus, we not only identify tumor dependencies, but also the tumors in which these genes are crucial for proliferation. Identifying dependencies that are critical for tumor survival is needed for drug targeted therapies. For example, FLT3 inhibitors in AML, which have had successful phase II trials (Fischer et al., 2010). Our approach is unbiased with respect to protein function and does not incorporate prior knowledge, thus enabling the identification of dependencies in genes involved with vesicular trafficking. TBC1D16 and RAB27A validate the ability of our approach to correctly identify tumor dependencies and the genes that they affect.
A key feature of our approach is that CONEXIC goes beyond identifying drivers. By associating candidate drivers with gene modules and annotating them using information from the literature, CONEXIC provides insight into the physiological roles of drivers and associated genes. We used LitVAn to find biological processes and pathways overrepresented in each module and to associate drivers with functions, accurately identifying targets of MITF and annotating the functions of known drivers (MITF, CCBN2 and TRAF3).
The results of microarray profiling following knockdown further support the association between modulator and module and confirm our ability to identify genes affected by TBC1D16 and RAB27A. We successfully connected genes involved in vesicular trafficking to their effects on gene expression, likely through a cascade of indirect influences. In addition to profiling the STCs that highly express each of these genes (test STCs), we also profiled two lower expressing STCs (control STCs), in which the effect of knockdown is less detrimental to growth. For TBC1D16, there is substantial overlap in the DEGs in the test STCs (p-value < 6.6e−22), but not in the DEGs between control and test STCs (p-value > 0.76). This reflects the complexity of the transformed state and demonstrates that genetic context has a fundamental impact on the effect of a perturbation.
Of the top 30 drivers selected by CONEXIC, three genes (TBC1D16, RAB27A and RAB7A) are known to be involved in vesicular trafficking (Itoh et al., 2006; Jordens et al., 2006). All of these genes are amplified (DNA) and highly expressed (RNA) in multiple melanomas. There is increasing evidence that genes controlling trafficking play a role in melanoma. Germline variation in GOLgi PHosphoprotein 3 (GOLPH3), a gene involved in vesicular trafficking, is associated with multiple cancers (Scott et al., 2009). Our data identifies two novel dependencies that are encoded in somatic CNAs, demonstrates the dependency of melanoma on TBC1D16 and RAB27A expression for proliferation and highlights the potential role of vesicular trafficking in this malignancy.
The role of vesicular trafficking in melanoma has yet to be characterized. Vesicular trafficking regulates many receptor tyrosine kinases (RTKs) both spatially and temporally and thus determines both the duration and intensity of signaling (Ying et al., 2010). For example, RAB7A is involved in the regulation of ERK signaling (Taub et al., 2007), and ERK is known to play an important role in melanoma (Chin et al., 2006). Tight control of ERK expression could potentially be important in melanocytes because of its influence on MITF: ERK is required for the activation of MITF, but high levels of ERK lead to MITF degradation (Wellbrock et al., 2008). It is possible that recurrent aberrations in vesicular trafficking genes might involve control of ERK signaling intensity. This is further supported by the up-regulation of an ERK signature (Pratilas et al., 2009) following RAB27A knockdown in our data (p-value < 4.7e−5).
CONEXIC differs from other methods in a number of ways. First, it uses the gene expression of a candidate driver, rather than its copy number as a proxy to report on the status of the gene, e.g., two tumors that overexpress a driver are treated equivalently, even if there is amplification in the DNA of only one of them. Second, it associates a candidate driver with a module of genes whose expression corresponds with that of the predicted driver, which was critical for identification of TBC1D16 as a modulator. Third, combining copy number and gene expression provides greater sensitivity for identifying significantly aberrant regions that would not be selected based on DNA alone, this was critical for the identification of RAB27A.
Methods based on copy number data are limited to detecting large regions containing multiple genes, such that the driver cannot be readily identified among them. Recent efforts have focused on integrating additional sources of information into the analysis. Some methods use prior information, such as the role of a gene in other cancers (Beroukhim et al., 2010). Others, like CONEXIC, integrate gene expression data (Adler et al., 2006), but the results of these methods fall short of CONEXIC's. We systematically compared CONEXIC to other methods using the same data and found that they did not identify MITF, or any other known driver in melanoma (see Supplementary Methods).
Statistical dependencies in gene expression have been used to connect a regulator to its target (Friedman et al., 2000; Lee et al., 2006; Segal et al., 2003) and for uncovering important regulators in cancer (Adler et al., 2006; Carro et al., 2010; Wang et al., 2009a). These approaches typically only detect transcription factors and signaling molecules and do not connect the altered regulatory networks to upstream genetic aberrations.
Incorporating information on amplification or deletion status allows us to consider any functional class of genes and thus permits detection of vesicular trafficking genes that would not be identified using other methods. It also allows us to relate the malignant phenotype to genetic aberrations from which it is likely to have originated.
We tuned our method towards reducing the selection of modulators that are not drivers. To gain this specificity, we do not detect all genes and pathways that drive tumors. First, some drivers in amplified and deleted regions do not pass the stringent statistical tests employed in our method. Second, CONEXIC only identifies candidate drivers that are encoded in amplified or deleted regions. In consequence, it would not detect drivers of melanoma such as BRAF and NRAS that are typically associated with point mutations. Third, CONEXIC detects drivers based on the assumptions delineated in Box 1; while these hold for many drivers, it is likely that they are not appropriate for all drivers.
To meet the challenge of finding all driving alterations in cancer, a number of complementary approaches are needed. Experimental approaches such as screening using pooled short hairpin RNAs (shRNAs) (Bric et al., 2009; Zender et al., 2008) are likely to detect a set of drivers different from those detected by CONEXIC. These screens are dependent on the genetic background and are limited to drivers that influence processes that can be readily measured, such as proliferation, whereas CONEXIC scans all the genetic data together and can potentially identify drivers of any function across different genetic backgrounds. In the future we envision that CONEXIC will be used to guide in vivo screening initiatives and to assist in the choice of regions, functional assays and genetic backgrounds probed.
The challenge of finding candidate drivers is considerable: tumors are heterogeneous, the data are noisy and highly correlated and there are a large number of possible combinations of drivers and genes in modules. Our approach is successful because it couples simple modeling assumptions with powerful computational search techniques and rigorous statistical evaluation of the results at each step.
Both the principles underlying CONEXIC and the software can be applied to any tumor cohort containing matched data for copy number aberrations and gene expression. The principle of associating any type of mutation (e.g., epigenetic alterations, coding sequence) with gene expression signatures or other phenotypic outputs that differ among samples will be of increasing importance as sequence and epigenetic data accumulates. Not only does this help to distinguish between driving and passenger mutations, but the genes in the associated module can also provide insight into the role of the driver. This approach can be used to identify the genetic aberrations responsible for tumorigenesis and to find those that relate to any other measurable phenotype, such as the resistance of tumors to drugs. We anticipate our approach will make an important contribution towards a basic mechanistic understanding of cancer and in revealing associations of clinical significance. Cancer is a heterogeneous disease in which we are only just beginning to appreciate the importance of genetic background and the myriad ways in which the cellular machinery can be redirected towards the transformed state. Methods that begin to dissect this complexity move us another step closer to a world where personalized therapies are routine.
A detailed description of the statistical methods and computational algorithms used can be found in the Supplementary Materials. The CONEXIC and LitVAN algorithms were developed for this research, the software is available at: www.c2b2.columbia.edu/danapeerlab/html/software.html
Cells were grown using standard culture conditions and knockdown was carried out by infection with lentivirus using RNAi sequences designed by the RNAi Consortium. shRNA lentivirus were prepared according to TRC protocols (http://www.broadinstitute.org/rnai/trc), with minor modifications. Cell proliferation assays, RT-PCR, microarrays and immunoblotting were carried out using standard techniques. Primer sequences and detailed methods can be found in Supplementary Experimental Procedures.
All primary data are available at the Gene Expression Omnibus (GSE23884).
The authors will like to thank Nir Hacohen, Antonio Iavarone, Daphne Koller, Liz Miller, Itsik Pe’er, Suzanne Pfeffer, Neal Rosen and Olga Troyanskaya for valuable comments. This research was supported by the National Institutes of Health Roadmap Initiative, NIH Director's New Innovator Award Program through grant number 1-DP2-OD002414-01 and National Centers for Biomedical Computing Grant 1U54CA121852-01A1. DP holds a Career Award at the Scientific Interface from the Burroughs Wellcome Fund and Packard Fellowship for Science and Engineering.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.