Cancer is a complex and highly heterogeneous disease that is mediated by a myriad of distinct cellular pathways, according to tissue of origin, specific set of chromosomal aberrations/mutations, and environmental conditions. In leukemia, for instance, there are several documented oncogenic lesions that work cooperatively to drive the cell to tumorigenesis (Mullighan et al, 2007
). As a result, cancer phenotypes can exhibit a great range of genetic variability. With analytical methods still in their relative infancy, it is thus not surprising that we are only in the very preliminary stages of assembling a complete repertoire of germ-line and somatic oncogenic lesions for each cancer phenotype.
Such knowledge, albeit still partial, has already proven useful as a guide for therapeutic intervention (Downward, 2006
) and is expected to become a key driver in the development of new personalized, diagnostic, and therapeutic strategies. Therefore, the computational inference of oncogenic events, as well as their specific impact on pathway dysregulation, has become the subject of intense focus in molecular biology.
High-throughput technologies are now producing vast amounts of biological data representing the availability of specific molecular species in a cellular population. These include, among many others, gene expression and genotypic profiles (Schena et al, 1995
), DNA-binding profiles from chromatin immunoprecipitation (Ren et al, 2000
), genomic sequences, and protein abundance from mass spectrometry (Perez and Nolan, 2002
). These data have been used extensively to characterize the differences between cancer cells and their normal counterpart. Gene expression profiling, in particular, has been successful in classifying tumors or patient prognosis based on specific molecular signatures. These have been applied to several phenotypes, including leukemia (Golub et al, 1999
) and breast cancer (van 't Veer et al, 2002
). In a similar context, expression profiling has also been used to characterize the molecular signatures arising from specific pharmacological interventions in the cell (Lamb et al, 2006
Recently, using these data, a number of computational methods have been proposed for the identification of oncogenes, tumor-suppressor genes, and even entire pathways that are dysregulated in cancer. A highly recurrent gene fusion event, for instance, was identified in prostate cancer from expression profiles using an ‘outlier' analysis approach (Tomlins et al, 2005
). Additionally, genome-wide SNP profiling and array-based comparative genomic hybridization were applied to the identification of germ-line and somatic lesions in several cancers, including leukemia (Mullighan et al, 2007
) and breast cancer (Yao et al, 2006
). Integrative approaches were also proposed: copy-number and expression profile data, for instance, were successfully used in the identification of specific chromosomal amplifications in breast cancer (Adler et al, 2006
). Other context-dependent methods have been proposed such as those that use reference signatures of specific activated pathways to characterize tumors and establish drug sensitivity (Bild et al, 2006
These methods, while partially successful, still focus primarily on characteristics of individual genes or gene products. It is not possible, therefore, to infer any details on how a protein's behavior has changed, nor the specific mechanisms that led to the pathologic transition.
In this paper, we introduce the interactome dysregulation enrichment analysis (IDEA) algorithm, which uses a genome-wide molecular interaction map as a systematic framework for the identification of genes playing a role in oncogenesis. Furthermore, we show that the same approach is also effective in identifying both targets and effectors of specific biochemical perturbations, a problem also known as the ‘drug mechanism-of-action' (MOA). Interestingly, while highly related, there are no available computational algorithms to address the MOA problem in a human cellular context; although interesting solutions have been proposed in bacteria (Gardner et al, 2003
) and yeast (di Bernardo et al, 2005
). We suggest that studying dysregulation patterns at a cellular network level, rather than in a ‘gene-centric' manner, can provide a highly efficient method for addressing both problems. Furthermore, the use of cellular networks provides a much-needed molecular interaction context to further characterize any gene predictions emerging from the analysis.
The use of an interaction network for gene–disease association is not novel per se
. A few recent studies have leveraged the growing repertoire of interaction data for this purpose. In one example (Lage et al, 2007
), protein–protein interaction networks were combined with Online Mendelian Inheritance in Man (OMIM) (Hamosh et al, 2000
) annotation data to identify complexes implicated in disease progression. In another study specific to prostate cancer (Ergun et al, 2007
), a regulatory network was inferred from microarray data and used as a filter to infer genetic mediators of disease progression. The approach was successful in identifying the androgen-receptor-signaling pathway, whose role in prostate cancer is already well documented. Both methods however, like others in this category, still adopt a gene-centric approach, using the underlying network essentially as a filter to identify clusters of significant genes. Furthermore, only individual interaction layers, such as the transcriptional layer or the protein complex layer, were modeled by these methods. Finally, no explicit biochemical validation is provided to support their prediction accuracy.
In this paper, we use an existing genome-wide cellular network, the B-cell interactome (BCI), originally assembled by our laboratory (Lefebvre et al, 2007
) and further enhanced by including post-translational modulation events (C Lefebvre et al
, in preparation). The BCI is a mixed-interaction network, representing several key molecular interaction types in a human B cell, including transcriptional, signaling, and complex formation. The proposed analysis works in two steps. We first use a large compendium of microarray expression profiles from normal, tumor-related, and experimentally manipulated B cells to identify BCI interactions showing either a gain of correlation (GoC) or a loss of correlation (LoC) pattern in the phenotype of interest. These interactions are either lost (LoC) or gained (GoC) in the specific phenotype compared with the background, based on an information-theoretic test. We then rank genes according to the statistical significance of the LoC/GoC enrichment among the interactions in which they directly participate (see Box 1
for method overview).
Interactome Dysregulation Enrichment Analysis (IDEA)
An overview of the proposed network-based analysis to characterize oncogenic mechanisms and pharmacological interventions. (A) In step 1, a comprehensive network of interactions is generated for B cells using a Bayesian evidence integration approach, including predictions of post-translational modifications. In this diagram, transcription factors are shown in red, non-transcription factors in gray, and modulators are shown in blue. Directed arrows indicate protein–DNA (P-D) interactions, and undirected indicate protein–protein (P-P) interactions or modulation events. Evidences, or clues, include curated databases, literate mining, orthologous interactions from model organisms, and reverse engineering algorithms. (B) In step 2, each interaction is analyzed to determine which show aberrant behavior in a specific phenotype (P); that is, interactions that show correlation in all samples except P (TF1 and T1), or interactions that are not correlated in any samples except P (TF1 and T2). These dysregulated interactions are classified as LoC or GoC, respectively, for every edge in the BCI. (C) In step 3, these dysregulated interactions are pooled together and a statistical enrichment is calculated which identifies genes having an unusually high number of these interactions in its neighborhood, either through direct or modulated links.
The study introduces four key innovations as follows: (1) by adopting a genome-wide, mixed-interaction network, instead of the individual interaction layers of previous studies, we cover a far greater range of processes within the cell; (2) rather than analyzing the differential properties of individual genes (e.g., expression profile or genotypic data), we identify molecular interactions that are significantly dysregulated in a particular phenotype of interest. We hypothesize that genes implicated in cancer initiation and progression (as well as those targeted by specific biochemical perturbations) will show dysregulated interactions with their molecular partners. Biologically, this is quite plausible, since biochemical perturbations as well as a wide variety of oncogenic events (gene fusion or translocation, post-translational protein modification, structural mutation) will manifest through gains or losses of regulatory, signaling, and protein–complex interaction capability; (3) we validate on three distinct tumor models (follicular (FL), Burkitt's (BL), and mantle cell lymphoma (MCL)), whose oncogenic lesions are both known and completely different. In each case, we show that the known gene is identified in the 20 most significant by the analysis; (4) finally, we biochemically validate the approach by perturbing B-cell lines (using the CD40 ligand/antibody) and by showing that the method is successful in identifying the perturbation targets (CD40 pathway genes).
A key advantage of such a network-centric approach is that it can identify relatively small, yet tightly connected areas of the network (modules) that are dysregulated, providing a window over the mechanistic and possibly synergistic processes underlying oncogenesis and biochemical perturbation.