|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide messenger RNA profiling provides a snapshot of the global state of the cell under different experimental conditions such as diseased versus normal cellular states. However, because measurements are in the form of quantitative changes in messenger RNA levels, such experimental data does not provide direct understanding of the regulatory molecular mechanisms responsible for the observed changes. Identifying potential cell signaling regulatory mechanisms responsible for changes in gene expression under different experimental conditions or in different tissues has been the focus of many computational systems biology studies. Most popular approaches include promoter analysis, gene ontology, or pathway enrichment analysis, as well as reverse engineering of networks from messenger RNA expression data. Here we present a rational approach for identifying and ranking protein kinases that are likely responsible for observed changes in gene expression. By combining promoter analysis; data from various chromatin immunoprecipitation studies such as chromatin immunoprecipitation sequencing, chromatin immunoprecipitation coupled with paired-end ditag, and chromatin immunoprecipitation-on-chip; protein-protein interactions; and kinase-protein phosphorylation reactions collected from the literature, we can identify and rank candidate protein kinases for knock-down, or other types of functional validations, based on genome-wide changes in gene expression. We describe how protein kinase candidate identification and ranking can be made robust by cross-validation with phosphoproteomics data as well as through a literature-based text-mining approach. In conclusion, data integration can produce robust candidate rankings for understanding cell regulation through identification of protein kinases responsible for gene expression changes, and thus rapidly advancing drug target discovery and unraveling drug mechanisms of action.
As progress in biomedical sciences leads to breakthrough in biotechnology, biotechnology is fueling back progress in biomedical sciences. However, as molecular data collection in mammalian cell biology research becomes more extensive and complex, advances in computation are now critical for continual progress.1 Computational systems biology is playing an increasingly important role in many aspects of biomedical research.2 Emerged as the result of the sequencing of the yeast, mouse, and human genomes, computational systems biology and bioinformatics (CSB) aims to develop robust theoretical models that explain how molecular components give rise to cellular, tissue, and organism phenotypes at the genome-wide scale.3 Biotechnological advances in instruments capable of measuring molecular components within cells at a genome-wide level, together with the infusion of ideas from physics and mathematics for data analysis and advances from computer science for data sharing, storage, search, and visualization, are expected to lead to a surge in biomedical breakthroughs in translational research in the near future.
Although genome-wide proteomic approaches improve rapidly, currently and throughout the past decade the most widely available and cost-effective genome-wide expression data collected is at the RNA level. Typical studies examine cells under different experimental conditions such as control versus treated, or disease versus normal intracellular states. Since quantitative changes in mRNA levels do not directly explain how regulatory molecular mechanisms are altered to induce changes in gene expression, and in turn lead to changes in cellular phenotype, identifying such regulatory mechanisms has been the focus of many CSB studies. This is because such understanding will enable us to, among other things, better control cell behavior with small molecules, and in turn translate such ability to therapeutics development. Most popular approaches for data interpretation of changes in gene expression include promoter analysis,4,5 gene ontology,6 and pathway enrichment analyses,7 as well as reverse engineering of networks from mRNA expression data.8 The ultimate goal of many of these studies is to identify and rank potential target genes/proteins that if knocked down would explain the observed changes by reversing them. Such protein targets may ultimately become drug targets. In this article, we describe how protein kinases that are likely responsible for observed genome-wide changes in gene expression at the messenger RNA (mRNA) level can be identified and ranked by a rational approach. In the first phase, one combines promoter analysis of microarray results with data from various chromatin immunoprecipitation (ChIP-X) studies, such as chromatin immunoprecipitation sequencing (ChIP-Seq), chromatin immunoprecipitation coupled with paired-end ditag (ChIP-PET), and chromatin immunoprecipitation-on-chip (ChIP-chip). Then, using protein-protein interactions and kinase-protein phosphorylation reactions collected from the public domain, one can potentially identify and rank candidate protein kinases for knock-down and/or other types of functional validations.
We discuss how predictions made by this approach can be cross-validated using other types of high-throughput experimental data, such as phosphoproteomics and protein/DNA arrays, as well as using lists of protein kinases extracted automatically from the literature.
The overall workflow is shown schematically in Figure 1 and discussed below.
The first step of our CSB approach is to identify the genes (mRNAs) that are differentially expressed under 2 conditions. This is a standard procedure that can be carried out using statistical tests such as analysis of variance or an adjusted t test, and/or unsupervised clustering approaches such as hierarchical clustering or principal component analyses.
To link changes in gene expression to the molecular mechanisms responsible for the observed changes, we can first apply promoter analysis using binding site matrices obtained from databases such as TRANSFAC4 or JASPAR.5 This method computationally scans the DNA sequence in the proximity of genes’ coding regions, seeking enrichment of binding sites for annotated transcription factor binding logo-motifs. Such an approach can identify and rank a list of transcription factor candidates responsible for the observed changes by computing binding-site enrichment for all the genes that changed in expression significantly. Transcription factor binding site enrichment can be computed for all genes that significantly changed in mRNA expression, or by dividing regulation for genes that were differentially increased or decreased in expression compared with the control. Alternatively, we can generate a list of most likely transcriptional regulators by cross-referencing the genes that increased or decreased in expression with previously published ChIP-X studies. Such studies report the binding of specific transcription factors in proximity to gene coding regions. By compiling the results from many ChIP-X studies, we can obtain a global picture of transcriptional activity of many transcription factors. Although such data is collected in many cell types and across different mammalian organisms under different conditions, it has the advantage that it considers the chromatin state of the cell and as such is expected to reduce false positives, a critical limitation of the binding logo-motif promoter scanning approach. Both the promoter analyses and the ChIP-X enrichment analyses produce ranked lists of transcription factors that most likely regulate genes that significantly increased or decreased in mRNA expression.
Such lists can be compared for overlap to assess consistency.
Most analyses stop at this stage; however, our approach next step is to “connect” the transcription factors detected by the ChIP-X enrichment and/or by the promoter scanning approach using known experimentally reported protein-protein interactions. Several tools have been developed for using prior knowledge about protein-protein interaction networks to build subnetworks that connect lists of “seed nodes” given as input.9–12 We have developed Genes2Networks10 and successfully used it for finding pathways responsible for neurite outgrowth13 and predicting a novel disease gene, SHOC2, that was found to contain a mutation that can cause a Noonan-like syndrome.14 Once we have built a subnetwork that connects the transcription factors, we can convert the proteins (nodes) within this subnetwork to a list (of proteins). Such a list of proteins can then be fed into Kinase Enrichment Analysis (KEA), a Web tool we developed for identifying protein kinases that are enriched with substrates in a given input list of proteins15; or into sequence-based motif discovery tools such as Motif-X16 or GPS17 for identifying the protein kinases that most likely regulate the subnetwork anchored with the transcription factors detected to regulate the observed changes in gene expression. Alternatively, programs such as Motif-X or similar sequence-based approaches17,18 can be used to associate protein kinases enriched with substrate sequences in the subnetwork. The final output of the workflow is a ranked list of protein kinases as targets for experimental validation through knock-down, overexpression, or dominant negative perturbations.
Some of the steps in the process can be skipped. We can, for example, look for kinases that significantly changed in their mRNA level directly. Alternatively, we can look for kinases in the protein-protein interactions subnetworks created from the transcription factors detected computationally, avoiding the KEA or Motif-X analyses. We also have the option to only use transcription factor binding motif promoter scanning and not ChIP-X, or to use only KEA and not Motif-X. How can we evaluate which approach works best? To answer this question we can bring in additional experimental evidence for cross-validation. For example, if we have data describing changes in protein/DNA interactions,13 or proteomics data reporting quantitative changes in protein levels (eg, profiling the entire nuclear proteome19), or, more directly, stable isotope labeling with amino acids in cell culture (SILAC) phosphoproteomics data describing quantitative changes in phosphoprotein levels,20 we can evaluate and validate the different steps of the analysis. Protein/DNA interaction arrays such as those developed by Panomics/Affymetrix (www.panomics.com/index.php?id=product18) can validate the transcription factors detected by the promoter analysis or the ChIP-X enrichment analysis. SILAC phosphoproteomics followed by a KEA or Motif-X analysis can validate the final output of the protein kinase rankings. In addition, we can see if the candidate protein kinases detected through the analysis were previously implicated with the biological process we aim to better understand. Ultimately, we can employ an RNA interference (RNAi) screen targeting all kinases to see if our rankings reflect our predictions. The different validation approaches are summarized in the schematic provided in Figure 1 under the title Cross-Validation. All the computational analyses presented here can be achieved by meshing the programs Genes2Networks,10 Lists2Networks,21 KEA,15 and ChEA (unpublished) developed by the Ma’ayan laboratory.
Protein kinases are great candidate drug targets because they display global effects through their ability to phosphorylate many targets exhibiting broad changes in cell behavior. Small-molecule inhibitors that target protein kinases can be easily screened for hits through in vitro phosphorylation assays. Protein kinases contain regulatory structural domains as well as catalytic domains amenable for small-molecule targeting. Here we have described several alternatives for inferring and ranking protein kinases, both computationally and experimentally, and for cross-validation of predictions initially made based on gene expression changes. However, there is still a need to validate the approach presented experimentally. With our approach there are many “free parameters” as well as “process routes” for making the ranked kinase predictions. Hence, protein kinase rankings can result in many different output tables depending on the computational and/or experimental routes taken and parameters chosen for cutoffs and algorithmic setup in each step. Although cross-validation could be useful for directing the right route and parameter selection, experimental validation, through functional studies as well as through short hairpin RNA (shRNA) or RNAi screens, is probably the best approach for evaluating the approach presented. CSB is now at a stage where data integration across regulatory layers can be used to infer and unravel regulation at the molecular level of mammalian cells at a degree of detail never possible before. Computational approaches for data integration and creative algorithms that can extract new knowledge from the increasingly complex, multidimensional, diverse datasets are of critical importance in this milieu.
The research presented is supported by NIH grants R01-DK088541 to JCH and AM, R01-DK078897 to JCH, and P50-GM071558-01A27398 to AM.
Potential conflict of interest: Nothing to report.