|Home | About | Journals | Submit | Contact Us | Français|
Genomic, proteomic, and other omic-based approaches are now broadly used in biomedical research to facilitate the understanding of disease mechanisms and identification of molecular targets and biomarkers for therapeutic and diagnostic development. While the Omics technologies and bioinformatics tools for analyzing Omics data are rapidly advancing, the functional analysis and interpretation of the data remain challenging due to the inherent nature of the generally long workflows of Omics experiments. We adopt a strategy that emphasizes the use of curated knowledge resources coupled with expert-guided examination and interpretation of Omics data for the selection of potential molecular targets. We describe a downstream workflow and procedures for functional analysis that focus on biological pathways, from which molecular targets can be derived and proposed for experimental validation.
Biomarkers are referred to as biological entities or characteristics that can be used to indicate the states of healthy or diseased cells, tissues, or individuals. Nowadays, biomarkers are mostly molecular makers, such as genes, proteins, metabolites, glycans, and other molecules, that can be used for disease diagnosis, prognosis, prediction of therapeutic responses, as well as therapeutic development (1–3). Over the past decade, high-throughput technologies, such as genomic microarrays, proteomic and metabolomic mass spectrometry, have been used to generate large amount of data from single experiments that allow for global comparison of changes in molecular profiles that underlie particular cellular phenotypes. As a result, the omics-based approaches, coupled with computational and bioinformatics methods, provide unprecedented opportunities to speed up the biomarker discovery and now are widely used to facilitate diagnostic and therapeutic developments for many diseases and particularly in cancers (4–10). Potential biomarkers have been identified at various molecular levels, including genetic, mRNA, protein/peptide, as well as epigenetic (11), miRNA (12), glycans (13), and metabolites (4). For example, using DIGE-based proteomics potential biomarkers (e.g., PPA2 and Ezrin) were identified to be useful for the diagnosis of metastatic prostate cancer (14), and a proteolytic fragment of alpha1-antitrypsin (BF5) was identified as a potential diagnostic and prognostic marker for inflammatory breast cancer as well as a target for potential therapeutic intervention (15, 16). Epigenetic marker, such as PITX2 DNA methylation, is reported as a robust assay for paraffin-embedded tissue for outcome prediction in early breast cancer patients treated by adjuvant tamoxifen therapy (11). In addition, microRNAs, such as miR-500, were identified as a potential diagnostic marker for hepatic cell carcinoma (17).
Increasingly, pathway and network-based analyses are applied to Omics data to gain more insight into the underlying biological function and processes, such as cell signaling and metabolic pathways and gene regulatory networks (18, 19). For example, 12 core signaling pathways were shown to be altered in human pancreatic cancers through genomic analyses (18). Network modeling linked breast cancer susceptibility to the centrosome dysfunction (20), and led to the identification of a proliferation/differentiation switch in cellular networks of multicellular organisms (21). These approaches have led to a new trend in identifying biomarkers in recent years, namely, pathway and network-based biomarker discovery, which identify panels of, instead of single, biomarkers for practical use in diagnostic and therapeutic developments (22–24). Protein networks have been shown to provide a powerful source of information for disease classification and to help in predicting disease causing genes (25, 26). Network approaches have also been used for improving the prediction of cancer outcome (27, 28), providing novel hypotheses for pathways involved in tumor progression (28), and exploring cancer-associated genes (29).
In this chapter, we focus on the methodology for the identification of molecular targets through functional Omics data analysis particularly of biological pathways, which can provide more mechanistic insights into the underlying phenotypes and may facilitate therapeutics development. We adopt a strategy that emphasizes the use of curated knowledge resources, and describe a workflow and procedures, coupled with expert-guided analysis and interpretation, for the selection of potential molecular targets.
Despite the rapid advancement of the high-throughput technologies and the bioinformatics tools, the functional analysis and interpretation of Omics data remain challenging due to high variation, low reproducibility, and noise of the data. Although many algorithms and tools have been developed to address these challenges, much is inherent to the long workflows of the Omics experiments, e.g., from sample preparation and raw data acquisition, to data processing and analysis. Many statistical and machine learning methods have been developed for better partitioning or clustering of genes (30–33), however, understanding of the biological meaning and functional interpretation of the group of genes/proteins are critical downstream steps in the Omics workflow, and are necessary for the design of therapeutic strategies. This downstream functional analysis relies heavily on existing knowledge annotated for genes or proteins and frequently requires expert-guided analysis for appropriate interpretation.
Annotations of genes and proteins integrated from multiple bioinformatics databases are the basis for functional analysis and interpretation of Omics data (34). Numerous gene and protein databases, varying in size and scope, have been developed to provide functional annotations for genes and gene products, as archived for the past decade, e.g., in the “Molecular Biology Database Collection” in the journal Nucleic Acids Research (35). The number of databases and database entries is rapidly growing, e.g., in 2009 the journal archived a total of 1,170 databases, nearly 100 more than in 2008. These databases are divided into 14 general categories, including databases of DNA, RNA and protein sequences, structure, genomics, proteomics, metabolic, and signaling pathways. Databases most relevant to Omics data analyses include: (1) Gene and protein databases, such as UniProt (36) for protein-based annotations, Entrez Gene (37) and model organism databases (e.g., Mouse Genome Database) (38) for gene-based annotations; (2) GO annotations, such as GOA for annotation of gene products with Gene Ontology (GO) terms (39); (3) Biological pathways, such as KEGG (40) and Pathway Interaction Database (PID) (41) for annotations of proteins involved in metabolic and signaling pathways; Pathway Commons has been developed as a single point of access for diverse pathway databases; and (4) Protein–protein interaction (PPI) databases, such as IntAct (42) and MINT (43) for annotations of proteins involved in physical interaction of proteins.
Mapping different Omics data types (e.g., gene, mRNA, peptide/protein, metabolite) to the common biological entities (e.g., proteins) is an essential step for deriving comprehensive annotations for functional Omics data analysis (34). Omics data mapping is accomplished most commonly by ID (database entry identifier) mapping that allows different but related biological entities to be mapped to the IDs of common entities (e.g., proteins). One of the most common issues in protein mapping is that the relation between different types of biological entity could be one-to-one (e.g., one gene ID to one protein ID) or one-to-many (e.g., one gene ID to two or more protein IDs), and this is not only caused by the difference between genes and proteins (e.g., one gene encodes several protein isoforms), but can also result from database redundancy (see Note 1). UniProt Knowledgebase (UniProtKB) is the main section of UniProt with comprehensive and high-quality protein sequence annotations (44), and iProClass is an integrated database for all UniProt protein sequences with value-added annotations integrated from over 100 other databases (45). The UniProt and iProClass databases thus serve as the underlying infrastructure for protein ID mapping (different IDs mapped to UniProtKB protein IDs) and data integration for experimental Omics data. ID mapping based on the two databases allows ~32 commonly used, heterogeneous IDs to be converted from each other and the ID mapping services are available online both at the Protein Information Resource (PIR) (http://pir.georgetown.edu) and UniProt (http://www.uniprot.org). ID mapping data files are also available at PIR for download to perform data mapping offline. Other ID mapping tools include DAVID gene ID conversion tool (http://david.abcc.ncifcrf.gov/conversion.jsp) (46) and Protein Identifier Cross-Reference Service (PICR, http://www.ebi.ac.uk/Tools/picr) (47).
Various bioinformatics tools are available for functional profiling of Omics data based on annotations of genes and proteins, such as PIR batch retrieval and functional categorization tool (http://pir.georgetown.edu/pirwww/search/batch.shtml), iProXpress (http://pir.georgetown.edu/iproxpress) (34), DAVID (48), and BABELOMICS (http://babelomics.bioinfo.cipf.es) (49). Annotations used for profiling by these tools include GO terms, pathways, keywords, sequence features, and families, among which GO terms and pathways are the most commonly used: GO has become a common annotation standard, and pathways provide more insightful biological meaning for the data. Moreover, many concepts in other annotations, such as keywords, have been covered by GO terms. While most of these tools allow profiling of single gene/protein list or two for comparison, iProXpress provides comparative profiling of multiple data sets (or data groups) for cross-data sets comparison, a very useful feature that accommodates many real-world data analysis issues.
For pathway analysis, mapping experimental data to metabolic and signaling pathways is a key for functional interpretation of the Omics data. Curated canonical pathway maps are available in many pathway databases, however, few public Omics analysis tools integrate the maps into their systems to allow experimental data superimposed onto the pathway maps. Several commercial pathway analysis systems are available, such as Ingenuity IPA (http://www.ingenuity.com) and GeneGO MetaCore (http://www.genego.com). Although these tools differ in features, such as visualization of canonical pathways and presentation of experimental data mapped onto the pathways, they all have one feature in common, i.e., integration of additional pathway and functional association data manually curated from literature into the systems in addition to the publicly available data in pathway databases, such as KEGG and PID.
Despite the extensive use of annotations from current knowledgebase for functional analysis of Omics data, annotations of genes and proteins lag far behind the rapid growth of literature due to the ever-expanding sequence data and the laborious nature of manual curation. In nearly all Omics experiments, varying numbers of identified genes or proteins lack sufficient annotations in databases to be functionally analyzed, and in such cases literature becomes the critical source for deriving functional information. Although literature data have been used solely or combined with other Omics data to generate gene/protein association networks (50–52), currently no literature mining tools have been integrated into any pipelined Omics system in a fashion that computationally extracted data are directly used as annotations for functional data analysis. Nonetheless, literature text mining is an important component of the data analysis workflow, and has been used to assist pathway analysis, such as ResNet of Pathway Studio (53) (http://www.ariadnegenomics.com/products/databases/ariadne-resnet). A variety of text mining tools are available to assist in mining relevant gene or protein data from literature, and this coupled with manual search of PubMed are often necessary for functional Omics data analyses (see Note 2).
The pathway and network-based Omics data analysis approach aims to delineate molecular maps that underlie the changes in biological samples under investigation, and to aid in discovery of molecular targets and biomarkers for diagnostic and therapeutic developments. Below we describe practical procedures applied to analyses of Omics data related to cell signaling and metabolic pathways, as well as organelle biogenesis.
We focus on downstream analytical steps of the Omics workflow leading to functional interpretation of Omics data. The workflow begins with a list of gene or protein identifiers or peptide sequences as results from upstream data processing and analysis, e.g., gene clusters or differentially expressed genes or proteins and follows steps 1–6 depicted in Fig. 1: The genes or proteins in the list are then mapped to UniProtKB protein identifiers (step 1). Next, functional annotations are derived for the list of genes or proteins (step 2) based on integrated data from multiple bioinformatics databases (step 4), including text mining of literature for information that has not yet been annotated in databases (step 5). Steps 4 and 5 make maximal use of public knowledge resources. Functional analyses are often conducted using several approaches (step 3) based on different types of knowledge annotated in bioinformatics databases, i.e., GO profiling, molecular networks, and biological pathways. Among them, GO profiling, while revealing limited biological insights into Omics data, usually covers most of the genes/proteins under analysis (see Note 3). By contrast, while giving more biological insights, pathway analysis is limited by low coverage of proteins annotated in known canonical pathways (see Note 4). In between the GO profiling and pathway mapping is molecular network analysis of interactions or functional associations between genes or proteins. Finally, molecular targets are inferred from the functional analysis (step 6).
Omics experiments are often carried out under various experimental conditions, from which differential patterns of gene or protein expressions are to be analyzed and potential molecular targets are sought. To assist the subsequent bioinformatics analysis, genes or proteins associated with different experimental conditions are divided into appropriate data groups and assigned with proper notations (Table 1). Although there is no fixed scheme for assignment, the notations usually clearly distinguish the key conditions under which each experiment is carried out and/or data are collected for given studies. There are additional considerations in Omics data grouping in the case of proteomic data (see Note 5).
Since the UniProt and iProClass databases are the data warehouse of the iProXpress system and serving as the underlying infrastructure for Omics data mapping and integration, the list of genes or proteins from Omics data are mapped to UniProtKB protein entries, referred to as protein mapping, to obtain functional annotations. Protein mapping is primarily based on gene/protein identifiers. For gene expression microarray data, commonly used gene identifiers include Entrez Gene ID, NCBI gi number, and Refseq ID. For mass spectrometry (MS) proteomic data, depending on the database selected for protein identification by the search engine (e.g., MASCOT), the commonly used identifiers include UniProtKB, IPI, NCBI nr, Refseq, etc. Gene and protein IDs are mapped to UniProtKB entries based on comprehensive ID mapping tools available at PIR or UniProt, which converts commonly used gene and protein IDs (such as NCBI's gi number and Entrez Gene ID) to UniProtKB IDs and vice versa. After protein mapping, all gene or protein IDs from one or more data sets or experimental groups are integrated into a master list of UniProtKB identifiers (ACs or IDs), each associated with corresponding experimental groups and notes (Table 1). This master list of proteins is the basis for the subsequent functional annotation and analysis using the iProXpress system.
Frequently, UniProtKB entry matches are not found for a fraction of input gene or protein identifiers, resulting from updates of database identifiers or deletion of entries occurring to most databases, especially when analyzing legacy data in which mixed database identifiers are often used. In such cases, the mapping can be based on sequence comparison or name mapping if the sequence is not available. For genes, the sequence identity and taxonomy information may be used to map the gi numbers to UniProtKB IDs in addition to the mapping bridged by EMBL/GenBank protein accessions (34). For MS proteomic data, peptide sequences are matched against all sequences in UniProtKB (see Note 6).
When gene microarray and MS proteomic experiments are conducted on the same biological samples under identical or similar conditions, the two Omics data sets are compared after data being merged through protein mapping. Direct comparison of expression at both mRNA and protein levels can provide stronger evidences for the underlying changes. For example, the 2D-gel/MS proteomics study identified 412 and 771 proteins that potentially changed in response to radiation treatment in ATM (Ataxia Telangiectasia Mutated) mutated (ATM−) and wild type (ATM+) cells, respectively, while the gene microarray study identified 103 and 131 significantly changed genes in the two cells, respectively (54). Among those genes/proteins, only 13 were commonly identified, including RRM2, the catalytic subunit of ribonucleoside-diphosphate reductase (RR), a rate-limiting enzyme required for synthesis of dNDP and thus of DNA synthesis in human (55). However, care should be taken in mapping data from genes to proteins due to one-to-many relations and redundancy existing in the UniProt database (see Note 1).
As discussed above, the experimental groups in which the genes or proteins are identified, as well as additional experimental information are annotated for all proteins with proper notations. The annotated data groups are used for direct comparative analysis between selected groups of interest, such as cell types, treatment types and time course, as well as Omics data types. The metadata annotation can also be used for limiting functional profiling to proteins in selected groups using the iProXpress interface (see below).
After protein mapping, rich annotations are described for given Omics data sets in the so-called protein information matrix (Table 2) that captures salient features of proteins, such as functions, pathways, and protein–protein interactions, derived from comprehensive protein annotations integrated into the UniProt and iProClass databases. The matrix allows for browsing and search of rich protein information through the iProXpress Web interface, which facilitates detailed examination of the Omics data (Fig. 2). Among protein annotations, GO terms, including molecular function, biological process and cellular component, and pathways, such as KEGG, are most commonly used for functional profiling.
Gene Ontology profiling is primarily based on GO slim, a cut-down version of GO terms at high levels of GO hierarchy (http://www.geneontology.org/GO.slims). GO slims are usually derived from terms at second and third levels of the GO hierarchy, though varying from sources to sources in the selection of additional terms from deeper levels. GO profiling provides a general view of biology underlying the Omics data and can suggest significant functional categories of genes or proteins that can be further investigated. For example, 26 genes are found upregulated in ionizing radiation treated ATM+ cells, which were identified from gene expression microarray data and were profiled using GO biological process (Fig. 3). The profile shows high representation of proteins in GO categories, such as “cell communication,” “response to stimulus,” and “cell proliferation,” in which several proteins are known to be involved in radiation-induced responses, e.g., BRCA1, p53, HDAC1, and STAT3. Because GO slims are terms of high level, genes/proteins profiled under given GO categories often overlap to varying degrees, e.g., the above mentioned proteins are common in three or more of the top five GO categories (Fig. 3). However, some terms are too broad, such as “regulation of biological process” or “biological regulation” to reveal meaningful biological information (see Note 7).
Due to the overall low coverage of pathway annotations for a given organism, relatively large numbers of proteins are usually missed in pathway profiling for any Omics data set. Nonetheless, it could provide significant insights into the underlying biology, particularly when used for cross-data set comparative profiling. For example, in our previous study, comparison of nine organelle proteomes, including mitochondria, endoplasmic reticulum (ER), and seven other lysosome-related organelles (56), the pathway profiles based on KEGG pathways show that “oxidative phosphorylation pathway” is prevalent in mitochondria while “N-glycan biosynthesis pathway” is in the ER (Fig. 4), which are consistent with the well-established functions of the two organelles. Pathway profiling also led to the identification of “purine metabolism pathway” that showed notable differences between radiation-treated vs. untreated ATM− and ATM+ cells (Fig. 5a) (54).
One key step in functional Omics data analysis is pathway mapping, a process that maps genes/proteins detected by Omics experiments to corresponding proteins annotated in canonical pathways. Various software tools are available for pathway mapping, including iProXpress, DAVID, and commercial tools, such as IPA (http://www.ingenuity.com) and MetaCore (http://www.genego.com). Visualization of the mapped pathways greatly facilitates the comparative analysis and understanding of the underlying differences across experimental groups, thus being critical for identifying potential molecular targets. Visualization of mapped pathways is provided as part of several software systems, e.g., mapped proteins in canonical pathways are highlighted by a distinct color (for one experimental condition as in IPA) or labeled with experimental conditions under which they were detected (as in MetaCore). Recently, KEGG developed a standalone tool, KegArray, for mapping gene expression profiles to pathways and genomes (57).
Different pathway tools should be used to maximize the identification of potential pathway-based targets because pathways annotated in different databases vary in their contents and boundaries (see Note 8). We used iProXpress, KEGG, IPA, and MetaCore pathway tools for mapping and/or visualization of metabolic and signaling pathways in several proteomic and functional genomic studies, including those on organelle biogenesis (58), radiation-induced DNA damage repair (54), and estrogen-induced apoptosis in breast cancer cells (59). Pathway mapping could lead to the identification of specific steps in which the proteins participate and the roles they may play.
For genes or proteins of interest that were derived from the Omics data based on differential expression and/or functional profiling, but do not have annotated pathway information, literature mining is used to uncover their potential associations with or pathways for the underlying phenotypes. Various text mining tools are available to assist literature mining (see Note 2).
We use the functional analysis of Omics data generated from radiation-treated ATM− and ATM+ cells (54) as an example to illustrate the Omics workflow described above. ATM, a serine–threonine protein kinase, plays critical roles in stress-induced responses, such as DNA damage repair and cell cycle regulation. Using human fibroblast cell lines expressing mutated ATM gene (AT5BIVA cell, ATM−) or wild type ATM (ATCL8 cell, ATM+), the study aims to better understand ATM-mediated pathways in response to ionizing radiation, which could facilitate identification of molecular targets for therapeutic interventions, such as increasing radiation or drug sensitivities of cancers. The two cell lines are subjected to global expression profiling using gene microarray and 2D-gel and MS proteomics. Below are the steps used for the analysis.
In summary, through functional profiling and pathway mapping, this example shows that purine metabolism is significantly represented and differentially changed in the ATM− and ATM+ cells in response to radiation. The increased expression of RRM2 at both mRNA and protein levels, and of p53, BRCA1, HDAC1, and Chk1 at the mRNA level in ATM+ but not in ATM− cells, strongly suggest that RRM2 is a downstream target of the ATM-mediated radiation response pathways and is required for radiation-induced DNA repair. This is supported by a recent report that upregulation of RRM2 transcription in response to DNA damage in human involves ATR/ATM-Chk1-E2F1 pathway (60). RRM2 is also known to play roles in cell proliferation, tumorigenicity, metastasis, and drug resistance (61). Increased expression of RRM2 has been linked to increased drug resistance, and its decrease in expression is linked to the reversal of drug-resistance in cancer cells (61, 62). RRM2 is a potential therapeutic target for cancers, e.g., targeting RRM2 for sensitizing cancer cells to drug effects through enhancing camptothecin (CPT)-induced DNA damage in breast cancer cells (60).
Omics-based molecular target and biomarker identifications remain challenging, and many limitations exist, e.g., see review in ref. 63.
The work has been supported in part by Federal funds from the National Cancer Institute (NCI), National Institutes of Health (NIH), under Contract No. HHSN261200800001E (Z.Z.H.), by NCI grant P01CA074175 (A.D.), by NIH grant U01-HG02712 (C.W.), and by the Department of Defense Breast Cancer Research Program W81XWH-06-10590 Center of Excellence Grant (A.W., A.T.R.). The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government.