|Home | About | Journals | Submit | Contact Us | Français|
For the past decade, the development of genomic technology has revolutionized modern biological research and drug discovery. Functional genomic analyses enable biologists to perform analysis of genetic events on a global scale and they have been widely used in gene discovery, biomarker determination, disease classification, and drug target identification. In this article, we provide an overview of the current and emerging tools involved in genomic studies, including expression arrays, microRNA arrays, array CGH, ChIP-on-chip, methylation arrays, mutation analysis, genome wide-association studies, proteomic analysis, integrated functional genomic analysis and related bioinformatic and biostatistical analyses. Using human liver cancer as an example, we provide further information of how these genomic approaches can be applied in cancer research.
Genomic analyses include a variety of tools that address the global changes of specific biological parameters. Genomic analyses that examine DNA, RNA, or protein levels provide powerful tools to characterize gene function and regulation, facilitate disease classification, biomarker identification, risk factor stratification and drug discovery (Fig. 1).
Genome-wide expression studies empowered by microarray analysis enable the systematic analysis of complex biological systems and because expression microarrays were developed in the 1990s they represent the beginning of the genomic era. The principle of microarray expression profiling study is based on the hybridization of a single-strand nucleic acid fragment to its complementary single strand with high specificity . The target cDNA is first labeled using fluorescent dyes and then hybridized to the array surface. The retained labeled target will then be subjected to stringent washes, capture and quantification of the fluorescent signals, followed by the data analysis process. Two platforms that are commonly in use are: cDNA microarrays and oligonucleotide microarrays . The cDNA microarrays contain a collection of probes generated by PCR amplification from cDNA libraries, expressed sequence tag clones or long genome cloned fragments. Oligonucleotide microarrays commonly contain short oligonucleotides (25–30 nt) or long oligonucleotides (50–80 nt). In both platforms, the short DNA segments are printed onto solid supports, usually microscopic slides, by direct contact mode (mechanical robotic spotting) or non-contact mode (ink jet technique). Oligonucleotide microarrays manufactured by Affymetrix, the GeneChip, employ a different technology platform, designated the photolithographic method, to synthesize the short oligonucleotides in situ on the chip. Different platforms each have their limitations and advantages, but in common, all have been shown reliably to capture gene expression signature on a genomic scale .
Global expression profiles enable a better understanding of the molecular signature of human diseases, including liver cancers [2,3] (Fig. 2). For instance, we and others have reported genome-wide expression profiles of liver cancer and their clinico-pathological implications [4–7]. We have observed specific gene expression patterns between tumor and non-tumor , in association with p53 status , and clonality delineation for multi-nodular tumor . There are reports on gene expression profiling in association with tumor metastasis , and patient outcome [5,7]. Differentially expressed genes demonstrated the potential to serve as prognostic biomarkers  and therapeutic targets .
An emerging number of studies have shown the regulation of gene expression by small non-coding RNAs in animals and plants [11,12]. MicroRNAs (miRNA) are an abundant class of small non-protein-coding RNAs that are 19- to 25-nucleotides in length that have been implicated in many human diseases including tumor initiation and progression. miRNA function as negative gene regulators, which can control hundreds of gene targets, and may function as either tumor suppressors or oncogenes . Differential expression of miRNAs have revealed diagnostic, prognostic and therapeutic implications . There are several approaches for genome scale miRNA expression profiling [15,16]. One of the most commonly used platforms is miRNA oligo arrays . The overall experimental design of these miRNA arrays is very similar to regular expression arrays. Other miRNA profiling methods included multiplexed q-RT-PCR assays and bead based methods [15,16].
Chromosomal imbalances, including deletions and amplifications, are common in human tumors. Comparative genome hybridization (CGH) has been widely used to examine for the global analysis of DNA copy number since its first report in the early nineties. The resolution of conventional CGH is limited by the length of the metaphase chromosomes, which is approximately 10 megabases and could contain hundreds of genes. Microarray-based CGH has been developed, which combines microarray technology with the CGH approach [18,19]. Defined DNA fragments (BACs, cDNAs or oligo) have been used to replace metaphase chromosomes and results in higher resolution. Microarray-based CGH allows a precise mapping for the regions of genetic aberrations [20,21], including human liver cancers .
DNA methylation is one of the most important covalent modifications of genomic DNA in eukaryotic cells. Human tumor samples frequently show abnormal patterns of DNA methylation, which may contribute to the aberrant patterns of gene expression and regulation. Traditional, methylation can only be determined on a gene-by-gene basis using methods that include bisulfite conversion followed by PCR or sequencing, methylation-sensitive restriction enzymes or affinity purification. Recent technology development has enabled the analysis of DNA methylation in a genome-wide scale [23,24]. Genome-wide methylation analyses can be divided into two categories: array based or non-array based. Several companies provide commercially-available arrays for methylation analysis . The arrays are designed to analyze bisulfite-converted DNA (for example, bead arrays from Illumina), or use the restriction enzyme-based methylation analysis (for example, oligo arrays from NimbleGen or Agilent). Array hybridization and analysis is similar to what has been described in expression or CGH arrays. Nonmicroarray based experimental design include Restriction Landmark Genome Scanning (RLGS), methylation specific digital karyotyping (MSDK), and high-throughput sequencing after bisulfate conversion [23,24].
“ChIP-chip” applies antibodies specific to a regulatory factor, in most cases, a transcriptional factor, for genome-scale chromatin-immunoprecipitation combined with microarrays spotted with intergenic sequences to identify their bound targets [25,26]. Protein-bound DNA fragments retrieved by the antibody are hybridized to microarrays to identify the retrieved sequences. ChIP-chip analysis is now being widely used as a reliable tool to identify targets of critical transcriptional regulators in a high throughput manner.
DNA sequence variation can influence disease risk and response to drug therapy by altering gene expression, RNA processing, or the amino acid sequence of proteins. Genome-wide association studies use DNA microarrays to investigate the effect of millions of common DNA sequence variants in the human genome, of which the most common type are single nucleotide polymorphisms (SNPs). The genome-wide association approach is statistically powerful  and has led to the discovery of many new genetic variants that underlie variation in human traits (www.genome.gov/26525384), including a number of cancers . A number of these disease variants influence the risk of multiple diseases, including shared risk variants for several common cancers . Thus, the genome-wide studies may have important implications in drug development by assisting to identify novel therapeutic targets and genetic biomarkers that for drug discovery .
Genome-wide association studies have become possible due to several recent technological advances. Improvements in DNA microarray technology have rapidly reduced the cost of genotyping SNPs, allowing for the testing of up to one million SNPs using a single microarray. At the same time, the HapMap Project validated nearly four million SNPs in multiple diverse populations, and determined the extent of linkage disequilibrium (LD) between SNPs . LD refers to the non-random association of SNPs, typically those that are closest together. The presence of LD allows for SNPs on genotyping platforms to serve as a proxy for other nearby SNPs . As a result, current DNA microarrays can assay most common SNPs in the HapMap. In this way, LD reduces both the genotyping costs of genome-wide association studies and the multiple testing burden (see the Biostatistical Analysis section below).
It is important to note that genome-wide association studies are better suited to investigate the potential association of common variants, typically defined as those with a minor allele frequency of greater than 5%, with disease than rare variants . Since strongly deleterious alleles are likely to face selection pressure, variants with large effects will be rare; common variants will have more modest effects on gene function. Most variants associated with disease in genome-wide association studies are common and have modest or small effects. Because of this, individual variants will not serve as strong predictors of disease risk . They may, however, explain a large amount of the risk of disease in the population, or population attributable fraction. For this reason, interventions that developed to counteract these risk variants could substantially reduce the incidence of disease in a population.
Human cancer is widely considered to be induced by somatic alterations within the cancer genome, leading to mutations of oncogenes or tumor suppressors . Sequencing of tumor cell genomic DNA, also known as “deep sequencing”, has been applied to identify “driver mutations”, which will clearly have an important impact on our understanding of carcinogenesis. The development of next-generation sequencing technologies has enabled us to perform genome-wide mutation analysis on cancer cells [36–38]. In most cases, these deep sequencing analyses involve amplifying all individual exons for most known genes, followed by large scale sequencing to examine the amplicon for possible mutations [36,38]. For instance, recent studies have discovered somatic mutations that affect key signaling pathways in acute myeloid leukemia and lung cancers [39,40].
Expression profiling (mRNA-based for expression level changes) and genomic profiling (DNA-based for copy number or sequence variation) can not provide a complete picture on the heterogeneity of complex diseased tissue. The level of mRNA or DNA changes do not always correspond to protein level changes, nor to the post-translational modifications, e.g. phosphorylation, which are critical in regulating protein activity. Indeed, a number of the targeted therapeutic agents are designed to inhibit the activity of a protein, e.g., tyrosine kinases. Therefore, protein profiling is essential in providing the protein molecular signatures for bioassay and therapeutic development. Two-dimensional polyacrylamide gel electrophoresis (2D–PAGE) approach is commonly employed to study protein profiles [41,42]. The proteins can be separated according to their size (molecular mass) and charge (isoelectric point) properties and their abundance then determined accordingly. The difficulty in elucidating the identity of the protein spots remains, however, the major obstacle in clinical application. Since the late 1980s, matrix-assisted laser desorption/ionization (MALDI) mass spectrometry (MS) has been advanced to allow the rapid measurement for the molecular weights of different proteins with a time-of-flight (TOF) MS. MALDI-TOF MS has limitations on mass resolution and accuracy, however, to identify peptides with high confidence. Alternative approaches have been used to provide a more reliable determination of peptide sequence, including collision-induced dissociation (CID) with a tandem MS, electron capture dissociation (ECD), infrared multiphoton dissociation (IRMPD), and electron transfer dissociation (ETD) . Furthermore, protein arrays, also known as antibody arrays, are an emerging technology that provides parallel analysis of multiple proteins . In addition, protein arrays can be applied to profile specific protein post-translational modifications, such as phosphorylation or neddylation, and to measure enzyme activities and protein cell-surface expression. Proteomic approaches have been widely used for biomarker discovery in human tumors . Protein profiling of blood samples has been the focus in recent years, because it allows repeated measurements (especially important in monitoring treatment response) and without the need to obtain tumor tissues. Blood samples, prepared as serum or plasma fractions, have been used for biomarker discovery [46–48].
Most genomic experiments produce large amounts of data. For example, in a gene expression microarray study, 22,000 genes x 100 samples will generate 2.2 million data points. In addition, genomic experiments are often noisy and not normally distributed, and usually contain missing values in the expression matrix. Robust biostatistical analyses are required to obtain biological relevant interpretations of the genomic data [49,50].
Specific statistical tools need to be applied to specific genomic studies. Therefore, people have to choose biostatistical software that is best suited for their specific experiments and those questions that they are trying to address. In general, statistical analyses of genomic data can be divided into two major categories: supervised and unsupervised methods [49,50]. Supervised approaches try to identify the genetic events that fit a predetermined pattern. For example, supervised analysis is used to identify genes that are differentially expressed between groups of samples, as well as to find genes that can be used to accurately predict the characteristics of groups. In contrast to the supervised method, the unsupervised approaches characterize genomic data without prior input or knowledge of predetermined pattern. Unsupervised analysis is used to identify internal structure in the genomic data set. The most commonly used unsupervised analysis tool is Hierarchical clustering and Principal Components Analysis (PCA).
Because genomic studies examine thousands or millions of data points, stringent significance criteria are applied to the association results. One method is to undertake a Bonferroni correction by dividing the significance criteria by the number of tests being conducted. For example, correcting for a million tested common SNPs in a genome-wide association study, an association would need a p-value of 5 × 10−8 (0.05 divided by 1,000,000) to be considered “genome-wide significant.” It is for this reason that genome-wide association studies typically involve thousands of subjects to achieve sufficient statistical power. Other multiple testing adjustments that are less conservative than a Bonferroni correction, such as permutation derived p-values and false discovery rates, are often employed to maintain statistical power and to clarify the strength of a reported finding in light of the genomic scale of the experiment [51,52].
Despite all these advantages of biostatical analysis, there is no standard or one-size-fits-all solution for statistical approach or even a single way to pick the significant p-value to balance type I and type II errors in statistical readouts. Each biologist has to approach this question based on his/her own biological questions in each specific setting.
Genomic studies generally generate large amounts of data. Even after statistical analysis, one may identify large number of de-regulated genes, for example, genes which are methylated or mutated in tumor samples. Bioinformatics analysis tools have been developed to assist scientists to extract meaningful data and interpret the genomic data in a functional manner.
One of the most commonly used methods to annotate the gene function is through Gene Ontology (GO, http://www.geneontology.org/) [53,54]. GO classifies gene function according to three organizing principles: molecular function, biological process and cellular component. When certain GO terms are statistically enriched in a cluster, it may suggest possible functional significance of the cluster of genes.
Another commonly used bioinformatic analysis tool is Gene Module Analysis [55,56]. Just as GO term analysis builds on pre-existing knowledge for the interpretation of microarray data, one can interrogate the global gene expression profile with respect to known sets of genes by gene module analysis. In brief, gene module analysis asks whether the genes whose level of expression changes in an experiment are similar to those which have been observed in another setting. The gene modules may be defined by function (e.g. GO terms or other annotations), the presence of specific cis- or trans-regulatory motifs for transcription factor or miRNA binding, or known responsiveness to specific signaling pathways or drugs. Gene Set Enrichment Analysis (GSEA) is the most popular modular analysis method that is publically available (http://www.broad.mit.edu/gsea/) . Ingeniuty Pathway analysis software (http://www.ingenuity.com/) is a popular commercially available modular analysis tool.
Integrated functional genomics is an approach that combines results from multiple genomic analyses or genomic analysis and functional experiments to identify important genetic signals underlying biological processes. Integrated functional genomics is critical for cancer research. As one can imagine, genome-wide studies are likely to produce large numbers of genes with expression alternations, mutations, abnormal methylation or DNA copy number variations in cancer cells. Only a small number of these genetic alternations, however, have functional roles during tumorigenesis. The majority of the genes are likely to be passenger genes that are either the products of genomic instability or secondary changes during carcinogenesis, but have no direct contribution to the malignant transformation. By combining multiple genomic analyses, one can significantly narrow down the list of genes which may contain functionally significant oncogenes or tumor suppressor genes. For example, DNA copy number gains and losses contribute to cancer development by increased and decreased expression of oncogenes and tumor suppressor genes, respectively. In our recent studies, we combined expression arrays and array CGH studies of human HCC samples, and were able to identify a relatively small set of genes as candidate oncogenes or tumor suppressors whose expression levels are associated with DNA copy number changes . In another example, ChIP-chip experiments can be integrated with gene expression analysis to delineate how specific transcriptional factors regulate global gene expression. In a recent study, Acevedo et al. performed ChIP-chip experiments to assay binding of RNA polymerase II, H3me3K27, and H3me3K9 and DNA methylation in 25,000 promoter regions in normal liver and liver tumor samples . The experiments successfully identified changes in active and silenced regions of the genome in liver tumor cells, and in so doing identified novel molecular mechanisms that mediate tumor specific changes in gene expression in the liver. In addition, by combining genomic analysis and functional screenings, such as siRNA mediated gene silencing, we can rapidly identify potential driver genetic events. For example, in a recent study, Zender L et al. identified small regions of recurrent deletions in human liver cancer by genomic analyses . Using microRNA based short-hairpin RNA libraries, targeting genes within these deleted regions, the group conducted in vivo RNAi screening to identify genes that, when silenced, cooperate with Myc to promote liver cancer development. The study successfully identified and validated 13 tumor suppressor genes for liver cancer .
In summary, in the next decade, with these tools for genomic analyses being widely used in biomedical research, one can foresee the emerging of large amounts of data tangling different biological questions. Functional genomic analyses will likely have multiple implications for drug discovery and development. For example, integrated functional genomic studies will likely identify driver mutations or genes which tumor cells depend on for growth and metastasis. These genes can be used as targets for drug development, and it will lead to drugs for specific genetic events which are likely to be more efficient and less toxic for cancer treatment. Genomic analyses will also identify genetic signatures, such as gene expression profiles or specific mutation status, which can be used to predict drug responsiveness. These biomarkers will clearly increase the power and efficiency of clinical trials by selecting the appropriate patient populations and may lead to successful clinical drug development. Altogether, the application of genomic analysis to drug development will facilitate drug discovery and the development process in a more efficient manner.
This work is supported by NIH K01CA096774 and R21CA131625 to X.C as well as RGC and NSFC (HKU 7560/06M and N_HKU 709/07) to S.T.C.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
This article provides an overview of the current and emerging tools involved in genomic studies and analyses.