|Home | About | Journals | Submit | Contact Us | Français|
The most common form of genetic variation, single nucleotide polymorphisms or SNPs, can affect the way an individual responds to the environment and modify disease risk. Although most of the millions of SNPs have little or no effect on gene regulation and protein activity, there are many circumstances where base changes can have deleterious effects. Non-synonymous SNPs that result in amino acid changes in proteins have been studied because of their obvious impact on protein activity. It is well known that SNPs within regulatory regions of the genome can result in disregulation of gene transcription. However, the impact of SNPs located in putative regulatory regions, or rSNPs, is harder to predict for two primary reasons. First, the mechanistic roles of non-coding genomic sequence remain poorly defined. Second, experimental validation of the functional consequences of rSNPs is often slow and laborious. In this review, we summarize traditional and novel methodologies for candidate rSNPs selection, in particular in silico techniques that aid in candidate rSNP selection. Additionally we will discuss molecular biological techniques that assess the impact of rSNPs on binding of regulatory machinery, as well as functional consequences on transcription. Standard techniques such as EMSA and luciferase reporter constructs are still widely used to assess effects of rSNPs on binding and gene transcription; however, these protocols are often bottlenecks in the discovery process. Therefore, we highlight novel and developing high-throughput protocols that promise to aid in shortening the process of rSNP validation. Given the large amount of genomic information generated from a multitude of re-sequencing and genome-wide SNP array efforts, future focus should be to develop validation techniques that will allow greater understanding of the impact these polymorphisms have on human health and disease.
An individual’s physiologic response to environmental stimuli can be modulated by genetic variation. Millions of genetic alterations (termed polymorphisms) occur at >1% in the human population, and while most have little impact on human health, there is abundant recent evidence that certain variants cause a myriad of phenotypic effects and given a specific context, can strongly impact disease susceptibility . Common polymorphisms include tandem repeated segments (minisatellite, 0.1–20 kb; microsatellite, 2–100 nucleotides), large (copy number variants) and small segmental deletions/insertions/duplications, and single nucleotide polymorphisms (SNPs). SNPs are exceedingly common polymorphisms, accounting for approximately 90% of all known sequence variation  and are estimated to occur every 100–300 base pairs. The most recent build of NCBI’s SNP database (dbSNP 127) lists over 11 million identified SNPs in the human population, with over 5 million validated by multiple investigators. By calculation, given evenly spaced SNP incidence throughout the human genome, there are approximately 165,000 SNPs within the 20,000–25,000 estimated genes whose coding regions cover approximately 1.5% of the human genome [3,4].
SNPs that affect gene expression occur in all regions of the genome. SNPs located within the coding region of genes have been extensively studied, including those that cause amino acid codon alterations (non-synonymous variants) which can lead to protein misfolding, polarity shift, improper phosphorylation and other functional consequences. Less predictable are variants located within non-coding regions of the genome. While mostly regarded as non-functional, this type of alteration can impact gene regulatory sequences such as promoters, enhancers, and silencers . Termed regulatory SNPs (rSNPs), these variations have become more prevalent in recent literature [6–9]. Transcription factor (TF) binding sites are attractive regions to search for functional rSNPs. A SNP in a TF binding site can have multiple consequences. In most cases, a SNP does not change the TF and binding site interaction nor does it alter expression, since a TF will usually recognize a considerable number of binding sites. In some cases, a SNP may increase or decrease the binding, leading to allele-specific gene expression. In rare cases, a SNP may eliminate the natural binding site or generate a novel binding site. Consequently, the gene can no longer be regulated by the original TF. Thus, functional rSNPs in TF binding sites may predictably lead to differences in gene expression and phenotypes, and ultimately affect susceptibility to environmental exposure (Fig. 1). Indeed, there are numerous examples of rSNPs associated with disease susceptibility, including hypercholesterolemia , hyperbilirubinemia [11,12], myocardial infarction , acute lung injury , and asthma .
Identification and experimental verification of functional rSNPs are key limiting steps in an efficient functional polymorphism discovery process. Here, we review traditional and novel prioritization methods to generate candidate genes that have polymorphic regulatory regions using sequence analysis, expression and genotype information. Further, we will describe standard and developing molecular biology techniques that quantify the impact of polymorphisms on DNA binding activity and gene regulation (see Fig. 2 for discovery pipeline). Functional analysis using wet lab techniques is a slow and laborious process, yet it remains critical to validation of a phenotypic effect and risk characterization. Additionally, we will highlight developing high-throughput methodologies which promise to revolutionize the field of rSNP discovery and functional genomics.
A candidate gene or a gene region becomes of interest in a SNP discovery project through several mechanisms. A knowledge base, such as expertise in DNA repair, may have pointed to a group of genes involved a key cellular process or likely to be involved in a disease and these genes may be sequenced (or resequenced) in a population in order to identify variants. Epidemiological studies draw attention to a gene by detecting the association between disease phenotypes and chromosomal regions, individual genes, haplotypes, or individual SNPs, depending on the methodology and resolution of the markers. A review of these genetic and epidemiological methods is beyond the scope of the present work. However, SNPs or markers used in such studies are often not the causative variants; rather they simply are markers which serve to delimit or discover certain genomic regions that may contain the functional variants.
In recent years, epidemiological studies use SNPs identified through comparisons of genomic sequences obtained from contig or shotgun libraries in the Human Genome Project, or candidate gene-directed resequencing. The vast majority of known SNPs were identified in comparisons of genome sequences from only a few humans (10–20). Resequencing efforts such as the NHLBI Program for Genome Applications and the NIEHS Environmental Genome Project  have used larger groups of samples, 24 to 400 or more, with varying degrees of ethnic or racial diversity, but sampling strategies have not been employed to generate a representative population sample. Due to cost considerations, the poorly-defined regulatory regions of genes have not always been sequenced. The result is that the catalog of variants in the existing databases is incomplete, uneven and most lack accurate frequency estimates. A significant number of the SNPs in dbSNP are either sequencing errors or have extremely rare frequencies. Some genomic regions are very poorly characterized, with only few SNPs known for only some ethnic groups while other regions are well-characterize among worldwide groups (e.g. Human Diversity project) .
The International HapMap project has improved the situation dramatically by evaluating a large set of SNPs (~4 million distributed across the genome) in samples of African, Asian and European descent (described in more detail later) . This has allowed the identification of haplotypes and haplotype tagging SNPs (tagSNPs) that are common among human populations and those that are specific to some ethnicities . Despite the remarkable coverage achieved by the HapMap, many genome regions still have poor resolution because the SNPs picked for evaluation in the project were found to be monomorphic or very rare. In addition, populations used to define haplotypes may not be representative of other populations of interest , resulting in failed associations (both false positives and false negatives) in disease studies .
The use of 300,000- 1 million tagSNPs in whole genome association studies has enabled tests of association between common variation and disease; however, SNPs identified as risk factors in these scans are in noncoding regions that frequently have no apparent relationship to a gene’s function. Thus there is still need for sequencing in larger, more diverse populations in order to discover the linked causal variants. The ultimate approach is to test all genetic variation in all individuals of a study cohort. This possibility is near reality as extensive efforts to lower costs of sequencing, i.e. the “$1000 genome” challenge, are currently underway [22–24]. In the near term, as attention turns to noncoding regions, there is a need to develop approaches to identify functional regulatory elements and evaluate how sequence variation alters function.
Successful bioinformatics identification of functional rSNPs requires identification of putative regulatory sequences, or motifs, and the co-location of SNPs in these sequences. Prediction of the impact of SNP on TF binding would be useful. Within a given gene network, such as the p53 or NRF2 pathways, members of the pathway typically have cis-regulatory sequences within 10 kb (sometimes larger) of the transcription start site. Computational methods for the identification of cis-regulatory sequences have been successfully applied to simple organisms such as yeast and worm, and while some methods have been plagued by high false positive rates in mammals primarily because of the very large quantity of intergenic sequence present , many recent new bioinformatics algorithms have improved prediction . These include examining evolutionarily conserved regulatory sequences in upstream sequences of orthologous genes across species [7,27–29] and identifying statistically over-represented motifs in the upstream regions of genes that are co-regulated in microarray expression profiles [26,30].
With the availability of fully-assembled human and other mammalian genomes along with large-scale genotyping and gene expression databases, our laboratory developed computational tools to systematically evaluate rSNPs in transcription factor binding sites (TFBSs) in the human genome and to predict their impact on the expression of target genes. This approach integrates distinct computational methods and databases to facilitate position weight matrix (PWM) construction, TFBS prediction, phylogenetic footprinting, microarray expression profile analysis and SNP genotype-gene expression association. This system has been used to identify polymorphic antioxidant response elements .
Briefly, we built a PWM model based on a set of TFBSs that were discovered by investigators using classical experimental methods to examine DNA-protein interaction, including electrophoretic mobility shift assay, DNase I footprinting, and gene promoter reporter constructs. The PWM is a probabilistic model that assesses the nucleotide preference at each position of a TFBS. Assuming that the highest scoring sequences represent the strongest binding sequences, this model provides an approximation of relative DNA-protein interaction strength  and has been used to identify binding sites of p53  and estrogen receptor . Using the PWM model, potential TFBSs were identified for each input SNP sequence from dbSNP in a “sliding window” manner, a quantitative score was assigned, and the impact of a rSNP on a TFBS was estimated by calculating the difference in PWM scores for the allele-pair. Allelic pairs resulting in a larger score difference are predicted to be more likely to influence the binding of transcription factors. The selected rSNPs were then mapped to nearby genes using chromosomal positions and transcription start site information from NCBI dbSNP and Genome databases. To increase the chance of finding truly functional or bone fide binding sites and to identify potential loss of function SNPs, a comparative genomics approach called “phylogenetic footprinting” [34,35] was used. This analysis identifies evolutionary sequence conservation across human, mouse, and rat genomes—presumably, high levels of similarity imply functionality. Phylogenetics is commonly used and accepted for the identification of conserved regulatory elements. However, Horvath et al.  demonstrated that many human p53 response elements, particularly those regulating apoptosis genes and DNA repair, while showing strong homology among primate species (chimp, gorilla, rhesus) show very low conservation relative to rodents. Thus phylogenetic methods must be applied with knowledge of the pathway and used with caution. While recent sequencing of rhesus macaque (Macaca mulatta)  enables reconstruction of the ancestral state of the primate genome before the divergence of chimpanzees and humans, the extremely high homology among nonhuman primates alignments of primates limits informative comparisons to regions of recent divergence and positive selection [27,38,39]. The draft quality of nonhuman primate genome assemblies has also challenged the ability of current methods to detect insertions, deletions, and copy-number variations between humans, and primates, although this situation is improving rapidly .
Identification and annotation of all the DNA regulatory sequences in the human genome is a fundamental challenge in genomics and computational biology. Here we have briefly outlined approaches that examine TF binding motifs and phylogenetics in order to identify variation in sequences most likely to exhibit a phenotypic effect; however, it does not represent the breadth of bioinformatics methods that have been used to identify DNA regulatory sequences.
An obstacle in the discovery process (e.g. bioinformatic approaches) is that the function of many candidate genes is unknown. An approach for refining gene candidates prior to assessing SNP effects is to determine if the gene in question affects a model phenotypic endpoint of interest. At the cellular level, many molecular biological techniques that curtail loss of function such as dominant negative mutants, antisense oligonucleotides, and RNA interference can be utilized to determine a gene’s effect on phenotypic endpoint. Additionally, the use of transgenic (gene overexpression or knockout) mice or siRNA-induced loss of function allows for in vivo assessment of a gene’s effect on phenotype. Specific reduction or elimination of a gene of interest and subsequent measurement of a quantifiable cellular phenotype can justify further genetic-based studies; however, these processes are relatively low-throughput and not amenable to prioritizing large candidate gene lists. However, high-throughput RNA interference screens have been developed for mammalian cell cultures and, despite current high costs and various technical challenges, these large-scale screens hold future promise as effective tools for gene candidate selection .
This review focuses on approaches for experimentally assessing the relationship between genetic variation and phenotype. rSNP candidates, whether identified through linkage, candidate gene, and/or bioinformatic analysis, can be verified through a number of methods. Some methods test molecular properties such as binding affinity of a protein to TFBSs and the effect of a SNP in that region, while others directly examine the influence of a SNP on gene transcription and, ultimately, phenotypic response. Classically, many of these protocols are slow to process and difficult to optimize, making large scale evaluation of candidate rSNPs impractical. Therefore, many recent techniques have addressed the need to assess polymorphic effects in an accelerated, but quantifiable manner. In this section, we review both current and developing techniques which determine rSNP functionality.
Electrophoretic gel-mobility shift assay, or EMSA, allows in vitro detection and visualization of protein-DNA binding . First used to characterize the interaction of E. coli lac repressor to DNA restriction enzyme fragments , EMSA has become a flexible, simple, and relatively sensitive protocol for assessing DNA-protein interaction. Briefly, the assay utilizes a double-stranded oligonucleotide containing a known or putative binding sequence, typically 20–30 nts, which is labeled with either a radioactive or fluorescent marker and added to crude cellular nuclear extract or purified recombinant protein, thereby allowing DNA-protein complexes to form. The DNA probe/protein mixture is loaded onto a polyacrylamide gel and electrophoresed. If proteins are bound to the labeled sequence, the DNA-protein complex will move more slowly through the gel matrix, creating a “shift” relative to unbound oligonucleotide. Typically, a larger complex (i.e., DNA bound protein) results in retarded band migration. Dissociation of the protein from the oligonucleotide is discouraged by the low ionic strength of the gel, as well as the gel matrix itself .
Competitive EMSA distinguishes binding affinity of a particular protein for a target binding sequence, i.e. sequence variation caused by a SNP or site-directed mutation, by the use of competing non-labeled sequence in the binding reaction. Different types of unlabeled competing oligonucleotide are added to the binding reaction and attenuation of the signal (shifted band intensity) of the labeled probe is quantified. For example, Pitarque et al.  demonstrated the effect of a SNP located 5’ upstream region of the cytochrome P450 2A6 (CYP2A6) gene utilizing competitive EMSA. Furthermore, competitive unlabeled oligonucleotides can define what binding factors create a shift of interest by challenging with a sequence matching the consensus of a binding factor. Additionally, antibodies which recognize epitopes of a bound protein create a less mobile DNA-protein complex, resulting in a “supershift” on the gel (or in some cases, the antibody binds the DNA-binding region of the protein, thereby attenuating intensity of the shifted band), indicating specificity of DNA-protein binding. As an example, Masotti et al.  used this approach to determine that the transcription factor YY1 targeted a promoter sequence of the ribosome biogenesis gene TCOF1. Some Treacher-Collins syndrome (TCS) patients were found to harbor a SNP in this region, thereby altering YY1 binding potential and subsequently reducing TCOF1 gene expression – a phenotype suggested to increase TCS susceptibility.
While highly sensitive, EMSA results are often qualitative, as band intensity is difficult to quantify due to issues such as background and band smearing. When assessing SNP effects, in particular those that have exhibit minimal binding differences, more resolute quantification is desired. Additionally, gel electrophoretic-based technologies that assess in vitro DNA-binding potential are time-consuming and often difficult to optimize  (also see review ).
Increasingly, there is a demand for higher throughput assays to verify TF binding sites (and the effects of SNPs). One such approach utilizes fluorescence (Förster) resonance energy transfer (FRET). FRET describes the energy transfer of the emission of a donor fluorophore by an accepting fluorophore if the distance between the two is less than ~100 Å  resulting in a change in fluorescence (or quenching). FRET has outstanding sensitivity of detection, which can be realized at the level of single biomolecules . A number of recent efforts have utilized FRET to quantify binding of protein to DNA [49–51]. In one scheme, two duplex DNA “half” protein-binding sites are each separately labeled with a donor and acceptor fluorophores . The FRET effect is stabilized by the binding of protein to the full binding site. This procedure, known as the molecular beacon binding assay, allows real-time optical reporting of the binding event. Another approach exploits the use of dual fluorophore labeled duplex DNA with protein binding detected by exonuclease III protection . A linear duplex DNA is designed with two binding sites occupying each end of the dual-labeled probe. Protein binding protects the DNA duplex from exonuclease III digestion, thereby maintaining FRET. In the absence of protein binding, the duplex is digested and the FRET signal extinguished. Both assays demonstrate a quantitative dynamic range in respect to protein concentration, and feasibly could be used to test for SNP effects on TF binding. Additionally, these assays are rapid, and can be multiplexed using different fluorophores, greatly enhancing its utility as a medium-throughput method. Currently, designing molecular beacon assays can be technically challenging and the dual-labeled probes are relatively expensive to manufacture.
Surface plasmon resonance (SPR) imaging arrays are a somewhat recent innovation that can achieve determinations of DNA-protein binding kinetics without the use of radioactive or fluorescent nucleic acid base modifications. SPR describes the natural phenomena of photon energy transferring to electrons of a noble metal film (high refraction) present in an air or liquid (low refraction) interface . When the metal film is coupled on a prism, the incident light focused at a certain angle will be absorbed by the metal, thereby attenuating reflected light. However, in the presence of molecules occupying a chemically or biologically modified metal surface, the angle of non-reflectance will change, which can subsequently be measured in real-time by a CCD camera, a process known as SPR imaging. This technique has been utilized to measure many kinds of biomolecular interactions including antibody-epitope recognition (i.e. protein-protein interaction) , DNA: DNA or DNA:RNA oligomerization [54–57] and DNA-protein interactions [58–61]. DNA microarray technology has been combined with SPR imaging analysis to create a medium to high-throughput methodology for quantifying DNA-protein interactions  in which a recently modified DNA array generation process has enhanced detection of interactions with double-stranded sequences of similar makeup (i.e., SNPs in binding regions) . Indeed, this technique demonstrated quantifiable differences in binding association and dissociation of competing MafG homodimer or MafG:Nrf2 heterodimer transcription factors to systematic mutations of a Maf recognition element (MARE) sequence using single or double base substitutions in its flanking and core binding regions . This method allowed prediction of the binding affinities of native and mutated sequence for the MafG homodimer and MafG:Nrf2 heterodimer which have been shown to repress and enhance downstream antioxidant gene expression, respectively .
SPR imaging holds great promise for investigations of protein binding to polymorphic regulatory elements, but at present it is still a challenging technology. Coupling DNA oligonucleotide arrays to SPR imaging technology expands the possibilities, but as yet only one or two recombinant proteins have successfully been quantitatively assayed in combination. The use of experimentally derived cell or nuclear extracts has not been accomplished, although this might be achieved with antibody probing analogous to an EMSA supershift technique . It remains to be seen if cumulative induction-repression dynamics of a regulatory sequence in a treatment/exposure model can be assessed with SPR imaging.
A format in common use is the enzyme-linked immunosorbant assay (ELISA)-like procedure in which immobilized DNA binding regions are incubated with recombinant protein or nuclear extract in a multi-well format, and then probed with an antibody specific for the protein of interest [66,67]. An advantage of this protocol is that any protein of putative DNA-binding potential can be assayed, assuming the antibody is available. Additionally, the assay format has been demonstrated to be more sensitive to binding potential than EMSA, ascribed to the fact that protein-DNA binding equilibrium can be more readily achieved in this format given fixed amounts of protein and oligonucleotide . Assessing differences in binding affinities for polymorphic sequences might be accomplished using this format in the presence of competitive oligonucleotides specific to each allele, similar to competitive EMSA. Unfortunately, much like EMSA, only binding of one protein to one DNA sequence can be assayed per reaction. However, fluidic ELISA-based assays capable of identifying unique bound sequences in a multiplexed oligonucleotide binding reaction have been recently developed. Gorenstein, et al. [68,69] created a polystyrene bead library consisting of unique bead collections synthesized with a single species of nuclease-resistant synthetic phosphodiester-modified oligonucleotides known as thioaptamers. A mixture of differently labeled thioaptamer coated beads was incubated with NF-κB p50 protein, and then the bound protein was detected with a fluorescently labeled anti-p50 antibody. Labeled thioaptamer beads were then detected and sorted using flow-cytometry and the p50-bound thioaptamer sequences were recovered and sequenced for identification . Similarly, Yaoi, et al. multiplexed 24 different double-stranded TF binding sequences, each labeled with biotin, in a reaction with cellular nuclear extract . After purification, denatured double-stranded protein bound biotinylated probes were hybridized to unique fluorescently-labeled beads conjugated with anti-sense sequences specific for each of the 24 TF probes (i.e., one bead, one TF binding sequence). Following phycoerythrin-conjugated streptavidin treatment, each oligonucleotide probe provided a quantitative readout of TF binding to its consensus sequence. Our laboratory has pursued the possibility that an oligo-conjugated fluorescent microsphere-based technique could be utilized to assess differential binding affinities due to sequence polymorphism. We have developed a sensitive, high-throughput microsphere-based assay based on Luminex FlexMap technology to measure both p53 binding to p53-response elements and estrogen receptor α (Erα) binding to ERα response elements. Up to 82 response elements have been tested in a multiplex hybridization. Microspheres are sorted, identified, and bound proteins are detected with phycoerythrin-labeled antibodies (unpublished data, technique outlined in Fig. 3). The approach shows high resolution and utility in both experimental mutagenesis studies of TF binding sites and the functional evaluation of SNPs in TF binding sites.
A recently developed very high-throughput methodology that can evaluate hundreds of thousands of binding sites in a single experiment utilized the incorporation of immobilized oligos into a microarray format. While similar to the previously described ELISA-based formats, microarray binding technology provides both a quantitative and highly scalable platform to assess in vitro protein-DNA binding [71–74]. The first use of this approach on a genome-wide scale was performed by Mukherjee and colleagues using purified GST-tagged yeast transcription factors . This technology, termed ‘Protein Binding Microarrays’ or PBMs, has since been scaled up to represent every possible binding site 8–10 bp in length [76,77]. PBMs will likely significantly enhance our understanding of what individual sequence variants do to alter binding potential in an in vitro setting, allowing for greater predictive capability of a SNP effect in a TFBS.
A concern with in vitro binding assays is DNA is presented in an unnatural state, i.e., the sequences are not in a chromatin context, and specific cis- or trans-acting co-factors or enhancer elements are absent, thus creating an artificial protein-DNA binding situation. Chromatin immunoprecipitation (ChIP) is a versatile binding assay that provides insight into gene regulation in an endogenous state. In this technique, DNA binding proteins such as transcription factors or proteins involved in DNA binding complexes are crosslinked to the DNA by formaldehyde and subsequently sonicated in order to fragment the DNA before immunoprecipitation by an antibody specific for the protein of interest [78,79]. Formaldehyde is a powerful and reversible crosslinking agent that essentially acts to “freeze in time” protein-DNA, protein-RNA and protein-protein interactions .
Although the crosslinking, sonication, and immunoprecipitation procedures for ChIP have been well-documented in numerous recent studies, there are many techniques for detecting and analyzing immunoprecipitated DNA. Historically, slot blot, Southern blot and cloning and sequencing were used. More recently, immunoprecipitated DNA is PCR amplified and viewed on a gel or with quantitative real-time monitoring [81,82]. Additionally, immunoprecipitated DNA can be measured genome-wide by hybridizing to a genomic tiling microarray, procedurally known as ChIP-on-chip or ChIP-chip . Discovery methods such as ChIP-chip are powerful because they allow genomic scale identification of both known and novel DNA binding sites. Encouragingly, a recent independent and blind test of different ChIP-chip platforms using the same DNA samples with “spiked-in” targets found that ChIP-chip was very reproducible, where variation tended to result from lab, protocol, and computational algorithm differences . However, ChIP-chip detection limit is somewhat sensitive to protein expression levels and may be influenced by protein-protein interactions such as competitive DNA binding of other proteins . In addition, the sequence complexity in the human genome and the relative sequence complexity of the binding motif can affect resolution. However, this approach is likely to become routine as whole genome tiling arrays become more cost-effective. In order to overcome some of these limitations, the DIP-chip (DNA immunoprecipitation with microarray detection) method was developed using yeast Leu3p as a model in which purified protein and naked genomic DNA are mixed in vitro to form protein-DNA complexes. These complexes are then isolated, run on whole genomic microarrays to identify protein bound DNA fragments, and sequenced to identify putative binding sites .
Methods to detect protein-DNA interactions using high-throughput DNA sequencing techniques have also been developed using ChIP technology. For example, ChIP-STAGE (Sequence Tag Analysis of Genomic Enrichment) was developed to map regions of protein-DNA interactions in both yeast TATA-box binding protein and human transcription factor E2F4 targets . A similar method, ChIP-PET, combines ChIP with paired-end ditag (PET) sequencing for identification of TF binding sites . Using the sequence specific p53 tumor suppressor binding site as a model, Wei et al.  cloned immunoprecipitated DNA into a DNA library from which tags of both the 5’ and 3’ ends of each binding region were created and joined together to create PETs. PETs were then cloned into a final ChIP-PET library, sequenced, and then genomically mapped to determine boundaries of cloned fragments. While nonspecific fragments are distributed randomly, overlapping fragments representing true binding sites will form PET clusters .
ChIP methods have also been developed to analyze differential protein-DNA binding among allelic variants of a gene. HaploChIP (haplotype-specific chromatin immunoprecipitation), developed by Knight et al. , combined traditional ChIP using Pol II antibody with mass spectrometry to identify differential protein-DNA binding in vivo associated with allelic variants of the imprinted gene SNRPN. Our laboratory has modified this technique to address differential allelic binding. In this method, protein-bound DNA is immunoprecipitated and polymorphic candidate binding regions are then assayed using a probe-based allelic-discrimination genotyping assay to assess differential binding potential. Liu et al.  used this technique to demonstrate differential Pol II binding potential in a promoter region of the glutathione-S-transferase gene, GSTM3, caused by a base pair substitution. In a more high-throughput manner, Maynard and colleagues annealed ChIP DNA material to Illumina HumanHap300 GenotypingBeadChip genotyping arrays in order to assess allelic binding differences of Pol II across the genome . ChIP-SNP identified 466 SNPs that altered binding of Pol II, including many known imprinted genes. While this method binding evaluates several thousand polymorphic elements in a single experiment, an important limitation is that the evaluated SNPs in a test sample must be heterozygous, where homozygous locations are uninformative. Evaluation of all polymorphic TFBS of interest would therefore require numerous samples that represent heterozygosity at all putative binding locations, thereby limiting the cost effectiveness of such a study.
While TF-DNA binding is typically necessary for TF transactivation and gene regulation, assessing the effect of SNPs themselves on transcription and phenotype is essential.
A putative regulatory sequence can be incorporated into an expression vector construct containing a reporter gene lacking endogenous promoter activity. The plasmid is subjected to regulatory cellular machinery when transfected into a cell and the reporter gene expression is measured (i.e., typically a luciferase gene is downstream of the promoter region of interest). For polymorphisms, separate constructs containing allelic sequences are transfected in parallel and expression is normalized to a reference. For a thorough discussion of the use of gene reporter constructs in the context of assessing gene regulatory regions, see these reviews [8,92,93]. Various factors affect interpretation of data generated from reporter construct assays, including variation in transfection efficiency , normalization to co-transfected constitutively expressed constructs, and exclusion/inclusion of enhancer regions in the test reporter construct that cooperate in gene expression of the endogenous gene. While transient expression of reporter gene constructs use the host cell’s transcriptional machinery, high copy number and the lack of any chromatin features and structure make it an imperfect model system. Stable transfection of reporter constructs (i.e., permanent incorporation of the foreign plasmid DNA into the genome) potentially corrects problems of an abnormal chromatin context, but clonal variation can hinder precise quantitative assessment of gene regulation and expression [8,95]. While imperfect, transfection of plasmid constructs in a specific cell system, which may overexpress (or have induced) the TF of interest, gives the researcher a specific and quantitative tool for assessing SNP effects on TF gene transactivation.
If SNPs are present in the coding region of genes, then the effect of an rSNP on gene transcription can often be estimated by measuring allelic imbalance in transcript levels. Allelic ratios of a coding region SNP in a heterozygous individual, in which the rSNP of putative functional relevance is in linkage disequilibrium (LD), provides direct evidence for differential transcription of alleles caused by rSNPs. Detection of allele-specific gene expression depends on quantification of mRNA from each allele and assumes that normally each chromosome is equally expressed in a 1:1 ratio [96,97]. Relative allelic expression has been studied using methods such as amplification refractory mutation system (ARMS), PCR followed by RFLP analysis, SNP genotyping arrays, mRNA sequencing and mass spectrometry-based genotyping . PCR followed by primer extension is also common and has been detected using capillary electrophoresis [96,99] or mass spectrometry .
Alternatively, probe-based genotyping assays can provide a quantitative way of assessing allelic imbalance . For example, a coding SNP was found to be in LD with a functional rSNP in the promoter of the GSTM3 gene . Using genomic DNA of known genotypes as a standard reference, allelic expression of the coding SNP in GSTM3 transcript was measured using a quantitative probe-based genotyping assay. Using this approach the transcript linked to the strong binding promoter allele was enriched over the transcript linked to the weak allele assessed in three cell lines heterozygous for the functional rSNP (Fig. 4). We are currently applying this method to evaluate the effects of candidate rSNPs in antioxidant response elements . Because measured allele ratios can be impacted by the quantity of input mRNA and the number of amplification cycles, it is important to validate allelic imbalance methods using reference mixtures of DNA of different concentration and composed of known allele ratios. Allelic imbalance evaluation using probe-based assays are amenable to high-throughput assessment from a large set of samples. Indeed, other high-throughput methodologies, including pyrosequencing techniques [102,103], are being utilized to accurately and reliably quantify allelic imbalance.
The expression phenotype for many genes is highly variable in humans. Several studies indicate a considerable portion is heritable and is associated with SNPs in both coding and regulatory regions of the genome . The development of the International HapMap project has provided a convenient resource to examine this association. This project so far has genotyped more than 3 million SNPs in four distinct populations (CEU: 90 Utah residents with ancestry from northern and western Europe; CHB: 45 Han Chinese in Beijing; JPT: 45 Japanese in Tokyo; YRI: 90 Yoruba in Ibadan, Nigeria). Several research groups have generated gene expression profiles in lymphoblastoid cell lines of HapMap individuals and analyzed the correlation of SNP genotype to gene expression phenotype. For example, the genome-wide linkage analysis has been performed for expression phenotypes in CEU individuals to map the genetic determinants of variation in human gene expression . In addition, cis-acting rSNPs have been discovered using a regional association approach .
On a larger scale, genome-wide association has been performed to seek statistically significant association of SNPs with expression variation in lymphoblastoid cell lines of all 210 unrelated HapMap individuals . Our laboratory has also utilized the HapMap resource to test the association between genotypes of bioinformatically identified antioxidant response element SNPs and gene expression phenotypes in 60 unrelated CEU individuals . There are, however, numerical limitations with these approaches utilizing the HapMap data. First, HapMap genotyping data and expression are currently only available for 3.1 million SNPs and 270 individuals. Second, many SNPs (or haplotypes) have very low allele frequencies. Third, and most importantly, the microarray expression profiles used in our association analysis were from untreated human lymphoblastoid cell lines. Ideally, expression information from additional tissues and stress exposures could be evaluated in order to find association between exposure-inducible gene expression and SNPs that affect transcriptional activation. Because polymorphic variants that are known to affect expression are often found at population frequencies of less than 5% [107–109], issues of sample size become crucial in a survey like this one. Ideally, expression profiles from a larger sample (n > 400) of stress-exposed human cells could be used in this analysis, combined with a higher resolution of HapMap genotyping data, making this bioinformatics-driven approach particularly effective.
A key challenge in identifying functional polymorphisms in gene regulatory regions is the shear amount of genomic data and the undefined nature of these regions. Linkage analysis has had success in identifying chromosomal regions relevant to disease susceptibility, and genome-wide SNP association studies will identify many additional important candidate genes. However, it remains a challenge to resolve the role by which specific polymorphisms affect gene expression. Candidate gene approaches can be effective; however, novel genes are often overlooked using this method. By using bioinformatics-based methods that access genomic information and expression databases, one can retain the power of a candidate gene approach yet still uncover novel gene regions. Additionally, resources such as dbSNP and the International HapMap database can provide detailed polymorphism data on gene candidates whose impact can be predicted further using bioinformatics tools such as PWM scoring.
It is still necessary, however, to verify the functional impact of novel gene polymorphism using basic molecular biological techniques. Methods such as EMSA, ChIP, and luciferase reporter constructs can be used to test the effect of a SNP on regulatory element function. These processes tend to be laborious and not well matched for screening large numbers of DNA elements and SNPs. It is, therefore, imperative to develop high-throughput methods to assess regulatory regions in the genome. Indeed, methodologies covered in this review, such as SPR analysis, oligo-conjugated microsphere binding assays, PBMs, and allelic imbalance methods, have contributed greatly to this endeavor. Unfortunately, obstacles exist with these technologies (such as reproducibility, quantification, limit of detection, and affordability), and continued efforts to develop these and new technologies are needed. Methodology advancement to allow effective analysis of genomic data will be necessary to understand the genetic basis of disease susceptibility and effectively predict and prevent disease in individuals.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.