Human genetic variation can have a pronounced influence on susceptibility to exposure induced disease. The most common form (about 90 %) of human genetic variation is the single nucleotide polymorphism (SNP). To date, over 10 million of these variations have been deposited in the National Center for Biotechnology Information (NCBI) dbSNP database. That is on average one SNP per 300 base-pairs existing in the human genome. The vast majority of SNPs likely have no measurable impact on human health, but a small fraction of SNPs alter gene function or expression and affect phenotypes. Identifying such functional polymorphisms is of great importance, as it will advance our understanding of phenotypic variations and enable the identification of high-risk individuals.
Many investigations of functional polymorphisms have focused primarily on SNPs in coding regions of genes, on the presumption that such polymorphisms can influence phenotype by altering the encoded protein. However, SNPs in non-coding regulatory regions can also play an important biological role. In particular, regulatory elements that control the levels and timing of transcription are attractive regions to examine for functional SNPs. SNPs in transcription factor binding sites (TFBSs) may affect the binding of transcription factors, lead to differences in gene expression and phenotypes, and therefore affect susceptibility to environmental exposure. The impact of variation on susceptibility to environmental exposure has been well established through the study of several important metabolism and DNA repair genes (reviewed in (
1)). Other examples include a SNP that causes human α-thalassemias by creating a new binding site for erythroid transcription factor GATA-1 in the upstream of α-globin gene (
2) and a SNP (-43C->T) in the proximal Sp1 site of the human low density lipoprotein receptor promoter results in heterozygous familial hypercholesterolemia (
3). The
UGT1A1 gene has a TATA box polymorphism that reduces expression of
UGT1A1, leading to Gilbert's syndrome (a common form of hyperbilirubinemia) (
4,
5), and has also been associated with higher levels of mutagens in the urine (
6). The steroid metabolism gene
CYP17 has a GC box polymorphism in its proximal promoter that has been associated with higher levels of circulating estradiol (
7) and with differences in bone mineral density (
8).
It is a formidable challenge to identify SNPs in TFBSs among millions of uncharacterized SNPs and to evaluate their potential impact on human health. Traditionally, TFBSs have been discovered using time-consuming and low-throughput experimental methods that explore DNA-protein interaction. Computational methods for the identification of
cis-regulatory sequences have been sought to direct laboratory work, and have been successfully applied to simple organisms such as yeast and worm. But these methods have been plagued by high false positive rate in mammals because intergenic sequences in higher eukaryotes are very long and contain a large excess of non-regulatory sequences (
9). Recently, new bioinformatics algorithms are gaining success in improving the predictive specificity. One popular algorithm examines well conserved regulatory sequences through the comparison of upstream sequences of orthologous genes across species (
10,
11). Another approach identifies statistically over-represented motifs in the upstream regions of genes that are co-regulated in microarray expression profiles (
12).
With the availability of a fully-assembled human genome sequence, large-scale genotyping and gene expression profiling technologies, we aimed to develop computational tools to systematically detect functional SNPs in TFBSs in the human genome and to predict their impact on the expression of target genes. We have implemented an integrated system combining TFBS recognition, comparative genome analysis, gene expression profiling, and genotype-expression phenotype association.
This paper demonstrates our systematic approach to identifying functional polymorphisms that regulate the human antioxidant response pathway. The antioxidant response element (ARE) is a
cis-acting enhancer sequence found in the promoter region of many genes encoding antioxidant and Phase II detoxification enzymes/proteins. In response to oxidative stress, the transcription factor NRF2 (nuclear factor erythroid-derived 2-like 2) translocates to the nucleus and dimerizes with other basic leucine zipper (bZIP) proteins such as small Maf proteins (MafG) to form a transactivation complex that binds to AREs. Other regulatory proteins such as NRF1(
13), NRF3(
14) and BACH1 bind to AREs and under some conditions compete for binding with NRF2 (
15). NRF2 mediates a transcriptional network of responsive genes that modulate
in vivo mechanisms against oxidative damage and reactive electrophiles.
Numerous studies have investigated NRF2 binding to DNA elements and a consensus sequence for binding was initially proposed as 5′-RGTGACnnnGC-3′ (where n = A, C, G, or T, R = A or G) after mutagenesis studies of the rat
Gsta2 and
Nqo1 gene enhancers (
16). After analyzing promoters of mouse
Gsta2, Nqo1, Gstp1, and
Ftl genes, Wasserman & Fahl (
17) suggested that the functional ARE is better represented by an extended consensus sequence 5′-TMAnn
RTGAYnnn
GCRwwww-3′ (where W = A or T; M = A or C, ‘core’ consensus underlined). Furthermore, Erickson
et al (
18) suggested that ARE consensus should be revised to 5′-RTKAYnnnGCR-3′ (where K = G or T, Y = C or T) as a result of finding a functional ARE in the human
GCLM promoter region. Interestingly, a detailed mutagenesis study of the mouse
Nqo1 ARE by Nioi
et al (
19) found that the G at position 14 (Wasserman numbering) was not essential for function of the enhancer, but the nucleotides marked ‘n’ at positions 4 and 12 were essential for function in mouse. This work suggested that a universally applicable ARE consensus sequence might not be possible (
19). Despite recent progress in identifying ARE-regulated genes and understanding functional mechanisms of transcription regulation, relatively little is known about sequence polymorphisms in human ARE genes that might affect gene expression levels and resulting susceptibility phenotypes.
In order to identify potential polymorphic AREs in the human genome, we constructed a position weight matrix (PWM) statistical model, based on a set of functional ARE sequences culled from published experimental studies (
Table S1). Using this PWM model and our computational tools, we identified a set of ARE SNPs that may regulate
in vivo NRF2-mediated responses.