|Home | About | Journals | Submit | Contact Us | Français|
Single nucleotide polymorphisms (SNPs) in transcription factor binding sites (TFBSs) may affect the binding of transcription factors, lead to differences in gene expression and phenotypes, and therefore affect susceptibility to environmental exposure. We developed an integrated computational system for discovering functional SNPs in TFBSs in the human genome and predicting their impact on the expression of target genes. In this system we: (1) construct a position weight matrix (PWM) from a collection of experimentally discovered TFBSs; (2) predict TFBSs in SNP sequences using the PWM and map SNPs to the upstream regions of genes; (3) examine the evolutionary conservation of putative TFBSs by phylogenetic footprinting; (4) prioritize candidate SNPs based on microarray expression profiles from tissues in which the transcription factor of interest is either deleted or over-expressed; and (5) finally, analyze association of SNP genotypes with gene expression phenotypes. The application of our system has been tested to identify functional polymorphisms in the antioxidant response element (ARE), a cis-acting enhancer sequence found in the promoter region of many genes that encode antioxidant and Phase II detoxification enzymes/proteins. In response to oxidative stress, the transcription factor NRF2 (nuclear factor erythroid-derived 2-like 2) binds to AREs, mediating transcriptional activation of its responsive genes and modulating in vivo defense mechanisms against oxidative damage. Using our novel computational tools, we have identified a set of polymorphic AREs with functional evidence, showing the utility of our system to direct further experimental validation of genomic sequence variations that could be useful for identifying high-risk individuals.
Human genetic variation can have a pronounced influence on susceptibility to exposure induced disease. The most common form (about 90 %) of human genetic variation is the single nucleotide polymorphism (SNP). To date, over 10 million of these variations have been deposited in the National Center for Biotechnology Information (NCBI) dbSNP database. That is on average one SNP per 300 base-pairs existing in the human genome. The vast majority of SNPs likely have no measurable impact on human health, but a small fraction of SNPs alter gene function or expression and affect phenotypes. Identifying such functional polymorphisms is of great importance, as it will advance our understanding of phenotypic variations and enable the identification of high-risk individuals.
Many investigations of functional polymorphisms have focused primarily on SNPs in coding regions of genes, on the presumption that such polymorphisms can influence phenotype by altering the encoded protein. However, SNPs in non-coding regulatory regions can also play an important biological role. In particular, regulatory elements that control the levels and timing of transcription are attractive regions to examine for functional SNPs. SNPs in transcription factor binding sites (TFBSs) may affect the binding of transcription factors, lead to differences in gene expression and phenotypes, and therefore affect susceptibility to environmental exposure. The impact of variation on susceptibility to environmental exposure has been well established through the study of several important metabolism and DNA repair genes (reviewed in (1)). Other examples include a SNP that causes human α-thalassemias by creating a new binding site for erythroid transcription factor GATA-1 in the upstream of α-globin gene (2) and a SNP (-43C->T) in the proximal Sp1 site of the human low density lipoprotein receptor promoter results in heterozygous familial hypercholesterolemia (3). The UGT1A1 gene has a TATA box polymorphism that reduces expression of UGT1A1, leading to Gilbert's syndrome (a common form of hyperbilirubinemia) (4, 5), and has also been associated with higher levels of mutagens in the urine (6). The steroid metabolism gene CYP17 has a GC box polymorphism in its proximal promoter that has been associated with higher levels of circulating estradiol (7) and with differences in bone mineral density (8).
It is a formidable challenge to identify SNPs in TFBSs among millions of uncharacterized SNPs and to evaluate their potential impact on human health. Traditionally, TFBSs have been discovered using time-consuming and low-throughput experimental methods that explore DNA-protein interaction. Computational methods for the identification of cis-regulatory sequences have been sought to direct laboratory work, and have been successfully applied to simple organisms such as yeast and worm. But these methods have been plagued by high false positive rate in mammals because intergenic sequences in higher eukaryotes are very long and contain a large excess of non-regulatory sequences (9). Recently, new bioinformatics algorithms are gaining success in improving the predictive specificity. One popular algorithm examines well conserved regulatory sequences through the comparison of upstream sequences of orthologous genes across species (10, 11). Another approach identifies statistically over-represented motifs in the upstream regions of genes that are co-regulated in microarray expression profiles (12).
With the availability of a fully-assembled human genome sequence, large-scale genotyping and gene expression profiling technologies, we aimed to develop computational tools to systematically detect functional SNPs in TFBSs in the human genome and to predict their impact on the expression of target genes. We have implemented an integrated system combining TFBS recognition, comparative genome analysis, gene expression profiling, and genotype-expression phenotype association.
This paper demonstrates our systematic approach to identifying functional polymorphisms that regulate the human antioxidant response pathway. The antioxidant response element (ARE) is a cis-acting enhancer sequence found in the promoter region of many genes encoding antioxidant and Phase II detoxification enzymes/proteins. In response to oxidative stress, the transcription factor NRF2 (nuclear factor erythroid-derived 2-like 2) translocates to the nucleus and dimerizes with other basic leucine zipper (bZIP) proteins such as small Maf proteins (MafG) to form a transactivation complex that binds to AREs. Other regulatory proteins such as NRF1(13), NRF3(14) and BACH1 bind to AREs and under some conditions compete for binding with NRF2 (15). NRF2 mediates a transcriptional network of responsive genes that modulate in vivo mechanisms against oxidative damage and reactive electrophiles.
Numerous studies have investigated NRF2 binding to DNA elements and a consensus sequence for binding was initially proposed as 5′-RGTGACnnnGC-3′ (where n = A, C, G, or T, R = A or G) after mutagenesis studies of the rat Gsta2 and Nqo1 gene enhancers (16). After analyzing promoters of mouse Gsta2, Nqo1, Gstp1, and Ftl genes, Wasserman & Fahl (17) suggested that the functional ARE is better represented by an extended consensus sequence 5′-TMAnnRTGAYnnnGCRwwww-3′ (where W = A or T; M = A or C, ‘core’ consensus underlined). Furthermore, Erickson et al (18) suggested that ARE consensus should be revised to 5′-RTKAYnnnGCR-3′ (where K = G or T, Y = C or T) as a result of finding a functional ARE in the human GCLM promoter region. Interestingly, a detailed mutagenesis study of the mouse Nqo1 ARE by Nioi et al (19) found that the G at position 14 (Wasserman numbering) was not essential for function of the enhancer, but the nucleotides marked ‘n’ at positions 4 and 12 were essential for function in mouse. This work suggested that a universally applicable ARE consensus sequence might not be possible (19). Despite recent progress in identifying ARE-regulated genes and understanding functional mechanisms of transcription regulation, relatively little is known about sequence polymorphisms in human ARE genes that might affect gene expression levels and resulting susceptibility phenotypes.
In order to identify potential polymorphic AREs in the human genome, we constructed a position weight matrix (PWM) statistical model, based on a set of functional ARE sequences culled from published experimental studies (Table S1). Using this PWM model and our computational tools, we identified a set of ARE SNPs that may regulate in vivo NRF2-mediated responses.
We developed a modular system for identifying polymorphic response elements. It integrates distinct computational methods and databases to facilitate: (1) position weight matrix (PWM) construction, (2) TFBS prediction, (3) phylogenetic footprinting, (4) microarray expression profile analysis, and (5) SNP genotype-gene expression association. Figure 1 summarizes the steps in each of these processes. It is implemented using PERL, MySQL, and R programming languages. The programs are available upon request.
We identified 57 functional ARE sequences from the literature that were discovered by experimental methods investigating DNA-protein interaction, including electrophoretic mobility shift assay (EMSA), DNase I footprinting, and promoter-reporter gene constructs (see supplementary Table S1 for sequences and references). Quantitative comparative functional data was available for only a few of these sequences so it was necessary to create a statistical model to quantitatively describe the ARE motif to search for novel AREs, and predicting the binding affinity of putative AREs to NRF2. These reported functional ARE sequences were aligned and then a position frequency matrix (Figure 2A) and position weight matrix (PWM) (Figure 2B) for the 21-nt extended ARE motif were created. Using the PWM model, a quantitative score was assigned to a sequence by summing the values that correspond to the observed nucleotide at each position, enabling us to identify potential binding sites by computation. The relative occurrence of the nucleotide at the position in the ARE was further visualized as a “sequence logo” (Figure 2C, generated using the WebLogo program (20)); which displays the contribution of individual positions and individual nucleotides to the overall specificity of the ARE motif. We observed that nucleotides at positions 7, 9, and 14 were fully conserved across all AREs and all 3 species (human, mouse and rat), while positions 8 and 15 show near 100% conservation. Comparing the sequence logos for the known ARE sequences from each species (human, mouse and rat, supplementary Figure S1), we found these PWM-based data were consistent with previously proposed consensus sequences (16). Our model suggests that nucleotide positions from 7 to 16 provided the majority of information beyond the ‘core’ consensus sequence, but the ‘extended’ sequences (17) outside positions 7-16 may contain species-specific information.
Based on our ARE PWM, we set a threshold for searching new AREs to the lowest PWM score (PWM = 6.4) observed among the 57 functional AREs (Table S1), and inferred a degenerate consensus sequence “nnnnnnTnAnnnnGnnnnnn” that was used as a pre-filter to accelerate computation for detecting AREs in millions of dbSNP sequences. Since AREs enhance transcription in both forward and reversed complementary orientations, we performed eight “sliding window” searches for each input SNP, including searches of both alleles plus flanking sequences, on positive and negative strands, and in forward and reversed complementary orientations.
Among 11,369,307 uniquely mapped human SNPs in NCBI dbSNP (build 126), we located 280,867 SNPs that have at least one allele that matched the pre-filter and scored above the threshold. Then we mapped SNPs to nearby genes using chromosomal positions and transcription start site (TSS) information from NCBI dbSNP and Genomes databases. In this screen for SNPs in putative AREs, we limited our search to the upstream 5-kilobase (kb) regions of genes (in other searches we have used larger windows). We found that 4,967 SNPs mapped to AREs in the upstream 5-kb regions of 3,733 putative ARE-regulated genes.
For large and representative collections of TF binding sites, the PWM scores are proportional to binding energies (21). We estimated the impact of a SNP on a transcription factor binding site by calculating the difference in PWM scores (ΔPWM) for the allele-pair of the candidate SNPs. We reasoned that a SNP pair with a larger ΔPWM, reflecting change in a conserved position, should have greater impact on DNA-protein interaction and function. After sorting 4,967 putative ARE SNPs on the positions in PWM, we found that ΔPWM scores were highest when SNPs occurred at positions 7, 9, 14, and 15 (Figure 3, solid line); positions with least variability. In total, there were 742 (15%) ARE SNPs that led to the maximal ΔPWM (5.0). It is interesting that the counts of ARE SNPs at the two fully conserved positions 7 and 14 were much more than the counts in other positions (Figure 3, dashed line).
Since TFBSs are short and degenerate sequences, TFBS prediction algorithms usually generate numerous hits, many of which may be false-positive binding sites. To increase the chance of finding “true” positive binding sites among our many putative ARE genes, and to identify potential loss of function SNPs, we used a comparative genomics approach called “phylogenetic footprinting”(22, 23) to identify AREs with evolutionary sequence conservation across human, mouse, and rat genomes. Functional regulatory sequences are expected to display high levels of similarity across species. We implemented this method in two steps. First, we detected putative AREs in the upstream 5-kb regions of human, mouse and rat orthologous genes using our TFBS search algorithm. Our genome-wide computational ARE search indicated that 35.6% human genes, 38.4% mouse genes, and 21% rat genes have one or more putative AREs within 5-kb of their transcription start site. After intersecting these three putative ARE gene sets on NCBI Homologene ID (11), we found 2388 orthologous genes have putative AREs in upstream 5-kb regions (Figure 4). Then, we selected the conserved AREs by examining multiple sequence alignments of the upstream 5-kb regions of human, mouse, and rat orthologous genes. In the example shown in Figure 5, there are four putative AREs predicted by our ARE PWM model in the upstream region of the human thioredoxin (TXN) gene. However, multiple sequence alignment indicated that only one human ARE (ARE1) had a perfect alignment with sequences from other species. In addition, the PWM scores for putative ARE sequences in this alignment were above threshold for all three species, suggesting the functionality of this conserved ARE. Indeed, this human ARE has been previously reported as biologically functional (24). As a proof of principle, we examined the conservation of all known functional human ARE sequences that were used to build our model. Among these 25 functional human AREs, we found that 17 (68%) human ARE sequences had perfect local alignment with mouse and rat sequences (defined as ‘conserved’). Furthermore, in 15 of 17 of these alignments from human, mouse and rat, all sequences had PWM scores above the minimum threshold (defined as ‘highly conserved’) (Table 1). This finding suggests that phylogenetic footprinting is an efficient method to enrich functional AREs from putative binding sites predicted solely by motif analysis. After searching in the aligned upstream 5-kb regions of 2388 orthologous genes, we found that 412 orthologous genes (including 15 known ARE genes) have 479 ‘highly conserved’ and 297 ‘conserved’ AREs, which are listed in the supplementary Table S2.
We previously determined and compared gene expression profiles of genes differentially regulated at baseline and in response to hyperoxia (95% oxygen), using mice with targeted disruption of Nrf2 (Nrf2−/−) and wild-type controls (Nrf2+/+) (25). Sixty-five genes were expressed constitutively higher (> 2 fold) in Nrf2+/+ mice than in Nrf2−/− mice, while 175 genes were significantly augmented (> 2 fold) by oxidative stress (hyperoxia) in Nrf2+/+ mice but not Nrf2−/− mice (Table 2 row 1). From this list of Nrf2-regulated mouse genes we identified 193 homologous human genes. In addition, genes identified in two similarly designed microarray studies using Nrf2-knockout mice treated with Nrf2-activating chemicals sulforaphane or tert-butylhydroquinone (Table 2, rows 2 and 3), and a microarray analysis of human neuroblastoma cells treated with tert-butylhydroquinone (Table 2 row 4) were evaluated. We found 103 additional genes for inclusion in our list of potential ARE target genes, supplementary Table S3 (total 296 genes). The majority of the up-regulated genes were associated with various metabolic reactions involved in oxidative stress and detoxication of electrophiles and free radicals. We then examined ARE SNPs in this group of potential ARE target genes.
Using the position weight matrix analysis and chromosomal position mapping we obtained a list of putative ARE SNPs which was further filtered by matching with genes containing evolutionarily conserved AREs and demonstrating expression level regulation by NRF2. This produced a small set of 52 putative ARE SNPs mapped to conserved AREs of 38 genes, shown in Table 3. Among these selected top candidates, 8 genes are considered as known ARE genes (with a large body of experimental evidence in the literature). Of note, these known ARE genes (such as HMOX1 and SOD1) have multiple AREs and ARE SNPs in their upstream regions, suggesting enhanced capability to recruit the transcription factor NRF2.
The functional categories of our top candidate genes were analyzed. The Gene Ontology terms of these top candidates were enriched using a web-based program GoMiner (26), we found that: at false discovery rate (FDR) 0.1 level, there were 10 GO terms enriched, including ‘response to stress’ (p=0.0000042), ‘response to xenobiotic stimulus’ (p=0.00020), ‘response to chemical stimulus’ (p=0.00027), ‘response to stimulus’ (p=0.000437), ‘response to abiotic stimulus’ (p=0.000687), ‘response to oxidative stress’ (p=0.00089), ‘xenobiotic metabolism’ (p=0.00016), ‘response to reactive oxygen species’ (p=0.0012), ‘oxygen and reactive oxygen species metabolism’ (p=0.0020), and ‘generation of precursor metabolites and energy’ (p=0.0032). Among them, the last 4 GO terms were at the lowest level in the gene ontology hierarchy, describing detailed gene functions. There were 55% (21/38) of our candidate genes belonging to these enriched functional categories. We also found a GO term, ‘regulation of transcription DNA-dependent’, which was enriched (p=0.028) but the FDR was above the threshold (0.1). There were 13% (5/38) of our candidate genes belonging to this functional category. The GO terms enriched within our candidate gene list were also enriched in the list of experimentally discovered ARE genes (supplementary table 1), suggesting the involvement of our candidate genes in human antioxidant response pathway.
Direct comparisons of bioinformatically identified TFBS SNPs with expression values determined in genotyped population samples is an efficient way to assess SNP function. Recently, cis-acting regulatory SNPs have been discovered using a related, regional association approach (27, 28). Using uniquely mapped, genotyped SNPs from the International HapMap Project (29) and microarray expression profiles from cells obtained from the 60 unrelated CEPH Utah (CEU) individuals, we tested the association between genotypes of putative ARE SNPs and gene expression phenotype. We determined that 13 of the 52 putative ARE SNPs identified by the conservation and expression profiles were genotyped in the HapMap project. Unfortunately, only three of them had minor allele frequencies (MAF) > 0.05 in the HapMap CEU cell lines, allowing inclusion in the association analysis. For these 3 ARE SNPs, we accessed expression levels against genotypes by linear regression. We found two SNPs were significantly associated with gene's basal level expression at a p < 0.05. The putative ARE SNP rs7130110, located in the -4732 upstream of human hemoglobin epsilon 1 (HBE1) gene, exhibited an association (p =0.046) with HBE1 expression level in 60 individuals (Figure 5). The presence of the “C” allele correlated with higher basal levels of HBE1 expression; where individuals genotyped as “CC”, “GC”, and “GG” had mean log2 expression values of 5.3, 4.8, and 4.1, respectively. The genotypes of the putative ARE SNP rs13040272 at -4435 of matrix metallopeptidase 9 (MMP9) gene exhibited an association (p =0.045) with basal expression levels, where individuals genotyped as “TT”, “CT”, and “CC” had mean log2 expression values of 7.1, 6.8, and 6.2, respectively.
The identification of sequence polymorphisms that regulate gene expression is important for understanding human variation in response to exposures (or environmental stimuli). SNPs in transcription factor binding sites may affect the binding of transcription factors, lead to differences in gene expression and phenotypes, and therefore affect response and susceptibility to environmentally induced disease (2-8, 30, 31). Computational methods for predicting the effects of genetic variation are increasingly important because of the rapidly growing catalog of SNPs. Our system for identifying functional SNPs in TFBSs and predicting their impact is intended as a component in a functional SNP discovery pipeline with the objective of identifying SNPs to be used in population studies that are likely to impact exposure susceptibility phenotype. We utilized the primary NCBI databases including dbSNP, Gene, Homologene, and Genome databases and accessed the UCSC comparative alignments for phylogenetic footprinting.
An important feature of our system is the method of searching dbSNP using a position weight matrix (PWM) based on a curated list of experimentally discovered AREs and computationally predicting the impact of a SNP on binding. The PWM is a probabilistic model to capture the nucleotide preference at each position of a TFBS. Assuming that the highest scoring sequences represent the strongest binding sequences, this model provides an approximation of relative DNA-protein interaction strength (32) and has been used to identify binding sites of p53 (33) and estrogen receptor (34). PWM score of a putative binding site has also been used to quantitatively assess to what degree a query DNA sequence fits the TFBS motif (30). If functional binding data such as that of Yamamoto (35, 36) or Nioi et al (19) could be incorporated into building an improved PWM model then scores might more accurately predict binding and function. We tested our PWM predictions against the binding data of Yamamoto et al (see Supplementary Table S4) and observed that their strong binding sequences produced high PWM scores. Further work is needed to support this relationship.
Given this caveat, we suggest that the difference in PWM scores between SNP alleles (or ΔPWM) could be a convenient way to predict the impact of a SNP in a TFBS. Indeed, a SNP (rs1799722) located in the upstream region of the BDKRD2 gene and identified in our search was demonstrated to impact binding in mobility shift assays (30). It is of great interest that Yamamoto et al (35) using mutagenesis of ARE-like sequences and surface plasmon resonance binding assays showed that single-nucleotide sequence changes could alter the binding specificity from a MafG:Nrf2 heterodimer to preference for a MafG:MafG homodimer. SNPs that alter binding specificity in this way could have a profound effect on response phenotype. Development of an high-throughput assay to quantitatively measure relative NRF2 binding with a large number of putative AREs is in progress and should be valuable for validating the ΔPWM model.
We have used phylogenetic footprinting to identify polymorphic ARE sequences that occur in evolutionarily conserved regions of the genome. While detailed knowledge about the relationship of sequence conservation to function is lacking for specific TFBSs, numerous investigations suggest that conserved binding sites are more likely to be functional (10, 11, 22, 23). Thus our approach should identify SNPs that result in loss of function. SNPs in poorly conserved AREs that might be “human specific”, or that are “gain of function” will have to be evaluated in other ways. We have recently undertaken a detailed analysis of sequence conservation for ARE sites and compared it with that of NF-KB, and p53 binding sites. We observed that ARE sequences are considerably more conserved than p53 binding sites and nearly as conserved as NF-KB binding sites (M. Horvath, M.A Resnick, D.A Bell, submitted). However, some highly inducible xenobiotic metabolism genes in mouse (e.g. Gsta2) with multiple, identical high-scoring AREs show no homology with the corresponding human gene, suggesting these genes may be under different selection pressure. As cross-species comparative studies of ARE-mediated inducibility are completed it will of great interest to see which are conserved and which show positive selection in specific species.
Our interest is in identifying SNPs in genes that are regulated by NRF2 and oxidative stress. To this end we have used microarray expression profiles in NRF2 model systems to confirm our selection of putative ARE SNPs in oxidant-responsive genes. This process resulted in identification of 52 SNPs in putative AREs (Table 3). When a subset of these SNPs were examined for genotype-expression association in human HapMap cell lines, we observed allelic expression for two—although only a small fraction of our candidate SNPs could be examined. This is not surprising because there are major numerical limitations with the HapMap data. First, HapMap genotyping data and expression were only available for 60 unrelated parents; secondly, useful frequency data were only available for about 25% of our top candidate SNPs; thirdly, and most importantly, microarray profiles used in the conservation-expression screening step were generated from oxidant-treated cells or animals, whereas the expression profiles used in the association analysis were from untreated human lymphoblast cell lines. Ideally, expression information from addition tissues and oxidant exposures could be evaluated. Because polymorphic variants that are known to affect expression are often found at population frequencies of less than 5% (2, 3, 37), issues of sample size become crucial in a survey like this one. We suggest that if expression profiles from a larger sample (n > 400) of oxidant-exposed human cells could be used in this analysis, combined with a higher resolution of HapMap genotyping data, this approach could be particularly effective.
For the purpose of identifying potential functional SNPs in TFBS, we have demonstrated how selection of SNPs can be informed by comparative and functional genomics and genotype/phenotype comparisons can be accomplished using genotyping data from the International HapMap Project and publicly available microarray gene expression datasets. Further evaluation of phenotypic impact of these polymorphisms will advance our understanding of the interaction between genetic predisposition and response environmental exposure.
The sequences of transcription factor NRF2 binding sites (also known as antioxidant response elements, AREs) were obtained from the NCBI literature database, PUBMED. These AREs were discovered by one or more experimental methods, such as deletion/mutational analysis, nuclease protection assay, and electrophoretic mobility shift assay, and their enhancer activity was confirmed by luciferase assay (supplementary Table S1).
The SNP sequences were downloaded from NCBI dbSNP build 126. For faster computation, we extracted 20 bases upstream and 20 bases downstream of each SNP. The upstream 5-kb sequences for human, mouse, and rat genes were extracted using the assembled chromosomal sequences and gene mapping data downloaded from the NCBI Genomes database at ftp.ncbi.nih.gov/genomes. The genome assemblies used were: human build 36.1, mouse build 36.1, and rat build 4.1. The NCBI HomoloGene database build 52 was used to find homology groups based on protein sequence similarity. The multiple alignments of 16 vertebrate genomes with human were downloaded from the University of California at Santa Cruz Genome Bioinformatics Site http://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz17way/.
Given a set of experimentally defined AREs, a statistical model is essential for quantitatively representing the DNA-protein interaction and for predicting novel binding sites in genomic sequence. We implemented a widely used statistical model, position weight matrix (PWM), whose entries reflect the observed nucleotide frequencies at each position for a given set of binding sites. Briefly, all known binding sites were aligned and the total number of observations of each nucleotide for each position was counted and put into a position frequency matrix; then the position frequency matrix was converted to a position weight matrix, using formulae described by Wasserman and Sandelin (38), that converts normalized frequency values to logarithm values. Once a position weight matrix model has been compiled, the PWM score for any putative ARE is calculated by summing the individual matrix values that correspond to the observed nucleotide at each position in that site. We used PWM to search novel AREs in genome, and estimated the impact of a SNP on a TFBS by calculating the difference in PWM scores for the allele-pair. Allelic pairs resulting in a bigger score difference are predicted to more likely influence the binding of TFs.
The potential AREs in a genomic sequence were detected by sliding a window that has a length of the PWM along the input sequence. The lowest PWM score of the experimentally discovered functional AREs was used as the PWM score threshold (cutoff) to assess the putative functionality of an unknown sequence. Also, a special degenerate sequence (we called search pattern), composed of invariant nucleotides and ‘N’s for any other nucleotides, was used for fast searching genomic sequences. A putative TFBS was found only if the following two criteria were true: (1) the sub-sequence in the window matches the search pattern, and (2) the PWM score of the sub-sequence is above the PWM score threshold.
To improve selectivity of detection of functional SNPs, we analyzed AREs in human, mouse, and rat genomes using phylogenetic footprinting. First, we detected putative AREs in the upstream 5-kb regions of human, mouse and rat orthologous genes by our TFBS search algorithm. Orthologous genes were defined as in the NCBI HomoloGene database, which is a collection of homologs among the annotated genes of completely sequenced eukaryotic genomes. Using an approach similar to Sun et al (11) we selected genes with AREs detected in each species. Next we evaluated conservation at the sequence level by examining multiple sequence alignments of the upstream 5-kb regions for these orthologous genes (downloaded from University of California Santa Cruz Genome Bioinformatics site). Conserved AREs were defined as: (1) there must be sequences from human, mouse, and rat in the alignment, and (2) there is no gap in any aligned sequences. Conservation was further evaluated by PWM score. A binding site was considered to be “highly conserved” if all sequences in the alignment have PWM scores that were above the threshold; otherwise, the binding site was “conserved” if either mouse or rat sequence had a PWM lower than the threshold.
Using mice with targeted disruption of Nrf2 (Nrf2−/−) and wild-type controls (Nrf2+/+), we determined expression profiles of genes differentially regulated at baseline and in response to hyperoxia. The microarray study design and results has been reported in a separate publication (25). Briefly, total RNA was isolated from the left lung of each mouse, and cRNA was synthesized, and biotinylated. Each fragmented cRNA sample was hybridized to Affymetrix Murine Genome U74Av2 oligonucleotide array, and the signal intensity was collected. The expression data and other three published microarray studies listed in Table 2 were used to infer a set of NRF2-regulated genes in human by searching the HomoloGene database, and this gene set was used for prioritizing our candidate ARE SNPs.
The SNP genotyping data were obtained from the International HapMap Project website (http://www.hapmap.org). The data release 21 (July 20, 2006) contains 3.3 million SNPs genotyped in samples from 4 human populations. We used the genotyping data for the unrelated parents in 30 HapMap CEU trios for genotype-expression phenotype association analysis. Microarray expression profiles for 30 HapMap CEU trio samples were obtained by Dr. V.G. Cheung's lab. RNA was extracted from lymphoblastoid cells from each individual and hybridized onto Affymetrix Human Genome Focus arrays. Expression intensity was scaled to 500 using Affymetrix Microarray Suite 5.0 program and log2 transformed (28). For the association analysis, log2-transformed expression level of all the unrelated individuals in the HapMap CEU panel as a dependent variable was linear regressed on SNP genotype (coded 0, 1, and 2). Given a population, a SNP is considered as associated with a gene's expression level if expression values are linear regressed on SNP genotypes (i.e. the differential allelic expression is observed) with p < 0.05.
Using high-throughput GoMiner web interface, developed by National Cancer Institute (26), we analyzed Gene Ontology (GO) terms to organize our candidate genes into functional categories based on biological process, molecular function and subcellular localization. The GoMiner employs two-sided Fisher's exact to compare the list of interest genes and all genes in human genome to find the GO terms that are over-represented. The input files for GoMiner include the list of our interest genes and the list of all genes in human genome. All GO evidence codes were selected. The false discovery rate (FDR) threshold was set to 0.1, and the number of randomizations was set to 5000.
This work was funded by the intramural research program of the National Institute of Environmental Health Sciences, National Institutes of Health and grants to V. Cheung from the National Institute of Health. We acknowledge helpful comments from anonymous reviewers.
CONFLICTS OF INTEREST STATEMENT: The authors declare no conflicts of interest.