|Home | About | Journals | Submit | Contact Us | Français|
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Neuronal activity regulates gene expression to control learning and memory, homeostasis of neuronal function, and pathological disease states such as epilepsy. A great deal of experimental evidence supports the involvement of two particular transcription factors in shaping the genomic response to neuronal activity and mediating plasticity: CREB and zif268 (egr-1, krox24, NGFI-A). The gene targets of these two transcription factors are of considerable interest, since they may help develop hypotheses about how neural activity is coupled to changes in neural function.
We have developed a computational approach for identifying binding sites for these transcription factors within the promoter regions of annotated genes in the mouse, rat, and human genomes. By combining a robust search algorithm to identify discrete binding sites, a comparison of targets across species, and an analysis of binding site locations within promoter regions, we have defined a group of candidate genes that are strong CREB- or zif268 targets and are thus regulated by neural activity. Our analysis revealed that CREB and zif268 share a disproportionate number of targets in common and that these common targets are dominated by transcription factors.
These observations may enable a more detailed understanding of the regulatory networks that are induced by neural activity and contribute to the plasticity transcriptome. The target genes identified in this study will be a valuable resource for investigators who hope to define the functions of specific genes that underlie activity-dependent changes in neuronal properties.
Transcription of new genes is initiated in the nervous system by both synaptic activity and action potential firing [1-3], and activity-dependent changes in gene expression are critical in epileptogenesis, brain injury, and learning and memory. For example, reducing CREB-dependent gene expression eliminate memory acquisition , and knock-out of the gene encoding zif268 (egr-1, krox24, NGFI-A) impairs memory-like processes such as LTP . Alterations in gene expression after seizure may lead to abnormal neural function and the development of epilepsy (reviewed by [6,7]). Furthermore, a number of inherited forms of mental retardation can be traced to defects in activity-dependent gene expression, such as Rubenstein-Taybi syndrome, related to a mutation in the CREB-binding protein, and Rett syndrome, tied to a defect in a DNA-binding protein that regulates the correct timing of expression of many downstream genes . Although some genes that are rapidly upregulated by neuronal activity have been identified, the direct and indirect targets of activity-dependent transcription factors remain of substantial interest in developing and constraining models for how neuronal function is altered by experience.
Various prior approaches have attempted to identify gene sets that underlie activity-dependent changes in neural function. cDNA microarrays have been frequently employed to characterize the plasticity transcriptome by identifying candidate genes that are upregulated after activity [9-15], typically through pharmacologically-induced seizure [16-21] or patterned sensory stimulation [22,23]. Alternate approaches include the use of chromatin-immunoprecipitation (ChIP) against specific transcription factors and determination of immunoprecipitated DNA binding sites by hybridization to microarrays or PCR analysis [24-28]. Both approaches, though informative, have important limitations. For example, current cDNA microarray analyses have typically identified genes that are show several-fold regulation (>1.5 up- or down-regulation), although in principle microarray studies can identify genes showing small changes in transcript levels given adequate numbers of redundant measurements. Differences in the timing and conditions of sample collection for microarray analysis can lead to heterogeneous results between investigators. In addition, microarray analysis typically does not reveal information about the transcriptional pathways by which genes are regulated in an activity-dependent manner. Although ChIP has been a useful approach for identifying candidate transcription factor binding sites, interpretation of this data can be complicated by tissue-specific occupancy of binding sites and the inherent heterogeneity of cell types in brain tissue . For example, ChIP cannot identify occupied sites if they occur in only a small subset of neurons. An alternate approach to identifying potential activity-regulated genes has been to isolate specific candidates and examine changes in the abundance of their protein or mRNA after manipulations of activity (see for example, [30,31]). However, this approach is obviously limited because it must proceed in a highly directed, case-by-case manner, and does not accommodate the discovery of novel or unlikely gene candidates.
Identification of activity-regulated genes could be improved by genomic screens that are unbiased by cell type, target gene preselection, or average expression level. A growing body of knowledge regarding transcriptional regulation of gene expression and consensus sequences for transcription factor binding, combined with annotated genome sequences, makes it possible to carry out a directed search for binding sites of specific activity-regulated transcription factors at the scale of an entire genome. Methods for locating transcription factor binding sites often rely upon relatively simple comparisons of a single sequences or consensus binding sites with individual promoter regions [32-34]. Although such searches can be productive, the use of a single consensus site is problematic because every nucleotide must be analyzed independently from the rest of the binding site and degeneracy can easily be over- or underestimated. This results in subjective target identification whose reliability is difficult to judge.
Here we develop and employ a rigorous computational method for scanning gene promoters for binding sites of two known activity-dependent transcription factors: CREB and zif268. These transcription factors were selected based upon a large body of data that links their activity to adaptive responses to neuronal activity, especially activity occurring during learning . For example, transduction of extracellular signals leads to CREB phosphorylation and assembly of the transcriptional apparatus, events that are necessary for consolidation of long-term memory . Overexpression of a constitutively active form of CREB enhances LTP in mice , and genetic manipulations that reduce CREB transcriptional activation result in impaired learning and memory [4,38]. Zif268 is an immediate-early gene (IEG), whose expression can be upregulated following increased neural activity and activation of intracellular signaling cascades  and is required for some forms of learning [5,40]. Because activity-dependent changes in gene transcription are linked to memory consolidation and also occur as a response to pathological conditions such as seizure [41-43], identification of the downstream targets of these transcription factors remains of considerable interest.
We performed an in silico analysis for full CREB and zif268 binding sites of promoter regions from all available annotated mouse genes. Likely binding sites were identified by applying position-specific scoring matrices (PSSMs) derived from previously characterized binding sites for CREB1 and zif268 confirmed by in vitro assays [44,45]. Individual binding site examples from specific genes were not used to develop the scoring matrix to provide consistency to the method used to develop the training algorithm without biasing it to a few anecdotal examples from individual genes. Because this analysis was specific for longer consensus binding sequences (i.e. not half-sites), it is more stringent and has a lower false positive rate than previous searches .
CREB sites have been demonstrated near both coding and non-coding regions of the genome and there is a growing body of data that supports an important regulatory role for CREB, and possibly other activity-dependent transcription factors, in non-coding regions (specifically for regulatory, noncoding mRNAs; ). Nonetheless, we restricted our search to promoter regions of annotated genes because locations of non-coding RNAs are currently not adequately annotated and genome-wide scans necessitate more stringent statistical tests than limited scans of promoters, and would result in a substantial increase in false negative rates.
As a further test of the validity of these transcription factor binding sites, we performed this search in parallel on all human as well as rat genes with annotated transcription start sites. In general, because the rat genome is less well-annotated that either mouse or human, most analysis was carried out on mouse and human candidate genes. Dual hits from both the mouse and human genomes were considered more likely to have an activity-dependent component to their regulation. Our results identify 516 candidate genes with conserved CREB or zif268 binding sites in both mouse and human homologous genes that may be regulated by activity, six of which were predicted to have more than one type of transcription factor binding site. These results provide an important resource in understanding the regulatory networks that control activity-dependent programs of gene expression.
We used a comparative genomics approach to identify genes likely to be regulated by neural activity. Position-specific scoring matrices were employed to create a search algorithm for DNA binding sites for CREB and zif268 in the promoters of annotated genes in order to define candidate genes from the plasticity transcriptome to guide and constrain future analysis. Gene lists were refined and evaluated by comparison of target frequencies across species and the presence of specific transcription factor binding sites in conserved homologous genes between human and mouse genomes. These gene lists allowed us to identify a significant subset of target genes that may be regulated in common by CREB and zif268. Finally, special attention was paid to the presence and relative frequency of genes with specific neural relevance amongst the candidate dataset, with an eye toward future experimental focus on the activity-dependant regulation of these genes.
Previous analyses of transcription factor binding sites have suffered from low degeneracy based upon comparison of a target sequence to a single overrepresented binding site or high degeneracy and a high false positive rate based upon inclusion of related IEG subfamily members with similar but distinct sequence specificities. When related transcription factors (such as zif268, egr-2, and egr-3 or c-fos/c-jun and FosB/JunD heterodimers) with overlapping binding site consensus sequences are pooled, binding site consensus is significantly relaxed and false positive rates are increased. Our method sought to reduce the rate of false positives by using a smaller number of experimentally verified, high-quality transcription factor binding sites using frequency matrices that have been experimentally developed for zif268 [45,48] and CREB [44,48]. A schematic of the consensus sequences used is shown in Fig. Fig.11.
An analogous search was attempted with AP-1, whose components, fos and jun, are also IEGs. The resulting predicted targets were not enriched for promoter regions and showed little conservation across species, facts we attribute to excessive degeneracy of the known AP-1 binding sequences used in training the computational methods. While we believe this negative result helps to reinforce the significance of the positive CREB and zif268 results, we have omitted further discussion of AP-1 in this report.
In order to identify conserved gene candidates subject to activity-dependent regulation, the entire database of mouse, rat, and human promoter sequences was searched for CREB and zif268 binding sites. In total, 18,071 mouse genes, 5,943 rat genes, and 19,794 human genes were examined (summarized in Table Table1;1; full gene lists can be found in Additional Files 1, 2, 3, 4). This represents all the annotated promoters from these species (approximately 50% of the estimated distinct mouse coding regions and two-thirds of the total estimated distinct human genes; [49-51]). CREB binding sites were predicted in 6% of mouse promoters, 7% of human promoters, and 11% of rat promoters (Fig. (Fig.2a).2a). Zif268 binding sites were predicted in 8% of mouse promoters, 6% of human promoters and 4.6% of rat promoters (Fig. (Fig.2b).2b). The precise sequence and location of binding site for each species is annotated in Additional Files 5, 6, 7. Because the mouse and human gene sets were more complete and representative of the total number of coding regions within those species than found in the rat dataset, we will primarily refer to the human and mouse searches from this point onward.
As we were interested in using comparative genomics to improve hit quality, we also examined those genes with homologues in the mouse and human datasets, a total of 13,365 genes (homologene dataset; ). In general, the frequencies of binding sites predicted for genes in the homologene dataset for either species in isolation were similar to those found for non-homologous genes in the same genomes (Table (Table1),1), suggesting that the homologous genes are a representative set of genes with respect to binding site predictions. We then examined the frequencies of binding site predictions conserved across both members of a gene pair. The number of hits from this search was lower than observed for individual genomes: in the set of mouse-human homologues, 2.66% have a conserved CREB site and 1.24% have a conserved zif268 site (Fig. (Fig.2c2c).
To develop a rigorous estimate of the quality of identified targets, we applied two methods for calculating their positive predictive value, defined as the probability a predicted hit is correct (see Methods for a complete description). This measure provides a conservative estimate of the percentage of identified transcription factor binding sites that we expect to be functional binding sites. For this purpose, we examined binding site location and cross-species conservation.
Fig. Fig.33 shows an analysis of position specificity of binding site predictions within the promoter regions. A pronounced peak in the relative frequency of binding site locations was observed for both CREB (Fig. (Fig.3a;3a; see also ) and zif268 (Fig. (Fig.3b).3b). Both binding sites were more than five times more likely to occur in the 50 bp closest to the annotated start site compared to more distant or intergenic sequence. This frequency increase was observed in analysis of the mouse and human gene datasets, as well as in the conserved mouse and conserved human gene datasets (Fig. (Fig.3c3c and and3d).3d). Furthermore, the low frequency of binding sites in regions most distal to the annotated start site (i.e. 800 bp upstream) was comparable to the frequency in intergenic regions (see Methods), suggesting that our selection of a 1,000 bp region surrounding transcription start was sufficient to capture most of the meaningful binding sites that might regulate gene expression. Using intergenic (i.e. 50,000 nt upstream of the annotated start site) hit frequencies to estimate false positive rate, we derived a positive predictive value of 0.64 for CREB hits in human and mouse promoters. Zif268 had a slightly higher positive predictive value of 0.81 in mouse and 0.74 in human (Table (Table1).1). The full set of predicted binding site locations is provided in Additional Files 1, 2, 3, which present the gene, location relative to transcription start, and sequence of each putative binding site for the mouse, human, and rat genomes, respectively.
Comparative genomics was used to provide a second method of estimating the positive predictive value. In the set of mouse-human homologues, 2.7% have a conserved CREB site and 1.2% have a conserved zif268 site. We proposed that the set of homologue pairs with conserved binding sites would provide a conservative set of true positive hits. The positive predictive value of the hits could then be estimated from the excess of conserved hits beyond what would be predicted from the hit rates in the individual genomes on the assumption of independence between genomes (see Methods). In contrast to the results using location specificity, the conservation approach yielded higher estimates of positive predictive value for CREB (0.83) than for zif268 (0.56). This difference is most likely to due changes in zif268 targets or promoter GC content between species. We can further validate these results by measuring conservation in binding site positions for the conserved hits. Comparing the positions of the conserved predicted sites between mouse and human yields correlation coefficients of 0.18 for CREB (p-value < 0.001) and 0.21 for zif268 (p-value < 0.01). There is considerable noise in location specificity, which might be explained by insertions or deletions in the promoters since the separation of human and mouse lineages or by errors in annotated transcription start sites in either species. The positions nonetheless show significantly greater correspondence than can be explained by chance.
Prior analyses of activity-dependent gene expression have not addressed whether the transcription factors selected for analysis regulate separate or partially overlapping target genes. We reasoned that genes containing binding sites for both transcription factors might be more likely to have an essential role in mediating activity-dependent changes in gene expression. To examine this, we looked at the relative frequencies of genes with both CREB and zif268 binding sites in the mouse, human, and rat gene datasets compared to the frequencies of genes with either site in isolation. Target genes were generally non-overlapping, suggesting that each transcription factor may regulate a distinct subset of genes, a possibility that might allow a cell greater combinatorial control over stimulus-specific transcriptional programs.
Intriguingly, however, the amount of overlap between zif268 and CREB targets was greater than would be expected by chance, suggesting that at least a subset of target genes may be coregulated by the two transcription factors analyzed. Statistical analysis showed that the predicted binding sites shared more targets in common than would be expected by chance for both mouse and human genomes, with the results weakly significant for mouse (p < 0.05), but not significant for human (p = 0.18).
Only six genes with conserved binding sites for both transcription factors were found. A significant fraction of these common targets were transcription factors. FosB (Fig. (Fig.4a),4a), Jund1 (Fig. (Fig.4b),4b), and Maff (Fig. (Fig.4c)4c) are all members of the AP-1 family of transcription factors. The Skil (Fig. (Fig.4d)4d) transcription factor is a member of the SKI/SNO/DAC family which are known to associate with AP-1 under some conditions . The observation that a specific group of transcription factors can be regulated by both CREB and zif268 implicate these genes in transcriptional networks of activity-regulated gene expression.
We sought to characterize the functional properties of the derived gene set through an unbiased computational search for functional gene classes significantly over- or under-represented in our hit set. We conducted this analysis by applying the GOstat web resource  to the set of CREB and zif268 conserved and species-specific targets. Because there were relatively few conserved targets in the homologene datset, GOstat analysis found few hits, such as RNA processing and localization, only for CREB target genes.
Because of the increased statistical power of using more data (versus higher quality data), we chose to further this analysis using the species-specific gene target lists. We chose to present the mouse target list because the comparative data available has primarily been carried out in rodents. As with the conserved gene list, genes with CREB consensus sites showed significant overrepresentation for targets involved in RNA processing, but were underrepresented in electrophysiologically important transmembrane receptor targets (Additional File 8). However, the few receptor and channel targets identified in this analysis may have critical functional importance. For example, seizure-dependent changes in expression of three targets, Kcnk1, Kcnmb4 and Hcn2, have been found , and these two genes represent targets for neuroprotective or anticonvulsant agents. Targets of zif268 are also underrepresented for transmembrane receptors, but are overrepresented for transcription factors as well as genes with neural-specific functions such as neuron development and axonogenesis. A full list of these targets can be found in Additional File 9.
Many of the gene targets identified in this analysis have been identified in previous studies. Published experimental data for the number of putative CREB and zif268 targets suggests that there may be hundreds of specific genes that carry these binding sites (see for example, [56,57], but much of this data is indirect. We thus surveyed the literature for examples of CREB and zif268 target regulation, using direct evidence for CREB or zif268 binding or consensus sequence identification. A comparison of some prominent candidates from this study with a literature-derived set of known CREB and zif268 targets is shown in Table Table2.2. Strong experimental support for many genes that were identified in the present analysis was found, such as somatostatin , tyrosine hydroxylase [59,60], and synapsin II . Because target regulation has been shown to depend critically on brain area, developmental stage, and even strain differences between mice and negative results are thus uniformative, we did not choose to select a random subset of genes to experimentally pursue as part of this purely computational analysis.
Although there are few well-documented differences between the transcriptional regulation of homologous genes in mouse and humans, our analysis was successful at identifying one of the few well-characterized mutations in a CREB binding site between a mouse and human gene that influence transcription, the glycoprotein hormone alpha subunit . This gene carries an identified CREB site in human, and a single gene mutation in the CREB binding site in mouse abolishes placental expression of this gene. Accordingly, this gene was identified as a human but not a mouse CREB target gene.
Several experimental studies have sought to characterize CREB or zif268 targets by searching for genes upregulated after transcription factor activation. We compare the genes isolated in a subset of these studies to the targets identified here. Microarray results of regulated genes after 1–5 weeks of hippocampal overexpression of a constitutively active form of CREB  were compared to the results from our search, using a program that takes into account relative changes in transcript levels . We found that predicted CREB targets from our search were overrepresented in that microarray dataset, but by a statistically insignificant amount. An analysis of CREB targets from a similar computational search showed slightly less enrichment in this microarray dataset, but this enrichment was also not significant. The lack of significance in overlap is likely due to the extended duration of VP-16 CREB overexpression, which complicates an interpretation of this result, as well as the fact that we searched specifically for direct targets of CREB while a microarray study would be expected to capture both direct and downstream targets.
Comparison of our work to the results of another experimental study that used chromatin immunoprecipitation followed by PCR amplification of CREB-associated DNA  revealed minimal overlap of genes (Table (Table3).3). This lack of overlap may in part be due to tissue specific CREB-site occupancy  and the small number of target genes compared. It may also derive from the fact that differences in annotation made exhaustive comparison of the two gene sets difficult; rather our analysis was based on a non-exhaustive set of results reported in the body of the paper by Impey et al. . It is notable that even the study by Impey et al. failed to identify some well-established CREB target genes, such as somatostatin, indicating that neither our computational nor their ChIP-based gene lists are likely to be exhaustive.
We also compared overlap between zif268 targets identified in our computational analysis and a recent zif268 overexpression study . We observed an extremely low rate of overlap between these two studies, where only 6.7% (9/135) regulated genes were represented in either the mouse or human zif268 target lists generated in our study (Table (Table3).3). Only one gene, the small glutamine-rich tetratricopeptide, showed a conserved zif268 site in both mouse and human as well as regulated mRNA levels after zif268 overexpression. The relatively low level of target overlap in this and other studies may be due to the duration of transcription factor overexpression (days) and ensuing cascades of altered gene expression (i.e., not all targets are directly regulated by the overexpressed transcription factor). Our decision to favor a high-specificity gene list at the cost of lower sensitivity was likely also a factor. While the exact criterion used by James et al. study  to define target sites was unclear, likely differences in the target specificity/sensitivity trade-off would be expected to yield poor correspondence between the data sets.
A similar comparative genomics analysis was carried out by Conkright et al., using a hidden Markov search algorithm trained on ten well-characterized CREB binding sites . Of 78 conserved CREB targets identified by Conkright et al. and 356 identified by our study, 25 were common to both studies (Table (Table2).2). Although this is a strong overlap relative to what one would expect by chance, it is nonetheless curious that it was not higher. We attribute this fact to likely assignment errors in both sets as well as likely differences in annotation. We believe our study is likely to have yielded a higher quality set of predicted binding sites based on the fact that we have access to more recent genome annotations, search in a more tightly focused region (1.2 kb versus 10 kb), search relative to transcription rather than translation start, and use a prediction algorithm that would screen out some possible spurious predictions likely with a hidden Markov model approach. This methodological argument that our gene set is closer to the ground truth is supported by the fact that while the two studies predict comparable numbers of CREB sites in mouse and human individually (1050 mouse and 1389 human for the present study and 1349 mouse and 1663 human for Conkright et al.) our predicted sites were validated by cross-species conservation at a rate several-fold higher (356 validated versus 78). The similarities of our gene set to that of Conkright et al. thus provide good validation that both approaches find meaningful gene sets, but the deviations do not challenge the accuracy of our set.
We have applied computational analyses to identify candidate genes regulated by neural activity based upon the presence of CREB and zif268 binding sites within their promoters. This work combined sequence-based motif finding methods with an analysis of homology, binding site co-occurrence, and binding site location to estimate and improve prediction accuracy. Because the consensus sites used for analysis were not derived from a possibly unrepresentative subset of specific genes but rather from experimentally determined binding motifs [44,45], we believe that the gene lists presented here are uniquely unbiased. The generated candidate gene lists provide potential targets for future experimental validation and may also be useful for interpretation of microarray data and inference of gene regulatory networks. This work has also revealed a pronounced location-specificity of high-quality CREB and zif268 binding sites, an observation that may be a diagnostic criterion for the detection of binding sites near poorly-annotated non-coding regions as well.
The principal goal of this work was the generation of a computational resource identifying likely targets of activity-dependent regulation to help guide future experimental study. Unlike previous experimental studies, which identified both direct and indirect targets using microarray analysis of regulated genes following overexpression of activated CREB or zif268 [37,56], this study specifically identified high-quality, direct transcriptional targets of CREB and zif268. Based on our comparative genomic analysis, we believe that our list of predicted targets based on conserved binding site predictions has a very low false positive rate. Although experimental support for some targets was not observed, this is hardly surprising given that site occupancy has been shown to vary according to tissue type [56,65-67] and overexpression of zif268 has been associated with repression of genes carrying zif268 sites within their promoter regions . Omission of real candidate genes may have occurred due to 1) the high stringency of the search we carried out, 2) poor or incomplete annotation of transcription start sites or multiple transcription start sites for a single gene, and 3) real differences in the regulation of mouse and human genes. It is also important to note that actual CREB binding may be influenced by chromatin superstructure as well as the presence of additional regulatory factors, variables that were not examined in the biochemical studies that provided the sequence matrices for CREB and zif268 that we used here. However, several lines of evidence confirm that this approach identifies a gene set that includes expected CREB and zif268 targets, shows activity-dependent expression changes and is significantly enriched for several specific functional classes of proteins consistent with a role in activity-dependent regulation.
The work also suggests several avenues for improvement of computational approaches to identifying targets of transcriptional regulation. One of the most striking findings from this study was the pronounced location specificity to CREB and zif268 targets within annotated gene promoters. The frequency of CREB and zif268 sites was greatest within the 50 bp closest to the annotated start site, and dropped to baseline (i.e. intergenic) frequencies after 600 bp. This finding suggests that the functionality of these sites is greatest at proximal locations within the promoter, a conclusion that is further supported by the high conservation of location specificity between species. Our results strongly support the use of homology methods for improving specificity of binding site prediction , as well as the use of binding site co-occurrence for the same purpose [69,70]. Our inability to replicate these successes with AP-1, though, highlights the necessity of a strong set of experimental data in these computational approaches.
Finally, this study has also led to some intriguing findings about the specific likely targets of CREB and zif268. Although these transcription factors are not exclusively expressed in the CNS – indeed, they are present in many other cell types at different developmental stages and regulate the transcription of gene targets that are not specifically neural – the essential role that they play in neural plasticity was an important consideration in motivating this study and the targets identified will be of strong interest to neurobiologists.
Significantly, the study revealed a significant convergence of targets containing both CREB and zif268 binding sites. Among these putative CREB and zif268 co-regulated genes, the set conserved across mouse and human included several known transcription factors and transcription regulatory elements, most associated with AP-1 regulation. CREB and zif268 may represent the top level of a regulatory network implicated in neural function, with expression of the functional proteins primarily controlled by an intervening network of other regulatory factors. Indeed, it is clear from other experimental data as well as our own that additional waves of transcriptional regulation and distinct programs of gene expression will follow the activation of this initial cascade . Furthermore, neural activity may also induce various forms of non-transcriptional regulation, including alternate splicing, accelerated degradation, or altered intracellular trafficking that the present analysis did not address. Thus, the target set presented here is likely only a beginning towards characterizing the full complement of genes induced by neural activity. The problem of defining the plasticity transcriptome is thus likely to remain an exciting challenge for computational and experimental researchers for the foreseeable future.
Computational identification of putative targets of CREB and zif268 regulation has identified a set of likely direct targets of activity-dependent regulation that avoids biases inherent in current experimental methods for characterizing such sets. In addition to providing a candidate gene set for future analysis, the study has revealed a pronounced location specificity and bias for co-occurrence particularly in promoters of other transcription factors, which will be useful for improving detection algorithms and more completely characterizing the regulatory networks underlying activity-dependent gene expression.
We compiled a database of gene promoter regions using sequences from mouse build mm6 , rat build rn3 , and human build hg17 [49,51] of the UCSC Genome Bioinformatics Resource . Transcription start sites for these promoters organized by mRNA accession number were found in the table "knownGene.txt" for each build. Where promoter regions were reported within 50 bp of each other, only the one earlier on the chromosome was used, as the copies were presumed to be duplicates of the same promoter region (derived from otherwise identical mRNAs of different lengths). Incomplete promoters with missing sequence data were also removed from the analysis. Annotated promoters included both TATA-box containing and TATA-less genes. The full promoter list was annotated with gene name, symbol, and accession number using the NCBI gene resources [52,74]. In total, 18,071 mouse promoters, 19,794 human promoters, and 5,943 rat promoters were analyzed (Tables (Tables1,1, Additional files 1, 2, 3).
When searching for candidate genes, we defined a putative promoter to be the genetic sequence from -1,000 bp to +200 bp of each transcription start for human, mouse, and rat genes. A set of intergenic sequences was also compiled for human and mouse to construct a "random" control dataset of 1,200 bp sequences using the regions from -51,200 bp to -50,000 bp relative to each transcription start site, where the transcription factors CREB and zif268 are not likely to have regulatory function. Due to a decrease in sequence quality further away from transcription start, distal sequence regions were available for only 77% of total genes, leaving 13,475 mouse intergenic regions and 15,178 human intergenic regions far analysis (Table (Table1).1). In order to confirm location-specificity trends inferred from the 1,200 bp regions, an additional search was run for each gene on an extended promoter region (-6,000 bp to +200 bp). We saw no significant difference in the region between -1,000 bp and -6,000 bp compared to the -51,200 bp to -50,000 bp region, suggesting that the initial search had identified the majority of sites with likely function.
The Homologene database provided us with human and mouse homologous pairs based on gene accession number , yielding 13,365 homologous gene pairs (Table (Table1,1, Additional File 4). A binding site prediction was defined as conserved if the same binding site type was predicted in the promoters of both homologous genes, without regard for position in the promoter.
The goal of this search was to identify a transcription factor binding site compared to its background. This is the case when the probability that it is a binding site is greater than the probability that the sequence would be observed by chance.
P(Model) > P(Background)
log(P(Model)) > log(P(Background))
log(P(Model)) - log(P(Background)) > 0
Position specific scoring matrices (PSSMs) or position weight matrices (PWMs), are a well-established method of motif finding [75,76]. We used a variant of them to find the log probability of a sequence being a part of the model. These methods are similar to the transcription factor binding site search available through the database of transcription start sites . Binding site frequency matrices for CREB and zif268 were obtained from the Transfac [48,78] public database (see Fig. Fig.1).1). These matrices give the frequency of each nucleotide in each position of the binding site. Scoring matrices for the present study were created from the Transfac frequency matrices with the following equation:
where S is the scoring matrix, A is the frequency matrix, n is the nucleotide, p is the position within each binding site. The pseudocount, b, is set at the relatively small value of 0.25 to allow limited tolerance of base-pairs which have never been observed in a given position for a binding site. When comparing this scoring matrix to a sequence of the same size, adding the scores for the nucleotide that is at that same position in the sequence gives you the log of probability that the sequence matches the model.
The probability that a sequence is not a binding site is based on background dinucleotide frequencies. For each individual species, we went through all promoters and calculated the probability of each dinucleotide transition. For instance, . We also calculated the probability of observing each nucleotide individually. The probability of observing any sequence can be calculated from those probabilities by multiplying the probability of the first nucleotide by the probability of each nucleotide transition. The log probability can be found by adding the log of each probability.
The goal of this study was to create a comprehensive list of possible transcription factor targets. The log of the sequence length is often subtracted to correct for the number of possible sites being searched. While subtracting by the full log of the sequence length would provide a more rigorous control, we deliberately chose to increase the sensitivity of the method at the expense of specificity. This increases the number of targets found while decreasing the average quality of the target genes, allowing us to better take advantage of homology to eliminate false positives. Instead of subtracting the log of the sequence length, we subtracted a smaller "correction value". The larger the correction value, the more stringent the search is. To analyze the effects of decreasing specificity on the quality of the target gene set, we use a measure called the positive predicted value. Intuitively, this measure is the probability that a predicted site is a true positive. The positive predictive value is defined as .
In terms of our data, the positive predictive value is
The expected number of sites is the number of sites expected to be conserved if there was no association between a binding site existing in mouse and its human homologue. A series of possible correction values are plotted against the positive predicted value (Additional File 10). Because it is the point at which CREB and zif268 positive predictive values plateau, we chose to use a correction value of 300. The final positive predictive value based on comparative genomics is found in Table Table11 under "Homologues." A binding site is considered a hit if the final calculated score is above zero. The equation used to determine the final score is given below:
Score = log(P(Model)) - log(P(Background)) - log(correction_value)
The positive predictive values for the individual species is calculated by comparing the promoter region to the intergenic region (see Table Table1).1). It is still calculated as , but the observed sites is now the percentage of promoter targets with a binding site and the expected sites is the percentage of intergenic regions with that binding site.
Associations between co-occurring binding sites were analyzed by applying a 2-tailed Fisher's exact test , using a web-based calculator , to the 2 by 2 contingency table of counts of occurrence of either, both, or neither site.
Analysis of the function of the activity-dependent transcription factor targets was done using GOstat , an online tool for finding overrepresented ontologies in a set of genes . The list of targets for each transcription factor binding site and species was searched against the entire list of promoters for overrepresentation of different gene ontology classes.
The VP16-CREB expression data  was obtained from the Gene Expression Omnibus (Barrett), record GSE3965, on the NCBI website. The SOFT data files were converted to a single matrix using the GEOquery as part of the Bioconductor  package for R. We separated out the expression data into two groups: a control group where CREB is expressed at normal levels (On dox, rev, wt) and an experimental group where VP16-CREB has been active from one to five weeks (1w, 2w, 5w) . Only experiments where the entire hippocampus is dissected are used, not micro-dissected CA1 regions of the hippocampus, where region-specific gene expression could bias the results. The data was analyzed using the GSEA software package for Windows . The change in gene expression levels between the control and the super active CREB were determined by a signal-to-noise metric. We created genesets out of our conserved CREB targets and the conserved CREB targets identified by Conkright et al. . Symbols present on these lists but with no corresponding microarray probe are ignored. The expression levels corresponding to the symbols in the genesets are queried for enrichment in the control versus the experimental microarray datasets using the methods described in Subramanian et al. A list of genes identified as the leading edge subset , which are genes that contribute to the enrichment of the CREB targets in the microarray data, are listed as the overlap between our dataset and VP16-CREB dataset .
ChIP, chromatin immunoprecipitation; IEG, intermediate early gene; LTP, long term potentiation; PCR, polymerase chain reaction; PSSM, position-specific scoring matrix; PWM, position weight matrix
ARP implemented all computational tools developed for this study and performed all computational and statistical analyses. ALB conceived the project and advised ARP on experimental design, data sets, and analysis. RS participated in the design of the study and advised ARP on computational and statistical issues. All authors read and approved the final manuscript.
Table of all identified CREB and zif268 target genes in mouse. The file includes all mouse genes that were searched indicating the number of identified CREB or zif268 target sites in each gene. Also noted is the score for each site, the mRNA, the gene ID, and a gene description.
Table of all identified CREB and zif268 target genes in human. The file includes all human genes that were searched indicating the number of identified CREB or zif268 target sites in each gene. Also noted is the score for each site, the mRNA, the gene ID, and a gene description.
Table of all identified CREB and zif268 target genes in rat. The file includes all rat genes that were searched indicating the number of identified CREB or zif268 target sites in each gene. Also noted is the score for each site, the mRNA, the gene ID, and a gene description.
Table of all identified CREB and zif268 target genes in the mouse-human homologene dataset. The file includes all mouse-human homologous genes that were searched, indicating the number of identified CREB or zif268 target sites in each gene. Also noted is the gene ID and a gene description.
Position and sequence of CREB and zif268 binding sites in mouse promoter regions. Mouse genes that had one or more CREB and zif268 binding sites are listed in table format, with the precise nucleotide sequence corresponding to the binding site and the relative position of this sequence within the promoter region indicated. Genes are identified by gene ID, symbol, and gene descriptor.
Position and sequence of CREB and zif268 binding sites in human promoter regions. Human genes that had one or more CREB and zif268 binding sites are listed in table format, with the precise nucleotide sequence corresponding to the binding site and the relative position of this sequence within the promoter region indicated. Genes are identified by gene ID, symbol, and gene descriptor.
Position and sequence of CREB and zif268 binding sites in rat promoter regions. Rat genes that had one or more CREB and zif268 binding sites are listed in table format, with the precise nucleotide sequence corresponding to the binding site and the relative position of this sequence within the promoter region indicated. Genes are identified by gene ID, symbol, and gene descriptor.
Gene functions associated with CREB and zif268 targets. A list of the broad gene ontology classifications where CREB or zif268 targets showed an over- or underrepresentation, according to the GOstat ontology classifier.
CREB or zif268 target genes in overrepresented gene ontology categories. The specific genes that were placed in over- or underrepresented gene onotologies using the GOstat resource are listed according to ontology classification.
Comparative genomics as a metric for transcription factor targets quality. Transcription factor binding site searches were done with varying correction scores that correspond to the specificity of the search. The log of the specificity is subtracted from every subsequence scored by the program to correct for sequence length (see Methods). A higher specificity means a smaller number of higher quality binding sites are used. The predicted fraction of true positives or positive predictive value is defined as (true positives)/(true positives + false positives). This measure is estimated as (observed sites - expected sites)/(observed sites). The observed sites are the targets verified by comparative genomics while the expected sites are the number of binding sites one would find by chance if comparing independent human/mouse datasets.
This work was supported by Merck Computational Biology and Chemistry Summer Program Fellowship (ARP), the Carnegie Mellon Undergraduate Research Initiative (ARP), the HHMI Summer Undergraduate Research Program (ARP), the Alfred P. Sloan Foundation (ALB), and U.S. National Science Foundation Award DBI-0346981 (RS). We thank Savina Imrhan, Ashley Hurt, and Joseph Mitchie for assistance with database searches, Sayan Mukherjee for help with analyzing microarray data, and Christopher Burge for helpful comments on this work.