|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions/at/oupjournals.org
Targeted transcript profiling studies can identify sets of co-expressed genes; however, identification of the underlying functional mechanism(s) is a significant challenge. Established methods for the analysis of gene annotations, particularly those based on the Gene Ontology, can identify functional linkages between genes. Similar methods for the identification of over-represented transcription factor binding sites (TFBSs) have been successful in yeast, but extension to human genomics has largely proved ineffective. Creation of a system for the efficient identification of common regulatory mechanisms in a subset of co-expressed human genes promises to break a roadblock in functional genomics research. We have developed an integrated system that searches for evidence of co-regulation by one or more transcription factors (TFs). oPOSSUM combines a pre-computed database of conserved TFBSs in human and mouse promoters with statistical methods for identification of sites over-represented in a set of co-expressed genes. The algorithm successfully identified mediating TFs in control sets of tissue-specific genes and in sets of co-expressed genes from three transcript profiling studies. Simulation studies indicate that oPOSSUM produces few false positives using empirically defined thresholds and can tolerate up to 50% noise in a set of co-expressed genes.
DNA microarrays profile patterns of gene expression changes on a genome-wide scale, elucidating sets of genes coordinately expressed under specific conditions. Recent improvements in bioinformatics methods for the analysis of sequences regulating transcription have made it possible to elucidate potential factors involved in key regulatory networks underlying a transcriptional response. The enumeration of such networks, by identifying genes with similar patterns of expression and shared cis-regulatory motifs, is crucial to advancing our understanding of biological pathways and processes.
Transcriptional regulation of gene expression is a tightly controlled process that involves the synchronized binding of trans-acting transcription factors (TFs) to numerous binding sites in the regions surrounding a gene's transcription start site (TSS), as well as to enhancer regions that mediate gene activation from distal locations. The binding specificities of TFs to their cognate DNA binding motifs are typically modeled using position specific scoring matrices (PSSMs) (1), which are constructed from alignments of binding site sequences that have been characterized experimentally or identified in high-throughput protein–DNA binding assays (2,3). These PSSMs are catalogued in databases such as TRANSFAC (4) and JASPAR (5). The use of PSSMs to detect individual transcription factor binding sites (TFBSs) is well-established (6). However, application of these models typically yields a large number of false positive predictions due to the short, degenerate nature of TFBS motifs. For example, a 6 bp long motif has ~1 in 4000 chance of occurring at random; while tolerance of ambiguity at just one highly variable position can raise the prediction rate to 1 in 1000.
Dramatic improvements in the specificity of TFBS prediction are attained by limiting the search space to regions of conserved, non-coding DNA using a comparative genomics approach known as phylogenetic footprinting (7–10). Based on the assumption that functional DNA sequences are subject to greater selective pressure, and therefore, are conserved across moderately diverged organisms, comparison of sequences from orthologous genes can highlight functional, non-coding DNA, providing clues to where regulatory sequences may be located. Phylogenetic footprinting eliminates, on average, 80% of sequence (11), and estimates have placed the proportion of TFBSs occurring within conserved regions when comparing human and mouse sequences at ~70% (11,12). Thus, while the use of phylogenetic footprinting limits our ability to detect binding sites that have evolved in a species-specific manner, the drastic reduction in noise increases specificity, and far outweighs the decrease in sensitivity.
Even with the improved performance conferred by phylogenetic footprinting, most predicted TFBSs are non-functional. By incorporating gene expression data into the analysis procedures, we should improve capacity to discriminate functional binding sites from potential false positive matches. This paper describes a new method, oPOSSUM, which identifies statistically over-represented, conserved TFBSs in the promoters of co-expressed genes. Based on the assumption that some subset of the co-expressed genes is co-regulated by one or more common TFs, we reason that the observed number of binding sites for those TFs should be greater than would be expected by chance. oPOSSUM integrates a pre-computed database of predicted, conserved TFBSs, derived from phylogenetic footprinting and TFBS detection algorithms, with statistical methods for calculating over-representation (Figure 1).
The oPOSSUM system was validated using curated regulatory region collections for genes expressed in a tissue-specific manner, and published targets of the nuclear factor NF-κB. oPOSSUM was then applied to two published transcript profiling data sets, as well as to a new analysis of expression addressing NF-κB inhibition. The results demonstrate that oPOSSUM is able to identify the TFs expected to mediate changes in gene expression through the detection of over-represented TFBSs. Simulations using random sampling gave low false positive rates and revealed tolerance for some noise in the gene sets.
The Ensembl software system (13) provides a flexible bioinformatics framework to retrieve sequences and annotations for genes from multiple organisms. Sets of orthologous human and mouse genes are available via EnsMART, a computationally convenient interface to genome annotations. To avoid aligning paralogs (genes which have diverged due to gene duplication in a common ancestor), human genes mapping to more than one mouse gene (and vice versa) are filtered to obtain a set of one-to-one orthologous pairs. For each human–mouse orthologous pair, repeat masked sequences are retrieved encompassing the region 5000 bp upstream of the annotated TSS to either the 3′ end of the gene or, in the case of long genes, 50000 bp downstream of the TSS. For genes with multiple annotated TSSs, the 5′-most TSS is selected.
Orthologous sequences are aligned using ORCA (D.J. Arenillas and W.W. Wasserman, manuscript under preparation.), a pairwise global progressive alignment algorithm similar to LAGAN (14). ORCA first identifies short segments of high similarity between orthologous genes by performing a local BLASTN alignment using the Bl2Seq algorithm (Version 2.2.5) (15), and then aligns the regions between such segments through the more time-consuming Needleman–Wunsch algorithm (NW) (16) to obtain an overall global alignment of the two sequences. The process is recursive; regions that are too long to align using NW are re-aligned with BLASTN using less stringent parameters. The process comes to a halt when either the regions are short enough to perform NW successfully (the product of the input sequence lengths does not exceed 100 Mb), or the minimum BLASTN word size of seven has been reached. The first iteration of BLASTN was performed using the following parameters: penalty for a nucleotide mismatch = −7; expectation value = 0.10; word size = 15; default values were used for the remaining parameters. For each subsequent run of BLASTN, the nucleotide mismatch score was incremented by two and the word size was decremented by four. NW global alignments used a match score = 3; mismatch score = −1; gap open penalty = 20; and gap extension penalty = 0.
Three dynamically selected and progressively more stringent conservation thresholds are applied. Specifically, each alignment is scanned using a 100 bp sliding window, the percent sequence identity within each window is calculated, and the top 10, 20 and 30% of all windows (excluding those overlapping a coding region) are retained. Minimum identity thresholds of 70, 65 and 60% are required for the high, medium and low conservation levels, respectively. The use of dynamically computed thresholds versus fixed sequence identity cutoffs is motivated by the variable rates of evolution for each gene in each genome.
The conserved non-coding regions of the promoters are searched for matches to all TFBS profiles in the JASPAR database with information content >8 bits, using the TFBS suite of Perl modules for regulatory sequence analysis (17). Excluding low information content profiles (a measure of the specificity of predictions) eliminates spurious hits. A predicted binding site for a given TF model is reported if the site occurs in both the human and mouse sequences above a threshold PSSM score of 75%, and at equivalent positions in the alignment. Overlapping sites for the same TF are filtered such that only the highest scoring is kept. The location, score, orientation and local sequence conservation level of each TFBS match in the human and mouse genes are stored in the oPOSSUM database.
Two statistical measures were calculated to determine which, if any, TFBS were over-represented in the set of promoters for co-expressed genes. These represent two distinct models for counting the occurrences of binding sites.
The Z-score uses a simple binomial distribution model to compare the rate of occurrence of a TFBS in the set of co-expressed genes to the expected rate estimated from the pre-computed background set.
For a given TFBS, let the random variable X denote the number of predicted binding site nucleotides in the conserved non-coding regions of the co-expressed gene set. Let B be the number of predicted binding site nucleotides in the conserved non-coding regions of the background set. Using a binomial model with n events, where n is the total number of nucleotides examined (i.e. the total number of nucleotides in the conserved non-coding regions) from the co-expressed genes, and N is the total number of nucleotides examined from the background gene set, then the expected value of X is μ = BC, where C = n/N (i.e. the ratio of sample sizes). Then, taking P = B/N as the probability of success, the standard deviation is given by .
Now, let x be the observed number of binding site nucleotides in the conserved non-coding regions of the co-expressed genes. By applying the Central Limit Theorem and using the normal approximation to the binomial distribution with a continuity correction, the Z-score is calculated as . Thus the Z-score indicates a significant difference in the rate of occurrence of sites, and is particularly good for detecting increased prevalence of common sites.
In contrast to the Z-score, the one-tailed Fisher exact probability compares the proportion of co-expressed genes containing a particular TFBS to the proportion of the background set that contains the site to determine the probability of a non-random association between the co-expressed gene set and the TFBS of interest. It is calculated using the hyper-geometric probability distribution that describes sampling without replacement from a finite population consisting of two types of elements (18). Therefore, the number of times a TFBS occurs in the promoter of an individual gene is disregarded, and instead, the TFBS is considered as either present or absent. A significant value for the Fisher exact probability indicates that there are a significant proportion of genes that contain the site, and is particularly good for rare TFBSs. Fisher exact probabilities were calculated using the R Statistics package (http://www.r-project.org).
A list of genes differentially expressed during interruption of the NF-κB pathway by a specific NF-κB inhibitor was obtained from an unpublished microarray experiment. Supplementary Table 1 contains sufficient information to reproduce or challenge the in silico promoter analysis described in this text, and the design of the experiment is briefly described here. Human umbilical vein endothelial cells (HUVEC) in the treatment condition were pre-treated with 10 μM of NF-κB inhibitor, followed by stimulation of the NF-κB signaling pathway with 0.1 ng/ml IL-1B 1 h later (t = 0 h). A second sample of the same culture was treated with 0.1 ng/ml IL-1B only at t = 0. A third sample received only vehicle treatment (0.33% dimethyl sulfoxide) at t = 0. From each condition, total RNA was isolated at 6 h using the RNeasy midi kit (Qiagen, USA). The entire paradigm was repeated three times on separate batches of HUVEC, generating nine samples. Equal amounts of RNA were pooled from the three IL-1B treated samples, as the control channel. Each sample, i.e. the three inhibitor treated, the three vehicle treated, and the pool of IL-1B treatment alone, was split in two and labeled with either the Cy3 or Cy5 fluorescent dye (Agilent, USA). Using a two-color microarray system, the labeled cDNA from treatment and control conditions was hybridized to an oligonucleotide microarray representing 23000 human genes (Agilent, USA) as follows: (i) three replicates of individual vehicle treated samples versus pool of IL-1B samples, (ii) three replicates of pool of IL-1B samples versus individual IL-1B + NF-κB inhibitor treated samples, (iii) same as (i) with fluors reversed and (iv) same as (ii) with fluors reversed. After quantification of the raw data, normalization and combination of the technical, fluor-reversed replicates using the Rosetta Resolver® (version 3.0) gene-expression-data-analysis system (19), an error-weighted ANOVA analysis was performed across replicates in the two groups. Biological replicates were then combined using the Rosetta Resolver® (version 3.0) error model.
For our analysis we focused on a list of 508 sequences showing significantly decreased levels of expression in inhibitor-treated cells, defined by an ANOVA P-value ≤0.01,an error-model P-value ≤0.01 and a fold-change ≥1.3. The 508 sequences were mapped to 326 unique Ensembl gene IDs by identifying gene models from Ensembl V19 (build 34a) which overlapped with the probe sequences. The downregulated genes were submitted for analysis by oPOSSUM.
To estimate the false positive rate, we tested oPOSSUM on randomly generated subsets of genes from the oPOSSUM database to determine how frequently TFBSs are identified as over-represented by chance, and to assess the validity of the selected Z-score and Fisher P-value cutoffs of 10 and 0.01, respectively. We created 100 independent sets, each containing 15 genes. These were submitted to oPOSSUM, and the number of TFBSs significantly over-represented during each trial was counted. The number of trials that generated significant TFBSs over 100 independent trials gives us a measure of the false positive rate. We repeated this process for gene lists of 50, 100 and 200 randomly selected genes to see what, if any, effect the number of genes in the list has on the false positive rate.
Next we investigated the amount of noise oPOSSUM can tolerate by adding increasing numbers of randomly selected genes from the oPOSSUM database to our reference gene sets. For the muscle- and liver-specific gene sets, we added 5, 10, 15, 20, 25, 30, 40, 50, 75 and 100 randomly selected genes, and submitted them to oPOSSUM. Additional increments of 150 and 300 genes were tested for the larger set of NF-κB target genes. This process was repeated 100 times for each noise level. The average Z-scores and Fisher P-values for the Mef-2, HNF-1 and NF-κB TFBS profiles over 100 independent trials for each noise level were recorded.
For all of the analyses presented in this study, we examined the promoter region encompassing 5000 bp upstream and 5000 bp downstream of the TSS, used the highest conservation level to extract conserved, non-coding regions (top 10% of conserved regions with a minimum of 70% sequence identity), and required a PSSM score >85% for predicted binding sites, using only the vertebrate-specific PSSMs in JASPAR.
The oPOSSUM database was constructed from an initial set of 14083 orthologs from human and mouse, obtained by selecting only ‘one-to-one’ human–mouse orthologs from Ensembl (20). Of these, 4921 (34.9%) of the ortholog sequence pairs failed to produce reasonable alignments of the promoter regions, due largely to an inability to reconcile TSS positions as a result of alternative promoter usage by orthologs, and to a lesser degree, as a consequence of low nucleotide sequence similarity between assigned orthologous gene pairs, genes within genes, and TSSs located within exons of upstream genes on the opposite strand. Attempts to align a subset of the failed promoter pairs using the LAGAN algorithm produced similar results (not shown). An additional 456 (3.2%) ortholog pairs successfully aligned but did not contain conserved, non-coding regions (minimum of 100 bp with >60% identity) in the target region spanning from 5000 bp upstream of the TSS to 5000 bp downstream of the TSS. Of the remaining 8706 genes with conserved promoters, 8698 contained matches to one or more TFBS profiles (PSSM cutoff of 75%), producing 2.4 × 106, 3.3 × 106 and 4.1 × 106 conserved predicted binding sites at the high, medium and low conservation levels, respectively (See Materials and Methods).
The muscle and liver regulatory region collections catalogue experimentally verified TFBSs that confer muscle- and liver-specific gene expression, respectively (21,22). We searched the literature for additional experimentally verified sites in human and mouse, adding eight liver-specific and five muscle-specific promoters to these collections (available at http://www.cisreg.ca/tjkwon/). In addition to these tissue-specific genes, we compiled a list of 61 known targets of the nuclear factor NF-κB (23). We used these reference sets to assess oPOSSUM's ability to discriminate functionally relevant TFBSs and to empirically determine appropriate thresholds for our scoring measures.
oPOSSUM calculates two statistical measures for binding site over-representation, one at the gene level (Fisher exact test) and the other based on the ratio of TFBSs to nucleotides (Z-score). Figure 2 shows the correlation between the scores for each reference set. Clearly, scores for the majority of TFBSs cluster at the bottom right corner of the graph for all reference sets, with Z-scores ranging from −10 to 10 and Fisher P-values ranging from 0.02 to 1. For each reference set, we also ranked the top 10 binding sites, ordered by Z-score, along with associated Fisher P-values (Table 1). In each case, the TFs were further investigated for experimentally verified evidence in the given tissue or system.
Studies of skeletal muscle expression have revealed five primary classes of TFs that contribute to skeletal muscle-specific expression: Myf (MyoD), Mef-2, SRF, TEF-1 and Sp-1 (24). Submission of the 25 genes of human, mouse or rat origin in the muscle regulatory collection resulted in 14 pairs of orthologs being analyzed. oPOSSUM ranked SRF, TEF-1, Mef-2 and Myf as the top four most significant profiles (Table 1). In fact, all of these TFs had Fisher P-values <0.01 and with the exception of Myf, had Z-scores >10, considerably higher than for all other TFs (Figure 2). Sp-1 was ranked tenth but without sufficiently convincing scores to discriminate it from the remainder of the TFBSs (Figure 2); this is not surprising given that it is a ubiquitous activator of numerous genes in the human genome (25).
Based on a collection of genes expressed either exclusively in liver hepatocytes or in a small number of tissues including liver hepatocytes, previous studies have found that hepatocyte-specific gene expression can be governed by the combined action of four primary TFs: HNF-1, HNF-3, HNF-4 and c/EBP (26). (There are additional regulatory programs that are controlled independently of these factors in hepatocytes.) Using this established list of 22 genes, we were able to analyze 11 orthologous gene pairs. Predicted HNF-1 sites were the most significantly over-represented TFBSs in the promoters of genes from the liver collection using both the Z-score and Fisher measures (Table 1). In fact, with a Z-score of 32.5, which is almost three times greater than the next most significant TFBS profile from JASPAR, and a Fisher P-value of 1.5 × 10−4, HNF-1 clearly segregates from the remaining TFBS profiles in this reference set (Figure 2). c/EBP ranked third, but was not sufficiently over-represented to exceed the significance cutoffs of 10 and 0.01 for the Z-score and Fisher measures, respectively.
The NF-κB/Rel family of TFs, which includes RELA (p65), NF-κB1 (p50, p105), NF-κB2 (p52, p100), c-REL and RELB, plays a central role in regulating the immune response (27). oPOSSUM was applied to a set of 61 known NF-κB-regulated genes (23), which include a large number of cytokines and immunoreceptors, and to a lesser extent, antigen presentation proteins, cell adhesion molecules, acute phase proteins, stress response genes and TFs. Of the 61 human genes submitted to oPOSSUM, 33 were mapped to mouse orthologs and subsequently analyzed. The NF-κB, c-REL, p65 and p50 binding sites, which are all members of the NF-κB-family of TFs, ranked as the top four most over-represented TFBSs, using either the Z-score or Fisher P-values (Table 1). Figure 2 shows that they were indeed the only TFBSs with significant scores discriminating them from other sites, with Z-scores as high as 35.6 and Fisher P-values as low as 1.2 × 10−9.
Based on the results obtained from the three reference gene sets, we decided empirically to use a Z-score cutoff of 10 and Fisher P-value cutoff of 0.01 to identify TFBSs for each of our test sets.
The reference collections used above are curated sets of genes. In contrast, high-throughput transcript profiling studies typically produce clusters of hundreds of co-expressed genes, of which only a small subset is likely to be co-regulated by a given factor. We assessed oPOSSUM's performance on three sets of genes derived from transcript profiling experiments, and report the results in Table 2. For each set of co-expressed genes, we list the top ten over-represented TFBSs, as determined by the Z-score, as well as any additional TFBSs with significant Fisher P-values (P < 0.01).
The c-Myc TF, which dimerizes with the Max protein, is a key regulator of cell proliferation, differentiation and apoptosis (28,29). Using serial analysis of gene expression (SAGE), Menssen and Hermeking (29) identified 216 different SAGE tags corresponding to unique mRNAs that were induced after adenoviral expression of c-Myc in HUVEC. The induction of 53 genes was confirmed using microarray analysis and RT–PCR. We analyzed the 53 genes with oPOSSUM and found that the binding sites of Myc–Max heterodimers are indeed the most significantly over-represented (Table 2); Myc–Max sites were identified in seven of the genes. Matches to the binding profile for homogeneous Max dimers, c-Myc's interacting partner, were also highly over-represented (present in nine genes, giving a high Z-score of 21.9). The binding profile for a related protein, n-Myc, ranked amongst the top ten most over-represented profiles.
In a study examining the role of transcriptional repression in oncogenesis, Ordway et al. (30) used microarrays to compare the gene expression profile of 208F fibroblasts transformed by c-Fos against the profiles for the parental 208F rat fibroblast cell line. We mapped the list of 252 induced genes to 150 human orthologs, which were submitted to oPOSSUM. As expected, the c-Fos TFBS was ranked as the most over-represented TFBS in the promoters of the induced genes, with a Z-score of 11.0 and a Fisher P-value of 2.9 × 10−2 (Table 2). c-Fos sites were identified in 40 of the co-expressed genes.
In HUVEC cells, interleukin 1B treatment precipitates an inflammatory response observable as an induction of mRNA expression. This response can be modulated by the inhibition of the NF-κB signaling pathway (31). We assessed oPOSSUM's performance on 326 genes that showed decreased levels of expression in interleukin-1B-stimulated HUVEC cells treated with an NF-κB inhibitor as compared to IL-1B-stimulated HUVEC cells. Binding sites for the NF-κB/Rel family of TFs were the most over-represented (present in ~50 genes) in the inhibitor-modulated genes (Table 2). Other over-represented TFBSs included the immune-related genes Irf-1, Irf-2 and SPI-B.
Based on the reference gene sets and expression data, oPOSSUM successfully identifies TFBSs that play a functional role in the regulation of sets of co-expressed genes. In the majority of cases, a Z-score >10 and a Fisher P-value <0.01 effectively discriminated the known sites within each set of reference genes. To assess how many of the over-represented TFBSs may be expected by chance and ascertain if the qualitatively observed thresholds are appropriate, we tested oPOSSUM on randomly generated subsets of genes from the oPOSSUM database.
In Figure 3, we show the percentage of trials that produced TFBS predictions for random sets of genes, providing a measure of the false positive rate. For a set of 15 genes, using the Z-score alone, 23% of the trials produced one false positive prediction, 19% produced two false positives, and so forth, for an overall false positive rate of 66%. Using only the Fisher exact test for a set of 15 genes, we obtain an overall false positive rate of 28%. Thus, when used in isolation, each of the scoring measures result in surprisingly high false positive rates (average of 63% for the Z-score and 31% for the Fisher test), which are dramatically reduced by combining the scores. By applying both the Z-score and the Fisher P-value cutoffs to the randomly selected sets, we observed an average false positive rate of 15%. The specificity when using the combination of scores (Z and F) appears consistent across gene sets of different sizes. Thus, with sets as large as 100–200 genes, which is typical of clustered expression data, ~86% of the time no spurious results are observed.
Next we performed simulations to investigate the amount of noise oPOSSUM can tolerate. To do this, we added from 5 to 300 randomly selected genes to the reference gene sets, and applied oPOSSUM to determine what proportion of the sets could be noise before losing our ability to elucidate the TFBSs mediating tissue-specific and pathway-specific expression. We considered the Mef-2, HNF-1 and NF-κB binding site profiles to be representative of each set, and plotted their average Z-scores and Fisher P-values over 100 trials against the proportion of noise in the set. The muscle, liver and NF-κB data sets can tolerate up to 60% of the gene list being noise using the Z-score (Figure 4A) and up to 50% using the Fisher P-value (Figure 4B). There is significant variation in the degree of noise tolerance amongst the three sets of genes: the NF-κB set is able to tolerate up to 80% of the set being noise versus only 50% for the muscle set. Figure 4 shows that the Z-score decreases quadratically and the Fisher P-value increases logarithmically with increasing noise for all three sets of genes.
The approach described for the detection of over-represented conserved TFBSs in sets of co-expressed genes has been implemented as a flexible, user-friendly website available from www.cisreg.ca. The implementation allows for analysis in default and custom modes. In the default mode, conserved human and mouse TFBS counts have been pre-calculated and stored using combinations of pre-defined values for the following three parameters: (i) the amount of sequence relative to the TSS to be included in the analysis, (ii) the level of interspecies conservation required and (iii) the PSSM score required for a hit to be reported (Table 3). Users simply select a pre-defined set of parameters, select a set of TFBS to be included in the analysis, and submit a list of gene identifiers (Ensembl, GenBank, RefSeq or LocusLink are presently supported) for analysis. oPOSSUM retrieves the TFBS hits matching the specified criteria for each gene in the list, calculates a Fisher exact probability and Z-score for the classes of TFBSs found in the set of genes, and returns ranked lists of TFBSs for each statistical test (Figure 5A). This operation is fast (<30 s for each of the reference sets) due to the pre-calculation of background frequencies. Pop-up windows for each TFBS display the genes in which the site has been located, as well as the site's co-ordinates and score (Figure 5B). Furthermore, the TFBSs are linked to the JASPAR database for easy access to information regarding the binding site profiles.
In the custom mode, users are not restricted to the pre-defined parameter values for the PSSM score and promoter region, and are given the option to supply user-defined background sets. Users might be motivated to introduce their own background sets if there is prior biological evidence linking sequence composition to expression in the tissue or condition studied. The customization option provides users with more control, and results in more variable processing speeds depending on the size of the background set and the parameters selected.
The oPOSSUM API, based on a set of object-oriented Perl modules, provides an interface to the oPOSSUM database and defines data objects for facilitating statistical (Fisher and Z-score) analysis. A set of modules at the top level of the API tree model each of the data objects in the oPOSSUM database. Briefly, the current version of the API includes modules for connecting and retrieving gene indices, orthologous gene pairs, conserved region information, TFBS matches, and other types of data from the oPOSSUM database, running the Z-score and Fisher analyses, and storing the input and output from these analysis modules. The API with accompanying documentation is available through the oPOSSUM website.
Regulatory analysis of the promoters of co-expressed genes can give rise to hypotheses about the factors, TFBSs, and putative pathways involved in generating the observed expression patterns. Our integrated approach to regulatory analysis incorporates public data sources, cross-species conservation and complementary statistical methods to identify over-represented motifs. We validate the method using updated reference sets of muscle-specific and liver-specific regulatory regions, and a new set of NF-κB-regulated genes. We demonstrate the utility of this technique for analyzing experimental data with three independent gene expression studies. We show the robustness of this method through computationally assessed rates of false-positives and noise tolerance. The procedure has been implemented as a user-friendly, flexible website called oPOSSUM and as a Perl API. In short, we illustrate herein that oPOSSUM is a novel, validated, useful, robust, user-friendly means for analysts to explore potential regulatory mechanisms in their expression experiments.
In the case of the muscle regulatory collection, oPOSSUM ranked four of the five documented TFBSs as the four most over-represented sites. In fact, the only three profiles to surpass the specified Z-score and Fisher P-value cutoffs were those for the muscle-specific TFs SRF, TEF-1 and Mef-2. Similar results were obtained for the extended liver regulatory collection, which contains genes with experimentally verified HNF1/3/4 and c/EBP binding sites. oPOSSUM analysis resulted in two over-represented TFBSs, including the top-ranked HNF-1, followed by forkhead related activator 2 (FREAC-2), a member of the forkhead box family of eukaryotic DNA binding proteins, which includes FREAC-2, FREAC-4, HNF-3β and HNF-4. c/EBP, though not considered significantly over-represented based on our empirical cutoffs, ranked third. The JASPAR database does not currently contain a binding profile for HNF-4, and so this TF could not be included in the analysis. The liver set illustrates how the absence of high-quality PSSM profiles to model all TFs in the human genome represents a key limitation to this method for the entire field.
Application of oPOSSUM to a set of known targets of NF-κB resulted in all four NF-κB-related profiles ranking at the top of the list of over-represented TFBS profiles, with markedly significant scores. Within the ranked list, though not exceeding the thresholds, we observed other immune response-related TFBS profiles. For example, the interferon regulatory factors Irf-1 and Irf-2 (ranked numbers 8 and 6, respectively) are known regulators of the host defense in response to viral infection or cytokine stimulation; they regulate interferon (IFN) and IFN-inducible genes, and also form interactions with SPI-1 and SPI-B (ranked number 5) to induce the activity of various cytokines (32–34). It is worth noting that the default threshold values are simply suggestions and less conservative cutoffs may yield valuable insights as well.
While the system behaved well for the validation collections, we desired to assess the utility for the analysis of larger, more heterogeneous experimental data. In each of the two published gene expression data sets, derived from the ectopic expression of c-Myc and c-Fos respectively, oPOSSUM clearly and appropriately ranked the corresponding TFBSs as being the most significantly over-represented. Further evidence of oPOSSUM's utility in analyzing gene expression data was presented by applying oPOSSUM to a set of genes that showed decreased expression in an experiment examining the effect of a known inhibitor of the NF-κB signaling pathway. This set of genes is distinct from the NF-κB reference set in that the experiment examines an interleukin-induced immune–response in a cellular system, and potentially contains a large number of mRNAs that are independent of NF-κB signaling. Still, oPOSSUM identified the NF-κB binding sites (NF-κB, c-REL, p50 and p65) as being significantly over-represented, as well as the same TFs involved in the immune response that were identified in the NF-κB test set (Irf1, Irf2 and SPI-B). Taken together, the three experimental analyses illustrate the power of promoter sequence analysis to identify the TFs governing gene expression changes observed in heterogeneous microarray and SAGE data.
A common problem for promoter analysis is circularity. Binding sites that have been experimentally verified in genes are used to construct binding site profiles, which in turn, are used to search for binding sites in sets containing the original genes. In this study, the Mef-2, SRF, c-REL, p50, p65, Myc–Max and c-Fos binding site profiles were constructed based on SELEX experiments, in which in vitro binding experiments are used to isolate suitable binding sites for a particular TF from random oligonucleotides (35). Thus, we can be sure that at least for the SELEX-based profiles, we have avoided any circularity.
At present, a key limitation to the oPOSSUM analysis is the scarcity of annotated binding site profiles. JASPAR, the underlying database supporting oPOSSUM, contains 111 high-quality binding site profiles representing 25 structural classes. When analyzing expression data sets, it is worth keeping in mind that although the TF mediating the observed response may not be present in the sparse JASPAR database, it is possible that a TF that recognizes a similar motif may be identified as being over-represented. For example, looking at the results for the c-Myc experiment in Table 2, it is evident that the over-represented TFBSs we observe are predominately bound by TFs containing the basic helix–loop–helix (bHLH) domain, and in particular, by TFs within the bHLH-ZIP (basic helix–loop–helix/leucine zipper) structural class. In fact, four of the five TFs in JASPAR that belong to the bHLH-ZIP class rank amongst the top ten profiles. This is also true for the NF-κB-related data sets where we see a clear over-representation of TFs belonging to the Rel class of TFs (Tables 1 and and2).2). It is important to consider whether a match may be indicative of a member of a structural class, rather than the specific profiled TF. A major challenge, however, is that zinc-finger proteins make up the largest class of TF proteins, comprising ~47% of the estimated 1445 TFs identified in mammalian genomes (36). Cys2–His2 zinc-fingers are the most versatile of the DNA-recognition domains, and variations in amino acid sequence enable them to bind to a diverse range of DNA sequences. In addition, zinc-finger proteins in mammalian genomes use multiple, tandem fingers to interact with arrays of subsites, providing a degree of modularity and exceptional adaptability (37). JASPAR currently contains only 17 zinc-finger binding profiles. We will continue our ongoing efforts to expand the JASPAR collection and incorporate new information as it becomes available.
Analysis of the false positive rates using random sampling revealed that, when used in isolation, the Z-score and Fisher tests result in high false positive rates that can be reduced by combining the two scoring measures. While it's true that applying a multiple testing correction could possibly improve performance for the Fisher measure, the Z-scores we obtain are extremely large, such that a correction for the 100 or so TFBS profiles being tested has negligible impact on the Z-score results. Instead, we have opted to empirically derive threshold cutoffs based on our reference data. Furthermore, our experience with oPOSSUM suggests that it is the ranks of the binding site profiles rather than the specific values of the scores that are indicative of functionally relevant TFBSs. For these reasons, and in light of the binding similarities within factor families, we have abstained from making a Bonferroni correction to adjust for multiple testing. In the future, we may introduce an option for users to make this adjustment that is based on an improved statistical model.
oPOSSUM is the first integrated, web-based tool for analyzing sets of co-expressed genes that incorporates cross-species comparisons, PSSM-based promoter motif detection, and statistical methods for the identification of over-represented TFBSs with a pre-computed database. Other resources are available for detecting and visualizing binding sites within the conserved regions of human genes [Consite (11), rVISTA (38), dbTSS (39), CONREAL (40), CORG (41)], as well as for identifying statistically over-represented motifs in the promoters of related sequences [Clover (42), OTFBS (43), PRIMA (44)]. Other comparable tools that integrate all of these approaches include the Toucan workbench for regulatory sequence analysis (45), CONFAC (46), and CRÈME (47). Unlike Toucan, oPOSSUM employs a pre-computed database of conserved TFBSs, eliminating the need for long processing times involved in retrieving sequences, performing alignments, and detecting motifs via PSSMs. Furthermore, the use of two complementary statistical tests to determine over-represented TFBSs is unique to oPOSSUM, and attempts to address the inherent problems involved in analyzing conserved regions of promoters for TFBSs, which include variation in conservation properties from one orthologous gene pair to another and multiple occurrences of a particular TFBS in the promoter of a single gene.
The oPOSSUM system is under continued development. As new information accumulates, we intend to expand the orthology mapping, increase the number of TFBS profiles supported in JASPAR, include the option for users to specify alternative promoters (TSSs), and improve the over-representation analysis. We believe that this approach to regulatory analysis will be helpful to researchers hoping to elucidate transcriptional pathways from gene expression data.
Supplementary Material is available at NAR Online.
Thank you to Yves Boie, Ernest Asante-Appiah, Jimmy Fourtounis and Chris Roberts (Merck) for supplying the NF-κB microarray data. This work was supported by the CIHR/MSF Strategic Training Program in Bioinformatics, the National Science and Engineering Research Council of Canada (NSERC) and Merck-Frosst, as well as a grant from the Canadian Institutes of Health Research (WWW). Funding to pay the Open Access publication charges for this article was provided by Merck-Frosst research funding to the CMMT (WWW).
Conflict of interest statement. None declared.