The oPOSSUM database was constructed from an initial set of 14
083 orthologs from human and mouse, obtained by selecting only ‘one-to-one’ human–mouse orthologs from Ensembl (20
). Of these, 4921 (34.9%) of the ortholog sequence pairs failed to produce reasonable alignments of the promoter regions, due largely to an inability to reconcile TSS positions as a result of alternative promoter usage by orthologs, and to a lesser degree, as a consequence of low nucleotide sequence similarity between assigned orthologous gene pairs, genes within genes, and TSSs located within exons of upstream genes on the opposite strand. Attempts to align a subset of the failed promoter pairs using the LAGAN algorithm produced similar results (not shown). An additional 456 (3.2%) ortholog pairs successfully aligned but did not contain conserved, non-coding regions (minimum of 100 bp with >60% identity) in the target region spanning from 5000 bp upstream of the TSS to 5000 bp downstream of the TSS. Of the remaining 8706 genes with conserved promoters, 8698 contained matches to one or more TFBS profiles (PSSM cutoff of 75%), producing 2.4 × 106
, 3.3 × 106
and 4.1 × 106
conserved predicted binding sites at the high, medium and low conservation levels, respectively (See Materials and Methods).
Validation using reference gene sets
The muscle and liver regulatory region collections catalogue experimentally verified TFBSs that confer muscle- and liver-specific gene expression, respectively (21
). We searched the literature for additional experimentally verified sites in human and mouse, adding eight liver-specific and five muscle-specific promoters to these collections (available at http://www.cisreg.ca/tjkwon/
). In addition to these tissue-specific genes, we compiled a list of 61 known targets of the nuclear factor NF-κB (23
). We used these reference sets to assess oPOSSUM's ability to discriminate functionally relevant TFBSs and to empirically determine appropriate thresholds for our scoring measures.
oPOSSUM calculates two statistical measures for binding site over-representation, one at the gene level (Fisher exact test) and the other based on the ratio of TFBSs to nucleotides (Z-score). shows the correlation between the scores for each reference set. Clearly, scores for the majority of TFBSs cluster at the bottom right corner of the graph for all reference sets, with Z-scores ranging from −10 to 10 and Fisher P-values ranging from 0.02 to 1. For each reference set, we also ranked the top 10 binding sites, ordered by Z-score, along with associated Fisher P-values (). In each case, the TFs were further investigated for experimentally verified evidence in the given tissue or system.
Figure 2 Relationship between the Fisher P-values and Z-scores for the muscle, liver and NF-κB reference sets. Based on the distribution of scores for the reference sets, a Z-score cutoff of 10 and a Fisher P-value cutoff of 0.01 were empirically selected (more ...)
Statistically over-represented TFBSs in reference gene sets
Muscle-specific regulatory region collection
Studies of skeletal muscle expression have revealed five primary classes of TFs that contribute to skeletal muscle-specific expression: Myf (MyoD), Mef-2, SRF, TEF-1 and Sp-1 (24
). Submission of the 25 genes of human, mouse or rat origin in the muscle regulatory collection resulted in 14 pairs of orthologs being analyzed. oPOSSUM ranked SRF, TEF-1, Mef-2 and Myf as the top four most significant profiles (). In fact, all of these TFs had Fisher P
-values <0.01 and with the exception of Myf, had Z
-scores >10, considerably higher than for all other TFs (). Sp-1 was ranked tenth but without sufficiently convincing scores to discriminate it from the remainder of the TFBSs (); this is not surprising given that it is a ubiquitous activator of numerous genes in the human genome (25
Liver-specific regulatory region collection
Based on a collection of genes expressed either exclusively in liver hepatocytes or in a small number of tissues including liver hepatocytes, previous studies have found that hepatocyte-specific gene expression can be governed by the combined action of four primary TFs: HNF-1, HNF-3, HNF-4 and c/EBP (26
). (There are additional regulatory programs that are controlled independently of these factors in hepatocytes.) Using this established list of 22 genes, we were able to analyze 11 orthologous gene pairs. Predicted HNF-1 sites were the most significantly over-represented TFBSs in the promoters of genes from the liver collection using both the Z
-score and Fisher measures (). In fact, with a Z
-score of 32.5, which is almost three times greater than the next most significant TFBS profile from JASPAR, and a Fisher P
-value of 1.5 × 10−4
, HNF-1 clearly segregates from the remaining TFBS profiles in this reference set (). c/EBP ranked third, but was not sufficiently over-represented to exceed the significance cutoffs of 10 and 0.01 for the Z
-score and Fisher measures, respectively.
Known NF-κB target genes
The NF-κB/Rel family of TFs, which includes RELA (p65), NF-κB1 (p50, p105), NF-κB2 (p52, p100), c-REL and RELB, plays a central role in regulating the immune response (27
). oPOSSUM was applied to a set of 61 known NF-κB-regulated genes (23
), which include a large number of cytokines and immunoreceptors, and to a lesser extent, antigen presentation proteins, cell adhesion molecules, acute phase proteins, stress response genes and TFs. Of the 61 human genes submitted to oPOSSUM, 33 were mapped to mouse orthologs and subsequently analyzed. The NF-κB, c-REL, p65 and p50 binding sites, which are all members of the NF-κB-family of TFs, ranked as the top four most over-represented TFBSs, using either the Z-score or Fisher P
-values (). shows that they were indeed the only TFBSs with significant scores discriminating them from other sites, with Z
-scores as high as 35.6 and Fisher P
-values as low as 1.2 × 10−9
Based on the results obtained from the three reference gene sets, we decided empirically to use a Z-score cutoff of 10 and Fisher P-value cutoff of 0.01 to identify TFBSs for each of our test sets.
Application to transcript profiling data
The reference collections used above are curated sets of genes. In contrast, high-throughput transcript profiling studies typically produce clusters of hundreds of co-expressed genes, of which only a small subset is likely to be co-regulated by a given factor. We assessed oPOSSUM's performance on three sets of genes derived from transcript profiling experiments, and report the results in . For each set of co-expressed genes, we list the top ten over-represented TFBSs, as determined by the Z-score, as well as any additional TFBSs with significant Fisher P-values (P < 0.01).
Statistically over-represented TF binding sites in gene expression data sets
c-Myc SAGE experiment
The c-Myc TF, which dimerizes with the Max protein, is a key regulator of cell proliferation, differentiation and apoptosis (28
). Using serial analysis of gene expression (SAGE), Menssen and Hermeking (29
) identified 216 different SAGE tags corresponding to unique mRNAs that were induced after adenoviral expression of c-Myc in HUVEC. The induction of 53 genes was confirmed using microarray analysis and RT–PCR. We analyzed the 53 genes with oPOSSUM and found that the binding sites of Myc–Max heterodimers are indeed the most significantly over-represented (); Myc–Max sites were identified in seven of the genes. Matches to the binding profile for homogeneous Max dimers, c-Myc's interacting partner, were also highly over-represented (present in nine genes, giving a high Z
-score of 21.9). The binding profile for a related protein, n-Myc, ranked amongst the top ten most over-represented profiles.
c-Fos microarray experiment
In a study examining the role of transcriptional repression in oncogenesis, Ordway et al
) used microarrays to compare the gene expression profile of 208F fibroblasts transformed by c-Fos against the profiles for the parental 208F rat fibroblast cell line. We mapped the list of 252 induced genes to 150 human orthologs, which were submitted to oPOSSUM. As expected, the c-Fos TFBS was ranked as the most over-represented TFBS in the promoters of the induced genes, with a Z
-score of 11.0 and a Fisher P
-value of 2.9 × 10−2
(). c-Fos sites were identified in 40 of the co-expressed genes.
NF-κB microarray experiment
In HUVEC cells, interleukin 1B treatment precipitates an inflammatory response observable as an induction of mRNA expression. This response can be modulated by the inhibition of the NF-κB signaling pathway (31
). We assessed oPOSSUM's performance on 326 genes that showed decreased levels of expression in interleukin-1B-stimulated HUVEC cells treated with an NF-κB inhibitor as compared to IL-1B-stimulated HUVEC cells. Binding sites for the NF-κB/Rel family of TFs were the most over-represented (present in ~50 genes) in the inhibitor-modulated genes (). Other over-represented TFBSs included the immune-related genes Irf-1, Irf-2 and SPI-B.
Based on the reference gene sets and expression data, oPOSSUM successfully identifies TFBSs that play a functional role in the regulation of sets of co-expressed genes. In the majority of cases, a Z-score >10 and a Fisher P-value <0.01 effectively discriminated the known sites within each set of reference genes. To assess how many of the over-represented TFBSs may be expected by chance and ascertain if the qualitatively observed thresholds are appropriate, we tested oPOSSUM on randomly generated subsets of genes from the oPOSSUM database.
In , we show the percentage of trials that produced TFBS predictions for random sets of genes, providing a measure of the false positive rate. For a set of 15 genes, using the Z-score alone, 23% of the trials produced one false positive prediction, 19% produced two false positives, and so forth, for an overall false positive rate of 66%. Using only the Fisher exact test for a set of 15 genes, we obtain an overall false positive rate of 28%. Thus, when used in isolation, each of the scoring measures result in surprisingly high false positive rates (average of 63% for the Z-score and 31% for the Fisher test), which are dramatically reduced by combining the scores. By applying both the Z-score and the Fisher P-value cutoffs to the randomly selected sets, we observed an average false positive rate of 15%. The specificity when using the combination of scores (Z and F) appears consistent across gene sets of different sizes. Thus, with sets as large as 100–200 genes, which is typical of clustered expression data, ~86% of the time no spurious results are observed.
Figure 3 Percentage of trials that produced false positive (FP) predictions. Sets containing 15, 50, 100 and 200 randomly selected genes were generated and submitted to oPOSSUM (100 trials each). Each segment of the bar represents the percentage of trials where (more ...)
Next we performed simulations to investigate the amount of noise oPOSSUM can tolerate. To do this, we added from 5 to 300 randomly selected genes to the reference gene sets, and applied oPOSSUM to determine what proportion of the sets could be noise before losing our ability to elucidate the TFBSs mediating tissue-specific and pathway-specific expression. We considered the Mef-2, HNF-1 and NF-κB binding site profiles to be representative of each set, and plotted their average Z-scores and Fisher P-values over 100 trials against the proportion of noise in the set. The muscle, liver and NF-κB data sets can tolerate up to 60% of the gene list being noise using the Z-score () and up to 50% using the Fisher P-value (). There is significant variation in the degree of noise tolerance amongst the three sets of genes: the NF-κB set is able to tolerate up to 80% of the set being noise versus only 50% for the muscle set. shows that the Z-score decreases quadratically and the Fisher P-value increases logarithmically with increasing noise for all three sets of genes.
Figure 4 Noise tolerance. Increasing numbers of randomly selected genes were added to the muscle, liver and NF-κB reference sets to assess the effect of noise on (A) the Z-score and (B) Fisher exact probability statistical measures. The amount of noise (more ...)
The approach described for the detection of over-represented conserved TFBSs in sets of co-expressed genes has been implemented as a flexible, user-friendly website available from www.cisreg.ca
. The implementation allows for analysis in default and custom modes. In the default mode, conserved human and mouse TFBS counts have been pre-calculated and stored using combinations of pre-defined values for the following three parameters: (i) the amount of sequence relative to the TSS to be included in the analysis, (ii) the level of interspecies conservation required and (iii) the PSSM score required for a hit to be reported (). Users simply select a pre-defined set of parameters, select a set of TFBS to be included in the analysis, and submit a list of gene identifiers (Ensembl, GenBank, RefSeq or LocusLink are presently supported) for analysis. oPOSSUM retrieves the TFBS hits matching the specified criteria for each gene in the list, calculates a Fisher exact probability and Z
-score for the classes of TFBSs found in the set of genes, and returns ranked lists of TFBSs for each statistical test (). This operation is fast (<30 s for each of the reference sets) due to the pre-calculation of background frequencies. Pop-up windows for each TFBS display the genes in which the site has been located, as well as the site's co-ordinates and score (). Furthermore, the TFBSs are linked to the JASPAR database for easy access to information regarding the binding site profiles.
Predefined values for phylogenetic footprinting and TFBS detection available in oPOSSUM's default mode
Figure 5 The oPOSSUM result report for the identification of over-represented TFBSs in sets of co-expressed genes. (A) Results report showing the selected parameters, genes included and excluded in the analysis, and summary tables containing the Fisher exact probability (more ...)
In the custom mode, users are not restricted to the pre-defined parameter values for the PSSM score and promoter region, and are given the option to supply user-defined background sets. Users might be motivated to introduce their own background sets if there is prior biological evidence linking sequence composition to expression in the tissue or condition studied. The customization option provides users with more control, and results in more variable processing speeds depending on the size of the background set and the parameters selected.
The oPOSSUM application programming interface (API)
The oPOSSUM API, based on a set of object-oriented Perl modules, provides an interface to the oPOSSUM database and defines data objects for facilitating statistical (Fisher and Z-score) analysis. A set of modules at the top level of the API tree model each of the data objects in the oPOSSUM database. Briefly, the current version of the API includes modules for connecting and retrieving gene indices, orthologous gene pairs, conserved region information, TFBS matches, and other types of data from the oPOSSUM database, running the Z-score and Fisher analyses, and storing the input and output from these analysis modules. The API with accompanying documentation is available through the oPOSSUM website.