The properties of cells within an organism are defined by a complex interplay between proteins, RNA, and the genome, which can be conceptualized as the gene regulatory network. Two important components of the gene regulatory network are the DNA-binding
trans-acting transcription factors (TF) and their corresponding transcription factor binding sites (TFBS) in the DNA. Sets of proximal TFBSs that are sufficient to cooperatively mediate TF-regulated patterns of expression constitute
cis-regulatory modules (CRM). CRMs are the scaffold for combinatorial TF interactions, enabling a limited number of sequence-specific DNA binding TFs to participate in an exponential number of combinations, each potentially conferring specific patterns of gene activity (
Arnone and Davidson 1997).
In studying gene regulation within a cell or tissue, researchers are commonly confronted with the need to analyze sets of genes sharing a characteristic, such as co-expression, as they seek to infer properties of the gene regulatory network. A significant insight into the regulatory network structure is obtained when the mediating TFs for the observed expression patterns are identified. A key strategy in genome biology for determining such TFs is to determine the sequence motifs that are over-represented in the
cis-regulatory regions relative to some control. The successful predecessors to oPOSSUM-3, oPOSSUM (
Ho Sui et al. 2005) and oPOSSUM-2 (
Ho Sui et al. 2007), were developed to identify statistically over-represented, predicted TFBS in co-regulated gene sets. Two complementary scoring methods measured the over-representation: (1) Z-scores based on normal approximation to the binomial distribution that measures the change in the relative number of TFBS motifs in the foreground gene set compared with the background set, and (2) Fisher scores based on a one-tailed Fisher exact probability assessing the number of genes with the TFBS motifs in the foreground set
vs. the background set. Using the JASPAR database as the source of DNA binding profiles (
Portales-Casamar et al. 2010), the original oPOSSUM was designed to identify over-represented TFBSs, later referred to as Single Site Analysis (SSA). The original system also incorporated a conservation filter using phylogenetic footprinting based on pairwise alignments of orthologous sequences from human and mouse. In oPOSSUM-2, an additional analysis method called Combination Site Analysis (CSA) was introduced to identify over-represented proximal pairs of TFBSs. Separate oPOSSUM-2 implementations were released for two additional model organisms (
C. elegans and
S. cerevisiae). The nematode oPOSSUM-2 database was based on alignments between
C. elegans and
C. briggsae. The oPOSSUM-2 yeast system did not incorporate conservation filters, as the compact nature of the yeast genome results in dramatically reduced search space and noise compared with larger genomes. The oPOSSUM software is a highly cited tool for TFBS motif over-representation analysis (as assessed by Google Scholar citation counts), perhaps due to the ease of use and power of the approach. On average, excluding automated internet search software, 340 unique users work with oPOSSUM-2 each month.
Since the release of the original oPOSSUM system, a plethora of TFBS over-representation analysis tools have been introduced. TOUCAN2, a workbench system for regulatory sequence analysis implemented by
Aerts et al. (2005), contains features for identifying over-represented TFBS in proximal promoters of co-regulated genes.
Defrance and Touzet (2006) developed the TFM-Explorer, which assesses conservation of spatial arrangements of regulatory elements. Promoter Analysis Pipeline by
Chang et al. (2007) includes TFBS identification in gene sets as a component of the workbench, using non-redundant profiles from public databases.
Piechota et al. (2010) developed the cREMaG database, which attempts to correct for the confounding influence of variable information content of TFBS profiles, distinguishes between constitutive and inducible transcriptional forms of genes and reports the presence of CpG islands. Many of the methods provide web-based user interfaces, some of which are maintained. oPOSSUM-2 was found to perform well in an independent assessment of motif over-representation analysis tools (
Meng et al. 2010).
Since the implementation of these approaches, technology changes have greatly affected regulatory sequence studies. First, comprehensive multi-species sequence comparison measures are conveniently available in the form of phastCons and phyloP scores from the UCSC genome databases (
Hubisz et al. 2011). Phylogenetic footprinting, used by many TFBS enrichment programs, when performed with pairwise sequence alignments places emphasis on the quality of the choice of organism with which to compare and can sharply limit the number of genes that can be analyzed. Compared with pairwise alignments, multi-species analysis improves the quality of sequence alignments (
Kumar and Filipski 2007) and greatly increases the number of genes available for analysis. The proliferation of large-scale regulatory sequence profiling methods such as ChIP-Seq has demonstrated that TF-DNA interactions frequently occur outside of conserved regions (
Schmidt et al. 2010). Genomic regions bound by TFs in ChIP experiments are a snapshot of a single cell-type and set of conditions, and not all regions are necessarily functional
cis-regulatory sequences. In the absence of or in complement with experimental data, conservation is a useful filter for enabling computational motif enrichment analysis. Second, there has been a major update to the JASPAR database, an open-source, non-redundant, curated repository of TFBS profiles (
Portales-Casamar et al. 2010). The update provides a significant increase in non-vertebrate profiles, permitting the extension of regulatory analysis software to many non-vertebrate species, such as insects. Third, widespread application of ChIP-Seq profiling has resulted in an explosion of the number of potential regulatory sequences to be analyzed (
Johnson et al. 2007;
Malhotra et al. 2010;
Schmidt et al. 2010). These experiments produce sets of TF bound and control sequences, in which the foreground target (TF-bound) sets purportedly contain regulatory signatures of interest, whereas the background (control) sets lack those features. Such data are optimal for TFBS over-representation analysis, creating strong demand for a new generation of software that allows analysis from both a sequence-based perspective and a gene-based perspective.
Here we describe oPOSSUM-3, a system that capitalizes upon the aforementioned research developments. The new system features a panel of upgraded and novel approaches to regulatory sequence analysis, including Single-Site Analysis (SSA) and anchored Combination-Site Analysis (aCSA) (). A novel extension of the system addresses the challenge imposed by homologous TFs with highly similar (or identical) binding specificity. Such profile similarity will be of increasing impact on motif enrichment analysis as the number of TF profiles depicting similar binding grows. The TFBS Cluster Analysis (TCA) and anchored Combination TFBS Cluster Analysis (aCTCA) present results focused on TFBS sequence patterns rather than individual profile names. This new approach to regulatory sequence motif over-representation analysis has been assessed against reference sets of co-regulated genes and large-scale ChIP-Seq sequence collections. Assessments against reference cases exemplify the utility of oPOSSUM-3 for the identification of mediating TFs. The new system should maintain the oPOSSUM service as a popular resource for motif over-representation analysis.