The identification of motifs and motif modules is one of the most critical steps to understanding gene regulation. A motif, often represented by a position weight matrix, is the common pattern of short DNA segments bound by a transcription factor (TF). These DNA segments are called transcription factor binding sites (TFBSs). In high eukaryotes, including humans and mice, it is often the interplay of multiple TFBSs from different motifs, instead of from a single motif, that determines the temporal and spatial expression patterns of genes [1
]. We define a motif module as a group of several motifs, whose TFBSs co-occur in many short DNA sequences of one kilobase (kb) long. We also define cis regulatory modules (CRMs) as 1 kb long sequences containing TFBSs of all the motifs of a motif module. Because the binding of TFs to their TFBSs plays a pivotal role in controlling gene expression and the dysfunction of these binding sites often results in diseases [3
], it is important to identify motifs and motif modules.
Many methods are available for the identification of motifs and motif modules. Conventionally, TFBSs and motifs were identified on a gene-by-gene basis by experiments such as DNase footprinting [5
] and gel-mobility shift assay [6
]. It is through these experiments that we understand many basic principles of motifs. However, such experiments cannot address the challenging problem of identifying motifs and motif modules in the neighborhood of thousands of genes. Thus, many computational methods have been developed [2
], which are often based on the following motif properties: overrepresentation, conservation, and clustering. Motif overrepresentation means that TFBSs of a motif occur in the non-coding regions of a significant number of genes. Also, TFBSs of a motif are often conserved in different species. Finally, motifs are often clustered, with multiple TFBSs of different motifs often co-occurring in short DNA regions such as CRMs. Based on these properties, previously developed computational methods have shown some success in identifying motifs and motif modules in a group of putative co-regulated genes as well as on a genome-wide scale. At the same time, new experimental technologies such as chromatin immunoprecipitation followed by microarray experiments (ChIP-chip) [21
] and high-throughput sequencing of immunoprecipitated fragments (ChIP-seq) [22
] can provide thousands of short potential TFBS residing regions for computational methods to further identify motifs and motif modules.
Although there are many methods for motif and motif module identification, methods that can handle all of the non-coding regions of the human genome are still in great need. Many computational methods were designed to identify novel motifs and motif modules only in the promoter regions of a group of co-regulated genes. These methods are successful in identifying motifs and motif modules in simple organisms such as yeast. They are not, however, as successful at identifying motifs and motif modules in higher eukaryotes such as humans. This is because the TFBSs in higher eukaryotes can be several hundred thousand base pairs (bps) upstream, downstream, or in the introns of genes. To identify TFBSs in the long non-coding regions of higher eukaryotes, several methods based on multiple genome alignments were developed [24
]. However, current multiple genome alignments may not be able to align TFBSs and their orthologous counterparts well. For instance, multiple genome alignments from several popular methods are significantly different [26
]. Although ChIP-chip or ChIP-seq experiments can narrow down potential TF targeting regions, they are still costly and limited by the availability of high quality antibodies. Thus, novel methods that can systematically identify motif modules in the entire non-coding regions around human genes are needed.
Here we describe a computational method that identifies CRMs and motif modules in the human genome. Unlike the computational methods for promoter regions, our method works on the entire non-coding sequences around human genes. The non-coding sequences of a human gene include the upstream non-coding sequence until the nearest codon of the 5' adjacent gene, the downstream non-coding sequence until the nearest codon of the 3' adjacent gene, and the intron sequences of the gene itself. Unlike all previous methods, our method measures sequence conservation based on discontiguous sequence similarity [27
], which greatly expands the range of the conserved sequences. Our method is also different from the multiple genome alignment based methods, in that we use local alignments, which enables us to identify conserved TFBSs and CRMs that may be "misaligned" in the multiple alignments [26
By applying the method to all human genes with mouse or rat orthologs in the Mouse Genome Informatics database (MGI), we have identified 3161839 motif modules, 90.8% of which are already supported by various sources of functional evidence. Compared with 14 ChIP-seq experiments, on average, our methods predicted 69.6% of ChIP-seq peaks with TFBSs of multiple TFs. Our findings also show that TFBSs of motifs in many motif modules have preferred distances and orders. All predicted motif modules are available at http://www.cs.ucf.edu/~xiaoman/module1109
. We are developing a database, which will enable easier access to these predictions.