The increasing number of sequenced genomes of multicellular eukaryotes, including human, along with high-throughput methods such as whole genome microarray expression data, allows for systematic characterization of the cis-regulatory elements that control gene expression. Regulation of gene expression occurs at multiple levels in metazoans, including transcriptional and post-transcriptional. Transcriptional regulation involves binding of transcription factors (TFs) to short cis-regulatory elements, or transcription factor binding sites (TFBSs), that are generally 5–15 basepairs (bp) long. TFs bind to specific TFBSs on DNA, which leads to activation or repression of gene transcription [1
] into mRNA. mRNA stability and translation efficiency may be further regulated at the post-transcriptional level. One of the most well studied forms of post-transcriptional regulation involves binding of microRNAs (miRNAs) to cis-regulatory target sites residing in the 3' untranslated regions (UTRs) of mRNA and results in translational repression [2
]. The above mechanisms of regulation of gene expression are the most well studied, but other forms of regulation can also be deciphered at the DNA level, such as targets of RNA binding proteins (RBPs) [4
Existing experimental methods to identify cis-regulatory elements, or motifs, are time-consuming and cannot easily be scaled up to analyze a large number of genes [1
]. Some other techniques for large scale analysis of sequences, such as chromatin immunoprecipitation-on-chip (ChIP-chip), require prior knowledge of the trans-acting factor (see [8
] for review of experimental techniques). In contrast, certain types of computational algorithms can be used to discover de novo
motifs in the noncoding regions (NCRs) on a genome-wide scale without prior knowledge of the trans-acting factor and the labor costs of experimental techniques. NCRs are defined as any DNA sequence residing outside the translational start and end site of all known genes of a genome. In general, motifs are more difficult to detect in metazoans because the overall genomes are much larger than yeast, including the NCR (~200 times larger in human compared to yeast, where ~95% of the human genome is noncoding). Furthermore, regulatory elements can reside far upstream, downstream, in introns [1
], or UTR [3
] regions of the genes they regulate. The larger sequence search space leads to an increase in background noise and more difficulty in detecting true regulatory elements. There are existing computational algorithms that identify TFBSs (see [8
] and [9
] for reviews) and miRNA targets (see [10
] for review). To identify TFBSs, most algorithms start out with a set of co-regulated genes that are functionally related, which may be obtained using microarray data, to search for enriched motifs [1
]. Many of these algorithms were developed and tested in yeast, where the intergenic regions are much shorter and the motifs less degenerate than in metazoans. Many miRNA target prediction programs in metazoans begin with known miRNAs (computationally and/or experimentally identified) and use specific base-pairing rules between the miRNA and its targets to predict putative binding sites [3
Algorithms for detecting motifs in metazoans tend to leverage evolutionary conservation to reduce the background noise [1
]. There may be a disadvantage for algorithms that solely rely on alignments as input because studies have shown that the conservation of cis-regulatory sites tend to be low for the same trans-acting factor across different species [3
]. Multiple studies have shown that TFBSs often do not fall within conserved regions [11
]. Odom et al. demonstrated that approximately two-thirds of the binding sites of orthologous genes between human and mouse for the same transcription factor occurred in sequences that did not align. Due to the lack of verified miRNA binding sites, studies have not been as extensive in studying turnover of target sites across species. However, recent studies indicate there is approximately 50% conservation of miRNA targets between human and mouse, and that 30–50% of nonconserved sites might be functional in human when the miRNA and mRNA are expressed in the same tissue [3
]. Alignment algorithms also have their own set of technical problems such as misalignments due to large insertions and/or deletions [3
], which may affect defining conserved sites as nonalignable. Therefore, computational algorithms that incorporate both sequence conservation and species specific information for detecting cis-regulatory motifs are advantageous.
We previously introduced the CompMoby (Comparative MobyDick) algorithm for the study of transcriptional regulation in metazoans, which was successfully applied to embryonic stem (ES) cells [15
]. We have further developed the CompMoby software by developing a friendly user web interface that streamlines the analysis pipeline by formatting the user input and filtering the output with suggested default thresholds. The website also allows the user to download the necessary software for extraction of aligned NCRs that are not easily obtainable from a public database (discussed in the implementation section), includes sample input files for both 5' upstream and 3' downstream sequence analysis, and documentation that explains the results in the output files. The algorithm has also been extended to accommodate analysis of 3' UTRs for post-transcriptional regulatory studies. The utility of this software for 5' upstream and 3' downstream sequence analysis will be demonstrated in the results and discussion section.
The CompMoby software integrates species specific and evolutionary conservation information as input into the MobyDick algorithm [16
] and formats the output files to systematically identify over-represented putative TFBSs in upstream sequences, or putative miRNA and RBP targets in 3' UTR sequences. Our tool requires no prior knowledge of the trans-acting factor and allows de novo
motif discovery, which may lead to identification of a new TF, miRNA, or RBP. Currently, we restrict analysis of miRNA and RBP target sites to the 3' UTR sequences since studies have shown that most known metazoan miRNA targets reside in the 3' UTR [2
CompMoby has the advantage of being comprehensive and flexible because the software does not only rely on alignments, and cis-regulatory motifs that are alignable do not have to be 100% conserved. CompMoby can also capture degenerate cis-regulatory sites in its clustering step, where similar motifs that are not necessarily exact matches are grouped together. This flexibility is a useful feature because cis-regulatory sites for both TFs and miRNAs are often degenerate in animals [1
]. Our algorithm also allows the user to identify multiple putative cis-regulatory motifs of different lengths in one run by capturing both strong and weak motifs without having to iteratively mask the strongest motif in the input sequences [9
] or rerunning the algorithm for varying motif widths. The ability to capture multiple motifs of varying width and enrichment is a useful feature because there may be multiple TFBSs per upstream region of a gene [1
] or multiple miRNAs per mRNA [2
] that are often involved in combinatorial regulation of gene expression in metazoans. Most existing algorithms use either specific species information or multiple alignments in any one run and output the results separately. In comparison, the CompMoby software automatically combines all the MobyDick dictionaries derived from aligned and non-aligned sequence sets into clusters of motifs and calculates a p-value of over-representation for each cluster.
The CompMoby software provides the user with a tool to identify de novo cis-regulatory elements functioning at the transcriptional as well as the post-transcriptional level in metazoans. The nature of the algorithm allows the biologist to systematically identify the exact positions of putative cis-regulatory elements that are conserved and/or species specific on a genome-wide scale for experimental follow-up, which may provide insight into further understanding the complex regulatory networks of metazoans.