An important problem in genome annotation is the identification and characterization of functional elements. These elements include transcription factor binding sites (TFBS), which are short, degenerate sequences that appear frequently in the genome. The interactions between transcription factors (TFs) and their respective binding sites are critical for regulating gene expression. To characterize binding sequences for a TF, computational methods search for sequence patterns or "motifs" that appear repeatedly in genomic regions of interest (for a recent review, see [1
For many motif-finding methods, it is necessary to input upstream sequences from a set of genes (e.g., genes that have been identified as co-expressed from a microarray gene expression analysis), with the assumption that a common motif is shared by the sequences (e.g., [2
]). However, upstream sequences of genes included in this set may not have an occurrence of the same motif, or genes that have the occurrence of the motif in their upstream sequence may not be identified in the co-expressed set. To address these weaknesses, correlation-based motif finding methods [4
] have been developed that do not rely on a pre-determined set of genes either based on co-expression (e.g., [2
]) or over-representation of motifs as in [5
]. Using all genes from a single experiment, oligos in a specified length range are enumerated in their upstream sequence and tested for significant correlation with expression values or genome-wide location measurements for a particular TF. The correlation-based motif finding approach was introduced in the "Regulatory Element Detection Using Correlation with Expression" (REDUCE) software [4
] using a linear regression framework and has since been adapted in several ways including the use of scores to motifs instead of oligo counts [6
], probabilistic representations of motifs [7
], binary indicators for word occurrences [8
] and flexible non-linear regression functions [9
An alternative motif-finding strategy, relying on the availability of complete genomes from related species, has made it possible to search for putative TFBS in evolutionarily conserved sequences. It has been shown that for closely related species, where reasonable alignment of the orthologous promoter sequences can be achieved, the binding sites for many TFs are evolutionarily conserved. Different computational methods have been developed that vary in the number and diversity of species investigated, in search strategies, i.e. genome-wide (e.g., [11
]) versus gene sets (e.g., [13
]), in whether they use known transcription factors motifs (e.g., [14
]) or predict motifs de novo
]), in how they integrate inter-species conservation with intra-species conservation (e.g., [16
]), in whether the alignment of the motif occurrences across species is required (e.g., [17
]) and in whether global alignments in orthologous sequences are necessary [18
In summary, there are numerous motif finding methods that fall into several different classes, including those reviewed that are correlation or sequence-conservation based. Because of their successes individually, in this work, we describe a new method for predicting motifs that combines these two strategies.
Due to the variability in TF-DNA interactions, TFBS are characterized by motifs containing degenerate positions. For example, the second position in the consensus TFBS for the yeast transcription factor OPI1 (GRTTCGA) can be A or G, which is denoted by the IUPAC symbol R. At a functional TFBS, the possible substitutions at a position may be observed in aligned sequences from multiple species. For example, an OPI1 functional site may be fully conserved across species (as GATTCGA or GGTTCGA) or exhibit A or G at the second position for different species.
To search for degenerate motifs, we have developed an adaptation of the correlation-based algorithm REDUCE [4
] called conservation-REDUCE (c-REDUCE). In c-REDUCE, a multiple species alignment is generated and then translated into a consensus pattern using degenerate nucleotide symbols that capture the variation at each position across species. All oligos, including those with degenerate symbols, are then evaluated for significant correlation. By using multiple species data, we can identify motifs that may be missed by REDUCE, which only examines sequences from a single species and requires exactly the same oligo in different sequences.
An alternative method for identifying degenerate motifs is fast-REDUCE (f-REDUCE) [19
], which was developed for single species data and identifies degenerate motifs through an enumerative approach. However, enumeration of degenerate motifs can become very costly as the length of the motif and number of degenerate positions increases. In contrast, c-REDUCE reduces the search space of degenerate motifs by taking into account the variability at a position inferred from evolutionary information.
In summary, c-REDUCE benefits from the use of conservation in two ways. First, it predicts degenerate motifs, but reduces the search space by only focusing on naturally occurring degeneracies that appear across multiple species. Second, by examining sequences from multiple species, it will discount chance matches of a motif in a single species if it the match has a highly degenerate consensus sequence in the multiple species alignment. The degeneracy of the consensus, reflecting random mutations in other species, makes a functional TFBS at that position less likely. To predict transcription factor binding site motifs, our method is evaluated on ChIP-chip (chromatin immunoprecipitation on microarray) data in yeast and gene expression data in Drosophila. We find that the conservation and correlation-based approaches perform better in combination than they do individually.