Transcription factors are proteins that regulate an organism's genetic program by binding to specific sites in the genome and modifying the expression of nearby genes. Mapping these sites is an important step in understanding transcriptional regulation, and can be significantly facilitated by integrating multiple data sources such as sequence, gene annotations, and phylogenetic conservation [1
]. A previously published study [3
] reported an initial regulatory map for Saccharomyces cerevisiae
by analyzing genome-wide chromatin immunoprecipitation (ChIP) data for 203 proteins. Harbison and co-workers used motif discovery programs in an effort to detect statistically over-represented sequence patterns (motifs) in the bound regions that were likely to correspond to the binding specificity of the immunoprecipitated proteins. Applying six different algorithms, they identified thousands of motifs. After extensive filtering and statistical testing, they reported high-confidence results for sixty-five proteins. They used these high-confidence motifs to identify sites that were in regions bound by the corresponding protein and that were conserved across at least 3 yeast species. We wished to expand and refine the yeast regulatory map by using a more sophisticated incorporation of phylogenetic conservation information.
Recently, many authors have reported algorithms for motif discovery that use evolutionary conservation. Kellis et al
. presented a computational method involving the genome-wide discovery of a catalogue of conserved motifs, which they annotated by searching for overrepresented functional categories among the genes with each motif [4
]. Several programs use an expectation maximization-based search over a probability model of DNA sequence to find conserved motifs. EMnEM [5
] and PhyME [6
] both incorporate probabilistic evolutionary models into EM-based motif searches. Several other approaches to using conservation information in motif discovery use Gibbs sampling to sample a probability space and search for motifs. CompareProspector is a Gibbs sampling algorithm that uses a pre-computed score to measure the conservation level across windows in sequence alignments, and then biases the motif search to regions that are highly conserved [7
]. PhyloGibbs is another conservation-based Gibbs sampling algorithm that leverages conservation by assuming the motif must be present in all species in a conserved region [8
]. Recently, another Gibbs sampler was developed to incorporate phylogenetic data by employing two substitution matrices for motif instances and background sites, with the background model estimated from orthologous sequence alignments and the motif model assuming half the branch length of the background model [9
]. All these algorithms have been demonstrated, in certain contexts, to outperform similar methods that don't take advantage of conservation information.
Tompa and co-workers [10
], who recently assessed a number of motif discovery programs, demonstrated that these algorithms perform much better on synthetic data than on real data. Their results highlight the importance of evaluating algorithms using experimental datasets such as those of Harbison et al
. Using motif discovery programs to identify the specificity of proteins from experimental data is particularly challenging because there are many statistically significant motifs in such data, and no guarantee that a motif that corresponds to a factor's specificity will be highly ranked, or even discovered. Harbison et al
., who used six separate motif discovery programs, observed that each motif discovery program found the correct motif for at least one protein that was not found by the other methods. However, no single program demonstrated a clear superiority (D. Benjamin Gordon, personal communication). Their analysis provides a useful benchmark for evaluating motif discovery approaches on experimental data.
In this study, we report two improved algorithms for conservation-based motif discovery, Converge and PhyloCon, and we use these methods to reanalyze the data of Harbison et al
. Using statistical tests identical to the ones used by Harbison et al
, we find that Converge and PhyloCon each identify more correct motifs than were found using the combined results of the six programs employed in the earlier study. The motifs discovered by Converge and PhyloCon are often complementary. Combining these motifs, we were able to significantly expand the map of yeast regulatory sites without the need to alter any of the thresholds for statistical significance. The new map reveals a more elaborate and complex view of the yeast genetic regulatory network than was observed previously. The updated map can be viewed and downloaded from the authors' website [11