In the analysis described here, a number of motifs were found independently in the upstream regions of several different organisms. As we have done previously (11
), we could search for conserved motifs by pooling together upstream sequence from orthologous genes in groups of closely related organisms. We found that this improves our ability to discover conserved motifs (11
). However, as the groups that we have constructed in this analysis are relatively large, we were able to find a great number of motifs within sequence from individual organisms.
A large number of motifs were found in our AlignACE calculations. We must select those motifs that are both biologically relevant and statistically significant. Of the measures we used to assess our motifs, we found that the specificity score is the most useful measure for selecting biologically relevant motifs (11
). The specificity score is a measure of how specific a motif is to the sequence in which it was found. To eliminate motifs which are not statistically significant, we have applied a MAP score cutoff. The MAP score is a measure of the over-representation of the motif within the sequence input to the Gibbs sampling algorithm (20
). The MAP score and specificity score cutoffs that we have chosen here are very stringent (11
). We have also eliminated those motifs with AT content >80% (11
). Many specific and very AT-rich motifs were found in our analysis, yet all known E.coli
motifs have AT content <80% (11
An additional approach for assessing the significance of the motifs found in our analysis is to align randomly selected groups of upstream regions with the AlignACE algorithm and to measure the frequency of finding significant motifs. Hughes et al.
) have performed such an analysis in the S.cerevisiae
genome. They found that, although a small number of specific motifs with high MAP scores are found in the upstream regions of randomly selected ORFs, many more such motifs are in the upstream regions of groups of related genes. Using the number of high-scoring motifs found in the random runs as an estimate of the background noise, they calculated false positive rates for various cutoffs in MAP score and group specificity score (19
). Their highest cutoffs (analogous to the cutoffs that we have used here) had low rates of false positives (<20%). Performing such an analysis on prokaryotic sequences with the cutoffs that we have used here could be used to calculate the rate of false positives in the motifs that we have obtained.
The presence of a significant regulatory motif upstream of the genes comprising a predicted regulon lends additional evidence to the hypothesis that this is a biologically significant regulon. In addition, if a subset of the genes within a predicted regulon contain a significant regulatory motif in their upstream regions, this information can be used to revise the contents of this predicted regulon. We have used our motif analysis together with our regulon prediction methods to generate a final set of predicted regulons.
Of the three regulon prediction methods that we have compared in this paper, the most powerful method is based on conserved operons. The method based on protein fusions is essentially a special case of the method based on conserved operons (the two genes involved are fused into a single polypeptide in some organism, rather than being located close to one another in the genome and transcribed onto the same piece of mRNA). Even in S.cerevisiae,
the method based on conserved operons yields the largest number of predictions and the highest rate of true positives with the lowest rate of false positives. Loosening the operon definition to include divergently transcribed genes increases the recovery of true positives by 10–20%. It is not expected that this percentage could be any higher because only 10–20% of the genes contained in the same known E.coli
regulons and KEGG metabolic pathways are divergently transcribed. Using our groups based on conserved operons, we were able to find many more significant upstream regulatory motifs than using the groups from the WIT database (3
). This is because our groups were constructed using close homologs rather than strict orthologs; therefore, our groups include a total of four times more genes than the WIT groupings.
The least powerful of the three methods is the method based on conserved phylogenetic profiles. The idea behind this method is that the genes comprising entire pathways are either lost or passed on evolutionarily as a unit. However, this is often not the case. Homologs to an enzyme in a pathway may be present in an organism that no longer contains the pathway if this enzyme has become adapted for another cellular purpose. On the other hand, a homolog to an enzyme in a pathway may not be present in an organism that does contain the pathway if another non-homologous enzyme has evolved to fill in the missing function. The frequency of such events severely limits the usefulness of this method for regulon prediction.
Summing the three methods together yields the most useful groups, as there are independent predictions from all three methods. An additional improvement would be to optimize a weighted sum of the three methods. Predicted regulons obtained by these three methods can be used to find both known and new upstream regulatory motifs using local alignment programs such as AlignACE. We believe that groups of genes predicted to be coregulated by comparative genomics, and which also share a significant upstream regulatory motif, are likely to be coregulated. Experimental testing of some of these predicted regulons and their predicted cis-regulatory motifs is needed. All of the methods described here for predicted regulons and regulatory motifs will rapidly become more powerful as the number of completely sequenced genomes increases.