Because binding of sequence specific transcription factors to their recognition sites in non-coding DNA is an important step in the control of gene expression, the development of computational methods to identify transcription factor binding motifs in non-coding DNA has received much attention in computational biology [1
]. The low information content of transcription factor binding motifs implies difficulty for computational analyses. For example, given a known binding motif, identification of bona fide examples is always plagued by false positives - the so-called Futility Theorem [1
An even more challenging computational problem is the de novo
identification of transcription factor binding motifs (so-called motif-finding), for which there are many available tools (for tutorials on different methods see [2
] and references therein). Despite the substantial algorithm development effort in this area, most recent comprehensive benchmark studies [4
] revealed that the performance of DNA motif-finders leaves room for improvement in realistic scenarios, where known transcription factor binding sites have been planted in test sequence sets.
One explanation for these observations could be that the low information content of DNA binding sites places limits on this problem as well - an extension of the Futility Theorem [1
] to the de novo
motif-finding problem. This has led to development of a large number of motif finding algorithms that attempt to include additional data in the motif-finding problem to improve the signal to noise ratio. For example, including quantitative high-throughput gene expression or binding measurements [7
], phylogenetic information [11
], transcription factor structural class [15
], nucleosome-positioning information [17
], local sequence composition and GC content [18
], improved background models [19
], or different motif-finding models [21
] have all been shown to improve motif-finding results in practice.
Here we argue that ‘false positive motifs’, i.e., patterns similar to typical biological motifs, may be likely to arise due to the statistical nature of large sequence data sets. In other words, when the dataset is large enough, motifs with strength similar to real transcription factor binding motifs begin to occur by chance. Consistent with this idea, it is frequently observed that DNA motif-finders identify seemingly strong candidate motifs, even when randomly chosen sequences are provided as the input. This issue has been previously recognized [22
] in the so-called “twilight zone search”- a motif-finding scenario where the probability of observing random motifs with higher score than real motifs is non-negligible. It was shown that the detection of false-positives, particularly in the twilight zone, is inevitable [22
The prevalence of such false positive motifs in DNA motif-finding has led to substantial research to assess the statistical significance of motifs. It is important to distinguish three distinct types of research in this area. The first aims to calculate of p-values for matches to a given motif (e.g., [23
]) and is not directly relevant to the motif-finding problem considered here. The second aims to calculate the p-value of a motif itself, which is an ungapped multiple alignment [25
]), while the third concerns the statistical significance of a motif identified through a ‘motif-finding’ procedure (e.g., [22
The second and third types of research are closely related, and were both treated in the seminal work of Hertz & Stormo [27
], which used large deviations theory to approximate the motif distribution, and provided algorithms to approximate the p-value of the ungapped alignment. Recent work has led to highly efficient algorithms based on Fast Fourier transforms (FFT) to compute these p-values [26
] and given a motif of interest (or ungapped multiple alignment) it is now possible to obtain a p-value.
Hertz & Stormo [27
] also proposed a method to assign significance to motifs identified in a motif-finding procedure by assuming that the motif finder can explore the entire space of possible motifs and select the most significant one. The p-value for the identified motif is then ‘corrected’ for the number of possible motifs considered and converted to an E-value that is defined as the expected number of random motifs that would have information context at least as high as the given motif [27
]. Therefore, the false-positive rate is closely related to the motif p-value and can be reduced if an accurate p-value is available. However, the E-value suggested by Hertz & Stormo does not always provide a useful measure of significance particularly in the cases where in it desired to detect weak motifs, i.e. when there is a reasonable chance of finding motifs of similar strength in random sequences [28
In practice, significance of motifs identified through motif-finding is often obtained through simulations (e.g. [21
]) where the motif finder is run on random sequences either drawn from or generated based on the dataset, and significance is assessed using the strength of motifs identified in these random datasets (‘false positives’) as the null distribution. While it is assumed that this distribution can be approximated by Gumbel distribution [29
], it been shown empirically that it fits very well to 3-parameter Gamma distributions [30
] and when significance is assessed using this null-distribution, the false positive rate can be significantly reduced [18
]. While simulation-based methods are very useful to assess the significance of a motif-finding result, they do not provide insight as to how the false-positive rate changes as a function of the motif-finding problem parameters and therefore cannot be used to design experiments to avoid false-positives.
Here, we obtain a remarkably simple analytical relationship between the size of the sequence search space and the strength of the false-positive motifs (we provide a definition for the strength of a motif below). In particular, we use Sanov’s theorem [31
] to derive a bound on the p-value of motifs with a given strength. We then use this to relate the sample size at which less than one false-positive is expected and the strength of the false-positives (when they do occur) to the parameters of the motif-finding problem.
Since we have considered the underlying statistics of the one-occurrence-per-sequence motif-finding problem, our results should apply to any motif-finding method that attempts to solve this problem. We confirmed this with softwares that implement different optimization approaches, MEME [32
] which uses Expectation-Maximization, GIMSAN [34
] and the original Gibbs Sampler [36
] (the latter two being MCMC methods). Interestingly, when we compared the false positives produced by a fourth software, Weeder [39
], which uses combinatorial algorithm based on suffix trees to build the motifs and does not restrict the problem to one-occurrence-per-sequence, we also found similar statistics for the false positives, suggesting that our theoretical analysis may be robust to motif-finding assumptions. Because of the simplicity of the analytic relationship between dataset size and false positive strength, we present simple rules of thumb that we believe will be useful in study design, as well as aid in interpreting the results of de novo