A transcription factor typically interacts with DNA sequences that reflect a common pattern, or motif, characteristic of the factor. Such a motif can be represented by a consensus sequence or, less crudely, by a

*W* × 4 matrix

*q*, where

*W* is the motif’s size in base pairs, and each matrix element

*q*(

*k*,

*X*) is the probability of observing nucleotide

*X* (A, C, G or T) at position

*k* in the motif. It is then possible to scan this matrix along a DNA sequence, assigning a similarity score to each

*W*-long subsequence using a standard log likelihood ratio formula (

1). Typically, any subsequence with a similarity score above some threshold is counted as a ‘match’. Unfortunately, these matrices do not contain sufficient information to locate functional

*in vivo* binding sites accurately; at thresholds low enough to recover genuine binding sites, spurious matches occur at a high rate (

2). It seems that transcription factors must be guided to their

*in vivo* binding sites by contextual factors such as chromatin structure and interactions with other transcription factors, in addition to their innate DNA binding preferences. It is widely accepted that knowledge of transcription factor binding motifs is not in itself adequate to elucidate transcriptional control mechanisms. In addition to directly investigating contextual factors, another powerful approach to elucidating regulatory mechanisms is to gather DNA sequences that share a common regulatory property, and search for motifs shared by these sequences.

Two general ways of finding shared motifs can be envisaged. The first is to apply

*ab initio* motif discovery algorithms which search for recurring patterns of any kind. The second is to compile a library of all previously characterized motifs and assess whether any of these motifs are statistically over-represented in the sequences. Even though we expect to observe many spurious matches for each motif, it is plausible that if a motif is functionally present in many of the sequences, then the number of matches will be greater than would be expected by chance. The greater generality of

*ab initio* methods is a double-edged sword: they can find completely novel motifs not in any precompiled library, but the motifs must be stronger in order to be statistically significant and detectable, as compared with library-based methods. In addition,

*ab initio* methods tell us nothing about which factor might bind to a predicted motif, whereas precompiled libraries generally include annotations of which motifs are bound by which factors, or families of factors. Much research effort has been devoted to

*ab initio* motif discovery algorithms [see Frith

*et al*. (

3) for references], but until recently library-based methods have been neglected, despite the promising aspects of this approach.

Several techniques for testing whether a motif is over-represented in a target set of DNA sequences have recently been published (

4–

9), and it is instructive to draw connections among these methods, as most of them ultimately reduce to the statistics of contingency tables. All of these methods scan the motif matrix across the target sequences and a set of control sequences, recording matches with similarity score greater than some threshold. Liu

*et al*. (

4) proposed counting the number of target and control sequences with and without a match, and deemed the motif over-represented if matching sequences were at least twice as frequent in the target set as the control set. While this 2-fold excess criterion is intuitive, a more rigorous test using the hypergeometric distribution is available (

8,

10). More explicitly, the data can be cast as a 2 × 2 contingency table (Fig. ), where

*A* is the number of target sequences with a match,

*B* is the number of control sequences with a match,

*C* is the number of targets without a match and

*D* is the number of controls without a match. A chi-square test or Fisher’s exact test (the hypergeometric distribution) can be used to test the null hypothesis that the sequences with motif matches are evenly distributed among the target and control sets. Elkon

*et al*. (

7) use a more intricate procedure, counting the number of sequences with zero matches, one match, two matches or three or more matches in the target and control sets. These data can be cast as a 4 × 2 contingency table and tested using a multivariate hypergeometric distribution.

The methods described above can only be applied sensibly if all the target and control sequences have the same length, which is not always easy to arrange. In addition, they may lose statistical power by not counting all matches in each sequence. Several publications have suggested counting all matches in the target and control sequences, and two different binomial formulas have been proposed to test for over-representation (

5,

6,

8,

9). In fact, these data can also be cast as a 2 × 2 contingency table (Fig. ), where

*A* is the number of matches in target sequences,

*B* is the number of matches in control sequences,

*C* is the number of

*W*-long segments in target sequences that do not match and

*D* is the number of

*W*-long segments in control sequences that do not match. To test the null hypothesis, that matches are evenly distributed among the target and control sets, we can imagine randomly drawing

*A* +

*B* matching segments from a pool of

*A* +

*B* +

*C* +

*D* segments of target and control sequences. Equivalently, we can imagine drawing

*A* +

*C* target segments from a pool of

*A* +

*B* +

*C* +

*D* matching and non-matching segments. These two viewpoints lead to the same hypergeometric formula, but to two different binomial approximations of it, which are precisely those described by Sharan

*et al*. (

8) versus Aerts

*et al*. (

5), Zheng

*et al*. (

6) and Haverty

*et al*. (

9) These methods assume that occurrence of a match at each

*W*-long segment is independent, which is not quite true because the segments overlap one another, and correlations are also introduced by the presence of repetitive elements in DNA. For these reasons, Zheng

*et al*. (

6) needed to treat palindromic motifs specially, and some of their results were greatly influenced by the presence of repeats.

All the previous methods discard potentially useful information by collapsing matrix scores at each location to a binary quantity: above or below the threshold. They also reveal uncertainty regarding whether to count one match per sequence, a few matches per sequence or all matches in each sequence. Regulatory regions of higher eukaryotes often contain multiple binding sites for the same transcription factor, with weaker ‘shadow’ copies of the motif also being observed (

11). So consideration of multiple matches per sequence seems likely to help in discovering functional motifs by statistical over-representation. The reason for this site multiplicity is unclear: it might indicate cooperative binding by several factor molecules, it could constitute a mechanism for lateral diffusion of the factor along the DNA and/or the shadow sites might be fossils from the process of binding site turnover (

12). Here we report a novel method of combining multiple matches per sequence, which is motivated by a simple thermodynamic model. The matrix score ideally reflects the factor’s binding energy at each location; therefore the score’s exponential should be proportional to the factor’s equilibrium occupancy of that site (

1). We suppose that multiple sites simply serve to increase the total occupancy for the sequence, which we estimate by summing the exponentiated matrix score of each site. Finally, we assess whether the estimated total occupancies of the target sequences are greater than would be expected by chance. Thus our method incorporates information from the matrix scores, and combines information from all possible sites per sequence in a biophysically motivated way.