Transcription factors (TFs) are proteins that regulate transcription, the process by which messenger RNA is synthesised from a DNA template. TFs facilitate or inhibit recruitment of the RNA polymerase by binding to DNA, usually near the gene that they regulate. Their binding sites are short nucleotide patterns or “motifs”. Detection of such motifs in DNA sequence is therefore of great practical importance in the study of gene regulation. These motifs are not exact strings: while most binding sites for a given factor resemble a “consensus string” (for example, ACGCGT, the most common binding sequence for the MBP1 protein in budding yeast), mismatches and variations often occur.

An early study of the variability and statistical properties of binding sites was by Berg and von Hippel

[1]. The most popular representation of binding sites is the position weight matrix (PWM)

[2],

[3], which has a convenient visual depiction, the sequence logo

[4]. For a motif of length

, a PWM is a

matrix,

, where

is A, C, G or T, and

is an integer ranging over the length

of the binding sequence.

is the probability of seeing nucleotide

at position

; the sum over

, for each

, is unity. Typically, a PWM is estimated by aligning a large number of known binding sites, and calculating the relative frequencies of each nucleotide at each position. A “pseudocount” is generally added to the raw nucleotide counts, to allow for the limited size of the data. Thus, given

aligned sequences, where the number of nucleotides of type

at column

is

(with

for all

), the weight matrix is given by

where

. We choose

, which corresponds to a “uniform prior” or complete lack of prior bias (formally, a pseudocount is equivalent to assuming a Dirichlet prior: see

Materials and Methods for further discussion). A sequence logo

[4] is a visual representation where the four possible nucleotides are stacked at each position

, one atop the other, with their relative heights proportional to their weights in the

′th PWM column, and the total height proportional to the “information content” of the PWM column, defined as

.

A PWM assumes independence among different “columns” (values of

). As an extreme example, it cannot describe a case where two successive positions contain the nucleotides AA or TT equally often but not AT or TA: a weight matrix will contain 0.5 for each of A and T at each position, and will imply that all four of AA, AT, TA and TT are equally probable. For the most part, such strong correlations are not observed among different nucleotides in binding sites, but it is known

[5]–

[7] that different sites are not completely independent. Nevertheless, Benos

*et al.* [8], argued that the independent approximation is a good one in practice.

A related question is whether the binding

*energy* can be written as a sum of single-nucleotide binding energies. Djordjevic

*et al.* [9] argued that even with the additivity assumption for the binding energy (which they make), the binding probability should be modelled by a Fermi-Dirac function and not a Boltzmann function, while only the latter (which is the rare-binding limit of the former) can justify the PWM model. However, van Nimwegen

*et al.* [10] (supporting text) use a simple maximum-entropy argument to show that the additivity assumption on energy does imply the PWM model for binding sites, if one also makes the reasonable assumption that binding sites have a significantly higher expected binding energy than random sites. Therefore, non-independence of nucleotide distributions in different positions probably implies non-additivity of the binding energy.

Several attempts have been made to go beyond PWMs. A biophysical model was presented by Djordjevic

*et al.* [9], while several authors have considered purely statistical/bioinformatic approaches that take account of correlations (or other forms of binding-site heterogeneity not describable by PWMs) in various ways

[11]–

[14]. Recently, Sharon

*et al.* [14] described a “feature-based” model that enhances the PWM picture with representations of other sequence features, including interdependencies in binding site positions). However, none of these approaches has achieved significant popularity, perhaps because they lack the conceptual simplicity of the PWM.

If the independence assumption is adequate, are nearest-neighbour dinucleotides sufficient? Theoretically, the question is made complicated by the effect of sequence on DNA conformation and bendability, which means that the DNA-protein contact interactions (which, one would expect, are reasonably local) are not the only factor at play. O'Flanagan

*et al.* [15] observe contributions primarily from nearest-neighbour dinucleotides. However, Faiger

*et al.* [16] report that some TATA boxes (binding sites for the TBP) have context-dependent conformations that require one to go beyond nearest-neighbour non-additivity. Sharon

*et al.* [14] consider “features” that are much more complicated than nearest-neighbour dinucleotides. Below (see

Results), we examine binding sites in yeast for several transcription factors, and conclude that dinucleotide correlations are significant in several cases, and occur with gaps of all lengths in a binding region, not just with nearest-neighbours.

In fact, it has been known for many years that DNA, particularly non-coding DNA, exhibits long-range power-law correlations

[17], for reasons that remain unclear. Therefore, such correlations would not be surprising in binding sites.

A notable case where PWMs appear to be severely inadequate is the binding affinity of nucleosomes. Segal

*et al.* [18] used dinucleotide matrices to model nucleosome-binding DNA sequences, but their approach differs significantly from what is described below: notably, they confine themselves to nearest-neighbour dinucleotides. I do not address nucleosomes here, but hope to do so at a future date.

Here I describe a straightforward extension of the PWM method, which reduces to the PWM representation for independent positions. Analogous to a position weight matrix

, which gives the probability of observing each nucleotide

at each position

, let us define

, a

*dinucleotide weight matrix* (DWM) that gives the probability of observing each pair of nucleotides

and

at each pair of positions

and

in a binding site. All pairs of positions are considered: recognising that correlations occur at all scales, we are not restricted to nearest-neighbours (as in

[18]), and don't explicitly search for correlated pairs or features (as in

[14]).

Defining such an object is easy: but the use of

is not as straightforward as using

in predicting binding sites, because dinucleotide probabilities for different pairs of positions are not independent. With PWMs, one is interested in the likelihood

of observing the sequence

given a weight matrix model

; or the log-likelihood ratio

of observing the sequence given

, to observing it given a background model

. These numbers are readily calculated given the PWM and a simple background model: for example, if each nucleotide in the background model is represented by its actual genomic frequency (the model that is actually used throughout this work),

where

is the nucleotide at position

in the sequence, and

is the background probability of

. Meanwhile,

, that is, the product of the weight matrix value for each nucleotide at each position in the sequence. Often, instead of a PWM, a log-odds matrix is used whose entries, when summed, directly yield the log-likelihood ratio (the matrices from yeast ChIP data

[19],

[20], that we use below, are in this format).

No such factorisation is possible for

, the probability of observing a sequence given a dinucleotide model. However, I introduce here a conceptually straightforward approximation. This is a Bayesian estimate of the posterior probability of each nucleotide at each position

, given the neighbouring sequence (ie, all nucleotides within the putative binding region at all positions

. The product of these posterior probabilities, over all nucleotides, is treated as the likelihood of the sequence; and the log-odds is calculated as usual. The formula reduces, as it should, to the PWM value for any position

if nucleotides at other positions are independent of the nucleotide at

. The formula is derived in

Materials and Methods.

There are three complications with this approach, which may account for why such unrestricted DWMs have not been previously used: but the first two are answered here, and I argue that the third is an acceptable price to pay for the increased power.

First, there is the question of how to calculate with joint probabilities, or conditional probabilities, that are not independent. This is answered above; the method should in fact be more widely applicable, and this will be explored in the future.

Second, reliable estimation of

requires availability of many more sequences than estimation of

, because there are only 4 nucleotides but 16 dinucleotides. But this is increasingly less of a problem, since dozens of known binding sites now exist for several factors across different species. In fact, based on the benchmark results below, I argue that this approach would be particularly useful in analysing binding data from high-throughput experiments (ChIP-chip or ChIP-seq): these yield thousands of putative binding sites, of which hundreds may be sufficiently high-confidence for this purpose. Details on how to estimate the DWM are in

Materials and Methods.

Third, a DWM is a much larger object than a PWM: for a binding sequence of length

, a PWM is

-dimensional, while a DWM is

dimensional. The storage required is quadratic in

. This is exacerbated by one of the key observations below: flanking sequence of several nucleotides improves predictions and appears to play a role in determining binding sites, even when only a “core motif” is prominent in a sequence logo. Therefore, though a PWM for eukaryotic factors is typically between 6 and 15 bp long, the DWM here average 30bp in length (the ideal length of the flank is probably factor-specific, and has not been investigated in detail here). A DWM is also harder to visualise: a “sequence logo” cannot capture correlations. While one can consider a representation of “conditional” sequence logos resulting from fixing particular nucleotides, the result would be unwieldy and not very informative. I argue that PWMs and DWMs can live together (just as “consensus” sequence strings continue to be widely used despite the invention of sequence logos). PWMs have their utility as a concise and easy representation of binding motifs, while DWMs offer much better precision in prediction.