Transcription factors regulate gene expression by binding to cis-regulatory elements in DNA. These binding sites typically exhibit characteristic sequence patterns, called ‘motifs’. Experimental discovery of binding sites, however, remains non-trivial. So, although not all motifs are known and motif models are immature, motif matching is widely used to predict binding sites in DNA sequences.
The position weight matrix (PWM) is a common way of representing a motif (1
). A PWM has a numeric score for each of the four nucleotide types (rows) at each position in the motif (columns). It can be scanned across a DNA sequence to predict binding sites, as follows. If the PWM has width w
, each w
-long window of DNA (w
-mer) receives a score obtained by summing the scores of the nucleotides at each position. If the w
-mer score exceeds some threshold, it is predicted as a binding site.
A PWM is usually created from a position frequency matrix (PFM) via a position probability matrix (PPM). The PFM is constructed from a sample of binding sites, by counting the number of nucleotides of each type (row) at each position (column). The PPM is just a probabilistically normalized form of the PFM, so that each column sums to 1. Finally, the PWM is obtained by logarithmic transformation of the PPM divided by the nucleotides’ background probabilities. Thus, PWM scores are log likelihood ratios. Motif matrices are often visualized as sequence logos (e.g. A) (2
Figure 1. Pseudocount evaluations for JASPAR matrix ‘MA0001’ with a sample size of 30. (A) Sequence Logo for MA0001. (B) Normalized values of the matrix based comparison methods: ED = Euclidean Distance; COS = Cosine distance; TVD = Total Variation (more ...)
Starting from a small sample of binding sites, we will often get zeros in the PFM, leading to zeros in the PPM and negative infinities in the PWM: this is frequently regarded as undesirable (3
). The reasoning is that nucleotides that can occur in the motif will often be absent from a small sample by chance, and assigning them a probability of zero is too harsh. It is important to distinguish this motivation from the merely technical issue that it can be troublesome to deal with negative infinity using computers. In any case, it is common practice to add so-called pseudocounts to the PFM, in order to avoid zero probabilities. Pseudocounts can also be motivated by a Bayesian approach: if we assume a Dirichlet prior distribution for nucleotide frequencies, then the posterior mean estimator is equivalent to adding pseudocounts to the observed counts (3
In spite of the ubiquity of pseudocounts, there is no standard way of choosing them. Many types of pseudocounts are used, 0.01 (4
), 1 (5
), 1.5 (6
), 2 (7
), 4 (8
) and the square root of the number of binding sites in the sample (9
). (In this article, when we speak of a pseudocount of, e.g. 1.5, we mean that 1.5/4 is added to each of the four counts in a column.) Frith et al.
chose pseudocounts by fitting a Dirichlet prior to motifs in the TRANSFAC database (6
). However, the actual effect of changing pseudocount values has not been studied systematically: detailed evaluations have been made only for protein motifs (10–12
). Therefore, more practical analysis is required for DNA motifs.
Here, we evaluate pseudocount choices using known transcription factor binding sites. The main approach in this study is comparison between an original PPM, which we regard as representing the real sequence specificity of each transcription factor, and a sampled PPM with pseudocounts. The sampled motif matrix is created by stochastic generation from the original PPM. The results show how different pseudocount choices affect the similarity of the original and sampled PPMs. Our results provide guidance on choosing pseudocounts for motif matching.