Mononucleotide weight matrices
We evaluated the possibility of using the microarray binding data to calculate an accurate binding site weight matrix by comparing the observed Kaapp for each of the 64 triplets on the microarray with the Kaapp values calculated from mononucleotide Kaapp values. The observed Kaapp values were determined from microarray fluorescence intensity data (see Materials and Methods). In order to do this comparison, the mononucleotide Kaapp at each of the three positions of the central triplet was calculated by summing all the individual Kaapp values for sequences containing that nucleotide. For example, for comparing the observed versus calculated Kaapp for the triplet ACG, the Kaapp of A at position 1 was determined from the sum of the Kaapp values of all 16 ANN triplets. Similarly, the Kaapp of C at position 2 was determined from the sum of the Kaapp values of all 16 NCN triplets and the Kaapp of G at position 3 was determined from the sum of the Kaapp values of all 16 NNG triplets. We then determined whether the Kaapp values calculated by multiplying the mononucleotide Kaapp values for the three positions of the triplets (1NN × N2N × NN3) were essentially the same as the observed Kaapp values.
Using a non-pooled t-test adjusted for multiple hypothesis testing using the modified Bonferroni method to evaluate the statistical significance of the 64 observed versus calculated Kaapp values, we determined that there is a statistically significant interdependence between the three mononucleotide positions of the triplet binding site for finger 2. The correlation coefficients between the individual observed Kaapp values and the Kaapp values calculated according to a mononucleotide weight matrix derived from the same protein binding microarray data are shown in Table . The correlation coefficients between the ranks determined from the individual observed Kaapp values and the ranks determined from the Kaapp values calculated according to a mononucleotide weight matrix derived from the same protein binding microarray data are shown in Table . Although the overall correlation coefficients between the sets of observed Kaapp values and the sets of calculated Kaapp values were quite high (between 0.69 and 1.0), the rank correlation coefficients were only between 0.5 and 0.8.
| Table 1.Correlation coefficients between observed Kaapp values and those calculated by a microarray data-based weight matrix 1NN × N2N × NN3 assuming complete positional independence (1*2*3), between observed Kaapp values and those (more ...) |
| Table 2.Correlation coefficients between ranks calculated from the observed Kaapp values and those calculated by a microarray data-based weight matrix 1NN × N2N × NN3 (1*2*3), between ranks calculated from the observed Kaapp values (more ...) |
These analyses indicate that the mononucleotides of transcription factor binding sites do not exert independent effects on binding. There are a number of possible hypotheses for this observation. For example, a substitution of one particular base pair in a transcription factor DNA binding site might alter the packing of a DNA binding domain into the major and/or minor grooves of the DNA. Moreover, a single base pair substitution might also affect the local DNA structure and thus the geometry of various functional groups of the DNA binding site available for interaction with a transcription factor DNA binding domain.
Dinucleotide weight matrices
We evaluated two additional variations for calculating weight matrices using the Kaapp data. These additional variations (12N × NN3 and 1NN × N23) grouped together two of the three nucleotides of the triplet binding site and treated only the remaining nucleotide as independent. For example, for comparing the observed versus calculated Kaapp for the triplet ACG using 12N × NN3, the Kaapp of the dinucleotide AC at positions 1 and 2 was determined from the sum of the Kaapp values of all four ACN triplets and the Kaapp of G at position 3 was determined from the sum of the Kaapp values of all 16 NNG triplets. Likewise, for comparing the observed versus calculated Kaapp for the triplet ACG using 1NN × N23, the Kaapp of A at position 1 was determined from the sum of the Kaapp values of all 16 ANN triplets and the Kaapp of the dinucleotide CG at positions 2 and 3 was determined from the sum of the Kaapp values of all four NCG triplets.
As for the mononucleotide weight matrices, using a non-pooled t-test adjusted for multiple hypothesis testing to evaluate the statistical significance of the 64 observed versus calculated Kaapp values, we determined that there is a statistically significant interdependence between the positions of the triplet binding site for finger 2. The correlation coefficients between the individual observed Kaapp values and the Kaapp values calculated using 12N × NN3 and 1NN × N23 data derived from the same protein binding microarray data are shown in Table . The correlation coefficients between the ranks determined from the individual observed Kaapp values and the ranks determined from the Kaapp values calculated according to a mononucleotide weight matrix derived from the same protein binding microarray data are shown in Table . For four out of five Zif268 variants, one of these dinucleotide frequency calculations gave higher rank correlation coefficients than considering mononucleotide frequencies alone. Only for wild-type Zif268 did considering dinucleotide frequencies yield essentially the same correlation coefficient as did mononucleotide frequencies. However, for the four variants for which considering dinucleotide frequencies gave better approximations of Kaapp values, which dinucleotide yielded higher correlation coefficients varied. For LRHN, RGPD and REDV consideration of the first dinucleotide (12N) frequencies yielded higher rank correlation coefficients, while for KASN consideration of the second dinucleotide (N23) frequencies yielded a higher rank correlation coefficient.
This suggests that for these three Zif variants the mononucleotides of the first dinucleotide as well as the second dinucleotide exert cooperative effects on binding. However, it appears that the extent of cooperativity of various dinucleotides of a binding site is not a constant rule, even for variants of a single transcription factor that differ in just a portion of the DNA binding domain (for the Zif268 variants, only finger 2 of a three finger Cys2His2 zinc finger DNA binding domain was mutated). Consideration of dinucleotide frequencies improved the rank correlation coefficients for most of the Zif268 variants as compared to consideration of mononucleotide frequencies alone and provided fairly good approximations of the rank ordering of the binding sites. Nevertheless, the rank correlation coefficients after consideration of dinucleotide frequencies were still less than 0.8 for four out of five of the Zif268 variants. This indicates that there is some higher order level of nucleotide interdependence in DNA binding sites and further stresses the importance of complete reference tables of all possible binding sites for comparing protein binding preferences for various DNA sequences.
Positive and negative controls
In order to verify that our analyses were capable of detecting mononucleotide, dinucleotide or trinucleotide interdependence, we created four types of positive and negative controls that exhibited these different types of nucleotide interdependence. Mean correlation coefficients resulting from the six types of analyses on the five types of controls, as averaged over 1000 independent sets of each control, are displayed in Web table 1 at http://arep.med.harvard.edu/Bulyk/NAR2001supplementary/. As expected, the different types of analyses resulted in high correlation coefficients when used to analyze the controls designed with the corresponding types of nucleotide interdependence and lower correlation coefficients when used to analyze controls designed with other types of interdependence.
Predictive value of the hidden Markov models
Very rarely have experimental Kaapp data been available for a large number of possible DNA binding sites for a given DNA binding protein. Therefore, we decided to restrict the training set of binding site sequences and test the predictive ability of the zero order and first order HMMs. Our first test used 60% of the sites to construct the model and analyzed the remaining 40%; the second test used 80% of the sites to construct the model and analyzed the remaining 20%. The remaining sites (i.e. either 40% or 20% of the sites, respectively) were strung together to form one contiguous sequence, with 50 random nucleotides inserted between each site. For example, a sequence containing the sites ACG, GGG and TAC would be of the form ACG(N50)GGG(N50)TAC. For each of the five variants, only those sites with observed Kaapp values above background level were used (for wild-type Zif268 and REDV, the top 15; for LRHN, the top 13; for RGPD, the top 17; for KASN, all 64). Eliminating background sites avoided the situation that would occur if the 20% of sites being tested all had background Kaapp values and thus were all identical, making a correlation impossible. The correlation coefficients between the observed Kaapp values and posterior probabilities calculated from zero order and first order HMMs are shown in Table . These analyses indicate that binding site data for even a majority of binding sites with Kaapp values above background can provide a rough approximation of the Kaapp values, but even with data for 80% of the above background sites, the rank correlation coefficients are still somewhat low (no higher than 0.62 for these five Zif268 variants).
| Table 3.Correlation coefficients between observed Kaapp values and posterior probabilities calculated from zero order (complete position independence, based upon mononucleotide weight matrix data) and first order (strict immediate neighbor dependence, based upon (more ...) |
Extrapolation of full-length binding sites from subsets of binding sites
We considered whether 1NN and NN3 microarray data could be combined to accurately calculate the individual 123 microarray data. If this could be done, then only 32 sequences would be required to determine the preferences for 64 different triplet sites: 42 + 42. More significantly, only 144 sequences would be required to determine the preference for 1 048 576 different 10 bp long sites. In general, the number of sequences required would scale linearly with the binding site length according to 16(n – 1), where n is the length of the site in base pairs.
Combining 1NN and NN3 microarray data to infer the individual 123 Kaapp values can be performed using the joint probability function Pr(123) = Pr(12│3) Pr(3). Pr(12│3) can be determined from NN3 microarray data and Pr(3) can be determined from 1NN microarray data. Therefore, we are using one-quarter of the dataset to calculate mononucleotide frequencies for NN3 and a partially overlapping set of that data, also comprising one-quarter of the total dataset, to calculate dinucleotide frequencies for 12N.
In order to test this approximation, we derived 1NN and NN3 data from the available NNN microarray data. In order to calculate Pr(12│3), we divided each of the individual observed Kaapp values of type 123 by the sum of the Kaapp values of type NN3. For example, in order to calculate Pr(AC│G) for wild-type Zif268, the observed Kaapp values for ACG were divided by the sum of the Kaapp values of type NNG for each of the replicates on the microarray. In order to calculate Pr(3), the probabilities of each of the 16 sequences of type 1NN were calculated by dividing the individual observed Kaapp values by the sum of the Kaapp values of all 16 sequences of type 1NN. Pr(3) was then calculated as the sum of the Kaapp values of the four 1N3 sequences. Sixteen triplets for each of the five Zif268 variants were used as test cases for this approximation. For example, Pr(G) for wild-type Zif268 was calculated by first dividing the individual observed Kaapp values of each of the 16 sequences of type TNN by the sum of these Kaapp values and then calculating the sum of the Kaapp values of the four TNG sequences.
These 16 triplets were chosen so as to use 1NN and NN3 data that would most accurately reflect the triplet with the highest average Kaapp for the given variant. For example, TNN and NNT data were derived from the LRHN microarray binding data, so that the probability of T at the third position would be derived from the set of TNN sequences and the probability of TA at the first and second positions of the triplet would be derived from NNT data. We chose to use this methodology, since these triplets will probably include the most data-rich sequences in terms of specific binding by the variant.
The extrapolated
Kaapp values for the 16 triplets calculated in this manner for each of the five Zif268 variants correlated extremely well with the observed
Kaapp values. The rank correlation coefficients between observed
Kaapp values and
Kaapp values for all five variants were all over 0.97 (see Table ). REDV had the lowest rank correlation coefficient of the five variants; it differed only in the ordering of the three binding sites ranked 6, 7 and 8 in the observed versus extrapolated rankings. Thus, if for a particular transcription factor a binding site has already been identified that can be used as a starting sequence from which to vary each dinucleotide, then the optimal full-length binding site for that transcription factor can be extrapolated from its binding preferences for only a small subset of all possible full-length binding sites. In practice, one would perform microarray binding experiments starting at one set of dinucleotides and then, given the results of the extrapolation of those binding experiments, vary the adjacent dinucleotides accordingly. Analysis of Zif268 binding experiments using microarrays spotted with the 16 GCN NAG GCG sequences, in addition to the data we have already gathered using GCG NNN GCG microarrays, would allow one to verify that
12NN and
NN34 Kaapp values can be combined to calculate the
Kaapp values of
1234. If so, this would mean that one could determine the binding site preferences of transcription factors by a series of protein binding microarrays, stepping out 1 nt at a time to determine the optimal full-length binding site. Finally, the
Kdapp of the extrapolated 10 bp site for Zif268 could be compared with the
Kdapp of the 10 bp site used in the co-crystal structure (
24) to determine whether the extrapolated site has a higher binding affinity.
| Table 4.Correlation coefficients between ranks calculated from the observed Kaapp values and Kaapp values calculated by extrapolating from partial microarray data 1NN and NN3 |