|Home | About | Journals | Submit | Contact Us | Français|
Protein binding microarray (PBM) technology provides a rapid, high-throughput means of characterizing the in vitro DNA binding specificities of transcription factors (TFs). Using high-density, custom-designed microarrays containing all 10-mer sequence variants, one can obtain comprehensive binding site measurements for any TF, regardless of its structural class or species of origin. Here, we present a protocol for the examination and analysis of TF binding specificities at high resolution using such ‘all 10-mer’ universal PBMs. This procedure involves double-stranding a commercially synthesized DNA oligonucleotide array, binding a TF directly to the double-stranded DNA microarray, and labeling the protein-bound microarray with a fluorophore-conjugated antibody. We describe how to computationally extract the relative binding preferences of the examined TF for all possible contiguous and gapped 8-mers over the full range of affinities, from highest affinity sites to nonspecific sites. Multiple proteins can be tested in parallel in separate chambers on a single microarray, enabling the processing of a dozen or more TFs in a single day.
Cells respond to environmental stimuli, progress through the cell cycle, and adapt to changes in growth conditions by altering the expression of particular genes across the genome. In multicellular organisms, spatial and temporal changes in gene expression throughout development enable the formation of organs and tissues consisting of morphologically and functionally diverse cell types. Gene expression levels are dynamically regulated by transcription factors (TFs) through sequence-specific interactions with genomic DNA. As master regulators of numerous cellular processes, TFs constitute a substantial presence in the gene complement of every organism, accounting for approximately five to ten percent of genes in eukaryotes1–5. These proteins may function as either activators or repressors and may bind alone or in combination near the genes whose expression they control. The binding sites for eukaryotic TFs are themselves typically short (6 to 10 base pairs) and often exhibit considerable degeneracy. In order to globally map TFs to their target genes and understand the regulatory interactions that govern cellular identity and behavior, precise knowledge of the full range of the DNA binding specificities of TFs is necessary. Despite their central importance, however, comprehensive binding site measurements have been obtained for only a small number of TFs. Existing binding data are typically sparse, with only a handful of sites having been experimentally determined for any TF, and they frequently exhibit ascertainment bias according to affinity or simply which binding sites happened to have been identified first. Consequently, predictions of regulatory elements across the genome based on these limited binding data are prone to false positives and false negatives. Further, the binding specificities of the majority of eukaryotic TFs are currently completely unknown.
We have developed protein binding microarray (PBM) technology as a rapid, high-throughput means of characterizing the sequence specificities of DNA-protein interactions in vitro6–9. In contrast to earlier in vitro technologies for examining DNA-protein interactions (see below), which have been time-consuming and not highly scalable, PBMs enable the simultaneous measurement of the relative affinities of a TF for tens of thousands of individual DNA sequences in less than a day. In a typical PBM experiment, a purified, epitope-tagged TF is allowed to bind directly to a double-stranded DNA microarray, and the protein-bound array is labeled with a fluorophore-conjugated antibody specific to the epitope, providing a quantitative readout of the relative amounts of protein bound to each of the probe sequences on the array10. Intrinsic sequence preferences for the TF can be extracted according to the enrichment of these sequences among the brightest probes on the array.
The microarrays themselves can be fabricated in various ways. Microarrays spotted with a limited number of short, double-stranded DNA oligonucleotides were previously used to monitor the relative preference of wildtype and various mutant constructs of the mouse TF Egr1 (Zif268) for 64 variant binding sites7. We first extended the technique to the genome scale by spotting long PCR products representing all intergenic regions of the Saccharomyces cerevisiae genome in order to map the binding sites for a number of structurally diverse TFs from yeast8. For TFs of other organisms, however, yeast intergenic arrays limit the analysis to only those sequences represented in the S. cerevisiae genome, and the resulting data are biased by the frequencies with which those sequences occur on the arrays. Moreover, a given intergenic region can contain multiple binding sites for a given TF, complicating the accurate resolution of the fractional occupancies of separate sites within the lengthy DNA fragments.
Here we describe experimental and data analysis protocols for a universal PBM platform that utilizes synthetic (non-genomic) sequence in order to achieve both the desired versatility and binding site resolution for use in a new generation of PBM assays. We have specially designed our universal PBMs to contain all possible 10-bp sequences in a space- and cost-efficient manner9, 11. As such, they can be used to comprehensively characterize the full range of binding specificity of any TF from any structural family in any species, as long as the TF is capable of binding to sites that have ~12 or fewer informative nucleotide positions. (At this time it is uncertain whether our ‘all 10-mer’ PBM assays can derive the binding specificities of TFs that bind significantly longer DNA binding site motifs.) Custom-designed microarrays are synthesized by Agilent Technologies in an array of single-stranded 60-mer probes, and they are subsequently double-stranded biochemically in a solid phase primer extension reaction prior to protein binding and antibody labeling (Fig. 1). Probe signal intensities from a protein-bound microarray can be deconvoluted to produce a measure of the relative affinity of the TF for all k-mers (i.e., ‘words’, or DNA sequences of length k). Currently available array formats from Agilent enable the physical separation of a single slide into multiple chambers for separate PBM experiments. Consequently, binding data can be rapidly generated for large numbers of TFs, with each individual data set depicting an extremely rich landscape of sequence preferences encompassing both high and low affinity sites.
By providing comprehensive measurements for all possible binding site variants, universal PBMs offer the potential for improved computational methods of TF binding specificity representation and binding site discovery. Traditionally, TF binding specificities have been represented as either IUPAC consensus DNA sequences or mononucleotide position weight matrices (PWMs)12. Both forms are typically based on a limited number of known binding sites, from which the preferences of the TF for all other sequences is approximated. Further, standard mononucleotide PWMs are based on the assumption that all positions within the motif exert additive, independent effects on binding affinity. It has been shown that this is not the case for certain TFs, where the nucleotide preference at one position depends on which particular nucleotide occupies another position13–15. With universal PBMs, however, the binding specificity of a TF is more accurately captured in a look-up table that conveys its relative preference for every individual ‘word’. Nucleotide interdependence information is retained, and both high and low affinity classes of sites are identifiable. Nevertheless, we present here one approach for compactly representing PBM binding data in a PWM that utilizes the unbiased sequence coverage on the array to identify the relative contribution of each nucleotide at each position to the binding specificity.
In addition to providing a biochemical representation of TF-DNA interactions in vitro, PBMs can provide biological insights into the in vivo functions and regulatory roles of TFs. Gene regulation involves the dynamic association and dissociation of TFs and their binding sites in vivo. Consequently, in order to map and fully understand the regulatory interactions that underlie the global patterns of gene expression in an organism, one would need to know which binding sites throughout the genome are utilized in every cellular state and environmental perturbation. Methods to directly measure genome-wide TF occupancy in vivo have proven very useful (see below), but they are often hindered by experimental limitations, and examining every TF under all possible cell types and/or conditions is not feasible, particularly given potentially infinite ‘condition space’. Alternatively, universal PBMs enable the rapid identification of all possible binding site sequence variants in a single experiment. These binding data can be subsequently integrated with global gene expression profiles in order to infer the condition-specific targets and functions of TFs16. The in vitro binding specificities derived from universal PBM experiments show good agreement with preferred in vivo sites, when known17. Given the speed and ease with which these experiments can be performed, in vitro binding data can readily be generated for large numbers of TFs. This is noteworthy considering that TFs number approximately 300 in S. cerevisiae1, 750 in D. melanogaster3, and almost 2,000 in human5. Further, the combinatorial nature of gene regulation in higher eukaryotes necessitates the creation of a large catalog of TF binding sites in order to locate potential regulatory sequences and understand the regulatory relationships that exist.
Several other methods exist for determining the in vitro DNA binding specificities of TFs. Electrophoretic mobility shift assay (EMSA)18, 19, DNase I footprinting20, southwestern blotting21, and surface plasmon resonance22 are predominantly low-throughput approaches for examining a small number of distinct DNA sequences and exhibit different levels of precision. In vitro selection23 has been used to identify larger sets of binding sequences. This process involves an initial in vitro selection from a randomized pool of DNA oligonucleotides, followed by several additional cycles of amplification, selection, and ultimately sequencing. Like universal PBMs, this approach can provide an unbiased collection of permissible DNA binding site sequences; however, in most applications, only the highest-affinity sequences are retained for sequencing. These approaches are not currently suitable for the acquisition of comprehensive binding data for all sequence variants.
PBM technology has been adapted by other groups on a small scale in order to determine the in vitro binding preferences of particular TFs or TF families24, 25. On a larger scale, Ansari and colleagues synthesized a microarray composed of self-annealing hairpin probes covering all 8-mers (one 8-mer per probe), to which they bound small molecules as well as a TF in a PBM-like assay26. These experiments provide similar information as universal PBMs, although the greater sequence coverage afforded by our compact combinatorial design permits the recovery of the DNA binding preferences of TFs with longer and/or gapped motifs. Other microarray-based approaches have been developed to determine the biochemical affinity of a TF for its many target sequences. DNA microarrays coupled with surface plasmon resonance have been used to simultaneously monitor the kinetics of binding of the yeast TF Gal4 to 120 double-stranded DNA molecules27. Maerkl and Quake recently designed a microfluidic device which enabled them to measure the equilibrium dissociation constants of 4 TFs for 256 different DNA sequences28. Both of these methods require prior knowledge of a TF’s binding specificity and the design of separate sets of probe sequences to examine different TFs or TF families due to the limited throughput of each technology. Furthermore, we have observed that universal PBM fluorescence signal intensities are generally proportional to relative affinity; however, the precise relationship between signal intensity and absolute affinity is still under investigation.
Methods to monitor the in vivo occupancy of TF binding sites across the genome produce data complementary to those from PBMs. ChIP-chip29–31, or chromatin immunoprecipitation coupled with microarray hybridization, provides a direct measure of in vivo DNA interactions in a given cell type at a given time point and has been successfully used to examine TF binding in numerous organisms for a variety of conditions and tissues32. A separate microarray-based technique, DamID, utilizes a fusion protein between a TF and DNA adenine methyltransferase (Dam) and relies on detection of genomic DNA after digestion with a methylation-sensitive restriction enzyme33. ChIP-Seq34 and ChIP-PET35 both employ high-throughput sequencing as a readout of chromatin immunoprecipitated DNA, which can facilitate the mapping of bound regions to a larger fraction of the genome and at higher resolution than contemporary microarray hybridization36. These high-throughput in vivo approaches, though valuable, do possess certain technical limitations, such as the availability of ChIP-grade antibody and the accessibility of the epitope upon binding to DNA (for ChIP), as well as potentially limiting tissue sources. The interactions identified by these methods may not always correspond to direct protein-DNA contacts but could instead result from indirect association mediated by several intermediate proteins or complexes. Resolution is also limited due to difficulties in reducing the size of DNA fragments (ChIP-chip) or to the spread of methylation (DamID). Finally, these in vivo experiments must be conducted under conditions in which the TF of interest is expressed, nuclear, and actively bound to its target sites. Such conditions are not always known a priori, and TFs typically respond to many conditions and stimuli, such that it is impractical to examine every possible cellular state in order to fully map all functional interactions. The in vitro nature of PBMs eliminates many of the technical limitations of in vivo approaches, and PBM experiments for multiple proteins can be completed rapidly in less than a day. Furthermore, we have found the binding specificities derived from universal PBM experiments to be very consistent with known in vivo binding sites for well studied TFs17. While PBMs themselves do not directly identify genomic loci bound by a TF in vivo in a particular cellular condition, PBMs can be used to capture all possible binding sites in a single experiment. These data can then be integrated with genomic sequence, global gene expression profiles, and other data types to infer functional binding site usage in various conditions.
Given the abundance of TFs in the gene complement of every organism, universal PBMs can be used directly for the characterization of the binding specificities of thousands of individual TFs. As of this writing, universal PBMs have been used to interrogate the sequence preferences of TFs from prokaryotic and eukaryotic species, including V. harveyi37, P. falciparum38, S. cerevisiae9, C. elegans9, D. melanogaster (unpublished results, M.L. Bulyk and A.M. Michelson Labs), mouse9, 17, 39, and human9. Moreover, in addition to characterizing each individual protein’s DNA binding specificities, PBMs can be adapted to study heterodimers’ DNA binding specificities (F. De Masi, M.L. Bulyk, unpublished results) and the influence of ligands and protein cofactors on DNA binding40. Alterations in the overall affinity or even intrinsic sequence preferences of a TF could be monitored in the presence and absence of ligand, in combination with multiple dimerization partners, and in multi-protein complexes.
By providing comprehensive measurements for all possible k-mer sequence variants, universal PBMs offer the opportunity to examine the full landscape of TF binding at high resolution. Accordingly, families of TFs can be examined with PBMs to identify subtle differences in the binding profiles of homologous or structurally similar proteins17. One can search for subtle differences among the moderate and low affinity k-mer binding sites for related TFs that otherwise share the same high affinity sites17. Additionally, by examining the binding specificities of a large number of family members, one can begin to assemble a set of recognition rules for a particular TF structural family, in which the preferred binding sites of individual TFs can be predicted based on the amino acid identity at discriminatory residues within the protein17, 41. Synthetic constructs can also be designed with the goal of engineering novel binding specificity onto an existing scaffold and developing artificial TFs42, 43.
PBMs are limited by the amount of sequence that can be represented on a microarray. Space and technological limitations of early PBMs required the use of separate sets of probe sequences tailored to individual TFs or structural families with previously known sequence preferences7, 24, 25. Universal PBMs have largely circumvented this problem by utilizing a maximally compact and cost-efficient design9; however, for TFs with very long motifs due to an extensive network of protein-DNA contacts, it may be difficult to capture the full range of specificity. This is most problematic for prokaryotic TFs, which tend to dimerize and may bind to DNA sequences 20 bp or longer. We have made an effort to regularly sample long k-mers and gapped k-mers in our microarray design, which can help to reconstruct long motifs9. Furthermore, the development of higher density microarrays will enable the coverage of an even greater portion of sequence space. Even so, the construction of a microarray that captures all 12-mers, for example, requires 16-fold more sequence than an array that captures all 10-mers.
Additionally, as discussed above, the in vitro nature of universal PBMs somewhat complicates their use in predicting functional TF binding sites in vivo. Though we have observed good agreement between PBM-derived binding specificities and in vivo binding sites, it is impossible to fully replicate the in vivo nuclear environment on a microarray. Our standardized protocol uses physiological salt conditions (PBS, pH 7.4) as well as a rank-based statistical analysis framework that is quite robust to the TF concentration used in PBMs; however, different TFs may require different biochemical conditions for optimal binding. In addition, certain TFs may require particular post-translational modifications or protein interaction partners for increased affinity and specificity in DNA binding. The success of a PBM experiment also requires proper expression and folding of the TF under consideration, which is of particular concern when the TF is expressed in a heterologous or in vitro system. Consequently, it is difficult to interpret a negative PBM result yielding limited fluorescence intensity. It is also possible that the sequence preferences of an individual TF can be significantly altered by physical interactions with protein co-factors44, 45 (F. De Masi, M.L. Bulyk, unpublished results).
The design of a microarray containing all possible 10-bp sequences in a maximally compact manner has been described previously9, 11 and is beyond the scope of this paper. Briefly, we have utilized a de Bruijn sequence of order 10, in which every 10-mer sequence variant is represented exactly once in an overlapping manner. The de Bruijn sequence is partitioned into shorter sequences 36 nucleotides long that are joined to a common 24-nt primer sequence to become the 60-nt probes on the microarray. Each 36-mer contains 27 overlapping 10-mers. Our particular design ensures that all possible contiguous 8-mers and gapped 8-mers up to 12 total positions occur on at least 16 different probes (32 probes when reverse complements are considered) as shown in Figure 2. Thus, we are able to reliably estimate the relative preference of a TF for 22.3 million gapped and contiguous 8-mers (48 sequence variants of 341 patterns up to 8-of-12) based on a large ensemble of probe intensity measurements. The comprehensive coverage of gapped k-mers facilitates the recovery of motifs spanning more than 10 informative positions. Other microarray design strategies are possible; for instance, one may prefer to utilize an array with tiled genomic sequence endogenous to a particular species. The experimental protocols presented here are suitable for PBM experiments performed on any custom-designed Agilent microarray, as long as the appropriate primer sequence for double-stranding is included. We favor our strategy that utilizes de Bruijn sequences because it guarantees uniform and compact coverage of all sequence variants, enabling the examination of any TF from any species in an unbiased fashion. The flexibility of a design based on de Bruijn sequences is also favorable, as higher order de Bruijn sequences can easily be adapted for the future construction of higher density PBMs covering an even greater portion of sequence space, as microarray fabrication technology improves and feature density increases.
The protocol described here specifically refers to PBM experiments performed on arrays synthesized by Agilent Technologies. However, we know of no reason why these experiments would not be successful on other microarray platforms, and we expect such deviations would require only relatively minor modifications to the protocol. We have previously created our own smaller-scale, homemade universal PBMs by spotting 8,192 double-stranded oligonucleotide probes that together cover all possible 9-mers (M.L. Bulyk Lab and T.R. Hughes Lab, unpublished results). Other microarray manufacturers, such as NimbleGen, can accommodate custom designs as well. While the surface chemistries of various microarray slides differ, we have employed the PBM protocol described here on multiple slide types without difficulty.
Agilent offers several formats that enable different degrees of multiplexing. Currently we typically use the “4x44K” format, in which four identical subgrids of approximately 44,000 probes each can be physically separated into four chambers by a specially manufactured coverslip so that four proteins can be simultaneously examined on a single slide. Each chamber contains the entire complement of all possible 10-mers. Other currently available formats contain eight chambers (“8x15K”) or one chamber (“1x244K”) per slide, enabling complete coverage of all 9-mers and all 11-mers, respectively, in each chamber. These numbers are expected to improve as the allowable probe density increases. It should be noted that NimbleGen microarrays can currently accommodate all 12-mers on a single slide. The choice of microarray format depends partly on the number of proteins to be assayed, expectations of the proteins’ DNA binding site lengths, and cost considerations. For instance, eight-chambered universal PBMs containing all 9-mers potentially offer a more economical choice when multiple proteins are to be examined that are expected to have relatively short motifs.
DNA-binding proteins can be cloned and expressed by several strategies. We often clone just the DNA-binding domain of a TF, embedded in a modest amount of flanking sequence (often ~15 amino acids N- and C-terminal to the DNA-binding domain). Working with smaller polypeptides increases the ease of cloning and protein production as a practical matter; additionally, full-length proteins may possess additional domains that inhibit DNA binding in the absence of interacting protein co-factors46. For the TFs for which we have performed a direct comparison, DNA-binding domains and full-length proteins have yielded indistinguishable results on PBMs, or else the full-length protein has failed to bind while the domain alone exhibits sequence-specific binding. In contrast, for TFs expected to dimerize (such as helix-loop-helix and leucine zipper proteins), it is necessary to also include known or predicted dimerization domains. Full-length proteins may also be preferable in cases where regions outside of the TF’s DNA- binding domain are expected to confer additional sequence specificity, or if one attempts to assemble heterodimers or protein complexes in vitro on PBMs (F. De Masi, M.L. Bulyk, unpublished results). For ease of maintenance, sequence verification, and transfer into expression vectors for alternate tagging strategies, we typically create a master (donor, or Entry) clone compatible with the GATEWAY®47 or MAGIC17, 48 system. We then express each polypeptide as a fusion with glutathione S-transferase (GST) at the N-terminus. The GST tag can be used for both protein purification and fluorescent labeling of PBMs. Other epitope tags can be used instead, as long as they are compatible with labeling strategies (see below).
Much of our experience is based on expressing fusion proteins in inducible E. coli overexpression cultures, followed by purification using glutathione columns or glutathione-coated beads. This has worked quite well for us; however, other expression systems such as mammalian cell culture could be used, especially if there is an indication that particular post-translational modifications may be required. We have also observed that purification from cellular lysate is not always necessary, as only protein that is tagged with GST will produce signal on a PBM that has been stained with fluorophore-conjugated anti-GST antibody8. Furthermore, proteins can be expressed by coupled in vitro transcription and translation (IVT) reactions using E. coli lysate. Clones expressed in E. coli and by IVT yield proteins exhibiting identical binding specificities on PBMs in our hands17. IVT has the potential to dramatically increase the throughput of protein production for large-scale projects as these reactions can be conducted in parallel in 96-well plates, take less time than growing overexpression cultures, and do not require subsequent protein purification prior to use of the proteins in PBMs. The PBM protocol described here presumes that the desired epitope-tagged protein has already been produced and that its concentration has been accurately estimated by western blot or another method. PBM experiments are advantageous compared to traditional methods, such as EMSA, in that they require very small quantities of protein, typically just a few hundred nanograms per experiment. Proteins may be stored in a standard buffer (we typically use PBS pH 7.4 and Tris-HCl pH 7.0) or as unpurified cellular lysate. We recommend preparing separate aliquots of protein stocks and adding glycerol (final concentration 30%) for long-term storage at −80°C. For proteins containing zinc finger domains, zinc acetate should be added to all protein expression, purification, and storage buffers, as indicated in the protocol.
In order to use Agilent single-stranded oligonucleotide arrays in PBM experiments, they must first be double-stranded by a solid phase primer extension reaction. The protocol presented here has been optimized with respect to several parameters, including primer sequence and melting temperature, type of DNA polymerase, fluorescent label conjugated to the nucleotides, concentration of reagents, duration, and temperature. This process involved many experiments in which the incorporation of spiked-in fluorescently labeled nucleotides was monitored for a set of specially designed control probe sequences. However, it is possible that the primer extension procedure may be further improved. For example, it is possible that a shorter primer may be utilized, which would free up additional probe sequence for the inclusion of additional putative binding sites.
These primer extension reactions are quite sensitive to temperature and must be set up rapidly to minimize mis-annealing of primer and improper double-stranding. Consequently, it is important to monitor the fidelity of each primer extension reaction before using a microarray in a protein-binding experiment. This is accomplished by the addition of small quantities of Cy3-conjugated dUTP to the reaction. The Cy3 signal indicates the amount of double-stranded DNA present at each spot and is used as a normalization factor in the final analysis of the PBM (Fig. 3). This signal reflects the number of adenines in the template strand as well as the sequence context of each adenine; of note, the effect of sequence context varies for different fluorescent tags and polymerases. Therefore, after scanning a primer-extended microarray, we fit the observed signal intensities by a linear regression with 64 parameters, corresponding to every possible trinucleotide preceding each adenine in the template sequence, in order to ensure that the DNA is properly double-stranded. (The observed and expected Cy3 intensities should exhibit a correlation of R2 > 0.7, as shown in Figure 4.) We have observed that runs of 5 or more consecutive guanines are deleterious for primer extension reactions. As a result, we have replaced each probe sequence containing such runs of guanines with its reverse complement.
We have attempted to devise a single protocol that is best suited to the largest number of TFs in a first pass experiment. Our protocol utilizes relatively standard binding conditions (e.g., pH 7.4, 1x PBS buffer, 100 nM protein). After performing numerous PBM experiments, we believe these conditions to be suitable for most TFs. Furthermore, we specifically utilize rank-based statistics to analyze PBM data, under the assumption that the ranking of probes by intensity should be invariant to changes in pH or protein concentration even though their relative differences in signal intensity may vary. Nevertheless, some TFs’ DNA-binding may be particularly sensitive to salt concentrations or cofactors, and so those buffer conditions should be used in cases when such prior information on preferable alternate conditions is available. For example, zinc should be included in all reactions and wash buffers involving zinc finger TFs. If a PBM experiment produces faint or background-level signal, it may help to increase the protein concentration, decrease the wash time and stringency, and/or alter the binding conditions.
The protocol described here requires that TFs possess a GST tag so that they can be labeled by an Alexa488-conjugated anti-GST antibody (Sigma). Other tagging and labeling methods can theoretically be employed. We have successfully utilized the maltose binding protein (MBP) tag and the FLAG tag with corresponding fluorescently labeled antibodies in pilot experiments. However, the availability of a commercial fluorophore-conjugated polyclonal anti-GST antibody that results in very bright signal intensity makes GST our tag of choice. Figure 3 shows a close-up portion of a single microarray, scanned with two lasers to detect DNA concentration, represented by Cy3-labeled dUTP, and protein abundance, represented by Alexa 488-labeled anti-GST antibody. Usage of multiple tags and fluorophores may enable a dual-labeling strategy for comparing the binding specificities of homodimers and heterodimers (or for multiplexing independent TFs) on one microarray, as long as their spectra do not overlap with the fluorescent nucleotides or with each other. Alternatively, TFs could potentially be tagged directly with green fluorescent protein (GFP) or another fluorescent molecule in order to eliminate the labeling reaction entirely.
The spot diameter for microarrays manufactured by Agilent is currently ~50 microns, thus requiring a microarray scanner that is capable of 5-micron resolution scans for accurate image quantification. Higher-density microarrays with smaller feature sizes are anticipated, necessitating even higher resolution scans. Detection of Alexa 488 (488 nm excitation/522 nm emission) requires an argon laser, separate from the Cy3 (543 nm ex/570 nm em) and Cy5 (633 nm ex/670 nm em) lasers that are part of most standard microarray scanners (including Agilent’s own scanner). For our scans, we use a ScanArray 5000 (GSI Lumonics) scanner with an external 488 nm argon laser.
We frequently perform PBM experiments in duplicate for each TF. Rather than repeat an experiment on a microarray of the same design, though, we utilize a second microarray with an independent design constructed using a separate de Bruijn sequence of order 10. Our second microarray also contains all possible (non-palindromic) 8-mers spanning up to 12 total positions on 32 probes each. By combining data from separate microarrays of different designs, we effectively double the number of independent measurements made for every 8-mer, thereby increasing the accuracy. Nevertheless, replicate experiments may not always be necessary. There is substantial redundancy built into our combinatorial microarray design, minimizing the importance of any single probe measurement. For TFs expected to possess short motifs (i.e., 7 or fewer informative nucleotide positions), the sequence coverage provided by a single ‘all 10-mer’ microarray should be sufficient to capture its full binding specificity. If the goal of an experiment is to compare the binding profiles of two very similar TFs, this can also be accomplished by performing single experiments on the same microarray design17.
The greatest advantage of universal PBMs, compared to other existing methods for characterizing TF binding specificities, is that binding to all ‘words’ up to a given length k is simultaneously assayed. Consequently, these experiments provide a comprehensive look-up table conveying a precise measure of the preference of a particular TF for every sequence variant (Fig. 5a). There are several methods for scoring individual k-mers based on the distribution of signal intensities observed on the microarray. For instance, k-mers can be scored according to the median signal intensity of the set of probes containing each k-mer, which can be further transformed into a Z-score. These measures are useful because they convey information regarding relative differences in DNA occupancy and affinity. However, we have developed a separate rank-based, non-parametric enrichment score (E-score)9 that we believe is preferable for a larger number of applications. Because the E-score is rank-based, it is robust to differences in protein concentration and other binding conditions in the PBM assay. By putting all experiments on the same scale, it enables TFs to be directly compared and data from replicate PBM experiments on different array designs to be easily combined. Finally, the E-score is robust to differences in sample size (i.e., the number of spots harboring a match to a given k-mer), thus providing a uniform standard for comparing palindromes and non-palindromes and also k-mers of different lengths.
Such comprehensive ‘word-by-word’ measurements are valuable because they carry information about nucleotide interdependence as well as both high and low affinity classes of binding sites, information that is not easily captured in a conventional PWM representation. An exhaustive look-up table can also be used in performing genome-wide scans for potential TF binding sites. Yet such a list is cumbersome and provides little intuitive feel for the complete binding specificity of the TF. For this reason, and the fact that most existing software for genome scanning for TF binding sites utilize PWMs as input49, we developed the Seed-and-Wobble algorithm9 for PWM construction (Fig. 5b). This approach specifically takes advantage of the unbiased coverage of all k-mers on the array to identify the relative contribution of each base at each position to the binding specificity, and it has proven to be effective at recapitulating the known binding preferences of well-characterized TFs9, 17. By making use of the gapped k-mers present in our combinatorial design, Seed-and-Wobble also facilitates the recovery of both gapped motifs and long motifs with more than 10 informative positions. Additional algorithms, such as RankMotif++50, Prego51, and MatrixREDUCE52, are similar to Seed-and-Wobble in that they use all binding data rather than assigning an arbitrary cutoff, and they can be applied directly to the normalized data from universal PBM experiments as well.
HPLC-purified primer (unmodified) for double-stranding of DNA oligonucleotide array
5′-CAGCACGGACAACGGAACACAGAC-3′ (Integrated DNA Technologies)
High-purity solution dNTPs (GE Healthcare, cat. no. 27203502)
Cy3-conjugated dUTP (GE Healthcare, ca. no. PA53022)
Thermo Sequenase™ Cycle Sequencing kit (USB, cat. no. 78500)
Tween 20 (Sigma, cat. no. P1379)
Triton X-100 (Sigma, cat. no. T9284)
Nonfat dried milk, bovine (Sigma, cat. no. M7409)
Zinc acetate dihydrate, Zn(C2H3O2)2-2H2O (Sigma, cat. no. Z4540)
DNA, single-stranded from salmon testes (Sigma, cat. no. D7656)
Bovine serum albumin (New England Biolabs, cat. no. B9001S)
Anti-glutathione S-transferase, rabbit IgG fraction, Alexa Fluor 488 conjugate (Invitrogen, cat. no. A11131)
Protease, from Streptomyces griseus (5.8 U mg−1; Sigma, cat. no. P6911)
Sodium dodecyl sulfate (Sigma, cat. no. L4390)
EDTA disodium (Sigma, cat. no. E5134)
Sodium chloride, NaCl (Fisher, cat. no. S271-10)
Potassium chloride, KCl (MP Biomedicals, cat. no. 191427)
Sodium phosphate dibasic, Na2HPO4 (Sigma, cat. no. S7907)
Potassium phosphate monobasic, KH2PO4 (Sigma, cat. no. P0662)
Tris base, C4H11NO3 (Fisher, cat. no. BP152–500)
Magnesium chloride, MgCl2 (Sigma, cat. no. M8266)
Custom 4x44K microarray, AMADID #015681 and/or #016060 (Agilent, cat. no. G2514F)
SureHyb chamber (Agilent, cat. no. G2534A)
SureHyb gasket cover slides, 1 array/slide (Agilent, cat. no. G2534-60003)
SureHyb gasket cover slides, 4 array/slide (Agilent, cat. no. G2534-60011)
Vacuum desiccator (Fisher, cat. no. 086425)
Hybridization oven (Fisher, cat. no. 1324710)
Staining dishes (2) and cover (Wheaton Scientific, cat. no. 900303)
Glass staining dish slide rack (Wheaton Scientific, cat. no. 900304)
Magnetic stir plate and stir bars
Benchtop centrifuge with microplate rotor (Fisher, cat. no. 0537548)
Micro slide boxes (VWR, cat. no. 48444-004)
ScanArray® 5000 microarray scanner equipped with argon ion laser (488 nm excitation) and 522 nm emission filter (Perkin Elmer)
GenePix® Pro 6.0 microarray analysis software (Molecular Devices)
Coplin staining jars (VWR, cat. no. 47751792)
Orbital platform shaker
Syringes with BD Luer-Lok Tip (VWR, cat. no. 309603)
0.45 micron syringe filters (VWR, cat. no. 28196114)
Lifter Slip® cover slips for microarray slides (Fisher, cat. no. 22035809)
Dust Off XL canned air (VWR, cat. no. 21899080)
Incubated shaker (New Brunswick Scientific, cat. no. M1352-0004)
Nalgene disposable sterilization filtration units, 0.2 μm filter (Fisher, cat. no. 097401A)
GST-tagged protein: Protein can be expressed in vivo in E. coli cultures, by coupled in vitro transcription and translation, or using other expression systems as described above under “Protein Production Options and Requirements”. Samples may be purified using glutathione beads or columns and eluted in Tris-HCl (pH 7.0) or PBS (pH 7.4), or else cellular lysates containing overexpressed GST-tagged protein may be used directly. Add glycerol to a final concentration of 30%. If the protein contains zinc finger domains, add zinc acetate to a final concentration of 50 μM. Protein stocks should preferably contain at least 500 nM GST-tagged protein; estimate the protein concentration by Western blot, and concentrate if necessary. Prepare separate aliquots prior to freezing for long-term storage at −80°C.
10x Thermo Sequenase reaction buffer: Combine 26 ml 1 M Tris HCl, pH 9.5 and 60 ml sterile water. Dissolve 6.18 g MgCl2, and bring final volume to 100 ml using sterile water. Filter sterilize using a 0.2 μm Nalgene filter. Store at room temperate (20°C to 25°C) for up to 1 year.
10 mM dNTPs: Combine 25 μl each of dATP, dCTP, dGTP, and dTTP (all stock solutions at 100 mM) and 900 μl sterile water. Vortex to mix. The final mixture contains 10 mM total dNTPs (2.5 mM of each dNTP). Store at −20°C.
1x PBS: Add 28 g NaCl, 0.7 g KCl, 5.04 g Na2HPO4, and 0.84 g KH2PO4 to 3 L sterile water. Stir for ~30 minutes on a magnetic stir plate. Add sterile water to bring the final volume to 3.5 L. Adjust the pH to 7.4, and autoclave to sterilize. (Alternately, 1x PBS can be prepared by diluting a stock solution of 10x PBS in sterile water.) Store at room temperature.
4x PBS: Mix 3.2 g NaCl, 0.08 g KCl, 0.58 g Na2HPO4, and 0.096 g KH2PO4 with 100 ml sterile water, adjust the pH to 7.4, and filter sterilize using a 0.2 μm Nalgene filter. Store at room temperature.
10% (vol/vol) Triton X-100: Combine 15 ml Triton X-100 and 135 ml sterile water. Filter sterilize using a 0.2 μm Nalgene filter, and store at room temperature.
20% (vol/vol) Tween 20: Combine 30 ml Tween 20 and 120 ml sterile water. Filter sterilize using a 0.2 μm Nalgene filter, and store at room temperature.
2% (wt/vol) milk blocking solution: Dissolve 0.1 g nonfat dried milk in 5 ml 1x PBS. Allow at least 1 hour for milk to enter solution, rotating gently (25 r.p.m.) on an orbital shaker. This can be set up overnight to save time. Filter solution using a syringe and 0.45 μm filter. Filtered milk can be stored for up to 1 week at 4°C as long as no precipitate forms.
4% (wt/vol) milk blocking solution: Prepare as above (for 2% milk blocking solution), except dissolve 0.1 g nonfat dried milk in 2.5 ml 1x PBS.
500x zinc acetate (25 mM): Dissolve 0.55 g zinc acetate dihydrate (Zn(C2H3O2)2-2H2O) in 100 ml sterile water. Filter sterilize using a 0.2 μm Nalgene filter and split into 1.5 ml aliquots. Store aliquots at −20°C.
100x zinc acetate (5 mM): Combine 200 μl 500x zinc acetate and 800 μl sterile water. Store at −20°C.
PBM wash solution #1: Mix 210 ml PBS and 210 μl 10% Triton X-100. If proteins with zinc fingers are being examined, add 420 μl 500x zinc acetate. Make fresh on the day of the experiment.
PBM wash solution #2: Mix 70 ml PBS and 350 μl 20% Tween 20. If proteins with zinc fingers are being examined, add 140 μl 500x zinc acetate. Make fresh on the day of the experiment.
PBM wash solution #3: Mix 468 ml PBS and 12 ml 20% Tween 20. If proteins with zinc fingers are being examined, add 960 μl 500x zinc acetate. Make fresh on the day of the experiment.
PBM wash solution #4: Mix 560 ml PBS and 1.4 ml 20% Tween 20. If proteins with zinc fingers are being examined, add 1120 μl 500x zinc acetate. Make fresh on the day of the experiment.
PBM stripping solution: Combine 68.6 ml sterile water and 1.4 ml 500 mM EDTA in a beaker and mix on a magnetic stir plate. Add 7.0 g sodium dodecyl sulfate and dissolve. Finally, add 0.05 g Protease from Streptomyces griseus and dissolve. Continue stirring for 10 minutes.
CRITICAL: Protease should be stored as a solid powder at −20°C. This stripping solution must be made fresh immediately before use.
Hydration chamber: Lift out the tip rack of an empty pipette tip box, fill the bottom of the pipette tip box with about half an inch of sterile water, and replace the tip rack. Wipe the inside of the lid and the tip rack with a Kimwipe moistened with 70% ethanol.
|Reagent||Volume (μl) per microarray||Final concentration in mixture|
|Thermo Sequenase Reaction Buffer (10x)||90||1x|
|Primer (100 μM)||10.5||1.17 μM|
|dNTPs (10 mM total)||14.7||163 μM|
|Cy3-dUTP (1 mM)||1.47||1.63 μM|
|Thermo Sequenase Polymerase (4 U μl−1)||8||0.036 U μl−1|
|Reagent||Volume (μl) per chamber||Final concentration in mixture|
|Sterile water||varies (to 175 total)|
|Zinc acetate (100x)*||1.75||1x|
|4% milk blocking solution||87.5||2% (w/v) milk|
|BSA (10 mg/ml)||3.5||0.2 mg/ml|
|Salmon testes DNA (53 μg/ml)||1.0||0.3 μg/ml|
|GST-tagged protein||varies||100 nM|
|Reagent||Volume (μl) per microarray|
|2% milk blocking solution||778.4 (or 780)*|
|Alexa 488-conjugated anti-GST (Invitrogen, A11131)||20|
|Zinc acetate (500x)*||1.6 (or 0)*|
Protein binding microarray experiments are very rapid. Double-stranding and protein binding reactions can be performed either on the same day or on different days. 2–3 PBM slides can be processed in parallel for both stages. When performing a series of PBM experiments, much of the data normalization and sequence analysis for the first set of PBMs can be completed during the long incubation steps during the next set of experiment(s).
Steps 1–13, double-stranding Agilent microarrays: 3 h
Steps 14–38, protein binding and antibody staining of protein-bound arrays: 5 h
Steps 39–45, protease digestion: overnight incubation, followed by 1 h of washes and scanning
Steps 46–53, image analysis and data normalization: 1–3 h
Steps 54–63, sequence analysis: 1–2 h, using the software we provide at the Bulyk Lab website
900 μl primer extension reaction mixture should completely fill the volume of the SureHyb gasket cover slide. We routinely re-use these cover slides 20 or more times. However, if significant leakage of liquid occurs or if a seal does not properly form between the cover slide and microarray, it may be necessary to replace the cover slide.
It is important to execute this step rapidly to avoid a significant drop in temperature. If the reagents are not maintained at close to 85°C, improper double-stranding may occur due to primer mis-annealing and/or formation of secondary structures in the template strand. This will be reflected in the quality of the fit (R2) between the observed and expected Cy3 probe intensities in Step 50.
Due to the hydrophobic surface properties of Agilent slides, the microarray(s) should be mostly dry after removal from 1x PBS. If there are any droplets remaining, these can leave tracks behind during the centrifugation in Step 12. Excess liquid can be removed by dabbing the edges and back of the microarray with a Kimwipe. If the printed area of the microarray is still noticeably wet, rinse the microarray again in 1x PBS and remove it slowly over the course of approximately 10 seconds, tilted slightly face-down.
If the signal is uneven, the washes may need to be performed more vigorously. If there are speckles and dust particles visible in the scan, make sure that all containers and vessels used to store and prepare the wash solutions are cleaned thoroughly. Wash solutions can also be filtered prior to use.
The overall fluorescence intensity should be very bright if this protocol is followed as written. If for some reason the spots are barely visible at the highest laser power settings, possible improvements include using more Thermo Sequenase polymerase, more Cy3-labeled dUTP, and/or less unlabeled dNTP. (However, if the ratio of labeled dUTP to unlabeled dTTP exceeds ~5%, the Cy3 conjugate may significantly interfere with TF-DNA binding.) Take precautions to store all fluorescent materials in the dark to avoid photobleaching. It is also advisable to double-check that the proper laser and filter settings are being used by the microarray scanner.
If the staining dish is not kept covered (or if it is not thoroughly rinsed before use), dust or other particles may enter the wash solution. This can lead to speckles interfering with particular probe measurements during the scanning and image analysis.
Spillover between adjacent chambers may occur if the microarray is not dry after the wash in 1x PBS in Step 23. (Excess liquid can be removed by dabbing the edges and back of the microarray with a Kimwipe after Step 23.) A 175-μl protein binding mixture should just barely fill the volume of the gasket cover slide without leakage; however, the volume of the binding mixture can be reduced even further if spillover becomes a problem. The steel hybridization apparatus should be assembled and tightened quickly in order for the protein mixture to spread out throughout each chamber in the cover slide and for a seal to form. (If this occurs too slowly, the signal within a chamber may not be perfectly uniform.) It is important to check for bubbles after assembling the hybridization chamber. If bubbles are not moved to the side, the affected probes will have to be flagged and removed from the analysis.
As in Step 24, drying the microarray prevents spillover between adjacent chambers. If the microarray and cover slip are not assembled quickly enough, the center of each subgrid may appear brighter than the margins due to the uneven spread of fluorescently labeled antibody throughout the chamber. As before, if bubbles are not moved to the side, the probes on corresponding area of the slide will exhibit little to no signal intensity.
As in Step 11, the hydrophobic surface properties of Agilent slides should leave the microarray mostly dry after removal from 1x PBS. If there are any droplets remaining, these can leave tracks behind during centrifugation. Excess liquid can be removed by dabbing the edges and back of the microarray with a Kipwipe.
A successful PBM experiment will exhibit a broad range of signal intensities, with the brightest probes being visible at moderate laser power settings (50–75% laser power). If all probes are faint at even the highest laser power settings, this likely reflects a problem with the PBM experiment and may present further problems in the subsequent motif discovery steps. The experiment may have failed due to misfolded protein, improper binding buffer conditions, or the absence of required protein co-factors or post-translational modifications. These problems can only be addressed by altering the conditions for protein expression and/or protein binding. However, it is possible that the protein does bind DNA sequence-specifically but with low affinity or with a fast dissociation rate. In this case, the signal can be increased by repeating the PBM experiment with a higher protein concentration, a higher antibody concentration, and shorter wash times.
If problems continue, we suggest attempting a new PBM experiment with the S. cerevisiae TF Cbf1. We have found this protein to be easily expressed in and purified from E. coli and robust in our protocols for protein binding experiments. The resulting scan should exhibit a broad range in probe signal intensities, with a modest number of extremely bright probes. Sequence-verified full-length S. cerevisiae CBF1 cloned into the Gateway® Entry vector pDONR201 is available (Cbf1 pDONR201, CloneID ScCD00009385) via the PlasmID repository at http://plasmid.med.harvard.edu.
Some proteins may exhibit a high degree of non-specific binding to single-stranded DNA. In such cases, the Agilent control probes, which are not double-stranded by primer extension, may be among the brightest spots on the microarray. Therefore, it is important to always filter out these spots prior to sequence analysis.
The observed and expected Cy3 signal intensities should always exhibit a reasonably high correlation (R2 > 0.7). If instead, R2 ≈ 0, check to make sure that the GAL file contains the correct information for the microarray design that was used and that it was correctly aligned to the grid of spots in GenePix Pro. Probes that are problematic during primer extension will exhibit Cy3 signal intensities much lower than expected. (We had originally observed this for template strands containing long runs of guanine. Consequently, all probe sequences with five or more consecutive guanines have since been replaced in our Agilent array designs by their reverse complements.)
Occasionally, the method for PWM construction outlined in steps 56–61 may fail for TFs with exceptionally long motifs. This is particularly problematic for prokaryotic TFs, which frequently dimerize and bind to DNA sequences as long as 20 bp. This is because the most significant gapped 8-mer may occur in an unfavorable sequence context in the majority of its ~32 occurrences. In such cases, it may be possible to recover a specific PWM using a conventional motif finder by taking the sequences from the top N brightest spots as input8. This is not an optimal approach as it requires setting an arbitrary threshold above which all sequences are treated equally; however, it can occasionally lead to the successful recovery of the appropriate motif when the method outlined here fails. For example, MultiFinder integrates several previously-developed motif discovery algorithms and can be used for this purpose55.
Figure 3b shows a portion of a scan from a representative PBM experiment. All probes are usually visible above background fluorescence levels (i.e., between spots, where there is no DNA), but there is often a broad range in probe signal intensities. The majority of probes are typically relatively faint with similar signal intensities, corresponding to non-specific binding of protein. The remaining probes show evidence of specific binding, often with a small fraction of them exhibiting very high intensities. These probes contain the highest affinity binding sites. PBMs exhibiting such a broad distribution of signal intensities nearly always produce high-quality binding data and very high k-mer E-scores (i.e., E ≥ 0.45). Furthermore, it is sometimes the case that PBM data with seemingly uniform distributions of probe intensities will produce significant E-scores and PWMs with high information content as well. Since our scoring method is based on rank-order statistics, it is the relative ordering of probes and not the magnitude of their signal intensity differences that determines the degree of enrichment of a particular k-mer or motif. Consequently, it is always necessary to conduct a full analysis of each experiment before concluding that there was no sequence-specific binding. Occasionally a PBM experiment will fail to produce a significant motif, either because the Alexa 488 signal intensity (i.e., that attributable to protein binding) is too faint or because all probes appear to exhibit the same degree of (non-specific) binding. As described above, it is difficult to interpret a negative result since it could be due to misfolded protein, improper binding buffer conditions, or the absence of required protein co-factors or post-translational modifications. For many of these cases, it may be necessary to repeat the experiment under different conditions to achieve the desired results. Nevertheless, in large-scale screens that we have conducted, we have observed a success rate between 40 and 50% for proteins produced in E. coli or by coupled in vitro transcription and translation and tested in a single pass at 100 nM in the standard binding conditions described here.
The success of a PBM experiment can be estimated qualitatively by the overall distribution of Alexa 488 signal intensities observed in the scan. However, the quality of the binding data can only truly be judged by examining the k-mer E-scores derived from the preceding analysis. One indicator of a successful experiment is the occurrence of many k-mers with high E-scores. Our criterion for concluding that a protein exhibits specific binding is the observation of at least one 8-mer with an E-score > 0.45; however, most high quality experiments produce a maximum E-score > 0.49. In a survey of 168 mouse homeodomain TFs, we found, on average per TF, 146 contiguous 8-mers with E > 0.45 and 15 with E > 0.49 (ref 17). A second indicator of a successful experiment is that most of the top-scoring k-mers resemble each other and are easily aligned. The motifs of sequence-specific TFs typically tolerate degeneracies at some nucleotide positions of their binding sites. Consequently, the presence of high-scoring 8-mers that contain single mismatches or offsets with respect to each other bolsters the confidence that these 8-mers represent true TF binding sites, especially considering that each 8-mer score is based on measurements from an independent set of 32 probes.
It is often informative to compute the statistical significance of a particular E-score in a PBM experiment. We have calculated the distribution of 8-mer E-scores from negative control experiments performed using free GST (rather than GST-tagged TF) and used these to estimate the false discovery rates at various E-score thresholds (data not shown). Depending on the TF and the number of 8-mers surpassing each threshold, a false discovery rate of 0.01 typically corresponds to E-scores of approximately 0.32 to 0.36. Calculating significance in this manner enables us to determine the total number of likely true positive binding site sequences for a given TF.
For PBM experiments performed with the same protein on separate ‘all 10-mer’ microarray designs, we observe highly consistent 8-mer E-scores. As shown in Figure 8, the correlation among 8-mer E-scores is also high for experiments performed on different microarray designs9. Furthermore, the combined data (from averaging across separate arrays) are often more accurate because they are based on twice as many independent measurements9. (For TFs with short motifs (i.e., 7 or fewer informative nucleotide positions), the benefits of replicate experiments with multiple microarray designs are reduced because a single experiment is typically sufficient.) This increase in accuracy can be understood by considering the sources of variability in probe signal intensity. The same k-mer may lead to somewhat different signal intensities on different spots owing to its orientation and position on the probe relative to the slide surface9. Additionally, two probes with the same k-mer may exhibit different signal intensities due to different flanking sequences, both proximal (which may influence binding affinity to the k-mer) and distal (which may contain additional binding sites of various affinities). For these reasons, our k-mer scoring method relies on multiple measurements from a large ensemble of spots (at least 32 spots for each non-palindromic 8-mer, and at least 16 spots for each palindromic 8-mer). Nevertheless, in a given array design, a particular k-mer may frequently occur close to (or far from) the slide surface or may happen to fall on the same probe as a strong binding site more times than expected by chance. By doubling the number of independent measurements, we further minimize these sources of variation. This has the greatest impact on k-mers with E-scores near 0. The artificially high correlation across the entire range of E-scores in Figure 8a can be explained by systematic effects that are fixed within a single array design. Figure 8b shows that E-scores < 0.2 are in the realm of noise but that higher E-scores are very consistent across separate array designs.
Occasionally, the correlation in the E-score scatter plot for a pair of PBM experiments may not be as strong as in Figure 8. For example, one experiment may produce significantly fewer E-scores above any given threshold. This is indicative of a noisy data set and can usually be detected in the scanned image itself. In such cases, it is preferable to rely on data from a single array rather than average a high-quality data set with a noisy data set.
The k-mer binding profiles and PWMs derived from universal PBM experiments are typically very consistent with TF binding data obtained by other in vitro approaches. Databases such as TRANSFAC56 and JASPAR57 contain hundreds of matrices constructed from existing binding data. (TRANSFAC tends to be more inclusive, while JASPAR is manually curated and limited to a smaller number of TFs with high-confidence data.) Our PBM data nearly always agree with the corresponding entries in these databases at a coarse level, especially JASPAR. Slight discrepancies are not surprising, especially given that the database entries often exhibit ascertainment bias reflecting which particular sequences were chosen to be examined by the investigators. Furthermore, single PWMs in TRANSFAC frequently are derived from binding sequence data compiled from multiple experimental methods. In contrast, universal PBMs provide a uniform, unbiased platform for identifying comprehensive TF binding profiles. Large discrepancies between PBMs and existing data may also occasionally be observed, but this is also not surprising given that data in TRANSFAC and JASPAR for identical proteins are not always in agreement with each other17. This illustrates that motifs in databases and the literature cannot all be taken as a gold standard. Furthermore, even when PBM data do agree with existing binding data, the PBM data provide a richness and level of detail and absent from these databases, which typically only contain a handful of sequences.
Comparisons can also be made to in vivo binding data generated by alternate methods such as ChIP-chip8. There are many reasons why in vitro PBM data might not agree with established in vivo binding sites, several of which are discussed in the Introduction. TFs may require specific co-factors or post-translational modifications for optimal DNA binding. Furthermore, ligand-binding, heterodimeric protein interactions, and associations with other proteins in vivo can modulate the binding specificity of a TF through structural changes40. Nevertheless, we have observed data from our own PBM experiments to be very consistent with sites known to be bound in vivo8, 17.
The analysis method described here produces two distinct representations of the DNA binding specificity of a TF: an exhaustive table of the relative preferences for all k-mers, and a mononucleotide position weight matrix (PWM) (Fig. 5). Each representation carries its own set of advantages, and each is suitable for a variety of applications.
The ability to generate a comprehensive list of the relative preferences of a TF for all possible k-mers is one of the most important features of universal PBMs. This offers the opportunity to examine the full landscape of TF binding, including moderate and low affinity sequences. Additionally, it provides a high-resolution picture of protein-DNA interactions by conveying information about nucleotide interdependencies. Independent measurements of DNA binding affinity constants are consistent with k-mer median signal intensities and E-scores derived from PBMs, including for TFs and k-mers exhibiting nucleotide interdependence9. Complete k-mer binding profiles also enable the detailed comparison of the binding specificities of structurally similar TFs that otherwise share the same overall motif. For example, Figure 9 shows a comparison of the 8-mer E-scores for two related mouse TFs, Lhx2 and Lhx4. Though these TFs exhibit identical motifs and bind the same highest affinity 8-mers, they differ significantly in their preferred lower-affinity binding sites17.
Nevertheless, PWMs have proven to be a reliable, useful method for binding site representation. In their compactness, they present a much more intuitive picture of a TF’s binding specificity than a lengthy list of individual k-mer scores. For TFs that make extensive contacts with DNA, the PWMs derived from universal PBMs are particularly useful because they can be substantially longer than 8 base pairs, owing to the incorporation of information from many gapped k-mer patterns. (By considering different gapped patterns as candidate seeds, the resulting PWM will be anchored on the 8 most informative positions within the motif.) Finally, most existing software for searching for genomic occurrences of TF binding sites is designed to take PWMs as input12. Such analyses enable the prediction of direct regulatory targets of individual TFs in relatively compact eukaryotic genomes, such as yeast. In higher eukaryotes, where TFs often bind at a much greater distance from their target genes, more complicated prediction strategies are necessary58, 59.
We expect that the use of k-mer binding data, rather than PWMs, for searching genomic sequence will enable more accurate prediction of TF binding sites across the genome. Traditionally, PWMs have been used when only limited experimental binding data existed for a particular TF, allowing the preferences of the TF for all other sequences to be approximated. Now, universal PBMs allow the generation of comprehensive binding data for all k-mers. This constitutes a significant paradigm shift in the study of gene regulation. Consequently, new methodologies will be needed to score candidate regulatory regions of genomes according to TFs’ relative preferences over all possible k-mers. New databases to store these extensive k-mer-specific data will be necessary; the recently developed UniPROBE database hosts both k-mer-specific data and PWMs for published universal PBM data60. We expect universal PBMs to provide valuable data sets for understanding the regulatory processes that govern gene expression in all species.
We thank Anthony Philippakis for helpful discussion, Andrew Gehrke for technical assistance, and Manuel Llinas and Steven Gisselbrecht for helpful comments and critical reading of the manuscript. M.F.B. and M.L.B. were funded by NIH/NHGRI grant # R01 HG003985.
COMPETING INTERESTS STATEMENTS
The authors declare competing financial interests (see the HTML version of this article for details).