|Home | About | Journals | Submit | Contact Us | Français|
Transcription factor proteins control the temporal and spatial expression of genes by binding specific regulatory elements, or motifs, in DNA. Mapping a transcription factor to its motif is an important step towards defining the structure of transcriptional regulatory networks and understanding their dynamics. The information to map a transcription factor to its DNA binding specificity is in principle contained in the protein sequence. Nevertheless, methods that map directly from protein sequence to target DNA sequence have been lacking, and generation of regulatory maps has required experimental data. Here we describe a purely computational method for predicting transcription factor binding. The method calculates the free energy of binding between a transcription factor and possible target DNA sequences using thermodynamic integration. Approximations of additivity (each DNA basepair contributes independently to the binding energy) and linear response (the DNA-protein and DNA-solvent couplings are linear in an effective reaction coordinate representing the basepair character at a specific position) make the computations feasible and can be verified by more detailed simulations. Results obtained for MAT-α2, a yeast homeodomain transcription factor, are in good agreement with known results. This method promises to provide a general, computationally feasible route from a genome sequence to a gene regulatory network.
Given a genome sequence, it is increasingly easy to identify gene and protein components that compose biological networks and pathways. Predicting the network edges — the physical interactions between proteins and the specific binding between transcription factors and DNA regulatory elements —is a crucial step towards predicting and designing the properties of living systems. Transcription factor proteins regulate genes by binding to specific DNA sequences, termed regulatory elements, in the upstream promoter regions of target genes (Fig. 1). Multiple transcription factors can regulate a single gene, with complicated interactions yielding combinatorial control of expression. Predicting the binding preferences of individual transcription factors is a first step towards understanding the full details of transcriptional regulation.
The importance of transcriptional regulation is reflected by the large fraction of genes that perform this function in multicellular organisms. Over 10% of human genes encode transcription factors that recognize specific DNA sequences. Most transcription factors belong to one of a handful of major families that are represented in other eukaryotic lineages, including yeast, nematode, and insect (Fig. 2). Some transcription factor families appear in only a subset of lineages, for example the AP2 family which occurs in plants but not in animals or fungi, and the nuclear hormone receptor family which arose after the yeast –animal split. Other families, such as homeodomain proteins and leucine zipper transcription factors, are present in all major eukaryotic kingdoms.
While several DNA-binding motifs can recognize specific base sequences [1,2], the majority of characterized transcription factors contain an α-helix that makes base-specific contacts with the DNA major groove. Homeodomain, MYB, leucine zipper domain (bZIP), helix-loop-helix (HLH), winged-helix, zinc finger, nuclear hormone receptor, and helix-turn-helix (HTH, specific to prokaryotes) families all have α-helix motifs. The large classic zinc finger family encodes three sequential α-helices held by a scaffold that makes contact with three DNA major groove regions. Homeodomain transcription factors are three-helix bundles in which one helix binds to the DNA major groove and the other two are solvent-exposed. The specificity of a homeodomain protein arises primarily from contacts of a single α-helix with about four to six contiguous basepairs. The other two helices can make additional contacts with the DNA backbone but do not confer specificity. Leucine zippers are encoded by two separate α-helices that form homodimers or heterodimers when bound to DNA. Each chain recognizes a half site, and the half sites can compose a single continuous recognition region. The nuclear hormone receptor family contains two α-helices on a single scaffold. The two helices recognize DNA sequences with variable spacing. These differences in how the α-helices are presented to DNA has made it difficult to infer general recognition or complementarity rules between specific protein residues and cognate DNA basepairs.
Recognition of DNA by β-sheets is less common, but is observed in families such as AP2 in plant and ApiAP2 in malaria . Some malarial proteins have multiple ApiAP2 domains, suggesting a β-sheet analog to the α-helix nuclear hormone receptor motif. TATA-box binding proteins (TBP) also use β-sheets in DNA binding. However, TBP binds DNA in the minor groove and causes a large bend in the DNA structure, which is thought to facilitate the opening of the DNA duplex for transcription.
Intense interest in understanding transcriptional regulation has spurred development of experimental methods to characterize transcription factor binding sites. Screening by SELEX is possible but requires purified protein and costly rounds of enrichment. SELEX has been eclipsed by two recent methods that are beginning to yield high-throughput data: chromatin immunoprecipitation followed by microarray readout (ChIP/chip) [4–6] and protein binding to double-stranded DNA microarrays .
In the ChIP/chip approach, a cell or organism is engineered to express a tagged version of a transcription factor. If the tagged protein is expressed and binds DNA, it can be isolated to enrich the bound DNA. The DNA is then hybridized to identify the genomic region where the transcription factor binds. ChIP/chip methods require access to a cellular state that produces a transcription factor of interest. This can be difficult if the cellular state is unknown or difficult to access, such as a precise developmental state. Certain organisms, for example pathogens, are impractical to culture. The rapid evolution of regulatory networks makes it difficult to infer binding specificity cross-species even for orthologous proteins. Finally, it may not be clear whether the tagged protein, or some other component of a transcription factor complex, provides the observed specificity.
Binding of protein to a dsDNA array requires only purified protein, which can be generated by a protein expression system. These experiments can be expensive, however, due to the need to generate an organism-specific hybridization chip. This is because the dsDNA chips do not contain an array of motifs, but rather an array of the intergenic regions for a particular genome. When promoter regions are difficult to delineate, as in human, these arrays are even more challenging to fabricate and expensive to use.
The ChIP/chip and dsDNA methods provide only indirect evidence of binding preference by identifying DNA regions that are apparently enriched for transcription factor binding. Transcription factors have general affinity for the DNA backbone, and the signal from the specifically-bound DNA regions is accompanied by a background from non-specifically bound DNA. Identifying the specifically-bound genomic regions requires methods analogous to analysis of gene expression data to identify differentially expressed transcripts.
Once the enriched regions are identified, bioinformatics methods are required to infer the actual short motifs bound by the transcription factor. These methods have their origin in motif-finding methods, such as meme and bioprospector, that were developed to identify regulatory elements in co-expressed genes [8–13]. The statistical search does not always converge to the correct binding site. A recent study failed to identify known specificities for 138 of 203 yeast proteins tested .
As an alternative to experimental characterization, recognition codes are hypothetical specific pairing rules between amino acid side-chains and the DNA basepairs they specifically recognize. The theoretical possibility for recognition codes arises from different hydrogen-bonding opportunities in the major and minor DNA groove depending on the basepair. Unfortunately, it has been impossible to identify general rules that span different transcription factor families. The best performance has been obtained for the classic zinc finger family . Improved recognition codes parameterized using expectation maximization have now been used to predict binding for Drosophila zinc fingers . There has been little or no progress with other transcription factor families, other than observation of core binding regions for certain sub-families. Even weak knowledge is useful because binding motifs identified from computational methods can combine with experimental data to provide more precise predictions of transcriptional regulation. This has already been achieved with existing binding motif databases combined with ChIP/chip data .
The lack of success of recognition codes has prompted efforts towards structure-based predictions of protein-DNA interactions. Published studies suggest that molecular force-fields are sufficiently accurate to model interactions between DNA, protein, and solvent molecules [17–20]. Water molecules are crucial for many transcription factor families where water-mediated contacts are responsible for specificity recognition [21,22]. Nonetheless, studies aimed at predicting binding sites have generally used implicit solvent and fixed DNA backbones for expediency. These failings present a general problem because water-bridged hydrogen bonds and distortion of the DNA backbone are common features of transcription factor binding.
Baker and coworkers adapted methods used for protein-protein interaction prediction to study zinc finger binding preferences  based on data from experimental screens of variants of a murine zinc finger, Zif268 . A brute force search through possible DNA binding regions used fixed backbones, a rotamer side-chain library, and implicit solvent. The authors concluded that this method was questionable for predicting binding specificity, but suitable for the easier goal of designing a protein to recognize a pre-specified DNA sequence. Including hydrated rotamers in the protein sidechain library has been suggested  but requires additional parameterization and has yet to be tested. One of their recent work  extended the study to other transcription factors with limited success. Again their work does not include the effects of explicit water molecules, and their binding energies excluded any entropic contributions.
Lavery and coworkers studied nucleic acids  and their interactions with proteins [28–30] with a multi-copy approach that superimposes all four possible basepairs at each DNA position, again with implicit solvent. After generating structures with the fictitious basepairs, binding sites are ranked by a scoring function. This method yielded accurate predictions for 18 proteins tested, but primarily for complexes that did not contain water-mediated protein-DNA interactions. Furthermore, this analysis method does not properly account for the fictitious nature of the multi-copy base, which induces protein conformations that might not occur naturally.
A simplifying assumption in analysis of protein-DNA interactions is that the energetic contributions of basepairs are additive. This assumption yields the standard position-weight-matrix representation of a binding site. Recent work continues to point to the accuracy of the additive approximation , particularly for sequences similar to the most favored sequence . The main cause for non-additive contributions may be DNA backbone deformations , pointing again to the important of incorporating DNA flexibility in simulations.
The methods introduced here are motivated by Lavery’s multi-copy model for DNA. As described in the methods (Sec. 2), a multi-copy basepair is used to represent a transition state between two physical basepairs. Statistics collected during an equilibrium simulations of a DNA-protein complex with a multi-copy basepair and the same DNA sequence solvated by water are used to calculate a difference in binding free energy for two physical basepairs. Repeating this calculation for each position along a DNA sequence yields predictions for the binding specificity of the transcription factor. Results of applying these methods are presented for MAT-α2, a yeast homeodomain transcription factor (Sec. 3). The predicted binding motif is in good agreement with the motif described in the original literature. Finally, future plans and extensions are discussed (Sec. 4).
The specificity of a transcription factor (TF) may be represented as
where X and Y are two DNA sequences, and β−1 is the thermal energy kBT. Sequences X and Y are assumed to be equally abundant in a genome for simplicity; this assumption could easily be lifted by including an additional factor of their relative abundance. The binding free energy for a DNA sequence is the difference in free energies of the solvated TF–DNA complex and the solvated isolated components,
The solvation energy for the TF–DNA complex is G(TF − X), and the solvation energies of the isolated components are G(TF) and G(X). An analogous expression gives ΔG(Y). Combining Eqs.1 and 2 indicates that the G(TF) terms cancel, and
Under an additive approximation, the energetic contributions are additive over DNA positions,
where the index i runs over the W basepairs in the TF-DNA interface. Under this approximation, the probability that position i of bound sequence X, denoted xi, is basepair α is
where δ is an arbitrary reference state. Since the identity of a basepair is defined by the nucleotide on one of the strands, the basepairs AT, CG, GC, and TA are abbreviated as the nucleotides A, C, G, and T. Although there are C(4, 2) = 6 distinct pairwise combinations possible, the energy spacing between basepairs is fully specified by only 3 values, for example the ΔΔG values using the most favored basepair as the nominal zero of energy. Therefore only 3 pairwise comparisons are required to establish the position weight matrix at any position.
The additive approximation permits a direct comparison with the position-weight-matrix (PWM) of a binding site. This representation is obtained by aligning known binding sites for a transcription factor and tabulating n(xi = α), the number of times that basepair α occurs at position i. The PWM is then defined as
PWMs are conveniently visualized as logos [32,33]. The information at position i is summarized by a stack of letters signifying each nucleotide. The height of the stack is the information content (IC) at the position,
and the height of each letter is proportional to Pr(xi = α) with the letters sorted from top to bottom in decreasing order of probability.
Free energy differences ΔG(X ← Y) for DNA sequences X and Y in either the TF complex or in water can be calculated using thermodynamic integration ,
where t parameterizes a path λ(t) that switches from state Y to state X. Our calculations use a Hamiltonian H[λ] where the switching involves a superposition of basepairs at a single position,
with T0 and U0 representing the kinetic and potential energy of the entire system except for the position that is being switched, and Tα and Uα representing the kinetic and potential energy of the switching position. The components λα represent the fractional representation of basepair α. For example, λA = 1 with λC = λG = λT = 0 represents a physical state with basepair AT, and λA = λT = 0.5, λC = λG = 0 represents a 50-50 mixture of AT and TA with no CG or GC character. The kinetic energy in Eq. 9 need not be weighted by λ because classical configurational properties should be independent of these terms. For a linear switch between two states α and γ, with λα(t) = t and λγ(t) = 1 − t, the free energy change is
In principle, calculation of ΔG for this change requires a series of equilibrium calculations at multiple values of t, also known as multi-configurational thermodynamic integration. These calculations become increasingly difficult as t approaches 0 and 1, requiring much longer simulations in order to achieve comparable standard deviations as at other t values. Suppose, however, that the interaction between the switching position and the rest of the system can be modeled as an effective linear coupling to a bath. In this case, the Hamiltonian for state α is
where y, which may be a vector, represents bath modes, yα is the expected value of y in state α, Eα is the energy at this minimum value, k is the coupling between bath modes, and T is the kinetic energy of the bath. If the coupling k is independent of α, then the free energy difference between states is
The quantity that enters into thermodynamic integration is
At the midpoint value λα = λγ = 0.5, Hα − Hγ is exactly equal to Eα − Eγ. Our strategy, then, is to use the midpoint calculation as an estimate for the full integral. This estimator has reduced variance with a tradeoff of possibly introducing bias. Provided that the bias is smaller than the energy differences to be measured, however, this approximation will be successful in predicting specificity. The bias will be small for an effective linear system. The bias may still be small for non-linearities if the non-lineararities cancel systematically. Cancellations might be anticipated when the ΔG’s of the TF-DNA complex and of DNA in water are subtracted. From preliminary calculations, we found that this is indeed the case for a model TF studied here (results not shown), confirming that the midpoint simplification is a good approximation.
Coordinates for a complex of MAT-α2 with a DNA 10-mer were obtained from the crystal structure (PDB entry: 1APL)  with the DNA sequence of ATT-TACACGC. This sequence differs from the consensus sequence in Transfac , ACATG, although Transfac indicates a weak preference for ACATG over ACACG. The tleap module of Amber was used to add hydrogen atoms to the macromolecule. The charge states of the titratable residues are as follows: ASP −1, GLU −1, LYS +1, ARG +1, HIS +1. Since the single histidine is exposed at the protein surface, the nitrogens at both δ and positions are protonated. The complex was solvated in a truncated octahedron with 4632 water molecules and 9 Na+ ions to maintain a neutral system. Periodic boundary condition and Ewald summation were applied and the system was equilibrated for 1 ns using Amber. The 399 force field of Amber was used. The resulting macromolecular geometry as well as the positions of the ions were imported into Charmm . Water molecules were re-introduced in a cubic box. This system was then equilibrated for 700 ps at constant pressure (1 atm) and constant temperature (300 K) using the c27 force field of Charmm. Spherial cutoff of 14 Å was used to evaluate the non-bonded interactions including the electrostatics and van der Waals forces and potentials. This final system contains 8320 water molecules and 9 Na+ ions (26366 total atoms). The same protocol was applied to the DNA duplex from the crystal structure. The final solvated DNA system contains 18 Na+ ions and 7117 water molecules in a cubic box (22003 atoms in total).
Calculations of ΔΔG values were organized as a single-elimination tournament at each position along the DNA 10-mer. The λ switching functions between pairs of basepairs are implemented using the Blocks procedure in Charmm. In an effort to minimize systematic effects due to favorable solvation of GC over AT, the first round of the tournament ran AT vs. TA and GC vs. CG. The two basepairs with lower energies were then compared in the second round free energy calculation. Our rationale for a single-elimination tournament was to permit the most favored basepair to serve as the reference, rather than a higher-energy state.
The tournaments were conducted by replacing a basepair with a multi-copy superposition. Since we use the midpoint approximation (λ = 0.5), two base-pairs are superimposed according to the tournament. The new DNA with the multi-copy basepair was energy minimized, heated to 350 K and equilibrated at 350 K for 30 ps, cooled to 300 K and equilibrated at 300 K for 120 ps, and then run for 100 ps production in the NVT ensemble. We checked that energy differences showed no drift over the 100 ps production run. This procedure was used for both the TF-DNA complex and the free DNA in water.
Frames from the production run were collected every 0.5 ps yielding F = 200 frames per run. The energy difference ΔHf = Hα − Hγ was calculated for each frame. The mean and its standard error were calculated as
The factor (1 + c)/(1 − c) corrects the standard error for correlation c between neighboring frames. Standard error propagation was used to obtain the standard error of ΔΔG, i.e.
After the first round of the TF-DNA complex and DNA-water simulations were completed at each position, the winning pair of basepairs were determined and competed against each other using the same protocol as in the first round. The ΔΔG values were converted to position-weight-matrices using Eq. 5 for visualization and comparison.
Homeodomain proteins are important transcriptional regulators for developmental processes. This family was first identified in fruit fly as responsible for the proper development of body plan and segmentation. Homeodomain proteins regulate region-specific expression in flowers. Feedback loops involving homeodomain genes and proteins are an integral part of the cellular memory that maintains the expression patterns through development.
Specific binding to DNA sequences of homeodomains is achieved by contacts between an α-helix and the DNA major groove. Several structures of homeodomain-DNA complexes have been published by Wolberger and coworkers, including homeodomain mating type protein complexes in yeast [38–41,35] corresponding to PDB identifiers 1APL, 1YRN, 1AKH, 1LE8, and 1K61.
The MAT-α2 complex was simulated starting from the 1APL coordinates as described above, using a single-elimination tournament to calculate the three energy differences that fully specify the position weight matrix entries at each position (Sec. 2). Simulation results and error bars are provided for the ΔG calculations for the TF-DNA complex and the DNA in water, together with the final ΔΔG for each comparison (Fig. 3).
The logo extracted from the computer simulation compares favorably with logos for the same protein from Transfac and from the primary literature , as shown in Fig. 4. The Transfac logo was based on a consensus DNA motif extracted from targets of a heterodimer between MAT-α2 and MAT-a1. The literature logo is from work by Wolberger and coworkers on targets of a heterotetramer with MAT-α2 and MCM1 targeting a different set of genes and yielding a somewhat different logo.
Both the previous logos include sequence specificity at basepairs that lack specific contacts with MAT-α2. Presumably, this specificity is due to contacts with the other transcription factor complexed with MAT-α2. The basepairs that contact MAT-α2 are numbered 3 through 8 inclusive. We restrict attention to predictions at these positions where the two experimental logos make strong predictions defined as 1 or more bits of information.
The Transfac logo has only 3 high information positions, 5 through 7. These are all predicted correctly by simulation. The probability for this to occur by chance is (1/4)3, or a p-value of 0.016. The Wolberger logo has high information content for all 6 positions in contact. The simulation correctly predicts 5 of these. The p-value for correctly predicting 5 of 6 positions is approximately C(6, 5)(1/4)5(3/4)1 + (1/4)6, or 0.005. The results at the final position are in questionable agreement, A or T in simulation and T or C in the Wolberger logo, both with relatively low information contents in the logos. This position was found to contain a water-mediated contact between a serine side chain and the DNA , which might contribute to the lower information content, as water bridges allow more promiscuous recognition between protein side chains and DNA bases.
We have presented and tested a simulation method for calculating the DNA-binding specificity of a transcription factor. The method has no requirement for experimental data, other than the ability to construct an initial homology model for a DNA-protein complex. As most transcription factor families are represented by one or more bound complexes in PDB (Fig. 2), this is not a practical limitation.
This method has the benefit of permitting full main chain motions of both the DNA and the protein. DNA deformation contributions are known to be important contributors to binding specificity and are included in this method. Explicit solvent molecules permit water-bridged hydrogen bonding between transcription factor sidechains and DNA bases that are also known to contribute to specificity.
The simulation design as a single-elimination tournament, together with the additive approximation, make the computational implementation trivially parallelizable by running each position on a different node of a high-performance cluster.
Ongoing work is investigating the sensitivity of predictions to the background DNA sequence used to start the calculation and the quality of the initial homology model. The calculations described here used high-quality structural data for the transcription factor under study. We are now repeating the calculations using homology models built from related family members. Superposition of homeodomain proteins MAT-α2, Ubx, and Pbx1 present in PDB suggests that homology modeling will introduce little error due to the strong conservation of protein fold (Fig. 5). Furthermore, the critical DNA-binding region is an α-helix that is highly amenable to homology modeling, as opposed to a loop that would be difficult to predict.
Other retrospective analysis of trajectories is being conducted to investigate sampling efficiency. The 100 ps production runs must be sufficiently long to sample relevant motions of protein, DNA, and solvent. Motions that are slower than this timescale may lead to inefficient sampling and simulation error. An important timescale that might be slow is the residence time of water molecules in the DNA-protein interface.
Knowledge gained through these simulations will provide new information for guiding dynamical models of gene regulation. One of the most difficult aspects of modeling is defining the structure of a network. Network structure is defined by the set of interactions, ideally causal interactions, between network components. This topology often must be inferred from observed data by Bayesian network approaches. As the number of possible network topologies is combinatorially large, inference is computationally and practically challenging. The set of gene regulatory relationships defined by predicted protein-DNA interactions will be useful for constraining or biasing which network edges are considered in a topology search. This will permit the development of more detailed, quantitative models of gene regulation.
LAL acknowledges funding from the Department of Energy (DE-FG0204ER25626). JSB acknowledges funding from the NSF, NIH/NCRR U54RR020839, and the Whitaker foundation. We acknowledge a grant of computer time from the Pittsburgh Supercomputer Center, MCB060010P.