In eukaryotic gene regulation, highly specific patterns of gene expression are established by cis
-regulatory elements (CREs)—modular domains of several hundred base pairs containing multiple short DNA sequences (~5–15
bp) that serve as binding sites for transcription factors (TFs) (1
) Given a genome, a major challenge is to identify CREs in the sequence and to identify the TFs that bind to them (2
) A common situation is that one or more TFs are believed to regulate the expression of a set of genes, and one wants to identify the CREs responsible. Using a description of a TF's binding specificity such as a consensus sequence or position weight matrix (PWM) (3
), the genomic regions of interest (e.g. gene upstream and intronic regions) can be searched for binding sites that would indicate a potential CRE. However, since TF-binding sites are commonly short and sequence degenerate, many sites will often be identified while only a few will actually be part of a CRE. This is particularly evident in the large non-coding regions present in higher eukaryotic genomes (4
). To confront this problem, additional criteria have been used to refine the initial list of identified sites such as the PWM score (3
), clustering of the sites(5–8
); models based on previously identified CREs(9–11
); or phylogenetic conservation of the sites (12–14
). Critical to the success of all these approaches is an accurate description of the TF's binding specificity.
Databases such as TRANSFAC (15
) and JASPAR (16
) provide data on experimentally determined DNA-binding specificities and provide PWMs for many eukaryotic TFs. Unfortunately, for most TFs there is currently little or no data as to the DNA sequences they bind to. The most recent release of TRANSFAC database (version 7.0–public), the larger of the two databases, has approximately 400
PWMs for all eukaryotic TFs. In comparison, the human genome has approximately 1850 predicted TFs (17
); therefore many remain without well-characterized binding specificities. High-throughput experimental approaches, such as SELEX (18
), quantitative multiple fluorescence relative affinity (QuMFRA) assay (19
), genome-wide location analysis (ChIP-chip) (20
), DNA immunoprecipitation with microarray detection (DIP-chip) (21
) and protein-binding microarrays (PBM) (22
) are increasingly being used to efficiently determine the binding-specificity of individual TFs; however, obtaining high-quality data for many TFs still proves challenging (23
). Additional approaches for determining binding specificities that combine computational methods and structural information with sequence data (24
) and experimental data (SELEX, phage display) (25
) have also been proposed. Recently, various computational methods have been reported for using structures of protein–DNA complexes to predict binding specificities directly (26–34
), providing a potential complementary strategy to high-throughput determination of TF PWMs.
All purely structure-based methods require either an experimentally determined structure or a computationally derived model of a protein–DNA complex. Given a structure of a TF bound to one or more DNA sequences, a scoring function can be used to evaluate relative affinities. One type of scoring function uses knowledge-based potentials that have been developed through the statistical analysis of many protein–DNA structures. Such potentials have been used to estimate pair-wise amino-acid–nucleotide interaction energies (26–30
), amino-acid–polynucleotide interaction energies (30
) and DNA-deformation energies (36–38
). The potentials are based on structural parameters (e.g. amino-acid–nucleotide distances; twist, roll and tilt parameters of base-pair steps) chosen so as to be independent of the residue identities. Therefore, a single template complex can be used to predict affinities for many different protein and DNA sequences by simply changing the identities of the amino acid or the base and re-evaluating the potential. In this case, one generally only needs knowledge of the coordinates of the protein and DNA backbones since, in most current applications, the detailed atomic nature of amino-acid–base interactions are ignored. However, it is possible that specificity will depend in part on side-chain orientation. Indeed, alternate side-chain conformations that affect DNA-binding specificity have been observed in response to both the mutation of neighboring side-chains and the binding of different DNA sequences (39
), and likely contribute to the inter-dependent effects on binding affinities that have been observed for amino acid and base residues (19
Another approach to scoring is to build an all-atom model for a set of complexes of interest and to evaluate relative affinities with molecular-mechanics energy functions (31
); physically based, atomic-level energy functions (33
); or statistically derived atom–atom potentials (42
). In most recent studies, the protein and DNA sugar–phosphate backbone atoms are held fixed in the original, experimentally derived structure, and alternate DNA sequences are modeled by constructing DNA bases that are co-planar with the original template bases. For each DNA sequence, amino-acid side-chains are then modeled onto the fixed protein backbone by choosing conformations from a rotamer library that minimize the energy function (32
). An alternative method used by Paillard et al
) is to begin with the side-chains in the original template conformations and to perform an energy minimization in the presence of a multi-copy representation of the DNA bases (44
), where all DNA bases are present at each position simultaneously.
Despite the increasing number of protein–DNA complexes in the Protein Data Bank (PDB), TF sequences still greatly outnumber TF structures. Therefore, for the majority of TFs an experimentally derived structural template containing the TF will not be available. Since many structurally similar proteins dock on the DNA in a similar fashion (46–48
), one solution is to use a protein–DNA complex of a structurally homologous protein as a template for a TF for which no DNA-bound complex is available. Recent papers by Morozov et al
) and Contreras-Moreira and Collado-Vides (35
) have used all-atom modeling and knowledge-based approaches, respectively, to predict PWMs using structural homologs as templates. The homology modeling of protein–DNA complexes provides a broadly applicable method for PWM prediction as there are currently enough protein–DNA complexes in the PDB to represent the majority of TF structural families. Integral to a homology modeling approach is the assumption that a protein–DNA complex can serve as the template structure for a structurally homologous protein. This requires that the docking geometry—the spatial orientation of the protein backbone to the DNA molecule—of structural homologs is sufficiently similar that the difference in docking orientation does not affect the outcome of the predictions.
In this article, we test this assumption by using different protein–DNA complexes of structurally homologous C2
zinc fingers (ZF) as templates in atomic-level modeling predictions of TF-binding specificity. Extensive structure–based predictions of ZF binding specificity are carried out and the sensitivity of prediction accuracy on differences between the model and actual docking geometries is analyzed. Relative docking geometries are quantified in terms of an interface alignment score (IAS) that we recently described (48
). The algorithm that calculates IAS values aligns amino acids from the binding interface of two protein–DNA complexes based on the spatial relationships between the amino-acid backbone and DNA base atoms and provides a quantitative measure of the similarity between two protein–DNA interfaces. The IAS has been shown to provide a robust and sensitive measure for comparing the docking geometry of protein–DNA complexes.
ZF proteins constitute a large family of eukaryotic TFs that have been studied extensively both experimentally (49–51
) and computationally (24–26
) as a model system for understanding protein–DNA-binding specificity. The proteins in the family are often composed of multiple, structurally homologous, concatenated ZF domains of approximately 30 amino acids that bind successively along the DNA major groove. From the twenty-one available ZF protein structures listed in , ninety-three individual ZF domain
templates could be defined (Materials and Methods), representing the largest set of structurally homologous protein–DNA complexes currently available from the PDB. Fourteen of the twenty-one structures listed in the table are based on the well-studied Zif268 protein and include: two wild-type structures at different resolutions; one structure with two wild-type Zif268 proteins bound in tandem; nine structures with mutations in the ZF domain 1 (ZF1) in complex with various off-consensus DNA sites of the form GCGTGGNNN; and two modified Zif268 structures bound as head-to-head dimers. Three of the twenty-one structures are designed ZF proteins engineered to bind novel DNA sequences, and the remaining four structures are of various wild-type ZF proteins containing between two and six ZF domains bound to different DNA sequences.
Pair-wise structural superimpositions of all the ninety-three ZF domains yielded an average backbone heavy-atom RMSD (root mean square deviation) of 0.9
Å (variance 0.12). However, despite the overall structural similarity, significant differences in docking geometries are observed. These differences have an important effect on the ability to predict nucleotide sequences that preferentially bind to a ZF domain. We find a strong relationship between the IAS and prediction accuracy, and define a range of IAS values for which accurate structure-based predictions of PWMs is to be expected. Our results thus provide insight on how template choice can affect atomic-level modeling approaches to binding-specificity predictions and indicate ways in which prediction algorithms can be improved in future work.