In this study, we explored the use of comparative genomics as a tool for transcription factor engineering. Based on correlations derived from a previous study of nitrogen oxide metabolism in bacteria (
20), we experimentally tested eight different mutations for their ability to change DNA-binding specificity using CRP as the template. In all cases, the mutations were made to a triad of amino acids (Arg180/Glu181/Arg185) that are known to directly contact DNA bases within the major groove. These three amino acids alone were predicted to be sufficient for DNA-binding specificity within the CRP/FNR family. For each set of mutations made to CRP, we also made a corresponding set of mutations to the CRP operator site within the
lacZ promoter. Of the eight CRP variants, four were able to bind their cognate operator sequence and activate transcription of the
lacZ promoter. Though the results, in general, are less dramatic than the wild-type CRP/Owt pair, they provide excellent targets for subsequent refinement by directed evolution and other more traditional methods. Along these lines, we were able to demonstrate that we could improve activation by the CRP5/Om5 pair by screening for operator sequences with select positions randomized. While this screen was limited, it nonetheless demonstrates that further refinement is possible.
Utilizing genomic data to inform protein engineering is not a new idea. For example, the active site in an enzyme can often be determined within a multiple sequence alignment by identifying conserved residues (
33) or specificity-determining conserved within functional subfamilies (
34,
35) within a multiple sequence alignment. In the case of transcription factors, however, a multiple sequence alignment often will not suffice, as the DNA-binding sequence must also be considered in the analysis. In particular, identifying the specificity determining residues is often not sufficient for design purposes. Rather, we seek to identify the specific amino acids that bind different DNA sequences. Therefore, the new idea in this work is to use mutual information between transcription factors and their target DNA-binding sequences to inform protein engineering. In many regards, the computational approach used to generate the predictions tested here is similar to those used to study interacting proteins (
36–39). These approaches work under the assumption that any mutation to a specificity-determining residue on one binding partner must be matched by a compensating mutation to a specificity-determining residue on the other binding partner. By studying the co-variation of residues among binding partners in a given family of proteins, one can identify the specificity-determining residues and then apply the information to inform protein engineering. Of notable significance is the recent work by Skerker and colleagues (
38), where they were able to utilize these data to change the specificity of the EnvZ histidine kinase for its target response regulator. In conjunction with the analogous work presented here, these results demonstrate how purely genomic-based approaches can inform the re-engineering of protein interactions.
One limitation of the use of genomic-based approaches for transcription factor engineering is our ability to identify the target DNA-binding sequences and also discriminate among the potentially large number of DNA-sequences that these proteins can bind to. In the case of the work by Skerker and colleagues, the advantage of their system is that histidine kinase-response regulator pairs can often be inferred directly through their proximity in the genome, as they both typically reside in the same operon (
40). Furthermore, most histidine kinases interact exclusively with a single response regulator. In the case of transcription factors, identifying the target DNA-binding sequence is often impossible unless other data are available. The results used in this work were obtained from a comparative study of nitrogen oxide metabolism that integrated multiple data from both experimental and computational analysis (
20). For an arbitrary family of transcription factors, such results may not always be forthcoming. Furthermore, there is always a degree of uncertainty, often unquantifiable, associated with the identification of target-binding sites. Finally, with regards to specificity, many transcription factors are known to regulate multiple target genes. For example, CRP is estimated to regulate approximately 200 promoters in
E. coli and other relative organisms (
41). This promiscuity adds an extra degree of complexity, as the protein/operator site pairs often cannot directly be assigned and instead consensus sequences must be estimated. Despite these challenges, our results demonstrate the utility of these approaches for bacterial transcription factor engineering.
Our results also uncovered some surprising results, highlighting our limited understanding of even simple protein–DNA interactions. When CRP7 was paired with the Om7 reporter, expression was induced at low levels of CRP expression and repressed at high. This reporter was also toxic at high levels of CRP7 expression. Moreover, the reporter was not active in a
crp− background and showed weak, dose-dependent behavior with wild-type CRP, suggesting a complex, concentration-dependent interaction between this regulator–operator pair. In addition, significant cross-talk was observed in the case of wild-type CRP, which activated reporters with Om4 and Om7 in addition to the wild-type reporter, whereas the mutant regulators displayed far less promiscuity. From an evolutionary perspective, this cross-talk is not entirely unexpected; most of the regulator–operator pairs were taken from different species, so there may be no explicit evolutionary pressure to avoid crosstalk. Because large regulons such as the one dictated by CRP include an enormous diversity of promoters, the duplication and specialization of regulators could be a general mechanism in the evolution of regulatory pathways (
42), especially given the observation that birth and evolutionary turnover of regulatory sites may occur at a very fast rate even under relatively weak selection (
43,
44).
An additional puzzle concerns the role of bases within the operator sequence that do not make direct contact with amino-acid side chains. Previous work has established that only positions 5, 7 and 8 make direct contact with amino-acid side chains (
22). In addition to these bases, our results demonstrate that the so-called, non-specific bases also affect binding and specificity. For example, both Om1 and Om3 are identical at positions 5, 7 and 8, yet their response to CRP1 is different. In the case of Om1, CRP1 binds this site so strongly that it is able to activate transcription both in the presence and absence of atc inducer. However, in the case of Om3, CRP3 is able to activate transcription only in the presence of atc (i.e. at high levels of expression). Similarly, Om2 and Om5 are identical at positions 5, 7 and 8, yet wild-type CRP is only able to activate promoters with Om5, albeit weakly. Finally, in the case of Om5, we were able to improve the ability of CRP5 to activate these promoters by modifying positions 9, 10 and 11. Collectively, these results show that these ‘non-specific’ bases likely do make specific interactions, though the mode may be quite complex. Moreover, as we focused only on the residues that make specific contact and saw weaker activation in general than wild type, future endeavors will likely need to consider optimizing non-specific binding as these interactions may be needed to facilitate and/or compensate for changes to the core contacts.
We note that CRP2 was previously identified by Ebright and colleagues in a genetic selection for CRP mutants that were able to bind the
lacZ promoter with an adenine or thymine at position 7 within the CRP operator site, a condition that Om2 satisfies (
16). In addition to the valine substitution, they also found a lysine and leucine. Subsequent analysis demonstrated that the Val181 (and Leu181) substitution was unable to distinguish between different bases at position 7, resulting in roughly a 10-fold decrease in binding affinity relative to wild type (
32). As discussed (), we found that CRP2 was unable to activate transcription of the
lacZ promoter involving the Om2 operator site, results that correspond with their
in vitro binding analysis.
In the context of transcription factor engineering, we have shown that comparative genomics can be used to computationally isolate mutations that alter DNA-binding specificity. Previously, in the case of bacteria, these designs have resulted from randomized screens. With regards to applications, engineered transcription factors with novel DNA-binding specificity can greatly facilitate the design of synthetic gene circuits, as they expand the number of components available to build these circuits. One challenge in constructing these gene circuits is that the designs are limited by the number of components available that do not interfere with host physiology. Engineering such orthogonal components has been central focus in the nascent field of synthetic biology (
45–47). In addition, these engineered transcription factors provide additional tools for fine tuning gene expression with cells, a key task in uncovering new regulation and also for potentially designing new therapeutics approaches.
We note that one limitation of the approach explored in this work is that it does not provide information regarding the strength of the protein–DNA interaction. The analysis is based simply on correlations derived from sequence analysis and provides no information regarding binding energies. With regards to gene regulation, the final product is ultimately linked to the template protein. In our case, where the template is CRP, the natural product is a transcriptional activator. However, CRP is also a transcription repressor for a number of promoters [cf. (
48,
49)], so it can potentially be used to engineer repressors with novel specificity. In addition, the approach tested in this work can be applied to other families of transcription factors, including repressors.
To what extent is there a ‘code’ for transcription factor specificity? Our previous study suggested a three amino-acid code may be sufficient for inferring specificity in the CRP/FNR family of regulators. The results reported here highlight the importance of a combined computational and experimental approach: the amino acids sufficient for inferring the specificity of known regulators, but insufficient to design novel regulators. Thus, the residues at these positions may constrain binding to a small number of possible operator sites, even if they contribute only a fraction of the total energy of binding. For example, in our previous study, we found that the identity of two residues (positions 180 and 181) were sufficient to predict binding specificity (i.e. each unique combination of residues at these positions mapped to a unique binding site). However, a careful inspection of the CRP–DNA co-crystal structure indicates contacts between position 185 and the major groove, motivating us to include position 185 in our redesign experiments.
In a broader evolutionary context, our results show that it is possible to create orthogonal regulatory pathways after a surprisingly small number of mutational steps. Novel regulatory pathways are thought to evolve through gene duplication events (
50), horizontal gene transfer (
51), changes in the specificity of regulators (
52) and site turnover (
53,
54) yielding rewiring of regulatory pathways (
55). In order for these new pathways to form, transcription factors must mutate so that they no longer regulate their old target genes but instead target new ones. Sometimes we may observe early stages of this process; some examples of recently duplicated
E. coli regulators partially sharing their binding sites are UxuR/ExuR (
56,
57), GalS/GalR (
58,
59)and NarL/NarP (
60–63). In the case of regulators from the CRP/FNR family (and likely other helix-turn-helix transcription factors), specificity is predominantly determined by a core set of residues. The limited number of residues implies that these regulatory networks are quite plastic, as only a few mutations either to the protein or operator sites are necessary to change specificity and introduce new regulation or rewire existing networks. One open question is why there is only a small set of possible motifs observed within this family. Is this a structural constraint or are entirely new specificities possible within this family?
In conclusion, we have shown experimentally that comparative genomics can be used to inform transcription factor engineering. To date, bacterial transcription factor engineering has exclusively utilized direct evolution/random mutagenesis or domain swapping (
12). In particular, few computational approaches exist particularly in the case of bacterial transcription factors. Based on the computational analysis of co-variation between the specificity-determining residues within the DNA-recognition helix and the cognate, consensus binding sequence, we were able to engineer novel DNA-binding specificity in CRP. In fact, four out of eight designs worked as predicted with no subsequent refinement. Likely, the application of these computational approaches for engineering novel specificity in other proteins will also utilize directed evolution for subsequent refinement. One question might then be why employ computational approaches at all. As our results showed, many of the designs that actually worked involved multiple mutations to both the CRP protein and operator site. Searching such a large space of possible sequences is extremely labor intensive. The computational approach described in this work can focus mutagenesis to a core set of targets, greatly reducing the number of mutants needed to screen and also expanding the range of likely targets.