2.1. Computational alanine-scanning
Alanine-scanning mutagenesis is a simple and widely used technique in the determination of the functional role of specific amino acid side chains in proteins (
40). Alanine is the residue of choice because it is most abundant and is present both in buried and exposed positions as well as in all secondary structural elements. Because it eliminates the side chain beyond the β-carbon atom, alanine substitution is helpful to study the role of specific side chains in imparting substrate specificity. In addition, alanine is chiral and, unlike glycine, sufficiently structurally rigid not to fundamentally alter the backbone configuration of the protein. However, alanine-scanning mutagenesis rapidly becomes cumbersome with the increasing length of the peptide under investigation.
Computational alanine scanning, introduced by Massova and Kollman, is an elegant alternative for rapidly screening parts of a protein to deduce the role of specific side-chains in protein-peptide interactions (
41). In this approach, independent molecular dynamics (MD) simulations from separate starting structures, one each of the native and the alanine mutant structure, are performed. The structure of the alanine mutant is modeled by truncating the coordinates of the desired residue at the C
β atom in the crystal structure. Hydrogen atoms of the methyl group are added depending on the force field used for simulations. MD simulations of the starting complexes (wildtype and mutant), along with the individual components (protein and peptide), are then performed, and snapshots of the structure from simulation trajectories are saved at equally-spaced intervals. The binding free energies (Δ
Gbinding) for each complex can then be estimated using the following expression:
where the individual components are obtained from their corresponding simulations. The difference in free energies (ΔΔG) represents the energetic contribution of the residue to the protein-peptide interface. The flow of events related to this algorithm is summarized in . This MD protocol, where simulations are performed for both native and mutant proteins, is theoretically a representative of any structural rearrangements at the interface that may occur upon mutation, and hence represents the biological reality of an alanine mutation. This algorithm can be directly applied to a chaperone-substrate system in order to identify residues on the substrate peptide that determine the binding specificity of the substrate to the chaperone. Since the wild type and mutant complexes and their individual components are simulated independently of one another, this method results in a high standard deviation in the estimated energy values, which will likely increase with increase in the size of the protein-peptide complex. Massova and Kollman proposed a variant of the algorithm outlined above, in which independent simulations are not performed for the wild type and mutant complexes (
41). Instead, the side chain of the desired residue is truncated up to the C
β atom in every frame of the simulation trajectory (). This ‘post-process’ approach is guaranteed to decrease the standard deviation in estimation of binding free energies at the cost of inaccurate estimation in cases where local backbone rearrangements upon substrate recognition are expected. A systematic analysis of both approaches in comparison with experimental results shows that the postprocess approach, although not suitable for all cases, offers superior accuracy and faster prediction than the traditional MD approach (
42).
Incorporation of implicit solvation in place of explicit water in MD simulations will further reduce the time taken for such computational alanine scans. Recently, Moreira and colleagues introduced an improved methodology with reduced computational cost, involving molecular dynamics simulations performed in a continuum medium using the Generalized Born (GB) model. This approach has been used on multiple systems to predict ΔΔG
binding within 1 kcal/mol of experimentally determined values (
43).
2.2. Monte Carlo driven approach for consensus motif determination
Phage display library screening has been shown to be a powerful tool for characterizing protein-peptide interactions. Using this methodology, large peptide libraries can be rapidly screened to isolate candidate peptides that are recognized by the protein of interest. One of the most successful applications of phage display has been the isolation of monoclonal antibodies using large phage antibody libraries (
44). This approach has been applied to identify candidate peptides that are recognized by the yeast chaperone Ydj1, a representative member of the type-1 Dna-J like proteins (
45). The results from the screen not only produced a candidate peptide library, but also provided valuable insight into the potential consensus sequence required for substrate-recognition by Ydj1. One drawback of this approach is the time taken for the characterization of peptides that are potential binding partners to a given chaperone. The same library may not be useful for screening with multiple chaperones of interest. Recently, we have employed computational methods to identify a consensus peptide-binding motif for Ydj1, and experimentally verified this motif (
31). The trend observed in our computational studies agrees with the phage display screening conducted by Li and Sha, establishing the potential of predictive computational algorithms in determining the binding properties of proteins in general, and their specific application to understanding substrate recognition by chaperones.
The algorithm for determining whether a given peptide is a potential substrate for a chaperone is based on a Monte Carlo driven iterative search for the best combination of side chain orientations that energetically favor the configuration of the substrate in the chaperone-substrate complex. Such an algorithm can also be used to design novel peptides targeted for recognition by a specific chaperone. In this iterative approach, each position on the substrate is computationally mutated to all possible amino acids and the energetically most favorable residue is chosen for each position. This approach allows for screening the entire sequence space in order to identify potential substrate peptides. At each step, the stability of the chaperone-substrate complex is estimated after a chosen mutation. Estimation of the effect of mutation on protein stability is important, considering that mutagenesis is a central tool in molecular biology. The change in stability of the complex upon mutation (ΔΔG) can then be computed using
Here, a mutant is defined as a peptide in which one of the positions is mutated to a different amino acid, while the chaperone remains intact. The change in free energy (ΔΔG) can be compared across different substrates to arrive at the best-fit peptide for the chaperone. This process is considerably faster than using the computational alanine scanning approach described above, because free energy estimation protocols using molecular dynamics simulation require several hours of computing time, and thus are inefficient when estimating
ΔΔG from multiple mutations (
37). In addition, many heuristic force fields currently available are limited by the dataset on which the parameters are initially trained (
46–
49). To avoid these limitations, we have developed a methodology that uses a physical force field (hence applicability is not limited by a training data set), Medusa, with atomic modeling as well as fast side chain packing and backbone relaxation algorithms (
35,
50).
In Medusa, the protein is modeled using the united atom model, which includes all heavy atoms as well as polar hydrogen atoms. The free energy of the chaperone-substrate complex is then expressed as a weighted sum of van der Waals forces, solvation, hydrogen bonding, and backbone-dependent statistical energies (
35), as shown below.
where
E is the total energy of the system;
Evdw_attr, Evdw_rep are the attractive and repulsive portions of the van der Waals interaction, respectively;
Esolv is the solvation energy; and
Ebb_hb, Esc_hb, and
Ebb_sc_hb are the hydrogen bonding energies among backbone atoms, side chain atoms, and between backbone and side chain atoms, respectively.
E, |aa and
E,,aa|rot correspond to the internal energy of an amino acid and its rotamer state given the backbone dihedrals, ϕ and ψ.
Eref is the reference energy for the unfolded state. We use the weights (
W) to estimate the contribution of each term to the total energy. The weighting parameters were independently trained to recapitulate the native amino acid sequences for 34 proteins using high-resolution X-ray structures (
35).
2.3. Protein design with fixed backbone
In order to design the optimal side chain orientations for the residues in the substrate, we fix the protein’s backbone and use a Monte Carlo-based simulated annealing approach to search for low-energy structural configurations of the substrate. Using the Metropolis criterion, we accept or reject a trial mutation - either an amino acid substitution or a side chain rotation - by computing the energy difference between the original and altered states. Using Monte Carlo, we accept a change in the state of the system if the change results in decrease of the total energy of the system. Additionally, we incorporate the Metropolis criterion in making decisions during the Monte Carlo simulation. The advantage of the Metropolis criterion is that it allows us to accept changes to the system despite an increase in energy with a given probability. In doing so, the system can be rescued from being trapped in a local energy minimum. This methodology improves the sampling efficiency in the overall energy landscape. During the last step of simulated annealing, we perform a quenching simulation, in which conjugate-gradient minimization is used to find the lowest energy state in the sub-rotameric conformation of each trial rotamer. Due to the stochastic nature of the design algorithm, one needs to perform multiple simulations in order to attain statistical significance of the outcomes. However, since the simulations are independent of one another, they can be performed in parallel, thereby decreasing the time taken for free energy estimation. Using this approach, we can rapidly identify the best-fit amino acid side chains for each position on a given substrate for the chaperone. Given the promiscuity of substrate recognition by chaperones, it is critical to evaluate all possible substitutions at all positions on the substrate before choosing the best-fit substrate for the chaperone. The steps involved in determining the sequence space that best represents the substrate sequences are outlined as a flow diagram in .
2.4. Backbone flexibility improves protein stability estimation
The drawback of fixing the protein backbone during redesign is that the methodology does not accurately estimate the change in free energy upon small side chain to large side chain substitutions and vice versa, where movement of the backbone is necessary to reach the energetic minimum. In order to be able to allow small to large as well as large to small side chain substitutions, we modeled backbone flexibility in our protein redesign algorithm. Proteins often adapt their structure to changes in the sequence by slightly reorienting the backbone. We apply the same in our design approach to achieve a realistic estimation of ΔΔG upon mutation. Backbone relaxation also helps to relieve any nonphysical atomic interactions in the protein that might bias the van der Waals energy, and hence the total energy, of the chaperone-substrate complex.
2.5. Eris – An automated tool for estimating stability changes upon mutation
We have developed a web-based tool, Eris (
http://eris.dokhlab.org), for automated estimation of the change in protein stability upon mutation. We benchmarked Eris on 595 mutants with experimentally measured ΔΔG values. We observe that there is significant correlation (0.75; P = 2 × 10
−108) between the Eris-estimated and experimentally determined ΔΔG values, an overall performance that is comparable with that of other methods (
46–
48). However, unlike other methods, Eris accurately predicts the effect of small-to-large side chain mutations by effectively relaxing the backbone structures and resolving the clashes introduced by larger side chains. In direct comparison with other methods available on web servers, we computed stability changes upon small-to-large residue mutations, and we found that Eris outperforms other methods. Additionally, Eris provides a protein structure pre-relaxation option, which remarkably improves the prediction accuracy when a high-resolution protein structure is not available, such as when only a homology model of the protein exists. Because of its unbiased force field, side-chain packing, and backbone relaxation algorithms, Eris is applicable to a broad spectrum of mutations evaluated during protein engineering and design. Additionally, an integral step of Eris is backbone relaxation, when severe atom clashes or backbone strains are detected during calculation (
50,
51).
2.6. Ensemble-based profile prediction
In the Monte Carlo based approach outlined above, we fix the backbone of the chaperone and the substrate throughout the modeling exercise. However, we have also shown that incorporating backbone flexibility into our design principles improves estimation of the binding energy of a protein-peptide complex. A more recently reported methodology takes into account backbone flexibility of both the protein and the peptide, under the assumption that local conformational changes in the protein and/or peptide might improve the prediction of binding affinity between the two entities (
52). In this methodology, an ensemble of structurally distinct states is generated, reasoning that proteins are flexible to varying extents. The ensemble is used to generate a tolerance profile that represents a set of amino acids that are energetically tolerated at the protein-protein interface. The steps followed in this algorithm are described in detail below.
(i) Ensemble generation Small changes in side chain rotamer orientation and/or hydrogen bonding partners lead to significant side chain motion perpendicular to the chain direction. As a result, the corresponding residue and its adjacent peptides twist slightly around the backbone. This phenomenon is termed “backrub” motion in proteins (
53). Backrub is observed in sub-Angstrom resolution crystal structures, where the alternate conformations of the protein backbone exhibit a highly localized plasticity of small amplitude coupled to a larger, two-state conformational change of the side chain (
53). Smith and Kortemme take advantage of this minor perturbation in the protein backbone to computationally generate an ensemble of structurally distinct states using Monte Carlo simulations. In these simulations, side chain rotamer moves are performed, retaining the lowest-scoring structures obtained during the simulation (
52). Smith and Kortemme used the Rosetta protein-modeling suite with backrub and side chain rotamer moves to generate the structural ensemble of states (
54).
(ii) Profile prediction A tolerance profile for a given set of positions in the protein-protein interface can be obtained using the genetic algorithm based approach proposed by Humphris and Kortemme (
55). After every generation of sequence propagation using the genetic algorithm, Monte Carlo-simulated annealing will be used to minimize the energy function over the entire protein complex. During this process, optimal side chain rotameric orientations are chosen for the selected sequence from the Dunbrack rotamer library (
56). During the repacking, only the amino acids that are within 4 Å of any other amino acid are allowed to change their rotamer orientations, while the others are not considered for redesign. The efficiency of binding (the binding score) between the protein and peptide is then estimated based on the inter-chain score across the interface. The difference between the overall complex score and the binding score will be used as an estimate for folding, assuming that the score for which the sequence remains unchanged is constant. With the binding score of the native complex as a benchmark, sequences that offer a binding score within 1% of that of the native complex are considered a part of the tolerance profile. This procedure is repeated for all the near-native structural states generated using the backrub algorithm, and the profiles thus obtained are included in the tolerance profile.
The outcomes of this approach are similar to the Monte Carlo based consensus motif identification algorithm outlined above. The set of sequences generated following this protocol can be used to rationally design the interface between the chaperone and substrate, such that a target peptide can be engineered to be a substrate for a specific chaperone.
2.7 Case study: Substrate recognition by the yeast chaperone Ydj1p
Ydj1p (Yeast DnaJ 1) is the yeast homolog of
E. Coli DnaJ and a representative member of the 40 kDa heat shock proteins (Hsp40s). These proteins are essential for normal cell growth and the survival of yeast from heat stress, and are involved in protein translocation across the membrane as well as protein folding and degradation (
57). Ydj1p has been shown to influence the assembly-state of endogenous yeast prions, and it influences the aggregation of fragments of huntingtin in yeast models (
58,
59). The structure of this protein with a co-crystallized 7-residue peptide substrate (GWLYEIS) in its C-terminal peptide-binding domain (PDB: 1NLT) clearly indicates that the substrate binds to the chaperone via beta-strand extension on the surface of the protein (
30). Our goal in this study was to identify peptides that potentially bind Ydj1 with affinities comparable to the co-crystallized peptide. We applied the fixed backbone redesign protocol described above in order to identify a consensus sequence from the computationally designed peptide library, which had energetically favorable substrates for Ydj1. We systematically mutated each position on the peptide to all possible amino acids, and estimated the change in stability upon mutation. This process is analogous to the generation of a random peptide library, often performed in phage display screening. We then compared the binding affinity of each designed peptide to that of the peptide co-crystallized with Ydj1 (referred to as wild type peptide). We isolated a set of ~2500 peptides featuring a binding affinity of >75% of that of the wild type peptide. We computed the propensity of each amino acid to appear in a given position on the peptide obtained from the computationally generated peptide library, and arrived at a consensus sequence (GX[LMQ]{P}X{P}{CIMPVW}, where [XY] denotes either X or Y and {XY} denotes neither X nor Y) that acts as a sufficient condition for recognition by Ydj1 (
31). We experimentally verified that the computationally generated consensus sequence is in good agreement with the trend observed by phage display screening (
31,
45). We therefore demonstrated that our peptide redesign methodology could be generally applied to computationally mimic phage display screening for the identification of potential binding motifs recognized by chaperones.
We conducted a yeast proteome-wide screen for peptides satisfying the identified consensus motif. Interestingly, the hits obtained from our proteome screen include proteins that are either known substrates of Ydj1 (e.g. prions) or other chaperones with which Ydj1 has been shown to interact (e.g. Hsp70, Hsp90), along with other uncharacterized proteins. These results indicate that our methodology can be applied to identify potential new candidate substrates for a given chaperone. We acknowledge that these observations require further experimental verification to validate our claim.