|Home | About | Journals | Submit | Contact Us | Français|
The Φ-value analysis approach provides information about transition-state structures along the folding pathway of a protein by measuring the effects of an amino acid mutation on folding kinetics. Here we compared the theoretically calculated Φ values of 27 proteins with their experimentally observed Φ values; the theoretical values were calculated using a simple statistical-mechanical model of protein folding. The theoretically calculated Φ values reflected the corresponding experimentally observed Φ values with reasonable accuracy for many of the proteins, but not for all. The correlation between the theoretically calculated and experimentally observed Φ values strongly depends on whether the protein-folding mechanism assumed in the model holds true in real proteins. In other words, the correlation coefficient can be expected to illuminate the folding mechanisms of proteins, providing the answer to the question of which model more accurately describes protein folding: the framework model or the nucleation-condensation model. In addition, we tried to characterize protein folding with respect to various properties of each protein apart from the size and fold class, such as the free-energy profile, contact-order profile, and sensitivity to the parameters used in the Φ-value calculation. The results showed that any one of these properties alone was not enough to explain protein folding, although each one played a significant role in it. We have confirmed the importance of characterizing protein folding from various perspectives. Our findings have also highlighted that protein folding is highly variable and unique across different proteins, and this should be considered while pursuing a unified theory of protein folding.
Understanding the problem of protein folding poses a great challenge, particularly from a biophysics perspective. The protein-folding problem involves three major issues: (1) how the three-dimensional (3D) structure of a protein is determined from its amino acid sequence, (2) how a protein folds from a random coil to a native structure, and (3) how the 3D structure is characterized from static and dynamic perspectives. The second problem, which concerns the folding process (or folding pathway) of a protein, is the focus of this paper.
Regarding the folding process, there are currently two major hypotheses: the three-step mechanism or framework model [1–4] and the nucleation-condensation model [5,6]. In the framework model, secondary-structure elements fold first, followed by the coalescence of preformed secondary-structure elements to yield the native structure. The occurrence of step-by-step folding from short- to medium- and then to long-range interactions is a key assumption. In contrast to this, the nucleation-condensation model assumes that the secondary and tertiary structures are formed in parallel, or even in reverse order. That is, the long-range interactions or tertiary structures can be formed before or at the same time as the secondary structures. Interpretations of these two hypotheses seem to differ among researchers. For convenience and clarity, in this paper we refer to these two hypotheses but with the following specific preconditions: the framework model does not allow longer-range interactions to form before shorter-range interactions, but the nucleation-condensation model does.
The above key principles have contributed significantly to the conceptual basis of protein folding. However, there is currently a critical limitation: the difficulty of quantitatively connecting theoretical predictions to experimental results . Researchers have not been able to use an existing theory to consistently interpret their experimental results. Progress has been made towards eliminating this limitation by using improved simulations based on molecular dynamics and analytical calculations that use simple statistical-mechanical models. Since computer simulations of protein folding are time-consuming and can be performed only for one specific protein at a time, simple statistical-mechanical models of protein folding—which permit the direct analysis and fitting of experimental data—have been desired.
An Ising-like statistical-mechanical model of protein folding was first introduced by Wako & Saitô in 1978 [8,9]. This model was developed further by Gō & Abe , and Abe & Gō  demonstrated that it could accurately reproduce the results of computer simulations of two-dimensional lattice-protein folding. However, it was some time before a similar model could be applied to real proteins. This was first accomplished by Muñoz & Eaton in 1999 , who demonstrated that theoretically calculated folding rates could correlate well with experimentally observed ones.
In this paper, the Φ values for every residue of a specific protein calculated using a simple statistical-mechanical model are discussed and compared with the corresponding experimentally observed Φ values. Despite the simplicity of the statistical-mechanical model, the experimental Φ values were reasonably reproduced for many proteins, but not for others. Whereas the successful cases supported the assumptions of the model, the failures revealed the limitations of those assumptions. We will discuss protein-folding mechanisms while examining specific examples where the model succeeded and where it failed.
Φ-value analysis was introduced by Fersht and his colleagues [13,14] to provide information regarding the transition-state structures along the folding pathway of a protein by measuring the effects of amino acid mutations on folding kinetics. The Φ value is calculated as ΔΔG‡-D/ΔΔGN-D for a two-state-folding protein, where ΔG‡-D and ΔGN-D are the free energy (FE) differences between the unfolded and transition states and between the unfolded and folded states, respectively, and ΔΔ denotes the changes in these FE differences brought about by a point mutation at a specific amino acid residue. A Φ value of 1 indicates that all of the mutated residue interactions are formed in the transition state whereas a Φ value of 0 means that the residue is not involved in stabilizing the transition state. Intermediate Φ values indicate that the interactions are partially formed or that there are two populations with mostly unfolded and mostly folded states. However, because the relationship between the actual Φ value and the extent of protein structural formation is not necessarily linear, the interpretation of intermediate Φ values is still controversial .
Φ values have been investigated experimentally for many two-state-folding proteins (see Table 1). Φ values obtained in such experiments have provided valuable information for refining protein-folding models such as the nucleation-condensation model and the framework model .
Φ values have also been studied from a theoretical perspective. Since folding simulations can provide information about the folding pathway of the protein, the transition-state structures in these simulations are associated with the experimentally observed Φ values [17–19]. More directly, Φ values have been calculated based on the FE profiles obtained from statistical-mechanical models of protein folding [12,20–27]. Although various models have been proposed, their concepts are essentially similar and assume that each residue is in one of two states: native (folded) or random-coil (unfolded). It is also assumed that one or more contiguous segments in the polypeptide chain consists of residues in the native state, and that only individual amino acids in a pair within a native-structure polypeptide segment and in contact in the native structure interact with each other (this is the so-called Gō model). Zero interactions are assumed to occur in segments consisting of residues still in the random-coil state. In these studies, the calculated Φ values generally showed a reasonable agreement with the experimental Φ values.
In this paper, we report the results of Φ-value calculations made using a previously developed, alternative, simple statistical-mechanical model of protein folding [28–30]. Although this model was developed earlier than those mentioned above, various studies have recently adopted this model or others that are similar but extended, to study the protein-folding problem [25,27,31–38]. We also determined the Φ values for lattice proteins using this alternative model with particular focus on the dependence of lattice proteins on amino acid sequence and folding topology [28,29]. In line with these studies, we have calculated the Φ values of real proteins in the present study. For these calculations, we selected 27 proteins (each with 54 to 128 amino acid residues in length) with known experimentally derived Φ values.
In the next section, we briefly describe a statistical-mechanical model of protein folding that we developed. The calculated Φ values are shown and discussed in Results and Discussion along with the FE profiles, contact-order (CO) profiles, and sensitivities to the parameters used in the Φ-value calculation. In this study, we investigated protein folding from various perspectives rather than assessing the specific statistical-mechanical model used here, which we have already done in a previous paper .
The simple statistical-mechanical model of protein folding and unfolding used in this study was constructed with the intention of introducing the following picture of protein folding [8–11,28–30]. A protein folds in a stepwise manner along the polypeptide chain; this assumption conforms to the framework model. In this scheme, the first stage of folding includes short-range interactions that work dominantly to form small, native-like, or secondary structures such as the α-helix, β-strand, and hydrogen-bonded turn. Next, these structures grow gradually through medium-range interactions. Finally, these substructures coalesce into the native structure via long-range interactions.
It should be noted that our model is not exactly the same as the framework model. Whereas the framework model usually assumes a highly defined folding pathway in which specific local structures (e.g., secondary structures, super-secondary structures, domains) are formed in a hierarchical way, our model is more probabilistic, i.e., any local structures can form in any order according to their statistical weights, as described below. When we say that our model is based on the framework model, we refer to the important feature common to both the framework model and our own: longer-range interactions are not allowed to form without forming short-range interactions first.
The statistical-mechanical model of the folding and unfolding of a protein consisting of n residues in the above-mentioned picture is formulated as follows [28–30] (see Supplementary Text S1 for details).
The partition function for this statistical-mechanical model is given by the following recurrent relationship:
where β=1/kBT, kB is the Boltzmann constant, T is the absolute temperature, and Z1,j is the auxiliary partition function of a hypothetical protein molecule consisting of amino acid residues 1 to j. By definition, the partition function of the entire protein molecule is Z1,n (Z(T)).
The conformational energy of a local structure consisting of amino acid residues m to j is given as
where U(ξk, ξl) is the interaction energy between amino acid residues ξk and ξl. The contact matrix Γk,l=1 if amino acid residues k and l are in contact with each other in the native conformation, and Γk,l=0 otherwise. Because the interactions between neighboring residues and residues separated by one residue are not considered, Γk,l=0 for |k–l|≤2. U(ξk, ξl) depends on amino acid types ξk and ξl. Γk,l, i.e., whether the residue pair is in contact or not depends on the native structure. Consequently, E(m, j) depends on the amino acid sequence and the native structure. In other words, the amino acid sequence and the native structure are considered through E(m, j) in this model.
Function f(m, j) in Eq. (1) corresponds to the number of possible conformations of a segment between the amino acid residues m and j in the random-coil state; thus, kB ln f(m, j) is the chain entropy of the segment in the random-coil state, and −kB ln f(m, j) is the entropy loss of the segment when it forms the local structure.
Following our previous paper , we assigned a single value <0 to U(ξk, ξl), independent of the amino acid type, ξk or ξl (homogeneous contact-energy approximation). We set =−0.10 for all calculations in this study. For f(m, j), because we did not have any established values representing a real protein, we tentatively used the same formulation values that had been used for the lattice protein in the previous paper, i.e., f(m, j)=1.4084×(4.750)j–m–2 or ln f(m, j)=(j–m–2) B+ 0.3425, where B=ln 4.750=1.5581. In addition, we examined another B value, i.e., B=1.3 for comparison in this study.
A contact matrix Γk,l is defined as follows (as in the previous paper ): if a distance of at least one atom pair in two residues is shorter than a given cutoff distance Dc, this residue pair is considered to be in contact. We performed Φ-value calculations with four Dc values, 4.0, 5.0, 5.5, and 6.0 Å, in our previous study; and in that study, we found that the results were sensitive to the Dc value for some proteins but not for others. In the current study, however, two cases, Dc=4.2 and Dc=5.5 Å, were examined to study Dc dependence in the Φ-value calculations.
The energy (enthalpy) of the system, Eh, is expressed by the integer h in units of −0.01, i.e., Eh=h0 and 0=−0.01, for computational convenience. Eventually, the partition function Z is given as a polynomial with two variables, t and u, as a function of the temperature T [29,30,40].
and t is a dummy parameter introduced to count the number of amino acid residues in the native state η, and is set to unity in the last result. η runs from 0 to n–3. The coefficient Ω(η, h) for given values of η and h can be calculated using the recurrent equation, Eq. (1). W(η, T)tη is the sum of the statistical weights over all states with the given number of amino acid residues in the native state η and at a temperature T.
We define the FE for a given η from Eq. (4):
Eq. (6) is used to calculate the FE profiles of the proteins with η as the folding reaction coordinate. The enthalpy and entropy at a given temperature T can also be calculated (see Supplementary Text S1), but they are not discussed in this paper.
The kinetics of the folding and unfolding processes of proteins (such as their folding and unfolding rates) were formulated as a motion along a one-dimensional FE profile, with the number of amino acid residues in the native state as a reaction coordinate, as established by Muñoz & Eaton  and extensively examined by Henry & Eaton . Because we had the FE profile F(η, T) of Eq. (6), we applied the method of Muñoz & Eaton. According to this method, using a simple approach that involved solving a system of differential equations describing reversible hopping between adjacent discrete values of reaction coordinates (η and η+1 in this study), the characteristic relaxation rate can be given as follows:
Here, an equilibrium value of η (at temperature T),
can be calculated using F(η, T), where peq(η) is the probability that a conformation has η amino acid residues in the native state:
<η>eq and peq(η) are functions of T, but T is omitted here for clarity.
The relaxation rate k is estimated as the mean rate of relaxation of the average number of native residues to its equilibrium value from a starting point of two initial conditions: first with the entire population in the completely unfolded state, i.e., η=0, and second with the entire population in the native state, i.e., η=n–3. In this paper, the former relaxation rate k above 1/Tm and the latter k below 1/Tm are regarded as the folding and unfolding rates, kf and ku, respectively. A transition temperature, Tm, is defined as ln kf(Tm)= ln ku(Tm).
We considered a single amino acid substitution for each residue of a given protein. For a single amino acid substitution at the kth residue (for example, if amino acid residue ξk in the wild-type protein is replaced by amino acid wk), we simply assumed that U(ξk, ξl) is transformed to U(wk, ξl) in Eq. (2). Thus, the Φ values of the kth residue were calculated for the two cases: one where U(wk, ξl)=+0.01 and the other where U(wk, ξl)=−0.01, hereafter referred to as the “plus perturbation” and the “minus perturbation,” respectively (recall that U(ξk, ξl)= for each residue pair in the wild-type protein). The responses to the substitutions were examined based on changes in the logarithmic folding and unfolding rates, ln kf and ln ku, respectively, determined in this study. The corresponding rate changes, Δln kf and Δln ku, were used to calculate the Φ value defined by Eq. (10):
where and , because and . The superscripts “wild” and “mut” denote the wild-type and mutated proteins, respectively.
From the two Φ values calculated by the plus and minus perturbations, a Φ value for a specific residue is defined as follows. (i) A Φ value greater than 1.3 or smaller than −0.3 is considered “irregular” and is then regarded as having no defined Φ value. (ii) If 1.0<Φ<1.3, then Φ is set to 1.0, and if −0.3<Φ<0.0, then Φ is set to 0.0, because the Φ value is essentially defined as 0≤Φ≤1. Furthermore, (i) and (ii) are applied separately to the Φ values obtained in the plus and minus perturbations. (iii) The two Φ values are averaged. If only one Φ value is available, then it is defined as a Φ value of this residue. If neither of them is available, then the Φ value of this residue is classified as “not defined.”
To investigate the calculated Φ values, we generated a CO profile—a concept introduced in the previous paper . CO has been used to characterize the complexity of the folding topology of a protein in relation to its folding kinetics [42–46]. In line with this, the CO profile is defined as the cumulative number of native contacts plotted against k, where ρi is the number of native contacts between two residues whose mutual distance along the polypeptide chain is i. The following relationship holds :
The second term on the right-hand side, ∑kρk, corresponds to the area of the upper left region of the k vs. ck curve and is equal to the CO of a given protein (see Fig. 4 below). The CO profile is more informative than the CO, as described below.
Using the simple statistical-mechanical model, Φ values were calculated for 27 proteins for which Φ-value analysis had already been performed experimentally. The Protein Data Bank (PDB) entry codes, sizes, and fold classes of the proteins that we investigated are listed in Table 1. This protein set includes some proteins that belong to the same superfamily, such as 1SRM, 1SHG, and 1FYN, all of which are in the SH3 domain-containing superfamily. However, since the available data were limited, we did not omit data from homologous proteins; some would consider the data from such proteins to be redundant, but from a practical standpoint, their omission would make performing a statistical analysis far more difficult. Furthermore, comparisons between homologous proteins were expected to actually provide more useful information about the nature of protein folding. Accordingly, we used all the available data, disregarding perceptions of protein redundancy, in the following analyses. It is necessary, however, to remember this point in the statistical discussion below.
For each protein studied, the PDB entry code given in Table 1 is used as the protein name hereinafter, for convenience.
In this study, we were interested in characterizing protein folding from various perspectives. As is well known, protein size and fold class are primary characteristics of a protein that determine its folding dynamics. We can also characterize a protein by its Φ-value profile, FE profile, and CO profile. In addition, a Φ-value’s sensitivity to the parameters used in the Φ-value calculation can be examined. The Φ-value calculation was performed for four sets of parameters, hereinafter referred to as D42_13, D42_16, D55_13, and D55_16, each of which denote parameter values (Dc, B) = (4.2, 1.3), (4.2, 1.5581), (5.5, 1.3), and (5.5, 1.5581), respectively (see Materials and Methods). The following data pairs were used as described here: (D42_13, D42_16) and (D55_13, D55_16) were used to investigate the sensitivity to the chain entropy parameter B, and (D42_13, D55_13) and (D42_16, D55_16) were used to investigate the sensitivity to the cutoff distance Dc.
The results are summarized in Figure 1. We will discuss them below.
Φ-values are one of the few types of experimental data that can both A) give us information on the protein folding process, and B) be directly compared with theoretically calculated properties. In Figure 2, the Φ-value profiles calculated for every residue using the abovementioned statistical-mechanical model are shown with the corresponding experimentally observed Φ values for four example proteins: 1TEN, 1SHG, 3CI2, and 1AYE. Experimentally observed Φ values were available for some, but not all, of the residues. The Φ-value profiles for the other proteins are given in Supplementary Figure S1.
The theoretically calculated Φ-value profiles were assessed by their correlation coefficient (CC) with the experimentally observed Φ values. The results are shown in Table 2. The results from Garbuzynskiy et al.  are also shown, when available, for comparison. For Table 2, if the Φ-value experiments were carried out in more than one condition, we show only the case with the best CC. All the results are given in Supplementary Table S1.
In our previous study , we examined CC values to assess our statistical-mechanical model of protein folding. In this study, however, we shift our perspective to the characterization of protein folding using CC values, because they may imply whether or not a specific protein folds in the scheme assumed in our statistical-mechanical model. We consider here that differences in CC values reflect variation in active protein folding mechanisms. In Figure 1, proteins are classified into three groups according to the best CC value for each protein: CC≥0.6, 0.3≤CC<0.6, and CC<0.3. We assess the groups as follows: the “CC≥0.6” group is a case where our model successfully reproduced the experimental Φ values, the “CC<0.3” group is a case where our model failed, and the “0.3≤CC<0.6” group is an intermediate case—where the experimental Φ values of some regions in a specific protein were successfully reproduced by our model, but others were not. In other words, the “CC≥0.6” group and the “CC<0.3” group are rough indicators of the two major protein-folding schemes: the framework model and the nucleation-condensation model, respectively.
As mentioned in the Introduction, the concepts of the framework model and the nucleation-condensation model somehow remain ambiguous. In this paper, we confine our attention to their difference relating to the order in which short-, medium-, and long-range interactions are formed; that is, whereas longer-range interactions are assumed never to be formed before intervening shorter-range interactions in the framework model, long-range interactions can form before shorter-range interactions in the nucleation-condensation model.
Figure 2A for protein 1TEN shows that, although the Φ-value profile shows irregular trends, the experimentally observed Φ values were accurately reproduced. As described below, 1TEN is highly sensitive to plus/minus perturbations (it is also sensitive to other parameters used in the Φ-value calculation). The minus perturbation generated irregular Φ values for many residues. Rather than using the average of the two Φ values generated by the plus and minus perturbations, the Φ values generated by the plus perturbation were assigned to the Φ values of such residues. It is interesting that despite this potential confounding issue, 1TEN was found to belong to the high CC group.
Figure 2B for protein 1SHG is another example of CC≥0.6. Among the studied proteins, 1SHG showed high sensitivity to the parameters Dc and B. Whereas the CC was very high for Dc=5.5 Å, it was very low for Dc=4.2 Å. As for B, the CC for D42_13 was less than that for D42_16, whereas no difference was found between D55_13 and D55_16. The Φ-value profiles for the four parameter-sets differed significantly. However, because the number of experimentally observed Φ values was small relative to the number of residues, the statistical significance of the CC values was low. For example, although the calculated Φ values of the residues in the long N-terminal loop differed considerably depending on the parameter sets, such differences could not be assessed because of the lack of corresponding experimental data.
(2) 0.3≤CC<0.6. This category contains ten proteins. Only one example is given in Figure 2C for 3CI2. Whereas the Φ values of the residues in the N- and C-terminal regions were accurately reproduced, those in the middle of the polypeptide chain showed significant difference from the experimental Φ values. The statistical-mechanical model of a linear chain polymer has a tendency, in general, to have lower Φ values at both of the terminals because of a boundary effect, and to have higher Φ values in the middle of the chain because of strong cooperativity (owing to more interactions being present in the middle than at the terminals). Similar situations were observed in other proteins belonging to this category, such as 2ACY, 1BF4, and 1TIU (see Supplementary Fig. S1).
Figure 2C for protein 3CI2 also demonstrates a case where there was a variety of experimentally observed Φ values for a specific residue.
(3) CC<0.3. This category is composed of the remaining seven proteins. These cases, which did not successfully reproduce experimentally determined Φ values, served to highlight the differences between the framework model and the nucleation-condensation model. Whereas our model is based on the framework model, the experiments suggested that the nucleation-condensation model is active for this protein.
Only one example is shown in Figure 2D for protein 1AYE. This protein has the secondary-structure sequence β1-α1-β2-β3-α2-β4. The experimental results suggested that while the folding nucleus is made by the packing of β1 and β3-α2 to form a β-sheet, β1-β3, in the transition state, the remaining parts, α1-β2 and β4, are completely unfolded. This means that the two separated regions, β1 and β3-α2, interact first before the formation of α1-β2 between them. Such a situation is beyond the scope of the assumptions of our model.
In other examples of CC<0.3 (Fig. 1; see also Supplementary Fig. S1), protein 2ABD has the secondary-structure sequence α1-α2-α3-α4. The experimental results suggested that the interaction between the N- and C-terminal α-helices, α1 and α4, initiates the folding process before the formation of the central part, α2-α3. Contradicting the experimental results, the calculated Φ values indicated that α2-α3 was formed but α1 and α4 were unstructured in the transition state. As a result, the CC for 2ABD was largely negative. It should be noted, however, that the experimental Φ values around residues 20–40 varied widely, and the authors of the study that generated those experimental values  pointed out that some of the data for these residues were not well suited for Φ-value analysis because the mutations probably significantly changed the folding pathway of the wild-type protein.
The 1BTB protein, which is another example from the “CC<0.3” group, has the secondary-structure sequence β1-α1-α2-β2-α3-α4-β3. The experimental results suggested that the β-sheet composed of β1 and β3 in the N- and C-terminals, respectively, is formed in the early stage of folding. By contrast, the Φ-value calculations indicated that whereas the C-terminal is structured in the early stage of folding, the N-terminal region is not.
In the case of protein 2PTL, which has the secondary-structure sequence β1-β2-α1-β3-β4, there may be another reason for observing CC<0.3. Whereas the N-terminal β-sheet of β1 and β2 was well predicted by the Φ-value calculations, the unstructured α-helix in the middle of the polypeptide chain could not be reproduced very well. As discussed in our previous paper  and also as pointed out above, our model has a strong tendency to describe the early formation of an α-helix in the middle part of the polypeptide chain, owing to the dominance of short-range interactions in an α-helix and the avoidance of boundary effects.
We should be careful, however, when using CC values for the above discussions, because CC value may not necessarily be a reliable characteristic for assessing calculated Φ values. Whereas these calculations can generate a Φ value for any residue, the available experimental Φ values were limited. It is not clear how to assess irregular Φ values or residues for which Φ values cannot be obtained experimentally. For example, calculating the CC value after removing two Φ values of the C-terminal region from the 19 experimentally observed Φ values for protein 1TTF (see Supplementary Fig. S1) would result in changing the CC value from −0.09 to 0.28 in the D42_16 case. Moreover, Φ values for a specific protein frequently differ significantly, depending on the experimental conditions (see Supplementary Table S1).
Since Φ-value profiles are calculated based on FE profiles, it is expected that FE profile characterization should be associated with protein-folding characterization. In Figure 3, the FE profiles of three proteins, 3CI2, 1RIS, and 1URN are shown for four parameter sets, D42_13, D42_16, D55_13, and D55_16 (FE profiles for the other proteins are given in Supplementary Fig. S2). Since the profiles shifted upward or downward depending on Dc, i.e., the number of interactions, we focused on the shapes of the profiles. It was found that the profile shapes were essentially similar if their chain entropy parameter B values were equal. A few exceptions were observed, such as 1CSP and 1N88. Conversely, a change in B value causes a change in shape of the FE profile. The major differences between the shapes of the FE profiles appeared in the region from the transition state to the folded state, i.e., from the central peak to the valley on the left side of the FE profile. As far as the unfolded region is concerned, the profile shape is conserved, whereas the distance from the local minimum to the peak at the transition state can change.
It was also found that, while the FE profiles of some proteins have a single peak at the transition state, others have more than one peak (usually double peaks) or a plateau. The latter case means that a transition state cannot be assigned to a single folding reaction coordinate, but should be assigned to a range of coordinates. The secondary peak in a double-peaked profile is sometimes ambiguous, making it difficult to confidently judge its significance. We classified FE profiles into three types: single-peaked, double-peaked, and intermediate [denoted by (S), ((D)), and (D), respectively, in Fig. 1]. This classification could not be defined in a rigorous manner, but was carried out by visual inspection of the profiles. The FE profiles of 3CI2, 1RIS, and 1URN shown in Figure 3 were classified as single-peaked, intermediate, and double-peaked, respectively.
1CSP and 1N88 were exceptional cases, as mentioned above. Interestingly, while the FE profile of 1CSP with the parameter set D42_16 was single-peaked, the FE profiles calculated with the other parameter sets were double-peaked. This difference is reflected in the CC values: whereas the CC value was the lowest for D42_16, that for D55_16 was the highest. Accordingly, the FE profile of 1CSP was classified as double-peaked. For the FE profiles of 1N88, the shapes of the FE profiles for D42_16 and D55_16 differed considerably. While the former profile had a plateau, the latter was double-peaked. The CC values for the former profile were slightly better than those for the latter one.
According to Figure 1, whereas in the “0.3≤CC<0.6” and “CC≥0.6” groups, more proteins had a single-peaked FE profile than a double-peaked FE profile (10 vs. 8), this was reversed in the “CC<0.3” group (2 vs. 4). However, this difference was not statistically significant. Consequently, whether the FE profile was single-peaked or double-peaked was not a crucial factor for the CC values. The Φ values were essentially defined based on the single-peaked FE profiles. However, since Φ values can be calculated in the fashion described above, regardless of the double-peaked FE profile at the transition state, it may be necessary to reconsider Φ-value calculation for double-peaked FE profiles. The theory of multidimensional representation of an FE profile of protein folding that was proposed by Itoh & Sasai  would be highly relevant to any further considerations of this issue.
At the outset of this study, we expected to obtain useful information about protein folding by determining the FE profiles of the evaluated proteins. In our previous paper , we demonstrated a close relationship between a protein’s FE profile, its folding rate, and its fold class; however, in this study we realized that it is necessary to consider the fine structural changes of the FE profile to be able to predict changes in Φ values. For example, the FE profiles of 1ENH and 1IDY (Supplementary Fig. S2) seemed to be almost identical. Consequently, their calculated Φ-value profiles were grossly similar, but differed significantly in their fine details. FE profile changes that were induced by imposing perturbations were responsible for the observed changes in Φ values. However, it was hard to find such changes in the FE profiles, because such changes were not only small, but were also distributed over all regions of the profile. Similar situations existed for proteins in the SH3 domain-containing superfamily, namely 1SRM, 1SHG, and 1FYN (see Supplementary Figs. S1 and S2).
CO is one property of proteins that is well-known to be correlated with folding kinetics. The CO profile, which was discussed above, is an extension of the CO; the cumulative number of native contacts ck, is plotted against residue-residue distances along a chain, k. This profile provides information about the respective contribution ratios of the different interaction types (short-, medium-, and long-range) to the CO. Figure 4 shows the CO profiles of 27 proteins, in which ck and k were normalized between zero and one to adjust their protein sizes. It was found that the CO profiles could be classified into three types, referred to as Types A, B, and C (Fig. 4A, B, and C, respectively). In this classification, the reference curve (a dashed line in Fig. 4), ck=− (k–1)2+1 (with normalized ck and k), was used. This reference curve can be utilized if the number of native contacts of a residue-residue distance along the chain i decreases in inverse proportion to i, i.e., ρi≈n–i, where n is the number of residues of a given protein.
If a whole CO-profile curve was located above the reference curve, the protein was classified as Type A. If a CO-profile curve for small k was located above the reference curve, but for large k, it dropped below the reference curve, the protein was classified as either Type B or Type C. If the CO-profile curve cut across the reference curve around middle k and was close to the reference curve throughout, the protein was classified as Type B; if it cut across the reference curve around large k, the protein was classified as Type C. Short-range interactions were dominant in Type A proteins. The CO-profile curves of Type B proteins contained, as the reference curve, the short-, medium-, and long-range interactions that decreased in inverse proportion to residue-residue distance. Type C proteins were characterized by very long-range interactions such as interactions between N- and C-terminal regions. Since the upper left area of the curve corresponds to CO as pointed out by Eq. (11), the CO values of Type A proteins were the smallest and those of Type C were the largest.
Figure 1 shows that the CC values of Type A proteins were greater than those of Type C proteins (while the ratio of Type A to Type C was 9:5 in the “CC≥0.6” and “0.3≤CC<0.6” groups, it was 1:4 in the “CC<0.3” group). The CC values of the Type B proteins were of intermediate magnitude. Since the Type A proteins were rich in short-range interactions, it would be natural to expect that such proteins would be more likely to adopt protein-folding mechanisms consistent with the framework model. Another remarkable finding from this study was that Type A proteins were sensitive to the parameters used in the Φ-value calculation. The most plausible reason for this is that smaller proteins, which are more sensitive to these parameters, were more likely to be classified into Type A than into other groups. This finding indicates that CO-profile type, as a method of categorizing proteins, is probably not directly related to parameter sensitivity— although this cannot be completely ruled out.
The Type C proteins in the “CC<0.3” group were interesting because the long-range interactions, such as those between the N- and C-terminal regions, were formed at an early stage of the folding process without being preceded by the formation of secondary structures in the middle of the chain; that is, their protein-folding proceeded in accordance with the nucleation-condensation model. Unfortunately, the model studied in this paper cannot take into account the nucleation-condensation model. This is not because we deny that the nucleation-condensation mechanism can be active in protein folding, but simply because it is very difficult to construct a solvable statistical-mechanical model that can simulate this mechanism. However, we are interested in examining how proteins can fold in accordance with the nucleation-condensation model in future work.
CO has previously been discussed in terms of the folding rate of a protein [42–46]. In our previous paper , we demonstrated that the folding rates calculated by the present model were also well correlated with the experimentally determined folding rates of 72 proteins (with CC=0.81). Because Φ values were defined based on folding and unfolding rates, and such rates are related to the CO, the Φ-value profile should be related to the CO or the CO profile; and we did indeed find some relationships between them. However, it was also apparent that these are not the only factors that determine Φ values and protein folding.
A single set of parameter values to be used in our model has not yet been established. In fact, Table 2 shows that the best CC values for individual proteins (see underlined figures) were obtained by using different parameter sets, i.e., D42_13, D42_16, D55_13, or D55_16. In other words, Φ-value calculations are sensitive to the parameters chosen. The cutoff distance Dc that defines a residue-residue contact is one of the most important parameters, because it defines the 3D structure of a specific protein in the model. Parameter B, chain entropy, is also important, because protein-folding transitions occur in the balance between enthalpy gain that occurs with residue-residue interactions and entropy loss—particularly that associated with chain entropy. In addition, we do not have a sufficient number of experimentally determined Φ values. Consequently, it is hard to determine the best parameter set for our model at present. We consider sensitivity of proteins to such parameters as being one of the intrinsic characteristics of protein folding.
Parameter sensitivity was assessed using the root-mean-square difference (RMSD) between two Φ-value profiles that were calculated with different parameter sets. For analysis of plus/minus perturbations, Φ-value profiles for plus and minus perturbations were compared. For analysis of the distance cutoff Dc, Φ-value profiles of [D42_13 and D55_13] and [D42_16 and D55_16] were compared. For analysis of chain entropy parameter B, Φ-value profiles of [D42_13 and D42_16] and [D55_13 and D55_16] were compared. The results are shown in Table 3. The relatively larger RMSD values are indicated by an underline.
Half of the proteins were sensitive to Dc. In particular, small proteins (smaller than 63 residues) such as 1ENH were sensitive to Dc. A natural expectation would be that changes in the residue-residue contact matrix (Γij) caused by a change in Dc would have a significant effect on smaller proteins (see Eq. (2)), and that relatively larger proteins, two examples being “all-β” proteins (such as 1CSP, 1TIU, and 1TEN) and “α+β” proteins (such as 1URN, 1N88, 1FKB, and 2VIL), would be sensitive to Dc. However, we found that many larger proteins and “α+β” proteins were classified into the “not sensitive” group.
Most proteins that were sensitive to chain entropy parameter B were also sensitive to Dc. Smaller proteins such as 1ENH, 1IDY, 1SRM, 1SHG, and 1FYN were easily influenced by changes in the parameters. It is understandable that “all-β” proteins such as 1SRM, 1TEN, 1SHG, 1FYN, and 1CSP would be sensitive to B, because chain entropy plays an important role in the formation of β-sheets. In the formation of a β-sheet, a larger loss of chain entropy must be overcome than in the formation of an α-helix.
It is possible to mutate any protein. Mutations may change a protein’s 3D structure to some extent, and consequently may change the residue-residue contact matrix. A set of proteins with similar 3D structures can be regarded as an example of the outcome of such mutations. As discussed above, 1ENH and 1IDY have similar 3D structures, and their overall Φ-value profiles resembled each other. However, they differed significantly in some specific regions. In this study, we show that their calculated Φ-value profiles reproduce these experimentally observed differences, at least to some extent. We also show that the same is true for the SH3 domain-containing superfamily proteins, 1SRM, 1SHG, and 1FYN.
Beyond this, it is interesting that Φ-value profiles can provide information that is helpful for identifying residues that are sensitive to the model parameters. For example, Φ-value profiles allowed us to determine that residues in the N-terminal regions of 1SRM, 1SHG, and 1FYN were sensitive to model parameters (Supplementary Fig. S1), as were almost all residues in 1SHG. In the Φ-value experiments, the Φ values obtained for some specific residues sometimes differed significantly, depending on substituted amino acid types or on experimental conditions (see for example Fig. 2C). These results imply that mutations of certain residues in each protein have significant effects on protein folding. The sensitivity of specific residues to model parameters can reveal such characteristics of a protein.
In this paper, Φ values were calculated by imposing perturbations of protein structure via single amino acid substitutions. These perturbations changed the predicted interactions of a specific residue with the other residues of the protein. The two perturbation types “plus” and “minus” were defined according to whether the interactions were weakened or strengthened by them, respectively. We hypothesized that when the degree of perturbation is small, the difference in the effect of the plus and minus perturbations should be small; and during the current study, this presumption held true for most proteins. However, it was found that the two types of perturbation produced significantly different results for some proteins: i.e., 1URN, 1TEN, 1SHG, 2VIL, and 1FKB (Fig. 1 and Table 2). Note that these are “all-β” or “α+β” proteins. In these proteins, some residues showed large differences between Φ values calculated in the presence of plus perturbations vs. minus perturbations.
In some cases, either of the two perturbations gave an irregular Φ value, i.e., Φ much greater than one or much smaller than zero. Interestingly, when an irregular Φ value was obtained, it was almost always in the presence of a minus perturbation. In particular, the RMSDs between Φ-value profiles were larger in 1TEN and 1FKB than in other proteins (Fig. 5). Irregular Φ values were obtained for many more residues in these two proteins than in the other proteins. Curiously, the CC values for these two proteins differed considerably. Whereas the CC value for 1TEN was very high, that for 1FKB was very low (see Table 2).
In the experimental situation, the substitution of an amino acid residue with a larger side chain is one of the strategies employed to introduce stronger interactions than exist in the wild-type protein. However, if such a residue disrupts the proper folding of the protein owing to its side chain size, the Φ value may be undeterminable or irregular. In the theoretical model, however, the plus/minus perturbations only change the ensemble of conformations at every step along the folding reaction coordinate. Our results suggest that irregular Φ values can be obtained without disrupting proper folding.
Although we examined the FE profiles of 1TEN and 1FKB, we could not find remarkable differences between their plus/minus perturbations. In general, an irregular Φ value is obtained if Δln kf −Δln ku, which is the denominator of Eq. (10), is too small. One possible explanation is that subtle changes occur throughout the FE profiles, and occasionally a change in the folding rate Δln kf is very close to a change in the unfolding rate Δln ku. However, it should be noted that 1TEN and 1FKB are “all-β” proteins and have complex folding topologies of β-strands. A protein’s folding topology is regarded as complex if the folding process cannot easily be represented as a step-by-step formation from short-, to medium-, and then to long-range interactions. Although the CO is introduced to explain such complexity, more details regarding folding topology may be necessary to explain the remarkable differences observed between the plus/minus perturbations for these proteins. Interestingly, in the FE profile of 1FKB, the difference between the local minimum in the unfolding region and the local maximum in the transition region was considerably larger than that for the other proteins (see Supplementary Fig. S2). Obviously, this reflects the complex topology of folding that exists for 1FKB, which affects its irregular behavior in the Φ-value calculation.
The Φ-value analysis experiment involving circular permutants of the ribosomal protein S6  (PDB ID: 1RIS) provided an interesting example to discuss Φ-value calculation, particularly regarding the important role of the connectivity of the polypeptide chain. A circular permutant is created by dividing a given protein into two fragments, i.e., N- and C-terminal fragments, by an incision; then the N-terminal fragment is connected after the C-terminal fragment. In this analysis, circular permutants were assumed to have the same 3D structure as the wild-type protein, despite the differences in their connectivities.
1RIS has the secondary-structure sequence β1-α1-β2-β3-α2-β4. Five circular permutants, P13–14, P33–34, P54–55, P68–69, and P81–82, were constructed, where the superscripts indicate the position of the incision. All possible incisions between the secondary-structure elements of 1RIS were examined .
Φ values for the circular permutants were calculated by means of the same inter-residue contact information that was used for the wild-type 1RIS, but with differences in the respective connectivity, e.g., 14–97 and 1–13 for P13–14. Figure 6 shows the Φ-value profiles for the wild-type 1RIS and its circular permutant, P13–14. Φ-value profiles for the other circular permutants are given in Supplementary Figure S3. CC values for the wild-type proteins and circular permutants are shown in Table 4.
The two Φ-value profiles shown in Figure 6 are useful for illustrating the significance of chain connectivity . The wild-type 1RIS has the secondary-structure sequence β1-α1-β2-β3-α2-β4; it forms β-sheets with long-range interactions between β1 and β3, and between β1 and β4. When the secondary-structure sequence in the circular permutant P13–14 changes to α1-β2-β3-α2-β4-β1, the interactions between β1 and β4 change to short- or medium-range ones, and the C-terminal region forms a compact domain, β3-α2-β4-β1, that contains β1 and β3. While the circular permutants P33–34 and P54–55 behave similarly to P13–14, P68–69 and P81–82 do not. Accordingly, the Φ-value profiles of P13–14, P33–34, and P54–55 were different from those of P68–69, P81–82, and the wild-type protein, as shown in Figure 6 and Supplementary Figure S3. Owing to the reduction of the long-range interactions in the first group of 1RIS circular permutants, their CC values became markedly better than those of the second group of 1RIS circular permutants. In other words, whereas the former circular permutants underwent folding according to the framework model, the latter ones underwent folding according to the nucleation-condensation model.
Recently, Inanami et al. proposed a novel method to calculate the folding pathway of a multidomain protein in an extended form of the model discussed here . Their target protein was dihydrofolate reductase (DHFR), which has two domains: one comprising a continuous middle part of the polypeptide chain (ABD domain), and the other comprising discontinuous N- and C-terminal parts (DLD domain). According to the limitations of our model, the DLD domain cannot fold without ABD folding. However, Inanami et al. enhanced the model by introducing the “virtual loop-closure mechanism” to partition function in an exactly calculable form, which allowed them to successfully reproduce DHFR folding behavior. Just as the circular permutant strategy can resolve topological complexities of protein folding, the methodology of Inanami et al. seems to open up a new method of reproducing protein folding by the virtual-closure mechanism, especially when the protein folding occurs according to the nucleation-condensation model.
When planning this project, we expected to derive a unified view of protein folding by studying the folding process from various perspectives. We anticipated that the further characterization of proteins would provide a coherent explanation of the information presented in the table in Figure 1. All of the aspects examined in this paper should play a significant role in future studies of protein folding, and they are associated with one another. However, such connections are complex and exhibit highly nonlinear characteristics. It is known that the relationship between the Φ value and the extent of structural formation is not necessarily linear, and thus the interpretation of intermediate Φ values is still controversial. In addition, the number of available data from Φ-value analyses was too small for reliable statistical analyses at the time of this study. Consequently, it is hard to understand the meanings of the Φ values clearly at present. In this study, however, we have confirmed the importance of characterizing protein folding from various perspectives. Our findings have also highlighted that protein folding is highly variable and individual among different proteins. This should be considered when pursuing a unified theory of protein folding.
The Φ values were calculated for 27 proteins with a simple statistical-mechanical model of protein folding. We considered that the correlation between the calculated and experimentally observed Φ values should illuminate the protein-folding mechanism active for each protein, such as the framework model or the nucleation-condensation model. In addition, we investigated other properties associated with protein folding such as the free-energy profile, contact-order profile, and sensitivity to the parameters used in the Φ-value calculation of each protein. Through such analyses, we tried to characterize protein folding from various perspectives to derive a unified theory of protein folding.
This paper is respectfully dedicated to the late Professor Nobuhiko Saitô, who was a Ph.D. supervisor of H. W., and with whom the original statistical-mechanical model of protein folding, “island model”, was developed. This work was supported by JSPS Grant-in-Aid for Scientific Research (C) (Grant No. 16K00407).
Conflicts of Interest
The authors have declared that no competing interests exist.
H. W. and H. A. contributed equally to this work. Both authors discussed the results and implications, and commented on the manuscript at all stages.