|Home | About | Journals | Submit | Contact Us | Français|
Over the past three decades the protein folding field has undergone monumental changes. Originally a purely academic question, how a protein folds has now become vital in understanding diseases and our abilities to rationally manipulate cellular life by engineering protein folding pathways. We review and contrast past and recent developments in the protein folding field. Specifically, we discuss the progress in our understanding of protein folding thermodynamics and kinetics, the properties of evasive intermediates, and unfolded states. We also discuss how some abnormalities in protein folding lead to protein aggregation and human diseases.
Protein folding refers to the process by which a protein assumes its characteristic structure, known as the native state. The most fundamental question of how an amino acid sequence specifies both a native structure and the pathway to attain that state has defined the protein folding field. Over more than four decades the protein folding field has evolved (Fig. 1), as have the questions pertaining to it. This evolution can be divided into two predominant phases. During the first phase, research was focused on understanding the mechanisms of protein folding and uncovering the fundamental principles that govern the folding transition. While the first phase provided general answers to the protein folding question, new and no less ambitious questions arose: what are the mechanisms of protein folding in a context, such as under the influence of other biological molecules in the cellular environment? This next set of questions defined the second phase in protein folding field evolution.
The first phase is akin to a romantic stage of research, where the final goal of studies may not be directly applicable to a broader understanding, or exploitable in a relevant science. The final goal is to determine the basic principles that relate protein sequence and structure. The second phase is a more pragmatic stage of research, where the applications drive research in the field and the rational manipulation of derived knowledge allows engineering of tools for advancement of a relevant science. For example, understanding the functional intermediates that accompany the transition of a protein en route to its native state may allow rational manipulation of protein structure via protein design. This example not only relates protein sequence, structure and function, but also demonstrates the engineering aspect of the modern protein folding field.
Next, we survey the questions of the modern protein folding field. We attempt to describe a number of directions where understanding protein folding offers insights into more complex questions in molecular and cellular biology as well as medicine. We also describe new approaches and tools to address complexities associated with these new areas of research. We review studies of protein stability, folding kinetics, intermediate and unfolded states, and protein self-association and aggregation.
The protein folding field has witnessed significant changes and progress since the original work of Anfinsen showing that proteins can fold spontaneously [1, 2]. Early in vitro studies showed that the folding process typically occurs on a milliseconds-to-seconds time scale, much faster than the rate estimated assuming that folding proceeds by a random search of all possible conformations. Based upon this observation, Levinthal then proposed that a random conformation search does not occur in folding and that proteins fold by specific ‘folding pathways’ . On these pathways, the protein molecule passes through well-defined partially-structured intermediate states. Based on this view, numerous experiments and simulations were conducted to test the existence of transient folding intermediates [4, 5]. It was expected that the determination of the structures and population of folding intermediates could help elucidate protein folding mechanisms. Earlier experimental studies on protein folding kinetics monitored the structural changes through relaxation of the protein’s spectroscopic properties after exposing the protein to folding or unfolding conditions. The data obtained from such experiments exhibit single- or multiple-exponential time-decay: a single-exponential decay is interpreted as a signature of two-state kinetics between the native state and the denatured state, whereas models involving more than two states are required to explain multiple-exponential decay data. These experiments generally probe only the average behavior of proteins, and they are not able to provide information about the folding/unfolding process in atomic details.
The discovery of a class of simple, single-domain proteins which fold via two-state kinetics without any detectable intermediates in the early 1990s [6, 7], the development of experimental techniques with improved spatial/temporal resolution[8-13], and the application of computer simulations using simplified lattice and off-lattice models[14, 15] greatly enhanced our understanding of various aspects of the protein folding problem. Based on the nucleation theory [16-18], one of the early proposed mechanisms for protein folding, the nucleation-condensation model was formulated[19-21]. In this scenario, a small number of residues (folding nucleus) need to form their native contacts in order for the folding reaction to proceed fast into the native state. The cooperativity of the protein folding process is analogous to that exhibited in first-order phase transitions, which proceed via a nucleation and growth mechanism . Because of these similarities, terminology used in studies of phase transitions, such as energy landscapes and nucleation, was introduced into the discussion of protein folding. The concepts of the nucleation and the free energy landscape have promoted much of the recent progress in understanding the process of protein folding. Proteins are generally thought to have evolved to exhibit globally funneled energy landscapes [23-25] which allow proteins to fold to their native states through a stochastic process in which the free energy decreases spontaneously. The unfolded state, transition state, native state and possible intermediates correspond to local minima or saddle points in the free-energy landscape.
Advances in experimental techniques such as protein engineering, nuclear magnetic resonance (NMR), mass spectrometry, hydrogen exchange, fluorescence resonance energy transfer (FRET), atomic force microscopy (AFM), have made it possible to obtain detailed information about the different conformations occurring in the folding process[26, 27]. At the same time, computational methods have been developed to better interpret experimental data by using simulations to obtain structural information about the states which are populated during the folding process. In Table 1, we list several advances in experimental and computational methodologies used for investigating the folding of model proteins.
All-atom protein models with explicit or implicit solvents were developed to study the folding thermodynamics and the unfolding dynamics of specific proteins. Technological advances in computation allowed folding simulations of small proteins and peptides at atomic detail [28-30]. However, due to the complexity and vast dimensionality of protein conformational space, all-atom MD simulations have severe limitations on the time and length scales that can be studied. Novel simulation protocols have been proposed to improve conformational sampling efficiency, including biased sampling of the free-energy surface and non-equilibrium unfolding simulations . In addition, world-wide parallel computing (e.g. emoH@gnidloF ) and generalized ensemble sampling techniques that involve parallel simulations of molecular systems coupled with a Monte Carlo (MC) protocol [32, 33] have been successfully applied to protein folding [25, 34-36].
Multi-scale modeling approaches have also been used to combine efficient conformational sampling of coarse-grained models and accuracy of all-atom models to study protein folding pathways. In this approach, iterative simulations and inter-conversion between high and low-resolution protein models are performed. Feig et al. developed a multi-scale modeling tool set, MMTSB , which integrates a simplified protein model with the MC simulation engine, MONSSTER , and the all-atom MD packages AMBER  or CHARMM. Using a combination of CHARMM and discrete molecular dynamics (DMD) [41-46], Ding et al. reconstructed the transition state ensemble of the src-SH3 protein domain through multi-scale simulations . The protein folding studies can also be facilitated by sampling protein conformations near the native state. Several native-state sampling algorithms [48, 49] have been successfully utilized to study plasticity , cooperative interactions , and allostery  in proteins. Considering native-state ensemble naturally takes into account protein flexibility, which is shown to be crucial in structure based drug designs.
During the last five years, several tools for performing web-based analyses of protein folding dynamics have been developed. The Fold-Rate server (http://psfs.cbrc.jp/fold-rate/)  predicts rates of protein folding using the amino-acid sequence. The Parasol folding server (http://parasol.tamu.edu/groups/amatogroup/foldingserver)  predicts protein folding pathways using “probabilistic roadmaps”-based motion planning techniques. The iFold server (http://ifold.dokhlab.org)  allows discrete molecular dynamics (DMD) simulations of protein dynamics using simplified two-bead per residue protein models. These tools facilitate the second phase of protein folding research, whereby targeted simulations may be performed for probing the dynamics of protein folding and unfolding under controlled conditions.
DMD approaches [43-46] with simplified structural models of proteins have been extensively used for investigating general principles of protein folding and unfolding [56-60]. Dokholyan et al. have highlighted the differences between molecular dynamics and DMD approaches. As opposed to the traditional MD approach of iteratively solving Newtonian equations of motion for evolving protein folding trajectory, DMD simulations solve ballistic equations of motion with square-well approximation to inter-particle interaction potentials. DMD algorithm gains efficiency over traditional MD simulations in multiple ways. First, due to ballistic modeling of particle dynamics, a larger time step can be used in DMD simulations on average, which corresponds to the time interval between fastest ballistic interactions; secondly a faster inter-particle collision detection and velocity updating algorithm is used, since only the coordinates of colliding atoms need to be updated at each collision. Additionally, faster simulation speeds are attainable with the DMD approach through simplification of protein models. Overall, an increase in simulation speed of 5–10 orders of magnitude is attainable using DMD . Jang et al.  used DMD and simplified protein models with Gō interactions  to probe protein folding kinetics. Protein folding kinetics studies using DMD simulations are reviewed in . Recently, DMD simulations we used in uncovering the structural mechanisms of protein aggregation [64-66]. Among the fundamental challenges in studying protein folding using computer simulations are the time-scales and length-scales that can be investigated. DMD simulations have been shown to be useful for investigating long-timescale folding dynamics of complex biological systems such as poly-alanine aggregation [66, 67] and the nucleosome core particle .
In addition to the extensive in silico and in vitro studies of protein folding, significant progress has been made in understanding protein folding in vivo. There are two major differences between protein folding in vivo and in vitro. First, protein folding in vivo is usually assisted by molecular machinery, such as chaperones (in an ATP-dependent manner), and often involves small molecule cofactors. Molecular chaperones such as the heat shock protein Hsp70 and chaperonin proteins facilitate protein folding, in part, by isolating the proteins from bulk cytosol[69, 70]. Hartl and Horwich pioneered the research of chaperone-mediated protein folding, highlighting the differences between in vivo and in vitro folding mechanisms. The mechanism of chaperonin GroEL mediated folding, including in vivo folding intermediates, has been extensively studied by Horwich and Gierasch [72, 73]. Work by Landry et al.  showed that chaperon binding promotes α-helix formation in partially folded polypeptide chains. Horowitz et al. have investigated the role of chaperonin Cpn60-mediated hydrophobic exposure in protein folding[75, 76]. Nearly one third of all proteins in living cells are coordinated to small molecule cofactors. The pioneering work of Wittung-Stafshede and coworkers on the role of cofactors in in vivo protein folding[77, 78] demonstrated that bound metals stabilize the native fold, suggesting cofactor binding to unfolded polypeptides dramatically accelerates folding timescales.
A second notable difference between in vivo and in vitro protein folding is the fact that the concentrations of macromolecular solutes in cells can reach hundreds of grams per liter in cells, but most in vitro studies are performed in buffered solution with <1% of the cellular macromolecule concentration. The crowding environment in vivo can have a significant impact on protein stability and the native structure by changing the energy landscape of protein folding[80, 81]. Dedmon et al. showed that FlgM, a 97-residue protein from Salmonella typhimurium is unstructured in dilute solution, but in E.coli cells its C-terminal half is structured. McPhie et al. found that a molten globular state of apomyoglobin at low pH is stabilized by high concentration of the inert polymer, dextran, compared to the unfolded state. Moreover, it was found that aggregate formation from human apolipoprotein C-II is significantly accelerated by the addition of dextran, suggesting a direct effect of molecular crowding on protein aggregation.
Over the past three decades, novel experimental techniques and simulations have yielded many significant insights in protein folding research. Important advances have been made, especially toward the understanding of folding and unfolding mechanisms, the structure of folding transition states, folding kinetics, the nature of folding pathways, and the structure of unfolded proteins and protein folding in vivo. Theoretical approaches to study protein folding have largely complemented experiments by providing experimentally testable hypotheses. In recent years, the rational manipulation of folding pathways and the association between protein folding and disease have marked a more applied phase of protein folding research.
The thermodynamic stability of a protein is measured by the free energy difference between the folded state and the unfolded state (ΔG = Gunfold - Gfold). It determines the fraction of folded proteins, thereby having a profound effect on protein function. Natural proteins are only marginally stable. The energetic contributions from the favorable folding forces such as hydrophobic packing, hydrogen bonding and electrostatic interactions are nearly offset by the entropic penalization of folding. As a result, the measured the ΔG values of most proteins fall in the range of 3-15 kcal/mol . Due to this subtle balance between various physical interactions, a single mutation may shift the balance and significantly affect the stability of the whole protein. The accurate estimation of protein stability changes induced by mutations, measured as ΔΔG = ΔGWT − ΔGMut, still remains a significant challenge for computational biologists.
Experimentally, ΔG values can be obtained from denaturing experiments [6, 87-90] where the protein unfolds by increasing temperature or by adding denaturing agents such as urea and guanidinium HCl (GdHCl). Theoretically, given the interactions, free energy can be obtained from statistical mechanics using the partition function, Z, as G = -RT lnZ. Analytical calculations of partition functions require integration over all degrees of freedom in the protein’s conformational space, which is impossible in practice, except for simple models. Advances in computational biology have made possible the direct calculation of ΔΔG by MD or Monte Carlo simulations[28, 91-94], a comprehensive review of which can be found in Ref. . Using MD simulations, a ΔΔG value has been calculated for T157V mutant of T4 lysozyme which is in close agreement with experimental measurements .
The computational cost of direct ΔΔG estimation is still too high, however, preventing it from being applied to a large number of mutations for protein engineering. More heuristic approaches have been adapted which try to describe the free energy using empirical or effective functions taking advantage of the vast amount of known protein structures and stability measurements. Such simplifications significantly decrease the computational overhead and allow ΔΔG calculations of large numbers of mutants that can be compared with experimental results. In recent years, various methods have been proposed for large-scale ΔΔG predictions with reasonable prediction accuracies (which are commonly assessed by the linear correlation coefficient between the predicted and measured ΔΔG values). A comparison between these methods is listed in Table 2.
One approach utilizes statistical potentials that are developed using information from known protein structures, where the frequency distribution of amino acids conformations (such as pairwise distances and torsion angles) is used to extract effective potentials for free energy evaluations. Gilis and Rooman first applied database-derived backbone dihedral potentials to study the change of thermodynamic stability upon point mutations[97-99]. They found that torsion-angle potentials predict ΔΔG accurately for mutations of solvent-exposed residues and that distance-dependent statistical potentials are more accurate for predicting the ΔΔG of buried residues. They obtained correlation coefficients of 0.55 to 0.87 for a dataset of 238 mutations. Zhou et al.  developed a knowledge-based potential using the distance-scaled finite ideal-gas reference state (DFIRE) approach and calculated ΔΔG for 895 mutants, which have a correlation of 0.67 with experimental measurements. Similarly, statistical potentials utilizing side-chain rotamer libraries, direction- and distance-dependent distributions, and four-body interactions were also adapted for ΔΔG predictions and significant agreement with experimental measurements was achieved.
Another approach for large-scale ΔΔG predictions uses empirical functions to describe free energy changes induced by mutations and trains the parameters to recapitulate the experimental results. Guerois et al.  developed the FOLD-X energy function to study the stabilities of 1088 mutants. They used a comprehensive set of parameters to describe the van der Waals, solvation, hydrogen-bonding, electrostatic and entropic contribution to the protein stability, and obtained a correlation of 0.64 for the blind test set after training their parameters on 339 mutants. Khatun et al.  utilized contact potentials to predict ΔΔG of three sets of 303, 658 and 1356 mutants and their prediction correlations varied between 0.45 to 0.78. Bordner and Abagyan  used a combination of physical energy terms, statistical energy terms and a structural descriptor with weight factors scaled to experimental data for ΔΔG predictions, and found a correlation of 0.59 on 908 test mutants. Saraboji et al. classified the available thermal denaturing data on mutations according to substitution types, secondary structures and the area of solvent accessibilities, and used the average value from each category for the prediction and obtained a correlation of 0.64 .
Taking advantage of the vast amount of experimental ΔΔG data now available, machine-learning techniques have been introduced for ΔΔG estimation. Capriotti et al. [108, 109] trained a support vector machine using temperature, pH, mutations, nearby residues and relative solvent accessible area as input vectors. The support vector machine, when applied to a test set, gives a prediction correlation of 0.71. Cheng et al.  improved the support vector machine model to directly include the sequence information and obtained higher correlation prediction accuracy.
There are two significant drawbacks with training-based studies. First, improvement of the prediction accuracy relies on the available experimental stability data for parameter trainings. It is questionable whether parameters obtained from these trainings are transferable to other studies  since the experimentally-available mutation data may be biased (e.g., towards substitutions of large residues for small ones). Second, some mutations introduce strains in the protein backbone. To properly estimate the ΔΔG values, it is necessary to estimate the structural rearrangement that a protein undergoes to release the strain. To our knowledge, protein dynamics and flexibility have not been explicitly modeled in previous methods. Ignoring protein flexibility prohibits the application of current prediction methods to a wide range of mutations [100, 104].
To address both these caveats, Yin et al.  developed a novel method, Eris (http://eris.dokhlab.org), for accurate and rapid evaluation of the ΔΔG values using the recently-developed Medusa redesign suite. Eris features an all-atom force field developed from x-ray crystal structures, a fast side-chain packing algorithm, and a backbone relaxation method. The ΔΔG values of 595 mutants from five proteins were calculated and compared with the experimental data from the Protherm database and other sources [104, 105, 113, 114]. Significant correlations of ≈0.75 were found between predicted and experimental ΔΔG values. Eris identifies and efficiently relaxes strains in the backbone, especially when clashes and backbone strains are introduced by a small-to-large amino acid substitution. Interestingly, when high-resolution structures are not available, Eris allows refinement of the backbone structure, which yields better prediction accuracy. Compared with other ΔΔG prediction methods, the Eris method is a unique approach that combines physical energies with efficient atomic modeling, resulting in fast and unbiased ΔΔG predictions.
Despite remarkable progress in the last several decades, protein stability estimation methods are still imperfect. Several obstacles must be cleared in order to achieve a more reliable method for stability estimation. Due to the complexity of sampling multi-dimensional space, the entropic free energy of a protein is difficult to evaluate and is thereby often ignored or only roughly counted in current stability estimation methods. Furthermore, since protein stability is determined by the free energy difference between the folded and unfolded states, it is crucial to model the unfolded state and its effect on protein stability (cf. section on unfolded protein states). Most importantly, the prediction of large conformational changes upon mutations remains a major challenge in protein-stability estimation.
To understand protein folding, important details must be taken into consideration as the protein proceeds from unfolded to native state. How fast does a protein fold? Are there multiple pathways accessed en route to its folded state? What are the structural characteristics that determine the path and rate of protein folding? In addition, as we realized in recent years, there is a broad class of human diseases that arises from failure of some proteins to adopt and remain in their native states, partly due to abnormal folding kinetics . For example, a major cystic fibrosis-related deletion mutation of a single amino acid in the cystic fibrosis transmembrane conductance regulator (CFTR) affects its folding kinetics, but has minimal effect on its stability and structure . Also, recent evidence  suggests that altered dynamics of superoxide dismutase (SOD1) mutants, which are non-destabilizing or even stabilizing, possibly cause the aggregation of mutants in familial amyloid sclerosis (FALS). Elucidation of protein folding kinetics has never been more important, particularly in the context of finding molecular mechanisms of protein misfolding diseases.
Probing the structural properties of intermediate states and determining folding pathways have been major experimental challenges. Nonetheless, there have been significant advances in experimental techniques. Experimental methods with different temporal resolutions allow a more detailed dissection of the folding process (Table 3). Other methods allow for investigation of protein structure (e.g. NMR, ultraviolet/visual light CD, site-directed mutagenesis, Φ-value analysis, isotope labeling) and global properties (e.g., mass spectroscopy, quasi elastic light scattering, ultracentrifugation) .
As experimental data on folding rates of various proteins accumulated over the years, people sought the determinants of folding rates. Plaxco and co-workers  showed that there is a high correlation between the folding rate and the structural properties of proteins, as defined by contact order CO, , where L is the sequence length, N is the total number of inter-residue atomic contacts within a cutoff distance, ΔLij is the sequence separation of contacting residues i and j. Interestingly, assessment of other protein folding rate determinants such as local and long-range contacts were found to perform equally well as the contact order. Also, contact order is a geometric property which does not take into account the distribution and strength of the interactions on the rate . Fersht and coworkers found that specific interactions in the folding nucleus are equally important determinants of folding rates. For example, mutations in the folding nucleus of CI2 did not change its contact order, but result in a three order magnitude increase in the folding rate [90, 121, 122]. Sequence-based prediction of folding rates has also been proposed and was found to be of comparable performance to that of contact order [123, 124]. Thus, there is still a debate as to whether structure-based or sequence-based prediction is a more reliable predictor of folding rates [123, 124].
Extensive studies have likewise been made on the prediction of folding rates using molecular dynamics simulations. The primary bottleneck in this approach was sampling the time scales where the folding transitions are observable. Thus, early studies pioneered by Caflisch et al. [125-127] employed continuum solvent with low viscosity to observe multiple folding transitions. However, there is a nonlinear relationship between the folding time and viscosity , hence, the precise effect of very low viscosity on the protein folding kinetics of various systems remains unclear. To circumvent this problem, Pande et al. used “coupled ensemble dynamics” to simulate the folding of a β-hairpin from protein G using continuum solvent model  and united atom force-field  with water-like viscosity. In this and other subsequent simulations of other β-hairpins, the calculated folding rate was in close agreement with experimental measurements. A remarkable simulation in this class is a 1 μs folding simulation on the villin headpiece by Duan and Kollman . Folding rate predictions of this type have been limited to small two-state proteins. As protein size increases, it is difficult to computationally study folding kinetics. Rate predictions have been likewise performed using molecular dynamics simulations using explicit water models such as the TIP3P  and SPC  to gain additional insight into folding kinetics. Examples of MD simulations using explicit solvent which yielded experimentally consistent rates were performed by Pande et al., who observed helix-coil transitions  and protein folding . However, a potential drawback in the use of water models is that they are parameterized to a single temperature (~298 K), and thus may bias the dynamics in non-native temperature simulations. Overall, theoretical rate determinations and their subsequent comparisons with experiments provide a test of our understanding of protein folding kinetics.
Structural investigation of the transition state ensemble (TSE) is extremely challenging experimentally, since the TSE is an unstable state whose experimental detection is very difficult. Computationally, transition state conformations may be identified using unfolding simulations (cf. section on unfolded protein states), projection into one or two reaction coordinates, validation of putative transition states through calculation of the probability to fold (Pfold), and path sampling. The intuitive appeal of low-dimensional energy landscapes in explaining simple chemical reactions inspired people to develop a similar formulation for protein folding. However, unlike simple molecules, the high degrees of freedom of a protein make the analysis formidable. Thus, several groups proposed dimensional reduction by projecting the multi-dimensional energy landscape into few relevant coordinates. The proposed reaction coordinates could be the volume of the molecule , the fraction of amino acids in their native conformation , the number of contacts between amino acids [136, 137], and the fraction of native contacts in a conformation [136, 137]. Some others directly tackled the transition state by developing rigorous path sampling techniques . They constructed a large ensemble of transition paths, and through statistical analysis, they determined conformations whose Pfold=0.5. Although this method is computationally expensive, it is advantageous since there is no presupposed reaction coordinate.
Characteristics of the transition state ensemble have mainly been investigated by Φ-analysis, which involves measuring the folding kinetics and equilibrium thermodynamics of mutants containing amino acid substitutions throughout a protein. This method provides means to identify interactions mediated by specific amino acid side chains that stabilize the folding transition state. Φ-analysis has been applied to a large number of proteins (such as BPTI, myoglobin, protein A, ubiquitin, SH3 domain, and the WW domain) and, recently, even to amyloidogenesis[141, 142]. However, despite the prevalence of the methodology, there is debate regarding the validity and conventional interpretation of Φ-analysis, especially when the ΔΔG between wild type and mutant is less than 1.7 kcal/mol [2, 42, 143, 144]. However, Fersht and coworkers argued that reliable ϕ-values can be derived from mutations in suitable proteins with 0.6 < ΔΔG < 1.7 kcal/mol. Plaxco and coworkers disproved the assumed independence of the changes in free energy of transition and folded states when calculating error estimates in Φ-values . They proposed a new method of error estimation that accounts for the interdependence of changes in free energy of transition and folded states.
Using simplified protein models and rapid sampling DMD, Dokholyan et al. directly observed and characterized the transition state ensemble of Src homology 3 (SH3) (Fig. 2) [47, 59, 146, 147]. To probe the contribution of each amino acid residue to the transition state ensemble, they calculated the Φ-values, and found high correlation between simulation and experimental Φ-values. Moreover, they also predicted that the two most kinetically important residues in folding are L24 and G64. Both L24 and G64 are experimentally-verified to be important kinetically .
Experiments suggest that proteins may be kinetically trapped en route to the native state[149-151], but how do proteins avoid kinetic traps? By using a Go-model scaled to include sequence-specific interactions, Khare et al.  found that the residues which contribute most to Cu, Zn SOD1 stability also function as “gatekeepers” that avoid kinetic traps and protein misfolding. “Gatekeeper” residues were also identified in later computational studies of the ribosomal protein S6. Stoycheva et al.  and Matysiak et al. show that the mutations of gatekeeper residues can alter the folding landscape of S6 and shift the balance between its folding and aggregation. All these computational studies are fully consistent with experimental observations and the existence of gatekeeper residues suggests a selective pressure on avoiding misfolding in natural proteins.
The close interplay of computational and experimental efforts has advanced our knowledge of protein folding kinetics, including predicting the protein folding rate, identifying the kinetically-important residues, and characterizing the multiple pathways. For example, recent studies have demonstrated an agreement between theoretical and experimental folding free energy landscapes [154-160]. The characterization of molecular interactions responsible for different pathways opens the possibility to manipulate folding pathways. Current strategies of manipulating the folding pathway includes addition of denaturants, point mutations, and circular permutations . Specifically, Kuhlman and Baker rationally engineered multiple mutations in protein L to alter its folding pathways . Lowe and Itzhaki likewise recently redesigned the folding pathways of the repeat protein myotrophin . Hence, one strategy to tackle kinetics-related folding abnormalities is to rationally engineer the folding pathway after the full characterization of a protein’s folding kinetics.
Proteins sample ensembles of heterogeneous conformations in solution. This emerging view of the one-to-many correspondence between protein primary sequence and its possible three-dimensional conformations also challenges the traditional paradigm that protein function is dictated by the native state. In recent years, it has been found that even for the conventionally observed small two-state proteins (~100 amino acids or less) there exist partially unfolded intermediates on the folding pathways. These intermediates are generally undetectable in kinetic folding experiments [8, 116, 164-168] and, therefore, are called “hidden intermediates”. In addition, mounting evidence has indicated that the intermediate states formed during protein folding and unfolding may have significant roles in protein functions (Fig. 3). The folding and unfolding intermediates impact the physiological functions of proteins by exposing cryptic post-translational modifications or ligand binding sites. The intermediates are usually weakly-populated (thermodynamic intermediates) or short-lived (kinetic intermediates), and their characterization presents a significant challenge with current experimental methods. Recent synergies between computational and experimental studies have greatly facilitated the unprecedented structural characterization of rare intermediates and suggested a functional role for these evasive conformations.
Among the traditional experimental methods, hydrogen exchange (HX) is a unique tool that allows the detection and characterization of not only kinetic, but also weakly populated thermodynamic intermediates. In a typical equilibrium HX experiment, the rate at which an individual main-chain amide hydrogen exchanges with solvent deuterons is measured by NMR or mass spectrometry . The exchange rates of different amide hydrogens are sensitive to local and global structural changes of proteins, and thus contain useful structural information about different protein conformations. From the amide hydrogen exchange rates measured by HX it is possible to detect and characterize weakly-populated intermediates which are inaccessible by bulk methods. Over the years, HX methods have provided considerable insight into the coarse features of intermediate state conformations for a wide variety of proteins. However, HX is limited in its ability to describe the detailed structures of intermediates, which severely restricts its applications. In contrast, computational methods can reveal structural information at much higher resolutions that cannot be accessed by experiments. Also, unlike bulk experimental methods like HX, computational methods are able to study protein motions at the single-molecule level, which enables the detection of heterogeneous conformations in an ensemble of molecules.
In recent studies, Gsponer et al. and Dixon et al. have developed new computational approaches to incorporate HX protection factors from NMR experiments as constraints into MD simulations to detect and/or characterize the conformations of intermediate states. Using their new approach, Gsponer et al. were able to define the thermodynamic folding intermediates of the bacterial immunity protein Ig7. A structural comparison between this thermodynamic intermediate and kinetic intermediate determined from other methods indicates that the kinetic and thermodynamic intermediates of Ig7 are similar. In a parallel study, Dixon et al. computationally predicted and characterized a folding intermediate of the focal adhesion targeting (FAT) domain of focal adhesion kinase (FAK), which plays critical roles in cell proliferation and migration. The detected intermediate was hypothesized to expose a cryptic phosphorylation site in the FAT domain and regulate the localization of FAK. The existence of this intermediate and the predicted structural features were later experimentally confirmed . Besides HX methods, the site-specific structural information obtained from other NMR-based techniques such as relaxation dispersion spectroscopy , has also been incorporated into MD simulations to study weakly-populated folding intermediates .
A major limitation of traditional experimental methods such as HX is that they can only measure the average behavior of an ensemble of molecules and often cannot distinguish individual folding and unfolding routes or intermediates in the ensemble. By contrast, single-molecule methods enable the observation of the folding and unfolding of individual molecules and play increasingly important roles in studying intermediates. Two single-molecule techniques: fluorescence resonance energy transfer  and force spectroscopy including optical tweezers and AFM, have been utilized to probe intermediate states of proteins [178-180]. Direct comparisons of results from single-molecule stretching experiments by AFM and computational simulations shed light on the mechanical unfolding of proteins and how intermediates contribute to protein function.
Using AFM, Marszalek et al.  uncovered a force-induced unfolding intermediate of Ig domain I27 from titin, a modular protein which is responsible for muscle elasticity. This experimentally-discovered intermediate was also predicted by steered-molecular dynamics (SMD) simulations . Based on the results of SMD, the authors further predicted that the hydrogen-bonding network between strand A’ and G in I27 mainly determines its mechanical stability. This prediction was verified in a mutagenesis study where Li et al.  showed that the point mutations on the residues participating in this hydrogen-bonding network dramatically altered the mechanical stability of I27. In a similar study of another protein, domain 10 of type III fibronectin module, which plays a pivotal role in mechanical coupling between cell surface and extracellular matrix (ECM), Gao et al.  predicted force-induced unfolding intermediates using SMD simulations. The discovered intermediates were considered to expose binding sites which are necessary for the assembly of ECM fibronectin fibrils. One of the computationally-predicted unfolding intermediates is in excellent agreement with the one observed in a more recent AFM stretching experiment .
The finding that folding and unfolding intermediates can have significant contributions to protein function is challenging the conventional understanding of protein structure and function that is centered on the native state. It is expected that a close synergy between computational and experimental approaches will continue to play essential roles in characterizing these evasive protein states.
Unveiling the structural and dynamic properties of denatured proteins is crucial for understanding the protein folding [183, 184] and misfolding  problems. For example, the computational determination of a protein’s thermodynamic stability requires an accurate approximation of the denatured state as the reference state. NMR hydrogen exchange experiments also rely on models of protein unfolded states to tabulate the intrinsic hydrogen exchange rate . Furthermore, understanding the structural properties of unfolded proteins may shed light on the early events of protein folding and protein aggregation .
It has long been postulated that the denatured state of proteins is composed of an ensemble of featureless random coil-like conformations. According to Flory’s random coil theory, the size of a random coil polymer, characterized by the radius of gyration Rg, follows a power law dependence on the length of the polymer chain, n, Rg = R0nν. Here, R0 is the scaling constant, which is a function of persistence length, and ν represents the power law scaling exponent. Flory predicted the exponent to be 0.6 and later a more accurate renormalization calculation obtained ν=0.588 . Tanford et al. first confirmed this random coil scaling behavior for denatured proteins . Using intrinsic viscosity measurements for 12 proteins denatured by 5-6 M GuHCl, the authors obtained a scaling exponent ν=0.67±0.09. Wilkins et al.  showed that the hydrodynamic radii of sets of 8 highly denatured, disulfide-free proteins follow a power law scaling with ν=0.58±0.11. Recently, Kohn et al.  reassessed the scaling behavior of denatured proteins using small angle x-ray scattering for 17 proteins of lengths varying from 8 to 549 residues. They found the scaling exponent to be ν=0.598±0.029. All these experimental results confirm the random coil scaling of denatured proteins. In a random coil model of the denatured state, proteins are believed to lack persistent structure both locally and globally. The distribution of end-to-end distances or radii of gyration can be fit by a Gaussian distribution. A recent computational study  confirmed this behavior by generating an ensemble of protein conformations whereby only steric interactions between amino acids were considered for four different proteins. A scaling exponent of 0.58±0.02 was obtained.
The random coil scaling behavior [188, 194] was originally derived for homopolymers. However, proteins are heteropolymers for which specific interactions between amino acids play an important role and determine a unique native structure. Hence, while the scaling of the sizes of denatured proteins follows the random coil scaling as shown in experiments, it does not necessarily exclude the possibility that the denatured proteins can have residual native-like structures. Under denaturing conditions, the protein can still exhibit native conformational bias and retain a certain amount of residual native structures. Mounting experimental evidence [168, 195-198] supports residual native-like structural elements in the denatured state for a variety of proteins. Using residual dipolar coupling from NMR measurements, Shortle and Ackerman  showed that native-like topology persists under strong denaturing conditions as high as 8 M Urea for a truncated staphylococcal nuclease. By applying quasi-elastic neutron-scattering on α-lactalbumin, Bu et al.  demonstrated residual helical structure and tertiary-like interactions even in the absence of disulfide bonds and under highly denaturing conditions. Similarly, by using triple-resonance NMR, native-like topology has also been observed in protein L .
Several theoretical and computational studies [199-202] have addressed the role of specific interactions in conformational biasing toward the native state in the denatured states. Using a simple force field with only steric and hydrogen bond interactions, Pappu et al.  demonstrated that denatured protein states have a strong preference for the native structure. It was suggested [199-202] that the conformational bias of native structures in the denatured state is a possible explanation of the Levinthal’s paradox .
To reconcile the seemingly controversial properties of denatured proteins — the random coil scaling of their sizes and the presence of residual native structures — several computational works have recently been reported [203-207]. Fitzkee and Rose  reproduced the random coil scaling exponent using a denatured protein model with fixed secondary structure elements for a set of proteins. Tran et al. [204, 207] constructed the protein denatured state ensemble at atomic resolution in the excluded volume limit. Jha et al.  built the denatured state using a statistical coil library. Both the Pappu and Sosnick groups found that the putative denatured state ensemble features transient local structures such as turns, strand, and helices. In the mean time, the dimension of the denatured state follows the experimentally-observed random-coil scaling exponent. Ding et al.  developed a computational method to model denatured proteins using a structure-based potential . This interaction model is commonly used in thermodynamic and kinetics studies of protein folding [58, 208-210] to model amino acid interactions. This study  suggested that denatured proteins follow the random coil scaling sizes and retain residual secondary structures akin to those observed in native protein states. Hence, these computational works provide a conceptual reconciliation between two seemingly mutually-exclusive views of protein unfolded states.
What is the physical origin of the random coil scaling of protein size along with the seemingly contradictory persistence of local structures in denatured proteins? Previous computational studies [203-205, 207] suggest that the residual structures in the denatured state are limited to short-range elements, which only extend approximately seven to ten residues. The correlated fluctuation of residual structures diminishes quickly along the sequence and long-range contact formation is purely governed by the random diffusion of peptide chains. Hence, a coarse-graining process, which groups locally-interacting amino acids along the polypeptide chain into renormalized structural units (Fig. 4) reduces a denatured protein to a renormalized homopolymer. The renormalization process will result in an effective homopolymer which forms contacts due to chain diffusion. Protein sizes follow the renormalized power law scaling as proposed by Flory : Rg=R0(N/L)ν=(R0L-ν)Nν. Thus, we expect that the scaling exponent of denatured proteins ν is the same as for homopolymers, whose structural units are locally interacting amino acids.
The observation of residual native secondary structures in thermally-denatured protein states is consistent with a “guided-folding” scenario , where the rate-limiting process is the packing of the preformed secondary structures into the correct fold. In contrast, a random coil model of the denatured state without residual native-like structures implies that a protein has to overcome an excessive entropic barrier to form both the secondary and tertiary structures upon folding. The existence of persistent native-like secondary structures in the denatured state may also be responsible for the recent success of protein structure prediction using small secondary structure segments derived from the protein data bank . The existence of residual native-like structures in the denatured state also provides a novel way to manipulate a protein’s stability by stabilizing or destabilizing the residual structures .
While all the information needed for proteins to fold is encoded in their amino acid sequence , there are many more elements that play a part in vivo. In a crowded cellular environment, surrounded by interacting proteins, nascent polypeptides face a formidable challenge in finding the correct interactions that result in a folded and functional protein. Many become “trapped” in meta-stable intermediate structures which are usually recognized by proteasomal machinery and degraded or refolded by chaperones. Alternatively, they can associate with similar misfolded proteins to form aggregates.
Extant protein sequences are the result of a long history of evolutionary refinement establishing a set of interactions defining the native state. However, the same inherent recognition that occurs between sequences within a protein is the basis for a type of self-association termed 3-dimensional domain swapping (, extensively catalogued in 2002 by Liu and Eisenberg ). Domain swapping is an important phenomenon, taking part in both normal and disease-related processes, and is intimately tied to protein folding. Domain swapping may be viewed as a natural mechanism for dealing with instability due to evolutionary changes in the amino acid sequence . For example, a mutation that rigidifies a loop connecting two parts of a protein induces strain, which can be relieved without the loss of function by “swapping” the portion of the protein on one side of the loop with the corresponding part of a similar protein. Dimerization by this mechanism has advantages including a high local concentration of enzymatic activity, since two functional proteins are joined together.
Many diseases are now associated with protein aggregation and particularly with a form of ordered aggregate called the amyloid fibrils, which, regardless of the native sequence and structure of the precursor proteins, share distinct structural characteristics. From a protein folding standpoint, the inherent properties of the polypeptide chain that allow proteins with little or no sequence or structural similarity to misfold and assemble into similar high-order structures are of vital interest. Studies of aggregate structure reveal defined characteristics such as extensive hydrogen bond networks perpendicular to the fiber axis, called a cross-β conformation, and an α to β transition known to occur during the oligomerization of amyloid-forming proteins with significant helical content. Evidence for domain swapping as an early step in the aggregation process has been reported for several proteins [218-220]. In several aggregation-associated diseases including familial Amyotrophic Lateral Sclerosis and the transthyretin amyloidoses, the dissociation of a protein from its multimeric native state is known to be the rate limiting step for aggregation, suggesting a method for preventing aggregation by stabilizing the native interfaces of these assemblies [221-223].
Amyloid fibrils have alternately been deemed responsible for the pathologies of their various associated diseases and, more recently, credited with delaying or counteracting the observed pathologies by acting as a sink for highly cytotoxic soluble oligomers . The viewpoint that soluble oligomers act as cytotoxic species has garnered widespread attention since at least 1999 when it was noted that the abundance of soluble Aβ 1-42 oligomers is inversely correlated with neuronal degeneration in Alzheimer’s disease whereas amyloid levels do not correlate [225, 226]. In 2003, Glabe and co-workers discovered that soluble oligomeric species from several disease-related proteins shared a common structural epitope to which an antibody was developed . Later studies showed that soluble oligomers are able to disrupt the polarity of cellular membranes [144, 228, 229], one possible basis for disease-associated toxicity.
Various cellular protective mechanisms have evolved to ensure the proper folding of proteins. Molecular chaperones, for example, recognize misfolded proteins and provide an environment conducive to the formation of the appropriate native contacts . Vast proteosomal machinery clears proteinaceous debris from the cell using ubiquitin ligases to tag misfolded proteins for degradation and removal [231, 232]. One hypothesis formulated to explain the prevalence of protein aggregation in neurodegenerative diseases is that the ubiquitin proteasome loses efficiency over time causing a buildup of protein aggregates and debris in post-mitotic cells such as neurons . Also, parkin, an E3 ubiquitin ligase was found to be mutated in at least half of autosomal recessive juvenile parkinsonism patients, suggesting that a deficit in the clearance of its target protein leads to an early onset of symptoms .
Computational studies of protein aggregation have traditionally been inadequate due to the massive complexity of the system in both time and length scale. Several approaches have been used to overcome this complexity (Fig. 5). Traditional all-atom molecular dynamics simulations have been carried out to model the aggregation of disease-related peptides such as Aβ and polyglutamine [235-237]. Other techniques seek to identify which parts within larger proteins are responsible for their aggregation behavior, resulting in the identification of sequence stretches in various proteins that are “amyloidogenic,” or “hot spots” for aggregation [238-240]. Underlying this work is the idea that evolution acts to prevent aggregation by burying aggregation-prone protein sequences or otherwise prohibiting their apposition in protein structures and during folding1. To study the nature of subunit assembly and extension during aggregation, which are out of reach for all-atom molecular dynamics in both the size scale and time scale, simplified protein models are being utilized . By accessing aggregation events that are out of the reach of experimentalists, computational studies of aggregation are an essential compliment to the experimental findings regarding aggregate structure and formation mechanisms.
Protein aggregation is now widely viewed as a fundamental property of the polypeptide chain, meaning that all of the considerations discussed in the earlier sections of this review must, to some degree, apply to the study of aggregating proteins. Therefore, the study of protein self-association and aggregation really is the study of protein folding in the context of external influences like protein concentration, localization, or evolution. As this field develops and knowledge of protein aggregation as a general phenomenon accumulates, we stand to gain not only vital tools for treating specific diseases, but also insight into the behavior of all proteins with respect to their environment.
Perhaps the most significant four measures of success in the natural sciences are our abilities (i) to observe natural phenomena, (ii) to explain natural phenomena, (iii) to predict the effect of relevant variables on a specific phenomenon, and (iv) to rationally manipulate these phenomena. The protein folding field has seen significant breakthroughs according to all of these measures. Research in the protein folding field uncovered folding pathways and states that accompany the transition from unfolded to folded proteins, revealed the origin of the cooperative folding transition, allowed prediction of folding rates and changes in thermodynamic stability upon mutation, and permitted rational alteration of protein structure and folding pathways.
Given the recognition of many human maladies as “diseases of protein folding” over the past two decades, the wealth of new knowledge about the folding process is driving the study of protein folding back into its native environment, i.e. inside living organisms. The effect of the cellular environment on protein folding, e.g., how proteins fold in vivo and especially the behavior of “intrinsically-disordered” proteins is a highly active area of inquiry [82, 242, 243],. More applied research is ongoing in the design of animal models of diseases associated with protein folding such as cystic fibrosis, ALS, Alzheimer’s disease, and the prion diseases [244-246], and the design of small molecules for use in clinical trials for treating diseases like the transthyretin amyloidoses . The future of the protein folding field lies in its direct application to such medical problems, and for a growing number of protein systems, the future is now.
We thank Oana Lungu for reading the manuscript. This work was supported by the American Heart Association grant No. 0665361U, the North Carolina Biotechnology Center Grant No. 2006-MRG-1107, Cystic Fibrosis Foundation grant DOKHOL07I0, and the National Institutes of Health grants RO1GM080742-01 and R01CA084480-07.
1It is interesting to note here that in the case of Pmel17, which aggregates to form a “functional amyloid” involved in melanin biosynthesis, a protein seems to have evolved to aggregate at an incredible rate, perhaps to minimize the population time in a soluble oligomer form .
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimersthat apply to the journal pertain.