Since divergence ∼50 Ma ago from their terrestrial ancestors, cetaceans underwent a series of adaptations such as a ∼10–20 fold increase in myoglobin (Mb) concentration in skeletal muscle, critical for increasing oxygen storage capacity and prolonging dive time. Whereas the O2-binding affinity of Mbs is not significantly different among mammals (with typical oxygenation constants of ∼0.8–1.2 µM−1), folding stabilities of cetacean Mbs are ∼2–4 kcal/mol higher than for terrestrial Mbs. Using ancestral sequence reconstruction, maximum likelihood and Bayesian tests to describe the evolution of cetacean Mbs, and experimentally calibrated computation of stability effects of mutations, we observe accelerated evolution in cetaceans and identify seven positively selected sites in Mb. Overall, these sites contribute to Mb stabilization with a conditional probability of 0.8. We observe a correlation between Mb folding stability and protein abundance, suggesting that a selection pressure for stability acts proportionally to higher expression. We also identify a major divergence event leading to the common ancestor of whales, during which major stabilization occurred. Most of the positively selected sites that occur later act against other destabilizing mutations to maintain stability across the clade, except for the shallow divers, where late stability relaxation occurs, probably due to the shorter aerobic dive limits of these species. The three main positively selected sites 66, 5, and 35 undergo changes that favor hydrophobic folding, structural integrity, and intra-helical hydrogen bonds.
In this work, we identify positive selection in cetacean myoglobins and an early, significant divergence event. While O2-binding is nearly unchanged, positive selection acts to introduce and later maintain stability. Stability correlates with abundance across the species, supporting that selection for increased stability concurred with the known 10–20 fold increase in myoglobin abundance of cetaceans relative to terrestrial mammals, which itself resulted from speciation towards longer dive lengths of the animals. We suggest that this selection acted to keep constant the otherwise increasing number of unfolded Mb. Altogether, this work for the first time links protein phenotype (stability and abundance) in a specific, real protein to organism-level evolution and fitness of mammals.
The consistent observation across all kingdoms of life that highly abundant proteins evolve slowly demonstrates that cellular abundance is a key determinant of protein evolutionary rate. However, other empirical findings, such as the broad distribution of evolutionary rates, suggest that additional variables determine the rate of protein evolution. Here, we report that under the global selection against the cytotoxic effects of misfolded proteins, folding stability (ΔG), simultaneous with abundance, is a causal variable of evolutionary rate. Using both theoretical analysis and multiscale simulations, we demonstrate that the anticorrelation between the pre-mutation ΔG and the arising mutational effect (ΔΔG), purely biophysical in origin, is a necessary requirement for abundance–evolutionary rate covariation. Additionally, we predict and demonstrate in bacteria that the strength of abundance–evolutionary rate correlation depends on the divergence time separating reference genomes. Altogether, these results highlight the intrinsic role of protein biophysics in the emerging universal patterns of molecular evolution.
Significant progress has been made in recent years in a variety of seemingly unrelated fields such as sequencing, protein structure prediction, and high-throughput transcriptomics and metabolomics. At the same time new microscopic models were developed that made it possible to analyze evolution of genes and genomes from first principles. The results from these efforts enable, for the first time, a comprehensive insight into the evolution of complex systems and organisms on all scales – from sequences to organisms and populations. Every newly sequenced genome uncovers new genes, families, and folds. Where do these new genes come from? How does gene duplication and subsequent divergence of sequence and structure affect the fitness of the organism? What role does regulation play in the evolution of proteins and folds? Emerging synergism between data and modeling provide first robust answers to these questions.
Despite progresses in ancestral protein sequence reconstruction, much needs to be unraveled about the nature of the putative last common ancestral proteome that served as the prototype of all extant lifeforms. Here, we present data that indicate a steady decline (oil escape) in proteome hydrophobicity over species evolvedness (node number) evident in 272 diverse proteomes, which indicates a highly hydrophobic (oily) last common ancestor (LCA). This trend, obtained from simple considerations (free from sequence reconstruction methods), was corroborated by regression studies within homologous and orthologous protein clusters as well as phylogenetic estimates of the ancestral oil content. While indicating an inherent irreversibility in molecular evolution, oil escape also serves as a rare and universal reaction-coordinate for evolution (reinforcing Darwin's principle of Common Descent), and may prove important in matters such as (i) explaining the emergence of intrinsically disordered proteins, (ii) developing composition- and speciation-based “global” molecular clocks, and (iii) improving the statistical methods for ancestral sequence reconstruction.
Although of importance to both evolution and protein design, the manner in which the first proteome came to be, and the actual features of the earliest ancestral proteomes are both unknown. Through the analysis of diverse proteomes, we provide glimpses into the composition of the last common ancestor (LUCA) of all lifeforms, which indicate that the earliest/last common ancestor had a proteome that was highly hydrophobic/oily. Notably, the evidence presented (a) indicates that proteomes of all species ranging from bacteria to mammals appear to adhere to the same universal constraint (“oil escape”) set into motion by the last common ancestor more than 3.5 billion years ago, (b) indicates the presence of a previously untapped global (composition-level) molecular clock, and (c) strengthens the non-equilibrium/directional view of amino acid substitutions that challenges central dogmas regarding reversibility in molecular evolution.
We report a set of atomistic folding/unfolding simulations for the hairpin ribozyme using a monte carlo algorithm. The hairpin ribozyme folds in solution and catalyzes self-cleavage or ligation via a specific two-domain structure. The minimal active ribozyme has been studied extensively, showing stabilization of the active structure by cations and dynamic motion of the active structure. Here we introduce a simple model of tertiary structure formation that leads to a phase diagram for the RNA as a function of temperature and tertiary structure strength. We then employ this model to capture many folding/unfolding events and to examine the transition state ensemble (TSE) of the RNA during folding to its active “docked” conformation. The TSE is compact but with few tertiary interactions formed, in agreement with single-molecule dynamics experiments. To compare with experimental kinetic parameters we introduce a novel method to benchmark monte carlo kinetic parameters to docking/undocking rates collected over many single molecular trajectories. We find that topology alone, as encoded in a biased potential which discriminates between secondary and tertiary interactions, is sufficient to predict the thermodynamic behavior and kinetic folding pathway of the hairpin ribozyme. This method should be useful in predicting folding transition states for many natural or man-made RNA tertiary structures.
Recent work has shown that the incorporation of an all-hydrocarbon “staple” into peptides can greatly increase their α-helix propensity, leading to an improvement in pharmaceutical properties such as proteolytic stability, receptor affinity and cell-permeability. Stapled peptides thus show promise as a new class of drugs capable of accessing intractable targets such as those that engage in intracellular protein-protein interactions. The extent of α-helix stabilization provided by stapling has proven to be substantially context dependent, requiring cumbersome screening to identify the optimal site for staple incorporation. In certain cases, a staple encompassing one turn of the helix (attached at residues i and i+4) furnishes greater helix stabilization than one encompassing two turns (i,i+7 staple), which runs counter to expectation based on polymer theory. These findings highlight the need for a more thorough understanding of the forces that underlie helix stabilization by hydrocarbon staples. Here we report all-atom Monte Carlo folding simulations comparing unmodified peptides derived from RNAse A and BID BH3 with various i,i+4 and i,i+7 stapled versions thereof. The results of these simulations were found to be in quantitative agreement with experimentally determined helix propensities. We also discovered that staples can stabilize quasi-stable decoy conformations, and that the removal of these states plays a major role in determining the helix stability of stapled peptides. Finally, we critically investigate why our method works, exposing the underlying physical forces that stabilize stapled peptides.
Stapled Peptides; Monte-Carlo simulations; Drug Discovery; Folding Traps; Entropic Stabilization
An increasing number of proteins are being discovered with a remarkable and somewhat surprising feature, a knot in their native structures. How the polypeptide chain is able to “knot” itself during the folding process to form these highly intricate protein topologies is not known. Here we perform a computational study on the 160-amino acid homodimeric protein YibK which, like other proteins in the SpoU family of MTases, contains a deep trefoil knot in its C-terminal region. In this study, we use a coarse-grained Cα-chain representation and Langevin dynamics to study folding kinetics. We find that specific, attractive nonnative interactions are critical for knot formation. In the absence of these interactions, i.e. in an energetics driven entirely by native interactions, knot formation is exceedingly unlikely. Further, we find, in concert with recent experimental data on YibK, two parallel folding pathways which we attribute to an early and a late formation of the trefoil knot, respectively. For both pathways, knot formation occurs before dimerization. A bioinformatics analysis of the SpoU family of proteins reveals further that the critical nonnative interactions may originate from evolutionary conserved hydrophobic segments around the knotted region.
Reproduction is inherently risky, in part because genomic replication can introduce new mutations that are usually deleterious toward fitness. This risk is especially severe for organisms whose genomes replicate “semi-conservatively,” e.g. viruses and bacteria, where no master copy of the genome is preserved. Lethal mutagenesis refers to extinction of populations due to an unbearably high mutation rate (U), and is important both theoretically and clinically, where drugs can extinguish pathogens by increasing their mutation rate. Previous theoretical models of lethal mutagenesis assume infinite population size (N). However, in addition to high U, small N can accelerate extinction by strengthening genetic drift and relaxing selection. Here, we examine how the time until extinction depends jointly on N and U. We first analytically compute the mean time until extinction (τ) in a simplistic model where all mutations are either lethal or neutral. The solution motivates the definition of two distinct regimes: a survival phase and an extinction phase, which differ dramatically in both how τ scales with N and in the coefficient of variation in time until extinction. Next, we perform stochastic population-genetics simulations on a realistic fitness landscape that both (i) features an epistatic distribution of fitness effects that agrees with experimental data on viruses and (ii) is based on the biophysics of protein folding. More specifically, we assume that mutations inflict fitness penalties proportional to the extent that they unfold proteins. We find that decreasing N can cause phase transition-like behavior from survival to extinction, which motivates the concept of “lethal isolation.” Furthermore, we find that lethal mutagenesis and lethal isolation interact synergistically, which may have clinical implications for treating infections. Broadly, we conclude that stably folded proteins are only possible in ecological settings that support sufficiently large populations.
Most spontaneous mutations hurt organismal fitness, e.g. by destabilizing proteins. In many species, the normal mutation rate is strikingly high: on the order of one per genome per replication. In the face of these mutations, how can proteins maintain their native structure, and how can populations of organisms avoid extinction? Are there physics-based limits on how large the mutation rate of any species can be before the onslaught of mutations outpaces natural selection and melts-down proteins? Here, we address these questions with a computational model that combines protein folding thermodynamics with individual-based population genetics simulations. We calculate a theoretical “speed limit” equal to a few mutations per genome per replication—near the mutation rate of RNA viruses. Additionally, we find that the speed limit can be much lower in small populations where “random genetic drift” is strong. Thus, we conclude that stably folded proteins are only possible in ecological settings that support sufficiently large populations. These findings may have clinical implications for treating viral infections with drugs that elevate the viral mutation rate.
The specificity-determining residue database (SDR database) presents residue positions where mutations are predicted to have changed protein function in large protein families. Because the database pre-calculates predictions on existing protein sequence alignments, users can quickly find the predictions by selecting the appropriate protein family or searching by protein sequence. Predictions can be used to guide mutagenesis or to gain a better understanding of specificity changes in a protein family. The database is available on the web at http://paradox.harvard.edu/sdr.
In this work, we apply a detailed all-atom model with a transferable knowledge-based potential to study the folding kinetics of Formin-Binding protein, FBP28, which is a canonical three-stranded β-sheet WW domain. Replica exchange Monte Carlo (REMC) simulations starting from random coils find native-like (C α RMSD of 2.68Å) lowest energy structure. We also study the folding kinetics of FBP28 WW domain by performing a large number of ab initio Monte Carlo folding simulations. Using these trajectories, we examine the order of formation of two β –hairpins, the folding mechanism of each individual β– hairpin, and transition state ensemble (TSE) of FBP28 WW domain and compare our results with experimental data and previous computational studies. To obtain detailed structural information on the folding dynamics viewed as an ensemble process, we perform a clustering analysis procedure based on graph theory. Further, a rigorous Pfold analysis is used to obtain representative samples of the TSEs showing good quantitative agreement between experimental and simulated Φ values. Our analysis shows that the turn structure between first and second β strands is a partially stable structural motif that gets formed before entering the TSE in FBP28 WW domain and there exist two major pathways for the folding of FBP28 WW domain, which differ in the order and mechanism of hairpin formation.
transition state ensemble; protein folding; β-strand; β-hairpin; β-sheet; Φ-value analysis; Pfold analysis
To understand the interplay of residual structures and conformational fluctuations in the interaction of intrinsically disordered proteins (IDPs), we first combined implicit solvent and replica exchange sampling to calculate atomistic disordered ensembles of the nuclear co-activator binding domain (NCBD) of transcription coactivator CBP and the activation domain of the p160 steroid receptor coactivator ACTR. The calculated ensembles are in quantitative agreement with NMR-derived residue helicity and recapitulate the experimental observation that, while free ACTR largely lacks residual secondary structures, free NCBD is a molten globule with a helical content similar to that in the folded complex. Detailed conformational analysis reveals that free NCBD has an inherent ability to substantially sample all the helix configurations that have been previously observed either unbound or in complexes. Intriguingly, further high-temperature unbinding and unfolding simulations in implicit and explicit solvents emphasize the importance of conformational fluctuations in synergistic folding of NCBD with ACTR. A balance between preformed elements and conformational fluctuations appears necessary to allow NCBD to interact with different targets and fold into alternative conformations. Together with previous topology-based modeling and existing experimental data, the current simulations strongly support an “extended conformational selection” synergistic folding mechanism that involves a key intermediate state stabilized by interaction between the C-terminal helices of NCBD and ACTR. In addition, the atomistic simulations reveal the role of long-range as well as short-range electrostatic interactions in cooperating with readily fluctuating residual structures, which might enhance the encounter rate and promote efficient folding upon encounter for facile binding and folding interactions of IDPs. Thus, the current study not only provides a consistent mechanistic understanding of the NCBD/ACTR interaction, but also helps establish a multi-scale molecular modeling framework for understanding the structure, interaction, and regulation of IDPs in general.
Intrinsically disordered proteins (IDPs) are now widely recognized to play fundamental roles in biology and to be frequently associated with human diseases. Although the potential advantages of intrinsic disorder in cellular signaling and regulation have been widely discussed, the physical basis for these proposed phenomena remains sketchy at best. An integration of multi-scale molecular modeling and experimental characterization is necessary to uncover the molecular principles that govern the structure, interaction, and regulation of IDPs. In this work, we characterize the conformational properties of two IDPs involved in transcription regulation at the atomistic level and further examine the roles of these properties in their coupled binding and folding interactions. Our simulations suggest interplay among residual structures, conformational fluctuations, and electrostatic interactions that allows efficient synergistic folding of these two IDPs. In particular, we propose that electrostatic interactions might play an important role in facilitating rapid folding and binding recognition of IDPs, by enhancing the encounter rate and promoting efficient folding upon encounter.
Experimental studies on enzyme evolution show that only a small fraction of all possible mutation trajectories are accessible to evolution. However, these experiments deal with individual enzymes and explore a tiny part of the fitness landscape. We report an exhaustive analysis of fitness landscapes constructed with an off-lattice model of protein folding where fitness is equated with robustness to misfolding. This model mimics the essential features of the interactions between amino acids, is consistent with the key paradigms of protein folding and reproduces the universal distribution of evolutionary rates among orthologous proteins. We introduce mean path divergence as a quantitative measure of the degree to which the starting and ending points determine the path of evolution in fitness landscapes. Global measures of landscape roughness are good predictors of path divergence in all studied landscapes: the mean path divergence is greater in smooth landscapes than in rough ones. The model-derived and experimental landscapes are significantly smoother than random landscapes and resemble additive landscapes perturbed with moderate amounts of noise; thus, these landscapes are substantially robust to mutation. The model landscapes show a deficit of suboptimal peaks even compared with noisy additive landscapes with similar overall roughness. We suggest that smoothness and the substantial deficit of peaks in the fitness landscapes of protein evolution are fundamental consequences of the physics of protein folding.
Is evolution deterministic, hence predictable, or stochastic, that is unpredictable? What would happen if one could “replay the tape of evolution”: will the outcomes of evolution be completely different or is evolution so constrained that history will be repeated? Arguably, these questions are among the most intriguing and most difficult in evolutionary biology. In other words, the predictability of evolution depends on the fraction of the trajectories on fitness landscapes that are accessible for evolutionary exploration. Because direct experimental investigation of fitness landscapes is technically challenging, the available studies only explore a minuscule portion of the landscape for individual enzymes. We therefore sought to investigate the topography of fitness landscapes within the framework of a previously developed model of protein folding and evolution where fitness is equated with robustness to misfolding. We show that model-derived and experimental landscapes are significantly smoother than random landscapes and resemble moderately perturbed additive landscapes; thus, these landscapes are substantially robust to mutation. The model landscapes show a deficit of suboptimal peaks even compared with noisy additive landscapes with similar overall roughness. Thus, the smoothness and substantial deficit of peaks in fitness landscapes of protein evolution could be fundamental consequences of the physics of protein folding.
The population dynamics theory of B cells in a typical germinal center could play an important role in revealing how affinity maturation is achieved. However, the existing models encountered some conflicts with experiments. To resolve these conflicts, we present a coarse-grained model to calculate the B cell population development in affinity maturation, which allows a comprehensive analysis of its parameter space to look for optimal values of mutation rate, selection strength, and initial antibody-antigen binding level that maximize the affinity improvement. With these optimized parameters, the model is compatible with the experimental observations such as the ∼100-fold affinity improvements, the number of mutations, the hypermutation rate, and the “all or none” phenomenon. Moreover, we study the reasons behind the optimal parameters. The optimal mutation rate, in agreement with the hypermutation rate in vivo, results from a tradeoff between accumulating enough beneficial mutations and avoiding too many deleterious or lethal mutations. The optimal selection strength evolves as a balance between the need for affinity improvement and the requirement to pass the population bottleneck. These findings point to the conclusion that germinal centers have been optimized by evolution to generate strong affinity antibodies effectively and rapidly. In addition, we study the enhancement of affinity improvement due to B cell migration between germinal centers. These results could enhance our understanding of the functions of germinal centers.
The antibodies in our immune system could efficiently improve their abilities in recognizing new antigens. This is done with the help of proliferation, mutation and selection of B cells which carry antibodies, but we have difficulties in developing a quantitative description of this adaptation process which is consistent with the various aspects of experimental observations. Based on the knowledge from experiments, here we present a theoretical model to calculate the numbers of B cells with different antigen recognizing abilities all the time, and look for the best possible design that improves the antigen recognizing ability most efficiently. We find that the best possible design is consistent with the experimental observations, pointing to the conclusion that the immune system has been optimized in evolution. We then study the trade-offs leading to the optimization of the design. The results will not only improve our understanding of the functions in immune system, but also reveal the design principles behind the details. In addition, the study enhances our understanding of the population dynamics in evolution.
Mutators are clones whose mutation rate is about two to three orders of magnitude higher than the rate of wild-type clones and their roles in adaptive evolution of asexual populations have been controversial. Here we address this problem by using an ab initio microscopic model of living cells, which combines population genetics with a physically realistic presentation of protein stability and protein-protein interactions. The genome of model organisms encodes replication controlling genes (RCGs) and genes modeling the mismatch repair (MMR) complexes. The genotype-phenotype relationship posits that the replication rate of an organism is proportional to protein copy numbers of RCGs in their functional form and there is a production cost penalty for protein overexpression. The mutation rate depends linearly on the concentration of homodimers of MMR proteins. By simulating multiple runs of evolution of populations under various environmental stresses—stationary phase, starvation or temperature-jump—we find that adaptation most often occurs through transient fixation of a mutator phenotype, regardless of the nature of stress. By contrast, the fixation mechanism does depend on the nature of stress. In temperature jump stress, mutators take over the population due to loss of stability of MMR complexes. In contrast, in starvation and stationary phase stresses, a small number of mutators are supplied to the population via epigenetic stochastic noise in production of MMR proteins (a pleiotropic effect), and their net supply is higher due to reduced genetic drift in slowly growing populations under stressful environments. Subsequently, mutators in stationary phase or starvation hitchhike to fixation with a beneficial mutation in the RCGs, (second order selection) and finally a mutation stabilizing the MMR complex arrives, returning the population to a non-mutator phenotype. Our results provide microscopic insights into the rise and fall of mutators in adapting finite asexual populations.
The dramatic rise of mutators has been found to accompany adaptation of bacteria in response to many kinds of stress. Two views on the evolutionary origin of this phenomenon emerged: the pleiotropic hypothesis positing that it is a byproduct of environmental stress or other specific stress response mechanisms and the second order selection which states that mutators hitchhike to fixation with unrelated beneficial alleles. Conventional population genetics models could not fully resolve this controversy because they are based on certain assumptions about fitness landscape. Here we address this problem using a microscopic multiscale model, which couples physically realistic molecular descriptions of proteins and their interactions with population genetics of carrier organisms without assuming any a priori mutational effect on fitness landscape. We found that both pleiotropy and second order selection play a crucial role at different stages of adaptation: the supply of mutators is provided through destabilization of error correction complexes or, alternatively, fluctuations of production levels of prototypic mismatch repair proteins (pleiotropic effects), while the rise and fixation of mutators occurs when there is a sufficient supply of beneficial mutations in replication-controlling genes. This general mechanism assures a robust and reliable adaptation of organisms to unforeseen challenges. This study highlights physical principles underlying biological mechanisms of stress response and adaptation.
Prion proteins are known to misfold into a range of different aggregated forms, showing different phenotypic and pathological states. Understanding strain specificities is an important problem in the field of prion disease. Little is known about which PrPSc structural properties and molecular mechanisms determine prion replication, disease progression and strain phenotype. The aim of this work is to investigate, through a mathematical model, how the structural stability of different aggregated forms can influence the kinetics of prion replication. The model-based results suggest that prion strains with different conformational stability undergoing in vivo replication are characterizable in primis by means of different rates of breakage. A further role seems to be played by the aggregation rate (i.e. the rate at which a prion fibril grows). The kinetic variability introduced in the model by these two parameters allows us to reproduce the different characteristic features of the various strains (e.g., fibrils' mean length) and is coherent with all experimental observations concerning strain-specific behavior.
Prion diseases are caused by the accumulation of a cellular prion protein with an altered conformation, which acts as a catalyst for the further recruitment and the modification of the normal form of the protein. Protein polymerization appears to have a central role in the progression of the disease, an aspect shared with several other neurodegenerative diseases. The aim of this work is to investigate at the kinetic level the “prion strain phenomenon”, i.e., the ability of prion proteins to misfold into a range of different aggregated forms exhibiting different replication and propagation properties. The dynamics of prion replication is investigated with the help of a mathematical model. We relate a measurement accessible in vitro (prion structural stability) to a mathematical description of the fibrils' kinetics in vivo. The analysis of the model suggests that the replication kinetics of the different prion strains is characterizable by means of two parameters, representing the rates of breakage and aggregation. This result is coherent with various experimental findings concerning strain-specific behavior, such as, for example, the observation of the fibril mean length of the various strains.
Experimental protein-protein interaction (PPI) networks are increasingly being exploited in diverse ways for biological discovery. Accordingly, it is vital to discern their underlying natures by identifying and classifying the various types of deterministic (specific) and probabilistic (nonspecific) interactions detected. To this end, we have analyzed PPI networks determined using a range of high-throughput experimental techniques with the aim of systematically quantifying any biases that arise from the varying cellular abundances of the proteins. We confirm that PPI networks determined using affinity purification methods for yeast and Eschericia coli incorporate a correlation between protein degree, or number of interactions, and cellular abundance. The observed correlations are small but statistically significant and occur in both unprocessed (raw) and processed (high-confidence) data sets. In contrast, the yeast two-hybrid system yields networks that contain no such relationship. While previously commented based on mRNA abundance, our more extensive analysis based on protein abundance confirms a systematic difference between PPI networks determined from the two technologies. We additionally demonstrate that the centrality-lethality rule, which implies that higher-degree proteins are more likely to be essential, may be misleading, as protein abundance measurements identify essential proteins to be more prevalent than nonessential proteins. In fact, we generally find that when there is a degree/abundance correlation, the degree distributions of nonessential and essential proteins are also disparate. Conversely, when there is no degree/abundance correlation, the degree distributions of nonessential and essential proteins are not different. However, we show that essentiality manifests itself as a biological property in all of the yeast PPI networks investigated here via enrichments of interactions between essential proteins. These findings provide valuable insights into the underlying natures of the various high-throughput technologies utilized to detect PPIs and should lead to more effective strategies for the inference and analysis of high-quality PPI data sets.
Crowded intracellular environments present a challenge for proteins to form functional specific complexes while reducing non-functional interactions with promiscuous non-functional partners. Here we show how the need to minimize the waste of resources to non-functional interactions limits the proteome diversity and the average concentration of co-expressed and co-localized proteins. Using the results of high-throughput Yeast 2-Hybrid experiments, we estimate the characteristic strength of non-functional protein–protein interactions. By combining these data with the strengths of specific interactions, we assess the fraction of time proteins spend tied up in non-functional interactions as a function of their overall concentration. This allows us to sketch the phase diagram for baker's yeast cells using the experimentally measured concentrations and subcellular localization of their proteins. The positions of yeast compartments on the phase diagram are consistent with our hypothesis that the yeast proteome has evolved to operate closely to the upper limit of its size, whereas keeping individual protein concentrations sufficiently low to reduce non-functional interactions. These findings have implication for conceptual understanding of intracellular compartmentalization, multicellularity and differentiation.
non-functional interaction; protein–protein interaction; proteome size; yeast cytoplasm
In this work we develop a microscopic physical model of early evolution where phenotype—organism life expectancy—is directly related to genotype—the stability of its proteins in their native conformations—which can be determined exactly in the model. Simulating the model on a computer, we consistently observe the “Big Bang” scenario whereby exponential population growth ensues as soon as favorable sequence–structure combinations (precursors of stable proteins) are discovered. Upon that, random diversity of the structural space abruptly collapses into a small set of preferred proteins. We observe that protein folds remain stable and abundant in the population at timescales much greater than mutation or organism lifetime, and the distribution of the lifetimes of dominant folds in a population approximately follows a power law. The separation of evolutionary timescales between discovery of new folds and generation of new sequences gives rise to emergence of protein families and superfamilies whose sizes are power-law distributed, closely matching the same distributions for real proteins. On the population level we observe emergence of species—subpopulations that carry similar genomes. Further, we present a simple theory that relates stability of evolving proteins to the sizes of emerging genomes. Together, these results provide a microscopic first-principles picture of how first-gene families developed in the course of early evolution.
Here, we address the question of how Darwinian evolution of organisms determines molecular evolution of their proteins and genomes. We developed a microscopic ab initio model of early biological evolution where the fitness (essentially lifetime) of an organism is explicitly related to the evolving sequences of its proteins. The main assumption of the model is that the death rate of an organism is determined by the stability of the least stable of their proteins. A lattice model is used to calculate stability of all proteins in a genome from their amino acid sequence. The simulation of the model starts from 100 identical organisms, each carrying the same random gene, and proceeds via random mutations, gene duplication, organism births via replication, and organism deaths. We find that exponential population growth is possible only after the discovery of a very small number of specific advantageous protein structures. The number of genes in the evolving organisms depends on the mutation rate, demonstrating the intricate relationship between the genome sizes and protein stability requirements. Further, the model explains the observed power-law distributions of protein family and superfamily sizes, as well as the scale-free character of protein structural similarity graphs. Together, these results and their analysis suggest a plausible comprehensive scenario of emergence of the protein universe in early biological evolution.
The aim of this work is to elucidate how physical principles of protein design are reflected in natural sequences that evolved in response to the thermal conditions of the environment. Using an exactly solvable lattice model, we design sequences with selected thermal properties. Compositional analysis of designed model sequences and natural proteomes reveals a specific trend in amino acid compositions in response to the requirement of stability at elevated environmental temperature: the increase of fractions of hydrophobic and charged amino acid residues at the expense of polar ones. We show that this “from both ends of the hydrophobicity scale” trend is due to positive (to stabilize the native state) and negative (to destabilize misfolded states) components of protein design. Negative design strengthens specific repulsive non-native interactions that appear in misfolded structures. A pressure to preserve specific repulsive interactions in non-native conformations may result in correlated mutations between amino acids that are far apart in the native state but may be in contact in misfolded conformations. Such correlated mutations are indeed found in TIM barrel and other proteins.
What mechanisms does Nature use in her quest for thermophilic proteins? It is known that stability of a protein is mainly determined by the energy gap, or the difference in energy, between native state and a set of incorrectly folded (misfolded) conformations. Here we show that Nature makes thermophilic proteins by widening this gap from both ends. The energy of the native state of a protein is decreased by selecting strongly attractive amino acids at positions that are in contact in the native state (positive design). Simultaneously, energies of the misfolded conformations are increased by selection of strongly repulsive amino acids at positions that are distant in native structure; however, these amino acids will interact repulsively in the misfolded conformations (negative design). These fundamental principles of protein design are manifested in the “from both ends of the hydrophobicity scale” trend observed in thermophilic adaptation, whereby proteomes of thermophilic proteins are enriched in extreme amino acids—hydrophobic and charged—at the expense of polar ones. Hydrophobic amino acids contribute mostly to the positive design, while charged amino acids that repel each other in non-native conformations of proteins contribute to negative design. Our results provide guidance in rational design of proteins with selected thermal properties.
Protein–DNA interactions are vital for many processes in living cells, especially transcriptional regulation and DNA modification. To further our understanding of these important processes on the microscopic level, it is necessary that theoretical models describe the macromolecular interaction energetics accurately. While several methods have been proposed, there has not been a careful comparison of how well the different methods are able to predict biologically important quantities such as the correct DNA binding sequence, total binding free energy and free energy changes caused by DNA mutation. In addition to carrying out the comparison, we present two important theoretical models developed initially in protein folding that have not yet been tried on protein–DNA interactions. In the process, we find that the results of these knowledge-based potentials show a strong dependence on the interaction distance and the derivation method. Finally, we present a knowledge-based potential that gives comparable or superior results to the best of the other methods, including the molecular mechanics force field AMBER99.
There have been considerable attempts in the past to relate phenotypic trait—habitat temperature of organisms—to their genotypes, most importantly compositions of their genomes and proteomes. However, despite accumulation of anecdotal evidence, an exact and conclusive relationship between the former and the latter has been elusive. We present an exhaustive study of the relationship between amino acid composition of proteomes, nucleotide composition of DNA, and optimal growth temperature (OGT) of prokaryotes. Based on 204 complete proteomes of archaea and bacteria spanning the temperature range from −10 °C to 110 °C, we performed an exhaustive enumeration of all possible sets of amino acids and found a set of amino acids whose total fraction in a proteome is correlated, to a remarkable extent, with the OGT. The universal set is Ile, Val, Tyr, Trp, Arg, Glu, Leu (IVYWREL), and the correlation coefficient is as high as 0.93. We also found that the G + C content in 204 complete genomes does not exhibit a significant correlation with OGT (R = −0.10). On the other hand, the fraction of A + G in coding DNA is correlated with temperature, to a considerable extent, due to codon patterns of IVYWREL amino acids. Further, we found strong and independent correlation between OGT and the frequency with which pairs of A and G nucleotides appear as nearest neighbors in genome sequences. This adaptation is achieved via codon bias. These findings present a direct link between principles of proteins structure and stability and evolutionary mechanisms of thermophylic adaptation. On the nucleotide level, the analysis provides an example of how nature utilizes codon bias for evolutionary adaptation to extreme conditions. Together these results provide a complete picture of how compositions of proteomes and genomes in prokaryotes adjust to the extreme conditions of the environment.
Prokaryotes living at extreme environmental temperatures exhibit pronounced signatures in the amino acid composition of their proteins and the nucleotide compositions of their genomes, reflective of adaptation to their thermal environments. However, despite significant efforts, the definitive answer of what are the genomic and proteomic compositional determinants of optimal growth temperature (OGT) of prokaryotic organisms remained elusive. Here we performed a comprehensive analysis of amino acid and nucleotide compositional signatures of thermophylic adaptation by exhaustively evaluating all combinations of amino acids and nucleotides as possible determinants of OGT for all prokaryotic organisms with fully sequenced genomes. We discovered that total concentration of seven amino acids in proteomes—IVYWREL—serves as a universal proteomic predictor of OGT in prokaryotes. Resolving the old-standing controversy, we determined that the variation in nucleotide composition (increase of purine load, or A + G content with temperature) is largely a consequence of thermal adaptation of proteins. However, the frequency with which A and G nucleotides appear as nearest neighbors in genome sequences is strongly and independently correlated with OGT as a result of codon bias in corresponding genomes. Together these results provide a complete picture of proteomic and genomic determinants of thermophilic adaptation.
The determination of factors that influence protein conformational changes is very important for the identification of potentially amyloidogenic and disordered regions in polypeptide chains. In our work we introduce a new parameter, mean packing density, to detect both amyloidogenic and disordered regions in a protein sequence. It has been shown that regions with strong expected packing density are responsible for amyloid formation. Our predictions are consistent with known disease-related amyloidogenic regions for eight of 12 amyloid-forming proteins and peptides in which the positions of amyloidogenic regions have been revealed experimentally. Our findings support the concept that the mechanism of amyloid fibril formation is similar for different peptides and proteins. Moreover, we have demonstrated that regions with weak expected packing density are responsible for the appearance of disordered regions. Our method has been tested on datasets of globular proteins and long disordered protein segments, and it shows improved performance over other widely used methods. Thus, we demonstrate that the expected packing density is a useful value with which one can predict both intrinsically disordered and amyloidogenic regions of a protein based on sequence alone. Our results are important for understanding the structural characteristics of protein folding and misfolding.
Protein folding is one of the most challenging issues in biophysical science. During the past few years it has been shown that some diseases are connected with protein misfolding and the formation of insoluble aggregates called amyloid plaques. These processes may be associated with several diseases such as Alzheimer disease, Parkinson disease, Creutzfeldt-Jacob disease, and even certain forms of cancer. It has been shown that proteins with intrinsically disordered regions are involved in protein–protein or protein–nucleic acid interactions. The main objective of this paper is to report insights into the molecular mechanisms of amyloid aggregation. This has been done using the parameter of the observed number of contacts for each amino acid residue in globular state, further called expected packing density. By analysis of sequences alone, the authors have demonstrated that regions that possess strong expected packing density can be responsible for amyloidogenic properties of a protein, while regions with weak expected packing density correspond to disordered regions. A new concept is proposed that could aid in understanding protein folding, misfolding, and amyloidosis. The results help to explain that the nature of the amyloidogenic propensity of proteins is connected to their amino acid sequences that are able to form a large number of contacts.
The number of protein structures from structural genomics centers dramatically increases in the Protein Data Bank (PDB). Many of these structures are functionally unannotated because they have no sequence similarity to proteins of known function. However, it is possible to successfully infer function using only structural similarity.
Here we present the PDB-UF database, a web-accessible collection of predictions of enzymatic properties using structure-function relationship. The assignments were conducted for three-dimensional protein structures of unknown function that come from structural genomics initiatives. We show that 4 hypothetical proteins (with PDB accession codes: 1VH0, 1NS5, 1O6D, and 1TO0), for which standard BLAST tools such as PSI-BLAST or RPS-BLAST failed to assign any function, are probably methyltransferase enzymes.
We suggest that the structure-based prediction of an EC number should be conducted having the different similarity score cutoff for different protein folds. Moreover, performing the annotation using two different algorithms can reduce the rate of false positive assignments. We believe, that the presented web-based repository will help to decrease the number of protein structures that have functions marked as "unknown" in the PDB file.
Evolutionary traces of thermophilic adaptation are manifest, on the whole-genome level, in compositional biases toward certain types of amino acids. However, it is sometimes difficult to discern their causes without a clear understanding of underlying physical mechanisms of thermal stabilization of proteins. For example, it is well-known that hyperthermophiles feature a greater proportion of charged residues, but, surprisingly, the excess of positively charged residues is almost entirely due to lysines but not arginines in the majority of hyperthermophilic genomes. All-atom simulations show that lysines have a much greater number of accessible rotamers than arginines of similar degree of burial in folded states of proteins. This finding suggests that lysines would preferentially entropically stabilize the native state. Indeed, we show in computational experiments that arginine-to-lysine amino acid substitutions result in noticeable stabilization of proteins. We then hypothesize that if evolution uses this physical mechanism as a complement to electrostatic stabilization in its strategies of thermophilic adaptation, then hyperthermostable organisms would have much greater content of lysines in their proteomes than comparably sized and similarly charged arginines. Consistent with that, high-throughput comparative analysis of complete proteomes shows extremely strong bias toward arginine-to-lysine replacement in hyperthermophilic organisms and overall much greater content of lysines than arginines in hyperthermophiles. This finding cannot be explained by genomic GC compositional biases or by the universal trend of amino acid gain and loss in protein evolution. We discovered here a novel entropic mechanism of protein thermostability due to residual dynamics of rotamer isomerization in native state and demonstrated its immediate proteomic implications. Our study provides an example of how analysis of a fundamental physical mechanism of thermostability helps to resolve a puzzle in comparative genomics as to why amino acid compositions of hyperthermophilic proteomes are significantly biased toward lysines but not similarly charged arginines.
Comparative genomics sends us profound signals that are not easy to understand. For example, it is well known that proteins from hyperthermophiles are enriched with charged residues, but it has been a mystery why enrichment in positively charged amino acids is almost entirely due to lysines at the expense of very similar arginines. Here, the authors show that lysines (in contrast to arginines) exhibit significant residual dynamics in folded states of proteins, making the entropic cost to fold lysine-rich proteins less unfavorable compared with arginine-rich ones. Therefore, replacements of arginines by lysines provide additional thermal stabilization of proteins via entropic mechanism, making them positively charged residues of choice for evolutionary optimization of hyperthermostable proteins. Apparently, natural selection uses diverse physical mechanisms of thermal stability to achieve adaptation. This study provides an example of how better understanding of protein physics can help in solving genomic mysteries.
Certain amino acid residues in a protein, when mutated, change the protein's function. We present an improved method of finding these specificity-determining positions that uses all the protein sequence data available for a family of homologous proteins. We study in detail two families of eukaryotic transcription factors, basic leucine zippers and nuclear receptors, because of the large amount of sequences and experimental data available. These protein families also have a clear definition of functional specificity: DNA-binding specificity. We compare our results to three other methods, including the evolutionary trace algorithm and a method that depends on orthology relationships. All of the predictions are compared to the available mutational and crystallographic data. We find that our method provides superior predictions of the known specificity-determining residues and also predicts residue positions within these families that deserve further study for their roles in functional specificity.