Understanding the relative contributions of various evolutionary processes—purifying selection, neutral drift, and adaptation—is fundamental to evolutionary biology. A common metric to distinguish these processes is the ratio of nonsynonymous to synonymous substitutions (i.e., dN/dS) interpreted from the neutral theory as a null model. However, from biophysical considerations, mutations have non-negligible effects on the biophysical properties of proteins such as folding stability. In this work, we investigated how stability affects the rate of protein evolution in phylogenetic trees by using simulations that combine explicit protein sequences with associated stability changes. We first simulated myoglobin evolution in phylogenetic trees with a biophysically realistic approach that accounts for 3D structural information and estimates of changes in stability upon mutation. We then compared evolutionary rates inferred directly from simulation to those estimated using maximum-likelihood (ML) methods. We found that the dN/dS estimated by ML methods (ωML) is highly predictive of the per gene dN/dS inferred from the simulated phylogenetic trees. This agreement is strong in the regime of high stability where protein evolution is neutral. At low folding stabilities and under mutation-selection balance, we observe deviations from neutrality (per gene dN/dS > 1 and dN/dS < 1). We showed that although per gene dN/dS is robust to these deviations, ML tests for positive selection detect statistically significant per site dN/dS > 1. Altogether, we show how protein biophysics affects the dN/dS estimations and its subsequent interpretation. These results are important for improving the current approaches for detecting positive selection.
dN/dS; molecular evolution; protein evolution; folding stability; positive selection; maximum likelihood
A major component of ex vivo amyloid plaques of patients with dialysis-related amyloidosis (DRA) is a cleaved variant of β2-microglobulin (ΔN6) lacking the first six N-terminal residues. Here we perform a computational study on ΔN6, which provides clues to understand the amyloidogenicity of the full-length β2-microglobulin. Contrary to the wild-type form, ΔN6 is able to efficiently nucleate fibrillogenesis in vitro at physiological pH. This behavior is enhanced by a mild acidification of the medium such as that occurring in the synovial fluid of DRA patients. Results reported in this work, based on molecular simulations, indicate that deletion of the N-terminal hexapeptide triggers the formation of an intermediate state for folding and aggregation with an unstructured strand A and a native-like core. Strand A plays a pivotal role in aggregation by acting as a sticky hook in dimer assembly. This study further predicts that the detachment of strand A from the core is maximized at pH 6.2 resulting into higher aggregation efficiency. The structural mapping of the dimerization interface suggests that Tyr10, His13, Phe30 and His84 are hot-spot residues in ΔN6 amyloidogenesis.
Dialysis-related amyloidosis (DRA) is a conformational disease that affects individuals undergoing long-term haemodialysis. In DRA the progressive accumulation of protein human β2-microglobulin (Hβ2m) in the osteoarticular system, followed by its assembly into amyloid fibrils, eventually leads to tissue erosion and destruction. Disclosing the aggregation mechanism of Hβ2m under physiologically relevant conditions represents a major challenge due to the inability of the protein to efficiently nucleate fibrillogenesis in vitro at physiological pH. On the other hand, ΔN6, a truncated variant of Hβ2m, which is also a major component of ex vivo amyloid deposits extracted from DRA patients, is able to efficiently form amyloid fibrils de novo in physiological conditions. This amyloidogenic behavior is dramatically enhanced in a slightly more acidic pH (6.2) compatible with the mild acidification that occurs in the synovial fluid of DRA patients. In this work, an innovative three-stage methodological approach, relying on an array of molecular simulations, spanning different levels of resolution is used to investigate the initial stage of the de novo aggregation mechanism of ΔN6 in a physiologically relevant pH range. We identify an intermediate state for folding and aggregation, whose potential to dimerize is enhanced at pH 6.2. Our results provide rationalizations for previous experimental observations and new insights into the molecular basis of DRA.
The emergence of a novel A(H1N1) strain in 2009 was the first influenza pandemic of the genomic age, and unprecedented surveillance of the virus provides the opportunity to better understand the evolution of influenza. We examined changes in the nucleotide coding regions and the amino acid sequences of the hemagglutinin (HA), neuraminidase (NA), and nucleoprotein (NP) segments of the A(H1N1)pdm09 strain using publicly available data. We calculated the nucleotide and amino acid hamming distance from the vaccine strain A/California/07/2009 for each sequence. We also estimated Pepitope–a measure of antigenic diversity based on changes in the epitope regions–for each isolate. Finally, we compared our results to A(H3N2) strains collected over the same period. Our analysis found that the mean hamming distance for the HA protein of the A(H1N1)pdm09 strain increased from 3.6 (standard deviation [SD]: 1.3) in 2009 to 11.7 (SD: 1.0) in 2013, while the mean hamming distance in the coding region increased from 7.4 (SD: 2.2) in 2009 to 28.3 (SD: 2.1) in 2013. These trends are broadly similar to the rate of mutation in H3N2 over the same time period. However, in contrast to H3N2 strains, the rate of mutation accumulation has slowed in recent years. Our results are notable because, over the course of the study, mutation rates in H3N2 similar to that seen with A(H1N1)pdm09 led to the emergence of two antigenic drift variants. However, while there has been an H1N1 epidemic in North America this season, evidence to date indicates the vaccine is still effective, suggesting the epidemic is not due to the emergence of an antigenic drift variant. Our results suggest that more research is needed to understand how viral mutations are related to vaccine effectiveness so that future vaccine choices and development can be more predictive.
What are the molecular properties of proteins that fall on the radar of protein quality control (PQC)? Here we mutate the E. coli’s gene encoding dihydrofolate reductase (DHFR), and replace it with bacterial orthologous genes to determine how components of PQC modulate fitness effects of these genetic changes. We find that chaperonins GroEL/ES and protease Lon compete for binding to molten globule intermediate of DHFR, resulting in a peculiar symmetry in their action: Over-expression of GroEL/ES and deletion of Lon both restore growth of deleterious DHFR mutants and most of the slow-growing orthologous DHFR strains. Kinetic steady-state modeling predicts and experimentation verifies that mutations affect fitness by shifting the flux balance in cellular milieu between protein production, folding and degradation orchestrated by PQC through the interaction with folding intermediates.
Recent work has shown that the incorporation of an all-hydrocarbon “staple” into peptides can greatly increase their α-helix propensity, leading to an improvement in pharmaceutical properties such as proteolytic stability, receptor affinity and cell-permeability. Stapled peptides thus show promise as a new class of drugs capable of accessing intractable targets such as those that engage in intracellular protein-protein interactions. The extent of α-helix stabilization provided by stapling has proven to be substantially context dependent, requiring cumbersome screening to identify the optimal site for staple incorporation. In certain cases, a staple encompassing one turn of the helix (attached at residues i and i+4) furnishes greater helix stabilization than one encompassing two turns (i,i+7 staple), which runs counter to expectation based on polymer theory. These findings highlight the need for a more thorough understanding of the forces that underlie helix stabilization by hydrocarbon staples. Here we report all-atom Monte Carlo folding simulations comparing unmodified peptides derived from RNAse A and BID BH3 with various i,i+4 and i,i+7 stapled versions thereof. The results of these simulations were found to be in quantitative agreement with experimentally determined helix propensities. We also discovered that staples can stabilize quasi-stable decoy conformations, and that the removal of these states plays a major role in determining the helix stability of stapled peptides. Finally, we critically investigate why our method works, exposing the underlying physical forces that stabilize stapled peptides.
Stapled Peptides; Monte-Carlo simulations; Drug Discovery; Folding Traps; Entropic Stabilization
Although molecular chaperones are essential components of protein homeostatic machinery, their mechanism of action and impact on adaptation and evolutionary dynamics remain controversial. Here we developed a physics-based ab initio multi-scale model of a living cell for population dynamics simulations to elucidate the effect of chaperones on adaptive evolution. The 6-loci genomes of model cells encode model proteins, whose folding and interactions in cellular milieu can be evaluated exactly from their genome sequences. A genotype-phenotype relationship that is based on a simple yet non-trivially postulated protein-protein interaction (PPI) network determines the cell division rate. Model proteins can exist in native and molten globule states and participate in functional and all possible promiscuous non-functional PPIs. We find that an active chaperone mechanism, whereby chaperones directly catalyze protein folding, has a significant impact on the cellular fitness and the rate of evolutionary dynamics, while passive chaperones, which just maintain misfolded proteins in soluble complexes have a negligible effect on the fitness. We find that by partially releasing the constraint on protein stability, active chaperones promote a deeper exploration of sequence space to strengthen functional PPIs, and diminish the non-functional PPIs. A key experimentally testable prediction emerging from our analysis is that down-regulation of chaperones that catalyze protein folding significantly slows down the adaptation dynamics.
Molecular chaperones or heat-shock proteins are essential components of protein homeostatic machinery in all three domains of life, whose role is not only to prevent protein aggregation but also catalyze the protein folding process by decreasing the energetic barrier for folding. Importantly, chaperones have often been implicated as phenotypic capacitors since they buffer the deleterious effects of mutations, promote genetic diversity, and thus speed up adaptive evolution. Here we explore computationally the consequences of chaperone activity in cytoplasm via long-time evolutionary dynamics simulations. We use a 6-loci multi scale model of cell populations, where the fitness of each cell is determined from its genome, based on statistical mechanical principles of protein folding and protein-protein interactions. We find that by catalyzing protein folding chaperones buffer the deleterious effect of mutations on folding stability and thus open up a sequence space for efficient and simultaneous optimization of multiple molecular traits determining the cellular fitness. As a result, chaperones dramatically accelerate adaptation dynamics.
The patterns of polymorphisms in genomes are imprints of the evolutionary forces at play in nature. In particular, polymorphisms have been extensively used to infer the fitness effects of mutations and their dynamics of fixation. However, the role and contribution of molecular biophysics to these observations remain unclear. Here, we couple robust findings from protein biophysics, enzymatic flux theory, the selection against the cytotoxic effects of protein misfolding, and explicit population dynamics simulations in the polyclonal regime. First, we recapitulate results on the dynamics of clonal interference and on the shape of the DFE, thus providing them with a molecular and mechanistic foundation. Second, we predict that if evolution is indeed under the dynamic equilibrium of mutation–selection balance, the fraction of stabilizing and destabilizing mutations is almost equal among single-nucleotide polymorphisms segregating at high allele frequencies. This prediction is proven true for polymorphisms in the human coding region. Overall, our results show how selection for protein folding stability predominantly shapes the patterns of polymorphisms in coding regions.
SNPs; polymorphism; protein folding stability; DFE; clonal interference
An increasing number of proteins are being discovered with a remarkable and somewhat surprising feature, a knot in their native structures. How the polypeptide chain is able to “knot” itself during the folding process to form these highly intricate protein topologies is not known. Here we perform a computational study on the 160-amino acid homodimeric protein YibK which, like other proteins in the SpoU family of MTases, contains a deep trefoil knot in its C-terminal region. In this study, we use a coarse-grained Cα-chain representation and Langevin dynamics to study folding kinetics. We find that specific, attractive nonnative interactions are critical for knot formation. In the absence of these interactions, i.e. in an energetics driven entirely by native interactions, knot formation is exceedingly unlikely. Further, we find, in concert with recent experimental data on YibK, two parallel folding pathways which we attribute to an early and a late formation of the trefoil knot, respectively. For both pathways, knot formation occurs before dimerization. A bioinformatics analysis of the SpoU family of proteins reveals further that the critical nonnative interactions may originate from evolutionary conserved hydrophobic segments around the knotted region.
We use molecular dynamics simulations of a full atomistic Gō model to explore the impact of selected DE-loop mutations (D59P and W60C) on the folding space of protein human β2-microglobulin (Hβ2m), the causing agent of dialysis-related amyloidosis, a conformational disorder characterized by the deposition of insoluble amyloid fibrils in the osteoarticular system. Our simulations replicate the effect of mutations on the thermal stability that is observed in experiments in vitro. Furthermore, they predict the population of a partially folded state, with 60% of native internal free energy, which is akin to a molten globule. In the intermediate state, the solvent accessible surface area increases up to 40 times relative to the native state in 38% of the hydrophobic core residues, indicating that the identified species has aggregation potential. The intermediate state preserves the disulfide bond established between residue Cys25 and residue Cys80, which helps maintain the integrity of the core region, and is characterized by having two unstructured termini. The movements of the termini dominate the essential modes of the intermediate state, and exhibit the largest displacements in the D59P mutant, which is the most aggregation prone variant. PROPKA predictions of pKa suggest that the population of the intermediate state may be enhanced at acidic pH explaining the larger amyloidogenic potential observed in vitro at low pH for the WT protein and mutant forms.
intermediate states; molten globule; folding pathways; discrete molecular dynamics; principal component analysis; dialysis-related amyloidosis
The specificity-determining residue database (SDR database) presents residue positions where mutations are predicted to have changed protein function in large protein families. Because the database pre-calculates predictions on existing protein sequence alignments, users can quickly find the predictions by selecting the appropriate protein family or searching by protein sequence. Predictions can be used to guide mutagenesis or to gain a better understanding of specificity changes in a protein family. The database is available on the web at http://paradox.harvard.edu/sdr.
Since divergence ∼50 Ma ago from their terrestrial ancestors, cetaceans underwent a series of adaptations such as a ∼10–20 fold increase in myoglobin (Mb) concentration in skeletal muscle, critical for increasing oxygen storage capacity and prolonging dive time. Whereas the O2-binding affinity of Mbs is not significantly different among mammals (with typical oxygenation constants of ∼0.8–1.2 µM−1), folding stabilities of cetacean Mbs are ∼2–4 kcal/mol higher than for terrestrial Mbs. Using ancestral sequence reconstruction, maximum likelihood and Bayesian tests to describe the evolution of cetacean Mbs, and experimentally calibrated computation of stability effects of mutations, we observe accelerated evolution in cetaceans and identify seven positively selected sites in Mb. Overall, these sites contribute to Mb stabilization with a conditional probability of 0.8. We observe a correlation between Mb folding stability and protein abundance, suggesting that a selection pressure for stability acts proportionally to higher expression. We also identify a major divergence event leading to the common ancestor of whales, during which major stabilization occurred. Most of the positively selected sites that occur later act against other destabilizing mutations to maintain stability across the clade, except for the shallow divers, where late stability relaxation occurs, probably due to the shorter aerobic dive limits of these species. The three main positively selected sites 66, 5, and 35 undergo changes that favor hydrophobic folding, structural integrity, and intra-helical hydrogen bonds.
In this work, we identify positive selection in cetacean myoglobins and an early, significant divergence event. While O2-binding is nearly unchanged, positive selection acts to introduce and later maintain stability. Stability correlates with abundance across the species, supporting that selection for increased stability concurred with the known 10–20 fold increase in myoglobin abundance of cetaceans relative to terrestrial mammals, which itself resulted from speciation towards longer dive lengths of the animals. We suggest that this selection acted to keep constant the otherwise increasing number of unfolded Mb. Altogether, this work for the first time links protein phenotype (stability and abundance) in a specific, real protein to organism-level evolution and fitness of mammals.
The consistent observation across all kingdoms of life that highly abundant proteins evolve slowly demonstrates that cellular abundance is a key determinant of protein evolutionary rate. However, other empirical findings, such as the broad distribution of evolutionary rates, suggest that additional variables determine the rate of protein evolution. Here, we report that under the global selection against the cytotoxic effects of misfolded proteins, folding stability (ΔG), simultaneous with abundance, is a causal variable of evolutionary rate. Using both theoretical analysis and multiscale simulations, we demonstrate that the anticorrelation between the pre-mutation ΔG and the arising mutational effect (ΔΔG), purely biophysical in origin, is a necessary requirement for abundance–evolutionary rate covariation. Additionally, we predict and demonstrate in bacteria that the strength of abundance–evolutionary rate correlation depends on the divergence time separating reference genomes. Altogether, these results highlight the intrinsic role of protein biophysics in the emerging universal patterns of molecular evolution.
Significant progress has been made in recent years in a variety of seemingly unrelated fields such as sequencing, protein structure prediction, and high-throughput transcriptomics and metabolomics. At the same time new microscopic models were developed that made it possible to analyze evolution of genes and genomes from first principles. The results from these efforts enable, for the first time, a comprehensive insight into the evolution of complex systems and organisms on all scales – from sequences to organisms and populations. Every newly sequenced genome uncovers new genes, families, and folds. Where do these new genes come from? How does gene duplication and subsequent divergence of sequence and structure affect the fitness of the organism? What role does regulation play in the evolution of proteins and folds? Emerging synergism between data and modeling provide first robust answers to these questions.
Despite progresses in ancestral protein sequence reconstruction, much needs to be unraveled about the nature of the putative last common ancestral proteome that served as the prototype of all extant lifeforms. Here, we present data that indicate a steady decline (oil escape) in proteome hydrophobicity over species evolvedness (node number) evident in 272 diverse proteomes, which indicates a highly hydrophobic (oily) last common ancestor (LCA). This trend, obtained from simple considerations (free from sequence reconstruction methods), was corroborated by regression studies within homologous and orthologous protein clusters as well as phylogenetic estimates of the ancestral oil content. While indicating an inherent irreversibility in molecular evolution, oil escape also serves as a rare and universal reaction-coordinate for evolution (reinforcing Darwin's principle of Common Descent), and may prove important in matters such as (i) explaining the emergence of intrinsically disordered proteins, (ii) developing composition- and speciation-based “global” molecular clocks, and (iii) improving the statistical methods for ancestral sequence reconstruction.
Although of importance to both evolution and protein design, the manner in which the first proteome came to be, and the actual features of the earliest ancestral proteomes are both unknown. Through the analysis of diverse proteomes, we provide glimpses into the composition of the last common ancestor (LUCA) of all lifeforms, which indicate that the earliest/last common ancestor had a proteome that was highly hydrophobic/oily. Notably, the evidence presented (a) indicates that proteomes of all species ranging from bacteria to mammals appear to adhere to the same universal constraint (“oil escape”) set into motion by the last common ancestor more than 3.5 billion years ago, (b) indicates the presence of a previously untapped global (composition-level) molecular clock, and (c) strengthens the non-equilibrium/directional view of amino acid substitutions that challenges central dogmas regarding reversibility in molecular evolution.
We report a set of atomistic folding/unfolding simulations for the hairpin ribozyme using a monte carlo algorithm. The hairpin ribozyme folds in solution and catalyzes self-cleavage or ligation via a specific two-domain structure. The minimal active ribozyme has been studied extensively, showing stabilization of the active structure by cations and dynamic motion of the active structure. Here we introduce a simple model of tertiary structure formation that leads to a phase diagram for the RNA as a function of temperature and tertiary structure strength. We then employ this model to capture many folding/unfolding events and to examine the transition state ensemble (TSE) of the RNA during folding to its active “docked” conformation. The TSE is compact but with few tertiary interactions formed, in agreement with single-molecule dynamics experiments. To compare with experimental kinetic parameters we introduce a novel method to benchmark monte carlo kinetic parameters to docking/undocking rates collected over many single molecular trajectories. We find that topology alone, as encoded in a biased potential which discriminates between secondary and tertiary interactions, is sufficient to predict the thermodynamic behavior and kinetic folding pathway of the hairpin ribozyme. This method should be useful in predicting folding transition states for many natural or man-made RNA tertiary structures.
Reproduction is inherently risky, in part because genomic replication can introduce new mutations that are usually deleterious toward fitness. This risk is especially severe for organisms whose genomes replicate “semi-conservatively,” e.g. viruses and bacteria, where no master copy of the genome is preserved. Lethal mutagenesis refers to extinction of populations due to an unbearably high mutation rate (U), and is important both theoretically and clinically, where drugs can extinguish pathogens by increasing their mutation rate. Previous theoretical models of lethal mutagenesis assume infinite population size (N). However, in addition to high U, small N can accelerate extinction by strengthening genetic drift and relaxing selection. Here, we examine how the time until extinction depends jointly on N and U. We first analytically compute the mean time until extinction (τ) in a simplistic model where all mutations are either lethal or neutral. The solution motivates the definition of two distinct regimes: a survival phase and an extinction phase, which differ dramatically in both how τ scales with N and in the coefficient of variation in time until extinction. Next, we perform stochastic population-genetics simulations on a realistic fitness landscape that both (i) features an epistatic distribution of fitness effects that agrees with experimental data on viruses and (ii) is based on the biophysics of protein folding. More specifically, we assume that mutations inflict fitness penalties proportional to the extent that they unfold proteins. We find that decreasing N can cause phase transition-like behavior from survival to extinction, which motivates the concept of “lethal isolation.” Furthermore, we find that lethal mutagenesis and lethal isolation interact synergistically, which may have clinical implications for treating infections. Broadly, we conclude that stably folded proteins are only possible in ecological settings that support sufficiently large populations.
Most spontaneous mutations hurt organismal fitness, e.g. by destabilizing proteins. In many species, the normal mutation rate is strikingly high: on the order of one per genome per replication. In the face of these mutations, how can proteins maintain their native structure, and how can populations of organisms avoid extinction? Are there physics-based limits on how large the mutation rate of any species can be before the onslaught of mutations outpaces natural selection and melts-down proteins? Here, we address these questions with a computational model that combines protein folding thermodynamics with individual-based population genetics simulations. We calculate a theoretical “speed limit” equal to a few mutations per genome per replication—near the mutation rate of RNA viruses. Additionally, we find that the speed limit can be much lower in small populations where “random genetic drift” is strong. Thus, we conclude that stably folded proteins are only possible in ecological settings that support sufficiently large populations. These findings may have clinical implications for treating viral infections with drugs that elevate the viral mutation rate.
In this work, we apply a detailed all-atom model with a transferable knowledge-based potential to study the folding kinetics of Formin-Binding protein, FBP28, which is a canonical three-stranded β-sheet WW domain. Replica exchange Monte Carlo (REMC) simulations starting from random coils find native-like (C α RMSD of 2.68Å) lowest energy structure. We also study the folding kinetics of FBP28 WW domain by performing a large number of ab initio Monte Carlo folding simulations. Using these trajectories, we examine the order of formation of two β –hairpins, the folding mechanism of each individual β– hairpin, and transition state ensemble (TSE) of FBP28 WW domain and compare our results with experimental data and previous computational studies. To obtain detailed structural information on the folding dynamics viewed as an ensemble process, we perform a clustering analysis procedure based on graph theory. Further, a rigorous Pfold analysis is used to obtain representative samples of the TSEs showing good quantitative agreement between experimental and simulated Φ values. Our analysis shows that the turn structure between first and second β strands is a partially stable structural motif that gets formed before entering the TSE in FBP28 WW domain and there exist two major pathways for the folding of FBP28 WW domain, which differ in the order and mechanism of hairpin formation.
transition state ensemble; protein folding; β-strand; β-hairpin; β-sheet; Φ-value analysis; Pfold analysis
The population dynamics theory of B cells in a typical germinal center could play an important role in revealing how affinity maturation is achieved. However, the existing models encountered some conflicts with experiments. To resolve these conflicts, we present a coarse-grained model to calculate the B cell population development in affinity maturation, which allows a comprehensive analysis of its parameter space to look for optimal values of mutation rate, selection strength, and initial antibody-antigen binding level that maximize the affinity improvement. With these optimized parameters, the model is compatible with the experimental observations such as the ∼100-fold affinity improvements, the number of mutations, the hypermutation rate, and the “all or none” phenomenon. Moreover, we study the reasons behind the optimal parameters. The optimal mutation rate, in agreement with the hypermutation rate in vivo, results from a tradeoff between accumulating enough beneficial mutations and avoiding too many deleterious or lethal mutations. The optimal selection strength evolves as a balance between the need for affinity improvement and the requirement to pass the population bottleneck. These findings point to the conclusion that germinal centers have been optimized by evolution to generate strong affinity antibodies effectively and rapidly. In addition, we study the enhancement of affinity improvement due to B cell migration between germinal centers. These results could enhance our understanding of the functions of germinal centers.
The antibodies in our immune system could efficiently improve their abilities in recognizing new antigens. This is done with the help of proliferation, mutation and selection of B cells which carry antibodies, but we have difficulties in developing a quantitative description of this adaptation process which is consistent with the various aspects of experimental observations. Based on the knowledge from experiments, here we present a theoretical model to calculate the numbers of B cells with different antigen recognizing abilities all the time, and look for the best possible design that improves the antigen recognizing ability most efficiently. We find that the best possible design is consistent with the experimental observations, pointing to the conclusion that the immune system has been optimized in evolution. We then study the trade-offs leading to the optimization of the design. The results will not only improve our understanding of the functions in immune system, but also reveal the design principles behind the details. In addition, the study enhances our understanding of the population dynamics in evolution.
Mutators are clones whose mutation rate is about two to three orders of magnitude higher than the rate of wild-type clones and their roles in adaptive evolution of asexual populations have been controversial. Here we address this problem by using an ab initio microscopic model of living cells, which combines population genetics with a physically realistic presentation of protein stability and protein-protein interactions. The genome of model organisms encodes replication controlling genes (RCGs) and genes modeling the mismatch repair (MMR) complexes. The genotype-phenotype relationship posits that the replication rate of an organism is proportional to protein copy numbers of RCGs in their functional form and there is a production cost penalty for protein overexpression. The mutation rate depends linearly on the concentration of homodimers of MMR proteins. By simulating multiple runs of evolution of populations under various environmental stresses—stationary phase, starvation or temperature-jump—we find that adaptation most often occurs through transient fixation of a mutator phenotype, regardless of the nature of stress. By contrast, the fixation mechanism does depend on the nature of stress. In temperature jump stress, mutators take over the population due to loss of stability of MMR complexes. In contrast, in starvation and stationary phase stresses, a small number of mutators are supplied to the population via epigenetic stochastic noise in production of MMR proteins (a pleiotropic effect), and their net supply is higher due to reduced genetic drift in slowly growing populations under stressful environments. Subsequently, mutators in stationary phase or starvation hitchhike to fixation with a beneficial mutation in the RCGs, (second order selection) and finally a mutation stabilizing the MMR complex arrives, returning the population to a non-mutator phenotype. Our results provide microscopic insights into the rise and fall of mutators in adapting finite asexual populations.
The dramatic rise of mutators has been found to accompany adaptation of bacteria in response to many kinds of stress. Two views on the evolutionary origin of this phenomenon emerged: the pleiotropic hypothesis positing that it is a byproduct of environmental stress or other specific stress response mechanisms and the second order selection which states that mutators hitchhike to fixation with unrelated beneficial alleles. Conventional population genetics models could not fully resolve this controversy because they are based on certain assumptions about fitness landscape. Here we address this problem using a microscopic multiscale model, which couples physically realistic molecular descriptions of proteins and their interactions with population genetics of carrier organisms without assuming any a priori mutational effect on fitness landscape. We found that both pleiotropy and second order selection play a crucial role at different stages of adaptation: the supply of mutators is provided through destabilization of error correction complexes or, alternatively, fluctuations of production levels of prototypic mismatch repair proteins (pleiotropic effects), while the rise and fixation of mutators occurs when there is a sufficient supply of beneficial mutations in replication-controlling genes. This general mechanism assures a robust and reliable adaptation of organisms to unforeseen challenges. This study highlights physical principles underlying biological mechanisms of stress response and adaptation.
Crowded intracellular environments present a challenge for proteins to form functional specific complexes while reducing non-functional interactions with promiscuous non-functional partners. Here we show how the need to minimize the waste of resources to non-functional interactions limits the proteome diversity and the average concentration of co-expressed and co-localized proteins. Using the results of high-throughput Yeast 2-Hybrid experiments, we estimate the characteristic strength of non-functional protein–protein interactions. By combining these data with the strengths of specific interactions, we assess the fraction of time proteins spend tied up in non-functional interactions as a function of their overall concentration. This allows us to sketch the phase diagram for baker's yeast cells using the experimentally measured concentrations and subcellular localization of their proteins. The positions of yeast compartments on the phase diagram are consistent with our hypothesis that the yeast proteome has evolved to operate closely to the upper limit of its size, whereas keeping individual protein concentrations sufficiently low to reduce non-functional interactions. These findings have implication for conceptual understanding of intracellular compartmentalization, multicellularity and differentiation.
non-functional interaction; protein–protein interaction; proteome size; yeast cytoplasm
In this work we develop a microscopic physical model of early evolution where phenotype—organism life expectancy—is directly related to genotype—the stability of its proteins in their native conformations—which can be determined exactly in the model. Simulating the model on a computer, we consistently observe the “Big Bang” scenario whereby exponential population growth ensues as soon as favorable sequence–structure combinations (precursors of stable proteins) are discovered. Upon that, random diversity of the structural space abruptly collapses into a small set of preferred proteins. We observe that protein folds remain stable and abundant in the population at timescales much greater than mutation or organism lifetime, and the distribution of the lifetimes of dominant folds in a population approximately follows a power law. The separation of evolutionary timescales between discovery of new folds and generation of new sequences gives rise to emergence of protein families and superfamilies whose sizes are power-law distributed, closely matching the same distributions for real proteins. On the population level we observe emergence of species—subpopulations that carry similar genomes. Further, we present a simple theory that relates stability of evolving proteins to the sizes of emerging genomes. Together, these results provide a microscopic first-principles picture of how first-gene families developed in the course of early evolution.
Here, we address the question of how Darwinian evolution of organisms determines molecular evolution of their proteins and genomes. We developed a microscopic ab initio model of early biological evolution where the fitness (essentially lifetime) of an organism is explicitly related to the evolving sequences of its proteins. The main assumption of the model is that the death rate of an organism is determined by the stability of the least stable of their proteins. A lattice model is used to calculate stability of all proteins in a genome from their amino acid sequence. The simulation of the model starts from 100 identical organisms, each carrying the same random gene, and proceeds via random mutations, gene duplication, organism births via replication, and organism deaths. We find that exponential population growth is possible only after the discovery of a very small number of specific advantageous protein structures. The number of genes in the evolving organisms depends on the mutation rate, demonstrating the intricate relationship between the genome sizes and protein stability requirements. Further, the model explains the observed power-law distributions of protein family and superfamily sizes, as well as the scale-free character of protein structural similarity graphs. Together, these results and their analysis suggest a plausible comprehensive scenario of emergence of the protein universe in early biological evolution.
The aim of this work is to elucidate how physical principles of protein design are reflected in natural sequences that evolved in response to the thermal conditions of the environment. Using an exactly solvable lattice model, we design sequences with selected thermal properties. Compositional analysis of designed model sequences and natural proteomes reveals a specific trend in amino acid compositions in response to the requirement of stability at elevated environmental temperature: the increase of fractions of hydrophobic and charged amino acid residues at the expense of polar ones. We show that this “from both ends of the hydrophobicity scale” trend is due to positive (to stabilize the native state) and negative (to destabilize misfolded states) components of protein design. Negative design strengthens specific repulsive non-native interactions that appear in misfolded structures. A pressure to preserve specific repulsive interactions in non-native conformations may result in correlated mutations between amino acids that are far apart in the native state but may be in contact in misfolded conformations. Such correlated mutations are indeed found in TIM barrel and other proteins.
What mechanisms does Nature use in her quest for thermophilic proteins? It is known that stability of a protein is mainly determined by the energy gap, or the difference in energy, between native state and a set of incorrectly folded (misfolded) conformations. Here we show that Nature makes thermophilic proteins by widening this gap from both ends. The energy of the native state of a protein is decreased by selecting strongly attractive amino acids at positions that are in contact in the native state (positive design). Simultaneously, energies of the misfolded conformations are increased by selection of strongly repulsive amino acids at positions that are distant in native structure; however, these amino acids will interact repulsively in the misfolded conformations (negative design). These fundamental principles of protein design are manifested in the “from both ends of the hydrophobicity scale” trend observed in thermophilic adaptation, whereby proteomes of thermophilic proteins are enriched in extreme amino acids—hydrophobic and charged—at the expense of polar ones. Hydrophobic amino acids contribute mostly to the positive design, while charged amino acids that repel each other in non-native conformations of proteins contribute to negative design. Our results provide guidance in rational design of proteins with selected thermal properties.
Protein–DNA interactions are vital for many processes in living cells, especially transcriptional regulation and DNA modification. To further our understanding of these important processes on the microscopic level, it is necessary that theoretical models describe the macromolecular interaction energetics accurately. While several methods have been proposed, there has not been a careful comparison of how well the different methods are able to predict biologically important quantities such as the correct DNA binding sequence, total binding free energy and free energy changes caused by DNA mutation. In addition to carrying out the comparison, we present two important theoretical models developed initially in protein folding that have not yet been tried on protein–DNA interactions. In the process, we find that the results of these knowledge-based potentials show a strong dependence on the interaction distance and the derivation method. Finally, we present a knowledge-based potential that gives comparable or superior results to the best of the other methods, including the molecular mechanics force field AMBER99.
There have been considerable attempts in the past to relate phenotypic trait—habitat temperature of organisms—to their genotypes, most importantly compositions of their genomes and proteomes. However, despite accumulation of anecdotal evidence, an exact and conclusive relationship between the former and the latter has been elusive. We present an exhaustive study of the relationship between amino acid composition of proteomes, nucleotide composition of DNA, and optimal growth temperature (OGT) of prokaryotes. Based on 204 complete proteomes of archaea and bacteria spanning the temperature range from −10 °C to 110 °C, we performed an exhaustive enumeration of all possible sets of amino acids and found a set of amino acids whose total fraction in a proteome is correlated, to a remarkable extent, with the OGT. The universal set is Ile, Val, Tyr, Trp, Arg, Glu, Leu (IVYWREL), and the correlation coefficient is as high as 0.93. We also found that the G + C content in 204 complete genomes does not exhibit a significant correlation with OGT (R = −0.10). On the other hand, the fraction of A + G in coding DNA is correlated with temperature, to a considerable extent, due to codon patterns of IVYWREL amino acids. Further, we found strong and independent correlation between OGT and the frequency with which pairs of A and G nucleotides appear as nearest neighbors in genome sequences. This adaptation is achieved via codon bias. These findings present a direct link between principles of proteins structure and stability and evolutionary mechanisms of thermophylic adaptation. On the nucleotide level, the analysis provides an example of how nature utilizes codon bias for evolutionary adaptation to extreme conditions. Together these results provide a complete picture of how compositions of proteomes and genomes in prokaryotes adjust to the extreme conditions of the environment.
Prokaryotes living at extreme environmental temperatures exhibit pronounced signatures in the amino acid composition of their proteins and the nucleotide compositions of their genomes, reflective of adaptation to their thermal environments. However, despite significant efforts, the definitive answer of what are the genomic and proteomic compositional determinants of optimal growth temperature (OGT) of prokaryotic organisms remained elusive. Here we performed a comprehensive analysis of amino acid and nucleotide compositional signatures of thermophylic adaptation by exhaustively evaluating all combinations of amino acids and nucleotides as possible determinants of OGT for all prokaryotic organisms with fully sequenced genomes. We discovered that total concentration of seven amino acids in proteomes—IVYWREL—serves as a universal proteomic predictor of OGT in prokaryotes. Resolving the old-standing controversy, we determined that the variation in nucleotide composition (increase of purine load, or A + G content with temperature) is largely a consequence of thermal adaptation of proteins. However, the frequency with which A and G nucleotides appear as nearest neighbors in genome sequences is strongly and independently correlated with OGT as a result of codon bias in corresponding genomes. Together these results provide a complete picture of proteomic and genomic determinants of thermophilic adaptation.
The number of protein structures from structural genomics centers dramatically increases in the Protein Data Bank (PDB). Many of these structures are functionally unannotated because they have no sequence similarity to proteins of known function. However, it is possible to successfully infer function using only structural similarity.
Here we present the PDB-UF database, a web-accessible collection of predictions of enzymatic properties using structure-function relationship. The assignments were conducted for three-dimensional protein structures of unknown function that come from structural genomics initiatives. We show that 4 hypothetical proteins (with PDB accession codes: 1VH0, 1NS5, 1O6D, and 1TO0), for which standard BLAST tools such as PSI-BLAST or RPS-BLAST failed to assign any function, are probably methyltransferase enzymes.
We suggest that the structure-based prediction of an EC number should be conducted having the different similarity score cutoff for different protein folds. Moreover, performing the annotation using two different algorithms can reduce the rate of false positive assignments. We believe, that the presented web-based repository will help to decrease the number of protein structures that have functions marked as "unknown" in the PDB file.