Search tips
Search criteria

Results 1-25 (1352737)

Clipboard (0)

Related Articles

1.  Impact of translational error-induced and error-free misfolding on the rate of protein evolution 
Theoretical calculations suggest that, in addition to translational error-induced protein misfolding, a non-negligible fraction of misfolded proteins are error free.We propose that the anticorrelation between the expression level of a protein and its rate of sequence evolution be explained by an overarching protein-misfolding-avoidance hypothesis that includes selection against both error-induced and error-free protein misfolding, and verify this model by a molecular-level evolutionary simulation.We provide strong empirical evidence for the protein-misfolding-avoidance hypothesis, including a positive correlation between protein expression level and stability, enrichment of misfolding-minimizing codons and amino acids in highly expressed genes, and stronger evolutionary conservation of residues in which nonsynonymous changes are more likely to increase protein misfolding.
The rate of protein sequence evolution has long been of central interest to molecular evolutionists. Different proteins of the same species evolve at vastly different rates, which is commonly explained by a variation in functional constraint among different proteins (Kimura and Ohta, 1974). However, it is unclear how to quantify the functional constraint of a protein from the knowledge of its function. In the past decade, various types of genomic data from model organisms have been examined to look for the determinants of the rate of protein sequence evolution. The most unexpected discovery was a very strong anticorrelation between the expression level and evolutionary rate of a protein (E–R anticorrelation) (Pal et al, 2001). The prevailing explanation of the E–R anticorrelation is the translational robustness hypothesis (Drummond et al, 2005). This hypothesis posits that mistranslation induces protein misfolding, which is toxic to cells (Figure 1). Consequently, highly expressed proteins are under stronger pressures to be translationally robust and thus are more constrained in sequence evolution. However, the impact of the other source of misfolded proteins, translational error-free proteins (Figure 1), has not been evaluated. By theoretical calculation, computer simulation, and empirical data analysis, we examined the role of selection against both error-induced and error-free protein misfolding in creating the E–R correlation.
Our theoretical calculations suggested that a non-negligible fraction of misfolded proteins are error free. We estimated that when a protein is not very stable, on average ∼20% of misfolded molecules are error free. However, when a protein is very stable, this fraction reduces to ∼5%, which is probably a result of natural selection against protein misfolding.
We conducted a molecular-level evolutionary simulation (Figure 2A) using three different schemes: error-induced misfolding only, error-free misfolding only, and both types of misfolding. As expected, results from the first simulation are similar to those from a previous study that considers only error-induced misfolding (Drummond and Wilke, 2008). Interestingly, the second and third simulations can also generate the same patterns, including a positive correlation between the protein expression level and the unfolding energy (ΔG) of the error-free protein (Figure 2B), a negative correlation between the expression level and the fraction of protein molecules that misfold after being mistranslated (Figure 2C), a negative correlation between ΔG and the evolutionary rate (Figure 2D), and a negative correlation between the expression level and the evolutionary rate (i.e., the E–R anticorrelation) (Figure 2E). Furthermore, we found that selection against protein misfolding is more effective in reducing error-free misfolding than error-induced misfolding.
Based on these results, we propose that an overarching protein-misfolding-avoidance hypothesis that includes both sources of misfolding is superior to the prevailing translational robustness hypothesis, which considers only error-induced misfolding. We tested three key predictions of the protein-misfolding-avoidance hypotheses using yeast data. First, we showed that, consistent with our prediction, a positive correlation exists between the protein expression level and stability, which is measured by the unfolding energy or melting temperature. In addition, protein expression level is negatively correlated with protein aggregation propensity. Second, we found that codons minimizing protein misfolding are used more frequently in highly expressed proteins than in lowly expressed ones. Third, we showed that, within the same protein, amino acid residues in which random nonsynonymous mutations are more likely to increase protein misfolding are evolutionarily more conserved.
Together, these results provide unambiguous evidence that avoidance of both error-induced and error-free protein misfolding is a major source of the E–R anticorrelation and that protein stability and mistranslation have important roles in protein evolution.
What determines the rate of protein evolution is a fundamental question in biology. Recent genomic studies revealed a surprisingly strong anticorrelation between the expression level of a protein and its rate of sequence evolution. This observation is currently explained by the translational robustness hypothesis in which the toxicity of translational error-induced protein misfolding selects for higher translational robustness of more abundant proteins, which constrains sequence evolution. However, the impact of error-free protein misfolding has not been evaluated. We estimate that a non-negligible fraction of misfolded proteins are error free and demonstrate by a molecular-level evolutionary simulation that selection against protein misfolding results in a greater reduction of error-free misfolding than error-induced misfolding. Thus, an overarching protein-misfolding-avoidance hypothesis that includes both sources of misfolding is superior to the translational robustness hypothesis. We show that misfolding-minimizing amino acids are preferentially used in highly abundant yeast proteins and that these residues are evolutionarily more conserved than other residues of the same proteins. These findings provide unambiguous support to the role of protein-misfolding-avoidance in determining the rate of protein sequence evolution.
PMCID: PMC2990641  PMID: 20959819
evolutionary rate; expression level; mistranslation; protein misfolding
2.  Tissue-Specific Evolution of Protein Coding Genes in Human and Mouse 
PLoS ONE  2015;10(6):e0131673.
Protein-coding genes evolve at different rates, and the influence of different parameters, from gene size to expression level, has been extensively studied. While in yeast gene expression level is the major causal factor of gene evolutionary rate, the situation is more complex in animals. Here we investigate these relations further, especially taking in account gene expression in different organs as well as indirect correlations between parameters. We used RNA-seq data from two large datasets, covering 22 mouse tissues and 27 human tissues. Over all tissues, evolutionary rate only correlates weakly with levels and breadth of expression. The strongest explanatory factors of purifying selection are GC content, expression in many developmental stages, and expression in brain tissues. While the main component of evolutionary rate is purifying selection, we also find tissue-specific patterns for sites under neutral evolution and for positive selection. We observe fast evolution of genes expressed in testis, but also in other tissues, notably liver, which are explained by weak purifying selection rather than by positive selection.
PMCID: PMC4488272  PMID: 26121354
3.  An interdomain sector mediating allostery in Hsp70 molecular chaperones 
The Hsp70 family of molecular chaperones provides a well defined and experimentally powerful model system for understanding allosteric coupling between different protein domains.New extensions to the statistical coupling analysis (SCA) method permit identification of a group of co-evolving amino-acid positions—a sector—in the Hsp70 that is associated with allosteric function.Literature-based and new experimental studies support the notion that the protein sector identified through SCA underlies the allosteric mechanism of Hsp70.This work extends the concept of protein sectors by showing that two non-homologous protein domains can share a single sector when the underlying biological function is defined by the coupled activity of the two domains.
Allostery is a biologically critical property by which distantly positioned functional surfaces on proteins functionally interact. This property remains difficult to elucidate at a mechanistic level (Smock and Gierasch, 2009) because long-range coupling within proteins arises from the cooperative action of groups of amino acids. As a case study, consider the Hsp70 molecular chaperones, a large and diverse family of two-domain allosteric proteins required for cellular viability in nearly every organism (Figure 1) (Mayer and Bukau, 2005). In the ADP-bound state, the two domains act independently, the C-terminal substrate-binding domain displays a stable configuration in which the so-called ‘lid' region is docked against the β-sandwich subdomain, and substrates bind with relatively high affinity (Figure 1A) (Moro et al, 2003; Swain et al, 2007; Bertelsen et al, 2009). Exchange of ADP for ATP in the N-terminal nucleotide-binding domain causes significant local and propagated conformational change, formation of an interface with the substrate-binding domain, opening of the lid subdomain, and a decrease in the binding affinity for substrates (Figure 1B) (Rist et al, 2006; Swain et al, 2007). Upon ATP hydrolysis by the nucleotide-binding domain, Hsp70 is returned to the ADP-bound configuration suitable for another round of substrate binding and release. This process of cyclical substrate binding and release underlies all biological functions of Hsp70 proteins.
What is the structural basis for the long-range functional coupling within Hsp70? When allostery is a conserved property of a protein family, one approach to this problem is to analyze the correlated evolution of amino acids in the family—the expected statistical signature of cooperative action of protein residues (Lockless and Ranganathan, 1999; Kass and Horovitz, 2002; Suel et al, 2003). Previous work using an implementation of this concept (the statistical coupling analysis or SCA) showed that proteins contain sparse networks of co-evolving amino acids termed ‘sectors' that link protein active sites with distinct functional surfaces through the protein core (Halabi et al, 2009). This architecture is consistent with known allosteric mechanisms in protein domains (Suel et al, 2003; Halabi et al, 2009).
However, the principle of co-evolution of protein residues need not be limited to the study of individual protein domains. Indeed, conserved allosteric coupling between two (or more) non-homologous domains implies the existence of shared sectors that span functional sites on different domains. Here, we test this concept by extending the SCA method to consider the allosteric mechanism acting between the two domains of the Hsp70 proteins. Hsp70-like proteins include not only the allosteric Hsp70s, but also the Hsp110s—homologs that contain both domains and are regarded as structural models for Hsp70s, but that do not exhibit allosteric coupling. In this study, we take advantage of the functional divergence between the Hsp70s and Hsp110s to reveal patterns of co-evolution between amino acids that are specifically associated with the allosteric mechanism.
To identify the allosteric sector in Hsp70, we used SCA to compute a weighted correlation matrix, C̃, that describes the co-evolution of every pair of amino-acids positions in a sequence alignment of 926 members of the Hsp70/110 family. We then applied a mathematical method known as singular value decomposition to simultaneously evaluate the pattern of divergence between sequences and the pattern of co-evolution between amino-acid positions. The basic idea is that if the pattern of sequence divergence is able to classify members of a protein family into distinct functional subgroups, then we can rigorously identify the group of co-evolving residues that correspond to the underlying mechanism. Figure 2A shows the principal axis of sequence variation in the Hsp70/110 family, showing a clear separation of the allosteric (Hsp70) and non-allosteric (Hsp110) members of this family. The corresponding axis of co-evolution between amino-acid positions reveals a subset of Hsp70/110 positions (∼20%, 115 residues out of 605 total) that underlie the divergence of Hsp70 and Hsp110 proteins (Figure 2B). These positions derive roughly equally from the nucleotide-binding domain (in blue, 56 positions) and the substrate-binding domain (in green, 59 positions) and are more conserved within the Hsp70 sub-family. These results define a protein sector that is predicted to underlie the allosteric mechanism of Hsp70.
What is the structural arrangement of the putative allosteric sector within the Hsp70 protein? Consistent with a function in allosteric coupling, the 115 sector residues form a physically contiguous network of atoms, linking the ATP-binding site on the nucleotide-binding domain to the substrate recognition site on the substrate-binding domain through the interdomain interface (Figure 2C). The physical connectivity is remarkable given that only ∼20% of overall Hsp70 residues is involved (Figure 2B). Thus, functionally coupled but non-homologous protein domains can share a single sector of co-evolving residues that connects their respective functional sites.
We compared the Hsp70 sector mapping with the large body of biochemical studies that have been carried out in this family. We find strong experimental support for the involvement of sector positions in the Hsp70 allosteric mechanism in several regions: (1) within the ATP-binding site, (2) at the interface linking the two domains, and (3) within the β-sandwich core of the substrate-binding domain. The sector analysis also makes predictions about the involvement of some previously untested residues; we show that mutations at two such sites in fact reduce the allosteric coupling within Hsp70 in vitro and fail to complement a DnaK knockout strain of E. coli in a stress-response assay. Taken together, we conclude that sector positions are associated with the allosteric mechanism of Hsp70.
This work also adds a new finding with regard to the concept of protein sectors. Previous work showed that multiple quasi-independent sectors, each of which contributes a different aspect of function, are possible within a single protein domain (Halabi et al, 2009). This work shows that a single sector can also span two different protein domains when biological function (here, nucleotide-dependent substrate binding) arises from their coupled action. This result emphasizes the point that sectors are units of functional selection and are not obviously related to traditional hierarchies of structural organization in proteins. An interesting possibility is that evolution of allostery between proteins might evolve through the joining of protein sectors, a conjecture that can be tested in future work.
Allosteric coupling between protein domains is fundamental to many cellular processes. For example, Hsp70 molecular chaperones use ATP binding by their actin-like N-terminal ATPase domain to control substrate interactions in their C-terminal substrate-binding domain, a reaction that is critical for protein folding in cells. Here, we generalize the statistical coupling analysis to simultaneously evaluate co-evolution between protein residues and functional divergence between sequences in protein sub-families. Applying this method in the Hsp70/110 protein family, we identify a sparse but structurally contiguous group of co-evolving residues called a ‘sector', which is an attribute of the allosteric Hsp70 sub-family that links the functional sites of the two domains across a specific interdomain interface. Mutagenesis of Escherichia coli DnaK supports the conclusion that this interdomain sector underlies the allosteric coupling in this protein family. The identification of the Hsp70 sector provides a basis for further experiments to understand the mechanism of allostery and introduces the idea that cooperativity between interacting proteins or protein domains can be mediated by shared sectors.
PMCID: PMC2964120  PMID: 20865007
allostery; chaperone; co-evolution; SCA; sector
4.  Selective Constraints in Experimentally Defined Primate Regulatory Regions 
PLoS Genetics  2008;4(8):e1000157.
Changes in gene regulation may be important in evolution. However, the evolutionary properties of regulatory mutations are currently poorly understood. This is partly the result of an incomplete annotation of functional regulatory DNA in many species. For example, transcription factor binding sites (TFBSs), a major component of eukaryotic regulatory architecture, are typically short, degenerate, and therefore difficult to differentiate from randomly occurring, nonfunctional sequences. Furthermore, although sites such as TFBSs can be computationally predicted using evolutionary conservation as a criterion, estimates of the true level of selective constraint (defined as the fraction of strongly deleterious mutations occurring at a locus) in regulatory regions will, by definition, be upwardly biased in datasets that are a priori evolutionarily conserved. Here we investigate the fitness effects of regulatory mutations using two complementary datasets of human TFBSs that are likely to be relatively free of ascertainment bias with respect to evolutionary conservation but, importantly, are supported by experimental data. The first is a collection of almost >2,100 human TFBSs drawn from the literature in the TRANSFAC database, and the second is derived from several recent high-throughput chromatin immunoprecipitation coupled with genomic microarray (ChIP-chip) analyses. We also define a set of putative cis-regulatory modules (pCRMs) by spatially clustering multiple TFBSs that regulate the same gene. We find that a relatively high proportion (∼37%) of mutations at TFBSs are strongly deleterious, similar to that at a 2-fold degenerate protein-coding site. However, constraint is significantly reduced in human and chimpanzee pCRMS and ChIP-chip sequences, relative to macaques. We estimate that the fraction of regulatory mutations that have been driven to fixation by positive selection in humans is not significantly different from zero. We also find that the level of selective constraint in our TFBSs, pCRMs, and ChIP-chip sequences is negatively correlated with the expression breadth of the regulated gene, whereas the opposite relationship holds at that gene's nonsynonymous and synonymous sites. Finally, we find that the rate of protein evolution in a transcription factor appears to be positively correlated with the breadth of expression of the gene it regulates. Our study suggests that strongly deleterious regulatory mutations are considerably more likely (1.6-fold) to occur in tissue-specific than in housekeeping genes, implying that there is a fitness cost to increasing “complexity” of gene expression.
Author Summary
Changes in gene expression have been suggested to play a major role in mammalian evolution. In eukaryotes, gene expression is primarily controlled by sites, such as transcription factor binding sites (TFBSs), located in the noncoding region of the genome. The majority of these TFBSs remain unannotated, however, because they are typically short, degenerate, and laborious to identify experimentally. As a result, the effects of mutations in TFBSs on organism fitness remain poorly understood. We collected a dataset of TFBSs derived from the experimental biology literature and recent high-throughput studies to estimate the proportions of new mutations in TFBSs that have strongly deleterious and strongly beneficial effects upon organism fitness. We find that a relatively high proportion of new mutations in TFBSs are strongly deleterious, although it appears that relatively few are adaptive. We also demonstrate that the fraction of strongly deleterious regulatory mutations is correlated with the breadth of expression of the regulated gene. Thus, ubiquitously expressed genes are likely to experience fewer deleterious regulatory mutations than those expressed in a small number of tissues.
PMCID: PMC2490716  PMID: 18704158
5.  The scenario on the origin of translation in the RNA world: in principle of replication parsimony 
Biology Direct  2010;5:65.
It is now believed that in the origin of life, proteins should have been "invented" in an RNA world. However, due to the complexity of a possible RNA-based proto-translation system, this evolving process seems quite complicated and the associated scenario remains very blurry. Considering that RNA can bind amino acids with specificity, it has been reasonably supposed that initial peptides might have been synthesized on "RNA templates" containing multiple amino acid binding sites. This "Direct RNA Template (DRT)" mechanism is attractive because it should be the simplest mechanism for RNA to synthesize peptides, thus very likely to have been adopted initially in the RNA world. Then, how this mechanism could develop into a proto-translation system mechanism is an interesting problem.
Presentation of the hypothesis
Here an explanation to this problem is shown considering the principle of "replication parsimony" --- genetic information tends to be utilized in a parsimonious way under selection pressure, due to its replication cost (e.g., in the RNA world, nucleotides and ribozymes for RNA replication). Because a DRT would be quite long even for a short peptide, its replication cost would be great. Thus the diversity and the length of functional peptides synthesized by the DRT mechanism would be seriously limited. Adaptors (proto-tRNAs) would arise to allow a DRT's complementary strand (called "C-DRT" here) to direct the synthesis of the same peptide synthesized by the DRT itself. Because the C-DRT is a necessary part in the DRT's replication, fewer turns of the DRT's replication would be needed to synthesize definite copies of the functional peptide, thus saving the replication cost. Acting through adaptors, C-DRTs could transform into much shorter templates (called "proto-mRNAs" here) and substitute the role of DRTs, thus significantly saving the replication cost. A proto-rRNA corresponding to the small subunit rRNA would then emerge to aid the binding of proto-tRNAs and proto-mRNAs, allowing the reduction of base pairs between them (ultimately resulting in the triplet anticodon/codon pair), thus further saving the replication cost. In this context, the replication cost saved would allow the appearance of more and longer functional peptides and, finally, proteins. The hypothesis could be called "DRT-RP" ("RP" for "replication parsimony").
Testing the hypothesis
The scenario described here is open for experimental work at some key scenes, including the compact DRT mechanism, the development of adaptors from aa-aptamers, the synthesis of peptides by proto-tRNAs and proto-mRNAs without the participation of proto-rRNAs, etc. Interestingly, a recent computer simulation study has demonstrated the plausibility of one of the evolving processes driven by replication parsimony in the scenario.
Implication of the hypothesis
An RNA-based proto-translation system could arise gradually from the DRT mechanism according to the principle of "replication parsimony" --- to save the replication cost of RNA templates for functional peptides. A surprising side deduction along the logic of the hypothesis is that complex, biosynthetic amino acids might have entered the genetic code earlier than simple, prebiotic amino acids, which is opposite to the common sense. Overall, the present discussion clarifies the blurry scenario concerning the origin of translation with a major clue, which shows vividly how life could "manage" to exploit potential chemical resources in nature, eventually in an efficient way over evolution.
This article was reviewed by Eugene V. Koonin, Juergen Brosius, and Arcady Mushegian.
PMCID: PMC3002371  PMID: 21110883
6.  Evolutionary Rate and Duplicability in the Arabidopsis thaliana Protein–Protein Interaction Network 
Genome Biology and Evolution  2012;4(12):1263-1274.
Genes show a bewildering variation in their patterns of molecular evolution, as a result of the action of different levels and types of selective forces. The factors underlying this variation are, however, still poorly understood. In the last decade, the position of proteins in the protein–protein interaction network has been put forward as a determinant factor of the evolutionary rate and duplicability of their encoding genes. This conclusion, however, has been based on the analysis of the limited number of microbes and animals for which interactome-level data are available (essentially, Escherichia coli, yeast, worm, fly, and humans). Here, we study, for the first time, the relationship between the position of proteins in the high-density interactome of a plant (Arabidopsis thaliana) and the patterns of molecular evolution of their encoding genes. We found that genes whose encoded products act at the center of the network are more evolutionarily constrained than those acting at the network periphery. This trend remains significant when potential confounding factors (gene expression level and breadth, duplicability, function, and length of the encoded products) are controlled for. Even though the correlation between centrality measures and rates of evolution is generally weak, for some functional categories, it is comparable in strength to (or even stronger than) the correlation between evolutionary rates and expression levels or breadths. In addition, genes encoding interacting proteins in the network evolve at relatively similar rates. Finally, Arabidopsis proteins encoded by duplicated genes are more highly connected than those encoded by singleton genes. This observation is in agreement with the patterns observed in humans, but in contrast with those observed in E. coli, yeast, worm, and fly (whose duplicated genes tend to act at the periphery of the network), implying that the relationship between duplicability and centrality inverted at least twice during eukaryote evolution. Taken together, these results indicate that the structure of the A. thaliana network constrains the evolution of its components at multiple levels.
PMCID: PMC3542556  PMID: 23160177
network evolution; Arabidopsis interactome; natural selection; rates of evolution; gene duplication; network centrality
7.  Extensive parallelism in protein evolution 
Biology Direct  2007;2:20.
Independently evolving lineages mostly accumulate different changes, which leads to their gradual divergence. However, parallel accumulation of identical changes is also common, especially in traits with only a small number of possible states.
We characterize parallelism in evolution of coding sequences in three four-species sets of genomes of mammals, Drosophila, and yeasts. Each such set contains two independent evolutionary paths, which we call paths I and II. An amino acid replacement which occurred along path I also occurs along path II with the probability 50–80% of that expected under selective neutrality. Thus, the per site rate of parallel evolution of proteins is several times higher than their average rate of evolution, but still lower than the rate of evolution of neutral sequences. This deficit may be caused by changes in the fitness landscape, leading to a replacement being possible along path I but not along path II. However, constant, weak selection assumed by the nearly neutral model of evolution appears to be a more likely explanation. Then, the average coefficient of selection associated with an amino acid replacement, in the units of the effective population size, must exceed ~0.4, and the fraction of effectively neutral replacements must be below ~30%. At a majority of evolvable amino acid sites, only a relatively small number of different amino acids is permitted.
High, but below-neutral, rates of parallel amino acid replacements suggest that a majority of amino acid replacements that occur in evolution are subject to weak, but non-trivial, selection, as predicted by Ohta's nearly-neutral theory.
This article was reviewed by John McDonald (nominated by Laura Landweber), Sarah Teichmann and Subhajyoti De, and Chris Adami.
PMCID: PMC2020468  PMID: 17705846
8.  Widespread Positive Selection in Synonymous Sites of Mammalian Genes 
Molecular biology and evolution  2007;24(8):1821-1831.
Evolution of protein sequences is largely governed by purifying selection, with a small fraction of proteins evolving under positive selection. The evolution at synonymous positions in protein-coding genes is not nearly as well understood, with the extent and types of selection remaining, largely, unclear. A statistical test to identify purifying and positive selection at synonymous sites in protein-coding genes was developed. The method compares the rate of evolution at synonymous sites (Ks) to that in intron sequences of the same gene after sampling the aligned intron sequences to mimic the statistical properties of coding sequences. We detected purifying selection at synonymous sites in ∼28% of the 1,562 analyzed orthologous genes from mouse and rat, and positive selection in ∼12% of the genes. Thus, the fraction of genes with readily detectable positive selection at synonymous sites is much greater than the fraction of genes with comparable positive selection at nonsynonymous sites, i.e., at the level of the protein sequence. Unlike other genes, the genes with positive selection at synonymous sites showed no correlation between Ks and the rate of evolution in nonsynonymous sites (Ka), indicating that evolution of synonymous sites under positive selection is decoupled from protein evolution. The genes with purifying selection at synonymous sites showed significant anticorrelation between Ks and expression level and breadth, indicating that highly expressed genes evolve slowly. The genes with positive selection at synonymous sites showed the opposite trend, i.e., highly expressed genes had, on average, higher Ks. For the genes with positive selection at synonymous sites, a significantly lower mRNA stability is predicted compared to the genes with negative selection. Thus, mRNA destabilization could be an important factor driving positive selection in nonsynonymous sites, probably, through regulation of expression at the level of mRNA degradation and, possibly, also translation rate. So, unexpectedly, we found that positive selection at synonymous sites of mammalian genes is substantially more common than positive selection at the level of protein sequences. Positive selection at synonymous sites might act through mRNA destabilization affecting mRNA levels and translation.
PMCID: PMC2632937  PMID: 17522087
synonymous sites; nonsynonymous sites; positive selection; purifying selection; introns
9.  A protein evolution model with independent sites that reproduces site-specific amino acid distributions from the Protein Data Bank 
Since thermodynamic stability is a global property of proteins that has to be conserved during evolution, the selective pressure at a given site of a protein sequence depends on the amino acids present at other sites. However, models of molecular evolution that aim at reconstructing the evolutionary history of macromolecules become computationally intractable if such correlations between sites are explicitly taken into account.
We introduce an evolutionary model with sites evolving independently under a global constraint on the conservation of structural stability. This model consists of a selection process, which depends on two hydrophobicity parameters that can be computed from protein sequences without any fit, and a mutation process for which we consider various models. It reproduces quantitatively the results of Structurally Constrained Neutral (SCN) simulations of protein evolution in which the stability of the native state is explicitly computed and conserved. We then compare the predicted site-specific amino acid distributions with those sampled from the Protein Data Bank (PDB). The parameters of the mutation model, whose number varies between zero and five, are fitted from the data. The mean correlation coefficient between predicted and observed site-specific amino acid distributions is larger than = 0.70 for a mutation model with no free parameters and no genetic code. In contrast, considering only the mutation process with no selection yields a mean correlation coefficient of = 0.56 with three fitted parameters. The mutation model that best fits the data takes into account increased mutation rate at CpG dinucleotides, yielding = 0.90 with five parameters.
The effective selection process that we propose reproduces well amino acid distributions as observed in the protein sequences in the PDB. Its simplicity makes it very promising for likelihood calculations in phylogenetic studies. Interestingly, in this approach the mutation process influences the effective selection process, i.e. selection and mutation must be entangled in order to obtain effectively independent sites. This interdependence between mutation and selection reflects the deep influence that mutation has on the evolutionary process: The bias in the mutation influences the thermodynamic properties of the evolving proteins, in agreement with comparative studies of bacterial proteomes, and it also influences the rate of accepted mutations.
PMCID: PMC1570368  PMID: 16737532
10.  Hierarchical Targeting of Subtype C Human Immunodeficiency Virus Type 1 Proteins by CD8+ T Cells: Correlation with Viral Load 
Journal of Virology  2004;78(7):3233-3243.
An understanding of the relationship between the breadth and magnitude of T-cell epitope responses and viral loads is important for the design of effective vaccines. For this study, we screened a cohort of 46 subtype C human immunodeficiency virus type 1 (HIV-1)-infected individuals for T-cell responses against a panel of peptides corresponding to the complete subtype C genome. We used a gamma interferon ELISPOT assay to explore the hypothesis that patterns of T-cell responses across the expressed HIV-1 genome correlate with viral control. The estimated median time from seroconversion to response for the cohort was 13 months, and the order of cumulative T-cell responses against HIV proteins was as follows: Nef > Gag > Pol > Env > Vif > Rev > Vpr > Tat > Vpu. Nef was the most intensely targeted protein, with 97.5% of the epitopes being clustered within 119 amino acids, constituting almost one-third of the responses across the expressed genome. The second most targeted region was p24, comprising 17% of the responses. There was no correlation between viral load and the breadth of responses, but there was a weak positive correlation (r = 0.297; P = 0.034) between viral load and the total magnitude of responses, implying that the magnitude of T-cell recognition did not contribute to viral control. When hierarchical patterns of recognition were correlated with the viral load, preferential targeting of Gag was significantly (r = 0.445; P = 0.0025) associated with viral control. These data suggest that preferential targeting of Gag epitopes, rather than the breadth or magnitude of the response across the genome, may be an important marker of immune efficacy. These data have significance for the design of vaccines and for interpretation of vaccine-induced responses.
PMCID: PMC371059  PMID: 15016844
11.  On the origin of the translation system and the genetic code in the RNA world by means of natural selection, exaptation, and subfunctionalization 
Biology Direct  2007;2:14.
The origin of the translation system is, arguably, the central and the hardest problem in the study of the origin of life, and one of the hardest in all evolutionary biology. The problem has a clear catch-22 aspect: high translation fidelity hardly can be achieved without a complex, highly evolved set of RNAs and proteins but an elaborate protein machinery could not evolve without an accurate translation system. The origin of the genetic code and whether it evolved on the basis of a stereochemical correspondence between amino acids and their cognate codons (or anticodons), through selectional optimization of the code vocabulary, as a "frozen accident" or via a combination of all these routes is another wide open problem despite extensive theoretical and experimental studies. Here we combine the results of comparative genomics of translation system components, data on interaction of amino acids with their cognate codons and anticodons, and data on catalytic activities of ribozymes to develop conceptual models for the origins of the translation system and the genetic code.
Our main guide in constructing the models is the Darwinian Continuity Principle whereby a scenario for the evolution of a complex system must consist of plausible elementary steps, each conferring a distinct advantage on the evolving ensemble of genetic elements. Evolution of the translation system is envisaged to occur in a compartmentalized ensemble of replicating, co-selected RNA segments, i.e., in a RNA World containing ribozymes with versatile activities. Since evolution has no foresight, the translation system could not evolve in the RNA World as the result of selection for protein synthesis and must have been a by-product of evolution drive by selection for another function, i.e., the translation system evolved via the exaptation route. It is proposed that the evolutionary process that eventually led to the emergence of translation started with the selection for ribozymes binding abiogenic amino acids that stimulated ribozyme-catalyzed reactions. The proposed scenario for the evolution of translation consists of the following steps: binding of amino acids to a ribozyme resulting in an enhancement of its catalytic activity; evolution of the amino-acid-stimulated ribozyme into a peptide ligase (predecessor of the large ribosomal subunit) yielding, initially, a unique peptide activating the original ribozyme and, possibly, other ribozymes in the ensemble; evolution of self-charging proto-tRNAs that were selected, initially, for accumulation of amino acids, and subsequently, for delivery of amino acids to the peptide ligase; joining of the peptide ligase with a distinct RNA molecule (predecessor of the small ribosomal subunit) carrying a built-in template for more efficient, complementary binding of charged proto-tRNAs; evolution of the ability of the peptide ligase to assemble peptides using exogenous RNAs as template for complementary binding of charged proteo-tRNAs, yielding peptides with the potential to activate different ribozymes; evolution of the translocation function of the protoribosome leading to the production of increasingly longer peptides (the first proteins), i.e., the origin of translation. The specifics of the recognition of amino acids by proto-tRNAs and the origin of the genetic code depend on whether or not there is a physical affinity between amino acids and their cognate codons or anticodons, a problem that remains unresolved.
We describe a stepwise model for the origin of the translation system in the ancient RNA world such that each step confers a distinct advantage onto an ensemble of co-evolving genetic elements. Under this scenario, the primary cause for the emergence of translation was the ability of amino acids and peptides to stimulate reactions catalyzed by ribozymes. Thus, the translation system might have evolved as the result of selection for ribozymes capable of, initially, efficient amino acid binding, and subsequently, synthesis of increasingly versatile peptides. Several aspects of this scenario are amenable to experimental testing.
This article was reviewed by Rob Knight, Doron Lancet, Alexander Mankin (nominated by Arcady Mushegian), and Arcady Mushegian.
PMCID: PMC1894784  PMID: 17540026
12.  Faster-X Evolution of Gene Expression in Drosophila 
PLoS Genetics  2012;8(10):e1003013.
DNA sequences on X chromosomes often have a faster rate of evolution when compared to similar loci on the autosomes, and well articulated models provide reasons why the X-linked mode of inheritance may be responsible for the faster evolution of X-linked genes. We analyzed microarray and RNA–seq data collected from females and males of six Drosophila species and found that the expression levels of X-linked genes also diverge faster than autosomal gene expression, similar to the “faster-X” effect often observed in DNA sequence evolution. Faster-X evolution of gene expression was recently described in mammals, but it was limited to the evolutionary lineages shortly following the creation of the therian X chromosome. In contrast, we detect a faster-X effect along both deep lineages and those on the tips of the Drosophila phylogeny. In Drosophila males, the dosage compensation complex (DCC) binds the X chromosome, creating a unique chromatin environment that promotes the hyper-expression of X-linked genes. We find that DCC binding, chromatin environment, and breadth of expression are all predictive of the rate of gene expression evolution. In addition, estimates of the intraspecific genetic polymorphism underlying gene expression variation suggest that X-linked expression levels are not under relaxed selective constraints. We therefore hypothesize that the faster-X evolution of gene expression is the result of the adaptive fixation of beneficial mutations at X-linked loci that change expression level in cis. This adaptive faster-X evolution of gene expression is limited to genes that are narrowly expressed in a single tissue, suggesting that relaxed pleiotropic constraints permit a faster response to selection. Finally, we present a conceptional framework to explain faster-X expression evolution, and we use this framework to examine differences in the faster-X effect between Drosophila and mammals.
Author Summary
As species diverge over evolutionary time, they accumulate differences in the sequences of their genes and how those genes are expressed. We show that gene expression changes accumulate faster for genes on the X chromosome than for genes on the other chromosomes (autosomes) in Drosophila (the “faster-X” effect). The X chromosome is only found in a single copy in males, whereas the autosomes are found in two copies in both sexes. To compensate for the reduced dosage of X-linked genes in males, a molecular complex binds the Drosophila X chromosome to upregulate gene expression in males. We demonstrate that genes that escape this dosage compensation process have faster evolving expression levels. X-linked genes are inherited in a unique manner, and we hypothesize that this permits a faster rate of adaptive evolution, thereby driving the faster-X evolution of gene expression. We compare these observations with the recently described faster-X evolution of gene expression in mammals, and we explain how differences in dosage compensation, mutation rate, and population size could affect the extent of the faster-X effect.
PMCID: PMC3469423  PMID: 23071459
13.  Purifying Selection on Splice-Related Motifs, Not Expression Level nor RNA Folding, Explains Nearly All Constraint on Human lincRNAs 
Molecular Biology and Evolution  2014;31(12):3164-3183.
There are two strong and equally important predictors of rates of human protein evolution: The amount the gene is expressed and the proportion of exonic sequence devoted to control splicing, mediated largely by selection on exonic splice enhancer (ESE) motifs. Is the same true for noncoding RNAs, known to be under very weak purifying selection? Prior evidence suggests that selection at splice sites in long intergenic noncoding RNAs (lincRNAs) is important. We now report multiple lines of evidence indicating that the great majority of purifying selection operating on lincRNAs in humans is splice related. Splice-related parameters explain much of the between-gene variation in evolutionary rate in humans. Expression rate is not a relevant predictor, although expression breadth is weakly so. In contrast to protein-coding RNAs, we observe no relationship between evolutionary rate and lincRNA stability. As in protein-coding genes, ESEs are especially abundant near splice junctions and evolve slower than non-ESE sequence equidistant from boundaries. Nearly all constraint in lincRNAs is at exon ends (N.B. the same is not witnessed in Drosophila). Although we cannot definitely answer the question as to why splice-related selection is so important, we find no evidence that splicing might enable the nonsense-mediated decay pathway to capture transcripts incorrectly processed by ribosomes. We find evidence consistent with the notion that splicing modifies the underlying chromatin through recruitment of splice-coupled chromatin modifiers, such as CHD1, which in turn might modulate neighbor gene activity. We conclude that most selection on human lincRNAs is splice mediated and suggest that the possibility of splice–chromatin coupling is worthy of further scrutiny.
PMCID: PMC4245815  PMID: 25158797
ncRNA; rate of evolution; splicing
14.  Human Immunodeficiency Virus-Specific Gamma Interferon Enzyme-Linked Immunospot Assay Responses Targeting Specific Regions of the Proteome during Primary Subtype C Infection Are Poor Predictors of the Course of Viremia and Set Point▿  
Journal of Virology  2008;83(1):470-478.
It is unknown whether patterns of human immunodeficiency virus (HIV)-specific T-cell responses during acute infection may influence the viral set point and the course of disease. We wished to establish whether the magnitude and breadth of HIV type 1 (HIV-1)-specific T-cell responses at 3 months postinfection were correlated with the viral-load set point at 12 months and hypothesized that the magnitude and breadth of HIV-specific T-cell responses during primary infection would predict the set point. Gamma interferon (IFN-γ) enzyme-linked immunospot (ELISPOT) assay responses across the complete proteome were measured in 47 subtype C HIV-1-infected participants at a median of 12 weeks postinfection. When corrected for amino acid length and individuals responding to each region, the order of recognition was as follows: Nef > Gag > Pol > Rev > Vpr > Env > Vpu > Vif > Tat. Nef responses were significantly (P < 0.05) dominant, targeted six epitopic regions, and were unrelated to the course of viremia. There was no significant difference in the magnitude and breadth of responses for each protein region with disease progression, although there was a trend of increased breadth (mean, four to seven pools) in rapid progressors. Correlation of the magnitude and breadth of IFN-γ responses with the viral set point at 12 months revealed almost zero association for each protein region. Taken together, these data demonstrate that the magnitude and breadth of IFN-γ ELISPOT assay responses at 3 months postinfection are unrelated to the course of disease in the first year of infection and are not associated with, and have low predictive power for, the viral set point at 12 months.
PMCID: PMC2612312  PMID: 18945774
15.  Evolutionary Systems Biology of Amino Acid Biosynthetic Cost in Yeast 
PLoS ONE  2010;5(8):e11935.
Every protein has a biosynthetic cost to the cell based on the synthesis of its constituent amino acids. In order to optimise growth and reproduction, natural selection is expected, where possible, to favour the use of proteins whose constituents are cheaper to produce, as reduced biosynthetic cost may confer a fitness advantage to the organism. Quantifying the cost of amino acid biosynthesis presents challenges, since energetic requirements may change across different cellular and environmental conditions. We developed a systems biology approach to estimate the cost of amino acid synthesis based on genome-scale metabolic models and investigated the effects of the cost of amino acid synthesis on Saccharomyces cerevisiae gene expression and protein evolution. First, we used our two new and six previously reported measures of amino acid cost in conjunction with codon usage bias, tRNA gene number and atomic composition to identify which of these factors best predict transcript and protein levels. Second, we compared amino acid cost with rates of amino acid substitution across four species in the genus Saccharomyces. Regardless of which cost measure is used, amino acid biosynthetic cost is weakly associated with transcript and protein levels. In contrast, we find that biosynthetic cost and amino acid substitution rates show a negative correlation, but for only a subset of cost measures. In the economy of the yeast cell, we find that the cost of amino acid synthesis plays a limited role in shaping transcript and protein expression levels compared to that of translational optimisation. Biosynthetic cost does, however, appear to affect rates of amino acid evolution in Saccharomyces, suggesting that expensive amino acids may only be used when they have specific structural or functional roles in protein sequences. However, as there appears to be no single currency to compute the cost of amino acid synthesis across all cellular and environmental conditions, we conclude that a systems approach is necessary to unravel the full effects of amino acid biosynthetic cost in complex biological systems.
PMCID: PMC2923148  PMID: 20808905
16.  Genomic Determinants of Protein Evolution and Polymorphism in Arabidopsis 
Genome Biology and Evolution  2011;3:1210-1219.
Recent results from Drosophila suggest that positive selection has a substantial impact on genomic patterns of polymorphism and divergence. However, species with smaller population sizes and/or stronger population structure may not be expected to exhibit Drosophila-like patterns of sequence variation. We test this prediction and identify determinants of levels of polymorphism and rates of protein evolution using genomic data from Arabidopsis thaliana and the recently sequenced Arabidopsis lyrata genome. We find that, in contrast to Drosophila, there is no negative relationship between nonsynonymous divergence and silent polymorphism at any spatial scale examined. Instead, synonymous divergence is a major predictor of silent polymorphism, which suggests variation in mutation rate as the main determinant of silent variation. Variation in rates of protein divergence is mainly correlated with gene expression level and breadth, consistent with results for a broad range of taxa, and map-based estimates of recombination rate are only weakly correlated with nonsynonymous divergence. Variation in mutation rates and the strength of purifying selection seem to be major drivers of patterns of polymorphism and divergence in Arabidopsis. Nevertheless, a model allowing for varying negative and positive selection by functional gene category explains the data better than a homogeneous model, implying the action of positive selection on a subset of genes. Genes involved in disease resistance and abiotic stress display high proportions of adaptive substitution. Our results are important for a general understanding of the determinants of rates of protein evolution and the impact of selection on patterns of polymorphism and divergence.
PMCID: PMC3296466  PMID: 21926095
dN/dS; neutral theory; purifying selection; translational selection; recurrent hitchhiking
17.  Complex patterns of divergence among green-sensitive (RH2a) African cichlid opsins revealed by Clade model analyses 
Gene duplications play an important role in the evolution of functional protein diversity. Some models of duplicate gene evolution predict complex forms of paralog divergence; orthologous proteins may diverge as well, further complicating patterns of divergence among and within gene families. Consequently, studying the link between protein sequence evolution and duplication requires the use of flexible substitution models that can accommodate multiple shifts in selection across a phylogeny. Here, we employed a variety of codon substitution models, primarily Clade models, to explore how selective constraint evolved following the duplication of a green-sensitive (RH2a) visual pigment protein (opsin) in African cichlids. Past studies have linked opsin divergence to ecological and sexual divergence within the African cichlid adaptive radiation. Furthermore, biochemical and regulatory differences between the RH2aα and RH2aβ paralogs have been documented. It thus seems likely that selection varies in complex ways throughout this gene family.
Clade model analysis of African cichlid RH2a opsins revealed a large increase in the nonsynonymous-to-synonymous substitution rate ratio (ω) following the duplication, as well as an even larger increase, one consistent with positive selection, for Lake Tanganyikan cichlid RH2aβ opsins. Analysis using the popular Branch-site models, by contrast, revealed no such alteration of constraint. Several amino acid sites known to influence spectral and non-spectral aspects of opsin biochemistry were found to be evolving divergently, suggesting that orthologous RH2a opsins may vary in terms of spectral sensitivity and response kinetics. Divergence appears to be occurring despite intronic gene conversion among the tandemly-arranged duplicates.
Our findings indicate that variation in selective constraint is associated with both gene duplication and divergence among orthologs in African cichlid RH2a opsins. At least some of this variation may reflect an adaptive response to differences in light environment. Interestingly, these patterns only became apparent through the use of Clade models, not through the use of the more widely employed Branch-site models; we suggest that this difference stems from the increased flexibility associated with Clade models. Our results thus bear both on studies of cichlid visual system evolution and on studies of gene family evolution in general.
PMCID: PMC3514295  PMID: 23078361
Codon substitution model; Visual pigment evolution; Nonsynonymous-to-synonymous substitution rate ratio; dN/dS; Clade model; Maximum likelihood; Gene family evolution
18.  Impact of GC content on gene expression pattern in chicken 
GC content varies greatly between different genomic regions in many eukaryotes. In order to determine whether this organization named isochore organization influences gene expression patterns, the relationship between GC content and gene expression has been investigated in man and mouse. However, to date, this question is still a matter for debate. Among the avian species, chicken (Gallus gallus) is the best studied representative with a complete genome sequence. The distinctive features and organization of its sequence make it a good model to explore important issues in genome structure and evolution.
Only nuclear genes with complete information on protein-coding sequence with no evidence of multiple-splicing forms were included in this study. Chicken protein coding sequences, complete mRNA sequences (or full length cDNA sequences), and 5′ untranslated region sequences (5′ UTR) were downloaded from Ensembl and chicken expression data originated from a previous work. Three indices i.e. expression level, expression breadth and maximum expression level were used to measure the expression pattern of a given gene. CpG islands were identified using hgTables of the UCSC Genome Browser. Correlation analysis between variables was performed by SAS Proprietary Software Release 8.1.
In chicken, the GC content of 5′ UTR is significantly and positively correlated with expression level, expression breadth, and maximum expression level, whereas that of coding sequences and introns and at the third coding position are negatively correlated with expression level and expression breadth, and not correlated with maximum expression level. These significant trends are independent of recombination rate, chromosome size and gene density. Furthermore, multiple linear regression analysis indicated that GC content in genes could explain approximately 10% of the variation in gene expression.
GC content is significantly associated with gene expression pattern and could be one of the important regulation factors in the chicken genome.
PMCID: PMC3641017  PMID: 23557030
19.  Envelope Variants Circulating as Initial Neutralization Breadth Developed in Two HIV-Infected Subjects Stimulate Multiclade Neutralizing Antibodies in Rabbits 
Journal of Virology  2014;88(22):12949-12967.
Identifying characteristics of the human immunodeficiency virus type 1 (HIV-1) envelope that are effective in generating broad, protective antibodies remains a hurdle to HIV vaccine design. Emerging evidence of the development of broad and potent neutralizing antibodies in HIV-infected subjects suggests that founder and subsequent progeny viruses may express unique antigenic motifs that contribute to this developmental pathway. We hypothesize that over the course of natural infection, B cells are programmed to develop broad antibodies by exposure to select populations of emerging envelope quasispecies variants. To test this hypothesis, we identified two unrelated subjects whose antibodies demonstrated increasing neutralization breadth against a panel of HIV-1 isolates over time. Full-length functional env genes were cloned longitudinally from these subjects from months after infection through 2.6 to 5.8 years of infection. Motifs associated with the development of breadth in published, cross-sectional studies were found in both subjects. We compared the immunogenicity of envelope vaccines derived from time points obtained during and after broadening of neutralization activity within these subjects. Rabbits were coimmunized four times with selected multiple gp160 DNAs and gp140-trimeric envelope proteins. The affinity of the polyclonal response increased as a function of boosting. The most rapid and persistent neutralization of multiclade tier 1 viruses was elicited by envelopes that were circulating in plasma at time points prior to the development of 50% neutralization breadth in both human subjects. The breadth elicited in rabbits was not improved by exposure to later envelope variants. These data have implications for vaccine development in describing a target time point to identify optimal envelope immunogens.
IMPORTANCE Vaccine protection against viral infections correlates with the presence of neutralizing antibodies; thus, vaccine components capable of generating potent neutralization are likely to be critical constituents in an effective HIV vaccine. However, vaccines tested thus far have elicited only weak antibody responses and very modest, waning protection. We hypothesized that B cells develop broad antibodies by exposure to the evolving viral envelope population and tested this concept using multiple envelopes from two subjects who developed neutralization breadth within a few years of infection. We compared different combinations of envelopes from each subject to identify the most effective immunogens and regimens. In each subject, use of HIV envelopes circulating during the early development and maturation of breadth generated more-potent antibodies that were modestly cross neutralizing. These data suggest a new approach to identifying envelope immunogens that may be more effective in generating protective antibodies in humans.
PMCID: PMC4249069  PMID: 25210191
20.  Selection for the compactness of highly expressed genes in Gallus gallus 
Biology Direct  2010;5:35.
Coding sequence (CDS) length, gene size, and intron length vary within a genome and among genomes. Previous studies in diverse organisms, including human, D. Melanogaster, C. elegans, S. cerevisiae, and Arabidopsis thaliana, indicated that there are negative relationships between expression level and gene size, CDS length as well as intron length. Different models such as selection for economy model, genomic design model, and mutational bias hypotheses have been proposed to explain such observation. The debate of which model is a superior one to explain the observation has not been settled down. The chicken (Gallus gallus) is an important model organism that bridges the evolutionary gap between mammals and other vertebrates. As D. Melanogaster, chicken has a larger effective population size, selection for chicken genome is expected to be more effective in increasing protein synthesis efficiency. Therefore, in this study the chicken was used as a model organism to elucidate the interaction between gene features and expression pattern upon selection pressure.
Based on different technologies, we gathered expression data for nuclear protein coding, single-splicing genes from Gallus gallus genome and compared them with gene parameters. We found that gene size, CDS length, first intron length, average intron length, and total intron length are negatively correlated with expression level and expression breadth significantly. The tissue specificity is positively correlated with the first intron length but negatively correlated with the average intron length, and not correlated with the CDS length and protein domain numbers. Comparison analyses showed that ubiquitously expressed genes and narrowly expressed genes with the similar expression levels do not differ in compactness. Our data provided evidence that the genomic design model can not, at least in part, explain our observations. We grouped all somatic-tissue-specific genes (n = 1105), and compared the first intron length and the average intron length between highly expressed genes (top 5% expressed genes) and weakly expressed genes (bottom 5% expressed genes). We found that the first intron length and the average intron length in highly expressed genes are not different from that in weakly expressed genes. We also made a comparison between ubiquitously expressed genes and narrowly expressed somatic genes with similar expression levels. Our data demonstrated that ubiquitously expressed genes are less compact than narrowly expressed genes with the similar expression levels. Obviously, these observations can not be explained by mutational bias hypotheses either. We also found that the significant trend between genes' compactness and expression level could not be affected by local mutational biases. We argued that the selection of economy model is most likely one to explain the relationship between gene expression and gene characteristics in chicken genome.
Natural selection appears to favor the compactness of highly expressed genes in chicken genome. This observation can be explained by the selection of economy model.
This article was reviewed by Dr. Gavin Huttley, Dr. Liran Carmel (nominated by Dr. Eugene V. Koonin) and Dr. Araxi Urrutia (nominated by Dr. Laurence D. Hurst).
PMCID: PMC2883972  PMID: 20465857
21.  Asymmetric and non-uniform evolution of recently duplicated human genes 
Biology Direct  2010;5:54.
Gene duplications are a source of new genes and protein functions. The innovative role of duplication events makes families of paralogous genes an interesting target for studies in evolutionary biology. Here we study global trends in the evolution of human genes that resulted from recent duplications.
The pressure of negative selection is weaker during a short time immediately after a duplication event. Roughly one fifth of genes in paralogous gene families are evolving asymmetrically: one of the proteins encoded by two closest paralogs accumulates amino acid substitutions significantly faster than its partner. This asymmetry cannot be explained by differences in gene expression levels. In asymmetric gene pairs the number of deleterious mutations is increased in one copy, while decreased in the other copy as compared to genes constituting non-asymmetrically evolving pairs. The asymmetry in the rate of synonymous substitutions is much weaker and not significant.
The increase of negative selection pressure over time after a duplication event seems to be a major trend in the evolution of human paralogous gene families. The observed asymmetry in the evolution of paralogous genes shows that in many cases one of two gene copies remains practically unchanged, while the other accumulates functional mutations. This supports the hypothesis that slowly evolving gene copies preserve their original functions, while fast evolving copies obtain new specificities or functions.
This article was reviewed by Dr. Igor Rogozin (nominated by Dr. Arcady Mushegian), Dr. Fyodor Kondrashov, and Dr. Sergei Maslov.
PMCID: PMC2942815  PMID: 20825637
22.  Why Is the Correlation between Gene Importance and Gene Evolutionary Rate So Weak? 
PLoS Genetics  2009;5(1):e1000329.
One of the few commonly believed principles of molecular evolution is that functionally more important genes (or DNA sequences) evolve more slowly than less important ones. This principle is widely used by molecular biologists in daily practice. However, recent genomic analysis of a diverse array of organisms found only weak, negative correlations between the evolutionary rate of a gene and its functional importance, typically measured under a single benign lab condition. A frequently suggested cause of the above finding is that gene importance determined in the lab differs from that in an organism's natural environment. Here, we test this hypothesis in yeast using gene importance values experimentally determined in 418 lab conditions or computationally predicted for 10,000 nutritional conditions. In no single condition or combination of conditions did we find a much stronger negative correlation, which is explainable by our subsequent finding that always-essential (enzyme) genes do not evolve significantly more slowly than sometimes-essential or always-nonessential ones. Furthermore, we verified that functional density, approximated by the fraction of amino acid sites within protein domains, is uncorrelated with gene importance. Thus, neither the lab-nature mismatch nor a potentially biased among-gene distribution of functional density explains the observed weakness of the correlation between gene importance and evolutionary rate. We conclude that the weakness is factual, rather than artifactual. In addition to being weakened by population genetic reasons, the correlation is likely to have been further weakened by the presence of multiple nontrivial rate determinants that are independent from gene importance. These findings notwithstanding, we show that the principle of slower evolution of more important genes does have some predictive power when genes with vastly different evolutionary rates are compared, explaining why the principle can be practically useful despite the weakness of the correlation.
Author Summary
The fact that functionally more important genes or DNA sequences evolve more slowly than less important ones is commonly believed and frequently used by molecular biologists. However, previous genome-wide studies of a diverse array of organisms found only weak, negative correlations between the importance of a gene and its evolutionary rate. We show, here, that the weakness of the correlation is not because gene importance measured in lab conditions deviates from that in an organism's natural environments. Neither is it due to a potentially biased among-gene distribution of functional density. We suggest that the weakness of the correlation is factual, rather than artifactual. These findings notwithstanding, we show that the principle of slower evolution of more important genes does have some predictive power when genes with vastly different evolutionary rates are compared, explaining why the principle can be practically useful for tasks such as identifying functional non-coding sequences despite the weakness of the correlation.
PMCID: PMC2605560  PMID: 19132081
23.  The Impact of the Nucleosome Code on Protein-Coding Sequence Evolution in Yeast 
PLoS Genetics  2008;4(11):e1000250.
Coding sequence evolution was once thought to be the result of selection on optimal protein function alone. Selection can, however, also act at the RNA level, for example, to facilitate rapid translation or ensure correct splicing. Here, we ask whether the way DNA works also imposes constraints on coding sequence evolution. We identify nucleosome positioning as a likely candidate to set up such a DNA-level selective regime and use high-resolution microarray data in yeast to compare the evolution of coding sequence bound to or free from nucleosomes. Controlling for gene expression and intra-gene location, we find a nucleosome-free “linker” sequence to evolve on average 5–6% slower at synonymous sites. A reduced rate of evolution in linker is especially evident at the 5′ end of genes, where the effect extends to non-synonymous substitution rates. This is consistent with regular nucleosome architecture in this region being important in the context of gene expression control. As predicted, codons likely to generate a sequence unfavourable to nucleosome formation are enriched in linker sequence. Amino acid content is likewise skewed as a function of nucleosome occupancy. We conclude that selection operating on DNA to maintain correct positioning of nucleosomes impacts codon choice, amino acid choice, and synonymous and non-synonymous rates of evolution in coding sequence. The results support the exclusion model for nucleosome positioning and provide an alternative interpretation for runs of rare codons. As the intimate association of histones and DNA is a universal characteristic of genic sequence in eukaryotes, selection on coding sequence composition imposed by nucleosome positioning should be phylogenetically widespread.
Author Summary
Why do some parts of genes evolve slower than others? How can we account for the amino acid make-up of different parts of a protein? Answers to these questions are usually framed by reference to what the protein does and how it does it. This framework is, however, naïve. We now know that selection can act also on mRNA, for example, to ensure introns are removed properly. Here, we provide the first evidence that the way DNA works also affects gene and protein evolution. In living cells, most DNA wraps around histone protein structures to form nucleosomes, the basic building blocks of chromatin. Protein-coding sequence is no exception. Looking at genes in baker's yeast, we find that sequence between nucleosomes, linker sequence, is slow evolving. Both mutations that change the gene but not the protein and those that change gene and protein are affected. We argue that selection for correct nucleosome positioning, rather than differences in mutational processes, can explain this observation. Linker also exhibits distinct patterns of codon and amino acid usage, which reflect that DNA of linker needs to be rigid to prevent nucleosome formation. These results show that the way DNA works impacts on how genes evolve.
PMCID: PMC2570795  PMID: 18989456
24.  Evolutionary fates within a microbial population highlight an essential role for protein folding during natural selection 
Physicochemical properties of molecules can be linked directly to evolutionary fates of a population in a quantitative and predictive manner.Reversible- and irreversible-folding pathways must be accounted for to accurately determine in vitro kinetic parameters (KM and kcat) at temperatures or conditions in which a significant fraction of free enzyme is unfolded.In vivo population dynamics can be reproduced using in vitro physicochemical measurements within a model that imposes an activity threshold above which there is no added fitness benefit.
In nature, evolution occurs through the continuous adaptation of a population to its environment. The success or failure of organisms during adaptation is based on changes in molecular structure that give rise to changes in fitness that dictate evolutionary fates within a population. Although the conceptual link between genotype, phenotype, and fitness is clear, the ability to relate these complex adaptive landscapes in a quantitative manner remains difficult (Kacser and Burns, 1981; Dykhuizen et al, 1987; Weinreich et al, 2006). Dean and Thornton (2007) coined the term ‘functional synthesis' to capture the synergy between evolutionary and molecular biology to address important questions such as the evolution of complexity. The ‘functional synthesis,' in its most fully realized form, is an integrated systems biology approach to evolutionary dynamics that links physicochemical properties of molecules to evolutionary fates in a quantitative and predictive manner.
Functional synthesis flourishes in an experimental framework that allows investigators to directly link population dynamics (fitness) to changes in molecular function that result from alterations at the nucleotide level. The ‘weak link' approach was developed to tightly couple adaptive changes within the genome to changes in fitness and provide a population-based approach that can be used to examine alterations in function and fitness at the level of atomic structure and function (Counago and Shamoo, 2005; Counago et al, 2006). A homologous recombination strategy was used to replace the chromosomal copy of the essential adenylate kinase gene (adk) of the thermophilic bacterium Geobacillus stearothermophilus with that of the mesophile Bacillus subtilis. Recombinant G. stearothermophilus cells that expressed only B. subtilis adenylate kinase (AKBSUB) were unable to grow at temperatures higher than 55°C because of heat inactivation of the mesophilic enzyme and consequent disruption of adenylate homeostasis (Counago and Shamoo, 2005). Continuously growing populations of bacteria were then subjected to selection at increasing temperatures (from 55 to 70°C) that favor changes in the one gene not adapted for thermostability, adk. During the course of selection, the population was sampled and intermediates of adaptation were observed as mutations to adk. The first mutant to reach fixation was a single mutation AKBSUB Q199R (the glutamine at position 199 replaced with arginine). AKBSUB Q199R was eventually replaced at 62–63°C by five double mutants that arose nearly simultaneously within the population and share AKBSUB Q199R as their progenitor (Figure 4C). Changes to AK activity and thermal stability that resulted from mutation had direct consequences for cellular fitness and, therefore, met our goal for an experimental system that allows us to develop and test models for quantitative molecular evolution. These enzyme activities and stabilities were examined to determine how the mutant populations traversed the adaptive landscape to increased fitness (Counago et al, 2006).
We found that reversible- and irreversible-folding pathways as well as a ‘physiological threshold' above which fitness changes are minimal are necessary to reproduce the in vivo evolutionary fates of the population. Protein-folding parameters must be accounted for to accurately determine in vitro kinetic parameters (KM and kcat) at temperatures in which a significant fraction of free enzyme is unfolded (Scheme I and Equation 1).
Scheme I
Thermostability was assayed using differential scanning calorimetry (DSC) (Figure 4A) and the fraction of unfolded protein (YU) was then extended to accurately predict the extent of stabilization, shift in Tm, in the presence of ligand. The kinetic parameters determined at specific temperatures were then used to construct a temperature-dependent formulation of Equation (1) to model in vitro activity at any given ATP concentration and any temperature (Figure 4B).
Here, we have modeled fitness as a function of in vitro enzyme activity, which is a product of both activity and stability, and the application of a threshold that provides an upper limit on fitness. We hypothesize that an activity threshold exists above which no added fitness benefit is attained (the ‘physiological threshold'). However, as activity falls below this threshold, AK becomes rate limiting and fitness is negatively affected. The experimentally observed rise and fall of mutant alleles is shown in Figure 4C, whereas those predicted from our in vitro model are shown as Figure 4D. This model can successfully reproduce frequencies of mutants in a polymorphic population, including the transient success of three minor mutants and order of disappearance from the population, given only in vitro data and allowing for the activity threshold to be fit to the observed outcomes (Figure 4D). An appealing aspect of our fitness function is that it permits an evaluation of specific and quantitative aspects of protein stability and activity relative to evolutionary fates.
In vivo, diversity within a population is generated by a variety of mechanisms that span single nucleotide changes to genome-wide rearrangements and horizontal gene transfer. However, changes are generated within an organism, it is the physicochemical characteristics of the resulting macromolecules and their resultant changes in the fitness of the organism that are the ‘grist for the mill' of natural selection. Recent work has shown that adaptability can be facilitated by the accumulation of near neutral or even modestly destabilizing mutations that provide more possibilities for success. Chaperones have an important function in buffering biological systems against these destabilizing mutations as well as mistakes in translation that lead to polymorphic populations and have been shown to increase rates of adaptation (Rutherford, 2003; Drummond and Wilke, 2008; Tokuriki and Tawfik, 2009a). Thus, adaptation through protein evolution is circumscribed by protein stability. As most mutational events will be destabilizing (Tokuriki and Tawfik, 2009b), higher mutation rates can lead to decreases in fitness eventually leading to extinction (Zeldovich et al, 2007; Chen and Shakhnovich, 2009). Although our system links the physicochemical properties of adaptive changes that increase stability, the principles apply equally to those changes that might decrease stability of the ensemble either through mutation or translational errors (Drummond and Wilke, 2008). Thus, regardless of how protein diversity is generated, evolutionary dynamics will likely be strongly coupled to stability and function.
Systems biology can offer a great deal of insight into evolution by quantitatively linking complex properties such as protein structure, folding, and function to the fitness of an organism. Although the link between diseases such as Alzheimer's and misfolding is well appreciated, directly showing the importance of protein folding to success in evolution has been more difficult. We show here that predicting success during adaptation can depend critically on enzyme kinetic and folding models. We used a ‘weak link' method to favor mutations to an essential, but maladapted, adenylate kinase gene within a microbial population that resulted in the identification of five mutants that arose nearly simultaneously and competed for success. Physicochemical characterization of these mutants showed that, although steady-state enzyme activity is important, success within the population is critically dependent on resistance to denaturation and aggregation. A fitness function based on in vitro measurements of enzyme activity, reversible and irreversible unfolding, and the physiological context reproduces in vivo evolutionary fates in the population linking organismal adaptation to its physical basis.
PMCID: PMC2925523  PMID: 20631681
adenylate kinase; enzyme kinetics; experimental evolution; fitness functions; protein folding
25.  Random Single Amino Acid Deletion Sampling Unveils Structural Tolerance and the Benefits of Helical Registry Shift on GFP Folding and Structure 
Structure(London, England:1993)  2014;22(6):889-898.
Altering a protein’s backbone through amino acid deletion is a common evolutionary mutational mechanism, but is generally ignored during protein engineering primarily because its effect on the folding-structure-function relationship is difficult to predict. Using directed evolution, enhanced green fluorescent protein (EGFP) was observed to tolerate residue deletion across the breadth of the protein, particularly within short and long loops, helical elements, and at the termini of strands. A variant with G4 removed from a helix (EGFPG4Δ) conferred significantly higher cellular fluorescence. Folding analysis revealed that EGFPG4Δ retained more structure upon unfolding and refolded with almost 100% efficiency but at the expense of thermodynamic stability. The EGFPG4Δ structure revealed that G4 deletion caused a beneficial helical registry shift resulting in a new polar interaction network, which potentially stabilizes a cis proline peptide bond and links secondary structure elements. Thus, deletion mutations and registry shifts can enhance proteins through structural rearrangements not possible by substitution mutations alone.
Graphical Abstract
•Using directed evolution, the impact of amino acid deletion on EGFP is explored•Loops, helices, and strand termini are especially tolerant to amino acid deletion•A deletion mutant that enhances cellular production and fluorescence is identified•Structure reveals that a helical registry shift creates a new polar network
Using directed evolution, Arpino et al. examine the impact of amino acid deletion on EGFP and find that loops, helices, and strand termini are especially tolerant to amino acid deletion. Structural work provides a molecular explanation for this observation.
PMCID: PMC4058518  PMID: 24856363

Results 1-25 (1352737)