Mammalian genomes contain numerous genes for long noncoding RNAs (lncRNAs). The functions of the lncRNAs remain largely unknown but their evolution appears to be constrained by purifying selection, albeit relatively weakly. To gain insights into the mode of evolution and the functional range of the lncRNA, they can be compared with much better characterized protein-coding genes. The evolutionary rate of the protein-coding genes shows a universal negative correlation with expression: highly expressed genes are on average more conserved during evolution than the genes with lower expression levels. This correlation was conceptualized in the misfolding-driven protein evolution hypothesis according to which misfolding is the principal cost incurred by protein expression. We sought to determine whether long intergenic ncRNAs (lincRNAs) follow the same evolutionary trend and indeed detected a moderate but statistically significant negative correlation between the evolutionary rate and expression level of human and mouse lincRNA genes. The magnitude of the correlation for the lincRNAs is similar to that for equal-sized sets of protein-coding genes with similar levels of sequence conservation. Additionally, the expression level of the lincRNAs is significantly and positively correlated with the predicted extent of lincRNA molecule folding (base-pairing), however, the contributions of evolutionary rates and folding to the expression level are independent. Thus, the anticorrelation between evolutionary rate and expression level appears to be a general feature of gene evolution that might be caused by similar deleterious effects of protein and RNA misfolding and/or other factors, for example, the number of interacting partners of the gene product.
long noncoding RNA; ncRNA; RNA expression; genomic alignments; introns; RNA folding
Research in quantitative evolutionary genomics and systems biology led to the discovery of several universal regularities connecting genomic and molecular phenomic variables. These universals include the log-normal distribution of the evolutionary rates of orthologous genes; the power law–like distributions of paralogous family size and node degree in various biological networks; the negative correlation between a gene's sequence evolution rate and expression level; and differential scaling of functional classes of genes with genome size. The universals of genome evolution can be accounted for by simple mathematical models similar to those used in statistical physics, such as the birth-death-innovation model. These models do not explicitly incorporate selection; therefore, the observed universal regularities do not appear to be shaped by selection but rather are emergent properties of gene ensembles. Although a complete physical theory of evolutionary biology is inconceivable, the universals of genome evolution might qualify as “laws of evolutionary genomics” in the same sense “law” is understood in modern physics.
Pseudogenes, the nonfunctional homologues of functional genes are now coming to light as important resources regarding the study of human protein evolution. Processed pseudogenes arising by reverse transcription and reinsertion can provide molecular record on the dynamics and evolution of genomes. Researches on the progenitors of human processed pseudogenes delved out their highly expressed and evolutionarily conserved characters. They are reported to be short and GC-poor indicating their high efficiency for retrotransposition. In this article we focused on their high expressivity and explored the factors contributing for that and their relevance in the milieu of protein sequence evolution.
We here, analyzed the high expressivity of these genes configuring processed or retropseudogenes by their immense connectivity in protein-protein interaction network, an inclination towards alternative splicing mechanism, a lower rate of mRNA disintegration and a slower evolutionary rate. While the unusual trend of the upraised disorder in contrast with the high expressivity of the proteins encoded by processed pseudogene ancestors is accredited by a predominance of hub-protein encoding genes, a high propensity of repeat sequence containing genes, elevated protein stability and the functional constraint to perform the transcription regulatory jobs. Linear regression analysis demonstrates mRNA decay rate and protein intrinsic disorder as the influential factors controlling the expressivity of these retropseudogene ancestors while the latter one is found to have the most significant regulatory power.
Our findings imply that, the affluence of disordered regions elevating the network attachment to be involved in important cellular assignments and the stability in transcriptional level are acting as the prevailing forces behind the high expressivity of the human genes configuring processed pseudogenes.
Expressivity; Protein intrinsic disorder; Connectivity; Alternative splicing; Protein stability; mRNA decay rate; Evolutionary rate
Motivation: Correlated events of gains and losses enable inference of co-evolution relations. The reconstruction of the co-evolutionary interactions network in prokaryotic species may elucidate functional associations among genes.
Results: We developed a novel probabilistic methodology for the detection of co-evolutionary interactions between pairs of genes. Using this method we inferred the co-evolutionary network among 4593 Clusters of Orthologous Genes (COGs). The number of co-evolutionary interactions substantially differed among COGs. Over 40% were found to co-evolve with at least one partner. We partitioned the network of co-evolutionary relations into clusters and uncovered multiple modular assemblies of genes with clearly defined functions. Finally, we measured the extent to which co-evolutionary relations coincide with other cellular relations such as genomic proximity, gene fusion propensity, co-expression, protein–protein interactions and metabolic connections. Our results show that co-evolutionary relations only partially overlap with these other types of networks. Our results suggest that the inferred co-evolutionary network in prokaryotes is highly informative towards revealing functional relations among genes, often showing signals that cannot be extracted from other network types.
Availability and implementation: Available under GPL license as open source.
Supplementary data are available at Bioinformatics online.
Proteins show a broad range of evolutionary rates. Understanding the factors that are responsible for the characteristic rate of evolution of a given protein arguably is one of the major goals of evolutionary biology. A long-standing general assumption used to be that the evolution rate is, primarily, determined by the specific functional constraints that affect the given protein. These constrains were traditionally thought to depend both on the specific features of the protein's structure and its biological role. The advent of systems biology brought about new types of data, such as expression level and protein-protein interactions, and unexpectedly, a variety of correlations between protein evolution rate and these variables have been observed. The strongest connections by far were repeatedly seen between protein sequence evolution rate and the expression level of the respective gene. It has been hypothesized that this link is due to the selection for the robustness of the protein structure to mistranslation-induced misfolding that is particularly important for highly expressed proteins and is the dominant determinant of the sequence evolution rate.
This work is an attempt to assess the relative contributions of protein domain structure and function, on the one hand, and expression level on the other hand, to the rate of sequence evolution. To this end, we performed a genome-wide analysis of the effect of the fusion of a pair of domains in multidomain proteins on the difference in the domain-specific evolutionary rates. The mistranslation-induced misfolding hypothesis would predict that, within multidomain proteins, fused domains, on average, should evolve at substantially closer rates than the same domains in different proteins because, within a mutlidomain protein, all domains are translated at the same rate. We performed a comprehensive comparison of the evolutionary rates of mammalian and plant protein domains that are either joined in multidomain proteins or contained in distinct proteins. Substantial homogenization of evolutionary rates in multidomain proteins was, indeed, observed in both animals and plants, although highly significant differences between domain-specific rates remained. The contributions of the translation rate, as determined by the effect of the fusion of a pair of domains within a multidomain protein, and intrinsic, domain-specific structural-functional constraints appear to be comparable in magnitude.
Fusion of domains in a multidomain protein results in substantial homogenization of the domain-specific evolutionary rates but significant differences between domain-specific evolution rates remain. Thus, the rate of translation and intrinsic structural-functional constraints both exert sizable and comparable effects on sequence evolution.
This article was reviewed by Sergei Maslov, Dennis Vitkup, Claus Wilke (nominated by Orly Alter), and Allan Drummond (nominated by Joel Bader). For the full reviews, please go to the Reviewers' Reports section.
A long-standing assumption in evolutionary biology is that the evolution rate of protein-coding genes depends, largely, on specific constraints that affect the function of the given protein. However, recent research in evolutionary systems biology revealed unexpected, significant correlations between evolution rate and characteristics of genes or proteins that are not directly related to specific protein functions, such as expression level and protein–protein interactions. The strongest connections were consistently detected between protein sequence evolution rate and the expression level of the respective gene. A recent genome-wide proteomic study revealed an extremely strong correlation between the abundances of orthologous proteins in distantly related animals, the nematode Caenorhabditis elegans and the fruit fly Drosophila melanogaster. We used the extensive protein abundance data from this study along with short-term evolutionary rates (ERs) of orthologous genes in nematodes and flies to estimate the relative contributions of structural–functional constraints and the translation rate to the evolution rate of protein-coding genes. Together the intrinsic constraints and translation rate account for approximately 50% of the variance of the ERs. The contribution of constraints is estimated to be 3- to 5-fold greater than the contribution of translation rate.
protein evolution; structural–functional constraints; misfolding; protein abundance
Duplications of genes encoding highly connected and essential proteins are selected against in several species but not in human, where duplicated genes encode highly connected proteins. To understand when and how gene duplicability changed in evolution, we compare gene and network properties in four species (Escherichia coli, yeast, fly, and human) that are representative of the increase in evolutionary complexity, defined as progressive growth in the number of genes, cells, and cell types. We find that the origin and conservation of a gene significantly correlates with the properties of the encoded protein in the protein-protein interaction network. All four species preserve a core of singleton and central hubs that originated early in evolution, are highly conserved, and accomplish basic biological functions. Another group of hubs appeared in metazoans and duplicated in vertebrates, mostly through vertebrate-specific whole genome duplication. Such recent and duplicated hubs are frequently targets of microRNAs and show tissue-selective expression, suggesting that these are alternative mechanisms to control their dosage. Our study shows how networks modified during evolution and contributes to explaining the occurrence of somatic genetic diseases, such as cancer, in terms of network perturbations.
Gene copy number is often tightly controlled because it directly affects the gene dosage. In several species, including yeast, worm, and fly, genes that have a single gene copy (singleton genes) encode proteins with several connections in the protein interaction network (hubs) as well as essential proteins. Surprisingly, in mouse and human essential proteins and hubs are encoded by genes with more than one copy in the genome (duplicated genes). Here we show that these two distinct groups of hubs were acquired at different times during the evolution of protein interaction network and contribute in different ways to the cell life. Singleton hubs are ancestral genes that are conserved from prokaryotes to vertebrates and accomplish basic functions that deal with the cell survival. Duplicated hubs were acquired mostly within metazoans and duplicated through vertebrate-specific whole genome duplication. These genes are involved in processes that are crucial for the organization of multicellularity. Although duplicated, also recent hubs are subject to gene dosage control through microRNAs and tissue-selective expression. The clarification of how the protein interaction network evolves enables us to understand the adaptation to the progressive increase in complexity and to better characterize the genes involved in diseases such as cancer.
Many single-gene knockouts result in increased phenotypic (e.g., morphological) variability among the mutant's offspring. This has been interpreted as an intrinsic ability of genes to buffer genetic and environmental variation. A phenotypic capacitor is a gene that appears to mask phenotypic variation: when knocked out, the offspring shows more variability than the wild type. Theory predicts that this phenotypic potential should be correlated with a gene's knockout fitness and its number of negative genetic interactions. Based on experimentally measured phenotypic capacity, it was suggested that knockout fitness was unimportant, but that phenotypic capacitors tend to be hubs in genetic and physical interaction networks.
We re-analyse the available experimental data in a combined model, which includes knockout fitness and network parameters as well as expression level and protein length as predictors of phenotypic potential. Contrary to previous conclusions, we find that the strongest predictor is in fact haploid knockout fitness (responsible for 9% of the variation in phenotypic potential), with an additional contribution from the genetic interaction network (5%); once these two factors are taken into account, protein-protein interactions do not make any additional contribution to the variation in phenotypic potential.
We conclude that phenotypic potential is not a mysterious “emergent” property of cellular networks. Instead, it is very simply determined by the overall fitness reduction of the organism (which in its compromised state can no longer compensate for multiple factors that contribute to phenotypic variation), and by the number (and presumably nature) of genetic interactions of the knocked-out gene. In this light, Hsp90, the prototypical phenotypic capacitor, may not be representative: typical phenotypic capacitors are not direct “buffers” of variation, but are simply genes encoding central cellular functions.
Pathogens have represented an important selective force during the adaptation of modern human populations to changing social and other environmental conditions. The evolution of the immune system has therefore been influenced by these pressures. Genomic scans have revealed that immune system is one of the functions enriched with genes under adaptive selection.
Here, we describe how the innate immune system has responded to these challenges, through the analysis of resequencing data for 132 innate immunity genes in two human populations. Results are interpreted in the context of the functional and interaction networks defined by these genes. Nucleotide diversity is lower in the adaptors and modulators functional classes, and is negatively correlated with the centrality of the proteins within the interaction network. We also produced a list of candidate genes under positive or balancing selection in each population detected by neutrality tests and showed that some functional classes are preferential targets for selection.
We found evidence that the role of each gene in the network conditions the capacity to evolve or their evolvability: genes at the core of the network are more constrained, while adaptation mostly occurred at particular positions at the network edges. Interestingly, the functional classes containing most of the genes with signatures of balancing selection are involved in autoinflammatory and autoimmune diseases, suggesting a counterbalance between the beneficial and deleterious effects of the immune response.
Prokaryotic genomes are considered to be ‘wall-to-wall’ genomes, which consist largely of genes for proteins and structural RNAs, with only a small fraction of the genomic DNA allotted to intergenic regions, which are thought to typically contain regulatory signals. The majority of bacterial and archaeal genomes contain 6–14% non-coding DNA. Significant positive correlations were detected between the fraction of non-coding DNA and inter- and intra-operonic distances, suggesting that different classes of non-coding DNA evolve congruently. In contrast, no correlation was found between any of these characteristics of non-coding sequences and the number of genes or genome size. Thus, the non-coding regions and the gene sets in prokaryotes seem to evolve in different regimes. The evolution of non-coding regions appears to be determined primarily by the selective pressure to minimize the amount of non-functional DNA, while maintaining essential regulatory signals, because of which the content of non-coding DNA in different genomes is relatively uniform and intra- and inter-operonic non-coding regions evolve congruently. In contrast, the gene set is optimized for the particular environmental niche of the given microbe, which results in the lack of correlation between the gene number and the characteristics of non-coding regions.
An investigation of metabolic networks in E. coli and S. cerevisiae reveals that asymmetric protein interactions affect gene expression, the relative effect of gene-knockouts and genome evolution.
The relationships between proteins are often asymmetric: one protein (A) depends for its function on another protein (B), but the second protein does not depend on the first. In metabolic networks there are multiple pathways that converge into one central pathway. The enzymes in the converging pathways depend on the enzymes in the central pathway, but the enzymes in the latter do not depend on any specific enzyme in the converging pathways. Asymmetric relations are analogous to the “if->then” logical relation where A implies B, but B does not imply A (A->B).
We show that the majority of relationships between enzymes in metabolic flux models of metabolism in Escherichia coli and Saccharomyces cerevisiae are asymmetric. We show furthermore that these asymmetric relationships are reflected in the expression of the genes encoding those enzymes, the effect of gene knockouts and the evolution of genomes. From the asymmetric relative dependency, one would expect that the gene that is relatively independent (B) can occur without the other dependent gene (A), but not the reverse. Indeed, when only one gene of an A->B pair is expressed, is essential, is present in a genome after an evolutionary gain or loss, it tends to be the independent gene (B). This bias is strongest for genes encoding proteins whose asymmetric relationship is evolutionarily conserved.
The asymmetric relations between proteins that arise from the system properties of metabolic networks affect gene expression, the relative effect of gene knockouts and genome evolution in a predictable manner.
Genome-wide studies in Saccharomyces cerevisiae concluded that the dominant determinant of protein evolutionary rates is expression level, where highly-expressed proteins evolve most slowly. To determine how this constraint affects the evolution of protein interactions, we directly measure evolutionary rates of protein interface, surface and core residues by structurally mapping domain interactions to yeast genomes. We find that mRNA level and protein abundance, though correlated, report on pressures affecting regions of proteins differently. Pressures proportional to mRNA level slow evolutionary rates of all structural regions and reduce the variability in rate differences between interfaces and other surfaces. In contrast, the evolutionary rate variation within a domain is less dependent on protein abundance. Distinct pressures may be associated primarily with the cost (mRNA level) and functional benefit (protein abundance) of protein production. Interfaces of proteins with low mRNA levels may have higher evolutionary flexibility, and could constitute the raw material for new functions.
With the completion of the whole genome sequence for many organisms, investigations into genomic structure have revealed that gene distribution is variable, and that genes with similar function or expression are located within clusters. This clustering suggests that there are evolutionary constraints that determine genome architecture. However, as most of the evidence for constraints on genome evolution comes from studies on yeast, it is unclear how much of this prior work can be extrapolated to mammalian genomes. Therefore, in this work we wished to examine the constraints on regions of the mammalian genome containing conserved gene clusters.
We first identified regions of the mouse genome with microsynteny conservation by comparing gene arrangement in the mouse genome to the human, rat, and dog genomes. We then asked if any particular gene types were found preferentially in conserved regions. We found a significant correlation between conserved microsynteny and the density of mouse orthologs of human disease genes, suggesting that disease genes are clustered in genomic regions of increased microsynteny conservation.
The correlation between microsynteny conservation and disease gene locations indicates that regions of the mouse genome with microsynteny conservation may contain undiscovered human disease genes. This study not only demonstrates that gene function constrains mammalian genome organization, but also identifies regions of the mouse genome that can be experimentally examined to produce mouse models of human disease.
Prediction models that use gene expression levels are now being proposed for personalized treatment of cancer, but building accurate models that are easy to interpret remains a challenge. In this paper, we describe an integrative clinical-genomic approach that combines both genomic pathway and clinical information. First, we summarize information from genes in each pathway using Supervised Principal Components (SPCA) to obtain pathway-based genomic predictors. Next, we build a prediction model based on clinical variables and pathway-based genomic predictors using Random Survival Forests (RSF). Our rationale for this two-stage procedure is that the underlying disease process may be influenced by environmental exposure (measured by clinical variables) and perturbations in different pathways (measured by pathway-based genomic variables), as well as their interactions. Using two cancer microarray datasets, we show that the pathway-based clinical-genomic model outperforms gene-based clinical-genomic models, with improved prediction accuracy and interpretability.
microarrays; gene expression; pathway analysis; survival prediction; random survival forests
The killer cell immunoglobulin-like receptors (KIR) interact with major histocompatibility complex (MHC) class I ligands to regulate the functions of natural killer cells and T cells. Like human leukocyte antigens class I, human KIR are highly variable and correlated with infection, autoimmunity, pregnancy syndromes, and transplantation outcome. Limiting the scope of KIR analysis is the low resolution, sensitivity, and speed of the established methods of KIR typing. In this study, we describe a first-generation single nucleotide polymorphism (SNP)-based method for typing the 17 human KIR genes and pseudogenes that uses analysis by matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometry. It is a high-throughput method that requires minute amounts of genomic DNA for discrimination of KIR genes with some allelic resolution. A study of 233 individuals shows that the results obtained by the SNP-based KIR/MALDI-TOF method are consistent with those obtained with the established sequence-specific oligonucleotide probe or sequence-specific polymerase chain reaction methods. The added sensitivity of the KIR/MALDI-TOF method allowed putative novel alleles of the KIR2DL1, KIR3DL1, KIR2DS5, and KIR2DL5 genes to be identified. Sequencing the KIR2DL5 variant proved it was a newly discovered allele, one that appears associated with Hispanic and Native American populations. This KIR/ MALDI-TOF method of KIR typing should facilitate population and disease-association studies that improve knowledge of the immunological functions of KIR–MHC class I interactions.
KIR; HLA; MALDI-TOF; Genotyping; SNP
Fluctuations in protein abundance among single cells are primarily due to the inherent stochasticity in transcription and translation processes, such stochasticity can often confer phenotypic heterogeneity among isogenic cells. It has been proposed that expression noise can be triggered as an adaptation to environmental stresses and genetic perturbations, and as a mechanism to facilitate gene expression evolution. Thus, elucidating the relationship between expression noise, measured at the single-cell level, and expression variation, measured on population of cells, can improve our understanding on the variability and evolvability of gene expression. Here, we showed that noise levels are significantly correlated with conditional expression variations. We further demonstrated that expression variations are highly predictive for noise level, especially in TATA-box containing genes. Our results suggest that expression variabilities can serve as a proxy for noise level, suggesting that these two properties share the same underlining mechanism, e.g. chromatin regulation. Our work paves the way for the study of stochastic noise in other single-cell organisms.
Natural killer (NK) cells are circulating lymphocytes that function in innate immunity and placental reproduction. Regulating both development and function of NK cells is an array of variable and conserved receptors that interact with major histocompatibility complex (MHC) class I molecules. Families of lectin-like and immunoglobulin-like receptors are determined by genes in the natural killer (NKC) and leukocyte receptor (LRC) complexes, respectively. As a consequence of the strong, varying pressures on the immune and reproductive systems, NK cell receptors and their MHC class I ligands evolve rapidly, are highly diverse, and exhibit dramatic species-specific differences. The variable, polymorphic family of killer cell immunoglobulin-like receptors (KIR) that regulate human NK cell development and function evolved recently, from a single-copy gene during the evolution of simian primates. Our studies of KIR and MHC class I genes in representative species show how these two unlinked but functionally intertwined genetic complexes have co-evolved. In humans, combinations of KIR and HLA class I factors are associated with infectious diseases, including HIV/AIDS, autoimmunity, reproductive success and the outcome of therapeutic transplantation. The extraordinary, and unanticipated, divergence of human NK cell receptors and MHC class I ligands from their mouse counterparts can in part explain the difficulties experienced in finding informative mouse models for human diseases. Non-human primate models have far greater potential, but to realize their promise will first require more complete definition of the genetics and function of KIR and MHC variation in non-human primate species, at a level comparable to that achieved for the human species.
Non-human primates; NK cells; KIR; MHC; innate immunity
Phylostratigraphy is a method used to correlate the evolutionary origin of founder genes (that is, functional founder protein domains) of gene families with particular macroevolutionary transitions. It is based on a model of genome evolution that suggests that the origin of complex phenotypic innovations will be accompanied by the emergence of such founder genes, the descendants of which can still be traced in extant organisms. The origin of multicellularity can be considered to be a macroevolutionary transition, for which new gene functions would have been required. Cancer should be tightly connected to multicellular life since it can be viewed as a malfunction of interaction between cells in a multicellular organism. A phylostratigraphic tracking of the origin of cancer genes should, therefore, also provide insights into the origin of multicellularity.
We find two strong peaks of the emergence of cancer related protein domains, one at the time of the origin of the first cell and the other around the time of the evolution of the multicellular metazoan organisms. These peaks correlate with two major classes of cancer genes, the 'caretakers', which are involved in general functions that support genome stability and the 'gatekeepers', which are involved in cellular signalling and growth processes. Interestingly, this phylogenetic succession mirrors the ontogenetic succession of tumour progression, where mutations in caretakers are thought to precede mutations in gatekeepers.
A link between multicellularity and formation of cancer has often been predicted. However, this has not so far been explicitly tested. Although we find that a significant number of protein domains involved in cancer predate the origin of multicellularity, the second peak of cancer protein domain emergence is, indeed, connected to a phylogenetic level where multicellular animals have emerged. The fact that we can find a strong and consistent signal for this second peak in the phylostratigraphic map implies that a complex multi-level selection process has driven the transition to multicellularity.
Understanding the adaptive changes that alter the function of proteins during evolution is an important question for biology and medicine. The increasing number of completely sequenced genomes from closely related organisms, as well as individuals within species, facilitates systematic detection of recent selection events by means of comparative genomics.
We have used genome-wide strain-specific single nucleotide polymorphism data from 64 strains of budding yeast (Saccharomyces cerevisiae or Saccharomyces paradoxus) to determine whether adaptive positive selection is correlated with protein regions showing propensity for different classes of structure conformation. Data from phylogenetic and population genetic analysis of 3,746 gene alignments consistently shows a significantly higher degree of positive Darwinian selection in intrinsically disordered regions of proteins compared to regions of alpha helix, beta sheet or tertiary structure. Evidence of positive selection is significantly enriched in classes of proteins whose functions and molecular mechanisms can be coupled to adaptive processes and these classes tend to have a higher average content of intrinsically unstructured protein regions.
We suggest that intrinsically disordered protein regions may be important for the production and maintenance of genetic variation with adaptive potential and that they may thus be of central significance for the evolvability of the organism or cell in which they occur.
The rate of conservation of a gene in evolution is believed to be correlated with its biological importance. Recent studies have devised various conservation measures for genes and have shown that they are correlated with several biological characteristics of functional importance. Specifically, the state-of-the-art propensity for gene loss (PGL) measure was shown to be strongly correlated with gene essentiality and its number of protein–protein interactions (PPIs). The observed correlation between conservation and functional importance varies however between conservation measures, underscoring the need for accurate and general measures for the rate of gene conservation. Here we develop a novel maximum-likelihood approach to computing the rate in which a gene is lost in evolution, motivated by the same principles as those underlying PGL. However, in difference to PGL which considers only the most parsimonious ancestral states of the internal nodes of the phylogenetic tree relating the species, our approach weighs in a probabilistic manner all possible ancestral states, and includes the branch length information as part of the probabilistic model. In application to data of 16 eukaryotic genomes, our approach shows higher correlations with experimental data than PGL, including data on gene lethality, level of connectivity in a PPI network and coherence within functionally related genes.
Whole genome studies have highlighted duplicated genes as important substrates for adaptive evolution. We have investigated adaptive evolution in this class of genes in the human parasite Trypanosoma brucei, as indicated by the ratio of non-synonymous (amino-acid changing) to synonymous (amino acid retaining) nucleotide substitution rates.
We have identified duplicated genes that are most rapidly evolving in this important human parasite. This is the first attempt to investigate adaptive evolution in this species at the codon level. We identify 109 genes within 23 clusters of paralogous gene expansions to be subject to positive selection.
Genes identified include surface antigens in both the mammalian and insect host life cycle stage suggesting that competitive interaction is not solely with the adaptive immune system of the mammalian host. Also surface transporters related to drug resistance and genes related to developmental progression are detected. We discuss how adaptive evolution of these genes may highlight lineage specific processes essential for parasite survival. We also discuss the implications of adaptive evolution of these targets for parasite biology and control.
Protein kinase (PK) genes comprise the third largest superfamily that occupy ∼2% of the human genome. They encode regulatory enzymes that control a vast variety of cellular processes through phosphorylation of their protein substrates. Expression of PK genes is subject to complex transcriptional regulation which is not fully understood.
Our comparative analysis demonstrates that genomic organization of regulatory PK genes differs from organization of other protein coding genes. PK genes occupy larger genomic loci, have longer introns, spacer regions, and encode larger proteins. The primary transcript length of PK genes, similar to other protein coding genes, inversely correlates with gene expression level and expression breadth, which is likely due to the necessity to reduce metabolic costs of transcription for abundant messages. On average, PK genes evolve slower than other protein coding genes. Breadth of PK expression negatively correlates with rate of non-synonymous substitutions in protein coding regions. This rate is lower for high expression and ubiquitous PKs, relative to low expression PKs, and correlates with divergence in untranslated regions. Conversely, rate of silent mutations is uniform in different PK groups, indicating that differing rates of non-synonymous substitutions reflect variations in selective pressure. Brain and testis employ a considerable number of tissue-specific PKs, indicating high complexity of phosphorylation-dependent regulatory network in these organs. There are considerable differences in genomic organization between PKs up-regulated in the testis and brain. PK genes up-regulated in the highly proliferative testicular tissue are fast evolving and small, with short introns and transcribed regions. In contrast, genes up-regulated in the minimally proliferative nervous tissue carry long introns, extended transcribed regions, and evolve slowly.
PK genomic architecture, the size of gene functional domains and evolutionary rates correlate with the pattern of gene expression. Structure and evolutionary divergence of tissue-specific PK genes is related to the proliferative activity of the tissue where these genes are predominantly expressed. Our data provide evidence that physiological requirements for transcription intensity, ubiquitous expression, and tissue-specific regulation shape gene structure and affect rates of evolution.
Genome-wide expression data of gene microarrays can be used to infer gene networks. At a cellular level, a gene network provides a picture of the modules in which genes are densely connected, and of the hub genes, which are highly connected with other genes. A gene network is useful to identify the genes involved in the same pathway, in a protein complex or that are co-regulated. In this study, we used different methods to find gene networks in the ciliate Tetrahymena thermophila, and describe some important properties of this network, such as modules and hubs.
Using 67 single channel microarrays, we constructed the Tetrahymena gene network (TGN) using three methods: the Pearson correlation coefficient (PCC), the Spearman correlation coefficient (SCC) and the context likelihood of relatedness (CLR) algorithm. The accuracy and coverage of the three networks were evaluated using four conserved protein complexes in yeast. The CLR network with a Z-score threshold 3.49 was determined to be the most robust. The TGN was partitioned, and 55 modules were found. In addition, analysis of the arbitrarily determined 1200 hubs showed that these hubs could be sorted into six groups according to their expression profiles. We also investigated human disease orthologs in Tetrahymena that are missing in yeast and provide evidence indicating that some of these are involved in the same process in Tetrahymena as in human.
This study constructed a Tetrahymena gene network, provided new insights to the properties of this biological network, and presents an important resource to study Tetrahymena genes at the pathway level.
Yeast transcription factors that are more central in the transcription network tend to evolve more quickly.
Transcription factors play a fundamental role in regulating physiological responses and developmental processes. Here we examine the evolution of the yeast transcription factors in the context of the structure of the gene regulatory network.
In contrast to previous results for the protein-protein interaction and metabolic networks, we find that the position of a gene within the transcription network affects the rate of protein evolution such that more central transcription factors tend to evolve faster. Centrality is also positively correlated with expression variability, suggesting that the higher rate of divergence among central transcription factors may be due to their role in controlling information flow and may be the result of adaptation to changing environmental conditions. Alternatively, more central transcription factors could be more buffered against environmental perturbations and, therefore, less subject to strong purifying selection. Importantly, the relationship between centrality and evolutionary rates is independent of expression level, expression variability and gene essentiality.
Our analysis of the transcription network highlights the role of network structure on protein evolutionary rate. Further, the effect of network centrality on nucleotide divergence is different among the metabolic, protein-protein and transcriptional networks, suggesting that the effect of gene position is dependant on the function of the specific network under study. A better understanding of how these three cellular networks interact with one another may be needed to fully examine the impact of network structure on the function and evolution of biological systems.
The identification of sequence innovations in the genomes of mammals facilitates understanding of human gene function, as well as sheds light on the molecular mechanisms which underlie these changes. Although gene duplication plays a major role in genome evolution, studies regarding concerted evolution events among gene family members have been limited in scope and restricted to protein-coding regions, where high sequence similarity is easily detectable.
We describe a mammalian-specific expansion of more than 20 rapidly-evolving genes on human chromosome Xq22.1. Many of these are highly divergent in their protein-coding regions yet contain a conserved sequence motif in their 5' UTRs which appears to have been maintained by multiple events of concerted evolution. These events have led to the generation of chimaeric genes, each with a 5' UTR and a protein-coding region that possess independent evolutionary histories. We suggest that concerted evolution has occurred via gene conversion independently in different mammalian lineages, and these events have resulted in elevated G+C levels in the encompassing genomic regions. These concerted evolution events occurred within and between genes from three separate protein families ('brain-expressed X-linked' [BEX], WWbp5-like X-linked [WEX] and G-protein-coupled receptor-associated sorting protein [GASP]), which often are expressed in mammalian brains and associated with receptor mediated signalling and apoptosis.
Despite high protein-coding divergence among mammalian-specific genes, we identified a DNA motif common to these genes' 5' UTR exons. The motif has undergone concerted evolution events independently of its neighbouring protein-coding regions, leading to formation of evolutionary chimaeric genes. These findings have implications for the identification of non protein-coding regulatory elements and their lineage-specific evolution in mammals.