The present study reports metagenomic shotgun sequencing of microbial communities of two ancient permafrost horizons of the Russian Arctic. Results demonstrate a significant difference in microbial community structure of the analyzed samples in general and microorganisms of the methane cycle in particular.
Recombination between double-stranded DNA molecules is a key genetic process which occurs in a wide variety of organisms. Usually, crossing-over (CO) occurs during meiosis between genotypes with 98.0–99.9% sequence identity, because within-population nucleotide diversity only rarely exceeds 2%. However, some species are hypervariable and it is unclear how CO can occur between genotypes with less than 90% sequence identity. Here, we study CO in Schizophyllum commune, a hypervariable cosmopolitan basidiomycete mushroom, a frequently encountered decayer of woody substrates. We crossed two haploid individuals, from the United States and from Russia, and obtained genome sequences for their 17 offspring. The average genetic distance between the parents was 14%, making it possible to study CO at very high resolution. We found reduced levels of linkage disequilibrium between loci flanking the CO sites indicating that they are mostly confined to hotspots of recombination. Furthermore, CO events preferentially occurred in regions under stronger negative selection, in particular within exons that showed reduced levels of nucleotide diversity. Apparently, in hypervariable species CO must avoid regions of higher divergence between the recombining genomes due to limitations imposed by the mismatch repair system, with regions under strong negative selection providing the opportunity for recombination. These patterns are opposite to those observed in a number of less variable species indicating that population genomics of hypervariable species may reveal novel biological phenomena.
S. commune; recombination; negative selection; hyperpolymorphic
Rapid divergence of gene copies after duplication is thought to determine the fate of the copies and evolution of novel protein functions. However, data on how long the gene copies continue to experience an elevated rate of evolution remain scarce. Standard theory of gene duplications based on some level of genetic redundancy of gene copies predicts that the period of accelerated evolution must end relatively quickly. Using a maximum-likelihood approach we estimate preduplication, initial postduplication, and recent postduplication rates of evolution that occurred in the mammalian lineage. We find that both gene copies experience a similar in magnitude acceleration in their rate of evolution. The copy located in the original genomic position typically returns to the preduplication rates of evolution in a short period of time. The burst of faster evolution of the copy that is located in a new genomic position typically lasts longer. Furthermore, the fast-evolving copies on average continue to evolve faster than the preduplication rates far longer than predicted by standard theory of gene duplications. We hypothesize that the prolonged elevated rates of evolution are determined by functional properties that were acquired during, or soon after, the gene duplication event.
gene duplication; evolution; selection
•We identify four different approaches that are used to study compensatory evolution.•We classify different types of amino acid interactions found in protein structures.•A consensus on the importance of structural interactions in evolution is lacking.•The role of specific/general and local/allosteric compensations is required.•The study of compensatory evolution advances both structural and evolutionary biology.
The study of molecular evolution is important because it reveals how protein functions emerge and evolve. Recently, several types of studies indicated that substitutions in molecular evolution occur in a compensatory manner, whereby the occurrence of a substitution depends on the amino acid residues at other sites. However, a molecular or structural basis behind the compensation often remains obscure. Here, we review studies on the interface of structural biology and molecular evolution that revealed novel aspects of compensatory evolution. In many cases structural studies benefit from evolutionary data while structural data often add a functional dimension to the study of molecular evolution.
Genetic variation at the melanocortin-1 receptor (MC1R) gene is correlated with melanin color variation in many birds. Feral pigeons (Columba livia) show two major melanin-based colorations: a red coloration due to pheomelanic pigment and a black coloration due to eumelanic pigment. Furthermore, within each color type, feral pigeons display continuous variation in the amount of melanin pigment present in the feathers, with individuals varying from pure white to a full dark melanic color. Coloration is highly heritable and it has been suggested that it is under natural or sexual selection, or both. Our objective was to investigate whether MC1R allelic variants are associated with plumage color in feral pigeons.
We sequenced 888 bp of the coding sequence of MC1R among pigeons varying both in the type, eumelanin or pheomelanin, and the amount of melanin in their feathers. We detected 10 non-synonymous substitutions and 2 synonymous substitution but none of them were associated with a plumage type. It remains possible that non-synonymous substitutions that influence coloration are present in the short MC1R fragment that we did not sequence but this seems unlikely because we analyzed the entire functionally important region of the gene.
Our results show that color differences among feral pigeons are probably not attributable to amino acid variation at the MC1R locus. Therefore, variation in regulatory regions of MC1R or variation in other genes may be responsible for the color polymorphism of feral pigeons.
Columba livia; Feral pigeon; MC1R; Pigmentation; Color; Birds
Whether or not evolutionary change is inherently irreversible remains a controversial
topic. Some examples of evolutionary irreversibility are known; however, this question has
not been comprehensively addressed at the molecular level. Here, we use data from 221
human genes with known pathogenic mutations to estimate the rate of irreversibility in
protein evolution. For these genes, we reconstruct ancestral amino acid sequences along
the mammalian phylogeny and identify ancestral amino acid states that match known
pathogenic mutations. Such cases represent inherent evolutionary irreversibility because,
at the present moment, reversals to these ancestral amino acid states are impossible for
the human lineage. We estimate that approximately 10% of all amino acid
substitutions along the mammalian phylogeny are irreversible, such that a return to the
ancestral amino acid state would lead to a pathogenic phenotype. For a subset of 51 genes
with high rates of irreversibility, as much as 40% of all amino acid evolution was
estimated to be irreversible. Because pathogenic phenotypes do not resemble ancestral
phenotypes, the molecular nature of the high rate of irreversibility in proteins is best
explained by evolution with a high prevalence of compensatory, epistatic interactions
between amino acid sites. Under such mode of protein evolution, once an amino acid
substitution is fixed, the probability of its reversal declines as the protein sequence
accumulates changes that affect the phenotypic manifestation of the ancestral state. The
prevalence of epistasis in evolution indicates that the observed high rate of
irreversibility in protein evolution is an inherent property of protein structure and
irreversibility; epistasis; genetic diseases; protein evolution; ancestral state reconstruction
A subject of extensive study in evolutionary theory has been the issue of how neutral, redundant copies can be maintained in the genome for long periods of time. Concurrently, examples of adaptive gene duplications to various environmental conditions in different species have been described. At this point, it is too early to tell whether or not a substantial fraction of gene copies have initially achieved fixation by positive selection for increased dosage. Nevertheless, enough examples have accumulated in the literature that such a possibility should be considered. Here, I review the recent examples of adaptive gene duplications and make an attempt to draw generalizations on what types of genes may be particularly prone to be selected for under certain environmental conditions. The identification of copy-number variation in ecological field studies of species adapting to stressful or novel environmental conditions may improve our understanding of gene duplications as a mechanism of adaptation and its relevance to the long-term persistence of gene duplications.
gene duplication; environmental adaptation; genomics
The evolution and genomic stop codon frequencies have not been rigorously studied with the exception of coding of non-canonical amino acids. Here we study the rate of evolution and frequency distribution of stop codons in bacterial genomes.
We show that in bacteria stop codons evolve slower than synonymous sites, suggesting the action of weak negative selection. However, the frequency of stop codons relative to genomic nucleotide content indicated that this selection regime is not straightforward. The frequency of TAA and TGA stop codons is GC-content dependent, with TAA decreasing and TGA increasing with GC-content, while TAG frequency is independent of GC-content. Applying a formal, analytical model to these data we found that the relationship between stop codon frequencies and nucleotide content cannot be explained by mutational biases or selection on nucleotide content. However, with weak nucleotide content-dependent selection on TAG, -0.5 < Nes < 1.5, the model fits all of the data and recapitulates the relationship between TAG and nucleotide content. For biologically plausible rates of mutations we show that, in bacteria, TAG stop codon is universally associated with lower fitness, with TAA being the optimal for G-content < 16% while for G-content > 16% TGA has a higher fitness than TAG.
Our data indicate that TAG codon is universally suboptimal in the bacterial lineage, such that TAA is likely to be the preferred stop codon for low GC content while the TGA is the preferred stop codon for high GC content. The optimization of stop codon usage may therefore be useful in genome engineering or gene expression optimization applications.
This article was reviewed by Michail Gelfand, Arcady Mushegian and Shamil Sunyaev. For the full reviews, please go to the Reviewers’ Comments section.
The most common form of protein-coding gene overlap in eukaryotes is a simple nested structure, whereby one gene is embedded in an intron of another. Analysis of nested protein-coding genes in vertebrates, fruit flies and nematodes revealed substantially higher rates of evolutionary gains than losses. The accumulation of nested gene structures could not be attributed to any obvious functional relationships between the genes involved and represents an increase of the organizational complexity of animal genomes via a neutral process.
The rate of spontaneous mutation in natural populations is a fundamental parameter for many evolutionary phenomena. Because the rate of mutation is generally low, most of what is currently known about mutation has been obtained through indirect, complex and imprecise methodological approaches. However, in the past few years genome-wide sequencing of closely related individuals has made it possible to estimate the rates of mutation directly at the level of the DNA, avoiding most of the problems associated with using indirect methods. Here, we review the methods used in the past with an emphasis on next generation sequencing, which may soon make the accurate measurement of spontaneous mutation rates a matter of routine.
mutation; sequencing; estimating mutation rate; mutation accumulation
Recent discovery of the Large-billed Reed Warbler (Acrocephalus orinus) in museums and in the wild significantly expanded our knowledge of its morphological traits and genetic variability, and revealed new data on geographical distribution of the breeding grounds, migration routes and wintering locations of this species. It is now certain that A. orinus is breeding in Central Asia; however, the precise area of distribution remains unclear. The difficulty in the further study of this species lies in the small number of known specimens, with only 13 currently available in museums, and in the relative uncertainty of the breeding area and habitat of this species. Following morphological and genetic analyses from Svensson, et al, we describe 14 new A. orinus specimens from collections of Zoological Museums of the former USSR from the territory of Central Asian states. All of these specimens were erroneously labeled as Blyth's Reed Warbler (A. dumetorum), which is thought to be a breeding species in these areas. The 14 new A. orinus specimens were collected during breeding season while most of the 85 A. dumetorum specimens from the same area were collected during the migration period. Our data indicate that the Central Asian territory previously attributed as breeding grounds of A. dumetorum is likely to constitute the breeding territory of A. orinus. This rare case of a re-description of the breeding territory of a lost species emphasizes the importance of maintenance of museum collections around the world. If the present data on the breeding grounds of A. orinus are confirmed with field observations and collections, the literature on the biology of A. dumetorum from the southern part of its range may have to be reconsidered.
Surveying deleterious variation in human populations is crucial for our understanding, diagnosis and potential treatment of human genetic pathologies. A number of recent genome-wide analyses focused on the prevalence of segregating deleterious alleles in the nuclear genome. However, such studies have not been conducted for the mitochondrial genome.
We present a systematic survey of polymorphisms in the human mitochondrial genome, including those predicted to be deleterious and those that correspond to known pathogenic mutations. Analyzing 4458 completely sequenced mitochondrial genomes we characterize the genetic diversity of different types of single nucleotide polymorphisms (SNPs) in African (L haplotypes) and non-African (M and N haplotypes) populations. We find that the overall level of polymorphism is higher in the mitochondrial compared to the nuclear genome, although the mitochondrial genome appears to be under stronger selection as indicated by proportionally fewer nonsynonymous than synonymous substitutions. The African mitochondrial genomes show higher heterozygosity, a greater number of polymorphic sites and higher frequencies of polymorphisms for synonymous, benign and damaging polymorphism than non-African genomes. However, African genomes carry significantly fewer SNPs that have been previously characterized as pathogenic compared to non-African genomes.
Finding SNPs classified as pathogenic to be the only category of polymorphisms that are more abundant in non-African genomes is best explained by a systematic ascertainment bias that favours the discovery of pathogenic polymorphisms segregating in non-African populations. This further suggests that, contrary to the common disease-common variant hypothesis, pathogenic mutations are largely population-specific and different SNPs may be associated with the same disease in different populations. Therefore, to obtain a comprehensive picture of the deleterious variability in the human population, as well as to improve the diagnostics of individuals carrying African mitochondrial haplotypes, it is necessary to survey different populations independently.
This article was reviewed by Dr Mikhail Gelfand, Dr Vasily Ramensky (nominated by Dr Eugene Koonin) and Dr David Rand (nominated by Dr Laurence Hurst).
Although some data link archaeal and eukaryotic translation, the overall mechanism of protein synthesis in archaea remains largely obscure. Both archaeal (aRF1) and eukaryotic (eRF1) single release factors recognize all three stop codons. The archaeal genus Methanosarcinaceae contains two aRF1 homologs, and also uses the UAG stop to encode the 22nd amino acid, pyrrolysine. Here we provide an analysis of the last stage of archaeal translation in pyrrolysine-utilizing species. We demonstrated that only one of two Methanosarcina barkeri aRF1 homologs possesses activity and recognizes all three stop codons. The second aRF1 homolog may have another unknown function. The mechanism of pyrrolysine incorporation in the Methanosarcinaceae is discussed.
Translation termination; Archeon; Pyrrolysine-utilizing archea; aRF1; Polypeptide release factor
Divergence of two independently evolving sequences that originated from a common ancestor can be described by two parameters, the asymptotic level of divergence E and the rate r at which this level of divergence is approached. Constant negative selection impedes allele replacements and, therefore, is routinely assumed to decelerate sequence divergence. However, its impact on E and on r has not been formally investigated.
Strong selection that favors only one allele can make E arbitrarily small and r arbitrarily large. In contrast, in the case of 4 possible alleles and equal mutation rates, the lowest value of r, attained when two alleles confer equal fitnesses and the other two are strongly deleterious, is only two times lower than its value under selective neutrality.
Constant selection can strongly constrain the level of sequence divergence, but cannot reduce substantially the rate at which this level is approached. In particular, under any constant selection the divergence of sequences that accumulated one substitution per neutral site since their origin from the common ancestor must already constitute at least one half of the asymptotic divergence at sites under such selection.
This article was reviewed by Drs. Nicolas Galtier, Sergei Maslov, and Nick Grishin.
Mutation rate varies greatly between nucleotide sites of the human genome and depends both on the global genomic location and the local sequence context of a site. In particular, CpG context elevates the mutation rate by an order of magnitude. Mutations also vary widely in their effect on the molecular function, phenotype, and fitness. Independence of the probability of occurrence of a new mutation's effect has been a fundamental premise in genetics. However, highly mutable contexts may be preserved by negative selection at important sites but destroyed by mutation at sites under no selection. Thus, there may be a positive correlation between the rate of mutations at a nucleotide site and the magnitude of their effect on fitness. We studied the impact of CpG context on the rate of human–chimpanzee divergence and on intrahuman nucleotide diversity at non-synonymous coding sites. We compared nucleotides that occupy identical positions within codons of identical amino acids and only differ by being within versus outside CpG context. Nucleotides within CpG context are under a stronger negative selection, as revealed by their lower, proportionally to the mutation rate, rate of evolution and nucleotide diversity. In particular, the probability of fixation of a non-synonymous transition at a CpG site is two times lower than at a CpG site. Thus, sites with different mutation rates are not necessarily selectively equivalent. This suggests that the mutation rate may complement sequence conservation as a characteristic predictive of functional importance of nucleotide sites.
Mutations occur in some sites in the genome more frequently than in others. Similarly, mutations in some sites have greater consequences than in others. The effect of mutations might not be independent of the frequency with which mutations occur. Indeed, sites where mutations happen frequently will be preserved if the effects of these mutations are severe or will otherwise be allowed to mutate if there are no consequences for the organism. We compared both human–chimpanzee differences and sequence variation among humans in protein coding genes. We found that highly mutable nucleotide sites, such as the dinucleotide CpG, are on average more important and more frequently preserved by natural selection. Using this information, together with other features such as sequence conservation, opens a new perspective to predict the effect of human mutations, including their potential involvement in diseases.
The arginine vasopressin V1a receptor (V1aR) modulates social cognition and behavior in a wide variety of species. Variation in a repetitive microsatellite element in the 5' flanking region of the V1aR gene (AVPR1A) in rodents has been associated with variation in brain V1aR expression and in social behavior. In humans, the 5' flanking region of AVPR1A contains a tandem duplication of two ~350 bp, microsatellite-containing elements located approximately 3.5 kb upstream of the transcription start site. The first block, referred to as DupA, contains a polymorphic (GT)25 microsatellite; the second block, DupB, has a complex (CT)4-(TT)-(CT)8-(GT)24 polymorphic motif, known as RS3. Polymorphisms in RS3 have been associated with variation in sociobehavioral traits in humans, including autism spectrum disorders. Thus, evolution of these regions may have contributed to variation in social behavior in primates. We examined the structure of these regions in six ape, six monkey, and one prosimian species.
Both tandem repeat blocks are present upstream of the AVPR1A coding region in five of the ape species we investigated, while monkeys have only one copy of this region. As in humans, the microsatellites within DupA and DupB are polymorphic in many primate species. Furthermore, both single (lacking DupB) and duplicated alleles (containing both DupA and DupB) are present in chimpanzee (Pan troglodytes) populations with allele frequencies of 0.795 and 0.205 for the single and duplicated alleles, respectively, based on the analysis of 47 wild-caught individuals. Finally, a phylogenetic reconstruction suggests two alternate evolutionary histories for this locus.
There is no obvious relationship between the presence of the RS3 duplication and social organization in primates. However, polymorphisms identified in some species may be useful in future genetic association studies. In particular, the presence of both single and duplicated alleles in chimpanzees provides a unique opportunity to assess the functional role of this duplication in contributing to variation in social behavior in primates. While our initial studies show no signs of directional selection on this locus in chimps, pharmacological and genetic association studies support a potential role for this region in influencing V1aR expression and social behavior.
Mitochondrial tRNAs have been the subject of study for structural biologists interested in their secondary structure characteristics, evolutionary biologists have researched patterns of compensatory and structural evolution and medical studies have been directed towards understanding the basis of human disease. However, an up to date, manually curated database of mitochondrially encoded tRNAs from higher animals is currently not available.
We obtained the complete mitochondrial sequence for 277 tetrapod species from GenBank and re-annotated all of the tRNAs based on a multiple alignment of each tRNA gene and secondary structure prediction made independently for each tRNA. The mitochondrial (mt) tRNA sequences and the secondary structure based multiple alignments are freely available as Supplemental Information online.
We compiled a manually curated database of mitochondrially encoded tRNAs from tetrapods with completely sequenced genomes. In the course of our work, we reannotated more than 10% of all tetrapod mt-tRNAs and subsequently predicted the secondary structures of 6060 mitochondrial tRNAs. This carefully constructed database can be utilized to enhance our knowledge in several different fields including the evolution of mt-tRNA secondary structure and prediction of pathogenic mt-tRNA mutations. In addition, researchers reporting novel mitochondrial genome sequences should check their tRNA gene annotations against our database to ensure a higher level of fidelity of their annotation.
Independently evolving lineages mostly accumulate different changes, which leads to their gradual divergence. However, parallel accumulation of identical changes is also common, especially in traits with only a small number of possible states.
We characterize parallelism in evolution of coding sequences in three four-species sets of genomes of mammals, Drosophila, and yeasts. Each such set contains two independent evolutionary paths, which we call paths I and II. An amino acid replacement which occurred along path I also occurs along path II with the probability 50–80% of that expected under selective neutrality. Thus, the per site rate of parallel evolution of proteins is several times higher than their average rate of evolution, but still lower than the rate of evolution of neutral sequences. This deficit may be caused by changes in the fitness landscape, leading to a replacement being possible along path I but not along path II. However, constant, weak selection assumed by the nearly neutral model of evolution appears to be a more likely explanation. Then, the average coefficient of selection associated with an amino acid replacement, in the units of the effective population size, must exceed ~0.4, and the fraction of effectively neutral replacements must be below ~30%. At a majority of evolvable amino acid sites, only a relatively small number of different amino acids is permitted.
High, but below-neutral, rates of parallel amino acid replacements suggest that a majority of amino acid replacements that occur in evolution are subject to weak, but non-trivial, selection, as predicted by Ohta's nearly-neutral theory.
This article was reviewed by John McDonald (nominated by Laura Landweber), Sarah Teichmann and Subhajyoti De, and Chris Adami.
The glyoxylate cycle is thought to be present in bacteria, protists, plants, fungi, and nematodes, but not in other Metazoa. However, activity of the glyoxylate cycle enzymes, malate synthase (MS) and isocitrate lyase (ICL), in animal tissues has been reported. In order to clarify the status of the MS and ICL genes in animals and get an insight into their evolution, we undertook a comparative-genomic study.
Using sequence similarity searches, we identified MS genes in arthropods, echinoderms, and vertebrates, including platypus and opossum, but not in the numerous sequenced genomes of placental mammals. The regions of the placental mammals' genomes expected to code for malate synthase, as determined by comparison of the gene orders in vertebrate genomes, show clear similarity to the opossum MS sequence but contain stop codons, indicating that the MS gene became a pseudogene in placental mammals. By contrast, the ICL gene is undetectable in animals other than the nematodes that possess a bifunctional, fused ICL-MS gene. Examination of phylogenetic trees of MS and ICL suggests multiple horizontal gene transfer events that probably went in both directions between several bacterial and eukaryotic lineages. The strongest evidence was obtained for the acquisition of the bifunctional ICL-MS gene from an as yet unknown bacterial source with the corresponding operonic organization by the common ancestor of the nematodes.
The distribution of the MS and ICL genes in animals suggests that either they encode alternative enzymes of the glyoxylate cycle that are not orthologous to the known MS and ICL or the animal MS acquired a new function that remains to be characterized. Regardless of the ultimate solution to this conundrum, the genes for the glyoxylate cycle enzymes present a remarkable variety of evolutionary events including unusual horizontal gene transfer from bacteria to animals.
Arcady Mushegian (Stowers Institute for Medical Research), Andrey Osterman (Burnham Institute for Medical Research), Chris Ponting (Oxford University).
Carcinogenesis typically involves multiple somatic mutations in caretaker (DNA repair) and gatekeeper (tumor suppressors and oncogenes) genes. Analysis of mutation spectra of the tumor suppressor that is most commonly mutated in human cancers, p53, unexpectedly suggested that somatic evolution of the p53 gene during tumorigenesis is dominated by positive selection for gain of function. This conclusion is supported by accumulating experimental evidence of evolution of new functions of p53 in tumors. These findings prompted a genome-wide analysis of possible positive selection during tumor evolution.
A comprehensive analysis of probable somatic mutations in the sequences of Expressed Sequence Tags (ESTs) from malignant tumors and normal tissues was performed in order to access the prevalence of positive selection in cancer evolution. For each EST, the numbers of synonymous and non-synonymous substitutions were calculated. In order to identify genes with a signature of positive selection in cancers, these numbers were compared to: i) expected numbers and ii) the numbers for the respective genes in the ESTs from normal tissues.
We identified 112 genes with a signature of positive selection in cancers, i.e., a significantly elevated ratio of non-synonymous to synonymous substitutions, in tumors as compared to 37 such genes in an approximately equal-sized EST collection from normal tissues. A substantial fraction of the tumor-specific positive-selection candidates have experimentally demonstrated or strongly predicted links to cancer.
The results of EST analysis should be interpreted with extreme caution given the noise introduced by sequencing errors and undetected polymorphisms. Furthermore, an inherent limitation of EST analysis is that multiple mutations amenable to statistical analysis can be detected only in relatively highly expressed genes. Nevertheless, the present results suggest that positive selection might affect a substantial number of genes during tumorigenic somatic evolution.
Phylogenetic relationships between the extinct woolly mammoth(Mammuthus primigenius), and the Asian(Elephas maximus) and African savanna(Loxodonta africana) elephants remain unresolved. Here, we report the sequence of the complete mitochondrial genome (16,842 base pairs) of a woolly mammoth extracted from permafrost-preserved remains from the Pleistocene epoch—the oldest mitochondrial genome sequence determined to date. We demonstrate that well-preserved mitochondrial genome fragments, as long as ~1,600–1700 base pairs, can be retrieved from pre-Holocene remains of an extinct species. Phylogenetic reconstruction of the Elephantinae clade suggests thatM. primigenius andE. maximus are sister species that diverged soon after their common ancestor split from theL. africana lineage. Low nucleotide diversity found between independently determined mitochondrial genomic sequences of woolly mammoths separated geographically and in time suggests that north-eastern Siberia was occupied by a relatively homogeneous population ofM. primigenius throughout the late Pleistocene.
The complete mitochondrial genome of a 33,000-year-old Mammoth confirms its close relationship with the Asian elephant and suggests that genetic diversity among mammoths in Siberia during the late Pleistocene was low
Only a fraction of eukaryotic genes affect the phenotype drastically. We compared 18 parameters in 1273 human morbid genes, known to cause diseases, and in the remaining 16 580 unambiguous human genes. Morbid genes evolve more slowly, have wider phylogenetic distributions, are more similar to essential genes of Drosophila melanogaster, code for longer proteins containing more alanine and glycine and less histidine, lysine and methionine, possess larger numbers of longer introns with more accurate splicing signals and have higher and broader expressions. These differences make it possible to classify as non-morbid 34% of human genes with unknown morbidity, when only 5% of known morbid genes are incorrectly classified as non-morbid. This classification can help to identify disease-causing genes among multiple candidates.
Gene duplications have a major role in the evolution of new biological functions. Theoretical studies often assume that a duplication per se is selectively neutral and that, following a duplication, one of the gene copies is freed from purifying (stabilizing) selection, which creates the potential for evolution of a new function.
In search of systematic evidence of accelerated evolution after duplication, we used data from 26 bacterial, six archaeal, and seven eukaryotic genomes to compare the mode and strength of selection acting on recently duplicated genes (paralogs) and on similarly diverged, unduplicated orthologous genes in different species. We find that the ratio of nonsynonymous to synonymous substitutions (Kn/Ks) in most paralogous pairs is <<1 and that paralogs typically evolve at similar rates, without significant asymmetry, indicating that both paralogs produced by a duplication are subject to purifying selection. This selection is, however, substantially weaker than the purifying selection affecting unduplicated orthologs that have diverged to the same extent as the analyzed paralogs. Most of the recently duplicated genes appear to be involved in various forms of environmental response; in particular, many of them encode membrane and secreted proteins.
The results of this analysis indicate that recently duplicated paralogs evolve faster than orthologs with the same level of divergence and similar functions, but apparently do not experience a phase of neutral evolution. We hypothesize that gene duplications that persist in an evolving lineage are beneficial from the time of their origin, due primarily to a protein dosage effect in response to variable environmental conditions; duplications are likely to give rise to new functions at a later phase of their evolution once a higher level of divergence is reached.
Detection of changes in a protein's evolutionary rate may reveal cases of change in that protein's function. We developed and implemented a simple relative rates test in an attempt to assess the rate constancy of protein evolution and to detect cases of functional diversification between orthologous proteins. The test was performed on clusters of orthologous protein sequences from complete bacterial genomes (Chlamydia trachomatis, C. muridarum and Chlamydophila pneumoniae), complete archaeal genomes (Pyrococcus horikoshii, P. abyssi and P. furiosus) and partially sequenced mammalian genomes (human, mouse and rat).
Amino-acid sequence evolution rates are significantly correlated on different branches of phylogenetic trees representing the great majority of analyzed orthologous protein sets from all three domains of life. However, approximately 1% of the proteins from each group of species deviates from this pattern and instead shows variation that is consistent with an acceleration of the rate of amino-acid substitution, which may be due to functional diversification. Most of the putative functionally diversified proteins from all three species groups are predicted to function at the periphery of the cells and mediate their interaction with the environment.
Relative rates of protein evolution are remarkably constant for the three species groups analyzed here. Deviations from this rate constancy are probably due to changes in selective constraints associated with diversification between orthologs. Functional diversification between orthologs is thought to be a relatively rare event. However, the resolution afforded by the test designed specifically for genomic-scale datasets allowed us to identify numerous cases of possible functional diversification between orthologous proteins.