Viruses are the most abundant biological entities on earth and encompass a vast amount of genetic diversity. The recent rapid increase in the number of sequenced viral genomes has created unprecedented opportunities for gaining new insight into the structure and evolution of the virosphere. Here, we present an update of the phage orthologous groups (POGs), a collection of 4,542 clusters of orthologous genes from bacteriophages that now also includes viruses infecting archaea and encompasses more than 1,000 distinct virus genomes. Analysis of this expanded data set shows that the number of POGs keeps growing without saturation and that a substantial majority of the POGs remain specific to viruses, lacking homologues in prokaryotic cells, outside known proviruses. Thus, the great majority of virus genes apparently remains to be discovered. A complementary observation is that numerous viral genomes remain poorly, if at all, covered by POGs. The genome coverage by POGs is expected to increase as more genomes are sequenced. Taxon-specific, single-copy signature genes that are not observed in prokaryotic genomes outside detected proviruses were identified for two-thirds of the 57 taxa (those with genomes available from at least 3 distinct viruses), with half of these present in all members of the respective taxon. These signatures can be used to specifically identify the presence and quantify the abundance of viruses from particular taxa in metagenomic samples and thus gain new insights into the ecology and evolution of viruses in relation to their hosts.
G protein-coupled receptor (GPCR) kinases (GRKs) are best known for their role in homologous desensitization of GPCRs. GRKs phosphorylate activated receptors and promote high affinity binding of arrestins, which precludes G protein coupling. GRKs have a multidomain structure, with the kinase domain inserted into a loop of a regulator of G protein signaling homology domain. Unlike many other kinases, GRKs do not need to be phosphorylated in their activation loop to achieve an activated state. Instead, they are directly activated by docking with active GPCRs. In this manner they are able to selectively phosphorylate Ser/Thr residues on only the activated form of the receptor, unlike related kinases such as protein kinase A. GRKs also phosphorylate a variety of non-GPCR substrates and regulate several signaling pathways via direct interactions with other proteins in a phosphorylation-independent manner. Multiple GRK subtypes are present in virtually every animal cell, with the highest expression levels found in neurons, with their extensive and complex signal regulation. Insufficient or excessive GRK activity was implicated in a variety of human disorders, ranging from heart failure to depression to Parkinson’s disease. As key regulators of GPCR-dependent and -independent signaling pathways, GRKs are emerging drug targets and promising molecular tools for therapy. Targeted modulation of expression and/or of activity of several GRK isoforms for therapeutic purposes was recently validated in cardiac disorders and Parkinson’s disease.
G protein-coupled receptors; G protein-coupled receptor kinases; signaling; regulation; phosphorylation; G proteins; regulator of G protein signaling
Accurate inference of orthologous genes is a pre-requisite for most comparative genomics studies, and is also important for functional annotation of new genomes. Identification of orthologous gene sets typically involves phylogenetic tree analysis, heuristic algorithms based on sequence conservation, synteny analysis, or some combination of these approaches. The most direct tree-based methods typically rely on the comparison of an individual gene tree with a species tree. Once the two trees are accurately constructed, orthologs are straightforwardly identified by the definition of orthology as those homologs that are related by speciation, rather than gene duplication, at their most recent point of origin. Although ideal for the purpose of orthology identification in principle, phylogenetic trees are computationally expensive to construct for large numbers of genes and genomes, and they often contain errors, especially at large evolutionary distances. Moreover, in many organisms, in particular prokaryotes and viruses, evolution does not appear to have followed a simple ‘tree-like’ mode, which makes conventional tree reconciliation inapplicable. Other, heuristic methods identify probable orthologs as the closest homologous pairs or groups of genes in a set of organisms. These approaches are faster and easier to automate than tree-based methods, with efficient implementations provided by graph-theoretical algorithms enabling comparisons of thousands of genomes. Comparisons of these two approaches show that, despite conceptual differences, they produce similar sets of orthologs, especially at short evolutionary distances. Synteny also can aid in identification of orthologs. Often, tree-based, sequence similarity- and synteny-based approaches can be combined into flexible hybrid methods.
homolog; ortholog; paralog; xenolog; orthologous groups; tree reconciliation; comparative genomics
G protein-coupled receptor (GPCR) kinases (GRKs) play key role in homologous desensitization of GPCRs. GRKs phosphorylate activated receptors, promoting high affinity binding of arrestins, which precludes G protein coupling. Direct binding to active GPCRs activates GRKs, so that they selectively phosphorylate only the activated form of the receptor regardless of the accessibility of the substrate peptides within it and their Ser/Thr-containing sequence. Mammalian GRKs were classified into three main lineages, but earlier GRK evolution has not been studied. Here we show that GRKs emerged at the early stages of eukaryotic evolution via an insertion of a kinase similar to ribosomal protein S6 kinase into a loop in RGS domain. GRKs in Metazoa fall into two clades, one including GRK2 and GRK3, and the other consisting of all remaining GRKs, split into GRK1-GRK7 lineage and GRK4-GRK5-GRK6 lineage in vertebrates. One representative of each of the two ancient clades is found as early as placozoan Trichoplax adhaerens. Several protists, two oomycetes and unicellular brown algae have one GRK-like protein, suggesting that the insertion of a kinase domain into the RGS domain preceded the origin of Metazoa. The two GRK families acquired distinct structural units in the N- and C-termini responsible for membrane recruitment and receptor association. Thus, GRKs apparently emerged before animals and rapidly expanded in true Metazoa, most likely due to the need for rapid signalling adjustments in fast-moving animals.
Metagenomic analysis of viruses suggests novel patterns of evolution, changes the existing ideas of the composition of the virus world and reveals novel groups of viruses and virus-like agents. The gene composition of the marine DNA virome is dramatically different from that of known bacteriophages. The virome is dominated by rare genes, many of which might be contained within virus-like entities such as gene transfer agents. Analysis of marine metagenomes thought to consist mostly of bacterial genes revealed a variety of sequences homologous to conserved genes of eukaryotic nucleocytoplasmic large DNA viruses, resulting in the discovery of diverse members of previously undersampled groups and suggesting the existence of new classes of virus-like agents. Unexpectedly, metagenomics of marine RNA viruses showed that representatives of only one superfamily of eukaryotic viruses, the picorna-like viruses, dominate the RNA virome.
We have identified conserved orthologs in completely sequenced genomes of double-strand DNA phages and arranged them into evolutionary families (phage orthologous groups [POGs]). Using this resource to analyze the collection of known phage genomes, we find that most orthologs are unique in their genomes (having no diverged duplicates [paralogs]), and while many proteins contain multiple domains, the evolutionary recombination of these domains does not appear to be a major factor in evolution of these orthologous families. The number of POGs has been rapidly increasing over the past decade, the percentage of genes in phage genomes that have orthologs in other phages has also been increasing, and the percentage of unknown “ORFans” is decreasing as more proteins find homologs and establish a family. Other properties of phage genomes have remained relatively stable over time, most notably the high fraction of genes that are never or only rarely observed in their cellular hosts. This suggests that despite the renowned ability of phages to transduce cellular genes, these cellular “hitchhiker” genes do not dominate the phage genomic landscape, and a large fraction of the genes in phage genomes maintain an evolutionary trajectory that is distinct from that of the host genes.
The large (about 2200 amino acids) L polymerase protein of nonsegmented negative-strand RNA viruses (order Mononegavirales) has six conserved sequence regions (“domains”) postulated to constitute the specific enzymatic activities involved in viral mRNA synthesis, 5′-end capping, cap methylation, 3′ polyadenylation, and genomic RNA replication. Previous studies with vesicular stomatitis virus identified amino acid residues within the L protein domain VI required for mRNA cap methylation. In our recent study we analyzed four amino acid residues within domain VI of the Sendai virus L protein and our data indicated that there could be differences in L protein sequence requirements for cap methylation in two different families of Mononegavirales - rhabdoviruses and paramyxoviruses. In this study, we conducted a more comprehensive mutational analysis by targeting the entire SeV L protein domain VI, creating twenty-four L mutants, and testing these mutations for their effects on viral mRNA synthesis, cap methylation, viral genome replication and virus growth kinetics. Our analysis identified several residues required for successful cap methylation and virus replication and clearly showed the importance of the K-D-K-E tetrad and glycine-rich motif in the SeV cap methylation. This study is the first extensive sequence analysis of the L protein domain VI in the family Paramyxoviridae, and it confirms structural and functional similarity of this domain across different families of the order Mononegavirales.
Sendai virus; paramyxovirus; mRNA cap methylation; methyltransferase; L polymerase protein
Gene expression divergence is a phenotypic trait reflecting evolution of gene regulation and characterizing dissimilarity between species and between cells and tissues within the same species. Several distance measures, such as Euclidean and correlation-based distances have been proposed for measuring expression divergence.
We show that different distance measures identify different trends in gene expression patterns. When comparing orthologous genes in eight rat and human tissues, the Euclidean distance identified genes uniformly expressed in all tissues near the expression background as genes with the most conserved expression pattern. In contrast, correlation-based distance and generalized-average distance identified genes with concerted changes among homologous tissues as those most conserved. On the other hand, correlation-based distance, Euclidean distance and generalized-average distance highlight quite well the relatively high similarity of gene expression patterns in homologous tissues between species, compared to non-homologous tissues within species.
Different trends exist in the high-dimensional numeric data, and to highlight a particular trend an appropriate distance measure needs to be chosen. The choice of the distance measure for measuring expression divergence can be dictated by the expression patterns that are of interest in a particular study.
This article was reviewed by Mikhail Gelfand, Eugene Koonin and Subhajyoti De (nominated by Sarah Teichmann).
Motivation: Identifying orthologous genes in multiple genomes is a fundamental task in comparative genomics. Construction of intergenomic symmetrical best matches (SymBets) and joining them into clusters is a popular method of ortholog definition, embodied in several software programs. Despite their wide use, the computational complexity of these programs has not been thoroughly examined.
Results: In this work, we show that in the standard approach of iteration through all triangles of SymBets, the memory scales with at least the number of these triangles, O(g3) (where g = number of genomes), and construction time scales with the iteration through each pair, i.e. O(g6). We propose the EdgeSearch algorithm that iterates over edges in the SymBet graph rather than triangles of SymBets, and as a result has a worst-case complexity of only O(g3log g). Several optimizations reduce the run-time even further in realistically sparse graphs. In two real-world datasets of genomes from bacteriophages (POGs) and Mollicutes (MOGs), an implementation of the EdgeSearch algorithm runs about an order of magnitude faster than the original algorithm and scales much better with increasing number of genomes, with only minor differences in the final results, and up to 60 times faster than the popular OrthoMCL program with a 90% overlap between the identified groups of orthologs.
Availability and implementation: C++ source code freely available for download at ftp.ncbi.nih.gov/pub/wolf/COGs/COGsoft/
Supplementary information: Supplementary materials are available at Bioinformatics online.
Ethanolamine can be used as a source of carbon and nitrogen by phylogenetically diverse bacteria. Ethanolamine-ammonia lyase, the enzyme that breaks ethanolamine into acetaldehyde and ammonia, is encoded by the gene tandem eutBC. Despite extensive studies of ethanolamine utilization in Salmonella enterica serovar Typhimurium, much remains to be learned about EutBC structure and catalytic mechanism, about the evolutionary origin of ethanolamine utilization, and about regulatory links between the metabolism of ethanolamine itself and the ethanolamine-ammonia lyase cofactor adenosylcobalamin. We used computational analysis of sequences, structures, genome contexts, and phylogenies of ethanolamine-ammonia lyases to address these questions and to evaluate recent data-mining studies that have suggested an association between bacterial food poisoning and the diol utilization pathways. We found that EutBC evolution included recruitment of a TIM barrel and a Rossmann fold domain and their fusion to N-terminal α-helical domains to give EutB and EutC, respectively. This fusion was followed by recruitment and occasional loss of auxiliary ethanolamine utilization genes in Firmicutes and by several horizontal transfers, most notably from the firmicute stem to the Enterobacteriaceae and from Alphaproteobacteria to Actinobacteria. We identified a conserved DNA motif that likely represents the EutR-binding site and is shared by the ethanolamine and cobalamin operons in several enterobacterial species, suggesting a mechanism for coupling the biosyntheses of apoenzyme and cofactor in these species. Finally, we found that the food poisoning phenotype is associated with the structural components of metabolosome more strongly than with ethanolamine utilization genes or with paralogous propanediol utilization genes per se.
The genomes of two closely related lytic Thermus thermophilus siphoviruses with exceptionally long (~800 nm) tails, bacteriophages P23-45 and P74-26, were completely sequenced. The P23-45 genome consists of 84,201 bp with 117 putative ORFs (Open Reading Frames), and the P74-26 genome has 83,319 bp and 116 putative ORFs. The two genomes are 92% identical with 113 ORFs shared. Only 25% of phage gene product functions can be predicted from similarities to proteins and protein domains with known functions. The structural genes of P23-45, most of which have no similarity to sequences from public databases, were identified by mass-spectrometric analysis of virions. An unusual feature of the P23-45 and P74-26 genomes is the presence, in their largest intergenic regions, of long polypurine-polypyrimidine (R-Y) sequences with mirror repeat symmetry. Such sequences, abundant in eukaryotic genomes but rare in prokaryotes, are known to form stable triple helices that block replication and transcription and induce genetic instability. Comparative analysis of the two phage genomes shows that the area around the triplex-forming elements is enriched in mutational variations. In vitro, phage R-Y sequences form triplexes and block DNA synthesis by Taq DNA polymerase in orientation-dependent manner, suggesting that they may play a regulatory role during P23-45 and P74-26 development.
Thermus thermophilus; thermophages; virion proteomics; bioinformatics; triplex-forming sequence
A phyletic vector, also known as a phyletic (or phylogenetic) pattern, is a binary representation of the presences and absences of orthologous genes in different genomes. Joint occurrence of two or more genes in many genomes results in closely similar binary vectors representing these genes, and this similarity between gene vectors may be used as a measure of functional association between genes. Better understanding of quantitative properties of gene co-occurrences is needed for systematic studies of gene function and evolution. We used the probabilistic iterative algorithm Psi-square to find groups of similar phyletic vectors. An extended Psi-square algorithm, in which pseudocounts are implemented, shows better sensitivity in identifying proteins with known functional links than our earlier hierarchical clustering approach. At the same time, the specificity of inferring functional associations between genes in prokaryotic genomes is strongly dependent on the pathway: phyletic vectors of the genes involved in energy metabolism and in de novo biosynthesis of the essential precursors tend to be lumped together, whereas cellular modules involved in secretion, motility, assembly of cell surfaces, biosynthesis of some coenzymes, and utilization of secondary carbon sources tend to be identified with much greater specificity. It appears that the network of gene coinheritance in prokaryotes contains a giant connected component that encompasses most biosynthetic subsystems, along with a series of more independent modules involved in cell interaction with the environment.
A novel phage infecting Escherichia coli was isolated during a large-scale screen for phages that may be used for therapy of mastitis in cattle. The 77,554 bp genome of the phage, named phiEco32, was sequenced and annotated, and its virions were characterized by electron microscopy and proteomics. Two phiEco32-encoded proteins that interact with host RNA polymerase were identified. One of them is an ECF-family σ-factor that may be responsible for transcription of some viral genes. Another RNA polymerase-binding protein is a novel transcription inhibitor whose mechanism of action remains to be defined.
bacteriophage; Podoviridae; E. coli; genome; MudPIT; RNA polymerase-binding proteins
While genome-wide gene expression data are generated at an increasing rate, the repertoire of approaches for pattern discovery in these data is still limited. Identifying subtle patterns of interest in large amounts of data (tens of thousands of profiles) associated with a certain level of noise remains a challenge. A microarray time series was recently generated to study the transcriptional program of the mouse segmentation clock, a biological oscillator associated with the periodic formation of the segments of the body axis. A method related to Fourier analysis, the Lomb-Scargle periodogram, was used to detect periodic profiles in the dataset, leading to the identification of a novel set of cyclic genes associated with the segmentation clock. Here, we applied to the same microarray time series dataset four distinct mathematical methods to identify significant patterns in gene expression profiles. These methods are called: Phase consistency, Address reduction, Cyclohedron test and Stable persistence, and are based on different conceptual frameworks that are either hypothesis- or data-driven. Some of the methods, unlike Fourier transforms, are not dependent on the assumption of periodicity of the pattern of interest. Remarkably, these methods identified blindly the expression profiles of known cyclic genes as the most significant patterns in the dataset. Many candidate genes predicted by more than one approach appeared to be true positive cyclic genes and will be of particular interest for future research. In addition, these methods predicted novel candidate cyclic genes that were consistent with previous biological knowledge and experimental validation in mouse embryos. Our results demonstrate the utility of these novel pattern detection strategies, notably for detection of periodic profiles, and suggest that combining several distinct mathematical approaches to analyze microarray datasets is a valuable strategy for identifying genes that exhibit novel, interesting transcriptional patterns.
Members of the green fluorescent protein (GFP) family share sequence similarity and the 11-stranded β-barrel fold. Fluorescence or bright coloration, observed in many members of this family, is enabled by the intrinsic properties of the polypeptide chain itself, without the requirement for cofactors. Amino acid sequence of fluorescent proteins can be altered by genetic engineering to produce variants with different spectral properties, suitable for direct visualization of molecular and cellular processes. Naturally occurring GFP-like proteins include fluorescent proteins from cnidarians of the Hydrozoa and Anthozoa classes, and from copepods of the Pontellidae family, as well as non-fluorescent proteins from Anthozoa. Recently, an mRNA encoding a fluorescent GFP-like protein AmphiGFP, related to GFP from Pontellidae, has been isolated from the lancelet Branchiostoma floridae, a cephalochordate (Deheyn et al., Biol Bull, 2007 213:95).
We report that the nearly-completely sequenced genome of Branchiostoma floridae encodes at least 12 GFP-like proteins. The evidence for expression of six of these genes can be found in the EST databases. Phylogenetic analysis suggests that a gene encoding a GFP-like protein was present in the common ancestor of Cnidaria and Bilateria. We synthesized and expressed two of the lancelet GFP-like proteins in mammalian cells and in bacteria. One protein, which we called LanFP1, exhibits bright green fluorescence in both systems. The other protein, LanFP2, is identical to AmphiGFP in amino acid sequence and is moderately fluorescent. Live imaging of the adult animals revealed bright green fluorescence at the anterior end and in the basal region of the oral cirri, as well as weaker green signals throughout the body of the animal. In addition, red fluorescence was observed in oral cirri, extending to the tips.
GFP-like proteins may have been present in the primitive Metazoa. Their evolutionary history includes losses in several metazoan lineages and expansion in cephalochordates that resulted in the largest repertoire of GFP-like proteins known thus far in a single organism. Lancelet expresses several of its GFP-like proteins, which appear to have distinct spectral properties and perhaps diverse functions.
This article was reviewed by Shamil Sunyaev, Mikhail Matz (nominated by I. King Jordan) and L. Aravind.
We determined the sequence of the 152,372-bp genome of ϕYS40, a lytic tailed bacteriophage of Thermus thermophilus. The genome contains 170 putative open reading frames and three tRNA genes. Functions for 25% of ϕYS40 gene products were predicted on the basis of similarity to proteins of known function from diverse phages and bacteria. ϕYS40 encodes a cluster of proteins involved in nucleotide salvage, such as flavin-dependent thymidylate synthase, thymidylate kinase, ribonucleotide reductase, and deoxycytidylate deaminase, and in DNA replication, such as DNA primase, helicase, type A DNA polymerase, and predicted terminal protein involved in initiation of DNA synthesis. The structural genes of ϕYS40, most of which have no similarity to sequences in public databases, were identified by mass-spectrometric analysis of purified virions. Various ϕYS40 proteins have different phylogenetic neighbors, including Myovirus, Podovirus, and Siphovirus gene products, bacterial genes, and in one case, a dUTPase from a eukaryotic virus. ϕYS40 has apparently arisen through multiple acts of recombination between different phage genomes as well as through acquisition of bacterial genes.
Thermus thermophilus; bacteriophage; genome; virion; proteomics; bioinformatics; DNA polymerase
Reconstruction of evolutionary history of bacteriophages is a difficult problem because of fast sequence drift and lack of omnipresent genes in phage genomes. Moreover, losses and recombinational exchanges of genes are so pervasive in phages that the plausibility of phylogenetic inference in phage kingdom has been questioned.
We compiled the profiles of presence and absence of 803 orthologous genes in 158 completely sequenced phages with double-stranded DNA genomes and used these gene content vectors to infer the evolutionary history of phages. There were 18 well-supported clades, mostly corresponding to accepted genera, but in some cases appearing to define new taxonomic groups. Conflicts between this phylogeny and trees constructed from sequence alignments of phage proteins were exploited to infer 294 specific acts of intergenome gene transfer.
A notoriously reticulate evolutionary history of fast-evolving phages can be reconstructed in considerable detail by quantitative comparative genomics.
Open peer review
This article was reviewed by Eugene Koonin, Nicholas Galtier and Martijn Huynen.
Identification of the structural domains of proteins is important for our understanding of the organizational principles and mechanisms of protein folding, and for insights into protein function and evolution. Algorithmic methods of dissecting protein of known structure into domains developed so far are based on an examination of multiple geometrical, physical and topological features. Successful as many of these approaches are, they employ a lot of heuristics, and it is not clear whether they illuminate any deep underlying principles of protein domain organization. Other well-performing domain dissection methods rely on comparative sequence analysis. These methods are applicable to sequences with known and unknown structure alike, and their success highlights a fundamental principle of protein modularity, but this does not directly improve our understanding of protein spatial structure.
We present a novel graph-theoretical algorithm for the identification of domains in proteins with known three-dimensional structure. We represent the protein structure as an undirected, unweighted and unlabeled graph whose nodes correspond to the secondary structure elements and edges represent physical proximity of at least one pair of alpha carbon atoms from two elements. Domains are identified as constrained partitions of the graph, corresponding to sets of vertices obtained by the maximization of the cycle distributions found in the graph. When a partition is found, the algorithm is iteratively applied to each of the resulting subgraphs. The decision to accept or reject a tentative cut position is based on a specific classifier. The algorithm is applied iteratively to each of the resulting subgraphs and terminates automatically if partitions are no longer accepted. The distribution of cycles is the only type of information on which the decision about protein dissection is based. Despite the barebone simplicity of the approach, our algorithm approaches the best heuristic algorithms in accuracy.
Our graph-theoretical algorithm uses only topological information present in the protein structure itself to find the domains and does not rely on any geometrical or physical information about protein molecule. Perhaps unexpectedly, these drastic constraints on resources, which result in a seemingly approximate description of protein structures and leave only a handful of parameters available for analysis, do not lead to any significant deterioration of algorithm accuracy. It appears that protein structures can be rigorously treated as topological rather than geometrical objects and that the majority of information about protein domains can be inferred from the coarse-grained measure of pairwise proximity between elements of secondary structure elements.
The spindle pole body (SPB) is the sole site of microtubule nucleation in Saccharomyces cerevisiae; yet, details of its assembly are poorly understood. Integral membrane proteins including Mps2 anchor the soluble core SPB in the nuclear envelope. Adjacent to the core SPB is a membrane-associated SPB substructure known as the half-bridge, where SPB duplication and microtubule nucleation during G1 occurs. We found that the half-bridge component Mps3 is the budding yeast member of the SUN protein family (Sad1-UNC-84 homology) and provide evidence that it interacts with the Mps2 C terminus to tether the half-bridge to the core SPB. Mutants in the Mps3 SUN domain or Mps2 C terminus have SPB duplication and karyogamy defects that are consistent with the aberrant half-bridge structures we observe cytologically. The interaction between the Mps3 SUN domain and Mps2 C terminus is the first biochemical link known to connect the half-bridge with the core SPB. Association with Mps3 also defines a novel function for Mps2 during SPB duplication.
We present psi-square, a program for searching the space of gene vectors. The program starts with a gene vector, i.e., the set of measurements associated with a gene, and finds similar vectors, derives a probabilistic model of these vectors, then repeats search using this model as a query, and continues to update the model and search again, until convergence. When applied to three different pathway-discovery problems, psi-square was generally more sensitive and sometimes more specific than the ad hoc methods developed for solving each of these problems before.
This article was reviewed by King Jordan, Mikhail Gelfand, Nicolas Galtier and Sarah Teichmann.
S-adenosylmethionine is a source of diverse chemical groups used in biosynthesis and modification of virtually every class of biomolecules. The most notable reaction requiring S-adenosylmethionine, transfer of methyl group, is performed by a large class of enzymes, S-adenosylmethionine-dependent methyltransferases, which have been the focus of considerable structure-function studies. Evolutionary trajectories of these enzymes, and especially of other classes of S-adenosylmethionine-binding proteins, nevertheless, remain poorly understood. We addressed this issue by computational comparison of sequences and structures of various S-adenosylmethionine-binding proteins.
Two widespread folds, Rossmann fold and TIM barrel, have been repeatedly used in evolution for diverse types of S-adenosylmethionine conversion. There were also cases of recruitment of other relatively common folds for S-adenosylmethionine binding. Several classes of proteins have unique unrelated folds, specialized for just one type of chemistry and unified by the theme of internal domain duplications. In several cases, functional divergence is evident, when evolutionarily related enzymes have changed the mode of binding and the type of chemical transformation of S-adenosylmethionine. There are also instances of functional convergence, when biochemically similar processes are performed by drastically different classes of S-adenosylmethionine-binding proteins.
Comparison of remote sequence similarities and analysis of phyletic patterns suggests that the last universal common ancestor of cellular life had between 10 and 20 S-adenosylmethionine-binding proteins from at least 5 fold classes, providing for S-adenosylmethionine formation, polyamine biosynthesis, and methylation of several substrates, including nucleic acids and peptide chain release factor.
We have observed several novel relationships between families that were not known to be related before, and defined 15 large superfamilies of SAM-binding proteins, at least 5 of which may have been represented in the last common ancestor.
Most of the known prohead maturation proteases in double-stranded-DNA bacteriophages are shown, by computational methods, to fall into two evolutionarily independent clans of serine proteases, herpesvirus assemblin-like and ClpP-like. Phylogenetic analysis suggests that these two types of phage prohead protease genes displaced each other multiple times while preserving their exact location within the late operons of the phage genomes.
A hierarchy of 3,688 phyletic patterns was characterized encompassing more than 5,000 known protein-coding genes from 66 complete microbial genomes. The results indicate that gene loss and displacement has occurred in the evolution of most pathways.
Phyletic patterns denote the presence and absence of orthologous genes in completely sequenced genomes and are used to infer functional links between genes, on the assumption that genes involved in the same pathway or functional system are co-inherited by the same set of genomes. However, this basic premise has not been quantitatively tested, and the limits of applicability of the phyletic-pattern method remain unknown.
We characterized a hierarchy of 3,688 phyletic patterns encompassing more than 5,000 known protein-coding genes from 66 complete microbial genomes, using different distances, clustering algorithms, and measures of cluster quality. The most sensitive set of parameters recovered 223 clusters, each consisting of genes that belong to the same metabolic pathway or functional system. Fifty-six clusters included unexpected genes with plausible functional links to the rest of the cluster. Only a small percentage of known pathways and multiprotein complexes are co-inherited as one cluster; most are split into many clusters, indicating that gene loss and displacement has occurred in the evolution of most pathways.
Phyletic patterns of functionally linked genes are perturbed by differential gains, losses and displacements of orthologous genes in different species, reflecting the high plasticity of microbial genomes. Groups of genes that are co-inherited can, however, be recovered by hierarchical clustering, and may represent elementary functional modules of cellular metabolism. The phyletic patterns approach alone can confidently predict the functional linkages for about 24% of the entire data set.