Through the course of their evolution, viruses with large genomes have acquired numerous host genes, most of which perform function in virus reproduction in a manner that is related to their original activities in the cells, but some are exapted for new roles. Here we report the unexpected finding that protein F12, which is conserved among the chordopoxviruses and is implicated in the morphogenesis of enveloped intracellular virions, is a derived DNA polymerase, possibly of bacteriophage origin, in which the polymerase domain and probably the exonuclease domain have been inactivated. Thus, F12 appears to present a rare example of a drastic, exaptive functional change in virus evolution.
Reviewers: This article was reviewed by Frank Eisenhaber and Juergen Brosius. For complete reviews, go the Reviewers’ Reports section.
DNA polymerase; Exaptation; Poxviruses; Evolution of viruses
Analysis of the genome sequence of the starlet sea anemone, Nematostella vectensis, reveals many genes whose products are phylogenetically closer to proteins encoded by bacteria or bacteriophages than to any metazoan homologs. One explanation for such sequence affinities could be that these genes have been horizontally transferred from bacteria to the Nematostella lineage. We show, however, that bacterium-like and phage-like genes sequenced by the N. vectensis genome project tend to cluster on separate scaffolds, which typically do not include eukaryotic genes and differ from the latter in their GC contents. Moreover, most of the bacterium-like genes in N. vectensis either lack introns or the introns annotated in such genes are false predictions that, when translated, often restore the missing portions of their predicted protein products. In a freshwater cnidarian, Hydra, for which a proteobacterial endosymbiont is known, these gene features have been used to delineate the DNA of that endosymbiont sampled by the genome sequencing project. We predict that a large fraction of bacterium-like genes identified in the N. vectensis genome similarly are drawn from the contemporary bacterial consorts of the starlet sea anemone. These uncharacterized bacteria associated with N. vectensis are a proteobacterium and a representative of the phylum Bacteroidetes, each represented in the database by an apparently random sample of informational and operational genes. A substantial portion of a putative bacteriophage genome was also detected, which would be especially unlikely to have been transferred to a eukaryote.
Modulation of NF-κB-dependent responses is critical to the success of attaching/effacing (A/E) human pathogenic E. coli (EPEC and EHEC) and the natural mouse pathogen Citrobacter rodentium. NleB, a highly conserved type III secretion system effector of A/E pathogens, suppresses NF-κB activation, but the underlying mechanisms are unknown. We identified the mammalian glycolysis enzyme glyceraldehyde 3-phosphate dehydrogenase (GAPDH) as an NleB interacting protein. Further, we discovered that GAPDH interacts with the TNF receptor associated factor 2 (TRAF2), a protein required for TNF-α-mediated NF-κB activation, and regulates TRAF2 polyubiquitination. During infection, NleB functions as a translocated N-acetyl-D-glucosamine (O-GlcNAc) transferase that modifies GAPDH. NleB-mediated GAPDH O-GlcNAcylation disrupts the TRAF2-GAPDH interaction to suppress TRAF2 polyubiquitination and NF-κB activation. Eliminating NleB O-GlcNAcylation activity attenuates C. rodentium colonization of mice. These data identify GAPDH as a TRAF2 signaling cofactor and reveal a virulence strategy employed by A/E pathogens to inhibit NF-κB dependent host innate immune responses.
The problem of probabilistic inference of gene content in the last common ancestor of several extant species with completely sequenced genomes is: for each gene that is conserved in all or some of the genomes, assign the probability that its ancestral gene was present in the genome of their last common ancestor.
We have developed a family of models of gene gain and gene loss in evolution, and applied the maximum-likelihood approach that uses phylogenetic tree of prokaryotes and the record of orthologous relationships between their genes to infer the gene content of LUCA, the Last Universal Common Ancestor of all currently living cellular organisms. The crucial parameter, the ratio of gene losses and gene gains, was estimated from the data and was higher in models that take account of the number of in-paralogs in genomes than in models that treat gene presences and absences as a binary trait.
While the numbers of genes that are placed confidently into LUCA are similar in the ML methods and in previously published methods that use various parsimony-based approaches, the identities of genes themselves are different. Most of the models of either kind treat the genes found in many existing genomes in a similar way, assigning to them high probabilities of being ancestral (“high ancestrality”). The ML models are more likely than others to assign high ancestrality to the genes that are relatively rare in the present-day genomes.
This article was reviewed by Martijn A Huynen, Toni Gabaldón and Fyodor Kondrashov.
Minimal bacterial gene set comprises the genetic elements needed for survival of engineered bacterium on a rich medium. This set is estimated to include 300–350 protein-coding genes. One way of simplifying an organism with such a minimal genome even further is to constrain the amino acid content of its proteins. In this study, comparative genomics approaches and the results of gene knockout experiments were used to extrapolate the minimal gene set of mollicutes, and bioinformatics combined with the knowledge-based analysis of the structure-function relationships in these proteins and their orthologs, paralogs and analogs was applied to examine the challenges of completely replacing the rarest residue, cysteine. Among several known functions of cysteine residues, their roles in the active centers of the enzymes responsible for deoxyribonucleoside synthesis and transfer RNA modification appear to be crucial, as no alternative chemistry is known for these reactions. Thus, drastic reduction of the content of the rarest amino acid in a minimal proteome appears to be possible, but its complete elimination is challenging.
Viruses are the most abundant biological entities on earth and encompass a vast amount of genetic diversity. The recent rapid increase in the number of sequenced viral genomes has created unprecedented opportunities for gaining new insight into the structure and evolution of the virosphere. Here, we present an update of the phage orthologous groups (POGs), a collection of 4,542 clusters of orthologous genes from bacteriophages that now also includes viruses infecting archaea and encompasses more than 1,000 distinct virus genomes. Analysis of this expanded data set shows that the number of POGs keeps growing without saturation and that a substantial majority of the POGs remain specific to viruses, lacking homologues in prokaryotic cells, outside known proviruses. Thus, the great majority of virus genes apparently remains to be discovered. A complementary observation is that numerous viral genomes remain poorly, if at all, covered by POGs. The genome coverage by POGs is expected to increase as more genomes are sequenced. Taxon-specific, single-copy signature genes that are not observed in prokaryotic genomes outside detected proviruses were identified for two-thirds of the 57 taxa (those with genomes available from at least 3 distinct viruses), with half of these present in all members of the respective taxon. These signatures can be used to specifically identify the presence and quantify the abundance of viruses from particular taxa in metagenomic samples and thus gain new insights into the ecology and evolution of viruses in relation to their hosts.
G protein-coupled receptor (GPCR) kinases (GRKs) are best known for their role in homologous desensitization of GPCRs. GRKs phosphorylate activated receptors and promote high affinity binding of arrestins, which precludes G protein coupling. GRKs have a multidomain structure, with the kinase domain inserted into a loop of a regulator of G protein signaling homology domain. Unlike many other kinases, GRKs do not need to be phosphorylated in their activation loop to achieve an activated state. Instead, they are directly activated by docking with active GPCRs. In this manner they are able to selectively phosphorylate Ser/Thr residues on only the activated form of the receptor, unlike related kinases such as protein kinase A. GRKs also phosphorylate a variety of non-GPCR substrates and regulate several signaling pathways via direct interactions with other proteins in a phosphorylation-independent manner. Multiple GRK subtypes are present in virtually every animal cell, with the highest expression levels found in neurons, with their extensive and complex signal regulation. Insufficient or excessive GRK activity was implicated in a variety of human disorders, ranging from heart failure to depression to Parkinson’s disease. As key regulators of GPCR-dependent and -independent signaling pathways, GRKs are emerging drug targets and promising molecular tools for therapy. Targeted modulation of expression and/or of activity of several GRK isoforms for therapeutic purposes was recently validated in cardiac disorders and Parkinson’s disease.
G protein-coupled receptors; G protein-coupled receptor kinases; signaling; regulation; phosphorylation; G proteins; regulator of G protein signaling
Accurate inference of orthologous genes is a pre-requisite for most comparative genomics studies, and is also important for functional annotation of new genomes. Identification of orthologous gene sets typically involves phylogenetic tree analysis, heuristic algorithms based on sequence conservation, synteny analysis, or some combination of these approaches. The most direct tree-based methods typically rely on the comparison of an individual gene tree with a species tree. Once the two trees are accurately constructed, orthologs are straightforwardly identified by the definition of orthology as those homologs that are related by speciation, rather than gene duplication, at their most recent point of origin. Although ideal for the purpose of orthology identification in principle, phylogenetic trees are computationally expensive to construct for large numbers of genes and genomes, and they often contain errors, especially at large evolutionary distances. Moreover, in many organisms, in particular prokaryotes and viruses, evolution does not appear to have followed a simple ‘tree-like’ mode, which makes conventional tree reconciliation inapplicable. Other, heuristic methods identify probable orthologs as the closest homologous pairs or groups of genes in a set of organisms. These approaches are faster and easier to automate than tree-based methods, with efficient implementations provided by graph-theoretical algorithms enabling comparisons of thousands of genomes. Comparisons of these two approaches show that, despite conceptual differences, they produce similar sets of orthologs, especially at short evolutionary distances. Synteny also can aid in identification of orthologs. Often, tree-based, sequence similarity- and synteny-based approaches can be combined into flexible hybrid methods.
homolog; ortholog; paralog; xenolog; orthologous groups; tree reconciliation; comparative genomics
G protein-coupled receptor (GPCR) kinases (GRKs) play key role in homologous desensitization of GPCRs. GRKs phosphorylate activated receptors, promoting high affinity binding of arrestins, which precludes G protein coupling. Direct binding to active GPCRs activates GRKs, so that they selectively phosphorylate only the activated form of the receptor regardless of the accessibility of the substrate peptides within it and their Ser/Thr-containing sequence. Mammalian GRKs were classified into three main lineages, but earlier GRK evolution has not been studied. Here we show that GRKs emerged at the early stages of eukaryotic evolution via an insertion of a kinase similar to ribosomal protein S6 kinase into a loop in RGS domain. GRKs in Metazoa fall into two clades, one including GRK2 and GRK3, and the other consisting of all remaining GRKs, split into GRK1-GRK7 lineage and GRK4-GRK5-GRK6 lineage in vertebrates. One representative of each of the two ancient clades is found as early as placozoan Trichoplax adhaerens. Several protists, two oomycetes and unicellular brown algae have one GRK-like protein, suggesting that the insertion of a kinase domain into the RGS domain preceded the origin of Metazoa. The two GRK families acquired distinct structural units in the N- and C-termini responsible for membrane recruitment and receptor association. Thus, GRKs apparently emerged before animals and rapidly expanded in true Metazoa, most likely due to the need for rapid signalling adjustments in fast-moving animals.
Metagenomic analysis of viruses suggests novel patterns of evolution, changes the existing ideas of the composition of the virus world and reveals novel groups of viruses and virus-like agents. The gene composition of the marine DNA virome is dramatically different from that of known bacteriophages. The virome is dominated by rare genes, many of which might be contained within virus-like entities such as gene transfer agents. Analysis of marine metagenomes thought to consist mostly of bacterial genes revealed a variety of sequences homologous to conserved genes of eukaryotic nucleocytoplasmic large DNA viruses, resulting in the discovery of diverse members of previously undersampled groups and suggesting the existence of new classes of virus-like agents. Unexpectedly, metagenomics of marine RNA viruses showed that representatives of only one superfamily of eukaryotic viruses, the picorna-like viruses, dominate the RNA virome.
We have identified conserved orthologs in completely sequenced genomes of double-strand DNA phages and arranged them into evolutionary families (phage orthologous groups [POGs]). Using this resource to analyze the collection of known phage genomes, we find that most orthologs are unique in their genomes (having no diverged duplicates [paralogs]), and while many proteins contain multiple domains, the evolutionary recombination of these domains does not appear to be a major factor in evolution of these orthologous families. The number of POGs has been rapidly increasing over the past decade, the percentage of genes in phage genomes that have orthologs in other phages has also been increasing, and the percentage of unknown “ORFans” is decreasing as more proteins find homologs and establish a family. Other properties of phage genomes have remained relatively stable over time, most notably the high fraction of genes that are never or only rarely observed in their cellular hosts. This suggests that despite the renowned ability of phages to transduce cellular genes, these cellular “hitchhiker” genes do not dominate the phage genomic landscape, and a large fraction of the genes in phage genomes maintain an evolutionary trajectory that is distinct from that of the host genes.
The large (about 2200 amino acids) L polymerase protein of nonsegmented negative-strand RNA viruses (order Mononegavirales) has six conserved sequence regions (“domains”) postulated to constitute the specific enzymatic activities involved in viral mRNA synthesis, 5′-end capping, cap methylation, 3′ polyadenylation, and genomic RNA replication. Previous studies with vesicular stomatitis virus identified amino acid residues within the L protein domain VI required for mRNA cap methylation. In our recent study we analyzed four amino acid residues within domain VI of the Sendai virus L protein and our data indicated that there could be differences in L protein sequence requirements for cap methylation in two different families of Mononegavirales - rhabdoviruses and paramyxoviruses. In this study, we conducted a more comprehensive mutational analysis by targeting the entire SeV L protein domain VI, creating twenty-four L mutants, and testing these mutations for their effects on viral mRNA synthesis, cap methylation, viral genome replication and virus growth kinetics. Our analysis identified several residues required for successful cap methylation and virus replication and clearly showed the importance of the K-D-K-E tetrad and glycine-rich motif in the SeV cap methylation. This study is the first extensive sequence analysis of the L protein domain VI in the family Paramyxoviridae, and it confirms structural and functional similarity of this domain across different families of the order Mononegavirales.
Sendai virus; paramyxovirus; mRNA cap methylation; methyltransferase; L polymerase protein
Gene expression divergence is a phenotypic trait reflecting evolution of gene regulation and characterizing dissimilarity between species and between cells and tissues within the same species. Several distance measures, such as Euclidean and correlation-based distances have been proposed for measuring expression divergence.
We show that different distance measures identify different trends in gene expression patterns. When comparing orthologous genes in eight rat and human tissues, the Euclidean distance identified genes uniformly expressed in all tissues near the expression background as genes with the most conserved expression pattern. In contrast, correlation-based distance and generalized-average distance identified genes with concerted changes among homologous tissues as those most conserved. On the other hand, correlation-based distance, Euclidean distance and generalized-average distance highlight quite well the relatively high similarity of gene expression patterns in homologous tissues between species, compared to non-homologous tissues within species.
Different trends exist in the high-dimensional numeric data, and to highlight a particular trend an appropriate distance measure needs to be chosen. The choice of the distance measure for measuring expression divergence can be dictated by the expression patterns that are of interest in a particular study.
This article was reviewed by Mikhail Gelfand, Eugene Koonin and Subhajyoti De (nominated by Sarah Teichmann).
Motivation: Identifying orthologous genes in multiple genomes is a fundamental task in comparative genomics. Construction of intergenomic symmetrical best matches (SymBets) and joining them into clusters is a popular method of ortholog definition, embodied in several software programs. Despite their wide use, the computational complexity of these programs has not been thoroughly examined.
Results: In this work, we show that in the standard approach of iteration through all triangles of SymBets, the memory scales with at least the number of these triangles, O(g3) (where g = number of genomes), and construction time scales with the iteration through each pair, i.e. O(g6). We propose the EdgeSearch algorithm that iterates over edges in the SymBet graph rather than triangles of SymBets, and as a result has a worst-case complexity of only O(g3log g). Several optimizations reduce the run-time even further in realistically sparse graphs. In two real-world datasets of genomes from bacteriophages (POGs) and Mollicutes (MOGs), an implementation of the EdgeSearch algorithm runs about an order of magnitude faster than the original algorithm and scales much better with increasing number of genomes, with only minor differences in the final results, and up to 60 times faster than the popular OrthoMCL program with a 90% overlap between the identified groups of orthologs.
Availability and implementation: C++ source code freely available for download at ftp.ncbi.nih.gov/pub/wolf/COGs/COGsoft/
Supplementary information: Supplementary materials are available at Bioinformatics online.
Ethanolamine can be used as a source of carbon and nitrogen by phylogenetically diverse bacteria. Ethanolamine-ammonia lyase, the enzyme that breaks ethanolamine into acetaldehyde and ammonia, is encoded by the gene tandem eutBC. Despite extensive studies of ethanolamine utilization in Salmonella enterica serovar Typhimurium, much remains to be learned about EutBC structure and catalytic mechanism, about the evolutionary origin of ethanolamine utilization, and about regulatory links between the metabolism of ethanolamine itself and the ethanolamine-ammonia lyase cofactor adenosylcobalamin. We used computational analysis of sequences, structures, genome contexts, and phylogenies of ethanolamine-ammonia lyases to address these questions and to evaluate recent data-mining studies that have suggested an association between bacterial food poisoning and the diol utilization pathways. We found that EutBC evolution included recruitment of a TIM barrel and a Rossmann fold domain and their fusion to N-terminal α-helical domains to give EutB and EutC, respectively. This fusion was followed by recruitment and occasional loss of auxiliary ethanolamine utilization genes in Firmicutes and by several horizontal transfers, most notably from the firmicute stem to the Enterobacteriaceae and from Alphaproteobacteria to Actinobacteria. We identified a conserved DNA motif that likely represents the EutR-binding site and is shared by the ethanolamine and cobalamin operons in several enterobacterial species, suggesting a mechanism for coupling the biosyntheses of apoenzyme and cofactor in these species. Finally, we found that the food poisoning phenotype is associated with the structural components of metabolosome more strongly than with ethanolamine utilization genes or with paralogous propanediol utilization genes per se.
The genomes of two closely related lytic Thermus thermophilus siphoviruses with exceptionally long (~800 nm) tails, bacteriophages P23-45 and P74-26, were completely sequenced. The P23-45 genome consists of 84,201 bp with 117 putative ORFs (Open Reading Frames), and the P74-26 genome has 83,319 bp and 116 putative ORFs. The two genomes are 92% identical with 113 ORFs shared. Only 25% of phage gene product functions can be predicted from similarities to proteins and protein domains with known functions. The structural genes of P23-45, most of which have no similarity to sequences from public databases, were identified by mass-spectrometric analysis of virions. An unusual feature of the P23-45 and P74-26 genomes is the presence, in their largest intergenic regions, of long polypurine-polypyrimidine (R-Y) sequences with mirror repeat symmetry. Such sequences, abundant in eukaryotic genomes but rare in prokaryotes, are known to form stable triple helices that block replication and transcription and induce genetic instability. Comparative analysis of the two phage genomes shows that the area around the triplex-forming elements is enriched in mutational variations. In vitro, phage R-Y sequences form triplexes and block DNA synthesis by Taq DNA polymerase in orientation-dependent manner, suggesting that they may play a regulatory role during P23-45 and P74-26 development.
Thermus thermophilus; thermophages; virion proteomics; bioinformatics; triplex-forming sequence
A phyletic vector, also known as a phyletic (or phylogenetic) pattern, is a binary representation of the presences and absences of orthologous genes in different genomes. Joint occurrence of two or more genes in many genomes results in closely similar binary vectors representing these genes, and this similarity between gene vectors may be used as a measure of functional association between genes. Better understanding of quantitative properties of gene co-occurrences is needed for systematic studies of gene function and evolution. We used the probabilistic iterative algorithm Psi-square to find groups of similar phyletic vectors. An extended Psi-square algorithm, in which pseudocounts are implemented, shows better sensitivity in identifying proteins with known functional links than our earlier hierarchical clustering approach. At the same time, the specificity of inferring functional associations between genes in prokaryotic genomes is strongly dependent on the pathway: phyletic vectors of the genes involved in energy metabolism and in de novo biosynthesis of the essential precursors tend to be lumped together, whereas cellular modules involved in secretion, motility, assembly of cell surfaces, biosynthesis of some coenzymes, and utilization of secondary carbon sources tend to be identified with much greater specificity. It appears that the network of gene coinheritance in prokaryotes contains a giant connected component that encompasses most biosynthetic subsystems, along with a series of more independent modules involved in cell interaction with the environment.
A novel phage infecting Escherichia coli was isolated during a large-scale screen for phages that may be used for therapy of mastitis in cattle. The 77,554 bp genome of the phage, named phiEco32, was sequenced and annotated, and its virions were characterized by electron microscopy and proteomics. Two phiEco32-encoded proteins that interact with host RNA polymerase were identified. One of them is an ECF-family σ-factor that may be responsible for transcription of some viral genes. Another RNA polymerase-binding protein is a novel transcription inhibitor whose mechanism of action remains to be defined.
bacteriophage; Podoviridae; E. coli; genome; MudPIT; RNA polymerase-binding proteins
While genome-wide gene expression data are generated at an increasing rate, the repertoire of approaches for pattern discovery in these data is still limited. Identifying subtle patterns of interest in large amounts of data (tens of thousands of profiles) associated with a certain level of noise remains a challenge. A microarray time series was recently generated to study the transcriptional program of the mouse segmentation clock, a biological oscillator associated with the periodic formation of the segments of the body axis. A method related to Fourier analysis, the Lomb-Scargle periodogram, was used to detect periodic profiles in the dataset, leading to the identification of a novel set of cyclic genes associated with the segmentation clock. Here, we applied to the same microarray time series dataset four distinct mathematical methods to identify significant patterns in gene expression profiles. These methods are called: Phase consistency, Address reduction, Cyclohedron test and Stable persistence, and are based on different conceptual frameworks that are either hypothesis- or data-driven. Some of the methods, unlike Fourier transforms, are not dependent on the assumption of periodicity of the pattern of interest. Remarkably, these methods identified blindly the expression profiles of known cyclic genes as the most significant patterns in the dataset. Many candidate genes predicted by more than one approach appeared to be true positive cyclic genes and will be of particular interest for future research. In addition, these methods predicted novel candidate cyclic genes that were consistent with previous biological knowledge and experimental validation in mouse embryos. Our results demonstrate the utility of these novel pattern detection strategies, notably for detection of periodic profiles, and suggest that combining several distinct mathematical approaches to analyze microarray datasets is a valuable strategy for identifying genes that exhibit novel, interesting transcriptional patterns.
Members of the green fluorescent protein (GFP) family share sequence similarity and the 11-stranded β-barrel fold. Fluorescence or bright coloration, observed in many members of this family, is enabled by the intrinsic properties of the polypeptide chain itself, without the requirement for cofactors. Amino acid sequence of fluorescent proteins can be altered by genetic engineering to produce variants with different spectral properties, suitable for direct visualization of molecular and cellular processes. Naturally occurring GFP-like proteins include fluorescent proteins from cnidarians of the Hydrozoa and Anthozoa classes, and from copepods of the Pontellidae family, as well as non-fluorescent proteins from Anthozoa. Recently, an mRNA encoding a fluorescent GFP-like protein AmphiGFP, related to GFP from Pontellidae, has been isolated from the lancelet Branchiostoma floridae, a cephalochordate (Deheyn et al., Biol Bull, 2007 213:95).
We report that the nearly-completely sequenced genome of Branchiostoma floridae encodes at least 12 GFP-like proteins. The evidence for expression of six of these genes can be found in the EST databases. Phylogenetic analysis suggests that a gene encoding a GFP-like protein was present in the common ancestor of Cnidaria and Bilateria. We synthesized and expressed two of the lancelet GFP-like proteins in mammalian cells and in bacteria. One protein, which we called LanFP1, exhibits bright green fluorescence in both systems. The other protein, LanFP2, is identical to AmphiGFP in amino acid sequence and is moderately fluorescent. Live imaging of the adult animals revealed bright green fluorescence at the anterior end and in the basal region of the oral cirri, as well as weaker green signals throughout the body of the animal. In addition, red fluorescence was observed in oral cirri, extending to the tips.
GFP-like proteins may have been present in the primitive Metazoa. Their evolutionary history includes losses in several metazoan lineages and expansion in cephalochordates that resulted in the largest repertoire of GFP-like proteins known thus far in a single organism. Lancelet expresses several of its GFP-like proteins, which appear to have distinct spectral properties and perhaps diverse functions.
This article was reviewed by Shamil Sunyaev, Mikhail Matz (nominated by I. King Jordan) and L. Aravind.
We determined the sequence of the 152,372-bp genome of ϕYS40, a lytic tailed bacteriophage of Thermus thermophilus. The genome contains 170 putative open reading frames and three tRNA genes. Functions for 25% of ϕYS40 gene products were predicted on the basis of similarity to proteins of known function from diverse phages and bacteria. ϕYS40 encodes a cluster of proteins involved in nucleotide salvage, such as flavin-dependent thymidylate synthase, thymidylate kinase, ribonucleotide reductase, and deoxycytidylate deaminase, and in DNA replication, such as DNA primase, helicase, type A DNA polymerase, and predicted terminal protein involved in initiation of DNA synthesis. The structural genes of ϕYS40, most of which have no similarity to sequences in public databases, were identified by mass-spectrometric analysis of purified virions. Various ϕYS40 proteins have different phylogenetic neighbors, including Myovirus, Podovirus, and Siphovirus gene products, bacterial genes, and in one case, a dUTPase from a eukaryotic virus. ϕYS40 has apparently arisen through multiple acts of recombination between different phage genomes as well as through acquisition of bacterial genes.
Thermus thermophilus; bacteriophage; genome; virion; proteomics; bioinformatics; DNA polymerase
Reconstruction of evolutionary history of bacteriophages is a difficult problem because of fast sequence drift and lack of omnipresent genes in phage genomes. Moreover, losses and recombinational exchanges of genes are so pervasive in phages that the plausibility of phylogenetic inference in phage kingdom has been questioned.
We compiled the profiles of presence and absence of 803 orthologous genes in 158 completely sequenced phages with double-stranded DNA genomes and used these gene content vectors to infer the evolutionary history of phages. There were 18 well-supported clades, mostly corresponding to accepted genera, but in some cases appearing to define new taxonomic groups. Conflicts between this phylogeny and trees constructed from sequence alignments of phage proteins were exploited to infer 294 specific acts of intergenome gene transfer.
A notoriously reticulate evolutionary history of fast-evolving phages can be reconstructed in considerable detail by quantitative comparative genomics.
Open peer review
This article was reviewed by Eugene Koonin, Nicholas Galtier and Martijn Huynen.
Identification of the structural domains of proteins is important for our understanding of the organizational principles and mechanisms of protein folding, and for insights into protein function and evolution. Algorithmic methods of dissecting protein of known structure into domains developed so far are based on an examination of multiple geometrical, physical and topological features. Successful as many of these approaches are, they employ a lot of heuristics, and it is not clear whether they illuminate any deep underlying principles of protein domain organization. Other well-performing domain dissection methods rely on comparative sequence analysis. These methods are applicable to sequences with known and unknown structure alike, and their success highlights a fundamental principle of protein modularity, but this does not directly improve our understanding of protein spatial structure.
We present a novel graph-theoretical algorithm for the identification of domains in proteins with known three-dimensional structure. We represent the protein structure as an undirected, unweighted and unlabeled graph whose nodes correspond to the secondary structure elements and edges represent physical proximity of at least one pair of alpha carbon atoms from two elements. Domains are identified as constrained partitions of the graph, corresponding to sets of vertices obtained by the maximization of the cycle distributions found in the graph. When a partition is found, the algorithm is iteratively applied to each of the resulting subgraphs. The decision to accept or reject a tentative cut position is based on a specific classifier. The algorithm is applied iteratively to each of the resulting subgraphs and terminates automatically if partitions are no longer accepted. The distribution of cycles is the only type of information on which the decision about protein dissection is based. Despite the barebone simplicity of the approach, our algorithm approaches the best heuristic algorithms in accuracy.
Our graph-theoretical algorithm uses only topological information present in the protein structure itself to find the domains and does not rely on any geometrical or physical information about protein molecule. Perhaps unexpectedly, these drastic constraints on resources, which result in a seemingly approximate description of protein structures and leave only a handful of parameters available for analysis, do not lead to any significant deterioration of algorithm accuracy. It appears that protein structures can be rigorously treated as topological rather than geometrical objects and that the majority of information about protein domains can be inferred from the coarse-grained measure of pairwise proximity between elements of secondary structure elements.