Tn916-like conjugative transposons carrying antibiotic resistance genes are found in a diverse range of bacteria. Orf14 within the conjugation module encodes a bifunctional cell-wall hydrolase CwlT that consists of an N-terminal bacterial lysozyme domain (N-acetylmuramidase, bLysG) and a C-terminal NlpC/P60 domain (γ-d-glutamyl-l-diamino acid endopeptidase) and is expected to play an important role in the spread of the transposons. We determined the crystal structures of two CwlT from pathogens Staphylococcus aureus mu50 (SaCwlT) and Clostridium difficile 630 (CdCwlT). These structures reveal that NlpC/P60 and LysG domains are compact and conserved modules, connected by a short flexible linker. The LysG domain represents a novel family of widely distributed bacterial lysozymes. The overall structure and the active site of bLysG bear significant similarity to other members of the glycoside hydrolase family 23 (GH23), such as the g-type lysozyme (LysG) and Escherichia coli lytic transglycosylase MltE. The active site of bLysG contains a unique structural and sequence signature (DxxQSSES+S) that is important for coordinating a catalytic water. Molecular modeling suggests that the bLysG domain may recognize glycan in a similar manner to MltE. The C-terminal NlpC/P60 domain contains a conserved active site (Cys-His-His-Tyr) that appears to be specific for tetrapeptide. Access to the active site is likely regulated by isomerism of a side chain atop the catalytic cysteine, allowing substrate entry or product release, or closing during catalysis.
bifunctional cell-wall lysin; bacterial lysozyme; muramidase; NlpC/P60 endopeptidase; Tn916 family conjugative transposons
The promise of personalized cancer medicine cannot be fulfilled until we gain better understanding of the connections between the genomic makeup of a patient's tumor and its response to anticancer drugs. Several datasets that include both pharmacologic profiles of cancer cell lines as well as their genomic alterations have been recently developed and extensively analyzed. However, most analyses of these datasets assume that mutations in a gene will have the same consequences regardless of their location. While this assumption might be correct in some cases, such analyses may miss subtler, yet still relevant, effects mediated by mutations in specific protein regions. Here we study such perturbations by separating effects of mutations in different protein functional regions (PFRs), including protein domains and intrinsically disordered regions. Using this approach, we have been able to identify 171 novel associations between mutations in specific PFRs and changes in the activity of 24 drugs that couldn't be recovered by traditional gene-centric analyses. Our results demonstrate how focusing on individual protein regions can provide novel insights into the mechanisms underlying the drug sensitivity of cancer cell lines. Moreover, while these new correlations are identified using only data from cancer cell lines, we have been able to validate some of our predictions using data from actual cancer patients. Our findings highlight how gene-centric experiments (such as systematic knock-out or silencing of individual genes) are missing relevant effects mediated by perturbations of specific protein regions. All the associations described here are available from http://www.cancer3d.org.
There is increasing evidence that altering different functional regions within the same protein can lead to dramatically distinct phenotypes. Here we show how, by focusing on individual regions instead of whole proteins, we are able to identify novel correlations that predict the activity of anticancer drugs. We have also used proteomic data from both cancer cell lines and actual cancer patients to explore the molecular mechanisms underlying some of these region-drug associations. We finally show how associations found between protein regions and drugs using only data from cancer cell lines can predict the survival of cancer patients.
PF10014 is a novel family of 2-oxyglutarate-Fe2+-dependent dioxygenases that are involved in biosynthesis of antibiotics and regulation of biofilm formation, likely by catalyzing hydroxylation of free amino acids or other related ligands. The crystal structure of a PF10014 member from Methylibium petroleiphilum at 1.9 Å resolution shows strong structural similarity to cupin dioxygenases in overall fold and active site, despite very remote homology. However, one of the β-strands of the cupin catalytic core is replaced by a loop that displays conformational isomerism that likely regulates the active site.
PF10014/BsmA; cupin dioxygenase; free amino acids; 2-oxyglutarate; ferrous iron
Branched polymers of glucose are universally used for energy storage in cells, taking the form of glycogen in animals, fungi, Bacteria, and Archaea, and of amylopectin in plants. Some enzymes involved in glycogen and amylopectin metabolism are similarly conserved in all forms of life, but some, interestingly, are not. In this paper we focus on the phylogeny of glycogen branching and debranching enzymes, respectively involved in introducing and removing of the α(1–6) bonds in glucose polymers, bonds that provide the unique branching structure to glucose polymers.
We performed a large-scale phylogenomic analysis of branching and debranching enzymes in over 400 completely sequenced genomes, including more than 200 from eukaryotes. We show that branching and debranching enzymes can be found in all kingdoms of life, including all major groups of eukaryotes, and thus were likely to have been present in the last universal common ancestor (LUCA) but have been lost in seemingly random fashion in numerous single-celled eukaryotes. We also show how animal branching and debranching enzymes evolved from their LUCA ancestors by acquiring additional domains. Furthermore, we show that enzymes commonly perceived as orthologous, such as human branching enzyme GBE1 and E. coli branching enzyme GlgB, are in fact related by a gene duplication and consequently paralogous.
Despite being usually associated with animal liver glycogen and plant starch, energy storage in the form of branched glucose polymers is clearly an ancient process and has probably been present in the last universal common ancestor of all present life. The evolution of the enzymes enabling this form of energy storage is more complex than previously thought and illustrates the need for explicit phylogenomic analysis in the study of even seemingly “simple” metabolic enzymes. Patterns of conservation in the evolution of the glycogen/starch branching and debranching enzymes hint at some as yet unknown mechanisms, as mutations disrupting these patterns lead to a variety of genetic diseases in humans and other mammals.
Glycogen; Starch; Branching; Debranching; Glycogen storage disease; AGL; GBE1; GlgB; GlgX; TreX
PubServer, available at http://pubserver.burnham.org/, is a tool to automatically
collect, filter and analyze publications associated with groups of homologous
proteins. Protein entries in databases such as Entrez Protein database at NCBI
contain information about publications associated with a given protein. The
scope of these publications varies a lot: they include studies focused on
biochemical functions of individual proteins, but also reports from genome
sequencing projects that introduce tens of thousands of proteins. Collecting and
analyzing publications related to sets of homologous proteins help in functional
annotation of novel protein families and in improving annotations of
well-studied protein families or individual genes. However, performing such
collection and analysis manually is a tedious and time-consuming process.
PubServer automatically collects identifiers of homologous proteins using
PSI-Blast, retrieves literature references from corresponding database entries
and filters out publications unlikely to contain useful information about
individual proteins. It also prepares simple vocabulary statistics from titles,
abstracts and MeSH terms to identify the most frequently occurring keywords,
which may help to quickly identify common themes in these publications. The
filtering criteria applied to collected publications are user-adjustable. The
results of the server are presented as an interactive page that allows
re-filtering and different presentations of the output.
Gut microbiome metagenomics has revealed many protein families and domains found largely or exclusively in that environment. Proteins containing the GxGYxYP domain are over-represented in the gut microbiota, and are found in Polysaccharide Utilization Loci in the gut symbiont Bacteroides thetaiotaomicron, suggesting their involvement in polysaccharide metabolism, but little else is known of the function of this domain.
Genomic context and domain architecture analyses support a role for the GxGYxYP domain in carbohydrate metabolism. Sparse occurrences in eukaryotes are the result of lateral gene transfer. The structure of the GxGYxYP domain-containing protein encoded by the BT2193 locus reveals two structural domains, the first composed of three divergent repeats with no recognisable homology to previously solved structures, the second a more familiar seven-stranded β/α barrel. Structure-based analyses including conservation mapping localise a presumed functional site to a cleft between the two domains of BT2193. Matching to a catalytic site template from a GH9 cellulase and other analyses point to a putative catalytic triad composed of Glu272, Asp331 and Asp333.
We suggest that GxGYxYP-containing proteins constitute a novel glycoside hydrolase family of as yet unknown specificity.
Carbohydrate metabolism; Glycoside hydrolase; Polysaccharide Utilization Locus; PUL; Protein function prediction; JCSG; 3D structure; Protein family; Gut microbiota
Approximately 50% of cell wall peptidoglycan in Gram-negative bacteria is recycled with each generation. The primary substrates used for peptidoglycan biosynthesis and recycling in the cytoplasm are GlcNAc-MurNAc(anhydro)-tetrapeptide and its degradation product, the free tetrapeptide. This complex process involves ∼15 proteins, among which the cytoplasmic enzyme ld-carboxypeptidase A (LdcA) catabolizes the bond between the last two l- and d-amino acid residues in the tetrapeptide to form the tripeptide, which is then utilized as a substrate by murein peptide ligase (Mpl). LdcA has been proposed as an antibacterial target. The crystal structure of Novosphingobium aromaticivorans DSM 12444 LdcA (NaLdcA) was determined at 1.89-Å resolution. The enzyme was biochemically characterized and its interactions with the substrate modeled, identifying residues potentially involved in substrate binding. Unaccounted electron density at the dimer interface in the crystal suggested a potential site for disrupting protein-protein interactions should a dimer be required to perform its function in bacteria. Our analysis extends the identification of functional residues to several other homologs, which include enzymes from bacteria that are involved in hydrocarbon degradation and destruction of coral reefs. The NaLdcA crystal structure provides an alternate system for investigating the structure-function relationships of LdcA and increases the structural coverage of the protagonists in bacterial cell wall recycling.
POSA (Partial Order Structure Alignment), available at http://posa.godziklab.org, is a server for multiple protein structure alignment introduced in 2005 (Ye,Y. and Godzik,A. (2005) Multiple flexible structure alignment using partial order graphs. Bioinformatics, 21, 2362–2369). It is free and open to all users, and there is no login requirement, albeit there is an option to register and store results in individual, password-protected directories. In the updated POSA server described here, we introduce two significant improvements. First is an interface allowing the user to provide additional information by defining segments that anchor the alignment in one or more input structures. This interface allows users to take advantage of their intuition and biological insights to improve the alignment and guide it toward a biologically relevant solution. The second improvement is an interactive visualization with options that allow the user to view all superposed structures in one window (a typical solution for visualizing results of multiple structure alignments) or view them individually in a series of synchronized windows with extensive, user-controlled visualization options. The user can rotate structure(s) in any of the windows and study similarities or differences between structures clearly visible in individual windows.
domain assembly server, available at http://ffas.burnham.org/AIDA/ is a tool that can identify domains in multi-domain proteins and then predict their 3D structures and relative spatial arrangements. The server is free and open to all users, and there is an option for a user to provide an e-mail to get the link to result page. Domains are evolutionary conserved and often functionally independent units in proteins. Most proteins, especially eukaryotic ones, consist of multiple domains while at the same time, most experimentally determined protein structures contain only one or two domains. As a result, often structures of individual domains in multi-domain proteins can be accurately predicted, but the mutual arrangement of different domains remains unknown. To address this issue we have developed AIDA program, which combines steps of identifying individual domains, predicting (separately) their structures and assembling them into multiple domain complexes using an ab initio folding potential to describe domain–domain interactions. AIDA server not only supports the assembly of a large number of continuous domains, but also allows the assembly of domains inserted into other domains. Users can also provide distance restraints to guide the AIDA energy minimization.
Molecular evolution is driven by mutations, which may affect the fitness of an organism and are then subject to natural selection or genetic drift. Analysis of primary protein sequences and tertiary structures has yielded valuable insights into the evolution of protein function, but little is known about evolution of functional mechanisms, protein dynamics and conformational plasticity essential for activity. We characterized the atomic-level motions across divergent members of the dihydrofolate reductase (DHFR) family. Despite structural similarity, E. coli and human DHFRs use different dynamic mechanisms to perform the same function, and human DHFR cannot complement DHFR-deficient E. coli cells. Identification of the primary sequence determinants of flexibility in DHFRs from several species allowed us to propose a likely scenario for the evolution of functionally important DHFR dynamics, following a pattern of divergent evolution that is tuned by the cellular environment.
Periodic proteins, characterized by the presence of multiple repeats of short motifs, form an interesting and seldom-studied group. Due to often extreme divergence in sequence, detection and analysis of such motifs is performed more reliably on the structural level. Yet, few algorithms have been developed for the detection and analysis of structures of periodic proteins.
ConSole recognizes modularity in protein contact maps, allowing for precise identification of repeats in solenoid protein structures, an important subgroup of periodic proteins. Tests on benchmarks show that ConSole has higher recognition accuracy as compared to Raphael, the only other publicly available solenoid structure detection tool. As a next step of ConSole analysis, we show how detection of solenoid repeats in structures can be used to improve sequence recognition of these motifs and to detect subtle irregularities of repeat lengths in three solenoid protein families.
The ConSole algorithm provides a fast and accurate tool to recognize solenoid protein structures as a whole and to identify individual solenoid repeat units from a structure. ConSole is available as a web-based, interactive server and is available for download at http://console.sanfordburnham.org.
Protein repeat detection; Solenoid structure; Contact map; Template matching; Machine learning
Bacteroides spp. form a significant part of our gut microbiome and are well known for optimized metabolism of diverse polysaccharides. Initial analysis of the archetypal Bacteroides thetaiotaomicron genome identified 172 glycosyl hydrolases and a large number of uncharacterized proteins associated with polysaccharide metabolism.
BT_1012 from Bacteroides thetaiotaomicron VPI-5482 is a protein of unknown function and a member of a large protein family consisting entirely of uncharacterized proteins. Initial sequence analysis predicted that this protein has two domains, one on the N- and one on the C-terminal. A PSI-BLAST search found over 150 full length and over 90 half size homologs consisting only of the N-terminal domain. The experimentally determined three-dimensional structure of the BT_1012 protein confirms its two-domain architecture and structural analysis of both domains suggests their specific functions. The N-terminal domain is a putative catalytic domain with significant similarity to known glycoside hydrolases, the C-terminal domain has a beta-sandwich fold typically found in C-terminal domains of other glycosyl hydrolases, however these domains are typically involved in substrate binding. We describe the structure of the BT_1012 protein and discuss its sequence-structure relationship and their possible functional implications.
Structural and sequence analyses of the BT_1012 protein identifies it as a glycosyl hydrolase, expanding an already impressive catalog of enzymes involved in polysaccharide metabolism in Bacteroides spp. Based on this we have renamed the Pfam families representing the two domains found in the BT_1012 protein, PF13204 and PF12904, as putative glycoside hydrolase and glycoside hydrolase-associated C-terminal domain respectively.
Glycoside hydrolase; Carbohydrate metabolism; 3D structure; Protein family; Protein function prediction; Domain of unknown function; DUF
CA_C2195 from Clostridium acetobutylicum is a protein of unknown function. Sequence analysis predicted that part of the protein contained a metallopeptidase-related domain. There are over 200 homologs of similar size in large sequence databases such as UniProt, with pairwise sequence identities in the range of ~40-60%. CA_C2195 was chosen for crystal structure determination for structure-based function annotation of novel protein sequence space.
The structure confirmed that CA_C2195 contained an N-terminal metallopeptidase-like domain. The structure revealed two extra domains: an α+β domain inserted in the metallopeptidase-like domain and a C-terminal circularly permuted winged-helix-turn-helix domain.
Based on our sequence and structural analyses using the crystal structure of CA_C2195 we provide a view into the possible functions of the protein. From contextual information from gene-neighborhood analysis, we propose that rather than being a peptidase, CA_C2195 and its homologs might play a role in biosynthesis of a modified cell-surface carbohydrate in conjunction with several sugar-modification enzymes. These results provide the groundwork for the experimental verification of the function.
CA_C2195; Peptidase; DUF4910; DUF2172; HTH_47; Structural genomics
Bacteroides thetaiotaomicron, a predominant member of the human gut microbiota, is characterized by its ability to utilize a wide variety of polysaccharides using the extensive saccharolytic machinery that is controlled by an expanded repertoire of transcription factors (TFs). The availability of genomic sequences for multiple Bacteroides species opens an opportunity for their comparative analysis to enable characterization of their metabolic and regulatory networks.
A comparative genomics approach was applied for the reconstruction and functional annotation of the carbohydrate utilization regulatory networks in 11 Bacteroides genomes. Bioinformatics analysis of promoter regions revealed putative DNA-binding motifs and regulons for 31 orthologous TFs in the Bacteroides. Among the analyzed TFs there are 4 SusR-like regulators, 16 AraC-like hybrid two-component systems (HTCSs), and 11 regulators from other families. Novel DNA motifs of HTCSs and SusR-like regulators in the Bacteroides have the common structure of direct repeats with a long spacer between two conserved sites.
The inferred regulatory network in B. thetaiotaomicron contains 308 genes encoding polysaccharide and sugar catabolic enzymes, carbohydrate-binding and transport systems, and TFs. The analyzed TFs control pathways for utilization of host and dietary glycans to monosaccharides and their further interconversions to intermediates of the central metabolism. The reconstructed regulatory network allowed us to suggest and refine specific functional assignments for sugar catabolic enzymes and transporters, providing a substantial improvement to the existing metabolic models for B. thetaiotaomicron. The obtained collection of reconstructed TF regulons is available in the RegPrecise database (http://regprecise.lbl.gov).
Regulatory network; Regulon; Transcription factor; BACTEROIDES; Carbohydrate utilization
Genome scale network reconstruction has enabled predictive modeling of metabolism for many systems. Traditionally, protein structural information has not been represented in such reconstructions. Expanding a genome-scale model of Escherichia coli metabolism by including experimental and predicted protein structures enabled the analysis of protein thermostability in a network context, allowing prediction of protein activities that limit network function at super-optimal temperature and mechanistic interpretations of mutations found in strains adapted to heat. Predicted growth-limiting factors for thermotolerance were validated through nutrient supplementation experiments and defined metabolic sensitivities to heat stress, providing evidence that metabolic enzyme thermostability is rate limiting at super-optimal temperature. Inclusion of structural information expanded the content and predictive capability of genome-scale metabolic networks enabling structural systems biology of metabolism.
A novel highly conserved protein domain, DUF162 [Pfam: PF02589], can be mapped to two proteins: LutB and LutC. Both proteins are encoded by a highly conserved LutABC operon, which has been implicated in lactate utilization in bacteria. Based on our analysis of its sequence, structure, and recent experimental evidence reported by other groups, we hereby redefine DUF162 as the LUD domain family.
JCSG solved the first crystal structure [PDB:2G40] from the LUD domain family: LutC protein, encoded by ORF DR_1909, of Deinococcus radiodurans. LutC shares features with domains in the functionally diverse ISOCOT superfamily. We have observed that the LUD domain has an increased abundance in the human gut microbiome.
We propose a model for the substrate and cofactor binding and regulation in LUD domain. The significance of LUD-containing proteins in the human gut microbiome, and the implication of lactate metabolism in the radiation-resistance of Deinococcus radiodurans are discussed.
LUD; DUF162; LutB; LutC; Domain of unknown function; Deinococcus radiodurans
The discovery of broadly neutralizing antibodies (bNAbs) has provided an enormous impetus to the HIV vaccine research and to entire immunology. The bNAber database at http://bNAber.org provides open, user-friendly access to detailed data on the rapidly growing list of HIV bNAbs, including neutralization profiles, sequences and three-dimensional structures (when available). It also provides an extensive list of visualization and analysis tools, such as heatmaps to analyse neutralization data as well as structure and sequence viewers to correlate bNAbs properties with structural and sequence features of individual antibodies. The goal of the bNAber database is to enable researchers in this field to easily compare and analyse available information on bNAbs thereby supporting efforts to design an effective vaccine for HIV/AIDS. The bNAber database not only provides easy access to data that currently is scattered in the Supplementary Materials sections of individual papers, but also contributes to the development of general standards of data that have to be presented with the discovery of new bNAbs and a universal mechanism of how such data can be shared.
Despite numerous attempts over many years to develop an HIV vaccine based on classical strategies, none has convincingly succeeded to date. A number of approaches are being pursued in the field, including building upon possible efficacy indicated by the recent RV144 clinical trial, which combined two HIV vaccines. Here, we argue for an approach based, in part, on understanding the HIV envelope spike and its interaction with broadly neutralizing antibodies (bnAbs) at the molecular level and using this understanding to design immunogens as possible vaccines. BnAbs can protect against virus challenge in animal models and many such antibodies have been isolated recently. We further propose that studies focused on how best to provide T cell help to B cells that produce bnAbs are crucial for optimal immunization strategies. The synthesis of rational immunogen design and immunization strategies, together with iterative improvements, offers great promise for advancing toward an HIV vaccine.
Every genome contains a large number of uncharacterized proteins that may encode entirely novel biological systems. Many of these uncharacterized proteins fall into related sequence families. By applying sequence and structural analysis we hope to provide insight into novel biology.
We analyze a previously uncharacterized Pfam protein family called DUF4424 [Pfam:PF14415]. The recently solved three-dimensional structure of the protein lpg2210 from Legionella pneumophila provides the first structural information pertaining to this family. This protein additionally includes the first representative structure of another Pfam family called the YARHG domain [Pfam:PF13308]. The Pfam family DUF4424 adopts a 19-stranded beta-sandwich fold that shows similarity to the N-terminal domain of leukotriene A-4 hydrolase. The YARHG domain forms an all-helical domain at the C-terminus. Structure analysis allows us to recognize distant similarities between the DUF4424 domain and individual domains of M1 aminopeptidases and tricorn proteases, which form massive proteasome-like capsids in both archaea and bacteria.
Based on our analyses we hypothesize that the DUF4424 domain may have a role in forming large, multi-component enzyme complexes. We suggest that the YARGH domain may play a role in binding a moiety in proximity with peptidoglycan, such as a hydrophobic outer membrane lipid or lipopolysaccharide.
Domain of unknown function; Protein family; Protein structure; DUF4424; YARHG domain; Sequence analysis
Proteins are known to be dynamic in nature, changing from one conformation to another while performing vital cellular tasks. It is important to understand these movements in order to better understand protein function. At the same time, experimental techniques provide us with only single snapshots of the whole ensemble of available conformations. Computational protein morphing provides a visualization of a protein structure transitioning from one conformation to another by producing a series of intermediate conformations.
We present a novel, efficient morphing algorithm, Morph-Pro based on linear interpolation. We also show that apart from visualization, morphing can be used to provide plausible intermediate structures. We test this by using the intermediate structures of a c-Jun N-terminal kinase (JNK1) conformational change in a virtual docking experiment. The structures are shown to dock with higher score to known JNK1-binding ligands than structures solved using X-Ray crystallography. This experiment demonstrates the potential applications of the intermediate structures in modeling or virtual screening efforts.
Visualization of protein conformational changes is important for characterization of protein function. Furthermore, the intermediate structures produced by our algorithm are good approximations to true structures. We believe there is great potential for these computationally predicted structures in protein-ligand docking experiments and virtual screening. The Morph-Pro web server can be accessed at http://morph-pro.bioinf.spbau.ru.
Protein morphing; Molecular docking; Virtual screening
The tad (tight adherence) locus encodes a protein translocation system that produces a novel variant of type IV pili. The pilus assembly protein TadZ (called CpaE in Caulobacter crescentus) is ubiquitous in tad loci, but is absent in other type IV pilus biogenesis systems. The crystal structure of TadZ from E. rectale (ErTadZ), in complex with ATP and Mg2+, was determined to 2.1 Å resolution. ErTadZ contains an atypical ATPase domain with a variant of a deviant Walker-A motif that retains ATP binding capacity while displaying only low intrinsic ATPase activity. The bound ATP plays an important role in dimerization of ErTadZ. The N-terminal atypical receiver domain resembles the canonical receiver domain of response regulators, but has a degenerate, stripped-down “active site”. Homology modeling of the N-terminal atypical receiver domain of CpaE indicates that it has a conserved protein-protein binding surface similar to that of the polar localization module of the social mobility protein FrzS, suggesting a similar function. Our structural results also suggest that TadZ localizes to the pole through the atypical receiver domain during early stage of pili biogenesis, and functions as a hub for recruiting other pili components, thus providing insights into the Tad pilus assembly process.
Type IV pili assembly; TadZ; atypical receiver domain; atypical ATPase; localization factor
MtfA of Escherichia coli (formerly YeeI) was previously identified as a regulator of the phosphoenolpyruvate (PEP)-dependent:glucose phosphotransferase system. MtfA homolog proteins are highly conserved, especially among beta- and gammaproteobacteria. We determined the crystal structures of the full-length MtfA apoenzyme from Klebsiella pneumoniae and its complex with zinc (holoenzyme) at 2.2 and 1.95 Å, respectively. MtfA contains a conserved H149E150XXH153+E212+Y205 metallopeptidase motif. The presence of zinc in the active site induces significant conformational changes in the region around Tyr205 compared to the conformation of the apoenzyme. Additionally, the zinc-bound MtfA structure is in a self-inhibitory conformation where a region that was disordered in the unliganded structure is now observed in the active site and a nonproductive state of the enzyme is formed. MtfA is related to the catalytic domain of the anthrax lethal factor and the Mop protein involved in the virulence of Vibrio cholerae, with conservation in both overall structure and in the residues around the active site. These results clearly provide support for MtfA as a prototypical zinc metallopeptidase (gluzincin clan).
Evolutionary innovation in eukaryotes and especially animals is at least partially driven by genome rearrangements and the resulting emergence of proteins with new domain combinations, and thus potentially novel functionality. Given the random nature of such rearrangements, one could expect that proteins with particularly useful multidomain combinations may have been rediscovered multiple times by parallel evolution. However, existing reports suggest a minimal role of this phenomenon in the overall evolution of eukaryotic proteomes. We assembled a collection of 172 complete eukaryotic genomes that is not only the largest, but also the most phylogenetically complete set of genomes analyzed so far. By employing a maximum parsimony approach to compare repertoires of Pfam domains and their combinations, we show that independent evolution of domain combinations is significantly more prevalent than previously thought. Our results indicate that about 25% of all currently observed domain combinations have evolved multiple times. Interestingly, this percentage is even higher for sets of domain combinations in individual species, with, for instance, 70% of the domain combinations found in the human genome having evolved independently at least once in other species. We also show that previous, much lower estimates of this rate are most likely due to the small number and biased phylogenetic distribution of the genomes analyzed. The process of independent emergence of identical domain combination is widespread, not limited to domains with specific functional categories. Besides data from large-scale analyses, we also present individual examples of independent domain combination evolution. The surprisingly large contribution of parallel evolution to the development of the domain combination repertoire in extant genomes has profound consequences for our understanding of the evolution of pathways and cellular processes in eukaryotes and for comparative functional genomics.
Most proteins in eukaryotes are composed of two or more domains, evolutionary independent units with (often) their own individual functions. The specific repertoire of multidomain proteins in a given species defines the topology of pathways and networks that carry out its metabolic and regulatory processes. When proteins with new domain combinations emerge by gene fusion and fission, it directly affects topology of cellular networks in this organism. To better understand the evolution of such networks we analyzed a large set of eukaryotic genomes for the evolutionary history of known domain combinations. Our analysis shows that 70% of all domain combinations present in the human genome independently appeared in at least one other eukaryotic genome. Overall, over 25% of all known multidomain architectures emerged independently several times in the history of life. The difference between a global and species specific picture can be explained by the existence of a core set of domain combinations that keeps reemerging in different species, which are accompanied by a smaller number of unique domain combinations that do not appear anywhere else.
The human nuclear factor related to kappa-B-binding protein (NFRKB) is a 1299-residue protein that is a component of the metazoan INO80 complex involved in chromatin remodeling, transcription regulation, DNA replication and DNA repair. Although full length NFRKB is predicted to be around 65% disordered, comparative sequence analysis identified several potentially structured sections in the N-terminal region of the protein. These regions were targeted for crystallographic studies, and the structure of one of these regions spanning residues 370–495 was determined using the JCSG high-throughput structure determination pipeline. The structure reveals a novel, mostly helical domain reminiscent of the winged-helix fold typically involved in DNA binding. However, further analysis shows that this domain does not bind DNA, suggesting it may belong to a small group of winged-helix domains involved in protein-protein interactions.