This study uses the Pfam database to show that the sequence redundancy of protein structures deposited in the PDB is increasing. The possible reasons behind this trend are discussed.
High-resolution structural knowledge is key to understanding how proteins function at the molecular level. The number of entries in the Protein Data Bank (PDB), the repository of all publicly available protein structures, continues to increase, with more than 8000 structures released in 2012 alone. The authors of this article have studied how structural coverage of the protein-sequence space has changed over time by monitoring the number of Pfam families that acquired their first representative structure each year from 1976 to 2012. Twenty years ago, for every 100 new PDB entries released, an estimated 20 Pfam families acquired their first structure. By 2012, this decreased to only about five families per 100 structures. The reasons behind the slower pace at which previously uncharacterized families are being structurally covered were investigated. It was found that although more than 50% of current Pfam families are still without a structural representative, this set is enriched in families that are small, functionally uncharacterized or rich in problem features such as intrinsically disordered and transmembrane regions. While these are important constraints, the reasons why it may not yet be time to give up the pursuit of a targeted but more comprehensive structural coverage of the protein-sequence space are discussed.
Pfam families; structural coverage; protein-sequence space
Recent years have seen the establishment of structural genomics centers that explicitly target integral membrane proteins. Here, we review the advances in targeting these extremely high-hanging fruits of structural biology in high-throughput mode. We observe that the experimental determination of high-resolution structures of integral membrane proteins is increasingly successful both in terms of getting structures and of covering important protein families, e.g. from Pfam. Structural genomics has begun to contribute significantly toward this progress. An important component of this contribution is the set up of robotic pipelines that generate a wealth of experimental data for membrane proteins. We argue that prediction methods for the identification of membrane regions and for the comparison of membrane proteins largely suffice to meet the challenges of target selection for structural genomics of membrane proteins. In contrast, we need better methods to prioritize the most promising members in a family of closely related proteins and to annotate protein function from sequence and structure in absence of homology.
alpha-helical integral membrane proteins; structural genomics; protein families; protein structure; target selection; function prediction
It is a worthy goal to completely characterize all human proteins in terms of their domains. Here, using the Pfam database, we asked how far we have progressed in this endeavour. Ninety per cent of proteins in the human proteome matched at least one of 5494 manually curated Pfam-A families. In contrast, human residue coverage by Pfam-A families was <45%, with 9418 automatically generated Pfam-B families adding a further 10%. Even after excluding predicted signal peptide regions and short regions (<50 consecutive residues) unlikely to harbour new families, for ∼38% of the human protein residues, there was no information in Pfam about conservation and evolutionary relationship with other protein regions. This uncovered portion of the human proteome was found to be distributed over almost 25 000 distinct protein regions. Comparison with proteins in the UniProtKB database suggested that the human regions that exhibited similarity to thousands of other sequences were often either divergent elements or N- or C-terminal extensions of existing families. Thirty-four per cent of regions, on the other hand, matched fewer than 100 sequences in UniProtKB. Most of these did not appear to share any relationship with existing Pfam-A families, suggesting that thousands of new families would need to be generated to cover them. Also, these latter regions were particularly rich in amino acid compositional bias such as the one associated with intrinsic disorder. This could represent a significant obstacle toward their inclusion into new Pfam families. Based on these observations, a major focus for increasing Pfam coverage of the human proteome will be to improve the definition of existing families. New families will also be built, prioritizing those that have been experimentally functionally characterized.
Database URL: http://pfam.sanger.ac.uk/
Detection of protein homology via sequence similarity has important applications in biology, from protein structure and function prediction to reconstruction of phylogenies. Although current methods for aligning protein sequences are powerful, challenges remain, including problems with homologous overextension of alignments and with regions under convergent evolution. Here, we test the ability of the profile hidden Markov model method HMMER3 to correctly assign homologous sequences to >13 000 manually curated families from the Pfam database. We identify problem families using protein regions that match two or more Pfam families not currently annotated as related in Pfam. We find that HMMER3 E-value estimates seem to be less accurate for families that feature periodic patterns of compositional bias, such as the ones typically observed in coiled-coils. These results support the continued use of manually curated inclusion thresholds in the Pfam database, especially on the subset of families that have been identified as problematic in experiments such as these. They also highlight the need for developing new methods that can correct for this particular type of compositional bias.
We have identified a new protein domain, which we have named the SHOCT domain (Short C-terminal domain). This domain is widespread in bacteria with over a thousand examples. But we found it is missing from the most commonly studied model organisms, despite being present in closely related species. It's predominantly C-terminal location, co-occurrence with numerous other domains and short size is reminiscent of the Gram-positive anchor motif, however it is present in a much wider range of species. We suggest several hypotheses about the function of SHOCT, including oligomerisation and nucleic acid binding. Our initial experiments do not support its role as an oligomerisation domain.
The plant SLAC1 anion channel controls turgor pressure in the aperture-defining guard cells of plant stomata, thereby regulating exchange of water vapor and photosynthetic gases in response to environmental signals such as drought or high levels of carbon dioxide. We determined the crystal structure of a bacterial homolog of SLAC1 at 1.20Å resolution, and we have used structure-inspired mutagenesis to analyze the conductance properties of SLAC1 channels. SLAC1 is a symmetric trimer composed from quasi-symmetric subunits, each having ten transmembrane helices arranged from helical hairpin pairs to form a central five-helix transmembrane pore that is gated by an extremely conserved phenylalanine residue. Conformational features suggest a mechanism for control of gating by kinase activation, and electrostatic features of the pore coupled with electrophysiological characteristics suggest that selectivity among different anions is largely a function of the energetic cost of ion dehydration.
Hepatitis C virus (HCV) p7 is a membrane-associated oligomeric protein harboring ion channel activity. It is essential for effective assembly and release of infectious HCV particles and an attractive target for antiviral intervention. Yet, the self-assembly and molecular mechanism of p7 ion channelling are currently only partially understood. Using molecular dynamics simulations (aggregate time 1.2 µs), we show that p7 can form stable oligomers of four to seven subunits, with a bias towards six or seven subunits, and suggest that p7 self-assembles in a sequential manner, with tetrameric and pentameric complexes forming as intermediate states leading to the final hexameric or heptameric assembly. We describe a model of a hexameric p7 complex, which forms a transiently-open channel capable of conducting ions in simulation. We investigate the ability of the hexameric model to flexibly rearrange to adapt to the local lipid environment, and demonstrate how this model can be reconciled with low-resolution electron microscopy data. In the light of these results, a view of p7 oligomerization is proposed, wherein hexameric and heptameric complexes may coexist, forming minimalist, yet robust functional ion channels. In the absence of a high-resolution p7 structure, the models presented in this paper can prove valuable as a substitute structure in future studies of p7 function, or in the search for p7-inhibiting drugs.
Hepatitis C remains a serious global health problem affecting more than 2% of the world's population, and current therapies are effective in only a subset of patients, necessitating an ongoing search for new treatments. The p7 viroporin is considered to be an attractive possible drug target, but rational drug design is hampered by the absence of a high-resolution p7 structure. In this paper, we explore possible structures of oligomeric p7 channels, and discuss the strengths and shortcomings of these models with respect to experimentally determined properties, such as pore-lining residues, ion conductance, and compatibility with low-resolution electron microscopy images. Our results present an image of p7 as a rudimentary, minimalistic ion channel, capable of existing in multiple oligomeric states but exhibiting a bias towards hexamers and heptamers. We believe that the work presented here will be valuable for future research by providing plausible 3-dimensional atomic-resolution models for the visualization of the p7 viroporin and serve as a basis for future computational studies.
Bacterial chemoreceptors provide an important model for understanding signalling processes. In the serine receptor Tsr from E. coli, a binding event in the periplasmic domain of the receptor dimer causes a shift in a single transmembrane helix of roughly 0.15 nm towards the cytoplasm. This small change is propagated through the ∼22 nm length of the receptor, causing downstream inhibition of the kinase CheA. This requires interactions within a trimer of receptor dimers. Additionally, the signal is amplified across a 53,000 nm2 array of chemoreceptor proteins, including ∼5,200 receptor trimers-of-dimers, at the cell pole. Despite a wealth of experimental data on the system, including high resolution structures of individual domains and extensive mutagenesis data, it remains uncertain how information is communicated across the receptor from the binding event to the downstream effectors. We present a molecular model of the entire Tsr dimer, and examine its behaviour using coarse-grained molecular dynamics and elastic network modelling. We observe a large bending in dimer models between the linker domain HAMP and coiled-coil domains, which is supported by experimental data. Models of the trimer of dimers, built from the dimer models, are more constrained and likely represent the signalling state. Simulations of the models in a 70 nm diameter vesicle with a biologically realistic lipid mixture reveal specific lipid interactions and oligomerisation of the trimer of dimers. The results indicate a mechanism whereby small motions of a single helix can be amplified through HAMP domain packing, to initiate large changes in the whole receptor structure.
To understand cell signalling events requires a physical model of the structure and behaviour of the signalling proteins involved. The methyl-accepting chemoreceptor proteins direct bacterial movement towards food sources and away from toxins. Based on experimental data we have built structural models of the serine chemoreceptor (Tsr) as a dimer, which is incapable of activating the downstream kinase CheA, and as a trimer of dimers, which can activate CheA. We have performed molecular dynamics simulation to reveal the behaviour of these two forms in a planar lipid bilayer and in a 70 nm diameter lipid vesicle with a mixture of lipids mimicking the E. coli inner membrane. We show that in isolation the dimers undergo a bending movement around the central HAMP domain, whereas the trimer-of-dimers model does not. Comparison with published experimental data suggests that these bending motions are real, and that they occur in the trimer of dimers only in response to ligand binding. Drawing together these observations with studies showing that the signalling event involves small piston motions in the transmembrane helices suggests that the bending motion is frustrated in the unliganded trimer of dimers, and that ligand binding induces bending by repacking the HAMP interface.
Nitric oxide reductases (NORs) are membrane proteins that catalyze the reduction of nitric oxide (NO) to nitrous oxide (N2O), which is a critical step of the nitrate respiration process in denitrifying bacteria. Using the recently determined first crystal structure of the cytochrome c-dependent NOR (cNOR) [Hino T, Matsumoto Y, Nagano S, Sugimoto H, Fukumori Y, et al. (2010) Structural basis of biological N2O generation by bacterial nitric oxide reductase. Science 330: 1666–70.], we performed extensive all-atom molecular dynamics (MD) simulations of cNOR within an explicit membrane/solvent environment to fully characterize water distribution and dynamics as well as hydrogen-bonded networks inside the protein, yielding the atomic details of functionally important proton channels. Simulations reveal two possible proton transfer pathways leading from the periplasm to the active site, while no pathways from the cytoplasmic side were found, consistently with the experimental observations that cNOR is not a proton pump. One of the pathways, which was newly identified in the MD simulation, is blocked in the crystal structure and requires small structural rearrangements to allow for water channel formation. That pathway is equivalent to the functional periplasmic cavity postulated in cbb3 oxidase, which illustrates that the two enzymes share some elements of the proton transfer mechanisms and confirms a close evolutionary relation between NORs and C-type oxidases. Several mechanisms of the critical proton transfer steps near the catalytic center are proposed.
Denitrification is an anaerobic process performed by several bacteria as an alternative to aerobic respiration. A key intermediate step is catalyzed by the nitric oxide reductase (NOR) enzyme, which is situated in the cytoplasmic membrane. Proton delivery to the catalytic site inside NOR is an important part of its functioning. In this work we use molecular dynamics simulations to describe water distribution and to identify proton transfer pathways in cNOR. Our results reveal two channels from the periplasmic side of the membrane and none from the cytoplasmic side, indicating that cNOR is not a proton pump. It is our hope that these results will provide a basis for further experimental and computational studies aimed to understand details of the NOR mechanism. Furthermore, this work sheds light on the molecular evolution of respiratory enzymes.
The soluble monomeric domain of lipoprotein YxeF from the Gram positive bacterium B. subtilis was selected by the Northeast Structural Genomics Consortium (NESG) as a target of a biomedical theme project focusing on the structure determination of the soluble domains of bacterial lipoproteins. The solution NMR structure of YxeF reveals a calycin fold and distant homology with the lipocalin Blc from the Gram-negative bacterium E.coli. In particular, the characteristic β-barrel, which is open to the solvent at one end, is extremely well conserved in YxeF with respect to Blc. The identification of YxeF as the first lipocalin homologue occurring in a Gram-positive bacterium suggests that lipocalins emerged before the evolutionary divergence of Gram positive and Gram negative bacteria. Since YxeF is devoid of the α-helix that packs in all lipocalins with known structure against the β-barrel to form a second hydrophobic core, we propose to introduce a new lipocalin sub-family named ‘slim lipocalins’, with YxeF and the other members of Pfam family PF11631 to which YxeF belongs constituting the first representatives. The results presented here exemplify the impact of structural genomics to enhance our understanding of biology and to generate new biological hypotheses.
As the deluge of genomic DNA sequence grows the fraction of protein sequences that have been manually curated falls. In turn, as the number of laboratories with the ability to sequence genomes in a high-throughput manner grows, the informatics capability of those labs to accurately identify and annotate all genes within a genome may often be lacking. These issues have led to fears about transitive annotation errors making sequence databases less reliable. During the lifetime of the Pfam protein families database a number of protein families have been built, which were later identified as composed solely of spurious open reading frames (ORFs) either on the opposite strand or in a different, overlapping reading frame with respect to the true protein-coding or non-coding RNA gene. These families were deleted and are no longer available in Pfam. However, we realized that these may perform a useful function to identify new spurious ORFs. We have collected these families together in AntiFam along with additional custom-made families of spurious ORFs. This resource currently contains 23 families that identified 1310 spurious proteins in UniProtKB and a further 4119 spurious proteins in a collection of metagenomic sequences. UniProt has adopted AntiFam as a part of the UniProtKB quality control process and will investigate these spurious proteins for exclusion.
The identification of orthologs—genes pairs descended from a common ancestor through speciation, rather than duplication—has emerged as an essential component of many bioinformatics applications, ranging from the annotation of new genomes to experimental target prioritization. Yet, the development and application of orthology inference methods is hampered by the lack of consensus on source proteomes, file formats and benchmarks. The second ‘Quest for Orthologs’ meeting brought together stakeholders from various communities to address these challenges. We report on achievements and outcomes of this meeting, focusing on topics of particular relevance to the research community at large. The Quest for Orthologs consortium is an open community that welcomes contributions from all researchers interested in orthology research and applications.
Pfam is a widely used database of protein families, currently containing more than 13 000 manually curated protein families as of release 26.0. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/). Here, we report on changes that have occurred since our 2010 NAR paper (release 24.0). Over the last 2 years, we have generated 1840 new families and increased coverage of the UniProt Knowledgebase (UniProtKB) to nearly 80%. Notably, we have taken the step of opening up the annotation of our families to the Wikipedia community, by linking Pfam families to relevant Wikipedia pages and encouraging the Pfam and Wikipedia communities to improve and expand those pages. We continue to improve the Pfam website and add new visualizations, such as the ‘sunburst’ representation of taxonomic distribution of families. In this work we additionally address two topics that will be of particular interest to the Pfam community. First, we explain the definition and use of family-specific, manually curated gathering thresholds. Second, we discuss some of the features of domains of unknown function (also known as DUFs), which constitute a rapidly growing class of families within Pfam.
InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
Saccharides play a central role in the nutrition of all living organisms. Whereas several saccharide uptake systems are shared between the different phylogenetic kingdoms, the phosphoenolpyruvate-dependent phosphotransferase system exists almost exclusively in bacteria. This multi-component system includes an integral membrane protein EIIC that transports saccharides and assists in their phosphorylation. Here we present the crystal structure of an EIIC from Bacillus cereus that transports diacetylchitobiose. The EIIC is a homodimer, with an expansive interface formed between the N-terminal halves of the two protomers. The C-terminal half of each protomer has a large binding pocket that contains a diacetylchitobiose, which is occluded from both sides of the membrane with its site of phosphorylation near the conserved His250 and Glu334 residues. The structure shows the architecture of this important class of transporters, identifies the determinants of substrate binding and phosphorylation, and provides a framework for understanding the mechanism of sugar translocation.
The TrkH/TrkG/KtrB proteins mediate K+ uptake in bacteria and likely evolved from simple K+ channels by multiple gene duplications or fusions. Here we present the crystal structure of a TrkH from Vibrio parahaemolyticus. TrkH is a homodimer, and each protomer contains an ion permeation pathway. A selectivity filter, similar in architecture to those of K+ channels but significantly shorter, is lined by backbone and side chain oxygen atoms. Functional studies showed that the TrkH allows permeation of K+ and Rb+ but not smaller ions such as Na+ or Li+. Immediately intracellular to the selectivity filter are an intramembrane loop and an arginine residue, both highly conserved, which constrict the permeation pathway. Substituting the arginine with an alanine significantly increases the rate of K+ flux. These results reveal the molecular basis of K+ selectivity and suggest a novel gating mechanism by this large and important family of membrane transport proteins.
The New York Consortium on Membrane Protein Structure (NYCOMPS) was formed to accelerate the acquisition of structural information on membrane proteins by applying a structural genomics approach. NY-COMPS comprises a bioinformatics group, a centralized facility operating a high-throughput cloning and screening pipeline, a set of associated wet labs that perform high-level protein production and structure determination by x-ray crystallography and NMR, and a set of investigators focused on methods development. In the first three years of operation, the NYCOMPS pipeline has so far produced and screened 7,250 expression constructs for 8,045 target proteins. Approximately 600 of these verified targets were scaled up to levels required for structural studies, so far yielding 24 membrane protein crystals. Here we describe the overall structure of NYCOMPS and provide details on the high-throughput pipeline.
Membrane proteins; Structural genomics; High throughput; NMR; X-ray
VPA0419; yiiS; PFAM 04175; structural genomics; GFT NMR
Intrinsically disordered proteins are predicted to be highly abundant and play broad biological roles in eukaryotic cells. In particular, by virtue of their structural malleability and propensity to interact with multiple binding partners, disordered proteins are thought to be specialized for roles in signaling and regulation. However, these concepts are based on in silico analyses of translated whole genome sequences, not on large-scale analyses of proteins expressed in living cells. Therefore, whether these concepts broadly apply to expressed proteins is currently unknown. Previous studies have shown that heat-treatment of cell extracts lead to partial enrichment of soluble, disordered proteins. Based on this observation, we sought to address the current dearth of knowledge about expressed, disordered proteins by performing a large-scale proteomics study of thermo-stable proteins isolated from mouse fibroblast cells. Using novel multidimensional chromatography methods and mass spectrometry, we identified a total of 1,320 thermo-stable proteins from these cells. Further, we used a variety of bioinformatics methods to analyze the structural and biological properties of these proteins. Interestingly, more than 900 of these expressed proteins were predicted to be substantially disordered. These were divided into two categories, with 514 predicted to be predominantly disordered and 395 predicted to exhibit both disordered and ordered/folded features. In addition, 411 of the thermo-stable proteins were predicted to be folded. Despite the use of heat treatment (60 min. at 98 °C) to partially enrich for disordered proteins, which might have been expected to select for small proteins, the sequences of these proteins exhibited a wide range of lengths (622 ± 555 residues (average length ± standard deviation) for disordered proteins and 569 ± 598 residues for folded proteins). Computational structural analyses revealed several unexpected features of the thermo-stable proteins: 1) disordered domains and coiled-coil domains occurred together in a large number of disordered proteins, suggesting functional interplay between these domains, and 2) more than 170 proteins contained lengthy domains (>300 residues) known to be folded. Reference to Gene Ontology Consortium functional annotations revealed that, while disordered proteins play diverse biological roles in mouse fibroblasts, they do exhibit heightened involvement in several functional categories, including, cytoskeletal structure and cell movement, metabolic and biosynthetic processes, organelle structure, cell division, gene transcription, and ribonucleoprotein complexes. We believe that these results reflect the general properties of the mouse intrinsically disordered proteome (IDP-ome) although they also reflect the specialized physiology of fibroblast cells. Large-scale identification of expressed, thermo-stable proteins from other cell types in the future, grown under varied physiological conditions, will dramatically expand our understanding of the structural and biological properties of disordered eukaryotic proteins.
intrinsically disordered proteins; intrinsically unstructured proteins; proteomics; mammalian proteome; thermo-stable proteins
The New York Consortium on Membrane Protein Structure (NYCOMPS), a part of the Protein Structure Initiative (PSI) in the USA, has as its mission to establish a high-throughput pipeline for determination of novel integral membrane protein structures. Here we describe our current target selection protocol, which applies structural genomics approaches informed by the collective experience of our team of investigators. We first extract all annotated proteins from our reagent genomes, i.e. the 96 fully sequenced prokaryotic genomes from which we clone DNA. We filter this initial pool of sequences and obtain a list of valid targets. NYCOMPS defines valid targets as those that, among other features, have at least two predicted transmembrane helices, no predicted long disordered regions and, except for community nominated targets, no significant sequence similarity in the predicted transmembrane region to any known protein structure. Proteins that feed our experimental pipeline are selected by defining a protein seed and searching the set of all valid targets for proteins that are likely to have a transmembrane region structurally similar to that of the seed. We require sequence similarity aligning at least half of the predicted transmembrane region of seed and target. Seeds are selected according to their feasibility and/or biological interest, and they include both centrally selected targets and community nominated targets. As of December 2008, over 6,000 targets have been selected and are currently being processed by the experimental pipeline. We discuss how our target list may impact structural coverage of the membrane protein space.
Electronic supplementary material
The online version of this article (doi:10.1007/s10969-009-9071-1) contains supplementary material, which is available to authorized users.
Membrane proteins; Target selection; Structural genomics; Structure determination
Summary: The web server MetalDetector classifies histidine residues in proteins into one of two states (free or metal bound) and cysteines into one of three states (free, metal bound or disulfide bridged). A decision tree integrates predictions from two previously developed methods (DISULFIND and Metal Ligand Predictor). Cross-validated performance assessment indicates that our server predicts disulfide bonding state at 88.6% precision and 85.1% recall, while it identifies cysteines and histidines in transition metal-binding sites at 79.9% precision and 76.8% recall, and at 60.8% precision and 40.7% recall, respectively.
Availability: Freely available at http://metaldetector.dsi.unifi.it
Supplementary Information: Details and data can be found at http://metaldetector.dsi.unifi.it/help.php
Disordered proteins are highly abundant in regulatory processes such as transcription and cell-signaling. Different methods have been developed to predict protein disorder often focusing on different types of disordered regions. Here, we present MD, a novel META-Disorder prediction method that molds various sources of information predominantly obtained from orthogonal prediction methods, to significantly improve in performance over its constituents. In sustained cross-validation, MD not only outperforms its origins, but it also compares favorably to other state-of-the-art prediction methods in a variety of tests that we applied. Availability: http://www.rostlab.org/services/md/