Specific use cases of TOPSAN, an innovative collaborative platform for creating, sharing and distributing annotations and insights about protein structures, such as those determined by high-throughput structural genomics in the Protein Structure Initiative (PSI), are described. TOPSAN is the main annotation platform for JCSG structures and serves as a conduit for initiating collaborations with the biological community, as illustrated in this special issue of Acta Crystallographica Section F. Developed at the JCSG with the goal of opening a dialogue on the novel protein structures with the broader biological community, TOPSAN is a unique tool for fostering distributed collaborations and provides an efficient pathway to peer-reviewed publications.
The NIH Protein Structure Initiative centers, such as the Joint Center for Structural Genomics (JCSG), have developed highly efficient technological platforms that are capable of experimentally determining the three-dimensional structures of hundreds of proteins per year. However, the overwhelming majority of the almost 5000 protein structures determined by these centers have yet to be described in the peer-reviewed literature. In a high-throughput structural genomics environment, the process of structure determination occurs independently of any associated experimental characterization of function, which creates a challenge for the annotation and analysis of structures and the publication of these results. This challenge has been addressed by developing TOPSAN (‘The Open Protein Structure Annotation Network’), which enables the generation of knowledge via collaborations among globally distributed contributors supported by automated amalgamation of available information. TOPSAN currently provides annotations for all protein structures determined by the JCSG in addition to preliminary annotations on a large number of structures from the other PSI production centers. TOPSAN-enabled collaborations have resulted in insightful structure–function analysis for many proteins and have led to numerous peer-reviewed publications, as exemplified by the articles included in this issue of Acta Crystallographica Section F.
collaborative annotations; structural genomics; Protein Structure Initiative
The crystal structure of BT_3984, a SusD-family protein, reveals a TPR N-terminal region providing support for a loop-rich C-terminal subdomain and suggests possible interfaces involved in sus complex formation.
The crystal structure of the Bacteroides thetaiotaomicron protein BT_3984 was determined to a resolution of 1.7 Å and was the first structure to be determined from the extensive SusD family of polysaccharide-binding proteins. SusD is an essential component of the sus operon that defines the paradigm for glycan utilization in dominant members of the human gut microbiota. Structural analysis of BT_3984 revealed an N-terminal region containing several tetratricopeptide repeats (TPRs), while the signature C-terminal region is less structured and contains extensive loop regions. Sequence and structure analysis of BT_3984 suggests the presence of binding interfaces for other proteins from the polysaccharide-utilization complex.
structural genomics; starch-utilization system; gut microbiome; metagenomics
The crystal structure of BT2081 from B. thetaiotaomicron reveals a two-domain protein with a putative carbohydrate-binding site in the C-terminal domain.
BT2081 from Bacteroides thetaiotaomicron (GenBank accession code NP_810994.1) is a member of a novel protein family consisting of over 160 members, most of which are found in the different classes of Bacteroidetes. Genome-context analysis lends support to the involvement of this family in carbohydrate metabolism, which plays a key role in B. thetaiotaomicron as a predominant bacterial symbiont in the human distal gut microbiome. The crystal structure of BT2081 at 2.05 Å resolution represents the first structure from this new protein family. BT2081 consists of an N-terminal domain, which adopts a β-sandwich immunoglobulin-like fold, and a larger C-terminal domain with a β-sandwich jelly-roll fold. Structural analyses reveal that both domains are similar to those found in various carbohydrate-active enzymes. The C-terminal β-jelly-roll domain contains a potential carbohydrate-binding site that is highly conserved among BT2081 homologs and is situated in the same location as the carbohydrate-binding sites that are found in structurally similar glycoside hydrolases (GHs). However, in BT2081 this site is partially occluded by surrounding loops, which results in a deep solvent-accessible pocket rather than a shallower solvent-exposed cleft.
gut microbiome; sugars; structural genomics; immunoglobulin-like fold; jelly-roll fold
The first structures from the FmdE Pfam family (PF02663) reveal that some members of this family form tightly intertwined dimers consisting of two domains (N-terminal α+β core and C-terminal zinc-finger domains), whereas others contain only the core domain. The presence of the zinc-finger domain suggests that some members of this family may perform functions associated with transcriptional regulation, protein–protein interaction, RNA binding or metal-ion sensing.
Examination of the genomic context for members of the FmdE Pfam family (PF02663), such as the protein encoded by the fmdE gene from the methanogenic archaeon Methanobacterium thermoautotrophicum, indicates that 13 of them are co-transcribed with genes encoding subunits of molybdenum formylmethanofuran dehydrogenase (EC 22.214.171.124), an enzyme that is involved in microbial methane production. Here, the first crystal structures from PF02663 are described, representing two bacterial and one archaeal species: B8FYU2_DESHY from the anaerobic dehalogenating bacterium Desulfitobacterium hafniense DCB-2, Q2LQ23_SYNAS from the syntrophic bacterium Syntrophus aciditrophicus SB and Q9HJ63_THEAC from the thermoacidophilic archaeon Thermoplasma acidophilum. Two of these proteins, Q9HJ63_THEAC and Q2LQ23_SYNAS, contain two domains: an N-terminal thioredoxin-like α+β core domain (NTD) consisting of a five-stranded, mixed β-sheet flanked by several α-helices and a C-terminal zinc-finger domain (CTD). B8FYU2_DESHY, on the other hand, is composed solely of the NTD. The CTD of Q9HJ63_THEAC and Q2LQ23_SYNAS is best characterized as a treble-clef zinc finger. Two significant structural differences between Q9HJ63_THEAC and Q2LQ23_SYNAS involve their metal binding. First, zinc is bound to the putative active site on the NTD of Q9HJ63_THEAC, but is absent from the NTD of Q2LQ23_SYNAS. Second, whereas the structure of the CTD of Q2LQ23_SYNAS shows four Cys side chains within coordination distance of the Zn atom, the structure of Q9HJ63_THEAC is atypical for a treble-cleft zinc finger in that three Cys side chains and an Asp side chain are within coordination distance of the zinc.
Pfam family PF02663; metalloproteins; domain swapping; structural genomics; methanogenesis
The crystal structure of a novel MACPF protein, which may play a role in the adaptation of commensal bacteria to host environments in the human gut, was determined and analyzed.
Membrane-attack complex/perforin (MACPF) proteins are transmembrane pore-forming proteins that are important in both human immunity and the virulence of pathogens. Bacterial MACPFs are found in diverse bacterial species, including most human gut-associated Bacteroides species. The crystal structure of a bacterial MACPF-domain-containing protein BT_3439 (Bth-MACPF) from B. thetaiotaomicron, a predominant member of the mammalian intestinal microbiota, has been determined. Bth-MACPF contains a membrane-attack complex/perforin (MACPF) domain and two novel C-terminal domains that resemble ribonuclease H and interleukin 8, respectively. The entire protein adopts a flat crescent shape, characteristic of other MACPF proteins, that may be important for oligomerization. This Bth-MACPF structure provides new features and insights not observed in two previous MACPF structures. Genomic context analysis infers that Bth-MACPF may be involved in a novel protein-transport or nutrient-uptake system, suggesting an important role for these MACPF proteins, which were likely to have been inherited from eukaryotes via horizontal gene transfer, in the adaptation of commensal bacteria to the host environment.
MACPF; membrane-attack complexes; perforins; transmembrane pores; pathogenesis
The crystal structure of the highly specific γ-d-glutamyl-l-diamino acid endopeptidase YkfC from Bacillus cereus in complex with l-Ala-γ-d-Glu reveals the structural basis for the substrate specificity of NlpC/P60-family cysteine peptidases.
Dipeptidyl-peptidase VI from Bacillus sphaericus and YkfC from Bacillus subtilis have both previously been characterized as highly specific γ-d-glutamyl-l-diamino acid endopeptidases. The crystal structure of a YkfC ortholog from Bacillus cereus (BcYkfC) at 1.8 Å resolution revealed that it contains two N-terminal bacterial SH3 (SH3b) domains in addition to the C-terminal catalytic NlpC/P60 domain that is ubiquitous in the very large family of cell-wall-related cysteine peptidases. A bound reaction product (l-Ala-γ-d-Glu) enabled the identification of conserved sequence and structural signatures for recognition of l-Ala and γ-d-Glu and, therefore, provides a clear framework for understanding the substrate specificity observed in dipeptidyl-peptidase VI, YkfC and other NlpC/P60 domains in general. The first SH3b domain plays an important role in defining substrate specificity by contributing to the formation of the active site, such that only murein peptides with a free N-terminal alanine are allowed. A conserved tyrosine in the SH3b domain of the YkfC subfamily is correlated with the presence of a conserved acidic residue in the NlpC/P60 domain and both residues interact with the free amine group of the alanine. This structural feature allows the definition of a subfamily of NlpC/P60 enzymes with the same N-terminal substrate requirements, including a previously characterized cyanobacterial l-alanine-γ-d-glutamate endopeptidase that contains the two key components (an NlpC/P60 domain attached to an SH3b domain) for assembly of a YkfC-like active site.
γ-d-glutamyl-l-diamino acid endopeptidase; cell-wall recycling; NlpC/P60; SH3b; cysteine peptidases; enzyme specificity
The crystal structure of the first representative of DUF364 family reveals a combination of enolase N-terminal-like and C-terminal Rossmann-like folds. Analysis of the interdomain cleft combined with sequence and genome context conservation among homologs, suggests a unique catalytic site likely involved in the synthesis of a flavin or pterin derivative.
The crystal structure of Dhaf4260 from Desulfitobacterium hafniense DCB-2 was determined by single-wavelength anomalous diffraction (SAD) to a resolution of 2.01 Å using the semi-automated high-throughput pipeline of the Joint Center for Structural Genomics (JCSG) as part of the NIGMS Protein Structure Initiative (PSI). This protein structure is the first representative of the PF04016 (DUF364) Pfam family and reveals a novel combination of two well known domains (an enolase N-terminal-like fold followed by a Rossmann-like domain). Structural and bioinformatic analyses reveal partial similarities to Rossmann-like methyltransferases, with residues from the enolase-like fold combining to form a unique active site that is likely to be involved in the condensation or hydrolysis of molecules implicated in the synthesis of flavins, pterins or other siderophores. The genome context of Dhaf4260 and homologs additionally supports a role in heavy-metal chelation.
structural genomics; domains of unknown function; rare metals; siderophores; pterins
The crystal structure of BT1062 from Bacteroides thetaiotaomicron revealed a conserved fold that is widely adopted by fimbrial components.
BT1062 from Bacteroides thetaiotaomicron is a homolog of Mfa2 (PGN0288 or PG0179), which is a component of the minor fimbriae in Porphyromonas gingivalis. The crystal structure of BT1062 revealed a conserved fold that is widely adopted by fimbrial components.
DUF1812; PF08842; pili; fimbriae; BT1062; Mfa2; PGN0288; PG0179
The crystal structure of SSO2064, the first structural representative of Pfam family PF01796 (DUF35), reveals a two-domain architecture comprising an N-terminal zinc-ribbon domain and a C-terminal OB-fold domain. Analysis of the domain architecture, operon organization and bacterial orthologs combined with the structural features of SSO2064 suggests a role involving acyl-CoA binding for this family of proteins.
SSO2064 is the first structural representative of PF01796 (DUF35), a large prokaryotic family with a wide phylogenetic distribution. The structure reveals a novel two-domain architecture comprising an N-terminal, rubredoxin-like, zinc ribbon and a C-terminal, oligonucleotide/oligosaccharide-binding (OB) fold domain. Additional N-terminal helical segments may be involved in protein–protein interactions. Domain architectures, genomic context analysis and functional evidence from certain bacterial representatives of this family suggest that these proteins form a novel fatty-acid-binding component that is involved in the biosynthesis of lipids and polyketide antibiotics and that they possibly function as acyl-CoA-binding proteins. This structure has led to a re-evaluation of the DUF35 family, which has now been split into two entries in the latest Pfam release (v.24.0).
structural genomics; domains of unknown function; acyl-carrier proteins; acyl-coA; polyketide biosynthesis
Structures of the first representatives of PF06684 (DUF1185) reveal a Bacillus chorismate mutase-like fold with a potential role in amino-acid synthesis.
The crystal structures of BB2672 and SPO0826 were determined to resolutions of 1.7 and 2.1 Å by single-wavelength anomalous dispersion and multiple-wavelength anomalous dispersion, respectively, using the semi-automated high-throughput pipeline of the Joint Center for Structural Genomics (JCSG) as part of the NIGMS Protein Structure Initiative (PSI). These proteins are the first structural representatives of the PF06684 (DUF1185) Pfam family. Structural analysis revealed that both structures adopt a variant of the Bacillus chorismate mutase fold (BCM). The biological unit of both proteins is a hexamer and analysis of homologs indicates that the oligomer interface residues are highly conserved. The conformation of the critical regions for oligomerization appears to be dependent on pH or salt concentration, suggesting that this protein might be subject to environmental regulation. Structural similarities to BCM and genome-context analysis suggest a function in amino-acid synthesis.
domain of unknown function; structural genomics; chorismate mutase; amino acids; pH-dependent; salt-dependent
The crystal structures of SPO0140 and Sbal_2486 revealed a two-domain structure that adopts a novel fold. Analysis of the interdomain cleft suggests a nucleotide-based ligand with a genome context indicating signaling as a possible role for this family.
The crystal structures of SPO0140 and Sbal_2486 were determined using the semiautomated high-throughput pipeline of the Joint Center for Structural Genomics (JCSG) as part of the NIGMS Protein Structure Initiative (PSI). The structures revealed a conserved core with domain duplication and a superficial similarity of the C-terminal domain to pleckstrin homology-like folds. The conservation of the domain interface indicates a potential binding site that is likely to involve a nucleotide-based ligand, with genome-context and gene-fusion analyses additionally supporting a role for this family in signal transduction, possibly during oxidative stress.
structural genomics; domain of unknown function; domain duplication; signaling; oxidative stress
The crystal structure of the BVU2987 gene product from B. vulgatus (UniProt A6L4L1) reveals that members of the new Pfam family PF11396 (domain of unknown function; DUF2874) are similar to β-lactamase inhibitor protein and YpmB.
Proteins that contain the DUF2874 domain constitute a new Pfam family PF11396. Members of this family have predominantly been identified in microbes found in the human gut and oral cavity. The crystal structure of one member of this family, BVU2987 from Bacteroides vulgatus, has been determined, revealing a β-lactamase inhibitor protein-like structure with a tandem repeat of domains. Sequence analysis and structural comparisons reveal that BVU2987 and other DUF2874 proteins are related to β-lactamase inhibitor protein, PepSY and SmpA_OmlA proteins and hence are likely to function as inhibitory proteins.
BVU2987; DUF2874; PF11396; human gut microbiome; β-lactamase inhibitor protein-like fold; putative inhibitor proteins
NE1406, the first structural representative of PF09410, reveals a lipocalin-like fold with features that suggest involvement in lipid metabolism. In addition, NE1406 provides potential structural templates for two other protein families (PF07143 and PF08622).
The first structural representative of the domain of unknown function DUF2006 family, also known as Pfam family PF09410, comprises a lipocalin-like fold with domain duplication. The finding of the calycin signature in the N-terminal domain, combined with remote sequence similarity to two other protein families (PF07143 and PF08622) implicated in isoprenoid metabolism and the oxidative stress response, support an involvement in lipid metabolism. Clusters of conserved residues that interact with ligand mimetics suggest that the binding and regulation sites map to the N-terminal domain and to the interdomain interface, respectively.
structural genomics; domains of unknown function; calycin; lipocalin; fatty-acid binding proteins
The crystal structure of the NGO1945 gene product from N. gonorrhoeae (UniProt Q5F5IO) reveals that the N-terminal domain assigned as a domain of unknown function (DUF2063) is likely to bind DNA and that the protein may be involved in transcriptional regulation.
Proteins with the DUF2063 domain constitute a new Pfam family, PF09836. The crystal structure of a member of this family, NGO1945 from Neisseria gonorrhoeae, has been determined and reveals that the N-terminal DUF2063 domain is likely to be a DNA-binding domain. In conjunction with the rest of the protein, NGO1945 is likely to be involved in transcriptional regulation, which is consistent with genomic neighborhood analysis. Of the 216 currently known proteins that contain a DUF2063 domain, the most significant sequence homologs of NGO1945 (∼40–99% sequence identity) are from various Neisseria and Haemophilus species. As these are important human pathogens, NGO1945 represents an interesting candidate for further exploration via biochemical studies and possible therapeutic intervention.
NGO1945; PF09836; DUF2063; putative DNA-binding proteins; putative transcription regulators; structural genomics
The crystal structure of the first representative of the Pfam PF07336 (DUF1470) family reveals a two-domain organization that contains a new fold, termed the ABATE domain, at the N-terminus and a treble-clef zinc finger that is likely to bind DNA at the C-terminus.
The crystal structure of Jann_2411 from Jannaschia sp. strain CCS1, a member of the Pfam PF07336 family classified as a domain of unknown function (DUF1470), was solved to a resolution of 1.45 Å by multiple-wavelength anomalous dispersion (MAD). This protein is the first structural representative of the DUF1470 Pfam family. Structural analysis revealed a two-domain organization, with the N-terminal domain presenting a new fold called the ABATE domain that may bind an as yet unknown ligand. The C-terminal domain forms a treble-clef zinc finger that is likely to be involved in DNA binding. Analysis of the Jann_2411 protein and the broader ABATE-domain family suggests a role as stress-induced transcriptional regulators.
structural genomics; environmental stress; domains of unknown function; Pfam; bound metal identification
The first structural representative of the PF08866 (DUF1831) protein family reveals a potential new α+β fold and indicates a possible involvement in amino-acid metabolism.
The structure of LP2179, a member of the PF08866 (DUF1831) family, suggests a novel α+β fold comprising two β-sheets packed against a single helix. A remote structural similarity to two other uncharacterized protein families specific to the Bacillus genus (PF08868 and PF08968), as well as to prokaryotic S-adenosylmethionine decarboxylases, is consistent with a role in amino-acid metabolism. Genomic neighborhood analysis of LP2179 supports this functional assignment, which might also then be extended to PF08868 and PF08968.
structural genomics; DUFs; S-adenosylmethionine decarboxylase; amino-acid metabolism; probiotics
PA1994, a Pfam PF06475 (DUF1089) family homolog from P. aeruginosa, reveals remote similarities to lipoprotein localization factors and a conserved putative glycolipid-binding site.
The crystal structure of PA1994 from Pseudomonas aeruginosa, a member of the Pfam PF06475 family classified as a domain of unknown function (DUF1089), reveals a novel fold comprising a 15-stranded β-sheet wrapped around a single α-helix that assembles into a tight dimeric arrangement. The remote structural similarity to lipoprotein localization factors, in addition to the presence of an acidic pocket that is conserved in DUF1089 homologs, phospholipid-binding and sugar-binding proteins, indicate a role for PA1994 and the DUF1089 family in glycolipid metabolism. Genome-context analysis lends further support to the involvement of this family of proteins in glycolipid metabolism and indicates possible activation of DUF1089 homologs under conditions of bacterial cell-wall stress or host–pathogen interactions.
structural genomics; DUFs; glycolipids; osmotic stress; host–pathogen interactions
KPN03535 is a protein unique to K. pneumoniae. The crystal structure reveals that KPN03535 represents a novel variant of the OB-fold and is likely to be a DNA-binding lipoprotein.
KPN03535 (gi|152972051) is a putative lipoprotein of unknown function that is secreted by Klebsiella pneumoniae MGH 78578. The crystal structure reveals that despite a lack of any detectable sequence similarity to known structures, it is a novel variant of the OB-fold and structurally similar to the bacterial Cpx-pathway protein NlpE, single-stranded DNA-binding (SSB) proteins and toxins. K. pneumoniae MGH 78578 forms part of the normal human skin, mouth and gut flora and is an opportunistic pathogen that is linked to about 8% of all hospital-acquired infections in the USA. This structure provides the foundation for further investigations into this divergent member of the OB-fold family.
lipoproteins; OB-fold; NipE-like protein; single-stranded DNA-binding proteins; toxins; BOF; human gut pathogens; structural genomics