|Home | About | Journals | Submit | Contact Us | Français|
ECX21941 represents a very large family (over 600 members) of novel, ocean metagenome–specific proteins identified by clustering of the dataset from the Global Ocean Sampling expedition. The crystal structure of ECX21941 reveals unexpected similarity to Sm/LSm proteins, which are important RNA-binding proteins, despite no detectable sequence similarity. The ECX21941 protein assembles as a homopentamer in solution and in the crystal structure when expressed in Escherichia coli and represents the first pentameric structure for this Sm/LSm family of proteins, although the actual oligomeric form in vivo is currently not known. The genomic neighborhood analysis of ECX21941 and its homologs combined with sequence similarity searches suggest a cyanophage origin for this protein. The specific functions of members of this family are unknown, but our structure analysis of ECX21941 indicates nucleic acid-binding capabilities and suggests a role in RNA and/or DNA processing.
The ECX21941 gene from the Global Ocean Sampling (GOS) metagenome dataset 1,2 encodes a protein with a molecular weight of 11.5 kDa (residues 1−104) and a calculated isoelectric point of 5.75. ECX21941 was selected for structure determination in a pilot project to explore structural diversity of proteins from the ocean metagenome using the semiautomated, high-throughput pipeline of the Joint Center for Structural Genomics (JCSG; http://www.jcsg.org) 3 as part of the National Institute of General Medical Sciences’ Protein Structure Initiative. ECX21941 is a representative of a very large, novel, ocean metagenome–specific family (over 600 members), and its function is unknown. Genomic neighborhood analysis of ECX21941 and homologous proteins suggests a cyanophage origin for this protein.
The structure of ECX21941 is similar that of Sm/LSm/Sm-like proteins despite lack of any detectable sequence similarity and further analysis confirmed that it is a very divergent member of this protein family. Sm and Sm-like (or Like Sm, LSm) proteins (PF01423 [PFAM], cd00600 [CDD]) form a very large (>1500 members) and evolutionary diverse 4 protein family with an open β-barrel fold with SH3-like topology and diverse functions that center around RNA processing. The Sm/LSm family is classified into 23 different groups by the NCBI Conserved Domains Database 5 and into seven structurally characterized families of proteins with Sm-like fold by the SCOP database (sunid: 50181) 6. In eukaryotes, they are essential for pre-mRNA splicing 7, telomere formation 8, trans splicing 9, and mRNA degradation 10,11 and are implicated in human autoimmune diseases 12. Sm-like proteins have also been reported and characterized in bacteria and archaea, and share similar RNA-binding features with their eukaryotic counterparts 13–15.
ECX21941 is the first structural representative of Sm-like proteins with a pentameric assembly of protomers, as observed in the crystal structure and in solution from protein expressed in Escherichia coli. Other known functional assemblies are homohexameric (bacteria and archaea) 16,17, homoheptameric (archaea) 18–21, or heteroheptameric/octameric (eukaryota) 22–25. It is still unclear what drives different oligomer arrangements in Sm and LSm proteins, particularly in vivo, and how these different potential oligomerization states affect molecular activity. The crystal structure of ECX21941 presented here should aid in biochemical analyses to determine whether it is involved in RNA-mediated regulation and/or post-transcriptional processing of RNAs 26–31.
The DNA encoding ECX21941 (GenBank: ECX21941.1, GI:142318367, GOS_2577746) was synthesized with codons optimized for Escherichia coli expression and cloned into plasmid pSpeedET (CodonDevices, Cambridge, MA). Since crystallization trials with the full-length construct were unsuccessful, the polymerase incomplete primer extension (PIPE) 32 method was used to delete part of the gene encoding the C-terminal residues 100−104. The final construct used encodes residues 1–99 of ECX21941 in addition to MGSDKIHHHHHHENLYFQG of an expression and purification tag followed by a tobacco etch virus (TEV) protease cleavage site at its N-terminus. The cloning junctions were confirmed by DNA sequencing. Protein expression was performed in a selenomethionine-containing medium using the Escherichia coli strain GeneHogs (Invitrogen). At the end of fermentation, lysozyme was added to the culture to a final concentration of 250 µg/mL, and the cells were harvested. After one freeze/thaw cycle, the cells were homogenized in Lysis Buffer [50 mM HEPES pH 8.0, 50 mM NaCl, 10 mM imidazole, 1 mM Tris (2-carboxyethyl) phosphine hydrochloride (TCEP)] and passed through a Microfluidizer (Microfluidics). The lysate was clarified by centrifugation at 32,500 × g for 30 minutes and loaded onto nickel-chelating resin (GE Healthcare) pre-equilibrated with Lysis Buffer. The resin was washed with Wash Buffer [50 mM HEPES pH 8.0, 300 mM NaCl, 40 mM imidazole, 10% (v/v) glycerol, 1 mM TCEP], and the protein was eluted with Elution Buffer [20 mM HEPES pH 8.0, 300 mM imidazole, 10% (v/v) glycerol, 1 mM TCEP]. The eluate was buffer exchanged with HEPES Crystallization Buffer [20 mM HEPES pH 8.0, 200 mM NaCl, 40 mM imidazole, 1 mM TCEP] and treated with 1 mg of TEV protease per 15 mg of eluted protein. The digested protein was passed over nickel-chelating resin (GE Healthcare) pre-equilibrated with HEPES Crystallization Buffer, and the resin was washed with the same buffer. The flow-through and wash fractions were combined and concentrated for crystallization assays to 12.5 mg/mL by centrifugal ultrafiltration (Millipore). ECX21941 was crystallized using the nanodroplet vapor diffusion method 33 with standard JCSG crystallization protocols 3. Initial screening for diffraction was carried out using the Stanford Automated Mounting system (SAM) 34 at the Stanford Synchrotron Radiation Laboratory (SSRL, Menlo Park, CA). The crystallization reagent that produced the crystal used for structure solution contained 0.2 M calcium acetate and 20% (w/v) polyethylene glycol 3350 at pH 7.3. Ethylene glycol was added as a cryoprotectant to a final concentration of 10% (v/v). The crystal was indexed in the monoclinic space group C2 (Table I) 35,36. To determine its oligomeric state in solution, ECX21941 was analyzed using a 1 cm × 30 cm Superdex 200 column (GE Healthcare) coupled with miniDAWN static light scattering and Optilab differential refractive index detectors (Wyatt Technology). The mobile phase consisted of 20 mM Tris pH 8.0, 150 mM NaCl, and 0.02% (w/v) sodium azide. The molecular weight was calculated using ASTRA 5.1.5 software (Wyatt Technology).
Multi-wavelength anomalous diffraction (MAD) data were collected at the Advanced Photon Source (APS; Chicago, IL) on beamline 23-ID-D at wavelengths corresponding to the high-energy remote (λ1), inflection (λ2), and peak (λ3) of a selenium MAD experiment. The datasets were collected at 100K using a MAR300 CCD detector. The MAD data were integrated and reduced using XDS and then scaled with the program XSCALE 37. Data statistics are summarized in Table I. Phasing was performed with SHELXD 38 and autoSHARP 39, and automated iterative model building was performed using ARP/wARP 40 and RESOLVE 41. The initial trace revealed five protein subunits in the asymmetric unit (ASU), with a main-chain completeness of ~75% (with ~60% side chains) and starting Rcryst/Rfree values of ~36%/40%. From this initial trace, one of the chains (chain A) was manually adjusted to correct sequence registry and side chain rotamers using Coot 42. Molecular replacement (PHASER 43) was then used to place the other four molecules in the ASU using this partially refined structure as the search molecule. Model adjustments and completion, were performed with Coot 42. Structure refinement was carried out using REFMAC5 applying tight main-chain and loose-side chain NCS restraints and one TLS group per protomer chain throughout the refinement, Residues 0–1 and 91–99 are omitted from all five chains due to weak electron density. The tip (residues 40–43) of the loop region spanning residues 37 to 46 was disordered to a varying extent in each protomer and, therefore, was omitted from the structure since it could not be reliably modeled into the relatively weak, discontinuous, electron density. In addition, some monomers have slightly larger omitted regions around residue 40 and at the C-terminus. A total of 57 residues have their side chains truncated due to lack of interpretable density. Refinement statistics are summarized in Table I.
Analysis of the stereochemical quality of the model was accomplished using AutoDepInputTool 44, MolProbity, SFcheck 4.0 35, and WHATIF 5.0 45. Protein quaternary structure analysis was performed using the PQS (Protein Quaternary Structure) server 46, the PISA (Protein Interfaces, Surfaces, and Assemblies) server 47, and PITA (Protein InTerfaces and Assemblies) software 48. Figure 1B was adapted from an analysis using PDBsum 49. Figure 1A, Figure 2B, and Figure 3 were prepared with PyMOL (DeLano Scientific 50). Electrostatics surface potentials (Figure 3B) were calculated using APBS 51 and rendered within PyMOL using the APBS plug-in. Missing side-chain atoms were added to the model in their favored rotamers position using Coot, prior to electrostatic calculations and rendering. Figure 2A was prepared using MUSTANG 52 for structure superposition, JOY 53 for structural features annotation, and PDBsum analysis or existing annotations from the CDD database to highlight functional residues 5.
Putative homologs of ECX21941 were clustered using pairwise sequence similarities (CLANS software 54 with P-value = 5e-6) into groups of close homologs (Fig. 4). The genomic scaffolds encoding putative homologs with ≤ 85% sequence identity to ECX21941 were used in the genomic neighborhood analysis. CLANS software was used to identify groups of homologous proteins encoded by such scaffolds. Sequence conservation in Fig. 2B and and3A3A was calculated using Rate4Site 55. Selected scaffolds with different arrangements of the most frequently observed neighbors are shown in Fig. 5.
Atomic coordinates and experimental structure factors for ECX21941 from the GOS ocean metagenome dataset have been deposited in the PDB and are accessible under code 3by7.
The crystal structure of ECX21941 was determined to 2.6 Å resolution using the MAD method (Fig. 1). Data collection, model, and refinement statistics are summarized in Table I. The final model includes five protomers and seven water molecules in the ASU. The Matthews coefficient (Vm) 56 for ECX21941 is 2.5 Å3/Da, and the estimated solvent content is 49.8%. The Ramachandran plot produced by MolProbity 57 shows that 97.7% of the residues are in favored regions with no Ramachandran outliers.
The ECX21941 protomer is a single domain that, in general, adopts the characteristic twisted β-sheet seen in Sm and LSm proteins (Fig. 1; SCOP sunid 50181). This assignment is supported by a DALI 58 structure similarity search, which finds hits to numerous Sm and LSm proteins with Z-scores varying from 7.0 to 4.9, sequence identities ranging from 6% to 20%, and RMSDs ranging from 1.0 Å to 3.3 Å. Molecular weights of 50,360 Da and 50,020 Da were determined by two independent runs of analytical size exclusion chromatography in combination with static light scattering (SEC/SLS). Because ECX21941 has a calculated molecular weight of 11,137 Da (mass determined by LC/MS was 11,136 Da), SEC/SLS suggested that it forms a homo-pentamer in solution, consistent with quaternary structural analysis using the PQS, PISA, and PITA programs.
The structure and function of human and yeast hetero-heptameric/octameric Sm/LSm proteins are well characterized 24,25,59–61 (PDB codes: 2vc8, 1y96, 1n9r, 1d3b, 1b34, and 3bw1) . In bacteria (E. coli, Staphylococcus aureus, and Pseudomonas aeruginosa), the Sm-like Hfq protein forms a homo-hexamer (PDB codes: 1hk9 62, 1kq1 16, 1u1s 63, and 1ycy). Archaeal homo-heptameric Sm-like proteins have been characterized by crystallographic studies (PDB codes: 1i81 19, 1i4k 21, 1ljo 17, 1i8f 64, 1h64 65, 1loj, 1jbm 66, 1m5q 18, 1th7 20, and 2qtx 67) or biochemically 68,69.
From the numerous previously solved crystal structures of Sm and LSm proteins (five eukaryotic, 10 archaeal, and four bacterial), a brief comparative structural analysis is presented here using the following representative structures from the three kingdoms of life: human small nuclear ribonucleoprotein-associated protein B (PDB code: 1d3bB) and human gem-associated protein gemin6 (1y96); archaeal SmAP1 from Methanothermobacter thermautotrophicus (1loj) and archael Sm-related protein from Pyrococcus abyssi (1h64); and bacterial Hfq from S. aureus (1kq1). A superposition of the structure of ECX21941 with these representatives (Fig. 2B) reveals that, despite the lack of any discernible sequence similarity between ECX21941 and other Sm-like proteins (Fig. 2A), the overall structure of all of the monomers is very similar. All secondary structure elements are of similar length and have very similar orientations. However, the ECX21941 structure has some key distinguishing features: (a) the absence of an N-terminal helix; (b) the presence of a very pronounced C-terminal helix; (c) an insertion between strands β3 and β4 (also seen in SmB, PDB code 1d3b, chain B), which forms loop 4 in other Sm/LSm proteins (Fig. 1A); and (d) an insertion between β4' and β4, which forms loop 4' (flanked by Pro51 and Lys59 (Fig 2B and and4C)4C) that is involved in interaction with the adjacent subunit and, hence, participates in oligomer formation (Fig. 1, ,2,2, and and3A3A).
The presence of charged and aromatic amino acids (76–86) in the C-terminal α-helix (Lys80, Tyr82, His85, Lys100, and Lys103) and in loop 4 (Trp52, Tyr55, and Lys59) indicates they may be involved in nucleic acid interactions. The variation in size of loop 4 between the typical Sm1 and Sm2 motifs (motifs seen in previously characterized Sm/LSm proteins, but not in ECX21941) has also been observed in other Sm/LSm proteins (PDB codes 2fwkA, 1b34B, 1d3bB, and 2fb7A; Fig. 2A). Several proteins that are structurally similar to the Sm/LSm proteins, such as the Tudor domain (PDB codes 2e6n and 2o4×) and gemin6 (PDB code 1y96), have an α-helix at both termini.
The interaction interface between the protomers in the pentameric ring is formed by residues from β4 in one subunit with β5 in the adjacent monomer and by loop 4' (Fig. 2A and and3A).3A). The length of loop 4 contributes to the overall thickness of the petameric ring by increasing its height. The absence of an N-terminal α-helix and the orientation of the C-terminal α-helix do not significantly impact the overall shape and diameter of the assembly. The ring formed by ECX21941 has a diameter of ~60 Å, a width of ~30 Å, and a central pore size of ~9.2 Å. In the hexameric E. coli Hfq (PDB code 1hk9 62), the ring has a diameter of ~65 Å, a width of ~28 Å, and a central pore size of ~11 Å at its most narrow region. The archaeal LSm protein 64 (PDB code 1i8f) has a heptameric ring structure of ~65 Å diameter and ~38 Å width, which is similar to the dimensions of the core of human Sm, as observed by electron microscopy 70.
In the absence of functional data, we cannot determine if the observed homopentameric assembly of ECX21941 represents its biologically relevant form. It could be a consequence of overexpressing the protein in E. coli. ECX21941 may form functional hetero-oligomers in vivo, as occur in the eukaryotic Sm protein complexes, either with other cyanophage Sm-like proteins, where multiple paralogs are commonly found in a particular phage (Fig. 4C) or with host cyanobacterial Sm-like proteins. Recently, a cyanobacterial Sm-like protein similar to the bacterial RNA chaperone Hfq 71 (ssr3341, NP_441518), was identified and characterized and a single homolog of this protein is found in various strains of Synechococcus sp. Interestingly, ssr3341 was found to regulate genes essential for motility of Synechocystis sp. PCC 6803. The loss of motility caused by insertional inactivation of ssr3341 was complemented by reintroduction of the wild-type gene, correlated with the re-establishment of type IV pili on the cell surface 72. Some of the type IV pili function as receptors for bacteriophages, including PO4 phage for P. aeruginosa and the cholera toxin phage (CTXΦ) for Vibrio cholerae 73,74. It is possible that the cyanophage-encoded ECX21941, or its homologs, could play a similar role in the regulation of type IV pili biogenesis that may affect the rate of transduction.
Analysis of the electrostatic surface of the ECX21941 assembly reveals the surface charge distributions that may be relevant for interaction with a ligand. The different views in Fig. 3B portray positively charged amino acids on the outer periphery of the ring and a region of charged residues at the entrance to the central pore (Fig. 3B). Lys67 constitutes the positively charged region at the entrance to the pore from the top side. A negatively charged region is composed of Asp64, Asp65 and Ser66 prior to Lys67 going from one side to the other. The positively charged patch on the outer periphery of the top surface is formed by Lys2, Lys5, Lys29, Lys30, Lys59, and Lys75, many of which are conserved in other Sm/LSm proteins (Fig. 2, ,4C).4C). Lys2 superimposes with Arg19 in 1hk9 and 1u1s (Fig. 2A). Lys29 is located in a similar position in 1b34A (Lys41), 1hk9A (Lys47), 1u1sA (Lys47), and 2qtxA (Lys53; Fig.2A). Lys75 corresponds to Arg66 in 1u1s and 1hk9, but its side chain faces the opposite direction, whereas Asp64, Asp65, and Ser66 correspond to residues (with different physicochemical properties) that in other Sm-like proteins are involved in RNA binding and/or oligomerization (Fig. 2A). In Sm-like proteins, the 310-helix (H1) typically contains Lys67 that faces the entrance to the central pore in a similar position to Lys67 in 1d3bA (SmD3 protein). Site-directed mutagenesis coupled with oligonucleotide binding assays, which are beyond the scope of this study, should reveal the functional importance of these residues.
ECX21941 has several hundred predicted homologs in the GOS metagenome dataset, some of which have homologs in cyanophage species (Fig. 4C). The sequence similarity scores between cyanophage proteins and the HMMER (http://hmmer.janelia.org/) profile (calculated using alignment of all Sm-like proteins from the GOS) are between 17.5 and 116.6 (E-values are between 8.6e-6 and 3.6e-34). The HMMER score for the only known similar cyanobacterial protein (GenBank: ZP_01472537) is 31.4 (E-value = 3.6e-9). A Sm-like protein (GenBank: ECL08690) from the GOS with identified similarity to this cyanobacterial protein (BLAST score = 101, E-value = 0.004) has a much higher sequence similarity to a cyanophage protein (GenBank: YP_214412; BLAST score = 191, E-value = 2e-13). One protein from Prochlorococcus cyanophage P-SSM2 has a detectable sequence similarity to ECX21941 (BLAST score 85, E-value = 0.31) and significant similarity scores to other Sm-like proteins from the GOS (HMMER score = 39.9 and E-value = 4.4e-11; BLAST hit to GenBank number ECV68329, with score = 189 and E-value = 3e-13).
The protein sequence clustering identified several groups of close homologs (Fig. 4), indicating similar diversity of marine metagenome-specific Sm-like proteins, as observed in previously known Sm-like proteins. Thus, it is unlikely that all marine metagenome-specific Sm-like proteins have the same function.
The analysis of the genomic neighborhood of ECX21941 and its homologs identified several frequently observed proteins (Fig. 5). Such an analysis is limited to the immediate neighborhood because genomic scaffolds in the metagenomic dataset are relatively short. A similar sequential arrangement of the conserved genomic neighbors was found in one case between a scaffold with an ECX21941 homolog (GenBank scaffold ID: EP543697; Fig. 5) and a genome of cyanophage P-SSM2 (GenBank: AJ630128), where the order is regA (GenBank: CAF34194.1), small heat shock protein (HSP20-like chaperone, GenBank: CAF34195.1), hypothetical protein (GenBank: CAF34196.1), hypothetical protein (GenBank: CAF34197.1), and DNA polymerase gp43 (GenBank: CAF34198.1).
The JCSG has developed The Open Protein Structure Annotation Network (TOPSAN), a wiki-based community project to collect, share, and distribute information about protein structures determined at PSI centers. TOPSAN offers a combination of automatically generated, as well as comprehensive, expert-curated annotations, provided by JCSG personnel and members from the research community. Additional information about ECX21941 is available at http://www.topsan.org/explore?pdbID=3by7.
The crystal structure of ECX21941 reveals, for the first time, a pentameric assembly of protomers for Sm-like proteins. The weak, but statistically significant, sequence similarity between ECX21941 and cyanophage proteins (Fig. 4C), a strong similarity between its homologs and cyanophage proteins, and a strong similarity between proteins from an ECX21941 conserved neighborhood to cyanophage proteins (Fig. 5), led to the conclusion that ECX21941 is likely to be the first known structural representative of a viral (cyanophage) Sm-like protein. The bacterial Sm-like protein Hfq has long been known as a host factor for phage Qbeta RNA replication 75. The RNA-binding residues in previously characterized Sm-like proteins 16 correspond to Gln23 and Asp64, which are not conserved among ECX21941 homologs (Fig. 4C), and are on the opposite side of the ring of highly conserved residues (Arg8, Thr11, Glu13, and Asp14; Fig. 2B, ,3A).3A). Thus, the function of ECX21941 is likely to be different and remains unknown. However, the genomic neighborhood of ECX21941 and its homologs is enriched in ORFs encoding DNA-processing proteins (Fig. 5) with annotations similar to several proteins known to be involved in a non-homologous DNA repair pathway, or to genes putatively regulated by attenuation (such as Lhr-like helicases).
Portions of this research were performed at the APS Beamline ID-23-D of the GM/CA-CAT and SSRL. Use of the Advanced Photon Source was supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, under Contract No. DE-AC02-06CH11357. GM/CA CAT has been funded in whole or in part with Federal funds from the National Cancer Institute (Y1-CO-1020) and the National Institute of General Medical Science (Y1-GM-1104). The SSRL is a national user facility operated by Stanford University on behalf of the United States Department of Energy, Office of Basic Energy Sciences. The SSRL Structural Molecular Biology Program is supported by the Department of Energy, Office of Biological and Environmental Research, and by the National Institutes of Health (National Center for Research Resources, Biomedical Technology Program, and the National Institute of General Medical Sciences). The GOS sequence dataset was initially made available by the J. Craig Venter Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health.
Grant Sponsor: National Institute of General Medical Sciences, Protein Structure Initiative; Grant Number: U54 GM074898.