A phage moron is a DNA element inserted between a pair of genes in one phage genome that are adjacent in other related phage genomes. Phage morons are commonly found within phage genomes, and in a number of cases, they have been shown to mediate phenotypic changes in the bacterial host. The temperate phage HK97 encodes a moron element, gp15, within its tail morphogenesis region that is absent in most closely related phages. We show that gp15 is actively expressed from the HK97 prophage and is responsible for providing the host cell with resistance to infection by phages HK97 and HK75, independent of repressor immunity. To identify the target(s) of this gp15-mediated resistance, we created a hybrid of HK97 and the related phage HK022. This hybrid phage revealed that the tail tube or tape measure proteins likely mediate the susceptibility of HK97 to inhibition by gp15. The N terminus of gp15 is predicted with high probability to contain a single membrane-spanning helix by several transmembrane prediction programs. Consistent with this putative membrane localization, gp15 acts to prevent the entry of phage DNA into the cytoplasm, acting in a manner reminiscent of those of several previously characterized superinfection exclusion proteins. The N terminus of gp15 and its phage homologues bear sequence similarity to YebO proteins, a family of proteins of unknown function found ubiquitously in enterobacteria. The divergence of their C termini suggests that phages have co-opted this bacterial protein and subverted its activity to their advantage.
With the advent of next-generation DNA sequencing, the pace of inherited orphan disease gene identification has increased dramatically, a situation that will continue for at least the next several years. At present, the numbers of such identified disease genes significantly outstrips the number of laboratories available to investigate a given disorder, an asymmetry that will only increase over time. The hope for any genetic disorder is, where possible and in addition to accurate diagnostic test formulation, the development of therapeutic approaches. To this end, we propose here the development of a strategic toolbox and preclinical research pathway for inherited orphan disease. Taking much of what has been learned from rare genetic disease research over the past two decades, we propose generalizable methods utilizing transcriptomic, system-wide chemical biology datasets combined with chemical informatics and, where possible, repurposing of FDA approved drugs for pre-clinical orphan disease therapies. It is hoped that this approach may be of utility for the broader orphan disease research community and provide funding organizations and patient advocacy groups with suggestions for the optimal path forward. In addition to enabling academic pre-clinical research, strategies such as this may also aid in seeding startup companies, as well as further engaging the pharmaceutical industry in the treatment of rare genetic disease.
Orphan disease therapy; Preclinical drug development; Generalizable screening methods; Translational toolbox
We tested the general applicability of in situ proteolysis to form protein crystals suitable for structure determination by adding a protease (chymotrypsin or trypsin) digestion step to crystallization trials of 55 bacterial and 14 human proteins that had proven recalcitrant to our best efforts at crystallization or structure determination. This is a work in progress; so far we determined structures of 9 bacterial proteins and the human aminoimidazole ribonucleotide synthetase (AIRS) domain.
Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) and the associated proteins (Cas) comprise a system of adaptive immunity against viruses and plasmids in prokaryotes. Cas1 is a CRISPR-associated protein that is common to all CRISPR-containing prokaryotes but its function remains obscure. Here we show that the purified Cas1 protein of Escherichia coli (YgbT) exhibits nuclease activity against single-stranded and branched DNAs including Holliday junctions, replication forks, and 5′-flaps. The crystal structure of YgbT and site-directed mutagenesis have revealed the potential active site. Genome-wide screens show that YgbT physically and genetically interacts with key components of DNA repair systems, including recB, recC and ruvB. Consistent with these findings, the ygbT deletion strain showed increased sensitivity to DNA damage and impaired chromosomal segregation. Similar phenotypes were observed in strains with deletion of CRISPR clusters, suggesting that the function of YgbT in repair involves interaction with the CRISPRs. These results show that YgbT belongs to a novel, structurally distinct family of nucleases acting on branched DNAs and suggest that, in addition to antiviral immunity, at least some components of the CRISPR-Cas system have a function in DNA repair.
Cas1; CRISPR; DNA recombination; DNA repair; nuclease; YgbT
A crystal structure of the putative N-carbamoylsarcosine amidase (CSHase) Ta0454 from Thermoplasma acidophilum was solved by single-wavelength anomalous diffraction and refined at a resolution of 2.35 Å. CSHases are involved in the degradation of creatinine. Ta0454 shares a similar fold and a highly conserved C-D-K catalytic triad (Cys123, Asp9, and Lys90) with the structures of three cysteine hydrolases (PDB codes 1NBA, 1IM5, and 2H0R). Molecular dynamics (MD) simulations of Ta0454/N-carbamoylsarcosine and Ta0454/pyrazinamide complexes were performed to determine the structural basis of the substrate binding pattern for each ligand. Based on the MD simulated-trajectories, the MM/PBSA method predicts binding free energies of −24.5 and −17.1 kcal/mol for the two systems, respectively. The predicted binding free energies suggest that Ta0454 is selective for N-carbamoylsarcosine over pyrazinamide, and zinc ions play an important role in the favorable substrate bound states.
N-carbamoylsarcosine amidase; C-D-K catalytic triad; creatinine degradation; crystal structure; MM/PBSA; molecular dynamics simulations
The protein TA0175 has a large number of sequence homologues, most of which are annotated as unknown and a few as belonging to the haloacid dehalogenase superfamily, but has no known biological function. Using a combination of amino acid sequence analysis, three-dimensional crystal structure information, and kinetic analysis, we have characterized TA0175 as phosphoglycolate phosphatase from Thermoplasma acidophilum. The crystal structure of TA0175 revealed two distinct domains, a larger core domain and a smaller cap domain. The large domain is composed of a centrally located five-stranded parallel β-sheet with strand order S10, S9, S8, S1, S2 and a small β-hairpin, strands S3 and S4. This central sheet is flanked by a set of three α-helices on one side and two helices on the other. The smaller domain is composed of an open faced β-sandwich represented by three antiparallel β-strands, S5, S6, and S7, flanked by two oppositely oriented α-helices, H3 and H4. The topology of the large domain is conserved; however, structural variation is observed in the smaller domain among the different functional classes of the haloacid dehalogenase superfamily. Enzymatic assays on TA0175 revealed that this enzyme catalyzed the dephosphorylation of phosphoglycolate in vitro with similar kinetic properties seen for eukaryotic phosphoglycolate phosphatase. Activation by divalent cations, especially Mg2+, and competitive inhibition behavior with Cl− ions are similar between TA0175 and phosphoglycolate phosphatase. The experimental evidence presented for TA0175 is indicative of phosphoglycolate phosphatase.
The crystal structure of the hypothetical protein TA1238 from Thermoplasma acidophilum was solved with multiple-wavelength anomalous diffraction and refined at 2.0 Å resolution. The molecule consists of a typical four-helix antiparallel bundle with overhand connection. However, its oligomerization into a trimer leads to a coiled ‘super-helix’ which is novel for such bundles. Its central feature, a six-stranded coiled coil, is also novel for proteins. TA1238 does not have significant sequence relatives in databases, but shows strong structural homologues with some proteins in the Protein Data Bank. The function could not be inferred from the sequence but the structure, with some rearrangement, bears some resemblance to the active site region of cobalamin adenosyltransferase (TA1434). Specifically, TA1238 retains Arg104, which is structurally equivalent to functionally critical Arg119 of TA1434. For such conformational change, the overhand connection of TA1238 might need to be involved in a gating mechanism that might be modulated by ligands and/or by interactions with the physiological partners. This allowed us to hypothesize that TA1238 could be involved in cobalamin biosyntheses.
cobalamin biosynthesis; crystal structure; four-helix bundle; gating mechanism; MAD phasing; overhand connection; six-stranded coiled coil
Ribose-5-phosphate isomerase A (RpiA; EC 18.104.22.168) interconverts ribose-5-phosphate and ribulose-5-phosphate. This enzyme plays essential roles in carbohydrate anabolism and catabolism; it is ubiquitous and highly conserved. The structure of RpiA from Escherichia coli was solved by multiwavelength anomalous diffraction (MAD) phasing, and refined to 1.5 Å resolution (R factor 22.4%, Rfree 23.7%). RpiA exhibits an α/β/(α/β)/β/α fold, some portions of which are similar to proteins of the alcohol dehydrogenase family. The two subunits of the dimer in the asymmetric unit have different conformations, representing the opening/closing of a cleft. Active site residues were identified in the cleft using sequence conservation, as well as the structure of a complex with the inhibitor arabinose-5-phosphate at 1.25 Å resolution. A mechanism for acid-base catalysis is proposed.
ribose-5-phosphate isomerase; MAD; X-ray crystallography; pentose phosphate pathway; Calvin cycle; arabinose-5-phosphate
Ribose-5-phosphate isomerases (EC 22.214.171.124) interconvert ribose 5-phosphate and ribulose 5-phosphate. This reaction permits the synthesis of ribose from other sugars, as well as the recycling of sugars from nucleotide breakdown. Two unrelated types of enzyme can catalyze the reaction. The most common, RpiA, is present in almost all organisms (including Escherichia coli), and is highly conserved. The second type, RpiB, is present in some bacterial and eukaryotic species and is well conserved. In E. coli, RpiB is sometimes referred to as AlsB, because it can take part in the metabolism of the rare sugar, allose, as well as the much more common ribose sugars. We report here the structure of RpiB/AlsB from E. coli, solved by multi-wavelength anomalous diffraction (MAD) phasing, and refined to 2.2 Å resolution. RpiB is the first structure to be solved from pfam02502 (the RpiB/LacAB family). It exhibits a Rossmann-type αβα-sandwich fold that is common to many nucleotide-binding proteins, as well as other proteins with different functions. This structure is quite distinct from that of the previously solved RpiA; although both are, to some extent, based on the Rossmann fold, their tertiary and quaternary structures are very different. The four molecules in the RpiB asymmetric unit represent a dimer of dimers. Active-site residues were identified at the interface between the subunits, such that each active site has contributions from both subunits. Kinetic studies indicate that RpiB is nearly as efficient as RpiA, despite its completely different catalytic machinery. The sequence and structural results further suggest that the two homologous components of LacAB (galactose-6-phosphate isomerase) will compose a bi-functional enzyme; the second activity is unknown.
ribose-5-phosphate isomerase; pentose phosphate pathway; galactose-6-phosphate isomerase; MAD; X-ray crystallography
Structural proteomics projects are generating three-dimensional structures of novel, uncharacterized proteins at an increasing rate. However, structure alone is often insufficient to deduce the specific biochemical function of a protein. Here we determined the function for a protein using a strategy that integrates structural and bioinformatics data with parallel experimental screening for enzymatic activity. BioH is involved in biotin biosynthesis in Escherichia coli and had no previously known biochemical function. The crystal structure of BioH was determined at 1.7 Å resolution. An automated procedure was used to compare the structure of BioH with structural templates from a variety of different enzyme active sites. This screen identified a catalytic triad (Ser82, His235, and Asp207) with a configuration similar to that of the catalytic triad of hydrolases. Analysis of BioH with a panel of hydrolase assays revealed a carboxylesterase activity with a preference for short acyl chain substrates. The combined use of structural bioinformatics with experimental screens for detecting enzyme activity could greatly enhance the rate at which function is determined from structure.
We have identified a novel family of proteins, in which the N-terminal Cystathionine Beta-Synthase (CBS) domain is fused to the C-terminal Zn ribbon domain. Four proteins were over-expressed in E. coli and purified: TA0289 from Thermoplasma acidophilum, TV1335 from Thermoplasma vulcanum, PF1953 from Pyrococcus furiosus, and PH0267 from Pyrococcus horikoshii. The purified proteins had red/purple color in solution and an absorption spectrum typical of rubredoxins. Metal analysis of purified proteins revealed the presence of several metals with iron and zinc being the most abundant metals (2 to 67% of iron and 12 to 74% of zinc). Crystal structures of both mercury- and iron-bound TA0289 (1.5–2.0 Å resolution) revealed a dimeric protein whose inter-subunit contacts are formed exclusively by the α helices of two CBS sub-domains, whereas the C-terminal domain has a classical Zn-ribbon planar architecture. All proteins were reversibly reduced by chemical reductants (ascorbate or dithionite) or by the general rubredoxin reductase NorW from E. coli in the presence of NADH. Reduced TA0289 was found to be able to transfer electrons to cytochrome C from horse heart. Likewise, the purified Zn ribbon protein KTI11 from Saccharomyces cerevisiae had purple color in solution and a rubredoxin-like absorption spectrum, contained both iron and zinc, and was reduced by the rubredoxin reductase NorW from E. coli. Thus, recombinant Zn ribbon domains from archaea and yeast demonstrate a rubredoxin-like electron carrier activity in vitro. We suggest that in vivo some Zn ribbon domains might also bind iron and therefore possess an electron carrier activity adding another physiological role to this large family of important proteins.
Gene expression profiling has the potential to unravel molecular mechanisms behind gene regulation and identify gene targets for therapeutic interventions. As microarray technology matures, the number of microarray studies has increased, resulting in many different datasets available for any given disease. The increase in sensitivity and reliability of measurements of gene expression changes can be improved through a systematic integration of different microarray datasets that address the same or similar biological questions.
Traditional effect size models can not be used to integrate array data that directly compare treatment to control samples expressed as log ratios of gene expressions. Here we extend the traditional effect size model to integrate as many array datasets as possible. The extended effect size model (MAID) can integrate any array datatype generated with either single or two channel arrays using either direct or indirect designs across different laboratories and platforms. The model uses two standardized indices, the standard effect size score for experiments with two groups of data, and a new standardized index that measures the difference in gene expression between treatment and control groups for one sample data with replicate arrays. The statistical significance of treatment effect across studies for each gene is determined by appropriate permutation methods depending on the type of data integrated. We apply our method to three different expression datasets from two different laboratories generated using three different array platforms and two different experimental designs. Our results indicate that the proposed integration model produces an increase in statistical power for identifying differentially expressed genes when integrating data across experiments and when compared to other integration models. We also show that genes found to be significant using our data integration method are of direct biological relevance to the three experiments integrated.
High-throughput genomics data provide a rich and complex source of information that could play a key role in deciphering intricate molecular networks behind disease. Here we propose an extension of the traditional effect size model to allow the integration of as many array experiments as possible with the aim of increasing the statistical power for identifying differentially expressed genes.
The human cytosolic sulfotransfases (hSULTs) comprise a family of 12 phase II enzymes involved in the metabolism of drugs and hormones, the bioactivation of carcinogens, and the detoxification of xenobiotics. Knowledge of the structural and mechanistic basis of substrate specificity and activity is crucial for understanding steroid and hormone metabolism, drug sensitivity, pharmacogenomics, and response to environmental toxins. We have determined the crystal structures of five hSULTs for which structural information was lacking, and screened nine of the 12 hSULTs for binding and activity toward a panel of potential substrates and inhibitors, revealing unique “chemical fingerprints” for each protein. The family-wide analysis of the screening and structural data provides a comprehensive, high-level view of the determinants of substrate binding, the mechanisms of inhibition by substrates and environmental toxins, and the functions of the orphan family members SULT1C3 and SULT4A1. Evidence is provided for structural “priming” of the enzyme active site by cofactor binding, which influences the spectrum of small molecules that can bind to each enzyme. The data help explain substrate promiscuity in this family and, at the same time, reveal new similarities between hSULT family members that were previously unrecognized by sequence or structure comparison alone.
We metabolize many hormones, drugs, and bioactive chemicals and toxins from the environment. One family of enzymes that participate in the metabolic process consists of the cytosolic sulfotransferases, or SULTs. SULTs have a variety of mechanisms of action—sometimes they inactivate the biological activity of the chemical (e.g., in the case of estrogen). At other times, the enzymes make the chemical more toxic (e.g., for certain carcinogens). Humans have 12 distinct SULT enzymes. Determining how each of these human enzymes recognizes and distinguishes between the thousands of chemicals we confront each day is essential for understanding hormone regulation, assessing environmental risk, and eventually developing better, more-effective drugs. We have studied the human SULT family of enzymes to profile which small molecules are recognized by each enzyme. We also visualized and compared the detailed structural features that determine which enzyme interacts with which molecule. By studying the entire family, we discovered new ways in which chemicals interact with each enzyme. Furthermore, we identified new inhibitors and inhibitory mechanisms. Finally, we discovered functions for many of the human enzymes that were previously uncharacterized.
Structural genomics and substrate screening provide "chemical fingerprints" and insights into substrate promiscuity for the human family of drug- and hormone-metabolizing cytosolic sulfotransferase enzymes.
High-throughput structural proteomics is expected to generate
considerable amounts of data on the progress of structure determination
for many proteins. For each protein this includes information about
cloning, expression, purification, biophysical characterization
and structure determination via NMR spectroscopy or X-ray crystallography.
It will be essential to develop specifications and ontologies for
standardizing this information to make it amenable to retrospective
analysis. To this end we created the SPINE database and analysis
system for the Northeast Structural Genomics Consortium. SPINE,
which is available at bioinfo.mbb.yale.edu/nesg
or nesg.org, is specifically designed to enable distributed
scientific collaboration via the Internet. It was designed not just
as an information repository but as an active vehicle to standardize
proteomics data in a form that would enable systematic data mining.
The system features an intuitive user interface for interactive
retrieval and modification of expression construct data, query forms
designed to track global project progress and external links to many
other resources. Currently the database contains experimental data
on 985 constructs, of which 740 are drawn from Methanobacterium
thermoautotrophicum, 123 from Saccharomyces cerevisiae,
93 from Caenorhabditis elegans and the remainder
from other organisms. We developed a comprehensive set of data mining
features for each protein, including several related to experimental progress
(e.g. expression level, solubility and crystallization) and 42 based
on the underlying protein sequence (e.g. amino acid composition,
secondary structure and occurrence of low complexity regions). We
demonstrate in detail the application of a particular machine learning
approach, decision trees, to the tasks of predicting a protein’s
solubility and propensity to crystallize based on sequence features.
We are able to extract a number of key rules from our trees, in
particular that soluble proteins tend to have significantly more
acidic residues and fewer hydrophobic stretches than insoluble ones. One
of the characteristics of proteomics data sets, currently and in
the foreseeable future, is their intermediate size (∼500–5000 data points).
This creates a number of issues in relation to error estimation. Initially
we estimate the overall error in our trees based on standard cross-validation.
However, this leaves out a significant fraction of the data in model construction
and does not give error estimates on individual rules. Therefore,
we present alternative methods to estimate the error in particular
Virus infection induces an antiviral response that is predominantly associated with the synthesis and secretion of soluble interferon. Here, we report that herpes simplex virus type 1 virions induce an interferon-independent antiviral state in human embryonic lung cells that prevents plaquing of a variety of viruses. Microarray analysis of 19,000 human expressed sequence tags revealed induction of a limited set of host genes, the majority of which are also induced by interferon. Genes implicated in controlling the intracellular spread of virus and eliminating virally infected cells were among those induced. Induction of the cellular response occurred in the absence of de novo cellular protein synthesis and required viral penetration. In addition, this response was only seen when viral gene expression was inhibited, suggesting that a newly synthesized viral protein(s) may function as an inhibitor of this response.
RNA polymerase (RNAP) purified from Methanobacterium thermoautotrophicum ΔH has been shown to initiate transcription accurately in vitro from the hmtB archaeal histone promoter with either native or recombinant forms of the M. thermoautotrophicum TATA-binding protein and transcription factor TFB. Efforts to obtain transcription initiation from hydrogen-regulated methane gene promoters were, however, unsuccessful. Two previously unrecognized archaeal RNAP subunits have been identified, and complex formation by the M. thermoautotrophicum RNAP and TFB has been demonstrated.