Metabolomics offers a powerful means to investigate human malaria parasite biology and host-parasite interactions at the biochemical level, and to discover novel therapeutic targets and biomarkers of infection. Here, we used an approach based on liquid chromatography and mass spectrometry to perform an untargeted metabolomic analysis of metabolite extracts from Plasmodium falciparum–infected and uninfected patient plasma samples, and from an enriched population of in vitro cultured P. falciparum-infected and uninfected erythrocytes. Statistical modeling robustly segregated infected and uninfected samples based on metabolite species with significantly different abundances. Metabolites of the α-linolenic acid (ALA) pathway, known to exist in plants but not known to exist in P. falciparum until now, were enriched in infected plasma and erythrocyte samples. In vitro labeling with 13C-ALA showed evidence of plant-like ALA pathway intermediates in P. falciparum. Ortholog searches using ALA pathway enzyme sequences from 8 available plant genomes identified several genes in the P. falciparum genome that were predicted to potentially encode the corresponding enzymes in the hitherto unannotated P. falciparum pathway. These data suggest that our approach can be used to discover novel facets of host/malaria parasite biology in a high-throughput manner.
Infection by a human papillomavirus (HPV) may result in a variety of clinical conditions ranging from benign warts to invasive cancer depending on the viral type. The HPV E2 protein represses transcription of the E6 and E7 genes in integrated papillomavirus genomes and together with the E1 protein is required for viral replication. E2 proteins bind with high affinity to palindromic DNA sequences consisting of two highly conserved four base pair sequences flanking a variable ‘spacer’ of identical length. The E2 proteins directly contact the conserved DNA but not the spacer DNA. However, variation in naturally occurring spacer sequences results in differential protein binding affinity. This discrimination in binding is dependent on their sensitivity to the unique conformational and/or dynamic properties of the spacer DNA in a process termed ‘indirect readout’. This article explores the structure of the E2 proteins and their interaction with DNA and other proteins, the effects of ions on affinity and specificity, and the phylogenetic and biophysical nature of this core viral protein. We have analyzed the sequence conservation and electrostatic features of three-dimensional models of the DNA binding domains of 146 papillomavirus types and variants with the goal of identifying characteristics that associated with risk of virally caused malignancy. The amino acid sequence, three-dimensional structure, and the electrostatic features of E2 protein DNA binding domain showed high conservation among all papillomavirus types. This indicates that the specific interactions between the E2 protein and its binding sites on DNA have been conserved throughout PV evolution. Analysis of the E2 protein’s transactivation domain showed that unlike the DNA binding domain, the transactivation domain does not have extensive surfaces of highly conserved residues. Rather, the regions of high conservation are localized to small surface patches. The invariance of the E2 DNA binding domain structure, electrostatics and sequence suggests that it may be a suitable target for the development of vaccines effective against a broad spectrum of HPV types.
Papillomavirus; DNA; Protein-DNA interactions; Electrostatics; E2; Review
Proteins can be decomposed into supersecondary structure modules. We used a generic definition of supersecondary structure elements, so-called Smotifs, which are composed of two flanking regular secondary structures connected by a loop, to explore the evolution and current variety of structure building blocks. Here, we discuss recent observations about the saturation of Smotif geometries in protein structures and how it opens new avenues in protein structure modeling and design. As a first application of these observations we describe our loop conformation modeling algorithm, ArchPred that takes advantage of Smotifs classification. In this application, instead of focusing on specific loop properties the method narrows down possible template conformations in other, often not homologous structures, by identifying the most likely supersecondary structure environment that cradles the loop. Beyond identifying the correct starting supersecondary structure geometry, it takes into account information of fit of anchor residues, sterical clashes, match of predicted and observed dihedral angle preferences, and local sequence signal.
Secondary structure; Supersecondary Structure; Smotif; Loop modeling; Protein Structure Evolution; Protein Structure Modeling; Protein Structure Design
Worldwide structural genomics projects continue to release new protein structures at an unprecedented pace, so far nearly 6000, but only about 60% of these proteins have any sort of functional annotation.
We explored a range of features that can be used for the prediction of functional residues given a known three-dimensional structure. These features include various centrality measures of nodes in graphs of interacting residues: closeness, betweenness and page-rank centrality. We also analyzed the distance of functional amino acids to the general center of mass (GCM) of the structure, relative solvent accessibility (RSA), and the use of relative entropy as a measure of sequence conservation. From the selected features, neural networks were trained to identify catalytic residues. We found that using distance to the GCM together with amino acid type provide a good discriminant function, when combined independently with sequence conservation. Using an independent test set of 29 annotated protein structures, the method returned 411 of the initial 9262 residues as the most likely to be involved in function. The output 411 residues contain 70 of the annotated 111 catalytic residues. This represents an approximately 14-fold enrichment of catalytic residues on the entire input set (corresponding to a sensitivity of 63% and a precision of 17%), a performance competitive with that of other state-of-the-art methods.
We found that several of the graph based measures utilize the same underlying feature of protein structures, which can be simply and more effectively captured with the distance to GCM definition. This also has the added the advantage of simplicity and easy implementation. Meanwhile sequence conservation remains by far the most influential feature in identifying functional residues. We also found that due the rapid changes in size and composition of sequence databases, conservation calculations must be recalibrated for specific reference databases.
Functional site; Catalytic residues; Neural network; Feature selection; Structural genomics
Gene regulatory networks show robustness to perturbations. Previous works identified robustness as an emergent property of gene network evolution but the underlying molecular mechanisms are poorly understood. We used a multi-tier modeling approach that integrates molecular sequence and structure information with network architecture and population dynamics. Structural models of transcription factor-DNA complexes are used to estimate relative binding specificities. In this model, mutations in the DNA cause changes on two levels: (a) at the sequence level in individual binding sites (modulating binding specificity), and (b) at the network level (creating and destroying binding sites). We used this model to dissect the underlying mechanisms responsible for the evolution of robustness in gene regulatory networks. Results suggest that in sparse architectures (represented by short promoters), a mixture of local-sequence and network-architecture level changes are exploited. At the local-sequence level, robustness evolves by decreasing the probabilities of both the destruction of existent and generation of new binding sites. Meanwhile, in highly interconnected architectures (represented by long promoters), robustness evolves almost entirely via network level changes, deleting and creating binding sites that modify the network architecture.
Development from egg to embryo depends to a large extent on regulatory networks of genes called transcription factors. Previous research has shown these gene regulatory networks to be robust to perturbations at the level of the connections between transcription factors. Here, we investigate the mechanisms underlying the evolution of robustness in gene networks using a modeling approach, which considers three levels: binding of individual transcription factors to DNA, dynamics of gene expression levels, and fitness effects at the population level. In our model the gene regulatory network is determined by transcription factor binding sites within DNA sequences, which undergo mutation. We categorize these mutations in a continuum ranging from silent mutations, which have no effect on regulation and change only the DNA sequence (local-sequence level), to mutations that change connections between genes in the network (network-architecture level). We find that in sparse networks, containing few connections between genes, a balance of local-sequence and network-architecture level mechanisms are responsible for the evolution of robustness, but when the network is densely connected the network-architecture level mechanisms become dominant. We argue that the shift towards the network-architecture level for more densely-connected networks offers a potential explanation for the evolution of increased complexity.
Differential detergent fractionation (DDF) is frequently used to partition fresh cells and tissues into distinct compartments. We have tested whether DDF can reproducibly extract and fractionate cellular protein components from frozen tissues. Frozen kidneys were sequentially extracted with three different buffer systems. Analysis of the three fractions with LC-MS/MS identified 1,693 proteins, some of which were common to all fractions and others unique to specific fractions. Normalized spectral index values (SIN) obtained from these data were compared in order to evaluate both the reproducibility of the method as well as the efficiency of enrichment. SIN values between replicate fractions demonstrated a high correlation, confirming the reproducibility of the method. Correlation coefficients across the three fractions were significantly lower than those for the replicates, supporting the capability of DDF to differentially fractionate proteins into separate compartments. Subcellular annotation of the proteins identified in each fraction demonstrated a significant enrichment of cytoplasmic, cell membrane and nuclear proteins in the three respective buffer system fractions. We conclude that DDF can be applied to frozen tissue to generate reproducible proteome coverage discriminating subcellular compartments. This demonstrates the feasibility of analyzing cellular compartment specific proteins in archived tissue samples with the simple DDF method.
Differential detergent fractionation; Normalized spectral index; Frozen tissue; Subcellular location
Mass spectrometry analysis of cross-linked peptides can be used to probe protein contact sites in macromolecular complexes. We have developed a photo-cleavable cross-linker that enhances peptide enrichment, improving the signal-to-noise ratio of the cross-linked peptides in mass spectrometry analysis. This cross-linker utilizes nitro-benzyl alcohol group that can be cleaved by UV irradiation and is stable during the multiple washing steps used for peptide enrichment. The enrichment method utilizes a cross-linker that aids in eliminating contamination resulting from protein based retrieval systems, and thus, facilitates the identification of cross-linked peptides. Homodimeric pilM protein from Pseudomonas aeruginosa 2192 (pilM) was investigated to test the specificity and experimental conditions. As predicted, the known pair of lysine side chains within 14Å was cross-linked. An unexpected cross-link involving the protein’s amino terminus was also detected. This is consistent with the predicted mobility of the amino terminus that may bring the amino groups within 19Å of one another in solution. These technical improvements allow this method to be used for investigating protein-protein interactions in complex biological samples.
cross-link; enrichment; photo-cleavable; transient protein complex
The X-ray structure of a putative BenF-like (gene name: PFL1329) protein from Pseudomonas fluorescens Pf-5 (PflBenF) has been determined at 2.6Å resolution. X-ray crystallography revealed a canonical 18-stranded β-barrel fold that forms a central pore with a diameter of ∼4.6Å, which is consistent with the size and physicochemical properties of the presumed aromatic acid substrate, benzoate. Detailed comparisons with the previously-determined structure of Pseudomonas aeruginosa OpdK, a vanillate influx channel, revealed an arginine-rich aromatic acid selectivity filter of nearly identical structure composed of seven highly conserved residues Arg∼Asp∼Arg∼Arg∼Ser∼Asp∼Arg (R∼D∼R∼R∼S∼D∼R sequence motif, where ∼ denotes intervening residues) that define the narrowest part of the pore.
BenF-like; substrate specific porin; OprD superfamily; OprD subfamily; OpdK subfamily; benzoate; Pseudomonas; integral membrane protein
Reciprocal interactions between glia and neurons are essential for the proper organization and function of the nervous system. Recently, the interaction between ErbB receptors (ErbB2 and ErbB3) on the surface of Schwann cells and neuronal Neuregulin-1 (NRG1) has emerged as the pivotal signal that controls Schwann cell development, association with axons, and myelination. To understand the function of NRG1-ErbB2/3 signaling axis in adult Schwann cell biology we are studying the specific role of ErbB3 receptor tyrosine kinase (RTK) since it is the receptor for NRG1 on the surface of Schwann cells. Here we show that alternative transcription initiation results in the formation of a nuclear variant of ErbB3 (nuc-ErbB3) in rat primary Schwann cells. Nuc-ErbB3 possesses a functional nuclear localization signal sequence and binds to chromatin. Using ChIP-ChIP arrays we identified the promoters that associate with nuc-ErbB3 and clustered the active promoters in Schwann cell gene expression. Nuc-ErbB3 regulates the transcriptional activity of ezrin and HMGB1 promoters while inhibition of nuc-ErbB3 expression results in reduced myelination and altered distribution of ezrin in the nodes of Ranvier. Finally, we reveal that NRG1 regulates the translation of nuc-ErbB3 in rat Schwann cells. For the first time, to our knowledge, we show that alternative transcription initiation from a gene that encodes a RTK is capable to generate a protein variant of the receptor with a distinct role in molecular and cellular regulation. We propose a new concept for the molecular regulation of myelination through the expression and distinct role of nuc-ErbB3.
ErbB3; Schwann cells; myelination; nodes; transcription; signaling
VISTA suppresses T cell proliferation and cytokine production and can influence autoimmunity and antitumor responses in mice.
The immunoglobulin (Ig) superfamily consists of many critical immune regulators, including the B7 family ligands and receptors. In this study, we identify a novel and structurally distinct Ig superfamily inhibitory ligand, whose extracellular domain bears homology to the B7 family ligand PD-L1. This molecule is designated V-domain Ig suppressor of T cell activation (VISTA). VISTA is primarily expressed on hematopoietic cells, and VISTA expression is highly regulated on myeloid antigen-presenting cells (APCs) and T cells. A soluble VISTA-Ig fusion protein or VISTA expression on APCs inhibits T cell proliferation and cytokine production in vitro. A VISTA-specific monoclonal antibody interferes with VISTA-induced suppression of T cell responses by VISTA-expressing APCs in vitro. Furthermore, anti-VISTA treatment exacerbates the development of the T cell–mediated autoimmune disease experimental autoimmune encephalomyelitis in mice. Finally, VISTA overexpression on tumor cells interferes with protective antitumor immunity in vivo in mice. These findings show that VISTA, a novel immunoregulatory molecule, has functional activities that are nonredundant with other Ig superfamily members and may play a role in the development of autoimmunity and immune surveillance in cancer.
Toxoplasma gondii is an apicomplexan of both medical and veterinary importance which is classified as an NIH Category B priority pathogen. It is best known for its ability to cause congenital infection in immune competent hosts and encephalitis in immune compromised hosts. The highly stable and specialized microtubule-based cytoskeleton participates in the invasion process. The genome encodes three isoforms of both α- and β-tubulin and we show that the tubulin is extensively altered by specific post-translational modifications (PTMs) in this paper. T. gondii tubulin PTMs were analyzed by mass spectrometry and immunolabeling using specific antibodies. The PTMs identified on α-tubulin included acetylation of Lys40, removal of the last C-terminal amino acid residue Tyr453 (detyrosinated tubulin) and truncation of the last five amino acid residues. Polyglutamylation was detected on both α- and β-tubulins. An antibody directed against mammalian α-tubulin lacking the last two C-terminal residues (Δ2-tubulin) labeled the apical region of this parasite. Detyrosinated tubulin was diffusely present in subpellicular microtubules and displayed an apparent accumulation at the basal end. Methylation, a PTM not previously described on tubulin, was also detected. Methylated tubulins were not detected in the host cells, human foreskin fibroblasts, suggesting that this may be a modification specific to the Apicomplexa.
Toxoplasma gondii; cytoskeleton; tubulin; post-translational modification; proteomics; microtubules; conoid
The microtubule cytoskeleton has proven to be an effective target for cancer therapeutics. One class of drugs, known as microtubule stabilizing agents (MSAs), binds to microtubule polymers and stabilizes them against depolymerization. The prototype of this group of drugs, Taxol, is an effective chemotherapeutic agent used extensively in the treatment of human ovarian, breast, and lung carcinomas. Although electron crystallography and photoaffinity labeling experiments determined that the binding site for Taxol is in a hydrophobic pocket in β-tubulin, little was known about the effects of this drug on the conformation of the entire microtubule. A recent study from our laboratory utilizing hydrogen-deuterium exchange (HDX) in concert with various mass spectrometry (MS) techniques has provided new information on the structure of microtubules upon Taxol binding. In the current study we apply this technique to determine the binding mode and the conformational effects on chicken erythrocyte tubulin (CET) of another MSA, discodermolide, whose synthetic analogues may have potential use in the clinic. We confirmed that like Taxol, discodermolide binds to the taxane binding pocket in β-tubulin. However, as opposed to Taxol, which has major interactions with the M-loop, discodermolide orients itself away from this loop and towards the N-terminal H1–S2 loop. Additionally, discodermolide stabilizes microtubules mainly via its effects on interdimer contacts, specifically on the α-tubulin side, and to a lesser extent on interprotofilament contacts between adjacent β-tubulin subunits. Also, our results indicate complementary stabilizing effects of Taxol and discodermolide on the microtubules, which may explain the synergy observed between the two drugs in vivo.
microtubules; discodermolide; Taxol; mass spectrometry; hydrogen-deuterium exchange
X-linked dyskeratosis congenita (DC) is a rare bone marrow failure syndrome caused by mostly missense mutations in the pseudouridine synthase NAP57 (dyskerin/Cbf5). As part of H/ACA ribonucleoproteins (RNPs), NAP57 is important for the biogenesis of ribosomes, spliceosomal small nuclear RNPs, microRNAs and the telomerase RNP. DC mutations concentrate in the N- and C-termini of NAP57 but not in its central catalytic domain raising questions as to their impact. We demonstrate that the N- and C-termini together form the binding surface for the H/ACA RNP assembly factor SHQ1 and that DC mutations modulate the interaction between the two proteins. Pinpointing impaired interaction between NAP57 and SHQ1 as a potential molecular basis for X-linked DC has implications for therapeutic approaches, e.g. by targeting the NAP57–SHQ1 interface with small molecules.
One major objective of structural genomics efforts, including the NIH-funded Protein Structure Initiative (PSI), has been to increase the structural coverage of protein sequence space. Here, we present the target selection strategy used during the second phase of PSI (PSI-2). This strategy, jointly devised by the bioinformatics groups associated with the PSI-2 large-scale production centres, targets representatives from large, structurally uncharacterised protein domain families, and from structurally uncharacterised subfamilies in very large and diverse families with incomplete structural coverage. These very large families are extremely diverse both structurally and functionally, and are highly over-represented in known proteomes. On the basis of several metrics, we then discuss to what extent PSI-2, during its first three years, has increased the structural coverage of genomes, and contributed structural and functional novelty. Together, the results presented here suggest that PSI-2 is successfully meeting its objectives and provides useful insights into structural and functional space.
Folds are the basic building blocks of protein structures. Understanding the emergence of novel protein folds is an important step towards understanding the rules governing the evolution of protein structure and function and for developing tools for protein structure modeling and design. We explored the frequency of occurrences of an exhaustively classified library of supersecondary structural elements (Smotifs), in protein structures, in order to identify features that would define a fold as novel compared to previously known structures. We found that a surprisingly small set of Smotifs is sufficient to describe all known folds. Furthermore, novel folds do not require novel Smotifs, but rather are a new combination of existing ones. Novel folds can be typified by the inclusion of a relatively higher number of rarely occurring Smotifs in their structures and, to a lesser extent, by a novel topological combination of commonly occurring Smotifs. When investigating the structural features of Smotifs, we found that the top 10% of most frequent ones have a higher fraction of internal contacts, while some of the most rare motifs are larger, and contain a longer loop region.
Structural genomics efforts aim at exploring the repertoire of three-dimensional structures of protein molecules. While genome scale sequencing projects have already provided us with all the genes of many organisms, it is the three dimensional shape of gene encoded proteins that defines all the interactions among these components. Understanding the versatility and, ultimately, the role of all possible molecular shapes in the cell is a necessary step toward understanding how organisms function. In this work we explored the rules that identify certain shapes as novel compared to all already known structures. The findings of this work provide possible insights into the rules that can be used in future works to identify or design new molecular shapes or to relate folds with each other in a quantitative manner.
Toxoplasma gondii is a ubiquitous, Apicomplexan parasite that, in humans, can cause several clinical syndromes, including encephalitis, chorioretinitis and congenital infection. T. gondii was described a little over 100 years ago in the tissues of the gundi (Ctenodoactylus gundi). There are a large number of applicable experimental techniques available for this pathogen and it has become a model organism for the study of intracellular pathogens. With the completion of the genomes for a type I (GT-1), type II (ME49) and type III (VEG) strains, proteomic studies on this organism have been greatly facilitated. Several subcellular proteomic studies have been completed on this pathogen. These studies have helped elucidate specialized invasion organelles and their composition, as well as proteins associated with the cytoskeleton. Global proteomic studies are leading to improved strategies for genome annotation in this organism and an improved understanding of protein regulation in this pathogen. Web-based resources, such as EPIC-DB and ToxoDB, provide proteomic data and support for studies on T. gondii. This review will summarize the current status of proteomic research on T. gondii.
Apicomplexa; cell biology; genome; proteomic; Toxoplasma gondii
Scoring functions, such as molecular mechanic forcefields and statistical potentials are fundamentally important tools in protein structure modeling and quality assessment.
The performances of a number of publicly available scoring functions are compared with a statistical rigor, with an emphasis on knowledge-based potentials. We explored the effect on accuracy of alternative choices for representing interaction center types and other features of scoring functions, such as using information on solvent accessibility, on torsion angles, accounting for secondary structure preferences and side chain orientation. Partially based on the observations made, we present a novel residue based statistical potential, which employs a shuffled reference state definition and takes into account the mutual orientation of residue side chains. Atom- and residue-level statistical potentials and Linux executables to calculate the energy of a given protein proposed in this work can be downloaded from http://www.fiserlab.org/potentials.
Among the most influential terms we observed a critical role of a proper reference state definition and the benefits of including information about the microenvironment of interaction centers. Molecular mechanical potentials were also tested and found to be over-sensitive to small local imperfections in a structure, requiring unfeasible long energy relaxation before energy scores started to correlate with model quality.
Cross-linking analysis of protein complexes and structures by tandem mass spectrometry (MS/MS) has advantages in speed, sensitivity, specificity, and the capability of handling complicated protein assemblies. However, detection and accurate assignment of the cross-linked peptides are often challenging due to their low abundance and complicated fragmentation behavior in collision-induced dissociation (CID). To simplify the MS analysis and improve the signal-to-noise ratio of the cross-linked peptides, we developed a novel peptide enrichment strategy that utilizes a cross-linker with a cryptic thiol group and using beads modified with a photocleavable cross-linker. The functional cross-linkers were designed to react with the primary amino groups in proteins. Human serum albumin was used as a model protein to detect intra- and intermolecular cross-linkages. Use of this protein-free selective retrieval method eliminates the contamination that can result from avidin–biotin based retrieval systems and simplifies data analysis. These features may make the method suitable to investigate protein–protein interactions in biological samples.
We describe the proceedings and conclusions from a “Workshop on Applications of Protein Models in Biomedical Research” that was held at University of California at San Francisco on 11 and 12 July, 2008. At the workshop, international scientists involved with structure modeling explored (i) how models are currently used in biomedical research, (ii) what the requirements and challenges for different applications are, and (iii) how the interaction between the computational and experimental research communities could be strengthened to advance the field.
The Protein Structural Initiative (PSI) at the US National Institutes of Health (NIH) is funding four large-scale centers for structural genomics (SG). These centers systematically target many large families without structural coverage, as well as very large families with inadequate structural coverage. Here, we report a few simple metrics that demonstrate how successfully these efforts optimize structural coverage: while the PSI-2 (2005-now) contributed more than 8% of all structures deposited into the PDB, it contributed over 20% of all novel structures (i.e. structures for protein sequences with no structural representative in the PDB on the date of deposition). The structural coverage of the protein universe represented by today’s UniProt (v12.8) has increased linearly from 1992 to 2008; structural genomics has contributed significantly to the maintenance of this growth rate. Success in increasing novel leverage (defined in Liu et al. in Nat Biotechnol 25:849–851, 2007) has resulted from systematic targeting of large families. PSI’s per structure contribution to novel leverage was over 4-fold higher than that for non-PSI structural biology efforts during the past 8 years. If the success of the PSI continues, it may just take another ~15 years to cover most sequences in the current UniProt database.
Protein structure determination; Structural genomics; Evolution; Protein universe
High throughput proteomics experiments are useful for analyzing the protein expression of an organism, identifying the correct gene structure of a genome, or locating possible post-translational modifications within proteins. High throughput methods necessitate publicly accessible and easily queried databases for efficiently and logically storing, displaying, and analyzing the large volume of data.
EPICDB is a publicly accessible, queryable, relational database that organizes and displays experimental, high throughput proteomics data for Toxoplasma gondii and Cryptosporidium parvum. Along with detailed information on mass spectrometry experiments, the database also provides antibody experimental results and analysis of functional annotations, comparative genomics, and aligned expressed sequence tag (EST) and genomic open reading frame (ORF) sequences. The database contains all available alternative gene datasets for each organism, which comprises a complete theoretical proteome for the respective organism, and all data is referenced to these sequences. The database is structured around clusters of protein sequences, which allows for the evaluation of redundancy, protein prediction discrepancies, and possible splice variants. The database can be expanded to include genomes of other organisms for which proteome-wide experimental data are available.
EPICDB is a comprehensive database of genome-wide T. gondii and C. parvum proteomics data and incorporates many features that allow for the analysis of the entire proteomes and/or annotation of specific protein sequences. EPICDB is complementary to other -genomics- databases of these organisms by offering complete mass spectrometry analysis on a comprehensive set of all available protein sequences.
Toxoplasma gondii is an obligate intracellular protozoan that infects 20 to 90% of the population. It can cause both acute and chronic infections, many of which are asymptomatic, and, in immunocompromized hosts, can cause fatal infection due to reactivation from an asymptomatic chronic infection. An essential step towards understanding molecular mechanisms controlling transitions between the various life stages and identifying candidate drug targets is to accurately characterize the T. gondii proteome.
We have explored the proteome of T. gondii tachyzoites with high throughput proteomics experiments and by comparison to publicly available cDNA sequence data. Mass spectrometry analysis validated 2,477 gene coding regions with 6,438 possible alternative gene predictions; approximately one third of the T. gondii proteome. The proteomics survey identified 609 proteins that are unique to Toxoplasma as compared to any known species including other Apicomplexan. Computational analysis identified 787 cases of possible gene duplication events and located at least 6,089 gene coding regions. Commonly used gene prediction algorithms produce very disparate sets of protein sequences, with pairwise overlaps ranging from 1.4% to 12%. Through this experimental and computational exercise we benchmarked gene prediction methods and observed false negative rates of 31 to 43%.
This study not only provides the largest proteomics exploration of the T. gondii proteome, but illustrates how high throughput proteomics experiments can elucidate correct gene structures in genomes.
The Pentapeptide Repeat Protein (PRP) family has over 500 members in the prokaryotic and eukaryotic kingdoms. These proteins are composed of, or contain domains composed of, tandemly repeated amino acid sequences with a consensus sequence of [S,T,A,V][D,N][L,F]-[S,T,R][G]. The biochemical function of the vast majority of PRP family members is unknown. The three-dimensional structure of the first member of the PRP family was determined for the fluoroquinolone resistance protein (MfpA) from Mycobacterium tuberculosis. The structure revealed that the pentapeptide repeats encode the folding of a novel right-handed quadrilateral β-helix. MfpA binds to DNA gyrase and inhibits its activity. The rod-shaped, dimeric protein exhibits remarkable size, shape and electrostatic similarity to DNA.
The ybeY protein from E. coli is reported at a 2.7 Å resolution with a metal ion.
The three-dimensional crystallographic structure of the ybeY protein from Escherichia coli (SwissProt entry P77385) is reported at 2.7 Å resolution. YbeY is a hypothetical protein that belongs to the UPF0054 family. The structure reveals that the protein binds a metal ion in a tetrahedral geometry. Three coordination sites are provided by histidine residues, while the fourth might be a water molecule that is not seen in the diffraction map because of its relatively low resolution. X-ray fluorescence analysis of the purified protein suggests that the metal is a nickel ion. The structure of ybeY and its sequence similarity to a number of predicted metal-dependent hydrolases provides a functional assignment for this protein family. The figures and tables of this paper were prepared using semi-automated tools, termed the Autopublish server, developed by the New York Structural GenomiX Research Consortium, with the goal of facilitating the rapid publication of crystallographic structures that emanate from worldwide Structural Genomics efforts, including the NIH-funded Protein Structure Initiative.
Protein Structure Initiative; metalloproteins; nickel; UPF0054 family
MODBASE (http://salilab.org/modbase) is a relational database of annotated comparative protein structure models for all available protein sequences matched to at least one known protein structure. The models are calculated by MODPIPE, an automated modeling pipeline that relies on the MODELLER package for fold assignment, sequence–structure alignment, model building and model assessment (http:/salilab.org/modeller). MODBASE uses the MySQL relational database management system for flexible querying and CHIMERA for viewing the sequences and structures (http://www.cgl.ucsf.edu/chimera/). MODBASE is updated regularly to reflect the growth in protein sequence and structure databases, as well as improvements in the software for calculating the models. For ease of access, MODBASE is organized into different data sets. The largest data set contains 1 262 629 models for domains in 659 495 out of 1 182 126 unique protein sequences in the complete Swiss-Prot/TrEMBL database (August 25, 2003); only models based on alignments with significant similarity scores and models assessed to have the correct fold despite insignificant alignments are included. Another model data set supports target selection and structure-based annotation by the New York Structural Genomics Research Consortium; e.g. the 53 new structures produced by the consortium allowed us to characterize structurally 24 113 sequences. MODBASE also contains binding site predictions for small ligands and a set of predicted interactions between pairs of modeled sequences from the same genome. Our other resources associated with MODBASE include a comprehensive database of multiple protein structure alignments (DBALI, http://salilab.org/dbali) as well as web servers for automated comparative modeling with MODPIPE (MODWEB, http://salilab.org/modweb), modeling of loops in protein structures (MODLOOP, http://salilab.org/modloop) and predicting functional consequences of single nucleotide polymorphisms (SNPWEB, http://salilab.org/snpweb).