One of the major contributors to protein structures is the formation of disulphide bonds between selected pairs of
cysteines at oxidized state. Prediction of such disulphide bridges from sequence is challenging given that the possible
combination of cysteine pairs as the number of cysteines increases in a protein. Here, we describe a SVM (support vector
machine) model for the prediction of cystine connectivity in a protein sequence with and without a priori knowledge on
their bonding state. We make use of a new encoding scheme based on physico-chemical properties and statistical features
(probability of occurrence of each amino acid residue in different secondary structure states along with PSI-blast profiles).
We evaluate our method in SPX (an extended dataset of SP39 (swiss-prot 39) and SP41 (swiss-prot 41) with known disulphide
information from PDB) dataset and compare our results with the recursive neural network model described for the same
disulphide bridges; prediction; protein fold; SVM model; SPX dataset
We present the development of a web server, a protein short motif search tool that allows users to simultaneously search for a
protein sequence motif and its secondary structure assignments. The web server is able to query very short motifs searches against
PDB structural data from the RCSB Protein Databank, with the users defining the type of secondary structures of the amino acids
in the sequence motif. The output utilises 3D visualisation ability that highlights the position of the motif in the structure and on
the corresponding sequence. Researchers can easily observe the locations and conformation of multiple motifs among the results.
Protein short motif search also has an application programming interface (API) for interfacing with other bioinformatics tools.
The database is available for free at http://birg3.fbb.utm.my/proteinsms
Protein short motif search; protein secondary structure; visualization; application programming interface (API)
The Structure Integration with Function, Taxonomy and Sequences resource (SIFTS; http://pdbe.org/sifts) is a close collaboration between the Protein Data Bank in Europe (PDBe) and UniProt. The two teams have developed a semi-automated process for maintaining up-to-date cross-reference information to UniProt entries, for all protein chains in the PDB entries present in the UniProt database. This process is carried out for every weekly PDB release and the information is stored in the SIFTS database. The SIFTS process includes cross-references to other biological resources such as Pfam, SCOP, CATH, GO, InterPro and the NCBI taxonomy database. The information is exported in XML format, one file for each PDB entry, and is made available by FTP. Many bioinformatics resources use SIFTS data to obtain cross-references between the PDB and other biological databases so as to provide their users with up-to-date information.
Sequences and structures provide valuable complementary information on protein features and functions. However, it is not always straightforward for users to gather information concurrently from the sequence and structure levels. The UniProt knowledgebase (UniProtKB) strives to help users on this undertaking by providing complete cross-references to Protein Data Bank (PDB) as well as coherent feature annotation using available structural information. In this study, SSMap – a new UniProt-PDB residue-residue level mapping – was generated. The primary objective of this mapping is not only to facilitate the two tasks mentioned above, but also to palliate a number of shortcomings of existent mappings. SSMap is the first isoform sequence-specific mapping resource and is up-to-date for UniProtKB annotation tasks. The method employed by SSMap differs from the other mapping resources in that it stresses on the correct reconstruction of the PDB sequence from structures, and on the correct attribution of a UniProtKB entry to each PDB chain by using a series of post-processing steps.
SSMap was compared to other existing mapping resources in terms of the correctness of the attribution of PDB chains to UniProtKB entries, and of the quality of the pairwise alignments supporting the residue-residue mapping. It was found that SSMap shared about 80% of the mappings with other mapping sources. New and alternative mappings proposed by SSMap were mostly good as assessed by manual verification of data subsets. As for local pairwise alignments, it was shown that major discrepancies (both in terms of alignment lengths and boundaries), when present, were often due to differences in methodologies used for the mappings.
SSMap provides an independent, good quality UniProt-PDB mapping. The systematic comparison conducted in this study allows the further identification of general problems in UniProt-PDB mappings so that both the coverage and the quality of the mappings can be systematically improved for the benefit of the scientific community. SSMap mapping is currently used to provide PDB cross-references in UniProtKB.
Small loop-shaped motifs are common constituents of the three-dimensional structure of proteins. Typically they comprise between three and seven amino acid residues, and are defined by a combination of dihedral angles and hydrogen bonding partners. The most abundant of these are αβ-motifs, asx-motifs, asx-turns, β-bulges, β-bulge loops, β-turns, nests, niches, Schellmann loops, ST-motifs, ST-staples and ST-turns.
We have constructed a database of such motifs from a range of high-quality protein structures and built a web application as a visual interface to this.
The web application, Motivated Proteins, provides access to these 12 motifs (with 48 sub-categories) in a database of over 400 representative proteins. Queries can be made for specific categories or sub-categories of motif, motifs in the vicinity of ligands, motifs which include part of an enzyme active site, overlapping motifs, or motifs which include a particular amino acid sequence. Individual proteins can be specified, or, where appropriate, motifs for all proteins listed. The results of queries are presented in textual form as an (X)HTML table, and may be saved as parsable plain text or XML. Motifs can be viewed and manipulated either individually or in the context of the protein in the Jmol applet structural viewer. Cartoons of the motifs imposed on a linear representation of protein secondary structure are also provided. Summary information for the motifs is available, as are histograms of amino acid distribution, and graphs of dihedral angles at individual positions in the motifs.
Motivated Proteins is a publicly and freely accessible web application that enables protein scientists to study small three-dimensional motifs without requiring knowledge of either Structured Query Language or the underlying database schema.
The primary mission of UniProt is to support biological research by maintaining a stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces freely accessible to the scientific community. UniProt is produced by the UniProt Consortium which consists of groups from the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). UniProt is comprised of four major components, each optimized for different uses: the UniProt Archive, the UniProt Knowledgebase, the UniProt Reference Clusters and the UniProt Metagenomic and Environmental Sequence Database. UniProt is updated and distributed every 3 weeks and can be accessed online for searches or download at http://www.uniprot.org.
The primary mission of Universal Protein Resource (UniProt) is to support biological research by maintaining a stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces freely accessible to the scientific community. UniProt is produced by the UniProt Consortium which consists of groups from the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). UniProt is comprised of four major components, each optimized for different uses: the UniProt Archive, the UniProt Knowledgebase, the UniProt Reference Clusters and the UniProt Metagenomic and Environmental Sequence Database. UniProt is updated and distributed every 4 weeks and can be accessed online for searches or download at http://www.uniprot.org.
The ability to store and interconnect all available information on proteins is crucial to modern biological research. Accordingly, the Universal Protein Resource (UniProt) plays an increasingly important role by providing a stable, comprehensive, freely accessible central resource on protein sequences and functional annotation. UniProt is produced by the UniProt Consortium, formed in 2002 by the European Bioinformatics Institute (EBI), the Protein Information Resource (PIR) and the Swiss Institute of Bioinformatics (SIB). The core activities include manual curation of protein sequences assisted by computational analysis, sequence archiving, development of a user-friendly UniProt web site and the provision of additional value-added information through cross-references to other databases. UniProt is comprised of three major components, each optimized for different uses: the UniProt Archive, the UniProt Knowledgebase and the UniProt Reference Clusters. An additional component consisting of metagenomic and environmental sequences has recently been added to UniProt to ensure availability of such sequences in a timely fashion. UniProt is updated and distributed on a bi-weekly basis and can be accessed online for searches or download at .
The Protein Circular Dichroism Data Bank (PCDDB) is a public repository that archives and freely distributes circular dichroism (CD) and synchrotron radiation CD (SRCD) spectral data and their associated experimental metadata. All entries undergo validation and curation procedures to ensure completeness, consistency and quality of the data included. A web-based interface enables users to browse and query sample types, sample conditions, experimental parameters and provides spectra in both graphical display format and as downloadable text files. The entries are linked, when appropriate, to primary sequence (UniProt) and structural (PDB) databases, as well as to secondary databases such as the Enzyme Commission functional classification database and the CATH fold classification database, as well as to literature citations. The PCDDB is available at: http://pcddb.cryst.bbk.ac.uk.
Finding related conformations in the Protein Data Bank (PDB) is essential in many areas of bioscience. To assist this task, we designed a search engine that uses a compact database to quickly identify protein segments obeying a set of primary, secondary and tertiary structure constraints. The database contains information such as amino acid sequence, secondary structure, disulfide bonds, hydrogen bonds and atoms in contact as calculated from all protein structures in the PDB. The search engine parses the database and returns hits that match the queried parameters. The conformation search engine, which is notable for its high speed and interactive feedback, is expected to assist scientists in discovering conformation homologs and predicting protein structure. The engine is publicly available at http://ari.stanford.edu/psf and it will also be used in-house in an automatic mode aimed at discovering new protein motifs.
PAR-3D (http://sunserver.cdfd.org.in:8080/protease/PAR_3D/index.html) is a web-based tool that exploits the fact that relative juxtaposition of active site residues is a conserved feature in functionally related protein families. The server uses previously calculated and stored values of geometrical parameters of a set of known proteins (training set) for prediction of active site residues in a query protein structure. PAR-3D stores motifs for different classes of proteases, the ten glycolytic pathway enzymes and metal-binding sites. The server accepts the structures in the pdb format. The first step during the prediction is the extraction of probable active site residues from the query structure. Spatial arrangement of the probable active site residues is then determined in terms of geometrical parameters. These are compared with stored geometries of the different motifs. Its speed and efficiency make it a beneficial tool for structural genomics projects, especially when the biochemical function of the protein has not been characterized.
The mission of UniProt is to support biological research by providing a freely accessible, stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive cross-references and querying interfaces. UniProt is comprised of four major components, each optimized for different uses: the UniProt Archive, the UniProt Knowledgebase, the UniProt Reference Clusters and the UniProt Metagenomic and Environmental Sequence Database. A key development at UniProt is the provision of complete, reference and representative proteomes. UniProt is updated and distributed every 4 weeks and can be accessed online for searches or download at http://www.uniprot.org.
The Structural Descriptor Database (SDDB) is a web-based tool that predicts the function of proteins and functional site positions based on the structural properties of related protein families. Structural alignments and functional residues of a known protein set (defined as the training set) are used to build special Hidden Markov Models (HMM) called HMM descriptors. SDDB uses previously calculated and stored HMM descriptors for predicting active sites, binding residues, and protein function. The database integrates biologically relevant data filtered from several databases such as PDB, PDBSUM, CSA and SCOP. It accepts queries in fasta format and predicts functional residue positions, protein-ligand interactions, and protein function, based on the SCOP database.
To assess the SDDB performance, we used different data sets. The Trypsion-like Serine protease data set assessed how well SDDB predicts functional sites when curated data is available. The SCOP family data set was used to analyze SDDB performance by using training data extracted from PDBSUM (binding sites) and from CSA (active sites). The ATP-binding experiment was used to compare our approach with the most current method. For all evaluations, significant improvements were obtained with SDDB.
SDDB performed better when trusty training data was available. SDDB worked better in predicting active sites rather than binding sites because the former are more conserved than the latter. Nevertheless, by using our prediction method we obtained results with precision above 70%.
Motivation: We noted that the sumoylation site in C/EBP homologues is conserved beyond the canonical consensus sequence for sumoylation. Therefore, we investigated whether this pattern might define a more general protein motif.
Results: We undertook a survey of the human proteome using a regular expression based on the C/EBP motif. This revealed significant enrichment of the motif using different Gene Ontology terms (e.g. ‘transcription’) that pertain to the nucleus. When considering requirements for the motif to be functional (evolutionary conservation, structural accessibility of the motif and proper cell localization of the protein), more than 130 human proteins were retrieved from the UniProt/Swiss-Prot database. These candidates were particularly enriched in transcription factors, including FOS, JUN, Hif-1α, MLL2 and members of the KLF, MAF and NFATC families; chromatin modifiers like CHD-8, HDAC4 and DNA Top1; and the transcriptional regulatory kinases HIPK1 and HIPK2. The KEPEmotif appears to be restricted to the metazoan lineage and has three length variants—short, medium and long—which do not appear to interchange.
Supplementary information: Supplementary data are available at Bioinformatics online.
Protein structures have conserved features – motifs, which have a sufficient influence on the protein function. These motifs can be found in sequence as well as in 3D space. Understanding of these fragments is essential for 3D structure prediction, modelling and drug-design. The Protein Data Bank (PDB) is the source of this information however present search tools have limited 3D options to integrate protein sequence with its 3D structure.
We describe here a web application for querying the PDB for ligands, binding sites, small 3D structural and sequence motifs and the underlying database. Novel algorithms for chemical fragments, 3D motifs, ϕ/ψ sequences, super-secondary structure motifs and for small 3D structural motif associations searches are incorporated. The interface provides functionality for visualization, search criteria creation, sequence and 3D multiple alignment options. MSDmotif is an integrated system where a results page is also a search form. A set of motif statistics is available for analysis. This set includes molecule and motif binding statistics, distribution of motif sequences, occurrence of an amino-acid within a motif, correlation of amino-acids side-chain charges within a motif and Ramachandran plots for each residue. The binding statistics are presented in association with properties that include a ligand fragment library. Access is also provided through the distributed Annotation System (DAS) protocol. An additional entry point facilitates XML requests with XML responses.
MSDmotif is unique by combining chemical, sequence and 3D data in a single search engine with a range of search and visualisation options. It provides multiple views of data found in the PDB archive for exploring protein structures.
The Universal Protein Resource (UniProt) provides a stable, comprehensive, freely accessible, central resource on protein sequences and functional annotation. The UniProt Consortium is a collaboration between the European Bioinformatics Institute (EBI), the Protein Information Resource (PIR) and the Swiss Institute of Bioinformatics (SIB). The core activities include manual curation of protein sequences assisted by computational analysis, sequence archiving, development of a user-friendly UniProt website, and the provision of additional value-added information through cross-references to other databases. UniProt is comprised of four major components, each optimized for different uses: the UniProt Knowledgebase, the UniProt Reference Clusters, the UniProt Archive and the UniProt Metagenomic and Environmental Sequences database. UniProt is updated and distributed every three weeks, and can be accessed online for searches or download at http://www.uniprot.org.
Structural motifs are important for the integrity of a protein fold and can be employed to design and rationalize protein engineering and folding experiments. Such conserved segments represent the conserved core of a family or superfamily and can be crucial for the recognition of potential new members in sequence and structure databases. We present a database, MegaMotifBase, that compiles a set of important structural segments or motifs for protein structures. Motifs are recognized on the basis of both sequence conservation and preservation of important structural features such as amino acid preference, solvent accessibility, secondary structural content, hydrogen-bonding pattern and residue packing. This database provides 3D orientation patterns of the identified motifs in terms of inter-motif distances and torsion angles. Important applications of structural motifs are also provided in several crucial areas such as similar sequence and structure search, multiple sequence alignment and homology modeling. MegaMotifBase can be a useful resource to gain knowledge about structure and functional relationship of proteins. The database can be accessed from the URL http://caps.ncbs.res.in/MegaMotifbase/index.html
The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information that is essential for modern biological research. UniProt is produced by the UniProt Consortium which consists of groups from the European Bioinformatics Institute, the Protein Information Resource and the Swiss Institute of Bioinformatics. The core activities include manual curation of protein sequences assisted by computational analysis, sequence archiving, a user-friendly UniProt website and the provision of additional value-added information through cross-references to other databases. UniProt is comprised of four major components, each optimized for different uses: the UniProt Archive, the UniProt Knowledgebase, the UniProt Reference Clusters and the UniProt Metagenomic and Environmental Sequence Database. One of the key achievements of the UniProt consortium in 2008 is the completion of the first draft of the complete human proteome in UniProtKB/Swiss-Prot. This manually annotated representation of all currently known human protein-coding genes was made available in UniProt release 14.0 with 20 325 entries. UniProt is updated and distributed every three weeks and can be accessed online for searches or downloaded at www.uniprot.org.
TbTDPX (Trypanosoma brucei tryparedoxin-dependent peroxidase) is a genetically validated drug target in the fight against African sleeping sickness. Despite its similarity to members of the GPX (glutathione peroxidase) family, TbTDPX2 is functional as a monomer, lacks a selenocysteine residue and relies instead on peroxidatic and resolving cysteine residues for catalysis and uses tryparedoxin rather than glutathione as electron donor. Kinetic studies indicate a saturable Ping Pong mechanism, unlike selenium-dependent GPXs, which display infinite Km and Vmax values. The structure of the reduced enzyme at 2.1 Å (0.21 nm) resolution reveals that the catalytic thiol groups are widely separated [19 Å (0.19 nm)] and thus unable to form a disulphide bond without a large conformational change in the secondary-structure architecture, as reported for certain plant GPXs. A model of the oxidized enzyme structure is presented and the implications for small-molecule inhibition are discussed.
dithiol-dependent peroxidase; drug discovery; glutathione peroxidase; Leishmania; Trypanosoma; trypanothione; GPX, glutathione peroxidase; His6, hexahistidine; Lm, Leishmania major; PEG, poly(ethylene glycol); Pt, Populus trichocarpaxdeltoides (hybrid poplar); r.m.s.d., root mean square deviation; Tb, Trypanosoma brucei; TDPX, tryparedoxin-dependent peroxidase; TryX, tryparedoxin
β-Defensins comprise a family of cationic, antimicrobial and chemoattractant peptides. The six cysteine canonical motif is retained throughout evolution and the disulphide connectivities stabilise the conserved monomer structure. A murine β-defensin gene (Defr1) present in the main defensin cluster of C57B1/6 mice, encodes a peptide with only five of the canonical six cysteine residues. In other inbred strains of mice, the allele encodes Defb8, which has the six cysteine motif. We show here that in common with six cysteine β-defensins, defensin-related peptide 1 (Defr1) displays chemoattractant activity for CD4+ T cells and immature DC (iDC), but not mature DC cells or neutrophils. Murine Defb2 replicates this pattern of attraction. Defb8 is also able to attract iDC but not mature DC. Synthetic analogues of Defr1 with the six cysteines restored (Defr1 Y5C) or with only a single cysteine (Defr1-1cV) chemoattract CD4+ T cells with reduced activity, but do not chemoattract DC. β-Defensins have previously been shown to attract iDC through CC receptor 6 (CCR6) but neither Defr1 or its related peptides nor Defb8, chemoattract cells overexpressing CCR6. Thus, we demonstrate that the canonical six cysteines of β-defensins are not required for the chemoattractant activity of Defr1 and that neither Defr1 nor the six cysteine polymorphic variant allele Defb8, act through CCR6.
b-Defensin; Chemotaxis; DC
One of the promising methods of protein structure prediction involves the use of amino acid sequence-derived patterns. Here we report on the creation of non-degenerate motif descriptors derived through data mining of training sets of residues taken from the transmembrane-spanning segments of polytopic proteins. These residues correspond to short regions in which there is a deviation from the regular α-helical character (i.e. π-helices, 310-helices and kinks). A ‘search engine’ derived from these motif descriptors correctly identifies, and discriminates amongst instances of the above ‘non-canonical’ helical motifs contained in the SwissProt/TrEMBL database of protein primary structures. Our results suggest that deviations from α-helicity are encoded locally in sequence patterns only about 7–9 residues long and can be determined in silico directly from the amino acid sequence. Delineation of such variations in helical habit is critical to understanding the complex structure–function relationships of polytopic proteins and for drug discovery. The success of our current methodology foretells development of similar prediction tools capable of identifying other structural motifs from sequence alone. The method described here has been implemented and is available on the World Wide Web at http://cbcsrv.watson.ibm.com/Ttkw.html.
Summary: The TOPDOM database is a collection of domains and sequence motifs located consistently on the same side of the membrane in α-helical transmembrane proteins. The database was created by scanning well-annotated transmembrane protein sequences in the UniProt database by specific domain or motif detecting algorithms. The identified domains or motifs were added to the database if they were uniformly annotated on the same side of the membrane of the various proteins in the UniProt database. The information about the location of the collected domains and motifs can be incorporated into constrained topology prediction algorithms, like HMMTOP, increasing the prediction accuracy.
Availability: The TOPDOM database and the constrained HMMTOP prediction server are available on the page http://topdom.enzim.hu
Contact: email@example.com; firstname.lastname@example.org
Transmembrane Helices in Genome Sequences (THGS) is an interactive web-based database, developed to search the transmembrane helices in the user-interested gene sequences available in the Genome Database (GDB). The proposed database has provision to search sequence motifs in transmembrane and globular proteins. In addition, the motif can be searched in the other sequence databases (Swiss-Prot and PIR) or in the macromolecular structure database, Protein Data Bank (PDB). Further, the 3D structure of the corresponding queried motif, if it is available in the solved protein structures deposited in the Protein Data Bank, can also be visualized using the widely used graphics package RASMOL. All the sequence databases used in the present work are updated frequently and hence the results produced are up to date. The database THGS is freely available via the world wide web and can be accessed at http://pranag.physics.iisc.ernet.in/thgs/ or http://126.96.36.199/thgs/.
RNA secondary structure is important for designing therapeutics, understanding protein–RNA binding and predicting tertiary structure of RNA. Several databases and downloadable programs exist that specialize in the three-dimensional (3D) structure of RNA, but none focus specifically on secondary structural motifs such as internal, bulge and hairpin loops. The RNA Characterization of Secondary Structure Motifs (RNA CoSSMos) database is a freely accessible and searchable online database and website of 3D characteristics of secondary structure motifs. To create the RNA CoSSMos database, 2156 Protein Data Bank (PDB) files were searched for internal, bulge and hairpin loops, and each loop's structural information, including sugar pucker, glycosidic linkage, hydrogen bonding patterns and stacking interactions, was included in the database. False positives were defined, identified and reclassified or omitted from the database to ensure the most accurate results possible. Users can search via general PDB information, experimental parameters, sequence and specific motif and by specific structural parameters in the subquery page after the initial search. Returned results for each search can be viewed individually or a complete set can be downloaded into a spreadsheet to allow for easy comparison. The RNA CoSSMos database is automatically updated weekly and is available at http://cossmos.slu.edu.
Disulphide bonds between cysteine residues in proteins play a key role in protein folding, stability, and function. Loss of a disulphide bond is often associated with functional differentiation of the protein. The evolution of disulphide bonds is still actively debated; analysis of naturally occurring variants can promote understanding of the protein evolutionary process. One of the disulphide bond-containing protein families is the potato proteinase inhibitor II (PI-II, or Pin2, for short) superfamily, which is found in most solanaceous plants and participates in plant development, stress response, and defence. Each PI-II domain contains eight cysteine residues (8C), and two similar PI-II domains form a functional protein that has eight disulphide bonds and two non-identical reaction centres. It is still unclear which patterns and processes affect cysteine residue loss in PI-II. Through cDNA sequencing and data mining, we found six natural variants missing cysteine residues involved in one or two disulphide bonds at the first reaction centre. We named these variants Pi7C and Pi6C for the proteins missing one or two pairs of cysteine residues, respectively. This PI-II-7C/6C family was found exclusively in potato. The missing cysteine residues were in bonding pairs but distant from one another at the nucleotide/protein sequence level. The non-synonymous/synonymous substitution (Ka/Ks) ratio analysis suggested a positive evolutionary gene selection for Pi6C and various Pi7C. The selective deletion of the first reaction centre cysteine residues that are structure-level-paired but sequence-level-distant in PI-II illustrates the flexibility of PI-II domains and suggests the functionality of their transient gene versions during evolution.