Small loop-shaped motifs are common constituents of the three-dimensional structure of proteins. Typically they comprise between three and seven amino acid residues, and are defined by a combination of dihedral angles and hydrogen bonding partners. The most abundant of these are αβ-motifs, asx-motifs, asx-turns, β-bulges, β-bulge loops, β-turns, nests, niches, Schellmann loops, ST-motifs, ST-staples and ST-turns.
We have constructed a database of such motifs from a range of high-quality protein structures and built a web application as a visual interface to this.
The web application, Motivated Proteins, provides access to these 12 motifs (with 48 sub-categories) in a database of over 400 representative proteins. Queries can be made for specific categories or sub-categories of motif, motifs in the vicinity of ligands, motifs which include part of an enzyme active site, overlapping motifs, or motifs which include a particular amino acid sequence. Individual proteins can be specified, or, where appropriate, motifs for all proteins listed. The results of queries are presented in textual form as an (X)HTML table, and may be saved as parsable plain text or XML. Motifs can be viewed and manipulated either individually or in the context of the protein in the Jmol applet structural viewer. Cartoons of the motifs imposed on a linear representation of protein secondary structure are also provided. Summary information for the motifs is available, as are histograms of amino acid distribution, and graphs of dihedral angles at individual positions in the motifs.
Motivated Proteins is a publicly and freely accessible web application that enables protein scientists to study small three-dimensional motifs without requiring knowledge of either Structured Query Language or the underlying database schema.
Protein structures have conserved features – motifs, which have a sufficient influence on the protein function. These motifs can be found in sequence as well as in 3D space. Understanding of these fragments is essential for 3D structure prediction, modelling and drug-design. The Protein Data Bank (PDB) is the source of this information however present search tools have limited 3D options to integrate protein sequence with its 3D structure.
We describe here a web application for querying the PDB for ligands, binding sites, small 3D structural and sequence motifs and the underlying database. Novel algorithms for chemical fragments, 3D motifs, ϕ/ψ sequences, super-secondary structure motifs and for small 3D structural motif associations searches are incorporated. The interface provides functionality for visualization, search criteria creation, sequence and 3D multiple alignment options. MSDmotif is an integrated system where a results page is also a search form. A set of motif statistics is available for analysis. This set includes molecule and motif binding statistics, distribution of motif sequences, occurrence of an amino-acid within a motif, correlation of amino-acids side-chain charges within a motif and Ramachandran plots for each residue. The binding statistics are presented in association with properties that include a ligand fragment library. Access is also provided through the distributed Annotation System (DAS) protocol. An additional entry point facilitates XML requests with XML responses.
MSDmotif is unique by combining chemical, sequence and 3D data in a single search engine with a range of search and visualisation options. It provides multiple views of data found in the PDB archive for exploring protein structures.
Our manuscript presents a novel approach to protein structure analyses. We have organized an 8-dimensional data cube with protein 3D-structural information from 8706 high-resolution non-redundant protein-chains with the aim of identifying packing rules at the amino acid pair level. The cube contains information about amino acid type, solvent accessibility, spatial and sequence distance, secondary structure and sequence length. We are able to pose structural queries to the data cube using program ProPack. The response is a 1, 2 or 3D graph. Whereas the response is of a statistical nature, the user can obtain an instant list of all PDB-structures where such pair is found. The user may select a particular structure, which is displayed highlighting the pair in question. The user may pose millions of different queries and for each one he will receive the answer in a few seconds. In order to demonstrate the capabilities of the data cube as well as the programs, we have selected well known structural features, disulphide bridges and salt bridges, where we illustrate how the queries are posed, and how answers are given. Motifs involving cysteines such as disulphide bridges, zinc-fingers and iron-sulfur clusters are clearly identified and differentiated. ProPack also reveals that whereas pairs of Lys residues virtually never appear in close spatial proximity, pairs of Arg are abundant and appear at close spatial distance, contrasting the belief that electrostatic repulsion would prevent this juxtaposition and that Arg-Lys is perceived as a conservative mutation. The presented programs can find and visualize novel packing preferences in proteins structures allowing the user to unravel correlations between pairs of amino acids. The new tools allow the user to view statistical information and visualize instantly the structures that underpin the statistical information, which is far from trivial with most other SW tools for protein structure analysis.
New methods are described for finding recurrent three-dimensional (3D) motifs in RNA atomic-resolution structures. Recurrent RNA 3D motifs are sets of RNA nucleotides with similar spatial arrangements. They can be local or composite. Local motifs comprise nucleotides that occur in the same hairpin or internal loop. Composite motifs comprise nucleotides belonging to three or more different RNA strand segments or molecules. We use a base-centered approach to construct efficient, yet exhaustive search procedures using geometric, symbolic, or mixed representations of RNA structure that we implement in a suite of MATLAB programs, “Find RNA 3D” (FR3D). The first modules of FR3D preprocess structure files to classify base-pair and -stacking interactions. Each base is represented geometrically by the position of its glycosidic nitrogen in 3D space and by the rotation matrix that describes its orientation with respect to a common frame. Base-pairing and base-stacking interactions are calculated from the base geometries and are represented symbolically according to the Leontis/Westhof basepairing classification, extended to include base-stacking. These data are stored and used to organize motif searches. For geometric searches, the user supplies the 3D structure of a query motif which FR3D uses to find and score geometrically similar candidate motifs, without regard to the sequential position of their nucleotides in the RNA chain or the identity of their bases. To score and rank candidate motifs, FR3D calculates a geometric discrepancy by rigidly rotating candidates to align optimally with the query motif and then comparing the relative orientations of the corresponding bases in the query and candidate motifs. Given the growing size of the RNA structure database, it is impossible to explicitly compute the discrepancy for all conceivable candidate motifs, even for motifs with less than ten nucleotides. The screening algorithm that we describe finds all candidate motifs whose geometric discrepancy with respect to the query motif falls below a user-specified cutoff discrepancy. This technique can be applied to RMSD searches. Candidate motifs identified geometrically may be further screened symbolically to identify those that contain particular basepair types or base-stacking arrangements or that conform to sequence continuity or nucleotide identity constraints. Purely symbolic searches for motifs containing user-defined sequence, continuity and interaction constraints have also been implemented. We demonstrate that FR3D finds all occurrences, both local and composite and with nucleotide substitutions, of sarcin/ricin and kink-turn motifs in the 23S and 5S ribosomal RNA 3D structures of the H. marismortui 50S ribosomal subunit and assigns the lowest discrepancy scores to bona fide examples of these motifs. The search algorithms have been optimized for speed to allow users to search the non-redundant RNA 3D structure database on a personal computer in a matter of minutes.
One of the major contributors to protein structures is the formation of disulphide bonds between selected pairs of
cysteines at oxidized state. Prediction of such disulphide bridges from sequence is challenging given that the possible
combination of cysteine pairs as the number of cysteines increases in a protein. Here, we describe a SVM (support vector
machine) model for the prediction of cystine connectivity in a protein sequence with and without a priori knowledge on
their bonding state. We make use of a new encoding scheme based on physico-chemical properties and statistical features
(probability of occurrence of each amino acid residue in different secondary structure states along with PSI-blast profiles).
We evaluate our method in SPX (an extended dataset of SP39 (swiss-prot 39) and SP41 (swiss-prot 41) with known disulphide
information from PDB) dataset and compare our results with the recursive neural network model described for the same
disulphide bridges; prediction; protein fold; SVM model; SPX dataset
Disulphide bridges are well known to play key roles in stability, folding and functions of proteins. Introduction or deletion of disulphides by site-directed mutagenesis have produced varying effects on stability and folding depending upon the protein and location of disulphide in the 3-D structure. Given the lack of complete understanding it is worthwhile to learn from an analysis of extent of conservation of disulphides in homologous proteins. We have also addressed the question of what structural interactions replaces a disulphide in a homologue in another homologue.
Using a dataset involving 34,752 pairwise comparisons of homologous protein domains corresponding to 300 protein domain families of known 3-D structures, we provide a comprehensive analysis of extent of conservation of disulphide bridges and their structural features. We report that only 54% of all the disulphide bonds compared between the homologous pairs are conserved, even if, a small fraction of the non-conserved disulphides do include cytoplasmic proteins. Also, only about one fourth of the distinct disulphides are conserved in all the members in protein families. We note that while conservation of disulphide is common in many families, disulphide bond mutations are quite prevalent. Interestingly, we note that there is no clear relationship between sequence identity between two homologous proteins and disulphide bond conservation. Our analysis on structural features at the sites where cysteines forming disulphide in one homologue are replaced by non-Cys residues show that the elimination of a disulphide in a homologue need not always result in stabilizing interactions between equivalent residues.
We observe that in the homologous proteins, disulphide bonds are conserved only to a modest extent. Very interestingly, we note that extent of conservation of disulphide in homologous proteins is unrelated to the overall sequence identity between homologues. The non-conserved disulphides are often associated with variable structural features that were recruited to be associated with differentiation or specialisation of protein function.
Toll-like receptors (TLRs) play a central role in innate immunity. TLRs are membrane glycoproteins and contain leucine rich repeat (LRR) motif in the ectodomain. TLRs recognize and respond to molecules such as lipopolysaccharide, peptidoglycan, flagellin, and RNA from bacteria or viruses. The LRR domains in TLRs have been inferred to be responsible for molecular recognition. All LRRs include the highly conserved segment, LxxLxLxxNxL, in which "L" is Leu, Ile, Val, or Phe and "N" is Asn, Thr, Ser, or Cys and "x" is any amino acid. There are seven classes of LRRs including "typical" ("T") and "bacterial" ("S"). All known domain structures adopt an arc or horseshoe shape. Vertebrate TLRs form six major families. The repeat numbers of LRRs and their "phasing" in TLRs differ with isoforms and species; they are aligned differently in various databases. We identified and aligned LRRs in TLRs by a new method described here.
The new method utilizes known LRR structures to recognize and align new LRR motifs in TLRs and incorporates multiple sequence alignments and secondary structure predictions. TLRs from thirty-four vertebrate were analyzed. The repeat numbers of the LRRs ranges from 16 to 28. The LRRs found in TLRs frequently consists of LxxLxLxxNxLxxLxxxxF/LxxLxx ("T") and sometimes short motifs including LxxLxLxxNxLxxLPx(x)LPxx ("S"). The TLR7 family (TLR7, TLR8, and TLR9) contain 27 LRRs. The LRRs at the N-terminal part have a super-motif of STT with about 80 residues. The super-repeat is represented by STTSTTSTT or _TTSTTSTT. The LRRs in TLRs form one or two horseshoe domains and are mostly flanked by two cysteine clusters including two or four cysteine residue.
Each of the six major TLR families is characterized by their constituent LRR motifs, their repeat numbers, and their patterns of cysteine clusters. The central parts of the TLR1 and TLR7 families and of TLR4 have more irregular or longer LRR motifs. These central parts are inferred to play a key role in the structure and/or function of their TLRs. Furthermore, the super-repeat in the TLR7 family suggests strongly that "bacterial" and "typical" LRRs evolved from a common precursor.
Disulphide bonds between cysteine residues in proteins play a key role in protein folding, stability, and function. Loss of a disulphide bond is often associated with functional differentiation of the protein. The evolution of disulphide bonds is still actively debated; analysis of naturally occurring variants can promote understanding of the protein evolutionary process. One of the disulphide bond-containing protein families is the potato proteinase inhibitor II (PI-II, or Pin2, for short) superfamily, which is found in most solanaceous plants and participates in plant development, stress response, and defence. Each PI-II domain contains eight cysteine residues (8C), and two similar PI-II domains form a functional protein that has eight disulphide bonds and two non-identical reaction centres. It is still unclear which patterns and processes affect cysteine residue loss in PI-II. Through cDNA sequencing and data mining, we found six natural variants missing cysteine residues involved in one or two disulphide bonds at the first reaction centre. We named these variants Pi7C and Pi6C for the proteins missing one or two pairs of cysteine residues, respectively. This PI-II-7C/6C family was found exclusively in potato. The missing cysteine residues were in bonding pairs but distant from one another at the nucleotide/protein sequence level. The non-synonymous/synonymous substitution (Ka/Ks) ratio analysis suggested a positive evolutionary gene selection for Pi6C and various Pi7C. The selective deletion of the first reaction centre cysteine residues that are structure-level-paired but sequence-level-distant in PI-II illustrates the flexibility of PI-II domains and suggests the functionality of their transient gene versions during evolution.
We present the development of a web server, a protein short motif search tool that allows users to simultaneously search for a
protein sequence motif and its secondary structure assignments. The web server is able to query very short motifs searches against
PDB structural data from the RCSB Protein Databank, with the users defining the type of secondary structures of the amino acids
in the sequence motif. The output utilises 3D visualisation ability that highlights the position of the motif in the structure and on
the corresponding sequence. Researchers can easily observe the locations and conformation of multiple motifs among the results.
Protein short motif search also has an application programming interface (API) for interfacing with other bioinformatics tools.
The database is available for free at http://birg3.fbb.utm.my/proteinsms
Protein short motif search; protein secondary structure; visualization; application programming interface (API)
The RNA Bricks database (http://iimcb.genesilico.pl/rnabricks), stores information about recurrent RNA 3D motifs and their interactions, found in experimentally determined RNA structures and in RNA–protein complexes. In contrast to other similar tools (RNA 3D Motif Atlas, RNA Frabase, Rloom) RNA motifs, i.e. ‘RNA bricks’ are presented in the molecular environment, in which they were determined, including RNA, protein, metal ions, water molecules and ligands. All nucleotide residues in RNA bricks are annotated with structural quality scores that describe real-space correlation coefficients with the electron density data (if available), backbone geometry and possible steric conflicts, which can be used to identify poorly modeled residues. The database is also equipped with an algorithm for 3D motif search and comparison. The algorithm compares spatial positions of backbone atoms of the user-provided query structure and of stored RNA motifs, without relying on sequence or secondary structure information. This enables the identification of local structural similarities among evolutionarily related and unrelated RNA molecules. Besides, the search utility enables searching ‘RNA bricks’ according to sequence similarity, and makes it possible to identify motifs with modified ribonucleotide residues at specific positions.
The small leucine-rich repeat proteins and proteoglycans (SLRPs) form an important family of regulatory molecules that participate in many essential functions. They typically control the correct assembly of collagen fibrils, regulate mineral deposition in bone, and modulate the activity of potent cellular growth factors through many signalling cascades. SLRPs belong to the group of extracellular leucine-rich repeat proteins that are flanked at both ends by disulphide-bonded caps that protect the hydrophobic core of the terminal repeats. A capping motif specific to SLRPs has been recently described in the crystal structures of the core proteins of decorin and biglycan. This motif, designated as LRRCE, differs in both sequence and structure from other, more widespread leucine-rich capping motifs. To investigate if the LRRCE motif is a common structural feature found in other leucine-rich repeat proteins, we have defined characteristic sequence patterns and used them in genome-wide searches.
The LRRCE motif is a structural element exclusive to the main group of SLRPs. It appears to have evolved during early chordate evolution and is not found in protein sequences from non-chordate genomes. Our search has expanded the family of SLRPs to include new predicted protein sequences, mainly in fishes but with intriguing putative orthologs in mammals. The chromosomal locations of the newly predicted SLRP genes would support the large-scale genome or gene duplications that are thought to have occurred during vertebrate evolution. From this expanded list we describe a new class of SLRP sequences that could be representative of an ancestral SLRP gene.
Given its exclusivity the LRRCE motif is a useful annotation tool for the identification and classification of new SLRP sequences in genome databases. The expanded list of members of the SLRP family offers interesting insights into early vertebrate evolution and suggests an early chordate evolutionary origin for the LRRCE capping motif.
Cowpea [Vigna unguiculata (L.) Walp.] is one of the most important food and forage legumes in the semi-arid tropics because of its ability to tolerate drought and grow on poor soils. It is cultivated mostly by poor farmers in developing countries, with 80% of production taking place in the dry savannah of tropical West and Central Africa. Cowpea is largely an underexploited crop with relatively little genomic information available for use in applied plant breeding. The goal of the Cowpea Genomics Initiative (CGI), funded by the Kirkhouse Trust, a UK-based charitable organization, is to leverage modern molecular genetic tools for gene discovery and cowpea improvement. One aspect of the initiative is the sequencing of the gene-rich region of the cowpea genome (termed the genespace) recovered using methylation filtration technology and providing annotation and analysis of the sequence data.
CGKB is an integrated and annotated resource for cowpea GSS with features of homology-based and HMM-based annotations, enzyme and pathway annotations, GO term annotation, toolkits, and a large number of other facilities to perform complex queries. The cowpea GSS, chloroplast sequences, mitochondrial sequences, retroelements, and SSR sequences are available as FASTA formatted files and downloadable at CGKB. This database and web interface are publicly accessible at .
Searching for structural motifs across known protein structures can be useful for identifying unrelated proteins with similar function and characterising secondary structures such as β-sheets. This is infeasible using conventional sequence alignment because linear protein sequences do not contain spatial information. β-residue motifs are β-sheet substructures that can be represented as graphs and queried using existing graph indexing methods, however, these approaches are designed for general graphs that do not incorporate the inherent structural constraints of β-sheets and require computationally-expensive filtering and verification procedures. 3D substructure search methods, on the other hand, allow β-residue motifs to be queried in a three-dimensional context but at significant computational costs.
We developed a new method for querying β-residue motifs, called BetaSearch, which leverages the natural planar constraints of β-sheets by indexing them as 2D matrices, thus avoiding much of the computational complexities involved with structural and graph querying. BetaSearch exhibits faster filtering, verification, and overall query time than existing graph indexing approaches whilst producing comparable index sizes. Compared to 3D substructure search methods, BetaSearch achieves 33 and 240 times speedups over index-based and pairwise alignment-based approaches, respectively. Furthermore, we have presented case-studies to demonstrate its capability of motif matching in sequentially dissimilar proteins and described a method for using BetaSearch to predict β-strand pairing.
We have demonstrated that BetaSearch is a fast method for querying substructure motifs. The improvements in speed over existing approaches make it useful for efficiently performing high-volume exploratory querying of possible protein substructural motifs or conformations. BetaSearch was used to identify a nearly identical β-residue motif between an entirely synthetic (Top7) and a naturally-occurring protein (Charcot-Leyden crystal protein), as well as identifying structural similarities between biotin-binding domains of avidin, streptavidin and the lipocalin gamma subunit of human C8.
High molecular weight thioredoxin reductases (TRs) are pyridine nucleotide disulfide oxidoreductases that catalyze the reduction of the disulfide bond of thioredoxin (Trx). It is Trx that is responsible for reducing multiple protein disulfide targets in the cell. TRs utilize NADPH as the source of reducing equivalents to reduce a bound flavin prosthetic group, which in turn reduces an N-terminal redox center that has the conserved sequence CICVNVGCCT, where CIC is denoted as the interchange thiol and CCT is the thiol involved in charge-transfer complexation. The reduced N-terminal redox center reduces a C-terminal redox center on the opposite subunit of the head-to-tail homodimer. It is the C-terminal redox center that catalyzes the reduction of the Trx-disulfide. Variations in the amino acid sequence of the C-terminal redox center differentiate high molecular weight TRs into different types. Type Ia TRs have tetrapeptide C-terminal redox centers of sequence GCUG, where U is the rare amino acid selenocysteine (Sec), while the tetrapeptide sequence in type Ib TRs replace the Sec residue with a conventional cysteine (Cys) residue and can use small polar amino acids such as serine and threonine in place of the flanking glycine residues. The TR from P. falciparum (PfTR) is similar in structure and mechanism to type Ia and type Ib TRs except that the C-terminal redox center is different in its amino acid sequence. The C-terminal redox center of PfTR has the sequence G534CGGGKCG541 and we classify it as a type II high molecular weight TR. The oxidized type II redox motif will form a 20-membered disulfide ring, while the absence of spacer amino acids in the type I motif results in the formation of a rare 8-membered ring. We used site-directed mutagenesis and protein semisynthesis to investigate features of the distinctive type II C-terminal redox motif that help it perform catalysis. Deletion of Gly541 reduces thioredoxin-reductase activity by ~50-fold, most likely due to disruption of an important hydrogen bond between the amide N-H of Gly541 and the carbonyl of Gly534 that helps to stabilize the β-turn-β motif. Alterations of the 20-membered disulfide ring by amino acid deletion or substitution resulted in impaired catalytic activity. Subtle changes in the ring structure and size via homocysteine for cysteine substitution using semisynthesis also caused significant reductions in catalytic activity, demonstrating the importance of the disulfide ring’s geometry in making the C-terminal redox center reactive for thiol/disulfide exchange. The data suggested to us that the transfer of electrons from the N-terminal redox center to the C-terminal redox center may be rate limiting. We propose that the transfer of electrons from the N-terminal redox center in PfTR to the type II C-terminal disulfide is accelerated by the use of an “electrophilic activation” mechanism. In this electrophilic activation mechanism, the type II C-terminal disulfide is polarized, making the sulfur atom of Cys540 electron deficient, highly electrophilic, and activated for thiol/disulfide exchange with the N-terminal redox center. This hypothesis was investigated by constructing chimeric PfTR mutant enzymes containing C-terminal type I sequences GCCG and GCUG, respectively. The PfTR-GCCG chimera had 500-fold less thioredoxin-reductase activity than the native enzyme, but still reduced selenocystine and lipoic acid efficiently. The PfTR-GCUG chimera had higher catalytic activity than the native enzyme with Trx, selenocystine, and lipoic acid as substrates. The results suggested to us that: (i) Sec in the mutant enzyme accelerated the rate of thiol/disulfide exchange between the N- and C-terminal redox centers, (ii) the type II redox center evolved for efficient catalysis utilizing Cys instead of Sec, and the type II redox center of PfTR is partly responsible for substrate recognition of the cognate PfTrx substrate relative to non-cognate thioredoxins.
mass thioredoxin reductases (TRs) are pyridine nucleotide
disulfide oxidoreductases that catalyze the reduction of the disulfide
bond of thioredoxin (Trx). Trx is responsible for reducing multiple
protein disulfide targets in the cell. TRs utilize reduced β-nicotinamide
adenine dinucleotide phosphate to reduce a bound flavin prosthetic
group, which in turn reduces an N-terminal redox center that has the
conserved sequence CICVNVGCCT, where CIC is denoted as the interchange thiol while the thiol involved in
charge-transfer complexation is denoted as CCT. The reduced
N-terminal redox center reduces a C-terminal redox center on the opposite
subunit of the head-to-tail homodimer, the C-terminal redox center
that catalyzes the reduction of the Trx-disulfide. Variations in the
amino acid sequence of the C-terminal redox center differentiate high-molecular
mass TRs into different types. Type Ia TRs have tetrapeptide C-terminal
redox centers of with a GCUG sequence, where U is the rare amino acid
selenocysteine (Sec), while the tetrapeptide sequence in type Ib TRs
has its Sec residue replaced with a conventional cysteine (Cys) residue
and can use small polar amino acids such as serine and threonine in
place of the flanking glycine residues. The TR from Plasmodium
falciparum (PfTR) is similar in structure and mechanism to
type Ia and type Ib TRs except that the C-terminal redox center is
different in its amino acid sequence. The C-terminal redox center
of PfTR has the sequence G534CGGGKCG541, and
we classify it as a type II high-molecular mass TR. The oxidized type
II redox motif will form a 20-membered disulfide ring, whereas the
absence of spacer amino acids in the type I motif results in the formation
of a rare eight-membered ring. We used site-directed mutagenesis and
protein semisynthesis to investigate features of the distinctive type
II C-terminal redox motif that help it perform catalysis. Deletion
of Gly541 reduces thioredoxin reductase activity by ∼50-fold,
most likely because of disruption of an important hydrogen bond between
the amide NH group of Gly541 and the carbonyl of Gly534 that helps
the β–turn−β motif. Alterations of the 20-membered
disulfide ring either by amino acid deletion or by substitution resulted
in impaired catalytic activity. Subtle changes in the ring structure
and size caused by using semisynthesis to substitute homocysteine
for cysteine also caused significant reductions in catalytic activity,
demonstrating the importance of the disulfide ring’s geometry
in making the C-terminal redox center reactive for thiol–disulfide
exchange. The data suggested to us that the transfer of electrons
from the N-terminal redox center to the C-terminal redox center may
be rate-limiting. We propose that the transfer of electrons from the
N-terminal redox center in PfTR to the type II C-terminal disulfide
is accelerated by the use of an “electrophilic activation”
mechanism. In this mechanism, the type II C-terminal disulfide is
polarized, making the sulfur atom of Cys540 electron deficient, highly
electrophilic, and activated for thiol–disulfide exchange with
the N-terminal redox center. This hypothesis was investigated by constructing
chimeric PfTR mutant enzymes containing C-terminal type I sequences
GCCG and GCUG, respectively. The PfTR-GCCG chimera had 500-fold less
thioredoxin reductase activity than the native enzyme but still reduced
selenocystine and lipoic acid efficiently. The PfTR-GCUG chimera had
higher catalytic activity than the native enzyme with Trx, selenocystine,
and lipoic acid as substrates. The results suggested to us that (i)
Sec in the mutant enzyme accelerated the rate of thiol–disulfide
exchange between the N- and C-terminal redox centers, (ii) the type
II redox center evolved for efficient catalysis utilizing Cys instead
of Sec, and (iii) the type II redox center of PfTR is partly responsible
for substrate recognition of the cognate PfTrx substrate relative
to noncognate thioredoxins.
Many characterised proteins contain metal ions, small organic molecules or modified residues. In contrast, the huge amount of data generated by genome projects consists exclusively of sequences with almost no annotation. One of the goals of the structural genomics initiative is to provide representative three-dimensional (3-D) structures for as many protein/domain folds as possible to allow successful homology modelling. However, important functional features such as metal co-ordination or a type of prosthetic group are not always conserved in homologous proteins. So far, the problem of correct annotation of bioinorganic proteins has been largely ignored by the bioinformatics community and information on bioinorganic centres obtained by methods other than crystallography or NMR is only available in literature databases.
COMe (Co-Ordination of Metals) represents the ontology for bioinorganic and other small molecule centres in complex proteins. COMe consists of three types of entities: 'bioinorganic motif' (BIM), 'molecule' (MOL), and 'complex proteins' (PRX), with each entity being assigned a unique identifier. A BIM consists of at least one centre (metal atom, inorganic cluster, organic molecule) and two or more endogenous and/or exogenous ligands. BIMs are represented as one-dimensional (1-D) strings and 2-D diagrams. A MOL entity represents a 'small molecule' which, when in complex with one or more polypeptides, forms a functional protein. The PRX entities refer to the functional proteins as well as to separate protein domains and subunits. The complex proteins in COMe are subdivided into three categories: (i) metalloproteins, (ii) organic prosthetic group proteins and (iii) modified amino acid proteins. The data are currently stored in both XML format and a relational database and are available at .
COMe provides the classification of proteins according to their 'bioinorganic' features and thus is orthogonal to other classification schemes, such as those based on sequence similarity, 3-D fold, enzyme activity, or biological process. The hierarchical organisation of the controlled vocabulary allows both for annotation and querying at different levels of granularity.
In Archeae and Bacteria, the repeated elements called CRISPRs for "clustered regularly interspaced short palindromic repeats" are believed to participate in the defence against viruses. Short sequences called spacers are stored in-between repeated elements. In the current model, motifs comprising spacers and repeats may target an invading DNA and lead to its degradation through a proposed mechanism similar to RNA interference. Analysis of intra-species polymorphism shows that new motifs (one spacer and one repeated element) are added in a polarised fashion. Although their principal characteristics have been described, a lot remains to be discovered on the way CRISPRs are created and evolve. As new genome sequences become available it appears necessary to develop automated scanning tools to make available CRISPRs related information and to facilitate additional investigations.
We have produced a program, CRISPRFinder, which identifies CRISPRs and extracts the repeated and unique sequences. Using this software, a database is constructed which is automatically updated monthly from newly released genome sequences. Additional tools were created to allow the alignment of flanking sequences in search for similarities between different loci and to build dictionaries of unique sequences. To date, almost six hundred CRISPRs have been identified in 475 published genomes. Two Archeae out of thirty-seven and about half of Bacteria do not possess a CRISPR. Fine analysis of repeated sequences strongly supports the current view that new motifs are added at one end of the CRISPR adjacent to the putative promoter.
It is hoped that availability of a public database, regularly updated and which can be queried on the web will help in further dissecting and understanding CRISPR structure and flanking sequences evolution. Subsequent analyses of the intra-species CRISPR polymorphism will be facilitated by CRISPRFinder and the dictionary creator. CRISPRdb is accessible at
Several novel immunoglobulin-like transcripts (NILTs) which have previously been identified in the salmonid species rainbow trout (Oncorhynchus mykiss) contain either one or two extracellular Ig domains of the V-type. NILTs also possess either an immunoreceptor tyrosine-based activating motif (ITAM) or immunoreceptor tyrosine-based inhibitory motifs (ITIMs) in the cytoplasmic region resulting in different signalling abilities. Here we report for the first time the genomic organisation and structure of the multigene family of NILTs in Atlantic salmon (Salmo salar) using a BAC sequencing approach.
We have identified six novel Atlantic salmon NILT genes (Ssa-NILT1-6), two pseudogenes (Ssa-NILTp1 and Ssa-NILTp2) and seven genes encoding putative transposable elements in one BAC covering more than 200 kbp. Ssa-NILT1, 2, 4, 5 and 6 contain one Ig domain, all having a CX3C motif, whereas Ssa-NILT3 contains two Ig domains, having a CX6C motif in Ig1 and a CX7C motif in Ig2. Atlantic salmon NILTs possess several ITIMs in the cytoplasmic region and the ITIM-bearing exons are in phase 0. A comparison of identity between the amino acid sequences of the CX3C Ig domains from NILTs varies from 77% to 96%. Ssa-NILT1, 2, 3 and 4 were all confirmed to be expressed either by their presence in EST databases (Ssa-NILT1) or RT-PCR (Ssa-NILT2, 3, and 4) using cDNA as template. A survey of the repertoire of putative NILT genes in a single individual revealed three novel genes (Ssa-NILT7-9) represented by the Ig domain, which together with Ig domains from Ssa-NILT1-6 could be divided into different groups based on specific motifs.
This report reveals a tightly clustered, multigene NILT family in Atlantic salmon. By screening a highly redundant Atlantic salmon BAC library we have identified and characterised the genomic organisation of six genes encoding NILT receptors. The genes show similar characteristics to NILTs previously identified in rainbow trout, having highly conserved cysteines in the Ig domain and several inhibitory signalling motifs in the cytoplasmic region. In a single individual three unique NILT Ig domain sequences were discovered at the genomic DNA level, which were divided into two different groups based on a four residue motif after the third cysteine. Our results from the BAC screening and analysis on the repertoire of NILT genes in a single individual indicates that many genes of this expanding Ig containing NILT family are still to be discovered in fish.
Tandem repetition of structural motifs in proteins is frequently observed across all forms of life. Topology of repeating unit and its frequency of occurrence are associated to a wide range of structural and functional roles in diverse proteins, and defects in repeat proteins have been associated with a number of diseases. It is thus desirable to accurately identify specific repeat type and its copy number. Weak evolutionary constraints on repeat units and insertions/deletions between them make their identification difficult at the sequence level and structure based approaches are desired. The proposed graph spectral approach is based on protein structure represented as a graph for detecting one of the most frequently observed structural repeats, Ankyrin repeat.
It has been shown in a large number of studies that 3-dimensional topology of a protein structure is well captured by a graph, making it possible to analyze a complex protein structure as a mathematical entity. In this study we show that eigen spectra profile of a protein structure graph exhibits a unique repetitive profile for contiguous repeating units enabling the detection of the repeat region and the repeat type. The proposed approach uses a non-redundant set of 58 Ankyrin proteins to define rules for the detection of Ankyrin repeat motifs. It is evaluated on a set of 370 proteins comprising 125 known Ankyrin proteins and remaining non-solenoid proteins and the prediction compared with UniProt annotation, sequence-based approach, RADAR, and structure-based approach, ConSole. To show the efficacy of the approach, we analyzed the complete PDB structural database and identified 641 previously unrecognized Ankyrin repeat proteins. We observe a unique eigen spectra profile for different repeat types and show that the method can be easily extended to detect other repeat types. It is implemented as a web server, AnkPred. It is freely available at ‘bioinf.iiit.ac.in/AnkPred’.
AnkPred provides an elegant and computationally efficient graph-based approach for detecting Ankyrin structural repeats in proteins. By analyzing the eigen spectra of the protein structure graph and secondary structure information, characteristic features of a known repeat family are identified. This method is especially useful in correctly identifying new members of a repeat family.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0440-9) contains supplementary material, which is available to authorized users.
Ankyrin repeat; Protein contact network; Graph theory
Posttranslational modifications (PTMs) define covalent and chemical modifications of protein residues. They play important roles in modulating various biological functions. Current PTM databases contain important sequence annotations but do not provide informative 3D structural resource about these modifications. Posttranslational modification structural database (PTM-SD) provides access to structurally solved modified residues, which are experimentally annotated as PTMs. It combines different PTM information and annotation gathered from other databases, e.g. Protein DataBank for the protein structures and dbPTM and PTMCuration for fine sequence annotation. PTM-SD gives an accurate detection of PTMs in structural data. PTM-SD can be browsed by PDB id, UniProt accession number, organism and classic PTM annotation. Advanced queries can also be performed, i.e. detailed PTM annotations, amino acid type, secondary structure, SCOP class classification, PDB chain length and number of PTMs by chain. Statistics and analyses can be computed on a selected dataset of PTMs. Each PTM entry is detailed in a dedicated page with information on the protein sequence, local conformation with secondary structure and Protein Blocks. PTM-SD gives valuable information on observed PTMs in protein 3D structure, which is of great interest for studying sequence–structure– function relationships at the light of PTMs, and could provide insights for comparative modeling and PTM predictions protocols.
Database URL: PTM-SD can be accessed at http://www.dsimb.inserm.fr/dsimb_tools/PTM-SD/.
Structural motifs are important for the integrity of a protein fold and can be employed to design and rationalize protein engineering and folding experiments. Such conserved segments represent the conserved core of a family or superfamily and can be crucial for the recognition of potential new members in sequence and structure databases. We present a database, MegaMotifBase, that compiles a set of important structural segments or motifs for protein structures. Motifs are recognized on the basis of both sequence conservation and preservation of important structural features such as amino acid preference, solvent accessibility, secondary structural content, hydrogen-bonding pattern and residue packing. This database provides 3D orientation patterns of the identified motifs in terms of inter-motif distances and torsion angles. Important applications of structural motifs are also provided in several crucial areas such as similar sequence and structure search, multiple sequence alignment and homology modeling. MegaMotifBase can be a useful resource to gain knowledge about structure and functional relationship of proteins. The database can be accessed from the URL http://caps.ncbs.res.in/MegaMotifbase/index.html
Finding related conformations in the Protein Data Bank (PDB) is essential in many areas of bioscience. To assist this task, we designed a search engine that uses a compact database to quickly identify protein segments obeying a set of primary, secondary and tertiary structure constraints. The database contains information such as amino acid sequence, secondary structure, disulfide bonds, hydrogen bonds and atoms in contact as calculated from all protein structures in the PDB. The search engine parses the database and returns hits that match the queried parameters. The conformation search engine, which is notable for its high speed and interactive feedback, is expected to assist scientists in discovering conformation homologs and predicting protein structure. The engine is publicly available at http://ari.stanford.edu/psf and it will also be used in-house in an automatic mode aimed at discovering new protein motifs.
Loops represent an important part of protein structures. The study of loop is critical for two main reasons: First, loops are often involved in protein function, stability and folding. Second, despite improvements in experimental and computational structure prediction methods, modeling the conformation of loops remains problematic. Here, we present a structural classification of loops, ArchDB, a mine of information with application in both mentioned fields: loop structure prediction and function prediction. ArchDB (http://sbi.imim.es/archdb) is a database of classified protein loop motifs. The current database provides four different classification sets tailored for different purposes. ArchDB-40, a loop classification derived from SCOP40, well suited for modeling common loop motifs. Since features relevant to loop structure or function can be more easily determined on well-populated clusters, we have developed ArchDB-95, a loop classification derived from SCOP95. This new classification set shows a ~40% increase in the number of subclasses, and a large 7-fold increase in the number of putative structure/function-related subclasses. We also present ArchDB-EC, a classification of loop motifs from enzymes, and ArchDB-KI, a manually annotated classification of loop motifs from kinases. Information about ligand contacts and PDB sites has been included in all classification sets. Improvements in our classification scheme are described, as well as several new database features, such as the ability to query by conserved annotations, sequence similarity, or uploading 3D coordinates of a protein. The lengths of classified loops range between 0 and 36 residues long. ArchDB offers an exhaustive sampling of loop structures. Functional information about loops and links with related biological databases are also provided. All this information and the possibility to browse/query the database through a web-server outline an useful tool with application in the comparative study of loops, the analysis of loops involved in protein function and to obtain templates for loop modeling.
function annotation; loop structure classification; loop modeling
The Structural Descriptor Database (SDDB) is a web-based tool that predicts the function of proteins and functional site positions based on the structural properties of related protein families. Structural alignments and functional residues of a known protein set (defined as the training set) are used to build special Hidden Markov Models (HMM) called HMM descriptors. SDDB uses previously calculated and stored HMM descriptors for predicting active sites, binding residues, and protein function. The database integrates biologically relevant data filtered from several databases such as PDB, PDBSUM, CSA and SCOP. It accepts queries in fasta format and predicts functional residue positions, protein-ligand interactions, and protein function, based on the SCOP database.
To assess the SDDB performance, we used different data sets. The Trypsion-like Serine protease data set assessed how well SDDB predicts functional sites when curated data is available. The SCOP family data set was used to analyze SDDB performance by using training data extracted from PDBSUM (binding sites) and from CSA (active sites). The ATP-binding experiment was used to compare our approach with the most current method. For all evaluations, significant improvements were obtained with SDDB.
SDDB performed better when trusty training data was available. SDDB worked better in predicting active sites rather than binding sites because the former are more conserved than the latter. Nevertheless, by using our prediction method we obtained results with precision above 70%.
RNA secondary structure is important for designing therapeutics, understanding protein–RNA binding and predicting tertiary structure of RNA. Several databases and downloadable programs exist that specialize in the three-dimensional (3D) structure of RNA, but none focus specifically on secondary structural motifs such as internal, bulge and hairpin loops. The RNA Characterization of Secondary Structure Motifs (RNA CoSSMos) database is a freely accessible and searchable online database and website of 3D characteristics of secondary structure motifs. To create the RNA CoSSMos database, 2156 Protein Data Bank (PDB) files were searched for internal, bulge and hairpin loops, and each loop's structural information, including sugar pucker, glycosidic linkage, hydrogen bonding patterns and stacking interactions, was included in the database. False positives were defined, identified and reclassified or omitted from the database to ensure the most accurate results possible. Users can search via general PDB information, experimental parameters, sequence and specific motif and by specific structural parameters in the subquery page after the initial search. Returned results for each search can be viewed individually or a complete set can be downloaded into a spreadsheet to allow for easy comparison. The RNA CoSSMos database is automatically updated weekly and is available at http://cossmos.slu.edu.