Structural information is crucial in ribonucleic acid (RNA) analysis and functional annotation; nevertheless, how to include such structural data is still a debated problem. Dot-bracket notation is the most common and simple representation for RNA secondary structures but its simplicity leads also to ambiguity requiring further processing steps to dissolve. Here we present BEAR (Brand nEw Alphabet for RNA), a new context-aware structural encoding represented by a string of characters. Each character in BEAR encodes for a specific secondary structure element (loop, stem, bulge and internal loop) with specific length. Furthermore, exploiting this informative and yet simple encoding in multiple alignments of related RNAs, we captured how much structural variation is tolerated in RNA families and convert it into transition rates among secondary structure elements. This allowed us to compute a substitution matrix for secondary structure elements called MBR (Matrix of BEAR-encoded RNA secondary structures), of which we tested the ability in aligning RNA secondary structures. We propose BEAR and the MBR as powerful resources for the RNA secondary structure analysis, comparison and classification, motif finding and phylogeny.
Fragile X syndrome (FXS), the leading cause of inherited intellectual disability, is caused by epigenetic silencing of the FMR1 gene, through expansion and methylation of a CGG triplet repeat (methylated full mutation). An antisense transcript (FMR1-AS1), starting from both promoter and intron 2 of the FMR1 gene, was demonstrated in transcriptionally active alleles, but not in silent FXS alleles. Moreover, a DNA methylation boundary, which is lost in FXS, was recently identified upstream of the FMR1 gene. Several nuclear proteins bind to this region, like the insulator protein CTCF. Here we demonstrate for the first time that rare unmethylated full mutation (UFM) alleles present the same boundary described in wild type (WT) alleles and that CTCF binds to this region, as well as to the FMR1 gene promoter, exon 1 and intron 2 binding sites. Contrariwise, DNA methylation prevents CTCF binding to FXS alleles. Drug-induced CpGs demethylation does not restore this binding. CTCF knock-down experiments clearly established that CTCF does not act as insulator at the active FMR1 locus, despite the presence of a CGG expansion. CTCF depletion induces heterochromatinic histone configuration of the FMR1 locus and results in reduction of FMR1 transcription, which however is not accompanied by spreading of DNA methylation towards the FMR1 promoter. CTCF depletion is also associated with FMR1-AS1 mRNA reduction. Antisense RNA, like sense transcript, is upregulated in UFM and absent in FXS cells and its splicing is correlated to that of the FMR1-mRNA. We conclude that CTCF has a complex role in regulating FMR1 expression, probably through the organization of chromatin loops between sense/antisense transcriptional regulatory regions, as suggested by bioinformatics analysis.
Fragile X syndrome is the most common cause of inherited intellectual disability, accounting for about 1∶3000 males and 1∶4000 females. It is caused by a dynamic mutation of FMR1, a gene mapping on the X chromosome and containing a CGG repeat in its promoter region. Expansion of this unstable sequence beyond 200 repeats (full mutation) is followed by DNA methylation and histone changes, leading to the transcriptional inactivation of FMR1 and to the lack of the FMRP protein. Recently, an antisense transcript (FMR1-AS1) spanning the CGG repeats and a region of transition of DNA methylation (boundary) located upstream of the CGG repeats have been identified in transcriptional active FMR1 alleles. Several nuclear proteins bound to the methylation boundary have been described, such as the zinc-finger protein CTCF, the first known insulator in mammals. This protein is an important transcriptional regulator of genes harboring trinucleotide repeats and it is mostly active in chromatin organization. For the first time, we have investigated the role of CTCF protein in the transcriptional regulation of the FMR1 gene. Our results define a complex role for CTCF acting through chromatin organization of the FMR1 locus.
Anecdotal evidence of the involvement of alternative splicing (AS) in the regulation of protein-protein interactions has been reported by several studies. AS events have been shown to significantly occur in regions where a protein interaction domain or a short linear motif is present. Several AS variants show partial or complete loss of interface residues, suggesting that AS can play a major role in the interaction regulation by selectively targeting the protein binding sites. In the present study we performed a statistical analysis of the alternative splicing of a non-redundant dataset of human protein-protein interfaces known at molecular level to determine the importance of this way of modulation of protein-protein interactions through AS.
Using a Cochran-Mantel-Haenszel chi-square test we demonstrated that the alternative splicing-mediated partial removal of both heterodimeric and homodimeric binding sites occurs at lower frequencies than expected, and this holds true even if we consider only those isoforms whose sequence is less different from that of the canonical protein and which therefore allow to selectively regulate functional regions of the protein. On the other hand, large removals of the binding site are not significantly prevented, possibly because they are associated to drastic structural changes of the protein. The observed protection of the binding sites from AS is not preferentially directed towards putative hot spot interface residues, and is widespread to all protein functional classes.
Our findings indicate that protein-protein binding sites are generally protected from alternative splicing-mediated partial removals. However, some cases in which the binding site is selectively removed exist, and here we discuss one of them.
Alternative splicing; Protein-protein interaction; Hot spots; Protein three-dimensional structure; Disordered regions
The webPDBinder (http://pdbinder.bio.uniroma2.it/PDBinder) is a web server for the identification of small ligand-binding sites in a protein structure. webPDBinder searches a protein structure against a library of known binding sites and a collection of control non-binding pockets. The number of similarities identified with the residues in the two sets is then used to derive a propensity value for each residue of the query protein associated to the likelihood that the residue is part of a ligand binding site. The predicted binding residues can be further refined using conservation scores derived from the multiple alignment of the PFAM protein family. webPDBinder correctly identifies residues belonging to the binding site in 77% of the cases and is able to identify binding pockets starting from holo or apo structures with comparable performances. This is important for all the real world cases where the query protein has been crystallized without a ligand and is also difficult to obtain clear similarities with bound pockets from holo pocket libraries. The input is either a PDB code or a user-submitted structure. The output is a list of predicted binding pocket residues with propensity and conservation values both in text and graphical format.
Nucleos is a web server for the identification of nucleotide-binding sites in protein structures. Nucleos compares the structure of a query protein against a set of known template 3D binding sites representing nucleotide modules, namely the nucleobase, carbohydrate and phosphate. Structural features, clustering and conservation are used to filter and score the predictions. The predicted nucleotide modules are then joined to build whole nucleotide-binding sites, which are ranked by their score. The server takes as input either the PDB code of the query protein structure or a user-submitted structure in PDB format. The output of Nucleos is composed of ranked lists of predicted nucleotide-binding sites divided by nucleotide type (e.g. ATP-like). For each ranked prediction, Nucleos provides detailed information about the score, the template structure and the structural match for each nucleotide module composing the nucleotide-binding site. The predictions on the query structure and the template-binding sites can be viewed directly on the web through a graphical applet. In 98% of the cases, the modules composing correct predictions belong to proteins with no homology relationship between each other, meaning that the identification of brand-new nucleotide-binding sites is possible using information from non-homologous proteins. Nucleos is available at http://nucleos.bio.uniroma2.it/nucleos/.
The BITS2012 meeting, held in Catania on May 2-4, 2012, brought together almost 100 Italian researchers working in the field of Bioinformatics, as well as students in the same or related disciplines. About 90 original research works were presented either as oral communication or as posters, representing a landscape of Italian current research in bioinformatics.
This preface provides a brief overview of the meeting and introduces the manuscripts that were accepted for publication in this supplement, after a strict and careful peer-review by an International board of referees.
Nucleotides are involved in several cellular processes, ranging from the transmission of genetic information, to energy transfer and storage. Both sequence and structure based methods have been developed to predict the location of nucleotide-binding sites in proteins. Here we propose a novel methodology that leverages the observation that nucleotide-binding sites have a modular structure. Nucleotides are composed of identifiable fragments, i.e. the phosphate, the nucleobase and the carbohydrate moieties. These fragments are bound by specific structural motifs that recur in proteins of different fold. Moreover these motifs behave as modules and are found in different combinations across fold space. Our method predicts binding sites for each nucleotide fragment by comparing a query protein with a database of templates extracted from proteins of known structure. Whenever a similarity is found the fragment bound by the template is transferred on the query protein, thus identifying a putative binding site. Predictions falling inside the surface of the protein are discarded, and the remaining ones are scored using clustering and conservation. The method is able to rank as first a correct prediction in the 48%, 48% and 68% of the analyzed proteins for the nucleobase, carbohydrate and phosphate respectively, while considering the first five predictions the performances change to 71%, 65% and 86% respectively. Furthermore we attempted to reconstruct the full structure of the binding site, starting from the predicted positions of the fragments. We calculated that in the 59% of the analyzed proteins the method ranks as first a reconstructed binding site or a part of it. Finally we tested the reliability of our method in a real world case in which it has to predict nucleotide-binding sites in unbound proteins. We analyzed proteins whose structure has been solved with and without the nucleotide and observed only little variations in the method performance.
Phosphatases control cell growth by a variety of mechanisms. A novel strategy is presented that combines multiparametric analysis of cell perturbations with logic modeling to achieve a detailed mapping of human phosphatase function on growth pathways.
siRNA-mediated downregulation of 298 phosphatase and phosphatase-related genes coupled to automated microscopy was used to characterize their impact on key growth pathways.In parallel, a literature-derived signed directed network was derived and optimized by training with experimental data.The resulting logic-based growth model was used to infer the cell state upon perturbation of each signaling node and compare it with the profiles obtained upon phosphatase perturbation.Mapping of 67% of the protein phosphatase onto the growth model shows that phosphatases are key modulators of growth pathways and affect cell-cycle progression.This novel approach is general and enables to efficiently map proteins onto complex pathways.
Large-scale siRNA screenings allow linking the function of poorly characterized genes to phenotypic readouts. According to this strategy, genes are associated with a function of interest if the alteration of their expression perturbs the phenotypic readouts. However, given the intricacy of the cell regulatory network, the mapping procedure is low resolution and the resulting models provide little mechanistic insights. We have developed a new strategy that combines multiparametric analysis of cell perturbation with logic modeling to achieve a more detailed functional mapping of human genes onto complex pathways. A literature-derived optimized model is used to infer the cell activation state following upregulation or downregulation of the model entities. By matching this signature with the experimental profile obtained in the high-throughput siRNA screening it is possible to infer the target of each protein, thus defining its ‘entry point' in the network. By this novel approach, 41 phosphatases that affect key growth pathways were identified and mapped onto a human epithelial cell-specific growth model, thus providing insights into the mechanisms underlying their function.
cancer; computational biology; functional genomics; imaging; modeling
The ability to predict immunogenic regions in selected proteins by in-silico methods has broad implications, such as allowing a quick selection of potential reagents to be used as diagnostics, vaccines, immunotherapeutics, or research tools in several branches of biological and biotechnological research. However, the prediction of antibody target sites in proteins using computational methodologies has proven to be a highly challenging task, which is likely due to the somewhat elusive nature of B-cell epitopes. This paper proposes a web-based platform for scoring potential immunological reagents based on the structures or 3D models of the proteins of interest. The method scores a protein’s peptides set, which is derived from a sliding window, based on the average solvent exposure, with a filter on the average local model quality for each peptide. The platform was validated on a custom-assembled database of 1336 experimentally determined epitopes from 106 proteins for which a reliable 3D model could be obtained through standard modeling techniques. Despite showing poor sensitivity, this method can achieve a specificity of 0.70 and a positive predictive value of 0.29 by combining these two simple parameters. These values are slightly higher than those obtained with other established sequence-based or structure-based methods that have been evaluated using the same epitopes dataset. This method is implemented in a web server called B-Pred, which is accessible at http://immuno.bio.uniroma2.it/bpred. The server contains a number of original features that allow users to perform personalized reagent searches by manipulating the sliding window’s width and sliding step, changing the exposure and model quality thresholds, and running sequential queries with different parameters. The B-Pred server should assist experimentalists in the rational selection of epitope antigens for a wide range of applications.
B-cell epitopes; immunoinformatics; bioinformatics; web server; epitope prediction
This study is the first large-scale comparative analysis of multiple types of post-translational modifications in different eukaryotic species. The resulting network of co-evolving and functionally associated modifications reveals the global landscape of post-translational regulation.
In all, 115 149 non-redundant post-translational modifications (PTMs) of 13 different types were collected from 8 eukaryotes.Comparison of evolution speed reveals that carboxylation is the most conserved while SUMOylation is the fastest evolving PTM type.Co-evolution of PTM pairs that co-occur within proteins reveals a vastly interconnected global network of functionally associated PTM types in eukaryotes.Central to the network of functionally associated PTM types appear phosphorylation, acetylation, ubiquitination and O-linked glycosylation that control both temporal events and processes that govern protein localization.
Various post-translational modifications (PTMs) fine-tune the functions of almost all eukaryotic proteins, and co-regulation of different types of PTMs has been shown within and between a number of proteins. Aiming at a more global view of the interplay between PTM types, we collected modifications for 13 frequent PTM types in 8 eukaryotes, compared their speed of evolution and developed a method for measuring PTM co-evolution within proteins based on the co-occurrence of sites across eukaryotes. As many sites are still to be discovered, this is a considerable underestimate, yet, assuming that most co-evolving PTMs are functionally associated, we found that PTM types are vastly interconnected, forming a global network that comprise in human alone >50 000 residues in about 6000 proteins. We predict substantial PTM type interplay in secreted and membrane-associated proteins and in the context of particular protein domains and short-linear motifs. The global network of co-evolving PTM types implies a complex and intertwined post-translational regulation landscape that is likely to regulate multiple functional states of many if not all eukaryotic proteins.
post-translational modifications; protein regulation; proteomics; PTM code; PTM crosstalk
The BITS2011 meeting, held in Pisa on June 20-22, 2011, brought together more than 120 Italian researchers working in the field of Bioinformatics, as well as students in Bioinformatics, Computational Biology, Biology, Computer Sciences, and Engineering, representing a landscape of Italian bioinformatics research.
This preface provides a brief overview of the meeting and introduces the peer-reviewed manuscripts that were accepted for publication in this Supplement.
Gene regulatory networks are widely used by biologists to describe the interactions among genes, proteins and other components at the intra-cellular level. Recently, a great effort has been devoted to give gene regulatory networks a formal semantics based on existing computational frameworks.
For this purpose, we consider Statecharts, which are a modular, hierarchical and executable formal model widely used to represent software systems. We use Statecharts for modeling small and recurring patterns of interactions in gene regulatory networks, called motifs.
We present an improved method for modeling gene regulatory network motifs using Statecharts and we describe the successful modeling of several motifs, including those which could not be modeled or whose models could not be distinguished using the method of a previous proposal.
We model motifs in an easy and intuitive way by taking advantage of the visual features of Statecharts. Our modeling approach is able to simulate some interesting temporal properties of gene regulatory network motifs: the delay in the activation and the deactivation of the "output" gene in the coherent type-1 feedforward loop, the pulse in the incoherent type-1 feedforward loop, the bistability nature of double positive and double negative feedback loops, the oscillatory behavior of the negative feedback loop, and the "lock-in" effect of positive autoregulation.
We present a Statecharts-based approach for the modeling of gene regulatory network motifs in biological systems. The basic motifs used to build more complex networks (that is, simple regulation, reciprocal regulation, feedback loop, feedforward loop, and autoregulation) can be faithfully described and their temporal dynamics can be analyzed.
The identification of ligand binding sites is a key task in the annotation of proteins with known structure but uncharacterized function. Here we describe a knowledge-based method exploiting the observation that unrelated binding sites share small structural motifs that bind the same chemical fragments irrespective of the nature of the ligand as a whole.
PDBinder compares a query protein against a library of binding and non-binding protein surface regions derived from the PDB. The results of the comparison are used to derive a propensity value for each residue which is correlated with the likelihood that the residue is part of a ligand binding site. The method was applied to two different problems: i) the prediction of ligand binding residues and ii) the identification of which surface cleft harbours the binding site. In both cases PDBinder performed consistently better than existing methods.
PDBinder has been trained on a non-redundant set of 1356 high-quality protein-ligand complexes and tested on a set of 239 holo and apo complex pairs. We obtained an MCC of 0.313 on the holo set with a PPV of 0.413 while on the apo set we achieved an MCC of 0.271 and a PPV of 0.372.
We show that PDBinder performs better than existing methods. The good performance on the unbound proteins is extremely important for real-world applications where the location of the binding site is unknown. Moreover, since our approach is orthogonal to those used in other programs, the PDBinder propensity value can be integrated in other algorithms further increasing the final performance.
Protein phosphorylation modulates protein function in organisms at all levels of complexity. Parasites of the Leishmania genus undergo various developmental transitions in their life cycle triggered by changes in the environment. The molecular mechanisms that these organisms use to process and integrate these external cues are largely unknown. However Leishmania lacks transcription factors, therefore most regulatory processes may occur at a post-translational level and phosphorylation has recently been demonstrated to be an important player in this process. Experimental identification of phosphorylation sites is a time-consuming task. Moreover some sites could be missed due to the highly dynamic nature of this process or to difficulties in phospho-peptide enrichment.
Here we present PhosTryp, a phosphorylation site predictor specific for trypansomatids. This method uses an SVM-based approach and has been trained with recent Leishmania phosphosproteomics data. PhosTryp achieved a 17% improvement in prediction performance compared with Netphos, a non organism-specific predictor. The analysis of the peptides correctly predicted by our method but missed by Netphos demonstrates that PhosTryp captures Leishmania-specific phosphorylation features. More specifically our results show that Leishmania kinases have sequence specificities which are different from their counterparts in higher eukaryotes. Consequently we were able to propose two possible Leishmania-specific phosphorylation motifs.
We further demonstrate that this improvement in performance extends to the related trypanosomatids Trypanosoma brucei and Trypanosoma cruzi. Finally, in order to maximize the usefulness of PhosTryp, we trained a predictor combining all the peptides from L. infantum, T. brucei and T. cruzi.
Our work demonstrates that training on organism-specific data results in an improvement that extends to related species. PhosTryp is freely available at http://phostryp.bio.uniroma2.it
Phosfinder is a web server for the identification of phosphate binding sites in protein structures. Phosfinder uses a structural comparison algorithm to scan a query structure against a set of known 3D phosphate binding motifs. Whenever a structural similarity between the query protein and a phosphate binding motif is detected, the phosphate bound by the known motif is added to the protein structure thus representing a putative phosphate binding site. Predicted binding sites are then evaluated according to (i) their position with respect to the query protein solvent-excluded surface and (ii) the conservation of the binding residues in the protein family. The server accepts as input either the PDB code of the protein to be analyzed or a user-submitted structure in PDB format. All the search parameters are user modifiable. Phosfinder outputs a list of predicted binding sites with detailed information about their structural similarity with known phosphate binding motifs, and the conservation of the residues involved. A graphical applet allows the user to visualize the predicted binding sites on the query protein structure. The results on a set of 52 apo/holo structure pairs show that the performance of our method is largely unaffected by ligand-induced conformational changes. Phosfinder is available at http://phosfinder.bio.uniroma2.it.
Nearly half of known protein structures interact with phosphate-containing ligands, such as nucleotides and other cofactors. Many methods have been developed for the identification of metal ions-binding sites and some for bigger ligands such as carbohydrates, but none is yet available for the prediction of phosphate-binding sites. Here we describe Pfinder, a method that predicts binding sites for phosphate groups, both in the form of ions or as parts of other non-peptide ligands, in proteins of known structure. Pfinder uses the Query3D local structural comparison algorithm to scan a protein structure for the presence of a number of structural motifs identified for their ability to bind the phosphate chemical group. Pfinder has been tested on a data set of 52 proteins for which both the apo and holo forms were available. We obtained at least one correct prediction in 63% of the holo structures and in 62% of the apo. The ability of Pfinder to recognize a phosphate-binding site in unbound protein structures makes it an ideal tool for functional annotation and for complementing docking and drug design methods. The Pfinder program is available at http://pdbfun.uniroma2.it/pfinder.
Phospho3D is a database of three-dimensional (3D) structures of phosphorylation sites (P-sites) derived from the Phospho.ELM database, which also collects information on the residues surrounding the P-site in space (3D zones). The database also provides the results of a large-scale structural comparison of the 3D zones versus a representative dataset of structures, thus associating to each P-site a number of structurally similar sites. The new version of Phospho3D presents an 11-fold increase in the number of 3D sites and incorporates several additional features, including new structural descriptors, the possibility of selecting non-redundant sets of 3D structures and the availability for download of non-redundant sets of structurally annotated P-sites. Moreover, it features P3Dscan, a new functionality that allows the user to submit a protein structure and scan it against the 3D zones collected in the Phospho3D database. Phospho3D version 2.0 is available at: http://www.phospho3d.org/.
Local structural comparison methods can be used to find structural similarities involving functional protein patches such as enzyme active sites and ligand binding sites. The outcome of such analyses is critically dependent on the representation used to describe the structure. Indeed different categories of functional sites may require the comparison program to focus on different characteristics of the protein residues. We have therefore developed superpose3D, a novel structural comparison software that lets users specify, with a powerful and flexible syntax, the structure description most suited to the requirements of their analysis. Input proteins are processed according to the user's directives and the program identifies sets of residues (or groups of atoms) that have a similar 3D position in the two structures. The advantages of using such a general purpose program are demonstrated with several examples. These test cases show that no single representation is appropriate for every analysis, hence the usefulness of having a flexible program that can be tailored to different needs. Moreover we also discuss how to interpret the results of a database screening where a known structural motif is searched against a large ensemble of structures. The software is written in C++ and is released under the open source GPL license. Superpose3D does not require any external library, runs on Linux, Mac OSX, Windows and is available at http://cbm.bio.uniroma2.it/superpose3D.
Cyclosporin A (CsA) has important anti-microbial activity against parasites of the genus Leishmania, suggesting CsA-binding cyclophilins (CyPs) as potential drug targets. However, no information is available on the genetic diversity of this important protein family, and the mechanisms underlying the cytotoxic effects of CsA on intracellular amastigotes are only poorly understood. Here, we performed a first genome-wide analysis of Leishmania CyPs and investigated the effects of CsA on host-free L. donovani amastigotes in order to elucidate the relevance of these parasite proteins for drug development.
Multiple sequence alignment and cluster analysis identified 17 Leishmania CyPs with significant sequence differences to human CyPs, but with highly conserved functional residues implicated in PPIase function and CsA binding. CsA treatment of promastigotes resulted in a dose-dependent inhibition of cell growth with an IC50 between 15 and 20 µM as demonstrated by proliferation assay and cell cycle analysis. Scanning electron microscopy revealed striking morphological changes in CsA treated promastigotes reminiscent to developing amastigotes, suggesting a role for parasite CyPs in Leishmania differentiation. In contrast to promastigotes, CsA was highly toxic to amastigotes with an IC50 between 5 and 10 µM, revealing for the first time a direct lethal effect of CsA on the pathogenic mammalian stage linked to parasite thermotolerance, independent from host CyPs. Structural modeling, enrichment of CsA-binding proteins from parasite extracts by FPLC, and PPIase activity assays revealed direct interaction of the inhibitor with LmaCyP40, a bifunctional cyclophilin with potential co-chaperone function.
The evolutionary expansion of the Leishmania CyP protein family and the toxicity of CsA on host-free amastigotes suggest important roles of PPIases in parasite biology and implicate Leishmania CyPs in key processes relevant for parasite proliferation and viability. The requirement of Leishmania CyP functions for intracellular parasite survival and their substantial divergence form host CyPs defines these proteins as prime drug targets.
Visceral leishmanisasis, also known as Kala Azar, is caused by the protozoan parasite Leishmania donovani. The L. donovani infectious cycle comprises two developmental stages, a motile promastigote stage that proliferates inside the digestive tract of the phlebotomine insect host, and a non-motile amastigote stage that differentiates inside the macrophages of mammalian hosts. Intracellular parasite survival in mouse and macrophage infection assays has been shown to be strongly compromised in the presence of the inhibitor cyclosporin A (CsA), which binds to members of the cyclophilin (CyP) protein family. It has been suggested that the toxic effects of CsA on amastigotes occurs indirectly via host cyclophilins, which may be required for intracellular parasite development and growth. Using a host-free L. donovani culture system we revealed for the first time a direct and stage-specific effect of CsA on promastigote growth and amastigote viability. We provided evidence that parasite killing occurs through a heat sensitivity mechanism likely due to direct inhibition of the co-chaperone cyclophilin 40. Our data allow important new insights into the function of the Leishmania CyP protein family in differentiation, growth, and intracellular survival, and define this class of molecules as important drug targets.
Recently, modularity has emerged as a general attribute of complex biological systems. This is probably because modular systems lend themselves readily to optimization via random mutation followed by natural selection. Although they are not traditionally considered to evolve by this process, biological ligands are also modular, being composed of recurring chemical fragments, and moreover they exhibit similarities reminiscent of mutations (e.g. the few atoms differentiating adenine and guanine). Many ligands are also promiscuous in the sense that they bind to many different protein folds. Here, we investigated whether ligand chemical modularity is reflected in an underlying modularity of binding sites across unrelated proteins. We chose nucleotides as paradigmatic ligands, because they can be described as composed of well-defined fragments (nucleobase, ribose and phosphates) and are quite abundant both in nature and in protein structure databases. We found that nucleotide-binding sites do indeed show a modular organization and are composed of fragment-specific protein structural motifs, which parallel the modular structure of their ligands. Through an analysis of the distribution of these motifs in different proteins and in different folds, we discuss the evolutionary implications of these findings and argue that the structural features we observed can arise both as a result of divergence from a common ancestor or convergent evolution.
Linear motifs are short segments of multidomain proteins that provide regulatory functions independently of protein tertiary structure. Much of intracellular signalling passes through protein modifications at linear motifs. Many thousands of linear motif instances, most notably phosphorylation sites, have now been reported. Although clearly very abundant, linear motifs are difficult to predict de novo in protein sequences due to the difficulty of obtaining robust statistical assessments. The ELM resource at http://elm.eu.org/ provides an expanding knowledge base, currently covering 146 known motifs, with annotation that includes >1300 experimentally reported instances. ELM is also an exploratory tool for suggesting new candidates of known linear motifs in proteins of interest. Information about protein domains, protein structure and native disorder, cellular and taxonomic contexts is used to reduce or deprecate false positive matches. Results are graphically displayed in a ‘Bar Code’ format, which also displays known instances from homologous proteins through a novel ‘Instance Mapper’ protocol based on PHI-BLAST. ELM server output provides links to the ELM annotation as well as to a number of remote resources. Using the links, researchers can explore the motifs, proteins, complex structures and associated literature to evaluate whether candidate motifs might be worth experimental investigation.
Many proteins are highly modular, being assembled from globular domains and segments of natively disordered polypeptides. Linear motifs, short sequence modules functioning independently of protein tertiary structure, are most abundant in natively disordered polypeptides but are also found in accessible parts of globular domains, such as exposed loops. The prediction of novel occurrences of known linear motifs attempts the difficult task of distinguishing functional matches from stochastically occurring non-functional matches. Although functionality can only be confirmed experimentally, confidence in a putative motif is increased if a motif exhibits attributes associated with functional instances such as occurrence in the correct taxonomic range, cellular compartment, conservation in homologues and accessibility to interacting partners. Several tools now use these attributes to classify putative motifs based on confidence of functionality.
Current methods assessing motif accessibility do not consider much of the information available, either predicting accessibility from primary sequence or regarding any motif occurring in a globular region as low confidence. We present a method considering accessibility and secondary structural context derived from experimentally solved protein structures to rectify this situation. Putatively functional motif occurrences are mapped onto a representative domain, given that a high quality reference SCOP domain structure is available for the protein itself or a close relative. Candidate motifs can then be scored for solvent-accessibility and secondary structure context. The scores are calibrated on a benchmark set of experimentally verified motif instances compared with a set of random matches. A combined score yields 3-fold enrichment for functional motifs assigned to high confidence classifications and 2.5-fold enrichment for random motifs assigned to low confidence classifications. The structure filter is implemented as a pipeline with both a graphical interface via the ELM resource and through a Web Service protocol.
New occurrences of known linear motifs require experimental validation as the bioinformatics tools currently have limited reliability. The ELM structure filter will aid users assessing candidate motifs presenting in globular structural regions. Most importantly, it will help users to decide whether to expend their valuable time and resources on experimental testing of interesting motif candidates.
The structural analysis of protein ligand binding sites can provide information relevant for assigning functions to unknown proteins, to guide the drug discovery process and to infer relations among distant protein folds. Previous approaches to the comparative analysis of binding pockets have usually been focused either on the ligand or the protein component. Even though several useful observations have been made with these approaches they both have limitations. In the former case the analysis is restricted to binding pockets interacting with similar ligands, while in the latter it is difficult to systematically check whether the observed structural similarities have a functional significance.
Here we propose a novel methodology that takes into account the structure of both the binding pocket and the ligand. We first look for local similarities in a set of binding pockets and then check whether the bound ligands, even if completely different, share a common fragment that can account for the presence of the structural motif. Thanks to this method we can identify structural motifs whose functional significance is explained by the presence of shared features in the interacting ligands.
The application of this method to a large dataset of binding pockets allows the identification of recurring protein motifs that bind specific ligand fragments, even in the context of molecules with a different overall structure. In addition some of these motifs are present in a high number of evolutionarily unrelated proteins.
The occurrence of very similar structural motifs brought about by different parts of non homologous proteins is often indicative of a common function. Indeed, relatively small local structures can mediate binding to a common partner, be it a protein, a nucleic acid, a cofactor or a substrate. While it is relatively easy to identify short amino acid or nucleotide sequence motifs in a given set of proteins or genes, and many methods do exist for this purpose, much more challenging is the identification of common local substructures, especially if they are formed by non consecutive residues in the sequence.
Here we describe a publicly available tool, able to identify common structural motifs shared by different non homologous proteins in an unsupervised mode. The motifs can be as short as three residues and need not to be contiguous or even present in the same order in the sequence. Users can submit a set of protein structures deemed or not to share a common function (e.g. they bind similar ligands, or share a common epitope). The server finds and lists structural motifs composed of three or more spatially well conserved residues shared by at least three of the submitted structures. The method uses a local structural comparison algorithm to identify subsets of similar amino acids between each pair of input protein chains and a clustering procedure to group similarities shared among different structure pairs.
FunClust is fast, completely sequence independent, and does not need an a priori knowledge of the motif to be found. The output consists of a list of aligned structural matches displayed in both tabular and graphical form. We show here examples of its usefulness by searching for the largest common structural motifs in test sets of non homologous proteins and showing that the identified motifs correspond to a known common functional feature.