Linear motifs are short segments of multidomain proteins that provide regulatory functions independently of protein tertiary structure. Much of intracellular signalling passes through protein modifications at linear motifs. Many thousands of linear motif instances, most notably phosphorylation sites, have now been reported. Although clearly very abundant, linear motifs are difficult to predict de novo in protein sequences due to the difficulty of obtaining robust statistical assessments. The ELM resource at http://elm.eu.org/ provides an expanding knowledge base, currently covering 146 known motifs, with annotation that includes >1300 experimentally reported instances. ELM is also an exploratory tool for suggesting new candidates of known linear motifs in proteins of interest. Information about protein domains, protein structure and native disorder, cellular and taxonomic contexts is used to reduce or deprecate false positive matches. Results are graphically displayed in a ‘Bar Code’ format, which also displays known instances from homologous proteins through a novel ‘Instance Mapper’ protocol based on PHI-BLAST. ELM server output provides links to the ELM annotation as well as to a number of remote resources. Using the links, researchers can explore the motifs, proteins, complex structures and associated literature to evaluate whether candidate motifs might be worth experimental investigation.
Linear motifs are short, evolutionarily plastic components of regulatory proteins and provide low-affinity interaction interfaces. These compact modules play central roles in mediating every aspect of the regulatory functionality of the cell. They are particularly prominent in mediating cell signaling, controlling protein turnover and directing protein localization. Given their importance, our understanding of motifs is surprisingly limited, largely as a result of the difficulty of discovery, both experimentally and computationally. The Eukaryotic Linear Motif (ELM) resource at http://elm.eu.org provides the biological community with a comprehensive database of known experimentally validated motifs, and an exploratory tool to discover putative linear motifs in user-submitted protein sequences. The current update of the ELM database comprises 1800 annotated motif instances representing 170 distinct functional classes, including approximately 500 novel instances and 24 novel classes. Several older motif class entries have been also revisited, improving annotation and adding novel instances. Furthermore, addition of full-text search capabilities, an enhanced interface and simplified batch download has improved the overall accessibility of the ELM data. The motif discovery portion of the ELM resource has added conservation, and structural attributes have been incorporated to aid users to discriminate biologically relevant motifs from stochastically occurring non-functional instances.
The recent expansion in our knowledge of protein–protein interactions (PPIs) has allowed the annotation and prediction of hundreds of thousands of interactions. However, the function of many of these interactions remains elusive. The interactions of Eukaryotic Linear Motif (iELM) web server provides a resource for predicting the function and positional interface for a subset of interactions mediated by short linear motifs (SLiMs). The iELM prediction algorithm is based on the annotated SLiM classes from the Eukaryotic Linear Motif (ELM) resource and allows users to explore both annotated and user-generated PPI networks for SLiM-mediated interactions. By incorporating the annotated information from the ELM resource, iELM provides functional details of PPIs. This can be used in proteomic analysis, for example, to infer whether an interaction promotes complex formation or degradation. Furthermore, details of the molecular interface of the SLiM-mediated interactions are also predicted. This information is displayed in a fully searchable table, as well as graphically with the modular architecture of the participating proteins extracted from the UniProt and Phospho.ELM resources. A network figure is also presented to aid the interpretation of results. The iELM server supports single protein queries as well as large-scale proteomic submissions and is freely available at http://i.elm.eu.org.
Protein-protein interactions through short linear motifs (SLiMs) are an emerging concept that is different from interactions between globular domains. The SLiMs encode a functional interaction interface in a short (three to ten residues) poorly conserved sequence. This characteristic makes them much more likely to arise/disappear spontaneously via mutations, and they may be more evolutionarily labile than globular domains. The diversity of SLiM composition may provide functional diversity for a viral protein from different viral strains. This study is designed to determine the different SLiM compositions of ribonucleoproteins (RNPs) from influenza A viruses (IAVs) from different hosts and with different levels of virulence.
The 96 consensus sequences (regular expressions) of SLiMs from the ELM server were used to conduct a comprehensive analysis of the 52,513 IAV RNP sequences. The SLiM compositions of RNPs from IAVs from different hosts and with different levels of virulence were compared. The SLiM compositions of 845 RNPs from highly virulent/pandemic IAVs were also analyzed. In total, 292 highly conserved SLiMs were found in RNPs regardless of the IAV host range. These SLiMs may be basic motifs that are essential for the normal functions of RNPs. Moreover, several SLiMs that are rare in seasonal IAV RNPs but are present in RNPs from highly virulent/pandemic IAVs were identified.
The SLiMs identified in this study provide a useful resource for experimental virologists to study the interactions between IAV RNPs and host intracellular proteins. Moreover, the SLiM compositions of IAV RNPs also provide insights into signal transduction pathways and protein interaction networks with which IAV RNPs might be involved. Information about SLiMs might be useful for the development of anti-IAV drugs.
Post-translational phosphorylation is one of the most common protein modifications. Phosphoserine, threonine and tyrosine residues play critical roles in the regulation of many cellular processes. The fast growing number of research reports on protein phosphorylation points to a general need for an accurate database dedicated to phosphorylation to provide easily retrievable information on phosphoproteins.
Phospho.ELM is a new resource containing experimentally verified phosphorylation sites manually curated from the literature and is developed as part of the ELM (Eukaryotic Linear Motif) resource. Phospho.ELM constitutes the largest searchable collection of phosphorylation sites available to the research community. The Phospho.ELM entries store information about substrate proteins with the exact positions of residues known to be phosphorylated by cellular kinases. Additional annotation includes literature references, subcellular compartment, tissue distribution, and information about the signaling pathways involved as well as links to the molecular interaction database MINT. Phospho.ELM version 2.0 contains 1703 phosphorylation site instances for 556 phosphorylated proteins.
Phospho.ELM will be a valuable tool both for molecular biologists working on protein phosphorylation sites and for bioinformaticians developing computational predictions on the specificity of phosphorylation reactions.
post-transcriptional modification; protein kinase; bioinformatics
Linear motifs (LMs) are abundant short regulatory sites used for modulating the functions of many eukaryotic proteins. They play important roles in post-translational modification, cell compartment targeting, docking sites for regulatory complex assembly and protein processing and cleavage. Methods for LM detection are now being developed that are strongly dependent on scores for motif conservation in homologous proteins. However, most LMs are found in natively disordered polypeptide segments that evolve rapidly, unhindered by structural constraints on the sequence. These regions of modular proteins are difficult to align using classical multiple sequence alignment programs that are specifically optimised to align the globular domains. As a consequence, poor motif alignment quality is hindering efforts to detect new LMs.
We have developed a new benchmark, as part of the BAliBASE suite, designed to assess the ability of standard multiple alignment methods to detect and align LMs. The reference alignments are organised into different test sets representing real alignment problems and contain examples of experimentally verified functional motifs, extracted from the Eukaryotic Linear Motif (ELM) database. The benchmark has been used to evaluate and compare a number of multiple alignment programs. With distantly related proteins, the worst alignment program correctly aligns 48% of LMs compared to 73% for the best program. However, the performance of all the programs is adversely affected by the introduction of other sequences containing false positive motifs. The ranking of the alignment programs based on LM alignment quality is similar to that observed when considering full-length protein alignments, however little correlation was observed between LM and overall alignment quality for individual alignment test cases.
We have shown that none of the programs currently available is capable of reliably aligning LMs in distantly related sequences and we have highlighted a number of specific problems. The results of the tests suggest possible ways to improve program accuracy for difficult, divergent sequences.
Many proteins are highly modular, being assembled from globular domains and segments of natively disordered polypeptides. Linear motifs, short sequence modules functioning independently of protein tertiary structure, are most abundant in natively disordered polypeptides but are also found in accessible parts of globular domains, such as exposed loops. The prediction of novel occurrences of known linear motifs attempts the difficult task of distinguishing functional matches from stochastically occurring non-functional matches. Although functionality can only be confirmed experimentally, confidence in a putative motif is increased if a motif exhibits attributes associated with functional instances such as occurrence in the correct taxonomic range, cellular compartment, conservation in homologues and accessibility to interacting partners. Several tools now use these attributes to classify putative motifs based on confidence of functionality.
Current methods assessing motif accessibility do not consider much of the information available, either predicting accessibility from primary sequence or regarding any motif occurring in a globular region as low confidence. We present a method considering accessibility and secondary structural context derived from experimentally solved protein structures to rectify this situation. Putatively functional motif occurrences are mapped onto a representative domain, given that a high quality reference SCOP domain structure is available for the protein itself or a close relative. Candidate motifs can then be scored for solvent-accessibility and secondary structure context. The scores are calibrated on a benchmark set of experimentally verified motif instances compared with a set of random matches. A combined score yields 3-fold enrichment for functional motifs assigned to high confidence classifications and 2.5-fold enrichment for random motifs assigned to low confidence classifications. The structure filter is implemented as a pipeline with both a graphical interface via the ELM resource and through a Web Service protocol.
New occurrences of known linear motifs require experimental validation as the bioinformatics tools currently have limited reliability. The ELM structure filter will aid users assessing candidate motifs presenting in globular structural regions. Most importantly, it will help users to decide whether to expend their valuable time and resources on experimental testing of interesting motif candidates.
A major challenge in the proteomics and structural genomics era is to predict protein structure and function, including identification of those proteins that are partially or wholly unstructured. Non-globular sequence segments often contain short linear peptide motifs (e.g. SH3-binding sites) which are important for protein function. We present here a new tool for discovery of such unstructured, or disordered regions within proteins. GlobPlot (http://globplot.embl.de) is a web service that allows the user to plot the tendency within the query protein for order/globularity and disorder. We show examples with known proteins where it successfully identifies inter-domain segments containing linear motifs, and also apparently ordered regions that do not contain any recognised domain. GlobPlot may be useful in domain hunting efforts. The plots indicate that instances of known domains may often contain additional N- or C-terminal segments that appear ordered. Thus GlobPlot may be of use in the design of constructs corresponding to globular proteins, as needed for many biochemical studies, particularly structural biology. GlobPlot has a pipeline interface—GlobPipe—for the advanced user to do whole proteome analysis. GlobPlot can also be used as a generic infrastructure package for graphical displaying of any possible propensity.
PRINTS is a database of protein family 'fingerprints' offering a diagnostic resource for newly-determined sequences. By contrast with PROSITE, which uses single consensus expressions to characterise particular families, PRINTS exploits groups of motifs to build characteristic signatures. These signatures offer improved diagnostic reliability by virtue of the mutual context provided by motif neighbours. To date, 800 fingerprints have been constructed and stored in PRINTS. The current version, 17.0, encodes approximately 4500 motifs, covering a range of globular and membrane proteins, modular polypeptides, and so on. The database is accessible via the UCL Bioinformatics World Wide Web (WWW) Server at http://www. biochem.ucl.ac.uk/bsm/dbbrowser/ . We have recently enhanced the usefulness of PRINTS by making available new, intuitive search software. This allows both individual query sequence and bulk data submission, permitting easy analysis of single sequences or complete genomes. Preliminary results indicate that use of the PRINTS system is able to assign additional functions not found by other methods, and hence offers a useful adjunct to current genome analysis protocols.
The PRINTS database of protein family 'fingerprints' is a diagnostic resource that complements the PROSITE dictionary of sites and patterns. Unlike regular expressions, fingerprints exploit groups of conserved motifs within sequence alignments to build characteristic signatures of family membership. Thus fingerprints inherently offer improved diagnostic reliability by virtue of the mutual context provided by motif neighbours. To date, 600 fingerprints have been constructed and stored in PRINTS, representing a 50% increase in the size of the database in the last year. The current version, 13.0, encodes approximately 3000 motifs, covering a range of globular and membrane proteins, modular polypeptides, and so on. The database is accessible via UCL's Bioinformatics World Wide Web (WWW) server at http://www.biochem.ucl.ac.uk/bsm/dbbrowser / . We describe here progress with the database, its Web interface, and a recent exciting development: the integration of a novel colour alignment editor (http://www.biochem.ucl.ac.uk/bsm/dbbrowser++ +/CINEMA ), which allows visualisation and interactive manipulation of PRINTS alignments over the Internet.
The structure of many eukaryotic cell regulatory proteins is highly modular. They are assembled from globular domains, segments of natively disordered polypeptides and short linear motifs. The latter are involved in protein interactions and formation of regulatory complexes. The function of such proteins, which may be difficult to define, is the aggregate of the subfunctions of the modules. It is therefore desirable to efficiently predict linear motifs with some degree of accuracy, yet sequence database searches return results that are not significant.
We have developed a method for scoring the conservation of linear motif instances. It requires only primary sequence-derived information (e.g. multiple alignment and sequence tree) and takes into account the degenerate nature of linear motif patterns. On our benchmarking, the method accurately scores 86% of the known positive instances, while distinguishing them from random matches in 78% of the cases. The conservation score is implemented as a real time application designed to be integrated into other tools. It is currently accessible via a Web Service or through a graphical interface.
The conservation score improves the prediction of linear motifs, by discarding those matches that are unlikely to be functional because they have not been conserved during the evolution of the protein sequences. It is especially useful for instances in non-structured regions of the proteins, where a domain masking filtering strategy is not applicable.
Full length, eukaryotic proteins generally consist of several autonomously folding and functioning domains. Many of these domains are known to function by binding and/or modifying other partner proteins based on the recognition of a short, linear amino sequence contained within the target protein. This article reviews the many bioinformatic tools and resources which discover, define and catalogue the various, known protein domains as well as assist users by identifying domain signatures within proteins of interest. We also review the smaller subset of bioinformatic tools which catalogue and help identify the short linear motifs used for domain targeting. It has been suggested that these short, functional, peptide-sequence motifs are normally found in unstructured regions of the target. The role of protein structure in the activity of one representative of these short, functional motifs is explored through an examination of known structures deposited in the Protein Data Bank.
Protein Domains; Protein Domains; Protein Structure; Bioinformatics; review
Phospho.ELM is a manually curated database of eukaryotic phosphorylation sites. The resource includes data collected from published literature as well as high-throughput data sets.
The current release of Phospho.ELM (version 7.0, July 2007) contains 4078 phospho-protein sequences covering 12 025 phospho-serine, 2362 phospho-threonine and 2083 phospho-tyrosine sites. The entries provide information about the phosphorylated proteins and the exact position of known phosphorylated instances, the kinases responsible for the modification (where known) and links to bibliographic references. The database entries have hyperlinks to easily access further information from UniProt, PubMed, SMART, ELM, MSD as well as links to the protein interaction databases MINT and STRING.
A new BLAST search tool, complementary to retrieval by keyword and UniProt accession number, allows users to submit a protein query (by sequence or UniProt accession) to search against the curated data set of phosphorylated peptides.
Phospho.ELM is available on line at: http://phospho.elm.eu.org
PRINTS is a compendium of protein motif 'fingerprints' derived from the OWL composite sequence database. Fingerprints are groups of motifs within sequence alignments whose conserved nature allows them to be used as signatures of family membership. To date, 400 fingerprints have been constructed and stored in Prints, the size of which has doubled in the last year. The current version, 9.0, encodes approximately 2000 motifs, covering a range of globular and membrane proteins, modular polypeptides, and so on. Fingerprints inherently offer improved diagnostic reliability over single motif methods by virtue of the mutual context provided by motif neighbours. PRINTS thus provides a useful adjunct to the widely used PROSITE dictionary of patterns. The database is now accessible via the Database Browser on the UCL Bioinformatics server at http://www.biochem.ucl.ac.uk/bsm/dbbrowser .
The Phospho.ELM resource (http://phospho.elm.eu.org) is a relational database designed to store in vivo and in vitro phosphorylation data extracted from the scientific literature and phosphoproteomic analyses. The resource has been actively developed for more than 7 years and currently comprises 42 574 serine, threonine and tyrosine non-redundant phosphorylation sites. Several new features have been implemented, such as structural disorder/order and accessibility information and a conservation score. Additionally, the conservation of the phosphosites can now be visualized directly on the multiple sequence alignment used for the score calculation. Finally, special emphasis has been put on linking to external resources such as interaction networks and other databases.
PRINTS is a diagnostic collection of protein fingerprints. Fingerprints exploit groups of motifs to build characteristic family signatures, offering improved diagnostic reliability over single-motif approaches by virtue of the mutual context provided by motif neighbours. Around 1000 fingerprints have now been created and stored in PRINTS. The September 1998 release (version 20.0), encodes approximately 5700 motifs, covering a range of globular and membrane proteins, modular polypeptides and so on. The database is accessible via the DbBrowser Web Server at http://www.biochem.ucl.ac.uk/bsm/dbbrowser /. In addition to supporting its continued growth, recent enhancements to the resource include a BLAST server, and more efficient fingerprint search software, with improved statistics for estimating the reliability of retrieved matches. Current efforts are focused on the design of more automated methods for database maintenance; implementation of an object-relational schema for efficient data management; and integration with PROSITE, profiles, Pfam and ProDom, as part of the international InterPro project, which aims to unify protein pattern databases and offer improved tools for genome analysis.
Many aspects of cell signalling, trafficking, and targeting are governed by interactions between globular protein domains and short peptide segments. These domains often bind multiple peptides that share a common sequence pattern, or “linear motif” (e.g., SH3 binding to PxxP). Many domains are known, though comparatively few linear motifs have been discovered. Their short length (three to eight residues), and the fact that they often reside in disordered regions in proteins makes them difficult to detect through sequence comparison or experiment. Nevertheless, each new motif provides critical molecular details of how interaction networks are constructed, and can explain how one protein is able to bind to very different partners. Here we show that binding motifs can be detected using data from genome-scale interaction studies, and thus avoid the normally slow discovery process. Our approach based on motif over-representation in non-homologous sequences, rediscovers known motifs and predicts dozens of others. Direct binding experiments reveal that two predicted motifs are indeed protein-binding modules: a DxxDxxxD protein phosphatase 1 binding motif with a KD of 22 μM and a VxxxRxYS motif that binds Translin with a KD of 43 μM. We estimate that there are dozens or even hundreds of linear motifs yet to be discovered that will give molecular insight into protein networks and greatly illuminate cellular processes.
Many protein interactions are mediated by short amino acid motifs. The authors describe a new approach to identify these interaction motifs and experimentally validate some of their binding predictions.
MEME (Multiple EM for Motif Elicitation) is one of the most widely used tools for searching for novel ‘signals’ in sets of biological sequences. Applications include the discovery of new transcription factor binding sites and protein domains. MEME works by searching for repeated, ungapped sequence patterns that occur in the DNA or protein sequences provided by the user. Users can perform MEME searches via the web server hosted by the National Biomedical Computation Resource () and several mirror sites. Through the same web server, users can also access the Motif Alignment and Search Tool to search sequence databases for matches to motifs encoded in several popular formats. By clicking on buttons in the MEME output, users can compare the motifs discovered in their input sequences with databases of known motifs, search sequence databases for matches to the motifs and display the motifs in various formats. This article describes the freely accessible web server and its architecture, and discusses ways to use MEME effectively to find new sequence patterns in biological sequences and analyze their significance.
We present the development of a web server, a protein short motif search tool that allows users to simultaneously search for a
protein sequence motif and its secondary structure assignments. The web server is able to query very short motifs searches against
PDB structural data from the RCSB Protein Databank, with the users defining the type of secondary structures of the amino acids
in the sequence motif. The output utilises 3D visualisation ability that highlights the position of the motif in the structure and on
the corresponding sequence. Researchers can easily observe the locations and conformation of multiple motifs among the results.
Protein short motif search also has an application programming interface (API) for interfacing with other bioinformatics tools.
The database is available for free at http://birg3.fbb.utm.my/proteinsms
Protein short motif search; protein secondary structure; visualization; application programming interface (API)
Motivation: Eukaryotic proteins are highly modular, containing multiple interaction interfaces that mediate binding to a network of regulators and effectors. Recent advances in high-throughput proteomics have rapidly expanded the number of known protein–protein interactions (PPIs); however, the molecular basis for the majority of these interactions remains to be elucidated. There has been a growing appreciation of the importance of a subset of these PPIs, namely those mediated by short linear motifs (SLiMs), particularly the canonical and ubiquitous SH2, SH3 and PDZ domain-binding motifs. However, these motif classes represent only a small fraction of known SLiMs and outside these examples little effort has been made, either bioinformatically or experimentally, to discover the full complement of motif instances.
Results: In this article, interaction data are analysed to identify and characterize an important subset of PPIs, those involving SLiMs binding to globular domains. To do this, we introduce iELM, a method to identify interactions mediated by SLiMs and add molecular details of the interaction interfaces to both interacting proteins. The method identifies SLiM-mediated interfaces from PPI data by searching for known SLiM–domain pairs. This approach was applied to the human interactome to identify a set of high-confidence putative SLiM-mediated PPIs.
Availability: iELM is freely available at http://elmint.embl.de
Supplementary data are available at Bioinformatics online.
PredictProtein (PP, http://cubic.bioc.columbia.edu/pp/) is an internet service for sequence analysis and the prediction of aspects of protein structure and function. Users submit protein sequence or alignments; the server returns a multiple sequence alignment, PROSITE sequence motifs, low-complexity regions (SEG), ProDom domain assignments, nuclear localisation signals, regions lacking regular structure and predictions of secondary structure, solvent accessibility, globular regions, transmembrane helices, coiled-coil regions, structural switch regions and disulfide-bonds. Upon request, fold recognition by prediction-based threading is available. For all services, users can submit their query either by electronic mail or interactively from World Wide Web.
Short, linear motifs (SLiMs) play a critical role in many biological processes. The SLiMSearch 2.0 (Short, Linear Motif Search) web server allows researchers to identify occurrences of a user-defined SLiM in a proteome, using conservation and protein disorder context statistics to rank occurrences. User-friendly output and visualizations of motif context allow the user to quickly gain insight into the validity of a putatively functional motif occurrence. For each motif occurrence, overlapping UniProt features and annotated SLiMs are displayed. Visualization also includes annotated multiple sequence alignments surrounding each occurrence, showing conservation and protein disorder statistics in addition to known and predicted SLiMs, protein domains and known post-translational modifications. In addition, enrichment of Gene Ontology terms and protein interaction partners are provided as indicators of possible motif function. All web server results are available for download. Users can search motifs against the human proteome or a subset thereof defined by Uniprot accession numbers or GO term. The SLiMSearch server is available at: http://bioware.ucd.ie/slimsearch2.html.
The PRINTS database houses a collection of protein fingerprints. These may be used to make family and tentative functional assignments for uncharacterised sequences. The September 2001 release (version 32.0) includes 1600 fingerprints, encoding ∼10 000 motifs, covering a range of globular and membrane proteins, modular polypeptides and so on. In addition to its continued steady growth, we report here its use as a source of annotation in the InterPro resource, and the use of its relational cousin, PRINTS-S, to model relationships between families, including those beyond the reach of conventional sequence analysis approaches. The database is accessible for BLAST, fingerprint and text searches at http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/.
The PRINTS database houses a collection of protein fingerprints. These may be used to assign uncharacterised sequences to known families and hence to infer tentative functions. The September 2002 release (version 36.0) includes 1800 fingerprints, encoding ∼11 000 motifs, covering a range of globular and membrane proteins, modular polypeptides and so on. In addition to its continued steady growth, we report here the development of an automatic supplement, prePRINTS, designed to increase the coverage of the resource and reduce some of the manual burdens inherent in its maintenance. The databases are accessible for interrogation and searching at http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/.
Short, linear motifs (SLiMs) play a critical role in many biological processes, particularly in protein–protein interactions. The Short, Linear Motif Finder (SLiMFinder) web server is a de novo motif discovery tool that identifies statistically over-represented motifs in a set of protein sequences, accounting for the evolutionary relationships between them. Motifs are returned with an intuitive P-value that greatly reduces the problem of false positives and is accessible to biologists of all disciplines. Input can be uploaded by the user or extracted directly from UniProt. Numerous masking options give the user great control over the contextual information to be included in the analyses. The SLiMFinder server combines these with user-friendly output and visualizations of motif context to allow the user to quickly gain insight into the validity of a putatively functional motif. These visualizations include alignments of motif occurrences, alignments of motifs and their homologues and a visual schematic of the top-ranked motifs. Returned motifs can also be compared with known SLiMs from the literature using CompariMotif. All results are available for download. The SLiMFinder server is available at: http://bioware.ucd.ie/slimfinder.html.