Many proteins are highly modular, being assembled from globular domains and segments of natively disordered polypeptides. Linear motifs, short sequence modules functioning independently of protein tertiary structure, are most abundant in natively disordered polypeptides but are also found in accessible parts of globular domains, such as exposed loops. The prediction of novel occurrences of known linear motifs attempts the difficult task of distinguishing functional matches from stochastically occurring non-functional matches. Although functionality can only be confirmed experimentally, confidence in a putative motif is increased if a motif exhibits attributes associated with functional instances such as occurrence in the correct taxonomic range, cellular compartment, conservation in homologues and accessibility to interacting partners. Several tools now use these attributes to classify putative motifs based on confidence of functionality.
Current methods assessing motif accessibility do not consider much of the information available, either predicting accessibility from primary sequence or regarding any motif occurring in a globular region as low confidence. We present a method considering accessibility and secondary structural context derived from experimentally solved protein structures to rectify this situation. Putatively functional motif occurrences are mapped onto a representative domain, given that a high quality reference SCOP domain structure is available for the protein itself or a close relative. Candidate motifs can then be scored for solvent-accessibility and secondary structure context. The scores are calibrated on a benchmark set of experimentally verified motif instances compared with a set of random matches. A combined score yields 3-fold enrichment for functional motifs assigned to high confidence classifications and 2.5-fold enrichment for random motifs assigned to low confidence classifications. The structure filter is implemented as a pipeline with both a graphical interface via the ELM resource and through a Web Service protocol.
New occurrences of known linear motifs require experimental validation as the bioinformatics tools currently have limited reliability. The ELM structure filter will aid users assessing candidate motifs presenting in globular structural regions. Most importantly, it will help users to decide whether to expend their valuable time and resources on experimental testing of interesting motif candidates.
Linear motifs are short segments of multidomain proteins that provide regulatory functions independently of protein tertiary structure. Much of intracellular signalling passes through protein modifications at linear motifs. Many thousands of linear motif instances, most notably phosphorylation sites, have now been reported. Although clearly very abundant, linear motifs are difficult to predict de novo in protein sequences due to the difficulty of obtaining robust statistical assessments. The ELM resource at http://elm.eu.org/ provides an expanding knowledge base, currently covering 146 known motifs, with annotation that includes >1300 experimentally reported instances. ELM is also an exploratory tool for suggesting new candidates of known linear motifs in proteins of interest. Information about protein domains, protein structure and native disorder, cellular and taxonomic contexts is used to reduce or deprecate false positive matches. Results are graphically displayed in a ‘Bar Code’ format, which also displays known instances from homologous proteins through a novel ‘Instance Mapper’ protocol based on PHI-BLAST. ELM server output provides links to the ELM annotation as well as to a number of remote resources. Using the links, researchers can explore the motifs, proteins, complex structures and associated literature to evaluate whether candidate motifs might be worth experimental investigation.
The recent expansion in our knowledge of protein–protein interactions (PPIs) has allowed the annotation and prediction of hundreds of thousands of interactions. However, the function of many of these interactions remains elusive. The interactions of Eukaryotic Linear Motif (iELM) web server provides a resource for predicting the function and positional interface for a subset of interactions mediated by short linear motifs (SLiMs). The iELM prediction algorithm is based on the annotated SLiM classes from the Eukaryotic Linear Motif (ELM) resource and allows users to explore both annotated and user-generated PPI networks for SLiM-mediated interactions. By incorporating the annotated information from the ELM resource, iELM provides functional details of PPIs. This can be used in proteomic analysis, for example, to infer whether an interaction promotes complex formation or degradation. Furthermore, details of the molecular interface of the SLiM-mediated interactions are also predicted. This information is displayed in a fully searchable table, as well as graphically with the modular architecture of the participating proteins extracted from the UniProt and Phospho.ELM resources. A network figure is also presented to aid the interpretation of results. The iELM server supports single protein queries as well as large-scale proteomic submissions and is freely available at http://i.elm.eu.org.
Linear motifs are short, evolutionarily plastic components of regulatory proteins and provide low-affinity interaction interfaces. These compact modules play central roles in mediating every aspect of the regulatory functionality of the cell. They are particularly prominent in mediating cell signaling, controlling protein turnover and directing protein localization. Given their importance, our understanding of motifs is surprisingly limited, largely as a result of the difficulty of discovery, both experimentally and computationally. The Eukaryotic Linear Motif (ELM) resource at http://elm.eu.org provides the biological community with a comprehensive database of known experimentally validated motifs, and an exploratory tool to discover putative linear motifs in user-submitted protein sequences. The current update of the ELM database comprises 1800 annotated motif instances representing 170 distinct functional classes, including approximately 500 novel instances and 24 novel classes. Several older motif class entries have been also revisited, improving annotation and adding novel instances. Furthermore, addition of full-text search capabilities, an enhanced interface and simplified batch download has improved the overall accessibility of the ELM data. The motif discovery portion of the ELM resource has added conservation, and structural attributes have been incorporated to aid users to discriminate biologically relevant motifs from stochastically occurring non-functional instances.
Linear motifs (LMs) are abundant short regulatory sites used for modulating the functions of many eukaryotic proteins. They play important roles in post-translational modification, cell compartment targeting, docking sites for regulatory complex assembly and protein processing and cleavage. Methods for LM detection are now being developed that are strongly dependent on scores for motif conservation in homologous proteins. However, most LMs are found in natively disordered polypeptide segments that evolve rapidly, unhindered by structural constraints on the sequence. These regions of modular proteins are difficult to align using classical multiple sequence alignment programs that are specifically optimised to align the globular domains. As a consequence, poor motif alignment quality is hindering efforts to detect new LMs.
We have developed a new benchmark, as part of the BAliBASE suite, designed to assess the ability of standard multiple alignment methods to detect and align LMs. The reference alignments are organised into different test sets representing real alignment problems and contain examples of experimentally verified functional motifs, extracted from the Eukaryotic Linear Motif (ELM) database. The benchmark has been used to evaluate and compare a number of multiple alignment programs. With distantly related proteins, the worst alignment program correctly aligns 48% of LMs compared to 73% for the best program. However, the performance of all the programs is adversely affected by the introduction of other sequences containing false positive motifs. The ranking of the alignment programs based on LM alignment quality is similar to that observed when considering full-length protein alignments, however little correlation was observed between LM and overall alignment quality for individual alignment test cases.
We have shown that none of the programs currently available is capable of reliably aligning LMs in distantly related sequences and we have highlighted a number of specific problems. The results of the tests suggest possible ways to improve program accuracy for difficult, divergent sequences.
The eukaryotic linear motif (ELM http://elm.eu.org) resource is a hub for collecting, classifying and curating information about short linear motifs (SLiMs). For >10 years, this resource has provided the scientific community with a freely accessible guide to the biology and function of linear motifs. The current version of ELM contains ∼200 different motif classes with over 2400 experimentally validated instances manually curated from >2000 scientific publications. Furthermore, detailed information about motif-mediated interactions has been annotated and made available in standard exchange formats. Where appropriate, links are provided to resources such as switches.elm.eu.org and KEGG pathways.
Protein-protein interactions through short linear motifs (SLiMs) are an emerging concept that is different from interactions between globular domains. The SLiMs encode a functional interaction interface in a short (three to ten residues) poorly conserved sequence. This characteristic makes them much more likely to arise/disappear spontaneously via mutations, and they may be more evolutionarily labile than globular domains. The diversity of SLiM composition may provide functional diversity for a viral protein from different viral strains. This study is designed to determine the different SLiM compositions of ribonucleoproteins (RNPs) from influenza A viruses (IAVs) from different hosts and with different levels of virulence.
The 96 consensus sequences (regular expressions) of SLiMs from the ELM server were used to conduct a comprehensive analysis of the 52,513 IAV RNP sequences. The SLiM compositions of RNPs from IAVs from different hosts and with different levels of virulence were compared. The SLiM compositions of 845 RNPs from highly virulent/pandemic IAVs were also analyzed. In total, 292 highly conserved SLiMs were found in RNPs regardless of the IAV host range. These SLiMs may be basic motifs that are essential for the normal functions of RNPs. Moreover, several SLiMs that are rare in seasonal IAV RNPs but are present in RNPs from highly virulent/pandemic IAVs were identified.
The SLiMs identified in this study provide a useful resource for experimental virologists to study the interactions between IAV RNPs and host intracellular proteins. Moreover, the SLiM compositions of IAV RNPs also provide insights into signal transduction pathways and protein interaction networks with which IAV RNPs might be involved. Information about SLiMs might be useful for the development of anti-IAV drugs.
Motivation: Eukaryotic proteins are highly modular, containing multiple interaction interfaces that mediate binding to a network of regulators and effectors. Recent advances in high-throughput proteomics have rapidly expanded the number of known protein–protein interactions (PPIs); however, the molecular basis for the majority of these interactions remains to be elucidated. There has been a growing appreciation of the importance of a subset of these PPIs, namely those mediated by short linear motifs (SLiMs), particularly the canonical and ubiquitous SH2, SH3 and PDZ domain-binding motifs. However, these motif classes represent only a small fraction of known SLiMs and outside these examples little effort has been made, either bioinformatically or experimentally, to discover the full complement of motif instances.
Results: In this article, interaction data are analysed to identify and characterize an important subset of PPIs, those involving SLiMs binding to globular domains. To do this, we introduce iELM, a method to identify interactions mediated by SLiMs and add molecular details of the interaction interfaces to both interacting proteins. The method identifies SLiM-mediated interfaces from PPI data by searching for known SLiM–domain pairs. This approach was applied to the human interactome to identify a set of high-confidence putative SLiM-mediated PPIs.
Availability: iELM is freely available at http://elmint.embl.de
Supplementary data are available at Bioinformatics online.
Biology is encoded in molecular sequences: deciphering this encoding remains a grand scientific challenge. Functional regions of DNA, RNA, and protein sequences often exhibit characteristic but subtle motifs; thus, computational discovery of motifs in sequences is a fundamental and much-studied problem. However, most current algorithms do not allow for insertions or deletions (indels) within motifs, and the few that do have other limitations. We present a method, GLAM2 (Gapped Local Alignment of Motifs), for discovering motifs allowing indels in a fully general manner, and a companion method GLAM2SCAN for searching sequence databases using such motifs. glam2 is a generalization of the gapless Gibbs sampling algorithm. It re-discovers variable-width protein motifs from the PROSITE database significantly more accurately than the alternative methods PRATT and SAM-T2K. Furthermore, it usefully refines protein motifs from the ELM database: in some cases, the refined motifs make orders of magnitude fewer overpredictions than the original ELM regular expressions. GLAM2 performs respectably on the BAliBASE multiple alignment benchmark, and may be superior to leading multiple alignment methods for “motif-like” alignments with N- and C-terminal extensions. Finally, we demonstrate the use of GLAM2 to discover protein kinase substrate motifs and a gapped DNA motif for the LIM-only transcriptional regulatory complex: using GLAM2SCAN, we identify promising targets for the latter. GLAM2 is especially promising for short protein motifs, and it should improve our ability to identify the protein cleavage sites, interaction sites, post-translational modification attachment sites, etc., that underlie much of biology. It may be equally useful for arbitrarily gapped motifs in DNA and RNA, although fewer examples of such motifs are known at present. GLAM2 is public domain software, available for download at http://bioinformatics.org.au/glam2.
In recent decades, scientists have extracted genetic sequences—DNA, RNA, and protein sequences—from numerous organisms. These sequences hold the information for the construction and functioning of these organisms, but as yet we are mostly unable to read them. It has long been known that these sequences contain many kinds of “motifs”, i.e. re-occurring patterns, associated with specific biological functions. Thus, much research has been devoted to computer algorithms for automatically discovering subtle, recurring motifs in sequences. However, previous algorithms search for rigid motifs whose instances vary only by substitutions, and not by insertions or deletions. Real motifs are flexible, and do vary by insertions and deletions. This study describes a new computer algorithm for discovering motifs, which allows for arbitrary insertions and deletions. This algorithm can discover real, flexible motifs, and should be able to help us determine the functions of many biological molecules.
The structure of many eukaryotic cell regulatory proteins is highly modular. They are assembled from globular domains, segments of natively disordered polypeptides and short linear motifs. The latter are involved in protein interactions and formation of regulatory complexes. The function of such proteins, which may be difficult to define, is the aggregate of the subfunctions of the modules. It is therefore desirable to efficiently predict linear motifs with some degree of accuracy, yet sequence database searches return results that are not significant.
We have developed a method for scoring the conservation of linear motif instances. It requires only primary sequence-derived information (e.g. multiple alignment and sequence tree) and takes into account the degenerate nature of linear motif patterns. On our benchmarking, the method accurately scores 86% of the known positive instances, while distinguishing them from random matches in 78% of the cases. The conservation score is implemented as a real time application designed to be integrated into other tools. It is currently accessible via a Web Service or through a graphical interface.
The conservation score improves the prediction of linear motifs, by discarding those matches that are unlikely to be functional because they have not been conserved during the evolution of the protein sequences. It is especially useful for instances in non-structured regions of the proteins, where a domain masking filtering strategy is not applicable.
Post-translational phosphorylation is one of the most common protein modifications. Phosphoserine, threonine and tyrosine residues play critical roles in the regulation of many cellular processes. The fast growing number of research reports on protein phosphorylation points to a general need for an accurate database dedicated to phosphorylation to provide easily retrievable information on phosphoproteins.
Phospho.ELM is a new resource containing experimentally verified phosphorylation sites manually curated from the literature and is developed as part of the ELM (Eukaryotic Linear Motif) resource. Phospho.ELM constitutes the largest searchable collection of phosphorylation sites available to the research community. The Phospho.ELM entries store information about substrate proteins with the exact positions of residues known to be phosphorylated by cellular kinases. Additional annotation includes literature references, subcellular compartment, tissue distribution, and information about the signaling pathways involved as well as links to the molecular interaction database MINT. Phospho.ELM version 2.0 contains 1703 phosphorylation site instances for 556 phosphorylated proteins.
Phospho.ELM will be a valuable tool both for molecular biologists working on protein phosphorylation sites and for bioinformaticians developing computational predictions on the specificity of phosphorylation reactions.
post-transcriptional modification; protein kinase; bioinformatics
Virus interact extensively with host proteins, but the mechanisms controlling these interactions are not well understood. We present a comprehensive analysis of eukaryotic linear-peptide motifs (ELMs) in 2,208 viral genomes and reveal that viruses exploit molecular mimicry of host-like ELMs to possibly assist in host-virus interactions. Using a statistical genomics approach, we identify a large number of potentially functional ELMs and observe that the occurrence of ELMs is often evolutionarily conserved but not uniform across virus families. Some viral proteins contain multiple types of ELMs, in striking similarity to complex regulatory modules in host proteins, suggesting that ELMs may act combinatorially to assist viral replication. Furthermore, a simple evolutionary model suggests that the inherent structural simplicity of ELMs often enables them to tolerate mutations and evolve quickly. Our findings suggest that ELMs may allow fast rewiring of host-virus interactions, which likely assists rapid viral evolution and adaptation to diverse environments.
Many important interactions of proteins are facilitated by short, linear motifs (SLiMs) within a protein's primary sequence. Our aim was to establish robust methods for discovering putative functional motifs. The strongest evidence for such motifs is obtained when the same motifs occur in unrelated proteins, evolving by convergence. In practise, searches for such motifs are often swamped by motifs shared in related proteins that are identical by descent. Prediction of motifs among sets of biologically related proteins, including those both with and without detectable similarity, were made using the TEIRESIAS algorithm. The number of motif occurrences arising through common evolutionary descent were normalized based on treatment of BLAST local alignments. Motifs were ranked according to a score derived from the product of the normalized number of occurrences and the information content. The method was shown to significantly outperform methods that do not discount evolutionary relatedness, when applied to known SLiMs from a subset of the eukaryotic linear motif (ELM) database. An implementation of Multiple Spanning Tree weighting outperformed two other weighting schemes, in a variety of settings.
PRINTS is a database of protein family 'fingerprints' offering a diagnostic resource for newly-determined sequences. By contrast with PROSITE, which uses single consensus expressions to characterise particular families, PRINTS exploits groups of motifs to build characteristic signatures. These signatures offer improved diagnostic reliability by virtue of the mutual context provided by motif neighbours. To date, 800 fingerprints have been constructed and stored in PRINTS. The current version, 17.0, encodes approximately 4500 motifs, covering a range of globular and membrane proteins, modular polypeptides, and so on. The database is accessible via the UCL Bioinformatics World Wide Web (WWW) Server at http://www. biochem.ucl.ac.uk/bsm/dbbrowser/ . We have recently enhanced the usefulness of PRINTS by making available new, intuitive search software. This allows both individual query sequence and bulk data submission, permitting easy analysis of single sequences or complete genomes. Preliminary results indicate that use of the PRINTS system is able to assign additional functions not found by other methods, and hence offers a useful adjunct to current genome analysis protocols.
Models of protein evolution are used to describe evolutionary processes, for phylogenetic analyses and homology detection. Widely used general models of protein evolution are biased toward globular domains and lack resolution to describe evolutionary processes for other protein types. As three-dimensional structure is a major constraint to protein evolution, specific models have been proposed for other types of proteins. Here, we consider evolutionary patterns in coiled-coil forming proteins. Coiled-coils are widespread structural domains, formed by a repeated motif of seven amino acids (heptad repeat). Coiled-coil forming proteins are frequently rods and spacers, structuring both the intracellular and the extracellular spaces that often form protein interaction interfaces. We tested the hypothesis that due to their specific structure the associated evolutionary constraints differ from those of globular proteins. We showed that substitution patterns in coiled-coil regions are different than those observed in globular regions, beyond the simple heptad repeat. Based on these substitution patterns we developed a coiled-coil specific (CC) model that in the context of phylogenetic reconstruction outperforms general models in tree likelihood, often leading to different topologies. For multidomain proteins containing both a coiled-coil region and a globular domain, we showed that a combination of the CC model and a general one gives higher likelihoods than a single model. Finally, we showed that the model can be used for homology detection to increase search sensitivity for coiled-coil proteins. The CC model, software, and other supplementary materials are available at http://www.evocell.org/cgl/resources (last accessed January 29, 2015).
coiled-coil; protein evolution; phylogenetic inference; homology detection; amino acid substitutions; protein structure
Full length, eukaryotic proteins generally consist of several autonomously folding and functioning domains. Many of these domains are known to function by binding and/or modifying other partner proteins based on the recognition of a short, linear amino sequence contained within the target protein. This article reviews the many bioinformatic tools and resources which discover, define and catalogue the various, known protein domains as well as assist users by identifying domain signatures within proteins of interest. We also review the smaller subset of bioinformatic tools which catalogue and help identify the short linear motifs used for domain targeting. It has been suggested that these short, functional, peptide-sequence motifs are normally found in unstructured regions of the target. The role of protein structure in the activity of one representative of these short, functional motifs is explored through an examination of known structures deposited in the Protein Data Bank.
Protein Domains; Protein Domains; Protein Structure; Bioinformatics; review
While many authors have discussed models and tools for studying protein evolution at the sequence level, molecular function is usually mediated by complex, higher order features such as independently folding domains and linear motifs that are based on or embedded in a particular arrangment of features such as secondary structure elements, transmembrane domains and regions with intrinsic disorder. This ‘protein architecture’ can, in its most simplistic representation, be visualized as domain organization cartoons that can be used to compare proteins in terms of the order of their mostly globular domains.
Here, we describe a visual approach and a webserver for protein comparison that extend the domain organization cartoon concept. By developing an information-rich, compact visualization of different protein features above the sequence level, potentially related proteins can be compared at the level of propensities for secondary structure, transmembrane domains and intrinsic disorder, in addition to PFAM domains. A public Web server is available at www.proteinarchitect.net, while the code is provided at protarchitect.sourceforge.net.
Due to recent advances in sequencing technologies we are now flooded with millions of predicted proteins that await comparative analysis. In many cases, mature tools focused on revealing hits with considerable global or local similarity to well-characterized proteins will not be able to lead us to testable hypotheses about a protein's function, or the function of a particular region. The visual comparison of different types of protein features with ProteinArchitect will be useful when assessing the relevance of similarity search hits, to discover subgroups in protein families and superfamilies, and to understand protein regions with conserved features outside globular regions. Therefore, this approach is likely to help researchers to develop testable hypotheses about a protein's function even if is somewhat distant from the more characterized proteins, by facilitating the discovery of features that are conserved above the sequence level for comparison and further experimental investigation.
Phospho.ELM is a manually curated database of eukaryotic phosphorylation sites. The resource includes data collected from published literature as well as high-throughput data sets.
The current release of Phospho.ELM (version 7.0, July 2007) contains 4078 phospho-protein sequences covering 12 025 phospho-serine, 2362 phospho-threonine and 2083 phospho-tyrosine sites. The entries provide information about the phosphorylated proteins and the exact position of known phosphorylated instances, the kinases responsible for the modification (where known) and links to bibliographic references. The database entries have hyperlinks to easily access further information from UniProt, PubMed, SMART, ELM, MSD as well as links to the protein interaction databases MINT and STRING.
A new BLAST search tool, complementary to retrieval by keyword and UniProt accession number, allows users to submit a protein query (by sequence or UniProt accession) to search against the curated data set of phosphorylated peptides.
Phospho.ELM is available on line at: http://phospho.elm.eu.org
A major challenge in the proteomics and structural genomics era is to predict protein structure and function, including identification of those proteins that are partially or wholly unstructured. Non-globular sequence segments often contain short linear peptide motifs (e.g. SH3-binding sites) which are important for protein function. We present here a new tool for discovery of such unstructured, or disordered regions within proteins. GlobPlot (http://globplot.embl.de) is a web service that allows the user to plot the tendency within the query protein for order/globularity and disorder. We show examples with known proteins where it successfully identifies inter-domain segments containing linear motifs, and also apparently ordered regions that do not contain any recognised domain. GlobPlot may be useful in domain hunting efforts. The plots indicate that instances of known domains may often contain additional N- or C-terminal segments that appear ordered. Thus GlobPlot may be of use in the design of constructs corresponding to globular proteins, as needed for many biochemical studies, particularly structural biology. GlobPlot has a pipeline interface—GlobPipe—for the advanced user to do whole proteome analysis. GlobPlot can also be used as a generic infrastructure package for graphical displaying of any possible propensity.
Expressed sequence tags (ESTs) are an effective approach for discovery of novel genes. In the current study, approximately
250 ESTs of the cattle parasitic nematode Setaria digitata were examined and a cDNA clone identified whose coding
sequence could not be functionally annotated by searching over publicly available genome, protein, EST and STS databases.
Here, we report the extensive characterization of this ORF (UP) and its homologues using a bioinformatic approach.
Uncharacterized protein (SDUP) of S. digitata consists of 204 amino acids with a predicted molecular weight and isoelectric
point of 22.8KDa and 9.94, respectively. A search carried out using SDUP over nucleotide, EST and protein databases at
NCBI, NEMBASE3 and Parasite Genome Database (PGD) identified homologous counterparts from the human parasitic
nematodes Wuchereria bancrofti (WB), Brugia malayi (BM), Onchocerca volvulus (OV), the mouse filarial worm
Litomosoides sigmodontis (LS), swine parasitic nematodes Ascaris suum (AS) and diverged counterparts from the plant
parasitic nematode Meloidogyne hapla (MH) and free living nematodes Caenorhabditis elegans (CE) and Caenorhabditis
briggsae (CB). Phylogenetic analyses revealed the UPs to be undergoing divergent evolution. A search of the ESTs at PGD
showed that UP is expressed in all the stages of BM. Secondary structure analyses of multiply-aligned sequences of
homologues using Jpred server indicated UPs to be rich in beta-pleated structures. TMMHH server and beta barrel finder
programme indicated, UPs to be neither transmembrane or beta barrels proteins but are likely to be globular proteins.
Further, the Motif discovery tool of MEME identified three novel potential motifs for UPS, of which only two are present in
CE, CB & MH. Analyses of UPs using Signal IP, TargetP, Psort servers predicted this group of proteins to be devoid of
signal peptide cleavage sites, are not mitochondrial targeting peptides but appear to be localized to the nucleus, respectively.
Further analyses of the UPs using ScanProsite server for phosphorylation revealed potential sites for cAMP and cGMPdependent
protein kinase, Protein kinase C and Casein kinase II. Putative functional analysis using ProtFun 2.1 Server
indicated UPs to be nonenzymatic, growth factor like protein. Finally, collating all the information derived from
bioinformatic analyses, we conclude that the UPs of nematodes are most likely to be expressed at all stages in the life cycle,
localized to the nucleus, regulated by phosphorylation, rich in betapleated strands and are growth factor like nematode
nematodes; Setaria digitata; bioinformatics; servers
Pur-α is a nucleic acid-binding protein involved in cell cycle control, transcription, and neuronal function. Initially no prediction of the three-dimensional structure of Pur-α was possible. However, recently we solved the X-ray structure of Pur-α from the fruitfly Drosophila melanogaster and showed that it contains a so-called PUR domain. Here we explain how we exploited bioinformatics tools in combination with X-ray structure determination of a bacterial homolog to obtain diffracting crystals and the high-resolution structure of Drosophila Pur-α. First, we used sensitive methods for remote-homology detection to find three repetitive regions in Pur-α. We realized that our lack of understanding how these repeats interact to form a globular domain was a major problem for crystallization and structure determination. With our information on the repeat motifs we then identified a distant bacterial homolog that contains only one repeat. We determined the bacterial crystal structure and found that two of the repeats interact to form a globular domain. Based on this bacterial structure, we calculated a computational model of the eukaryotic protein. The model allowed us to design a crystallizable fragment and to determine the structure of Drosophila Pur-α. Key for success was the fact that single repeats of the bacterial protein self-assembled into a globular domain, instructing us on the number and boundaries of repeats to be included for crystallization trials with the eukaryotic protein. This study demonstrates that the simpler structural domain arrangement of a distant prokaryotic protein can guide the design of eukaryotic crystallization constructs. Since many eukaryotic proteins contain multiple repeats or repeating domains, this approach might be instructive for structural studies of a range of proteins.
Cell growth and proliferation require a complex series of tight-regulated and well-orchestrated events. Accordingly, proteins governing such events are evolutionary conserved, even among distant organisms. By contrast, it is more singular the case of “core functions” exerted by functional analogous proteins that are not homologous and do not share any kind of structural similarity. This is the case of proteins regulating the G1/S transition in higher eukaryotes–i.e., the retinoblastoma (Rb) tumor suppressor Rb—and budding yeast, i.e., Whi5. The interaction landscape of Rb and Whi5 is quite large, with more than one hundred proteins interacting either genetically or physically with each protein. The Whi5 interactome has been used to construct a concept map of Whi5 function and regulation. Comparison of physical and genetic interactors of Rb and Whi5 allows highlighting a significant core of conserved, common functionalities associated with the interactors indicating that structure and function of the network—rather than individual proteins—are conserved during evolution. A combined bioinformatics and biochemical approach has shown that the whole Whi5 protein is highly disordered, except for a small region containing the protein family signature. The comparison with Whi5 homologs from Saccharomycetales has prompted the hypothesis of a modular organization of structural disorder, with most evolutionary conserved regions alternating with highly variable ones. The finding of a consensus sequence points to the conservation of a specific phosphorylation rhythm along with two disordered sequence motifs, probably acting as phosphorylation-dependent seeds in Whi5 folding/unfolding. Thus, the widely disordered Whi5 appears to act as a hierarchical, “date hub” that has evolutionary assayed an original way of modular organization before being supplanted by the globular, multi-domain structured Rb, more suitable to cover the role of a “party hub”.
structural disorder; protein evolution; protein hub; date hub; party hub; multisite phosphorylation; systems biology; cell cycle
The PRINTS database of protein family 'fingerprints' is a diagnostic resource that complements the PROSITE dictionary of sites and patterns. Unlike regular expressions, fingerprints exploit groups of conserved motifs within sequence alignments to build characteristic signatures of family membership. Thus fingerprints inherently offer improved diagnostic reliability by virtue of the mutual context provided by motif neighbours. To date, 600 fingerprints have been constructed and stored in PRINTS, representing a 50% increase in the size of the database in the last year. The current version, 13.0, encodes approximately 3000 motifs, covering a range of globular and membrane proteins, modular polypeptides, and so on. The database is accessible via UCL's Bioinformatics World Wide Web (WWW) server at http://www.biochem.ucl.ac.uk/bsm/dbbrowser / . We describe here progress with the database, its Web interface, and a recent exciting development: the integration of a novel colour alignment editor (http://www.biochem.ucl.ac.uk/bsm/dbbrowser++ +/CINEMA ), which allows visualisation and interactive manipulation of PRINTS alignments over the Internet.
Many biological responses to intra- and extracellular stimuli are regulated through complex networks of transient protein interactions where a globular domain in one protein recognizes a linear peptide from another, creating a relatively small contact interface. These peptide stretches are often found in unstructured regions of proteins, and contain a consensus motif complementary to the interaction surface displayed by their binding partners. While most current methods for the de novo discovery of such motifs exploit their tendency to occur in disordered regions, our work here focuses on another observation: upon binding to their partner domain, motifs adopt a well-defined structure. Indeed, through the analysis of all peptide-mediated interactions of known high-resolution three-dimensional (3D) structure, we found that the structure of the peptide may be as characteristic as the consensus motif, and help identify target peptides even though they do not match the established patterns. Our analyses of the structural features of known motifs reveal that they tend to have a particular stretched and elongated structure, unlike most other peptides of the same length. Accordingly, we have implemented a strategy based on a Support Vector Machine that uses this features, along with other structure-encoded information about binding interfaces, to search the set of protein interactions of known 3D structure and to identify unnoticed peptide-mediated interactions among them. We have also derived consensus patterns for these interactions, whenever enough information was available, and compared our results with established linear motif patterns and their binding domains. Finally, to cross-validate our identification strategy, we scanned interactome networks from four model organisms with our newly derived patterns to see if any of them occurred more often than expected. Indeed, we found significant over-representations for 64 domain-motif interactions, 46 of which had not been described before, involving over 6,000 interactions in total for which we could suggest the molecular details determining the binding.
Protein-protein interactions are paramount in any aspect of the cellular life. Some proteins form large macromolecular complexes that execute core functionalities of the cell, while others transmit information in signalling networks to co-ordinate these processes. The latter type, of more transient nature, often occurs through the recognition of a small linear sequence motif in one protein by a specialized globular domain in the other. These peptide stretches often contain a consensus pattern complementary to the interaction surface displayed by their binding partners, and adopt a well-defined structure upon binding. Information that is currently available only from high-resolution three-dimensional (3D) structures, and that can be as characteristic as the consensus motif itself. In this manuscript, we present a strategy to identify novel domain-motif interactions (DMIs) among the set of protein complexes of known 3D structures, which provides information on the consensus motif and binding domain and also allows ready identification of the key interacting residues. A detailed knowledge of the interface is critical to plan further functional studies and for the development of interfering elements, be it drug-like compounds or novel engineered binding proteins or peptides. The small interfaces typical for DMIs make them interesting candidates for all these applications.
Emerging viral diseases, most of which are caused by the transmission of viruses from animals to humans, pose a threat to public health. Discovering pathogenic viruses through surveillance is the key to preparedness for this potential threat. Next generation sequencing (NGS) helps us to identify viruses without the design of a specific PCR primer. The major task in NGS data analysis is taxonomic identification for vast numbers of sequences. However, taxonomic identification via a BLAST search against all the known sequences is a computational bottleneck.
Here we propose an enhanced lowest-common-ancestor based method (ELM) to effectively identify viruses from massive sequence data. To reduce the computational cost, ELM uses a customized database composed only of viral sequences for the BLAST search. At the same time, ELM adopts a novel criterion to suppress the rise in false positive assignments caused by the small database. As a result, identification by ELM is more than 1,000 times faster than the conventional methods without loss of accuracy.
We anticipate that ELM will contribute to direct diagnosis of viral infections. The web server and the customized viral database are freely available at http://bioinformatics.czc.hokudai.ac.jp/ELM/.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-254) contains supplementary material, which is available to authorized users.
Next generation sequencing; Virus discovery; Diagnostic virology; Virome; Taxonomic identification