Life sciences are yielding huge data sets that underpin scientific discoveries fundamental to improvement in human health, agriculture and the environment. In support of these discoveries, a plethora of databases and tools are deployed, in technically complex and diverse implementations, across a spectrum of scientific disciplines. The corpus of documentation of these resources is fragmented across the Web, with much redundancy, and has lacked a common standard of information. The outcome is that scientists must often struggle to find, understand, compare and use the best resources for the task at hand.
Here we present a community-driven curation effort, supported by ELIXIR—the European infrastructure for biological information—that aspires to a comprehensive and consistent registry of information about bioinformatics resources. The sustainable upkeep of this Tools and Data Services Registry is assured by a curation effort driven by and tailored to local needs, and shared amongst a network of engaged partners.
As of November 2015, the registry includes 1785 resources, with depositions from 126 individual registrations including 52 institutional providers and 74 individuals. With community support, the registry can become a standard for dissemination of information about bioinformatics resources: we welcome everyone to join us in this common endeavour. The registry is freely available at https://bio.tools.
Data sharing, integration and annotation are essential to ensure the reproducibility of the analysis and interpretation of the experimental findings. Often these activities are perceived as a role that bioinformaticians and computer scientists have to take with no or little input from the experimental biologist. On the contrary, biological researchers, being the producers and often the end users of such data, have a big role in enabling biological data integration. The quality and usefulness of data integration depend on the existence and adoption of standards, shared formats, and mechanisms that are suitable for biological researchers to submit and annotate the data, so it can be easily searchable, conveniently linked and consequently used for further biological analysis and discovery. Here, we provide background on what is data integration from a computational science point of view, how it has been applied to biological research, which key aspects contributed to its success and future directions.
Data integration; Standards; Bioinformatics; Data driven; Open sciences
protein phosphorylation; disease; evolution; cell signaling; systems biology; bioinformatics; phosphorylation networks
Despite the investments in malaria research, an effective vaccine has not yet been developed and the causative parasites are becoming increasingly resistant to most of the available drugs. PfATP6, the sarco/endoplasmic reticulum Ca2+ pump (SERCA) of P. falciparum, has been recently genetically validated as a potential antimalarial target and cyclopiazonic acid (CPA) has been found to be a potent inhibitor of SERCAs in several organisms, including P. falciparum. In position 263, PfATP6 displays a leucine residue, whilst the corresponding position in the mammalian SERCA is occupied by a glutamic acid. The PfATP6 L263E mutation has been studied in relation to the artemisinin inhibitory effect on P. falciparum and recent studies have provided evidence that the parasite with this mutation is more susceptible to CPA. Here, we characterized, for the first time, the interaction of CPA with PfATP6 and its mammalian counterpart to understand similarities and differences in the mode of binding of the inhibitor to the two Ca2+ pumps. We found that, even though CPA does not directly interact with the residue in position 263, the presence of a hydrophobic residue in this position in PfATP6 rather than a negatively charged one, as in the mammalian SERCA, entails a conformational arrangement of the binding pocket which, in turn, determines a relaxation of CPA leading to a different binding mode of the compound. Our findings highlight differences between the plasmodial and human SERCA CPA-binding pockets that may be exploited to design CPA derivatives more selective toward PfATP6. Proteins 2015; 83:564–574. © 2015 The Authors. Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc.
SERCA; pfATP6; malaria; CPA; homology modeling; molecular dynamics; MM-GBSA
Summary: Rapid technological advances have led to an explosion of biomedical data in recent years. The pace of change has inspired new collaborative approaches for sharing materials and resources to help train life scientists both in the use of cutting-edge bioinformatics tools and databases and in how to analyse and interpret large datasets. A prototype platform for sharing such training resources was recently created by the Bioinformatics Training Network (BTN). Building on this work, we have created a centralized portal for sharing training materials and courses, including a catalogue of trainers and course organizers, and an announcement service for training events. For course organizers, the portal provides opportunities to promote their training events; for trainers, the portal offers an environment for sharing materials, for gaining visibility for their work and promoting their skills; for trainees, it offers a convenient one-stop shop for finding suitable training resources and identifying relevant training events and activities locally and worldwide.
Availability and implementation:
The 14-3-3s are a family of dimeric evolutionary conserved pSer/pThr binding proteins that play a key role in multiple biological processes by interacting with a plethora of client proteins. Giardia duodenalis is a flagellated protozoan that affects millions of people worldwide causing an acute and chronic diarrheal disease. The single giardial 14-3-3 isoform (g14-3-3), unique in the 14-3-3 family, needs the constitutive phosphorylation of Thr214 and the polyglycylation of its C-terminus to be fully functional in vivo. Alteration of the phosphorylation and polyglycylation status affects the parasite differentiation into the cyst stage. To further investigate the role of these post-translational modifications, the crystal structure of the g14-3-3 was solved in the unmodified apo form. Oligomers of g14-3-3 were observed due to domain swapping events at the protein C-terminus. The formation of filaments was supported by TEM. Mutational analysis, in combination with native PAGE and chemical cross-linking, proved that polyglycylation prevents oligomerization. In silico phosphorylation and molecular dynamics simulations supported a structural role for the phosphorylation of Thr214 in promoting target binding. Our findings highlight unique structural features of g14-3-3 opening novel perspectives on the evolutionary history of this protein family and envisaging the possibility to develop anti-giardial drugs targeting g14-3-3.
The mountains of data thrusting from the new landscape of modern high-throughput biology are irrevocably changing biomedical research and creating a near-insatiable demand for training in data management and manipulation and data mining and analysis. Among life scientists, from clinicians to environmental researchers, a common theme is the need not just to use, and gain familiarity with, bioinformatics tools and resources but also to understand their underlying fundamental theoretical and practical concepts. Providing bioinformatics training to empower life scientists to handle and analyse their data efficiently, and progress their research, is a challenge across the globe. Delivering good training goes beyond traditional lectures and resource-centric demos, using interactivity, problem-solving exercises and cooperative learning to substantially enhance training quality and learning outcomes. In this context, this article discusses various pragmatic criteria for identifying training needs and learning objectives, for selecting suitable trainees and trainers, for developing and maintaining training skills and evaluating training quality. Adherence to these criteria may help not only to guide course organizers and trainers on the path towards bioinformatics training excellence but, importantly, also to improve the training experience for life scientists.
bioinformatics; training; bioinformatics courses; training life scientists; train the trainers
Summary: We present iAnn, an open source community-driven platform for dissemination of life science events, such as courses, conferences and workshops. iAnn allows automatic visualisation and integration of customised event reports. A central repository lies at the core of the platform: curators add submitted events, and these are subsequently accessed via web services. Thus, once an iAnn widget is incorporated into a website, it permanently shows timely relevant information as if it were native to the remote site. At the same time, announcements submitted to the repository are automatically disseminated to all portals that query the system. To facilitate the visualization of announcements, iAnn provides powerful filtering options and views, integrated in Google Maps and Google Calendar. All iAnn widgets are freely available.
Motivation: The need for new drugs and new targets is particularly compelling in an era that is witnessing an alarming increase of drug resistance in human pathogens. The identification of new targets of known drugs is a promising approach, which has proven successful in several cases. Here, we describe a database that includes information on 5153 putative drug–target pairs for 150 human pathogens derived from available drug–target crystallographic complexes.
Availability and implementation: The TiPs database is freely available at http://biocomputing.it/tips.
firstname.lastname@example.org or email@example.com
Monitoring resistance phenotypes for Plasmodium falciparum, using in vitro growth assays, and relating findings to parasite genotype has proved particularly challenging for the study of resistance to artemisinins.
Plasmodium falciparum isolates cultured from 28 returning travellers diagnosed with malaria were assessed for sensitivity to artemisinin, artemether, dihydroartemisinin and artesunate and findings related to mutations in pfatp6 and pfmdr1.
Resistance to artemether in vitro was significantly associated with a pfatp6 haplotype encoding two amino acid substitutions (pfatp6 A623E and S769N; (mean IC50 (95% CI) values of 8.2 (5.7 – 10.7) for A623/S769 versus 623E/769 N 13.5 (9.8 – 17.3) nM with a mean increase of 65%; p = 0.012). Increased copy number of pfmdr1 was not itself associated with increased IC50 values for artemether, but when interactions between the pfatp6 haplotype and increased copy number of pfmdr1 were examined together, a highly significant association was noted with IC50 values for artemether (mean IC50 (95% CI) values of 8.7 (5.9 – 11.6) versus 16.3 (10.7 – 21.8) nM with a mean increase of 87%; p = 0.0068). Previously described SNPs in pfmdr1 are also associated with differences in sensitivity to some artemisinins.
These findings were further explored in molecular modelling experiments that suggest mutations in pfatp6 are unlikely to affect differential binding of artemisinins at their proposed site, whereas there may be differences in such binding associated with mutations in pfmdr1. Implications for a hypothesis that artemisinin resistance may be exacerbated by interactions between PfATP6 and PfMDR1 and for epidemiological studies to monitor emerging resistance are discussed.
Artemisinin resistance; pfmdr1; pfatp6; Gene copy number; Malaria; Travellers; Plasmodium
Funding bodies are increasingly recognizing the need to provide graduates and researchers with access to short intensive courses in a variety of disciplines, in order both to improve the general skills base and to provide solid foundations on which researchers may build their careers. In response to the development of ‘high-throughput biology’, the need for training in the field of bioinformatics, in particular, is seeing a resurgence: it has been defined as a key priority by many Institutions and research programmes and is now an important component of many grant proposals. Nevertheless, when it comes to planning and preparing to meet such training needs, tension arises between the reward structures that predominate in the scientific community which compel individuals to publish or perish, and the time that must be devoted to the design, delivery and maintenance of high-quality training materials. Conversely, there is much relevant teaching material and training expertise available worldwide that, were it properly organized, could be exploited by anyone who needs to provide training or needs to set up a new course. To do this, however, the materials would have to be centralized in a database and clearly tagged in relation to target audiences, learning objectives, etc. Ideally, they would also be peer reviewed, and easily and efficiently accessible for downloading. Here, we present the Bioinformatics Training Network (BTN), a new enterprise that has been initiated to address these needs and review it, respectively, to similar initiatives and collections.
Bioinformatics; training; end users; bioinformatics courses; learning bioinformatics
Linear motifs are short, evolutionarily plastic components of regulatory proteins and provide low-affinity interaction interfaces. These compact modules play central roles in mediating every aspect of the regulatory functionality of the cell. They are particularly prominent in mediating cell signaling, controlling protein turnover and directing protein localization. Given their importance, our understanding of motifs is surprisingly limited, largely as a result of the difficulty of discovery, both experimentally and computationally. The Eukaryotic Linear Motif (ELM) resource at http://elm.eu.org provides the biological community with a comprehensive database of known experimentally validated motifs, and an exploratory tool to discover putative linear motifs in user-submitted protein sequences. The current update of the ELM database comprises 1800 annotated motif instances representing 170 distinct functional classes, including approximately 500 novel instances and 24 novel classes. Several older motif class entries have been also revisited, improving annotation and adding novel instances. Furthermore, addition of full-text search capabilities, an enhanced interface and simplified batch download has improved the overall accessibility of the ELM data. The motif discovery portion of the ELM resource has added conservation, and structural attributes have been incorporated to aid users to discriminate biologically relevant motifs from stochastically occurring non-functional instances.
The function of proteins is often mediated by short linear segments of their amino acid sequence, called Short Linear Motifs or SLiMs, the identification of which can provide important information about a protein function. However, the short length of the motifs and their variable degree of conservation makes their identification hard since it is difficult to correctly estimate the statistical significance of their occurrence. Consequently, only a small fraction of them have been discovered so far. We describe here an approach for the discovery of SLiMs based on their occurrence in evolutionarily unrelated proteins belonging to the same biological, signalling or metabolic pathway and give specific examples of its effectiveness in both rediscovering known motifs and in discovering novel ones. An automatic implementation of the procedure, available for download, allows significant motifs to be identified, automatically annotated with functional, evolutionary and structural information and organized in a database that can be inspected and queried. An instance of the database populated with pre-computed data on seven organisms is accessible through a publicly available server and we believe it constitutes by itself a useful resource for the life sciences (http://www.biocomputing.it/modipath).
Genes involved in post-mating processes of multiple mating organisms are known to evolve rapidly due to coevolution driven by sexual conflict among male-female interacting proteins. In the malaria mosquito Anopheles gambiae - a monandrous species in which sexual conflict is expected to be absent or minimal - recent data strongly suggest that proteolytic enzymes specifically expressed in the female lower reproductive tissues are involved in the processing of male products transferred to females during mating. In order to better understand the role of selective forces underlying the evolution of proteins involved in post-mating responses, we analysed a cluster of genes encoding for three serine proteases that are down-regulated after mating, two of which specifically expressed in the atrium and one in the spermatheca of A. gambiae females.
The analysis of polymorphisms and divergence of these female-expressed proteases in closely related species of the A. gambiae complex revealed a high level of replacement polymorphisms consistent with relaxed evolutionary constraints of duplicated genes, allowing to rapidly fix novel replacements to perform new or more specific functions. Adaptive evolution was detected in several codons of the 3 genes and hints of episodic selection were also found. In addition, the structural modelling of these proteases highlighted some important differences in their substrate specificity, and provided evidence that a number of sites evolving under selective pressures lie relatively close to the catalytic triad and/or on the edge of the specificity pocket, known to be involved in substrate recognition or binding. The observed patterns suggest that these proteases may interact with factors transferred by males during mating (e.g. substrates, inhibitors or pathogens) and that they may have differently evolved in independent A. gambiae lineages.
Our results - also examined in light of constraints in the application of selection-inference methods to the closely related species of the A. gambiae complex - reveal an unexpectedly intricate evolutionary scenario. Further experimental analyses are needed to investigate the biological functions of these genes in order to better interpret their molecular evolution and to assess whether they represent possible targets for limiting the fertility of Anopheles mosquitoes in malaria vector control strategies.
molecular evolution; reproduction; adaptive evolution; gene duplication; Anopheles gambiae complex
Resistance to chloroquine of malaria strains is known to be associated with a parasite protein named PfCRT, the mutated form of which is able to reduce chloroquine accumulation in the digestive vacuole of the pathogen. Whether the protein mediates extrusion of the drug acting as a channel or as a carrier and which is the protonation state of its chloroquine substrate is the subject of a scientific debate. We present here an analytical approach that explores which combination of hypotheses on the mechanism of transport and the protonation state of chloroquine are consistent with available equilibrium experimental data. We show that the available experimental data are not, by themselves, sufficient to conclude whether the protein acts as a channel or as a transporter, which explains the origin of their different interpretation by different authors. Interestingly, though, each of the two models is only consistent with a subset of hypotheses on the protonation state of the transported molecule. The combination of these results with a sequence and structure analysis of PfCRT, which strongly suggests that the molecule is a carrier, indicates that the transported species is either or both the mono and di-protonated forms of chloroquine. We believe that our results, besides shedding light on the mechanism of chloroquine resistance in P. falciparum, have implications for the development of novel therapies against resistant malaria strains and demonstrate the usefulness of an approach combining systems biology strategies with structural bioinformatics and experimental data.
The Phospho.ELM resource (http://phospho.elm.eu.org) is a relational database designed to store in vivo and in vitro phosphorylation data extracted from the scientific literature and phosphoproteomic analyses. The resource has been actively developed for more than 7 years and currently comprises 42 574 serine, threonine and tyrosine non-redundant phosphorylation sites. Several new features have been implemented, such as structural disorder/order and accessibility information and a conservation score. Additionally, the conservation of the phosphosites can now be visualized directly on the multiple sequence alignment used for the score calculation. Finally, special emphasis has been put on linking to external resources such as interaction networks and other databases.
Phospho3D is a database of three-dimensional (3D) structures of phosphorylation sites (P-sites) derived from the Phospho.ELM database, which also collects information on the residues surrounding the P-site in space (3D zones). The database also provides the results of a large-scale structural comparison of the 3D zones versus a representative dataset of structures, thus associating to each P-site a number of structurally similar sites. The new version of Phospho3D presents an 11-fold increase in the number of 3D sites and incorporates several additional features, including new structural descriptors, the possibility of selecting non-redundant sets of 3D structures and the availability for download of non-redundant sets of structurally annotated P-sites. Moreover, it features P3Dscan, a new functionality that allows the user to submit a protein structure and scan it against the 3D zones collected in the Phospho3D database. Phospho3D version 2.0 is available at: http://www.phospho3d.org/.
Linear motifs are short segments of multidomain proteins that provide regulatory functions independently of protein tertiary structure. Much of intracellular signalling passes through protein modifications at linear motifs. Many thousands of linear motif instances, most notably phosphorylation sites, have now been reported. Although clearly very abundant, linear motifs are difficult to predict de novo in protein sequences due to the difficulty of obtaining robust statistical assessments. The ELM resource at http://elm.eu.org/ provides an expanding knowledge base, currently covering 146 known motifs, with annotation that includes >1300 experimentally reported instances. ELM is also an exploratory tool for suggesting new candidates of known linear motifs in proteins of interest. Information about protein domains, protein structure and native disorder, cellular and taxonomic contexts is used to reduce or deprecate false positive matches. Results are graphically displayed in a ‘Bar Code’ format, which also displays known instances from homologous proteins through a novel ‘Instance Mapper’ protocol based on PHI-BLAST. ELM server output provides links to the ELM annotation as well as to a number of remote resources. Using the links, researchers can explore the motifs, proteins, complex structures and associated literature to evaluate whether candidate motifs might be worth experimental investigation.
Many proteins are highly modular, being assembled from globular domains and segments of natively disordered polypeptides. Linear motifs, short sequence modules functioning independently of protein tertiary structure, are most abundant in natively disordered polypeptides but are also found in accessible parts of globular domains, such as exposed loops. The prediction of novel occurrences of known linear motifs attempts the difficult task of distinguishing functional matches from stochastically occurring non-functional matches. Although functionality can only be confirmed experimentally, confidence in a putative motif is increased if a motif exhibits attributes associated with functional instances such as occurrence in the correct taxonomic range, cellular compartment, conservation in homologues and accessibility to interacting partners. Several tools now use these attributes to classify putative motifs based on confidence of functionality.
Current methods assessing motif accessibility do not consider much of the information available, either predicting accessibility from primary sequence or regarding any motif occurring in a globular region as low confidence. We present a method considering accessibility and secondary structural context derived from experimentally solved protein structures to rectify this situation. Putatively functional motif occurrences are mapped onto a representative domain, given that a high quality reference SCOP domain structure is available for the protein itself or a close relative. Candidate motifs can then be scored for solvent-accessibility and secondary structure context. The scores are calibrated on a benchmark set of experimentally verified motif instances compared with a set of random matches. A combined score yields 3-fold enrichment for functional motifs assigned to high confidence classifications and 2.5-fold enrichment for random motifs assigned to low confidence classifications. The structure filter is implemented as a pipeline with both a graphical interface via the ELM resource and through a Web Service protocol.
New occurrences of known linear motifs require experimental validation as the bioinformatics tools currently have limited reliability. The ELM structure filter will aid users assessing candidate motifs presenting in globular structural regions. Most importantly, it will help users to decide whether to expend their valuable time and resources on experimental testing of interesting motif candidates.
The occurrence of very similar structural motifs brought about by different parts of non homologous proteins is often indicative of a common function. Indeed, relatively small local structures can mediate binding to a common partner, be it a protein, a nucleic acid, a cofactor or a substrate. While it is relatively easy to identify short amino acid or nucleotide sequence motifs in a given set of proteins or genes, and many methods do exist for this purpose, much more challenging is the identification of common local substructures, especially if they are formed by non consecutive residues in the sequence.
Here we describe a publicly available tool, able to identify common structural motifs shared by different non homologous proteins in an unsupervised mode. The motifs can be as short as three residues and need not to be contiguous or even present in the same order in the sequence. Users can submit a set of protein structures deemed or not to share a common function (e.g. they bind similar ligands, or share a common epitope). The server finds and lists structural motifs composed of three or more spatially well conserved residues shared by at least three of the submitted structures. The method uses a local structural comparison algorithm to identify subsets of similar amino acids between each pair of input protein chains and a clustering procedure to group similarities shared among different structure pairs.
FunClust is fast, completely sequence independent, and does not need an a priori knowledge of the motif to be found. The output consists of a list of aligned structural matches displayed in both tabular and graphical form. We show here examples of its usefulness by searching for the largest common structural motifs in test sets of non homologous proteins and showing that the identified motifs correspond to a known common functional feature.
Phospho.ELM is a manually curated database of eukaryotic phosphorylation sites. The resource includes data collected from published literature as well as high-throughput data sets.
The current release of Phospho.ELM (version 7.0, July 2007) contains 4078 phospho-protein sequences covering 12 025 phospho-serine, 2362 phospho-threonine and 2083 phospho-tyrosine sites. The entries provide information about the phosphorylated proteins and the exact position of known phosphorylated instances, the kinases responsible for the modification (where known) and links to bibliographic references. The database entries have hyperlinks to easily access further information from UniProt, PubMed, SMART, ELM, MSD as well as links to the protein interaction databases MINT and STRING.
A new BLAST search tool, complementary to retrieval by keyword and UniProt accession number, allows users to submit a protein query (by sequence or UniProt accession) to search against the curated data set of phosphorylated peptides.
Phospho.ELM is available on line at: http://phospho.elm.eu.org
3dLOGO is a web server for the identification and analysis of conserved protein 3D substructures. Given a set of residues in a PDB (Protein Data Bank) chain, the server detects the matching substructure(s) in a set of user-provided protein structures, generates a multiple structure alignment centered on the input substructures and highlights other residues whose structural conservation becomes evident after the defined superposition. Conserved residues are proposed to the user for highlighting functional areas, deriving refined structural motifs or building sequence patterns. Residue structural conservation can be visualized through an expressly designed Java application, 3dProLogo, which is a 3D implementation of a sequence logo. The 3dLOGO server, with related documentation, is available at http://3dlogo.uniroma2.it/
SH3-Hunter (http://cbm.bio.uniroma2.it/SH3-Hunter/) is a web server for the recognition of putative SH3 domain interaction sites on protein sequences. Given an input query consisting of one or more protein sequences, the server identifies peptides containing poly-proline binding motifs and associates them to a list of SH3 domains, in order to compose peptide–domain pairs. The server can accept a list of peptides and allows users to upload an input file in a proper format. An accurate selection of SH3 domains is available and users can also submit their own SH3 domain sequence.
SH3-Hunter evaluates which peptide–domain pair represents a possible interaction pair and produces as output a list of significant interaction sites for each query protein. Each proposed interaction site is associated to a propensity score and sensitivity and precision levels for the prediction. The server prediction capability is based on a neural network model integrating high-throughput pep-spot data with structural information extracted from known SH3-peptide complexes.
We performed an exhaustive search for local structural similarities in an ensemble of non-redundant protein functional sites. With the purpose of finding new examples of convergent evolution, we selected only those matching sites composed of structural regions whose residue order is inverted in the relative protein sequences.
A novel case of local analogy was detected between members of the ABC transporter and of the HprK/P families in their ATP binding site. This case cannot be derived by events of circular permutation since the residues of one of the region pairs are located in reverse order in the sequence of the two protein families. One of the analogous binding sites, the one identified in HprK/P, is known to also bind pyrophosphate, which is used as preferred energy source in its kinase and phosphorylase activity.
The discovery of this striking molecular similarity, also associated to a functional similarity, may help in suggesting new experiments aimed at a deeper understanding of members of the ABC transporter family known to be involved in many serious human diseases.