► Identifies key considerations in target selection and optimisation. ► Approaches to assign useful protein features and structure/function relationships. ► Comparison of latest crystallisation propensity predictors on nonredundant data. ► Discusses single point of reference target selection/optimisation resources. ► Guidance on using the SSPF Target Optimisation Utility (TarO).
Selection of protein targets for study is central to structural biology and may be influenced by numerous factors. A key aim is to maximise returns for effort invested by identifying proteins with the balance of biophysical properties that are conducive to success at all stages (e.g. solubility, crystallisation) in the route towards a high resolution structural model. Selected targets can be optimised through construct design (e.g. to minimise protein disorder), switching to a homologous protein, and selection of experimental methodology (e.g. choice of expression system) to prime for efficient progress through the structural proteomics pipeline.
Here we discuss computational techniques in target selection and optimisation, with more detailed focus on tools developed within the Scottish Structural Proteomics Facility (SSPF); namely XANNpred, ParCrys, OB-Score (target selection) and TarO (target optimisation). TarO runs a large number of algorithms, searching for homologues and annotating the pool of possible alternative targets. This pool of putative homologues is presented in a ranked, tabulated format and results are also visualised as an automatically generated and annotated multiple sequence alignment. The target selection algorithms each predict the propensity of a selected protein target to progress through the experimental stages leading to diffracting crystals. This single predictor approach has advantages for target selection, when compared with an approach using two or more predictors that each predict for success at a single experimental stage. The tools described here helped SSPF achieve a high (21%) success rate in progressing cloned targets to diffraction-quality crystals.
MSA, Multiple Sequence Alignment; PTM, Post Translational Modification; SSPF, Scottish Structural Proteomics Facility; MCC, Matthew’s correlation coefficient; AROC, Area Under the Receiver Operator Characteristic curve; Target selection; Crystallisation; Structural genomics; Structural biology; Bioinformatics; Construct design
Summary: Jalview Version 2 is a system for interactive WYSIWYG editing, analysis and annotation of multiple sequence alignments. Core features include keyboard and mouse-based editing, multiple views and alignment overviews, and linked structure display with Jmol. Jalview 2 is available in two forms: a lightweight Java applet for use in web applications, and a powerful desktop application that employs web services for sequence alignment, secondary structure prediction and the retrieval of alignments, sequences, annotation and structures from public databases and any DAS 1.53 compliant sequence or annotation server.
Availability: The Jalview 2 Desktop application and JalviewLite applet are made freely available under the GPL, and can be downloaded from www.jalview.org
Jpred (http://www.compbio.dundee.ac.uk/jpred) is a secondary structure prediction server powered by the Jnet algorithm. Jpred performs over 1000 predictions per week for users in more than 50 countries. The recently updated Jnet algorithm provides a three-state (α-helix, β-strand and coil) prediction of secondary structure at an accuracy of 81.5%. Given either a single protein sequence or a multiple sequence alignment, Jpred derives alignment profiles from which predictions of secondary structure and solvent accessibility are made. The predictions are presented as coloured HTML, plain text, PostScript, PDF and via the Jalview alignment editor to allow flexibility in viewing and applying the data. The new Jpred 3 server includes significant usability improvements that include clearer feedback of the progress or failure of submitted requests. Functional improvements include batch submission of sequences, summary results via email and updates to the search databases. A new software pipeline will enable Jnet/Jpred to continue to be updated in sync with major updates to SCOP and UniProt and so ensures that Jpred 3 will maintain high-accuracy predictions.
SCANPS performs iterative profile searching similar to PSI-BLAST but with full dynamic programing on each cycle and on-the-fly estimation of significance. This combination gives good sensitivity and selectivity that outperforms PSI-BLAST in domain-searching benchmarks. Although computationally expensive, SCANPS exploits onchip parallelism (MMX and SSE2 instructions on Intel chips) as well as MPI parallelism to give acceptable turnround times even for large databases. A web server developed to run SCANPS searches is now available at http://www.compbio.dundee.ac.uk/www-scanps. The server interface allows a range of different protein sequence databases to be searched including the SCOP database of protein domains. The server provides the user with regularly updated versions of the main protein sequence databases and is backed up by significant computing resources which ensure that searches are performed rapidly. For SCOP searches, the results may be viewed in a new tree-based representation that reflects the structure of the SCOP hierarchy; this aids the user in placing each hit in the context of its SCOP classification and understanding its relationship to other domains in SCOP.
Rising drug resistance is limiting treatment options
by methicillin-resistant Staphylococcus aureus (MRSA).
Herein we provide new evidence that wall teichoic acid (WTA) biogenesis
is a remarkable antibacterial target with the capacity to destabilize
the cooperative action of penicillin-binding proteins (PBPs) that
underlie β-lactam resistance in MRSA. Deletion of gene tarO, encoding the first step of WTA synthesis, resulted
in the restoration of sensitivity of MRSA to a unique profile of β-lactam
antibiotics with a known selectivity for penicillin binding protein
2 (PBP2). Of these, cefuroxime was used as a probe to screen for previously
approved drugs with a cryptic capacity to potentiate its activity
against MRSA. Ticlopidine, the antiplatelet drug Ticlid, strongly
potentiated cefuroxime, and this synergy was abolished in strains
lacking tarO. The combination was also effective
in a Galleria mellonella model of infection. Using
both genetic and biochemical strategies, we determined the molecular
target of ticlopidine as the N-acetylglucosamine-1-phosphate
transferase encoded in gene tarO and provide evidence
that WTA biogenesis represents an Achilles heel supporting the cooperative
function of PBP2 and PBP4 in creating highly cross-linked muropeptides
in the peptidoglycan of S. aureus. This approach
represents a new paradigm to tackle MRSA infection.
Methicillin resistance in Staphylococcus aureus depends on the production of mecA, which encodes penicillin-binding protein 2A (PBP2A), an acquired peptidoglycan transpeptidase (TP) with reduced susceptibility to beta-lactam antibiotics. PBP2A crosslinks nascent peptidoglycan when the native TPs are inhibited by beta-lactams. Although mecA expression is essential for beta-lactam resistance, it is not sufficient. Here we show that blocking the expression of wall teichoic acids (WTAs) by inhibiting the first enzyme in the pathway, TarO, sensitizes MRSA strains to beta-lactams even though the beta-lactam-resistant transpeptidase, PBP2A, is still expressed. The dramatic synergy between TarO inhibitors and beta-lactams is noteworthy not simply because strategies to overcome methicillin-resistant S. aureus (MRSA) are desperately needed, but because neither TarO nor the activities of the native TPs are essential in MRSA strains. The “synthetic lethality” of inhibiting TarO and the native TPs suggests a functional connection between ongoing WTA expression and peptidoglycan assembly in S. aureus. Indeed, transmission electron microscopy shows that S. aureus cells blocked in WTA synthesis have extensive defects in septation and cell separation, indicating dysregulated cell wall assembly and degradation. Our studies imply that WTAs play a fundamental role in S. aureus cell division and raise the possibility that synthetic lethal compound combinations may have therapeutic utility for overcoming antibiotic resistant bacterial infections.
The PIPs database (http://www.compbio.dundee.ac.uk/www-pips) is a resource for studying protein–protein interactions in human. It contains predictions of >37 000 high probability interactions of which >34 000 are not reported in the interaction databases HPRD, BIND, DIP or OPHID. The interactions in PIPs were calculated by a Bayesian method that combines information from expression, orthology, domain co-occurrence, post-translational modifications and sub-cellular location. The predictions also take account of the topology of the predicted interaction network. The web interface to PIPs ranks predictions according to their likelihood of interaction broken down by the contribution from each information source and with easy access to the evidence that supports each prediction. Where data exists in OPHID, HPRD, DIP or BIND for a protein pair this is also reported in the output tables returned by a search. A network browser is included to allow convenient browsing of the interaction network for any protein in the database. The PIPs database provides a new resource on protein–protein interactions in human that is straightforward to browse, or can be exploited completely, for interaction network modelling.
The MyHits web server (http://myhits.isb-sib.ch) is a new integrated service dedicated to the annotation of protein sequences and to the analysis of their domains and signatures. Guest users can use the system anonymously, with full access to (i) standard bioinformatics programs (e.g. PSI-BLAST, ClustalW, T-Coffee, Jalview); (ii) a large number of protein sequence databases, including standard (Swiss-Prot, TrEMBL) and locally developed databases (splice variants); (iii) databases of protein motifs (Prosite, Interpro); (iv) a precomputed list of matches (‘hits’) between the sequence and motif databases. All databases are updated on a weekly basis and the hit list is kept up to date incrementally. The MyHits server also includes a new collection of tools to generate graphical representations of pairwise and multiple sequence alignments including their annotated features. Free registration enables users to upload their own sequences and motifs to private databases. These are then made available through the same web interface and the same set of analytical tools. Registered users can manage their own sequences and annotations using only web tools and freeze their data in their private database for publication purposes.
Summary: JABAWS is a web services framework that simplifies the deployment of web services for bioinformatics. JABAWS:MSA provides services for five multiple sequence alignment (MSA) methods (Probcons, T-coffee, Muscle, Mafft and ClustalW), and is the system employed by the Jalview multiple sequence analysis workbench since version 2.6. A fully functional, easy to set up server is provided as a Virtual Appliance (VA), which can be run on most operating systems that support a virtualization environment such as VMware or Oracle VirtualBox. JABAWS is also distributed as a Web Application aRchive (WAR) and can be configured to run on a single computer and/or a cluster managed by Grid Engine, LSF or other queuing systems that support DRMAA. JABAWS:MSA provides clients full access to each application's parameters, allows administrators to specify named parameter preset combinations and execution limits for each application through simple configuration files. The JABAWS command-line client allows integration of JABAWS services into conventional scripts.
Availability and Implementation: JABAWS is made freely available under the Apache 2 license and can be obtained from: http://www.compbio.dundee.ac.uk/jabaws.
TropGENE-DB, is a crop information system created to store genetic, molecular and phenotypic data of the numerous yet poorly documented tropical crop species. The most common data stored in TropGENE-DB are information on genetic resources (agro-morphological data, parentages, allelic diversity), molecular markers, genetic maps, results of quantitative trait loci analyses, data from physical mapping, sequences, genes, as well as the corresponding references. TropGENE-DB is organized on a crop basis with currently three running modules (sugarcane, cocoa and banana), with plans to create additional modules for rice, cotton, oil palm, coconut, rubber tree, pineapple, taro, yam and sorghum. The TropGENE-DB information system is accessible for consultation via the internet at http://tropgenedb.cirad.fr. Specific web consultation interfaces have been designed to allow quick consultations as well as complex queries.
Profile–profile methods are well suited to detect remote evolutionary relationships between protein families. Profile Comparer (PRC) is an existing stand-alone program for scoring and aligning hidden Markov models (HMMs), which are based on multiple sequence alignments. Since PRC compares profile HMMs instead of sequences, it can be used to find distant homologues. For this purpose, PRC is used by, for example, the CATH and Pfam-domain databases. As PRC is a profile comparer, it only reports profile HMM alignments and does not produce multiple sequence alignments. We have developed webPRC server, which makes it straightforward to search for distant homologues or similar alignments in a number of domain databases. In addition, it provides the results both as multiple sequence alignments and aligned HMMs. Furthermore, the user can view the domain annotation, evaluate the PRC hits with the Jalview multiple alignment editor and generate logos from the aligned HMMs or the aligned multiple alignments. Thus, this server assists in detecting distant homologues with PRC as well as in evaluating and using the results. The webPRC interface is available at http://www.ibi.vu.nl/programs/prcwww/.
An extensive study of teichoic acid biosynthesis in the model organism Bacillus subtilis has established teichoic acid polymers as essential components of the gram-positive cell wall. However, similar studies pertaining to therapeutically relevant organisms, such as Staphylococcus aureus, are scarce. In this study we have carried out a meticulous examination of the dispensability of teichoic acid biosynthetic enzymes in S. aureus. By use of an allelic replacement methodology, we examined all facets of teichoic acid assembly, including intracellular polymer production and export. Using this approach we confirmed that the first-acting enzyme (TarO) was dispensable for growth, in contrast to dispensability studies in B. subtilis. Upon further characterization, we demonstrated that later-acting gene products (TarB, TarD, TarF, TarIJ, and TarH) responsible for polymer formation and export were essential for viability. We resolved this paradox by demonstrating that all of the apparently indispensable genes became dispensable in a tarO null genetic background. This work suggests a lethal gain-of-function mechanism where lesions beyond the initial step in wall teichoic acid biosynthesis render S. aureus nonviable. This discovery poses questions regarding the conventional understanding of essential gene sets, garnered through single-gene knockout experiments in bacteria and higher organisms, and points to a novel drug development strategy targeting late steps in teichoic acid synthesis for the infectious pathogen S. aureus.
The use of sequence alignments to understand protein families is ubiquitous in molecular biology. High quality alignments are difficult to build and protein alignment remains one of the largest open problems in computational biology. Misalignments can lead to inferential errors about protein structure, folding, function, phylogeny, and residue importance. Identifying alignment errors is difficult because alignments are built and validated on the same primary criteria: sequence conservation. Local covariation identifies systematic misalignments and is independent of conservation. We demonstrate an alignment curation tool, LoCo, that integrates local covariation scores with the Jalview alignment editor. Using LoCo, we illustrate how local covariation is capable of identifying alignment errors due to the reduction of positional independence in the region of misalignment. We highlight three alignments from the benchmark database, BAliBASE 3, that contain regions of high local covariation, and investigate the causes to illustrate these types of scenarios. Two alignments contain sequential and structural shifts that cause elevated local covariation. Realignment of these misaligned segments reduces local covariation; these alternative alignments are supported with structural evidence. We also show that local covariation identifies active site residues in a validated alignment of paralogous structures. Loco is available at https://sourceforge.net/projects/locoprotein/files/
Functional annotation is routinely performed for large-scale genomics projects and databases. Researchers working on more specific problems, for instance on an individual pathway or complex, also need to be able to quickly, completely and accurately annotate sequences. The Bioverse sequence annotation server (http://bioverse.compbio.washington.edu) provides a web-based interface to allow users to submit protein sequences to the Bioverse framework. Sequences are functionally and structurally annotated and potential contextual annotations are provided. Researchers can also submit candidate genomes for annotation of all proteins encoded by the genome (proteome).
Multiple sequence alignment (MSA) plays a central role in nearly all bioinformatics and molecular evolutionary applications. MSA reconstruction is thus one of the most heavily scrutinized bioinformatics fields. Evaluating the quality of MSA reconstruction is often hindered by the lack of good reference MSAs. The use of sequence evolution simulation can provide such reference MSAs. Furthermore, none of the MSA viewing/editing programs currently available allows the user to make direct comparisons between two or more MSAs. Considering the importance of MSA quality in a wide range of research, it is desirable if MSA assessment can be performed more easily.
We have developed SuiteMSA, a java-based application that provides unique MSA viewers. Users can directly compare multiple MSAs and evaluate where the MSAs agree (are consistent) or disagree (are inconsistent). Several alignment statistics are provided to assist such comparisons. SuiteMSA also includes a graphical phylogeny editor/viewer as well as a graphical user interface for a sequence evolution simulator that can be used to construct reference MSAs.
SuiteMSA provides researchers easy access to a sequence evolution simulator, reference alignments generated by the simulator, and a series of tools to evaluate the performance of the MSA reconstruction programs. It will help us improve the quality of MSAs, often the most important first steps of bioinformatics and other biological research.
The 3D-GENOMICS database (http://www.sbg.bio.ic.ac.uk/3dgenomics/) provides structural annotations for proteins from sequenced genomes. In August 2003 the database included data for 93 proteomes. The annotations stored in the database include homologous sequences from various sequence databases, domains from SCOP and Pfam, patterns from Prosite and other predicted sequence features such as transmembrane regions and coiled coils. In addition to annotations at the sequence level, several precomputed cross- proteome comparative analyses are available based on SCOP domain superfamily composition. Annotations are available to the user via a web interface to the database. Multiple points of entry are available so that a user is able to: (i) directly access annotations for a single protein sequence via keywords or accession codes, (ii) examine a sequence of interest chosen from a summary of annotations for a particular proteome, or (iii) access precomputed frequency-based cross-proteome comparative analyses.
The MyHits web site (http://myhits.isb-sib.ch) is an integrated service dedicated to the analysis of protein sequences. Since its first description in 2004, both the user interface and the back end of the server were improved. A number of tools (e.g. MAFFT, Jacop, Dotlet, Jalview, ESTScan) were added or updated to improve the usability of the service. The MySQL schema and its associated API were revamped and the database engine (HitKeeper) was separated from the web interface. This paper summarizes the current status of the server, with an emphasis on the new services.
Artemis and ACT have become mainstream tools for viewing and annotating sequence data, particularly for microbial genomes. Since its first release, Artemis has been continuously developed and supported with additional functionality for editing and analysing sequences based on feedback from an active user community of laboratory biologists and professional annotators. Nevertheless, its utility has been somewhat restricted by its limitation to reading and writing from flat files. Therefore a new version of Artemis has been developed, which reads from and writes to a relational database schema, and allows users to annotate more complex, often large and fragmented, genome sequences
Artemis and ACT have now been extended to read and write directly to the Generic Model Organism Database (GMOD, http://www.gmod.org) Chado relational database schema. In addition, a Gene Builder tool has been developed to provide structured forms and tables to edit coordinates of gene models and edit functional annotation, based on standard ontologies, controlled vocabularies and free text.
Artemis and ACT are freely available (under a GPL licence) for download (for MacOSX, UNIX and Windows) at the Wellcome Trust Sanger Institute web sites:
INTERFEROME is an open access database of types I, II and III Interferon regulated genes (http://www.interferome.org) collected from analysing expression data sets of cells treated with IFNs. This database of interferon regulated genes integrates information from high-throughput experiments with annotation, ontology, orthologue sequences from 37 species, tissue expression patterns and gene regulatory information to enable a detailed investigation of the molecular mechanisms underlying IFN biology. INTERFEROME fulfils a need in infection, immunity, development and cancer research by providing computational tools to assist in identifying interferon signatures in gene lists generated by high-throughput expression technologies, and their potential molecular and biological consequences.
The Gene Ontology (GO) project (http://www.geneontology.org/) provides structured, controlled vocabularies and classifications that cover several domains of molecular and cellular biology and are freely available for community use in the annotation of genes, gene products and sequences. Many model organism databases and genome annotation groups use the GO and contribute their annotation sets to the GO resource. The GO database integrates the vocabularies and contributed annotations and provides full access to this information in several formats. Members of the GO Consortium continually work collectively, involving outside experts as needed, to expand and update the GO vocabularies. The GO Web resource also provides access to extensive documentation about the GO project and links to applications that use GO data for functional analyses.
The Ensembl (http://www.ensembl.org/) database project provides a bioinformatics framework to organise biology around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of human, mouse and other genome sequences, available as either an interactive web site or as flat files. Ensembl also integrates manually annotated gene structures from external sources where available. As well as being one of the leading sources of genome annotation, Ensembl is an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements. These range from sequence analysis to data storage and visualisation and installations exist around the world in both companies and at academic sites. With both human and mouse genome sequences available and more vertebrate sequences to follow, many of the recent developments in Ensembl have focusing on developing automatic comparative genome analysis and visualisation.
Motivation: Artemis and Artemis Comparison Tool (ACT) have become mainstream tools for viewing and annotating sequence data, particularly for microbial genomes. Since its first release, Artemis has been continuously developed and supported with additional functionality for editing and analysing sequences based on feedback from an active user community of laboratory biologists and professional annotators. Nevertheless, its utility has been somewhat restricted by its limitation to reading and writing from flat files. Therefore, a new version of Artemis has been developed, which reads from and writes to a relational database schema, and allows users to annotate more complex, often large and fragmented, genome sequences.
Results: Artemis and ACT have now been extended to read and write directly to the Generic Model Organism Database (GMOD, http://www.gmod.org) Chado relational database schema. In addition, a Gene Builder tool has been developed to provide structured forms and tables to edit coordinates of gene models and edit functional annotation, based on standard ontologies, controlled vocabularies and free text.
Availability: Artemis and ACT are freely available (under a GPL licence) for download (for MacOSX, UNIX and Windows) at the Wellcome Trust Sanger Institute web sites: http://www.sanger.ac.uk/Software/Artemis/ http://www.sanger.ac.uk/Software/ACT/
Supplementary information: Supplementary data are available at Bioinformatics online.
Sequence signature databases such as PROSITE, which include amino acid segments that are indicative of a protein's function, are useful for protein annotation. Lamentably, the annotation is not always accurate. A signature may be falsely detected in a protein that does not carry out the associated function (false positive prediction, FP) or may be overlooked in a protein that does carry out the function (false negative prediction, FN). A new approach has emerged in which a signature is replaced with a sequence profile, calculated based on multiple sequence alignment (MSA) of homologous proteins that share the same function. This approach, which is superior to the simple pattern search, essentially searches with the sequence of the query protein against an MSA library. We suggest here an alternative approach, implemented in the QuasiMotiFinder web server (), which is based on a search with an MSA of homologous query proteins against the original PROSITE signatures. The explicit use of the average evolutionary conservation of the signature in the query proteins significantly reduces the rate of FP prediction compared with the simple pattern search. QuasiMotiFinder also has a reduced rate of FN prediction compared with simple pattern searches, since the traditional search for precise signatures has been replaced by a permissive search for signature-like patterns that are physicochemically similar to known signatures. Overall, QuasiMotiFinder and the profile search are comparable to each other in terms of performance. They are also complementary to each other in that signatures that are falsely detected in (or overlooked by) one may be correctly detected by the other.
The Protein Information Resource (PIR) produces the largest, most comprehensive, annotated protein sequence database in the public domain, the PIR-International Protein Sequence Database, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Sequence Database (JIPID). The expanded PIR WWW site allows sequence similarity and text searching of the Protein Sequence Database and auxiliary databases. Several new web-based search engines combine searches of sequence similarity and database annotation to facilitate the analysis and functional identification of proteins. New capabilities for searching the PIR sequence databases include annotation-sorted search, domain search, combined global and domain search, and interactive text searches. The PIR-International databases and search tools are accessible on the PIR WWW site at http://pir.georgetown.edu and at the MIPS WWW site at http://www.mips.biochem.mpg.de . The PIR-International Protein Sequence Database and other files are also available by FTP.
Current protein sequence databases employ different classification schemes that often provide conflicting annotations, especially for poorly characterized proteins. ProGMap (Protein Group Mappings, http://www.bioinformatics.nl/progmap) is a web-tool designed to help researchers and database annotators to assess the coherence of protein groups defined in various databases and thereby facilitate the annotation of newly sequenced proteins. ProGMap is based on a non-redundant dataset of over 6.6 million protein sequences which is mapped to 240 000 protein group descriptions collected from UniProt, RefSeq, Ensembl, COG, KOG, OrthoMCL-DB, HomoloGene, TRIBES and PIRSF. ProGMap combines the underlying classification schemes via a network of links constructed by a fast and fully automated mapping approach originally developed for document classification. The web interface enables queries to be made using sequence identifiers, gene symbols, protein functions or amino acid and nucleotide sequences. For the latter query type BLAST similarity search and QuickMatch identity search services have been incorporated, for finding sequences similar (or identical) to a query sequence. ProGMap is meant to help users of high throughput methodologies who deal with partially annotated genomic data.