Summary: JABAWS is a web services framework that simplifies the deployment of web services for bioinformatics. JABAWS:MSA provides services for five multiple sequence alignment (MSA) methods (Probcons, T-coffee, Muscle, Mafft and ClustalW), and is the system employed by the Jalview multiple sequence analysis workbench since version 2.6. A fully functional, easy to set up server is provided as a Virtual Appliance (VA), which can be run on most operating systems that support a virtualization environment such as VMware or Oracle VirtualBox. JABAWS is also distributed as a Web Application aRchive (WAR) and can be configured to run on a single computer and/or a cluster managed by Grid Engine, LSF or other queuing systems that support DRMAA. JABAWS:MSA provides clients full access to each application's parameters, allows administrators to specify named parameter preset combinations and execution limits for each application through simple configuration files. The JABAWS command-line client allows integration of JABAWS services into conventional scripts.
Availability and Implementation: JABAWS is made freely available under the Apache 2 license and can be obtained from: http://www.compbio.dundee.ac.uk/jabaws.
TarO (http://www.compbio.dundee.ac.uk/taro) offers a single point of reference for key bioinformatics analyses relevant to selecting proteins or domains for study by structural biology techniques. The protein sequence is analysed by 17 algorithms and compared to 8 databases. TarO gathers putative homologues, including orthologues, and then obtains predictions of properties for these sequences including crystallisation propensity, protein disorder and post-translational modifications. Analyses are run on a high-performance computing cluster, the results integrated, stored in a database and accessed through a web-based user interface. Output is in tabulated format and in the form of an annotated multiple sequence alignment (MSA) that may be edited interactively in the program Jalview. TarO also simplifies the gathering of additional annotations via the Distributed Annotation System, both from the MSA in Jalview and through links to Dasty2. Routes to other information gateways are included, for example to relevant pages from UniProt, COG and the Conserved Domains Database. Open access to TarO is available from a guest account with private accounts for academic use available on request. Future development of TarO will include further analysis steps and integration with the Protein Information Management System (PIMS), a sister project in the BBSRC ‘Structural Proteomics of Rational Targets’ initiative
Phage display is widely used in basic research such as the exploration of protein-protein interaction sites and networks, and applied research such as the development of new drugs, vaccines, and diagnostics. It has also become a promising method for epitope mapping. Research on new algorithms that assist and automate phage display based epitope mapping has attracted many groups. Most of the existing tools have not been implemented as an online service until now however, making it less convenient for the community to access, utilize, and evaluate them.
We present MIMOX, a free web tool that helps to map the native epitope of an antibody based on one or more user supplied mimotopes and the antigen structure. MIMOX was coded in Perl using modules from the Bioperl project. It has two sections. In the first section, MIMOX provides a simple interface for ClustalW to align a set of mimotopes. It also provides a simple statistical method to derive the consensus sequence and embeds JalView as a Java applet to view and manage the alignment. In the second section, MIMOX can map a single mimotope or a consensus sequence of a set of mimotopes, on to the corresponding antigen structure and search for all of the clusters of residues that could represent the native epitope. NACCESS is used to evaluate the surface accessibility of the candidate clusters; and Jmol is embedded to view them interactively in their 3D context. Initial case studies show that MIMOX can reproduce mappings from existing tools such as FINDMAP and 3DEX, as well as providing novel, rational results.
A web-based tool called MIMOX has been developed for phage display based epitope mapping. As a publicly available online service in this area, it is convenient for the community to access, utilize, and evaluate, complementing other existing programs. MIMOX is freely available at .
Many protein families contain sub-families with functional specialization, such as binding different ligands or being involved in different protein–protein interactions. A small number of amino acids generally determine functional specificity. The identification of these residues can aid the understanding of protein function and help finding targets for experimental analysis. Here, we present multi-Harmony, an interactive web sever for detecting sub-type-specific sites in proteins starting from a multiple sequence alignment. Combining our Sequence Harmony (SH) and multi-Relief (mR) methods in one web server allows simultaneous analysis and comparison of specificity residues; furthermore, both methods have been significantly improved and extended. SH has been extended to cope with more than two sub-groups. mR has been changed from a sampling implementation to a deterministic one, making it more consistent and user friendly. For both methods Z-scores are reported. The multi-Harmony web server produces a dynamic output page, which includes interactive connections to the Jalview and Jmol applets, thereby allowing interactive analysis of the results. Multi-Harmony is available at http://www.ibi.vu.nl/ programs/shmrwww.
Profile–profile methods are well suited to detect remote evolutionary relationships between protein families. Profile Comparer (PRC) is an existing stand-alone program for scoring and aligning hidden Markov models (HMMs), which are based on multiple sequence alignments. Since PRC compares profile HMMs instead of sequences, it can be used to find distant homologues. For this purpose, PRC is used by, for example, the CATH and Pfam-domain databases. As PRC is a profile comparer, it only reports profile HMM alignments and does not produce multiple sequence alignments. We have developed webPRC server, which makes it straightforward to search for distant homologues or similar alignments in a number of domain databases. In addition, it provides the results both as multiple sequence alignments and aligned HMMs. Furthermore, the user can view the domain annotation, evaluate the PRC hits with the Jalview multiple alignment editor and generate logos from the aligned HMMs or the aligned multiple alignments. Thus, this server assists in detecting distant homologues with PRC as well as in evaluating and using the results. The webPRC interface is available at http://www.ibi.vu.nl/programs/prcwww/.
The MyHits web server (http://myhits.isb-sib.ch) is a new integrated service dedicated to the annotation of protein sequences and to the analysis of their domains and signatures. Guest users can use the system anonymously, with full access to (i) standard bioinformatics programs (e.g. PSI-BLAST, ClustalW, T-Coffee, Jalview); (ii) a large number of protein sequence databases, including standard (Swiss-Prot, TrEMBL) and locally developed databases (splice variants); (iii) databases of protein motifs (Prosite, Interpro); (iv) a precomputed list of matches (‘hits’) between the sequence and motif databases. All databases are updated on a weekly basis and the hit list is kept up to date incrementally. The MyHits server also includes a new collection of tools to generate graphical representations of pairwise and multiple sequence alignments including their annotated features. Free registration enables users to upload their own sequences and motifs to private databases. These are then made available through the same web interface and the same set of analytical tools. Registered users can manage their own sequences and annotations using only web tools and freeze their data in their private database for publication purposes.
In the post-genomic era, systems-level studies are being performed that seek to explain complex biological systems by integrating diverse resources from fields such as genomics, proteomics or transcriptomics. New information management systems are now needed for the collection, validation and analysis of the vast amount of heterogeneous data available. Multiple alignments of complete sequences provide an ideal environment for the integration of this information in the context of the protein family.
MACSIMS is a multiple alignment-based information management program that combines the advantages of both knowledge-based and ab initio sequence analysis methods. Structural and functional information is retrieved automatically from the public databases. In the multiple alignment, homologous regions are identified and the retrieved data is evaluated and propagated from known to unknown sequences with these reliable regions. In a large-scale evaluation, the specificity of the propagated sequence features is estimated to be >99%, i.e. very few false positive predictions are made. MACSIMS is then used to characterise mutations in a test set of 100 proteins that are known to be involved in human genetic diseases. The number of sequence features associated with these proteins was increased by 60%, compared to the features available in the public databases. An XML format output file allows automatic parsing of the MACSIM results, while a graphical display using the JalView program allows manual analysis.
MACSIMS is a new information management system that incorporates detailed analyses of protein families at the structural, functional and evolutionary levels. MACSIMS thus provides a unique environment that facilitates knowledge extraction and the presentation of the most pertinent information to the biologist. A web server and the source code are available at .
We present TranslatorX, a web server designed to align protein-coding nucleotide sequences based on their corresponding amino acid translations. Many comparisons between biological sequences (nucleic acids and proteins) involve the construction of multiple alignments. Alignments represent a statement regarding the homology between individual nucleotides or amino acids within homologous genes. As protein-coding DNA sequences evolve as triplets of nucleotides (codons) and it is known that sequence similarity degrades more rapidly at the DNA than at the amino acid level, alignments are generally more accurate when based on amino acids than on their corresponding nucleotides. TranslatorX novelties include: (i) use of all documented genetic codes and the possibility of assigning different genetic codes for each sequence; (ii) a battery of different multiple alignment programs; (iii) translation of ambiguous codons when possible; (iv) an innovative criterion to clean nucleotide alignments with GBlocks based on protein information; and (v) a rich output, including Jalview-powered graphical visualization of the alignments, codon-based alignments coloured according to the corresponding amino acids, measures of compositional bias and first, second and third codon position specific alignments. The TranslatorX server is freely available at http://translatorx.co.uk.
Jpred (http://www.compbio.dundee.ac.uk/jpred) is a secondary structure prediction server powered by the Jnet algorithm. Jpred performs over 1000 predictions per week for users in more than 50 countries. The recently updated Jnet algorithm provides a three-state (α-helix, β-strand and coil) prediction of secondary structure at an accuracy of 81.5%. Given either a single protein sequence or a multiple sequence alignment, Jpred derives alignment profiles from which predictions of secondary structure and solvent accessibility are made. The predictions are presented as coloured HTML, plain text, PostScript, PDF and via the Jalview alignment editor to allow flexibility in viewing and applying the data. The new Jpred 3 server includes significant usability improvements that include clearer feedback of the progress or failure of submitted requests. Functional improvements include batch submission of sequences, summary results via email and updates to the search databases. A new software pipeline will enable Jnet/Jpred to continue to be updated in sync with major updates to SCOP and UniProt and so ensures that Jpred 3 will maintain high-accuracy predictions.
The use of sequence alignments to understand protein families is ubiquitous in molecular biology. High quality alignments are difficult to build and protein alignment remains one of the largest open problems in computational biology. Misalignments can lead to inferential errors about protein structure, folding, function, phylogeny, and residue importance. Identifying alignment errors is difficult because alignments are built and validated on the same primary criteria: sequence conservation. Local covariation identifies systematic misalignments and is independent of conservation. We demonstrate an alignment curation tool, LoCo, that integrates local covariation scores with the Jalview alignment editor. Using LoCo, we illustrate how local covariation is capable of identifying alignment errors due to the reduction of positional independence in the region of misalignment. We highlight three alignments from the benchmark database, BAliBASE 3, that contain regions of high local covariation, and investigate the causes to illustrate these types of scenarios. Two alignments contain sequential and structural shifts that cause elevated local covariation. Realignment of these misaligned segments reduces local covariation; these alternative alignments are supported with structural evidence. We also show that local covariation identifies active site residues in a validated alignment of paralogous structures. Loco is available at https://sourceforge.net/projects/locoprotein/files/
By virtue of their shared ancestry, homologous sequences are similar in their structure and function. Consequently, multiple sequence alignments are routinely used to identify trends that relate to function. This type of analysis is particularly productive when it is combined with structural and phylogenetic analysis.
Here we describe the release of PFAAT version 2.0, a tool for editing, analyzing, and annotating multiple sequence alignments. Support for multiple annotations is a key component of this release as it provides a framework for most of the new functionalities. The sequence annotations are accessible from the alignment and tree, where they are typically used to label sequences or hyperlink them to related databases. Sequence annotations can be created manually or extracted automatically from UniProt entries. Once a multiple sequence alignment is populated with sequence annotations, sequences can be easily selected and sorted through a sophisticated search dialog. The selected sequences can be further analyzed using statistical methods that explicitly model relationships between the sequence annotations and residue properties. Residue annotations are accessible from the alignment viewer and are typically used to designate binding sites or properties for a particular residue.
Residue annotations are also searchable, and allow one to quickly select alignment columns for further sequence analysis, e.g. computing percent identities. Other features include: novel algorithms to compute sequence conservation, mapping conservation scores to a 3D structure in Jmol, displaying secondary structure elements, and sorting sequences by residue composition.
PFAAT provides a framework whereby end-users can specify knowledge for a protein family in the form of annotation. The annotations can be combined with sophisticated analysis to test hypothesis that relate to sequence, structure and function.
Motivation: High-throughput biological research requires simultaneous visualization as well as analysis of genomic data, e.g. read alignments, variant calls and genomic annotations. Traditionally, such integrative analysis required desktop applications operating on locally stored data. Many current terabyte-size datasets generated by large public consortia projects, however, are already only feasibly stored at specialist genome analysis centers. As even small laboratories can afford very large datasets, local storage and analysis are becoming increasingly limiting, and it is likely that most such datasets will soon be stored remotely, e.g. in the cloud. These developments will require web-based tools that enable users to access, analyze and view vast remotely stored data with a level of sophistication and interactivity that approximates desktop applications. As rapidly dropping cost enables researchers to collect data intended to answer questions in very specialized contexts, developers must also provide software libraries that empower users to implement customized data analyses and data views for their particular application. Such specialized, yet lightweight, applications would empower scientists to better answer specific biological questions than possible with general-purpose genome browsers currently available.
Results: Using recent advances in core web technologies (HTML5), we developed Scribl, a flexible genomic visualization library specifically targeting coordinate-based data such as genomic features, DNA sequence and genetic variants. Scribl simplifies the development of sophisticated web-based graphical tools that approach the dynamism and interactivity of desktop applications.
Supplementary data are available at Bioinformatics online.
Carbohydrates play a critical role in human diseases and their potential utility as biomarkers for pathological conditions is a major driver for characterization of the glycome. However, the additional complexity of glycans compared to proteins and nucleic acids has slowed the advancement of glycomics in comparison to genomics and proteomics. The branched nature of carbohydrates, the great diversity of their constituents and the numerous alternative symbolic notations, make the input and display of glycans not as straightforward as for example the amino-acid sequence of a protein. Every glycoinformatic tool providing a user interface would benefit from a fast, intuitive, appealing mechanism for input and output of glycan structures in a computer readable format.
A software tool for building and displaying glycan structures using a chosen symbolic notation is described here. The "GlycanBuilder" uses an automatic rendering algorithm to draw the saccharide symbols and to place them on the drawing board. The information about the symbolic notation is derived from a configurable graphical model as a set of rules governing the aspect and placement of residues and linkages. The algorithm is able to represent a structure using only few traversals of the tree and is inherently fast. The tool uses an XML format for import and export of encoded structures.
The rendering algorithm described here is able to produce high-quality representations of glycan structures in a chosen symbolic notation. The automated rendering process enables the "GlycanBuilder" to be used both as a user-independent component for displaying glycans and as an easy-to-use drawing tool. The "GlycanBuilder" can be integrated in web pages as a Java applet for the visual editing of glycans. The same component is available as a web service to render an encoded structure into a graphical format. Finally, the "GlycanBuilder" can be integrated into other applications to create intuitive and appealing user interfaces: an example is the "GlycoWorkbench", a software tool for assisted annotation of glycan mass spectra. The "GlycanBuilder" represent a flexible, reliable and efficient solution to the problem of input and output of glycan structures in any glycomic tool or database.
Phylogeny-aware progressive alignment has been found to perform well in phylogenetic alignment benchmarks and to produce superior alignments for the inference of selection on codon sequences. Its implementation in the PRANK alignment program package also allows modelling of complex evolutionary processes and inference of posterior probabilities for sequence sites evolving under each distinct scenario, either simultaneously with the alignment of sequences or as a post-processing step for an existing alignment. This has led to software with many advanced features, and users may find it difficult to generate optimal alignments, visualise the full information in their alignment results, or post-process these results, e.g. by objectively selecting subsets of alignment sites.
We have created a web server called webPRANK that provides an easy-to-use interface to the PRANK phylogeny-aware alignment algorithm. The webPRANK server supports the alignment of DNA, protein and codon sequences as well as protein-translated alignment of cDNAs, and includes built-in structure models for the alignment of genomic sequences. The resulting alignments can be exported in various formats widely used in evolutionary sequence analyses. The webPRANK server also includes a powerful web-based alignment browser for the visualisation and post-processing of the results in the context of a cladogram relating the sequences, allowing (e.g.) removal of alignment columns with low posterior reliability. In addition to de novo alignments, webPRANK can be used for the inference of ancestral sequences with phylogenetically realistic gap patterns, and for the annotation and post-processing of existing alignments. The webPRANK server is freely available on the web at http://tinyurl.com/webprank .
The webPRANK server incorporates phylogeny-aware multiple sequence alignment, visualisation and post-processing in an easy-to-use web interface. It widens the user base of phylogeny-aware multiple sequence alignment and allows the performance of all alignment-related activity for small sequence analysis projects using only a standard web browser.
The analysis of transcriptome data involves many steps and various programs, along with organization of large amounts of data and results. Without a methodical approach for storage, analysis and query, the resulting ad hoc analysis can lead to human error, loss of data and results, inefficient use of time, and lack of verifiability, repeatability, and extensibility.
The Transcriptome Computational Workbench (TCW) provides Java graphical interfaces for methodical analysis for both single and comparative transcriptome data without the use of a reference genome (e.g. for non-model organisms). The singleTCW interface steps the user through importing transcript sequences (e.g. Illumina) or assembling long sequences (e.g. Sanger, 454, transcripts), annotating the sequences, and performing differential expression analysis using published statistical programs in R. The data, metadata, and results are stored in a MySQL database. The multiTCW interface builds a comparison database by importing sequence and annotation from one or more single TCW databases, executes the ESTscan program to translate the sequences into proteins, and then incorporates one or more clusterings, where the clustering options are to execute the orthoMCL program, compute transitive closure, or import clusters. Both singleTCW and multiTCW allow extensive query and display of the results, where singleTCW displays the alignment of annotation hits to transcript sequences, and multiTCW displays multiple transcript alignments with MUSCLE or pairwise alignments. The query programs can be executed on the desktop for fastest analysis, or from the web for sharing the results.
It is now affordable to buy a multi-processor machine, and easy to install Java and MySQL. By simply downloading the TCW, the user can interactively analyze, query and view their data. The TCW allows in-depth data mining of the results, which can lead to a better understanding of the transcriptome. TCW is freely available from www.agcol.arizona.edu/software/tcw.
GOblet is a comprehensive web server application providing the annotation of anonymous sequence data with Gene Ontology (GO) terms. It uses a variety of different protein databases (human, murines, invertebrates, plants, sp-trembl) and their respective GO mappings. The user selects the appropriate database and alignment threshold and thereafter submits single or multiple nucleotide or protein sequences. Results are shown in different ways, e.g. as survey statistics for the main GO categories for all sequences or as detailed results for each single sequence that has been submitted. In its newest version, GOblet allows the batch submission of sequences and provides an improved display of results with the aid of Java applets. All output data, together with the Java applet, are packed to a downloadable archive for local installation and analysis. GOblet can be accessed freely at http://goblet.molgen.mpg.de.
Contemporary biological research integrates neighboring scientific domains to answer complex questions in fields such as systems biology and drug discovery. This calls for tools that are intuitive to use, yet flexible to adapt to new tasks.
Bioclipse is a free, open source workbench with advanced features for the life sciences. Version 2.0 constitutes a complete rewrite of Bioclipse, and delivers a stable, scalable integration platform for developers and an intuitive workbench for end users. All functionality is available both from the graphical user interface and from a built-in novel domain-specific language, supporting the scientist in interdisciplinary research and reproducible analyses through advanced visualization of the inputs and the results. New components for Bioclipse 2 include a rewritten editor for chemical structures, a table for multiple molecules that supports gigabyte-sized files, as well as a graphical editor for sequences and alignments.
Bioclipse 2 is equipped with advanced tools required to carry out complex analysis in the fields of bio- and cheminformatics. Developed as a Rich Client based on Eclipse, Bioclipse 2 leverages on today's powerful desktop computers for providing a responsive user interface, but also takes full advantage of the Web and networked (Web/Cloud) services for more demanding calculations or retrieval of data. The fact that Bioclipse 2 is based on an advanced and widely used service platform ensures wide extensibility, making it easy to add new algorithms, visualizations, as well as scripting commands. The intuitive tools for end users and the extensible architecture make Bioclipse 2 ideal for interdisciplinary and integrative research.
Bioclipse 2 is released under the Eclipse Public License (EPL), a flexible open source license that allows additional plugins to be of any license. Bioclipse 2 is implemented in Java and supported on all major platforms; Source code and binaries are freely available at http://www.bioclipse.net.
The visualisation of genetic and genomic maps aligned within and between species and across data sources can be used to inform studies of genome evolution, assist genome assembly projects and aid gene discovery and identification. Whilst annotation, integration and exploration of assembled genome sequences is well supported, there are fewer tools available which can display genetic maps for less well-characterized species, and integrate these maps with annotated reference genomes to support cross species comparisons.
We have developed a desktop application to draw and align genetic and genomic maps, retrieved from remote data sources or loaded as local files. Maps can be retrieved from our public map database ArkDB or from any Ensembl data source (i.e. Ensembl and Ensembl Genomes). By using the JEnsembl API, maps can be drawn for any release version of any of the thousands of species present in Ensembl data sources, allowing not only inter-specific comparisons, but also comparisons between different versions/revisions of assembled genomes. Maps can be aligned by relating identical or synonymous markers across maps, or through the gene homology/orthology relationship data stored in the Ensembl Compara databases, allowing ready visualization of regions of conserved synteny between species. The map drawing canvas is highly configurable, supports interactive exploration of maps, markers and relationships and allows export of publication quality graphics.
ArkMAP allows users to draw and interactively explore gene and variation maps for any version of any annotated genome curated in the Ensembl data sources, and to integrate local mapping data. The maps and inter-map relationships drawn are highly configurable and ArkMAP may be used to produce publication quality graphics. ArkMAP is freely available as an auto-updating Java ‘Web Start’ application, or as a standalone archived application.
Software; Map drawing; Genetic map; Genomic map; Synteny; Conserved synteny; JEnsembl; Ensembl
Motivation: Aligning protein sequences with the best possible accuracy requires sophisticated algorithms. Since the optimal alignment is not guaranteed to be the correct one, it is expected that even the best alignment will contain sites that do not respect the assumption of positional homology. Because formulating rules to identify these sites is difficult, it is common practice to manually remove them. Although considered necessary in some cases, manual editing is time consuming and not reproducible. We present here an automated editing method based on the classification of ‘valid’ and ‘invalid’ sites.
Results: A support vector machine (SVM) classifier is trained to reproduce the decisions made during manual editing with an accuracy of 95.0%. This implies that manual editing can be made reproducible and applied to large-scale analyses. We further demonstrate that it is possible to retrain/extend the training of the classifier by providing examples of multiple sequence alignment (MSA) annotation. Near optimal training can be achieved with only 1000 annotated sites, or roughly three samples of protein sequence alignments.
Availability: This method is implemented in the software MANUEL, licensed under the GPL. A web-based application for single and batch job is available at http://fester.cs.dal.ca/manuel.
Supplementary information: Supplementary data are available at Bioinformatics online.
So-called next-generation sequencing (NGS) has provided the ability to sequence on a massive scale at low cost, enabling biologists to perform powerful experiments and gain insight into biological processes. BamView has been developed to visualize and analyse sequence reads from NGS platforms, which have been aligned to a reference sequence. It is a desktop application for browsing the aligned or mapped reads [Ruffalo, M, LaFramboise, T, Koyutürk, M. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 2011;27:2790–6] at different levels of magnification, from nucleotide level, where the base qualities can be seen, to genome or chromosome level where overall coverage is shown. To enable in-depth investigation of NGS data, various views are provided that can be configured to highlight interesting aspects of the data. Multiple read alignment files can be overlaid to compare results from different experiments, and filters can be applied to facilitate the interpretation of the aligned reads. As well as being a standalone application it can be used as an integrated part of the Artemis genome browser, BamView allows the user to study NGS data in the context of the sequence and annotation of the reference genome. Single nucleotide polymorphism (SNP) density and candidate SNP sites can be highlighted and investigated, and read-pair information can be used to discover large structural insertions and deletions. The application will also calculate simple analyses of the read mapping, including reporting the read counts and reads per kilobase per million mapped reads (RPKM) for genes selected by the user.
Availability: BamView and Artemis are freely available software. These can be downloaded from their home pages:
Requirements: Java 1.6 or higher.
genome browser; next-generation sequencing; visualization; Artemis; BamView
Mathematically optimal alignments do not always properly align active site residues or well-recognized structural elements. Most near-optimal sequence alignment algorithms display alternative alignment paths, rather than the conventional residue-by-residue pairwise alignment. Typically, these methods do not provide mechanisms for finding effectively the most biologically meaningful alignment in the potentially large set of options.
We have developed Web-based software that displays near optimal or alternative alignments of two protein or DNA sequences as a continuous moving picture. A WWW interface to a C++ program generates near optimal alignments, which are sent to a Java Applet, which displays them in a series of alignment frames. The Applet aligns residues so that consistently aligned regions remain at a fixed position on the display, while variable regions move. The display can be stopped to examine alignment details.
Available at http://fasta.bioch.virginia.edu/ noptalign. For source code contact the authors at firstname.lastname@example.org
Base-By-Base is a Java-based multiple sequence alignment editor. It is capable of working with protein and DNA molecules, but many of its unique features relate to the manipulation of the genomes of large DNA viruses such as poxviruses, herpesviruses, baculoviruses and asfarviruses (1-400 kb). The tool was built to serve as a platform for comparative genomics at the level of individual nucleotides.
In version 2, BBB-v2, of Base-By-Base we have added a series of new features aimed at providing the bench virologist with a better platform to view, annotate and analyze these complex genomes. Although a poxvirus genome, for example, may be less than 200 kb, it probably encodes close to 200 proteins using multiple classes of promoters with frequent overlapping of promoters and coding sequences and even some overlapping of genes. The new features allow users to 1) add primer annotations or other data sets in batch mode, 2) export differences between sequences to other genome browsers, 3) compare multiple genomes at a single nucleotide level of detail, 4) create new alignments from subsets/subsequences of a very large master alignment and 5) allow display of summaries of deep RNA sequencing data sets on a genome sequence.
BBB-v2 significantly improves the ability of virologists to work with genome sequences and provides a platform with which they can use a multiple sequence alignment as the basis for their own editable documents. Also, a .bbb document, with a variety of annotations in addition to the basic coding regions, can be shared among collaborators or made available to an entire research community. The program is available via Virology.ca using Java Web Start and is platform independent; the Java 1.5 virtual machine is required.
Phylogenetic and population genetic studies often deal with multiple sequence alignments that require manipulation or processing steps such as sequence concatenation, sequence renaming, sequence translation or consensus sequence generation. In recent years phylogenetic data sets have expanded from single genes to genome wide markers comprising hundreds to thousands of loci. Processing of these large phylogenomic data sets is impracticable without using automated process pipelines. Currently no stand-alone or pipeline compatible program exists that offers a broad range of manipulation and processing steps for multiple sequence alignments in a single process run.
Here we present FASconCAT-G, a system independent editor, which offers various processing options for multiple sequence alignments. The software provides a wide range of possibilities to edit and concatenate multiple nucleotide, amino acid, and structure sequence alignment files for phylogenetic and population genetic purposes. The main options include sequence renaming, file format conversion, sequence translation between nucleotide and amino acid states, consensus generation of specific sequence blocks, sequence concatenation, model selection of amino acid replacement with ProtTest, two types of RY coding as well as site exclusions and extraction of parsimony informative sites. Convieniently, most options can be invoked in combination and performed during a single process run. Additionally, FASconCAT-G prints useful information regarding alignment characteristics and editing processes such as base compositions of single in- and outfiles, sequence areas in a concatenated supermatrix, as well as paired stem and loop regions in secondary structure sequence strings.
FASconCAT-G is a command-line driven Perl program that delivers computationally fast and user-friendly processing of multiple sequence alignments for phylogenetic and population genetic applications and is well suited for incorporation into analysis pipelines.
Electronic supplementary material
The online version of this article (doi:10.1186/s12983-014-0081-x) contains supplementary material, which is available to authorized users.
Multiple sequence alignment; Phylogenetic reconstruction; Sequence processing; Consensus sequence; Sequence translation; Sequence concatenation; File format conversion
G protein-coupled receptors (GPCRs) transduce a wide variety of extracellular signals to within the cell and therefore have a key role in regulating cell activity and physiological function. GPCR malfunction is responsible for a wide range of diseases including cancer, diabetes and hyperthyroidism and a large proportion of drugs on the market target these receptors. The three dimensional structure of GPCRs is important for elucidating the molecular mechanisms underlying these diseases and for performing structure-based drug design. Although structural data are restricted to only a handful of GPCRs, homology models can be used as a proxy for those receptors not having crystal structures. However, many researchers working on GPCRs are not experienced homology modellers and are therefore unable to benefit from the information that can be gleaned from such three-dimensional models. Here, we present a comprehensive database called the GPCR-SSFE, which provides initial homology models of the transmembrane helices for a large variety of family A GPCRs.
Extending on our previous theoretical work, we have developed an automated pipeline for GPCR homology modelling and applied it to a large set of family A GPCR sequences. Our pipeline is a fragment-based approach that exploits available family A crystal structures. The GPCR-SSFE database stores the template predictions, sequence alignments, identified sequence and structure motifs and homology models for 5025 family A GPCRs. Users are able to browse the GPCR dataset according to their pharmacological classification or search for results using a UniProt entry name. It is also possible for a user to submit a GPCR sequence that is not contained in the database for analysis and homology model building. The models can be viewed using a Jmol applet and are also available for download along with the alignments.
The data provided by GPCR-SSFE are useful for investigating general and detailed sequence-structure-function relationships of GPCRs, performing structure-based drug design and for better understanding the molecular mechanisms underlying disease-associated mutations in GPCRs. The effectiveness of our multiple template and fragment approach is demonstrated by the accuracy of our predicted homology models compared to recently published crystal structures.