Innovations in biological and biomedical imaging produce complex high-content and multivariate image data. For decision-making and generation of hypotheses, scientists need novel information technology tools that enable them to visually explore and analyze the data and to discuss and communicate results or findings with collaborating experts from various places.
In this paper, we present a novel Web2.0 approach, BioIMAX, for the collaborative exploration and analysis of multivariate image data by combining the webs collaboration and distribution architecture with the interface interactivity and computation power of desktop applications, recently called rich internet application.
BioIMAX allows scientists to discuss and share data or results with collaborating experts and to visualize, annotate, and explore multivariate image data within one web-based platform from any location via a standard web browser requiring only a username and a password. BioIMAX can be accessed at http://ani.cebitec.uni-bielefeld.de/BioIMAX with the username "test" and the password "test1" for testing purposes.
Summary: We describe ChromA, a web-based alignment tool for chromatography–mass spectrometry data from the metabolomics and proteomics domains. Users can supply their data in open and standardized file formats for retention time alignment using dynamic time warping with different configurable local distance and similarity functions. Additionally, user-defined anchors can be used to constrain and speedup the alignment. A neighborhood around each anchor can be added to increase the flexibility of the constrained alignment. ChromA offers different visualizations of the alignment for easier qualitative interpretation and comparison of the data. For the multiple alignment of more than two data files, the center-star approximation is applied to select a reference among input files to align to.
Availability: ChromA is available at http://bibiserv.techfak.uni-bielefeld.de/chroma. Executables and source code under the L-GPL v3 license are provided for download at the same location.
Supplementary information: Supplementary data are available at Bioinformatics online.
With the advent of low cost, fast sequencing technologies metagenomic analyses are made possible. The large data volumes gathered by these techniques and the unpredictable diversity captured in them are still, however, a challenge for computational biology.
In this paper we address the problem of rapid taxonomic assignment with small and adaptive data models (< 5 MB) and present the accelerated k-mer explorer (AKE). Acceleration in AKE’s taxonomic assignments is achieved by a special machine learning architecture, which is well suited to model data collections that are intrinsically hierarchical. We report classification accuracy reasonably well for ranks down to order, observed on a study on real world data (Acid Mine Drainage, Cow Rumen).
We show that the execution time of this approach is orders of magnitude shorter than competitive approaches and that accuracy is comparable. The tool is presented to the public as a web application (url: https://ani.cebitec.uni-bielefeld.de/ake/, username: bmc, password: bmcbioinfo).
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0384-0) contains supplementary material, which is available to authorized users.
Metagenomics; Classification; Acceleration; Web-based; H2SOM; k-mer
RNA pseudoknots are an important structural feature of RNAs, but often neglected in computer predictions for reasons of efficiency. Here, we present the pknotsRG Web Server for single sequence RNA secondary structure prediction including pseudoknots. pknotsRG employs the newest Turner energy rules for finding the structure of minimal free energy. The algorithm has been improved in several ways recently. First, it has been reimplemented in the C programming language, resulting in a 60-fold increase in speed. Second, all suboptimal foldings up to a user-defined threshold can be enumerated. For large scale analysis, a fast sliding window mode is available. Further improvements of the Web Server are a new output visualization using the PseudoViewer Web Service or RNAmovies for a movie like animation of several suboptimal foldings.
The tool is available as source code, binary executable, online tool or as Web Service. The latter alternative allows for an easy integration into bio-informatics pipelines. pknotsRG is available at the Bielefeld Bioinformatics Server (http://bibiserv.techfak.uni-bielefeld.de/pknotsrg).
Summary: UniMoG is a software combining five genome rearrangement models: double cut and join (DCJ), restricted DCJ, Hannenhalli and Pevzner (HP), inversion and translocation. It can compute the pairwise genomic distances and a corresponding optimal sorting scenario for an arbitrary number of genomes. All five models can be unified through the DCJ model, thus the implementation is based on DCJ and, where reasonable, uses the most efficient existing algorithms for each distance and sorting problem. Both textual and graphical output is possible for visualizing the operations.
Availability and implementation: The software is available through the Bielefeld University Bioinformatics Web Server at http://bibiserv.techfak.uni-bielefeld.de/dcj with instructions and example data.
Motivation: Dynamic programming is ubiquitous in bioinformatics. Developing and implementing non-trivial dynamic programming algorithms is often error prone and tedious. Bellman’s GAP is a new programming system, designed to ease the development of bioinformatics tools based on the dynamic programming technique.
Results: In Bellman’s GAP, dynamic programming algorithms are described in a declarative style by tree grammars, evaluation algebras and products formed thereof. This bypasses the design of explicit dynamic programming recurrences and yields programs that are free of subscript errors, modular and easy to modify. The declarative modules are compiled into C++ code that is competitive to carefully hand-crafted implementations.
This article introduces the Bellman’s GAP system and its language, GAP-L. It then demonstrates the ease of development and the degree of re-use by creating variants of two common bioinformatics algorithms. Finally, it evaluates Bellman’s GAP as an implementation platform of ‘real-world’ bioinformatics tools.
Availability: Bellman’s GAP is available under GPL license from http://bibiserv.cebitec.uni-bielefeld.de/bellmansgap. This Web site includes a repository of re-usable modules for RNA folding based on thermodynamics.
Supplementary data are available at Bioinformatics online
Motivation: The research area metabolomics achieved tremendous popularity and development in the last couple of years. Owing to its unique interdisciplinarity, it requires to combine knowledge from various scientific disciplines. Advances in the high-throughput technology and the consequently growing quality and quantity of data put new demands on applied analytical and computational methods. Exploration of finally generated and analyzed datasets furthermore relies on powerful tools for data mining and visualization.
Results: To cover and keep up with these requirements, we have created MeltDB 2.0, a next-generation web application addressing storage, sharing, standardization, integration and analysis of metabolomics experiments. New features improve both efficiency and effectivity of the entire processing pipeline of chromatographic raw data from pre-processing to the derivation of new biological knowledge. First, the generation of high-quality metabolic datasets has been vastly simplified. Second, the new statistics tool box allows to investigate these datasets according to a wide spectrum of scientific and explorative questions.
Availability: The system is publicly available at https://meltdb.cebitec.uni-bielefeld.de. A login is required but freely available.
Intra-cellular and inter-cellular protein translocation can be observed by microscopic imaging of tissue sections prepared immunohistochemically. A manual densitometric analysis is time-consuming, subjective and error-prone. An automated quantification is faster, more reproducible, and should yield results comparable to manual evaluation. The automated method presented here was developed on rat liver tissue sections to study the translocation of bile salt transport proteins in hepatocytes. For validation, the cholestatic liver state was compared to the normal biological state.
An automated quantification method was developed to analyze the translocation of membrane proteins and evaluated in comparison to an established manual method. Firstly, regions of interest (membrane fragments) are identified in confocal microscopy images. Further, densitometric intensity profiles are extracted orthogonally to membrane fragments, following the direction from the plasma membrane to cytoplasm. Finally, several different quantitative descriptors were derived from the densitometric profiles and were compared regarding their statistical significance with respect to the transport protein distribution. Stable performance, robustness and reproducibility were tested using several independent experimental datasets. A fully automated workflow for the information extraction and statistical evaluation has been developed and produces robust results.
New descriptors for the intensity distribution profiles were found to be more discriminative, i.e. more significant, than those used in previous research publications for the translocation quantification. The slow manual calculation can be substituted by the fast and unbiased automated method.
e2g is a web-based server which efficiently maps large expressed sequence tag (EST) and cDNA datasets to genomic DNA. It significantly extends the volume of data that can be mapped in reasonable time, and makes this improved efficiency available as a web service. Our server hosts large collections of EST sequences (e.g. 4.1 million mouse ESTs of 1.87 Gb) in precomputed indexed data structures for efficient sequence comparison. The user can upload a genomic DNA sequence of interest and rapidly compare this to the complete collection of ESTs on the server. This delivers a mapping of the ESTs on the genomic DNA. The e2g web interface provides a graphical overview of the mapping. Alignments of the mapped EST regions with parts of the genomic sequence are visualized. Zooming functions allow the user to interactively explore the results. Mapped sequences can be downloaded for further analysis. e2g is available on the Bielefeld University Bioinformatics Server at http://bibiserv.techfak.uni-bielefeld.de/e2g/.
RNA Movies is a simple, yet powerful visualization tool in likeness to a media player application, which enables to browse sequential paths through RNA secondary structure landscapes. It can be used to visualize structural rearrangement processes of RNA, such as folding pathways and conformational switches, or to browse lists of alternative structure candidates. Besides extending the feature set, retaining and improving usability and availability in the web is the main aim of this new version. RNA Movies now supports the DCSE and RNAStructML input formats besides its own RNM format. Pseudoknots and ‘entangled helices’ can be superimposed on the RNA secondary structure layout. Publication quality output is provided through the Scalable Vector Graphics output format understood by most current drawing programs. The software has been completely re-implemented in Java to enable pure client-side operation as applet and web-start application available at the Bielefeld Bioinformatics Server http://bibiserv.techfak.uni-bielefeld.de/rnamovies
Motivation: Abstract shape analysis allows efficient computation of a representative sample of low-energy foldings of an RNA molecule. More comprehensive information is obtained by computing shape probabilities, accumulating the Boltzmann probabilities of all structures within each abstract shape. Such information is superior to free energies because it is independent of sequence length and base composition. However, up to this point, computation of shape probabilities evaluates all shapes simultaneously and comes with a computation cost which is exponential in the length of the sequence.
Results: We device an approach called RapidShapes that computes the shapes above a specified probability threshold T by generating a list of promising shapes and constructing specialized folding programs for each shape to compute its share of Boltzmann probability. This aims at a heuristic improvement of runtime, while still computing exact probability values.
Conclusion: Evaluating this approach and several substrategies, we find that only a small proportion of shapes have to be actually computed. For an RNA sequence of length 400, this leads, depending on the threshold, to a 10–138 fold speed-up compared with the previous complete method. Thus, probabilistic shape analysis has become feasible in medium-scale applications, such as the screening of RNA transcripts in a bacterial genome.
Availability: RapidShapes is available via http://bibiserv.cebitec.uni-bielefeld.de/rnashapes
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: Recent parallel pyrosequencing methods and the increasing number of finished genomes encourage the sequencing and investigation of closely related strains. Although the sequencing itself becomes easier and cheaper with each machine generation, the finishing of the genomes remains difficult. Instead of the desired whole genomic sequence, a set of contigs is the result of the assembly. In this applications note, we present the tool r2cat (related reference contig arrangement tool) that helps in the task of comparative assembly and also provides an interactive visualization for synteny inspection.
Cancer immunotherapy has recently entered a remarkable renaissance phase with the approval of several agents for treatment. Cancer treatment platforms have demonstrated profound tumor regressions including complete cure in patients with metastatic cancer. Moreover, technological advances in next-generation sequencing (NGS) as well as the development of devices for scanning whole-slide bioimages from tissue sections and image analysis software for quantitation of tumor-infiltrating lymphocytes (TILs) allow, for the first time, the development of personalized cancer immunotherapies that target patient specific mutations. However, there is currently no bioinformatics solution that supports the integration of these heterogeneous datasets.
We have developed a bioinformatics platform – Personalized Oncology Suite (POS) – that integrates clinical data, NGS data and whole-slide bioimages from tissue sections. POS is a web-based platform that is scalable, flexible and expandable. The underlying database is based on a data warehouse schema, which is used to integrate information from different sources. POS stores clinical data, genomic data (SNPs and INDELs identified from NGS analysis), and scanned whole-slide images. It features a genome browser as well as access to several instances of the bioimage management application Bisque. POS provides different visualization techniques and offers sophisticated upload and download possibilities. The modular architecture of POS allows the community to easily modify and extend the application.
The web-based integration of clinical, NGS, and imaging data represents a valuable resource for clinical researchers and future application in medical oncology. POS can be used not only in the context of cancer immunology but also in other studies in which NGS data and images of tissue sections are generated. The application is open-source and can be downloaded at http://www.icbi.at/POS.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-306) contains supplementary material, which is available to authorized users.
Personalized oncology; Data integration; Next-generation sequencing; Whole-slide bioimaging; Application; Open-source
We present four tools for the analysis of RNA secondary structure. They provide animated visualization of multiple structures, prediction of potential conformational switching, structure comparison (including local structure alignment) and prediction of structures potentially containing a certain kind of pseudoknots. All are available via the Bielefeld University Bioinformatics Server (http://bibiserv.techfak.uni-bielefeld.de).
Motivation: Profile hidden Markov models (pHMMs) are currently the most popular modeling concept for protein families. They provide sensitive family descriptors, and sequence database searching with pHMMs has become a standard task in today's genome annotation pipelines. On the downside, searching with pHMMs is computationally expensive.
Results: We propose a new method for efficient protein family classification and for speeding up database searches with pHMMs as is necessary for large-scale analysis scenarios. We employ simpler models of protein families called position-specific scoring matrices family models (PSSM-FMs). For fast database search, we combine full-text indexing, efficient exact p-value computation of PSSM match scores and fast fragment chaining. The resulting method is well suited to prefilter the set of sequences to be searched for subsequent database searches with pHMMs. We achieved a classification performance only marginally inferior to hmmsearch, yet, results could be obtained in a fraction of runtime with a speedup of >64-fold. In experiments addressing the method's ability to prefilter the sequence space for subsequent database searches with pHMMs, our method reduces the number of sequences to be searched with hmmsearch to only 0.80% of all sequences. The filter is very fast and leads to a total speedup of factor 43 over the unfiltered search, while retaining >99.5% of the original results. In a lossless filter setup for hmmsearch on UniProtKB/Swiss-Prot, we observed a speedup of factor 92.
Availability: The presented algorithms are implemented in the program PoSSuMsearch2, available for download at http://bibiserv.techfak.uni-bielefeld.de/possumsearch2/.
Supplementary information: Supplementary data are available at Bioinformatics online.
Nowadays it is possible to unravel complex information at all levels of cellular organization by obtaining multi-dimensional image information. At the macromolecular level, three-dimensional (3D) electron microscopy, together with other techniques, is able to reach resolutions at the nanometer or subnanometer level. The information is delivered in the form of 3D volumes containing samples of a given function, for example, the electron density distribution within a given macromolecule. The same situation happens at the cellular level with the new forms of light microscopy, particularly confocal microscopy, all of which produce biological 3D volume information. Furthermore, it is possible to record sequences of images over time (videos), as well as sequences of volumes, bringing key information on the dynamics of living biological systems. It is in this context that work on BioImage started two years ago, and that its first version is now presented here. In essence, BioImage is a database specifically designed to contain multi-dimensional images, perform queries and interactively work with the resulting multi-dimensional information on the World Wide Web, as well as accomplish the required cross-database links. Two sister home pages of BioImage can be accessed at http://www. bioimage.org and http://www-embl.bioimage.org
Metagenomics is a new field of research on natural microbial communities. High-throughput sequencing techniques like 454 or Solexa-Illumina promise new possibilities as they are able to produce huge amounts of data in much shorter time and with less efforts and costs than the traditional Sanger technique. But the data produced comes in even shorter reads (35-100 basepairs with Illumina, 100-500 basepairs with 454-sequencing). CARMA is a new software pipeline for the characterisation of species composition and the genetic potential of microbial samples using short, unassembled reads.
In this paper, we introduce WebCARMA, a refined version of CARMA available as a web application for the taxonomic and functional classification of unassembled (ultra-)short reads from metagenomic communities. In addition, we have analysed the applicability of ultra-short reads in metagenomics.
We show that unassembled reads as short as 35 bp can be used for the taxonomic classification of a metagenome. The web application is freely available at http://webcarma.cebitec.uni-bielefeld.de.
The vast majority of microbes are unculturable and thus cannot be sequenced by means of traditional methods. High-throughput sequencing techniques like 454 or Solexa-Illumina make it possible to explore those microbes by studying whole natural microbial communities and analysing their biological diversity as well as the underlying metabolic pathways. Over the past few years, different methods have been developed for the taxonomic and functional characterization of metagenomic shotgun sequences. However, the taxonomic classification of metagenomic sequences from novel species without close homologue in the biological sequence databases poses a challenge due to the high number of wrong taxonomic predictions on lower taxonomic ranks. Here we present CARMA3, a new method for the taxonomic classification of assembled and unassembled metagenomic sequences that has been adapted to work with both BLAST and HMMER3 homology searches. We show that our method makes fewer wrong taxonomic predictions (at the same sensitivity) than other BLAST-based methods. CARMA3 is freely accessible via the web application WebCARMA from http://webcarma.cebitec.uni-bielefeld.de.
In recent years, the deluge of complicated molecular and cellular microscopic images creates compelling challenges for the image computing community. There has been an increasing focus on developing novel image processing, data mining, database and visualization techniques to extract, compare, search and manage the biological knowledge in these data-intensive problems. This emerging new area of bioinformatics can be called ‘bioimage informatics’. This article reviews the advances of this field from several aspects, including applications, key techniques, available tools and resources. Application examples such as high-throughput/high-content phenotyping and atlas building for model organisms demonstrate the importance of bioimage informatics. The essential techniques to the success of these applications, such as bioimage feature identification, segmentation and tracking, registration, annotation, mining, image data management and visualization, are further summarized, along with a brief overview of the available bioimage databases, analysis tools and other resources.
Supplementary information: Supplementary data are available at Bioinformatics online.
In order to understand the phenotype of any living system, it is essential to not only investigate its genes, but also the specific metabolic pathway variant of the organism of interest, ideally in comparison with other organisms. The Comparative Pathway Analyzer, CPA, calculates and displays the differences in metabolic reaction content between two sets of organisms. Because results are highly dependent on the distribution of organisms into these two sets and the appropriate definition of these sets often is not easy, we provide hierarchical clustering methods for the identification of significant groupings. CPA also visualizes the reaction content of several organisms simultaneously allowing easy comparison. Reaction annotation data and maps for visualizing the results are taken from the KEGG database. Additionally, users can upload their own annotation data. This website is free and open to all users and there is no login requirement. It is available at https://www.cebitec.uni-bielefeld.de/groups/brf/software/cpa/index.html.
Adduct formation, fragmentation events and matrix effects impose special challenges to the identification and quantitation of metabolites in LC-ESI-MS datasets. An important step in compound identification is the deconvolution of mass signals. During this processing step, peaks representing adducts, fragments, and isotopologues of the same analyte are allocated to a distinct group, in order to separate peaks from coeluting compounds. From these peak groups, neutral masses and pseudo spectra are derived and used for metabolite identification via mass decomposition and database matching. Quantitation of metabolites is hampered by matrix effects and nonlinear responses in LC-ESI-MS measurements. A common approach to correct for these effects is the addition of a U-13C-labeled internal standard and the calculation of mass isotopomer ratios for each metabolite. Here we present a new web-platform for the analysis of LC-ESI-MS experiments. ALLocator covers the workflow from raw data processing to metabolite identification and mass isotopomer ratio analysis. The integrated processing pipeline for spectra deconvolution “ALLocatorSD” generates pseudo spectra and automatically identifies peaks emerging from the U-13C-labeled internal standard. Information from the latter improves mass decomposition and annotation of neutral losses. ALLocator provides an interactive and dynamic interface to explore and enhance the results in depth. Pseudo spectra of identified metabolites can be stored in user- and method-specific reference lists that can be applied on succeeding datasets. The potential of the software is exemplified in an experiment, in which abundance fold-changes of metabolites of the l-arginine biosynthesis in C. glutamicum type strain ATCC 13032 and l-arginine producing strain ATCC 21831 are compared. Furthermore, the capability for detection and annotation of uncommon large neutral losses is shown by the identification of (γ-)glutamyl dipeptides in the same strains. ALLocator is available online at: https://allocator.cebitec.uni-bielefeld.de. A login is required, but freely available.
The Signal Transduction Classification Database (STCDB) is a database of information relative to the classification of signal transduction. It is based primarily on a proposed classification of signal transduction and it describes each type of characterized signal transduction for which a unique ST number has been provided. This document presents, in its first version, the classification of signal transduction in eukaryotic cells. Approved classifications are available for web browsing at http://www.techfak.uni-bielefeld.de/~mchen/STCDB.
Summary: We introduce the tool mkESA, an open source program for constructing enhanced suffix arrays (ESAs), striving for low memory consumption, yet high practical speed. mkESA is a user-friendly program written in portable C99, based on a parallelized version of the Deep-Shallow suffix array construction algorithm, which is known for its high speed and small memory usage. The tool handles large FASTA files with multiple sequences, and computes suffix arrays and various additional tables, such as the LCP table (longest common prefix) or the inverse suffix array, from given sequence data.
Availability: The source code of mkESA is freely available under the terms of the GNU General Public License (GPL) version 2 at http://bibiserv.techfak.uni-bielefeld.de/mkesa/.
Year by year, approximately two million people die from tuberculosis, a disease caused by the bacterium Mycobacterium tuberculosis. There is a tremendous need for new anti-tuberculosis therapies (antituberculotica) and drugs to cope with the spread of tuberculosis. Despite many efforts to obtain a better understanding of M. tuberculosis' pathogenicity and its survival strategy in humans, many questions are still unresolved. Among other cellular processes in bacteria, pathogenicity is controlled by transcriptional regulation. Thus, various studies on M. tuberculosis concentrate on the analysis of transcriptional regulation in order to gain new insights on pathogenicity and other essential processes ensuring mycobacterial survival. We designed a bioinformatics pipeline for the reliable transfer of gene regulations between taxonomically closely related organisms that incorporates (i) a prediction of orthologous genes and (ii) the prediction of transcription factor binding sites. In total, 460 regulatory interactions were identified for M. tuberculosis using our comparative approach. Based on that, we designed a publicly available platform that aims to data integration, analysis, visualization and finally the reconstruction of mycobacterial transcriptional gene regulatory networks: MycoRegNet. It is a comprehensive database system and analysis platform that offers several methods for data exploration and the generation of novel hypotheses. MycoRegNet is publicly available at http://mycoregnet.cebitec.uni-bielefeld.de.
In recent years, new microscopic imaging techniques have evolved to allow us to visualize several different proteins (or other biomolecules) in a visual field. Analysis of protein co-localization becomes viable because molecules can interact only when they are located close to each other. We present a novel approach to align images in a multi-tag fluorescence image stack. The proposed approach is applicable to multi-tag bioimaging systems which (a) acquire fluorescence images by sequential staining and (b) simultaneously capture a phase contrast image corresponding to each of the fluorescence images. To the best of our knowledge, there is no existing method in the literature, which addresses simultaneous registration of multi-tag bioimages and selection of the reference image in order to maximize the overall overlap between the images.
We employ a block-based method for registration, which yields a confidence measure to indicate the accuracy of our registration results. We derive a shift metric in order to select the Reference Image with Maximal Overlap (RIMO), in turn minimizing the total amount of non-overlapping signal for a given number of tags. Experimental results show that the Robust Alignment of Multi-Tag Bioimages (RAMTaB) framework is robust to variations in contrast and illumination, yields sub-pixel accuracy, and successfully selects the reference image resulting in maximum overlap. The registration results are also shown to significantly improve any follow-up protein co-localization studies.
For the discovery of protein complexes and of functional protein networks within a cell, alignment of the tag images in a multi-tag fluorescence image stack is a key pre-processing step. The proposed framework is shown to produce accurate alignment results on both real and synthetic data. Our future work will use the aligned multi-channel fluorescence image data for normal and diseased tissue specimens to analyze molecular co-expression patterns and functional protein networks.