Within the recent years clock rates of modern processors stagnated while the demand for computing power continued to grow. This applied particularly for the fields of life sciences and bioinformatics, where new technologies keep on creating rapidly growing piles of raw data with increasing speed. The number of cores per processor increased in an attempt to compensate for slight increments of clock rates. This technological shift demands changes in software development, especially in the field of high performance computing where parallelization techniques are gaining in importance due to the pressing issue of large sized datasets generated by e.g., modern genomics. This paper presents an overview of state-of-the-art manual and automatic acceleration techniques and lists some applications employing these in different areas of sequence informatics. Furthermore, we provide examples for automatic acceleration of two use cases to show typical problems and gains of transforming a serial application to a parallel one. The paper should aid the reader in deciding for a certain techniques for the problem at hand. We compare four different state-of-the-art automatic acceleration approaches (OpenMP, PluTo-SICA, PPCG, and OpenACC). Their performance as well as their applicability for selected use cases is discussed. While optimizations targeting the CPU worked better in the complex k-mer use case, optimizers for Graphics Processing Units (GPUs) performed better in the matrix multiplication example. But performance is only superior at a certain problem size due to data migration overhead. We show that automatic code parallelization is feasible with current compiler software and yields significant increases in execution speed. Automatic optimizers for CPU are mature and usually no additional manual adjustment is required. In contrast, some automatic parallelizers targeting GPUs still lack maturity and are limited to simple statements and structures.
GPU; FPGA; multi-core; parallelization; automatic; high throughput; bioinformatics; sequence analysis
This paper presents a machine learning based approach for analyses of photos collected from laboratory experiments conducted to assess the potential impact of water-based drill cuttings on deep-water rhodolith-forming calcareous algae. This pilot study uses imaging technology to quantify and monitor the stress levels of the calcareous algae Mesophyllum engelhartii (Foslie) Adey caused by various degrees of light exposure, flow intensity and amount of sediment. A machine learning based algorithm was applied to assess the temporal variation of the calcareous algae size (∼ mass) and color automatically. Measured size and color were correlated to the photosynthetic efficiency (maximum quantum yield of charge separation in photosystem II, ΦPSIImax) and degree of sediment coverage using multivariate regression. The multivariate regression showed correlations between time and calcareous algae sizes, as well as correlations between fluorescence and calcareous algae colors.
We present results of our machine learning approach to the problem of classifying GC-MS data originating from wheat grains of different farming systems. The aim is to investigate the potential of learning algorithms to classify GC-MS data to be either from conventionally grown or from organically grown samples and considering different cultivars. The motivation of our work is rather obvious nowadays: increased demand for organic food in post-industrialized societies and the necessity to prove organic food authenticity. The background of our data set is given by up to 11 wheat cultivars that have been cultivated in both farming systems, organic and conventional, throughout 3 years. More than 300 GC-MS measurements were recorded and subsequently processed and analyzed in the MeltDB 2.0 metabolomics analysis platform, being briefly outlined in this paper. We further describe how unsupervised (t-SNE, PCA) and supervised (SVM) methods can be applied for sample visualization and classification. Our results clearly show that years have most and wheat cultivars have second-most influence on the metabolic composition of a sample. We can also show that for a given year and cultivar, organic and conventional cultivation can be distinguished by machine-learning algorithms.
metabolome informatics; statistics; metabolomics; computational metabolomics; organic farming; food authentication; machine learning
study of mapping and interaction of co-localized proteins at a sub-cellular level is important for understanding complex biological phenomena. One of the recent techniques to map co-localized proteins is to use the standard immuno-fluorescence microscopy in a cyclic manner (Nat Biotechnol 24:1270–8, 2006; Proc Natl Acad Sci 110:11982–7, 2013). Unfortunately, these techniques suffer from variability in intensity and positioning of signals from protein markers within a run and across different runs. Therefore, it is necessary to standardize protocols for preprocessing of the multiplexed bioimaging (MBI) data from multiple runs to a comparable scale before any further analysis can be performed on the data. In this paper, we compare various normalization protocols and propose on the basis of the obtained results, a robust normalization technique that produces consistent results on the MBI data collected from different runs using the Toponome Imaging System (TIS). Normalization results produced by the proposed method on a sample TIS data set for colorectal cancer patients were ranked favorably by two pathologists and two biologists. We show that the proposed method produces higher between class Kullback-Leibler (KL) divergence and lower within class KL divergence on a distribution of cell phenotypes from colorectal cancer and histologically normal samples.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-016-0088-2) contains supplementary material, which is available to authorized users.
Multiplexed fluorescence imaging; Protein signatures; Toponome imaging system; Normalization protocols; Bioimage informatics
MALDI mass spectrometry imaging was performed to localize metabolites during the first seven days of the barley germination. Up to 100 mass signals were detected of which 85 signals were identified as 48 different metabolites with highly tissue-specific localizations. Oligosaccharides were observed in the endosperm and in parts of the developed embryo. Lipids in the endosperm co-localized in dependency on their fatty acid compositions with changes in the distributions of diacyl phosphatidylcholines during germination. 26 potentially antifungal hordatines were detected in the embryo with tissue-specific localizations of their glycosylated, hydroxylated, and O-methylated derivates. In order to reveal spatio-temporal patterns in local metabolite compositions, multiple MSI data sets from a time series were analyzed in one batch. This requires a new preprocessing strategy to achieve comparability between data sets as well as a new strategy for unsupervised clustering. The resulting spatial segmentation for each time point sample is visualized in an interactive cluster map and enables simultaneous interactive exploration of all time points. Using this new analysis approach and visualization tool germination-dependent developments of metabolite patterns with single MS position accuracy were discovered. This is the first study that presents metabolite profiling of a cereals’ germination process over time by MALDI MSI with the identification of a large number of peaks of agronomically and industrially important compounds such as oligosaccharides, lipids and antifungal agents. Their detailed localization as well as the MS cluster analyses for on-tissue metabolite profile mapping revealed important information for the understanding of the germination process, which is of high scientific interest.
Adduct formation, fragmentation events and matrix effects impose special challenges to the identification and quantitation of metabolites in LC-ESI-MS datasets. An important step in compound identification is the deconvolution of mass signals. During this processing step, peaks representing adducts, fragments, and isotopologues of the same analyte are allocated to a distinct group, in order to separate peaks from coeluting compounds. From these peak groups, neutral masses and pseudo spectra are derived and used for metabolite identification via mass decomposition and database matching. Quantitation of metabolites is hampered by matrix effects and nonlinear responses in LC-ESI-MS measurements. A common approach to correct for these effects is the addition of a U-13C-labeled internal standard and the calculation of mass isotopomer ratios for each metabolite. Here we present a new web-platform for the analysis of LC-ESI-MS experiments. ALLocator covers the workflow from raw data processing to metabolite identification and mass isotopomer ratio analysis. The integrated processing pipeline for spectra deconvolution “ALLocatorSD” generates pseudo spectra and automatically identifies peaks emerging from the U-13C-labeled internal standard. Information from the latter improves mass decomposition and annotation of neutral losses. ALLocator provides an interactive and dynamic interface to explore and enhance the results in depth. Pseudo spectra of identified metabolites can be stored in user- and method-specific reference lists that can be applied on succeeding datasets. The potential of the software is exemplified in an experiment, in which abundance fold-changes of metabolites of the l-arginine biosynthesis in C. glutamicum type strain ATCC 13032 and l-arginine producing strain ATCC 21831 are compared. Furthermore, the capability for detection and annotation of uncommon large neutral losses is shown by the identification of (γ-)glutamyl dipeptides in the same strains. ALLocator is available online at: https://allocator.cebitec.uni-bielefeld.de. A login is required, but freely available.
Motivation: The research area metabolomics achieved tremendous popularity and development in the last couple of years. Owing to its unique interdisciplinarity, it requires to combine knowledge from various scientific disciplines. Advances in the high-throughput technology and the consequently growing quality and quantity of data put new demands on applied analytical and computational methods. Exploration of finally generated and analyzed datasets furthermore relies on powerful tools for data mining and visualization.
Results: To cover and keep up with these requirements, we have created MeltDB 2.0, a next-generation web application addressing storage, sharing, standardization, integration and analysis of metabolomics experiments. New features improve both efficiency and effectivity of the entire processing pipeline of chromatographic raw data from pre-processing to the derivation of new biological knowledge. First, the generation of high-quality metabolic datasets has been vastly simplified. Second, the new statistics tool box allows to investigate these datasets according to a wide spectrum of scientific and explorative questions.
Availability: The system is publicly available at https://meltdb.cebitec.uni-bielefeld.de. A login is required but freely available.
Megafauna play an important role in benthic ecosystem function and are sensitive indicators of environmental change. Non-invasive monitoring of benthic communities can be accomplished by seafloor imaging. However, manual quantification of megafauna in images is labor-intensive and therefore, this organism size class is often neglected in ecosystem studies. Automated image analysis has been proposed as a possible approach to such analysis, but the heterogeneity of megafaunal communities poses a non-trivial challenge for such automated techniques. Here, the potential of a generalized object detection architecture, referred to as iSIS (intelligent Screening of underwater Image Sequences), for the quantification of a heterogenous group of megafauna taxa is investigated. The iSIS system is tuned for a particular image sequence (i.e. a transect) using a small subset of the images, in which megafauna taxa positions were previously marked by an expert. To investigate the potential of iSIS and compare its results with those obtained from human experts, a group of eight different taxa from one camera transect of seafloor images taken at the Arctic deep-sea observatory HAUSGARTEN is used. The results show that inter- and intra-observer agreements of human experts exhibit considerable variation between the species, with a similar degree of variation apparent in the automatically derived results obtained by iSIS. Whilst some taxa (e. g. Bathycrinus stalks, Kolga hyalina, small white sea anemone) were well detected by iSIS (i. e. overall Sensitivity: 87%, overall Positive Predictive Value: 67%), some taxa such as the small sea cucumber Elpidia heckeri remain challenging, for both human observers and iSIS.
Motivation: Bioimaging techniques rapidly develop toward higher resolution and dimension. The increase in dimension is achieved by different techniques such as multitag fluorescence imaging, Matrix Assisted Laser Desorption / Ionization (MALDI) imaging or Raman imaging, which record for each pixel an N-dimensional intensity array, representing local abundances of molecules, residues or interaction patterns. The analysis of such multivariate bioimages (MBIs) calls for new approaches to support users in the analysis of both feature domains: space (i.e. sample morphology) and molecular colocation or interaction. In this article, we present our approach WHIDE (Web-based Hyperbolic Image Data Explorer) that combines principles from computational learning, dimension reduction and visualization in a free web application.
Results: We applied WHIDE to a set of MBI recorded using the multitag fluorescence imaging Toponome Imaging System. The MBI show field of view in tissue sections from a colon cancer study and we compare tissue from normal/healthy colon with tissue classified as tumor. Our results show, that WHIDE efficiently reduces the complexity of the data by mapping each of the pixels to a cluster, referred to as Molecular Co-Expression Phenotypes and provides a structural basis for a sophisticated multimodal visualization, which combines topology preserving pseudocoloring with information visualization. The wide range of WHIDE's applicability is demonstrated with examples from toponome imaging, high content screens and MALDI imaging (shown in the Supplementary Material).
Availability and implementation: The WHIDE tool can be accessed via the BioIMAX website http://ani.cebitec.uni-bielefeld.de/BioIMAX/; Login: whidetestuser; Password: whidetest.
Supplementary data are available at Bioinformatics online.
In recent years, new microscopic imaging techniques have evolved to allow us to visualize several different proteins (or other biomolecules) in a visual field. Analysis of protein co-localization becomes viable because molecules can interact only when they are located close to each other. We present a novel approach to align images in a multi-tag fluorescence image stack. The proposed approach is applicable to multi-tag bioimaging systems which (a) acquire fluorescence images by sequential staining and (b) simultaneously capture a phase contrast image corresponding to each of the fluorescence images. To the best of our knowledge, there is no existing method in the literature, which addresses simultaneous registration of multi-tag bioimages and selection of the reference image in order to maximize the overall overlap between the images.
We employ a block-based method for registration, which yields a confidence measure to indicate the accuracy of our registration results. We derive a shift metric in order to select the Reference Image with Maximal Overlap (RIMO), in turn minimizing the total amount of non-overlapping signal for a given number of tags. Experimental results show that the Robust Alignment of Multi-Tag Bioimages (RAMTaB) framework is robust to variations in contrast and illumination, yields sub-pixel accuracy, and successfully selects the reference image resulting in maximum overlap. The registration results are also shown to significantly improve any follow-up protein co-localization studies.
For the discovery of protein complexes and of functional protein networks within a cell, alignment of the tag images in a multi-tag fluorescence image stack is a key pre-processing step. The proposed framework is shown to produce accurate alignment results on both real and synthetic data. Our future work will use the aligned multi-channel fluorescence image data for normal and diseased tissue specimens to analyze molecular co-expression patterns and functional protein networks.
Innovations in biological and biomedical imaging produce complex high-content and multivariate image data. For decision-making and generation of hypotheses, scientists need novel information technology tools that enable them to visually explore and analyze the data and to discuss and communicate results or findings with collaborating experts from various places.
In this paper, we present a novel Web2.0 approach, BioIMAX, for the collaborative exploration and analysis of multivariate image data by combining the webs collaboration and distribution architecture with the interface interactivity and computation power of desktop applications, recently called rich internet application.
BioIMAX allows scientists to discuss and share data or results with collaborating experts and to visualize, annotate, and explore multivariate image data within one web-based platform from any location via a standard web browser requiring only a username and a password. BioIMAX can be accessed at http://ani.cebitec.uni-bielefeld.de/BioIMAX with the username "test" and the password "test1" for testing purposes.
Mass spectrometry-based proteomics has reached a stage where it is possible to
comprehensively analyze the whole proteome of a cell in one experiment. Here, the
employment of stable isotopes has become a standard technique to yield relative
abundance values of proteins. In recent times, more and more experiments are
conducted that depict not only a static image of the up- or down-regulated
proteins at a distinct time point but instead compare developmental stages of an
organism or varying experimental conditions.
Although the scientific questions behind these experiments are of course manifold,
there are, nevertheless, two questions that commonly arise: 1) which proteins are
differentially regulated regarding the selected experimental conditions, and 2)
are there groups of proteins that show similar abundance ratios, indicating that
they have a similar turnover? We give advice on how these two questions can be
answered and comprehensively compare a variety of commonly applied computational
methods and their outcomes.
This work provides guidance through the jungle of computational methods to analyze
mass spectrometry-based isotope-labeled datasets and recommends an effective and
easy-to-use evaluation strategy. We demonstrate our approach with three recently
published datasets on Bacillus subtilis [1,2] and Corynebacterium
glutamicum . Special focus is
placed on the application and validation of cluster analysis methods. All applied
methods were implemented within the rich internet application QuPE . Results can be found at
Measurements on gene level are widely used to gain new insights in complex diseases e.g. cancer. A promising approach to understand basic biological mechanisms is to combine gene expression profiles and classical clinical parameters. However, the computation of a correlation coefficient between high-dimensional data and such parameters is not covered by traditional statistical methods.
We propose a novel index, the Normalized Tree Index (NTI), to compute a correlation coefficient between the clustering result of high-dimensional microarray data and nominal clinical parameters. The NTI detects correlations between hierarchically clustered microarray data and nominal clinical parameters (labels) and gives a measurement of significance in terms of an empiric p-value of the identified correlations. Therefore, the microarray data is clustered by hierarchical agglomerative clustering using standard settings. In a second step, the computed cluster tree is evaluated. For each label, a NTI is computed measuring the correlation between that label and the clustered microarray data.
The NTI successfully identifies correlated clinical parameters at different levels of significance when applied on two real-world microarray breast cancer data sets. Some of the identified highly correlated labels confirm the actual state of knowledge whereas others help to identify new risk factors and provide a good basis to formulate new hypothesis.
The NTI is a valuable tool in the domain of biomedical data analysis. It allows the identification of correlations between high-dimensional data and nominal labels, while at the same time a p-value measures the level of significance of the detected correlations.
Non-linear relations between multiple biochemical parameters are the basis for the diagnosis of many diseases. Traditional linear analytical methods are not reliable predictors. Novel nonlinear techniques are increasingly used to improve the diagnostic accuracy of automated data interpretation. This has been exemplified in particular for the classification and diagnostic prediction of cancers based on expression profiling data. Our objective was to predict the genotype from complex biochemical data by comparing the performance of experienced clinicians to traditional linear analysis, and to novel non-linear analytical methods.
Design and methods
As a model, we used a well-defined set of interconnected data consisting of unstimulated serum levels of steroid intermediates assessed in 54 subjects heterozygous for a mutation of the 21-hydroxylase gene (CYP21B) and in 43 healthy controls.
The genetic alteration was predicted from the pattern of steroid levels with an accuracy of 39% by clinicians and of 64% by linear analysis. In contrast, non-linear analysis, such as self-organizing artificial neural networks, support vector machines, and nearest neighbour classifiers, allowed for higher accuracy up to 83%.
The successful application of these non-linear adaptive methods to capture specific biochemical problems may have generalized implications for biochemical testing in many areas. Nonlinear analytical techniques such as neural networks, support vector machines, and nearest neighbour classifiers may serve as an important adjunct to the decision process of a human investigator not ‘trained’ in a specific complex clinical or laboratory setting and may aid them to classify the problem more directly.
Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) has become an important tool in breast cancer diagnosis, but evaluation of multitemporal 3D image data holds new challenges for human observers. To aid the image analysis process, we apply supervised and unsupervised pattern recognition techniques for computing enhanced visualizations of suspicious lesions in breast MRI data. These techniques represent an important component of future sophisticated computer-aided diagnosis (CAD) systems and support the visual exploration of spatial and temporal features of DCE-MRI data stemming from patients with confirmed lesion diagnosis. By taking into account the heterogeneity of cancerous tissue, these techniques reveal signals with malignant, benign and normal kinetics. They also provide a regional subclassification of pathological breast tissue, which is the basis for pseudo-color presentations of the image data. Intelligent medical systems are expected to have substantial implications in healthcare politics by contributing to the diagnosis of indeterminate breast lesions by non-invasive imaging.
Classification; clustering; computer–aided diagnosis; magnetic resonance imaging; breast
Metagenomics, or the sequencing and analysis of collective genomes (metagenomes) of microorganisms isolated from an environment, promises direct access to the "unculturable majority". This emerging field offers the potential to lay solid basis on our understanding of the entire living world. However, the taxonomic classification is an essential task in the analysis of metagenomics data sets that it is still far from being solved. We present a novel strategy to predict the taxonomic origin of environmental genomic fragments. The proposed classifier combines the idea of the k-nearest neighbor with strategies from kernel-based learning.
Our novel strategy was extensively evaluated using the leave-one-out cross validation strategy on fragments of variable length (800 bp – 50 Kbp) from 373 completely sequenced genomes. TACOA is able to classify genomic fragments of length 800 bp and 1 Kbp with high accuracy until rank class. For longer fragments ≥ 3 Kbp accurate predictions are made at even deeper taxonomic ranks (order and genus). Remarkably, TACOA also produces reliable results when the taxonomic origin of a fragment is not represented in the reference set, thus classifying such fragments to its known broader taxonomic class or simply as "unknown". We compared the classification accuracy of TACOA with the latest intrinsic classifier PhyloPythia using 63 recently published complete genomes. For fragments of length 800 bp and 1 Kbp the overall accuracy of TACOA is higher than that obtained by PhyloPythia at all taxonomic ranks. For all fragment lengths, both methods achieved comparable high specificity results up to rank class and low false negative rates are also obtained.
An accurate multi-class taxonomic classifier was developed for environmental genomic fragments. TACOA can predict with high reliability the taxonomic origin of genomic fragments as short as 800 bp. The proposed method is transparent, fast, accurate and the reference set can be easily updated as newly sequenced genomes become available. Moreover, the method demonstrated to be competitive when compared to the most current classifier PhyloPythia and has the advantage that it can be locally installed and the reference set can be kept up-to-date.
Mass spectrometry is a key technique in proteomics and can be used to analyze complex samples quickly. One key problem with the mass spectrometric analysis of peptides and proteins, however, is the fact that absolute quantification is severely hampered by the unclear relationship between the observed peak intensity and the peptide concentration in the sample. While there are numerous approaches to circumvent this problem experimentally (e.g. labeling techniques), reliable prediction of the peak intensities from peptide sequences could provide a peptide-specific correction factor. Thus, it would be a valuable tool towards label-free absolute quantification.
In this work we present machine learning techniques for peak intensity prediction for MALDI mass spectra. Features encoding the peptides' physico-chemical properties as well as string-based features were extracted. A feature subset was obtained from multiple forward feature selections on the extracted features. Based on these features, two advanced machine learning methods (support vector regression and local linear maps) are shown to yield good results for this problem (Pearson correlation of 0.68 in a ten-fold cross validation).
The techniques presented here are a useful first step going beyond the binary prediction of proteotypic peptides towards a more quantitative prediction of peak intensities. These predictions in turn will turn out to be beneficial for mass spectrometry-based quantitative proteomics.
Metagenomics is providing striking insights into the ecology of microbial communities. The recently developed massively parallel 454 pyrosequencing technique gives the opportunity to rapidly obtain metagenomic sequences at a low cost and without cloning bias. However, the phylogenetic analysis of the short reads produced represents a significant computational challenge. The phylogenetic algorithm CARMA for predicting the source organisms of environmental 454 reads is described. The algorithm searches for conserved Pfam domain and protein families in the unassembled reads of a sample. These gene fragments (environmental gene tags, EGTs), are classified into a higher-order taxonomy based on the reconstruction of a phylogenetic tree of each matching Pfam family. The method exhibits high accuracy for a wide range of taxonomic groups, and EGTs as short as 27 amino acids can be phylogenetically classified up to the rank of genus. The algorithm was applied in a comparative study of three aquatic microbial samples obtained by 454 pyrosequencing. Profound differences in the taxonomic composition of these samples could be clearly revealed.
We present the novel prokaryotic gene finder GISMO, which combines searches for protein family domains with composition-based classification based on a support vector machine. GISMO is highly accurate; exhibiting high sensitivity and specificity in gene identification. We found that it performs well for complete prokaryotic chromosomes, irrespective of their GC content, and also for plasmids as short as 10 kb, short genes and for genes with atypical sequence composition. Using GISMO, we found several thousand new predictions for the published genomes that are supported by extrinsic evidence, which strongly suggest that these are very likely biologically active genes. The source code for GISMO is freely available under the GPL license.
Sinorhizobium meliloti genome sequence determination has provided the basis for different approaches of functional genomics for this symbiotic nitrogen-fixing alpha-proteobacterium. One of these approaches is gene disruption with subsequent analysis of mutant phenotypes. This method is efficient for single genes; however, it is laborious and time-consuming if it is used on a large scale. Here, we used a signature-tagged transposon mutagenesis method that allowed analysis of the survival and competitiveness of many mutants in a single experiment. A novel set of signature tags characterized by similar melting temperatures and G+C contents of the tag sequences was developed. The efficiencies of amplification of all tags were expected to be similar. Thus, no preselection of the tags was necessary to create a library of 412 signature-tagged transposons. To achieve high specificity of tag detection, each transposon was bar coded by two signature tags. In order to generate defined, nonredundant sets of signature-tagged S. meliloti mutants for subsequent experiments, 12,000 mutants were constructed, and insertion sites for more than 5,000 mutants were determined. One set consisting of 378 mutants was used in a validation experiment to identify mutants showing altered growth patterns.
Multivariate imaging techniques such as dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) have been shown to provide valuable information for medical diagnosis. Even though these techniques provide new information, integrating and evaluating the much wider range of information is a challenging task for the human observer. This task may be assisted with the use of image fusion algorithms.
In this paper, image fusion based on Kernel Principal Component Analysis (KPCA) is proposed for the first time. It is demonstrated that a priori knowledge about the data domain can be easily incorporated into the parametrisation of the KPCA, leading to task-oriented visualisations of the multivariate data. The results of the fusion process are compared with those of the well-known and established standard linear Principal Component Analysis (PCA) by means of temporal sequences of 3D MRI volumes from six patients who took part in a breast cancer screening study.
The PCA and KPCA algorithms are able to integrate information from a sequence of MRI volumes into informative gray value or colour images. By incorporating a priori knowledge, the fusion process can be automated and optimised in order to visualise suspicious lesions with high contrast to normal tissue.
Our machine learning based image fusion approach maps the full signal space of a temporal DCE-MRI sequence to a single meaningful visualisation with good tissue/lesion contrast and thus supports the radiologist during manual image evaluation.