RNA-seq, a massive parallel-sequencing-based transcriptome profiling method, provides digital data in the form of aligned sequence read counts. The comparative analyses of the data require appropriate statistical methods to estimate the differential expression of transcript variants across different cell/tissue types and disease conditions.
We developed a novel nonparametric empirical Bayesian-based approach (NPEBseq) to model the RNA-seq data. The prior distribution of the Bayesian model is empirically estimated from the data without any parametric assumption, and hence the method is “nonparametric” in nature. Based on this model, we proposed a method for detecting differentially expressed genes across different conditions. We also extended this method to detect differential usage of exons from RNA-seq data. The evaluation of NPEBseq on both simulated and publicly available RNA-seq datasets and comparison with three popular methods showed improved results for experiments with or without biological replicates.
NPEBseq can successfully detect differential expression between different conditions not only at gene level but also at exon level from RNA-seq datasets. In addition, NPEBSeq performs significantly better than current methods and can be applied to genome-wide RNA-seq datasets. Sample datasets and R package are available at http://bioinformatics.wistar.upenn.edu/NPEBseq.
The wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging to each class is not the same. With this kind of dataset, biologists need to identify a small number of informative genes that can be used as biomarkers for a disease.
This paper introduces a Balanced Iterative Random Forest (BIRF) algorithm to select the most relevant genes for a disease from imbalanced high-throughput gene expression microarray data. Balanced iterative random forest is applied on four cancer microarray datasets: a childhood leukaemia dataset, which represents the main target of this paper, collected from The Children’s Hospital at Westmead, NCI 60, a Colon dataset and a Lung cancer dataset. The results obtained by BIRF are compared to those of Support Vector Machine-Recursive Feature Elimination (SVM-RFE), Multi-class SVM-RFE (MSVM-RFE), Random Forest (RF) and Naive Bayes (NB) classifiers. The results of the BIRF approach outperform these state-of-the-art methods, especially in the case of imbalanced datasets. Experiments on the childhood leukaemia dataset show that a 7% ∼ 12% better accuracy is achieved by BIRF over MSVM-RFE with the ability to predict patients in the minor class. The informative biomarkers selected by the BIRF algorithm were validated by repeating training experiments three times to see whether they are globally informative, or just selected by chance. The results show that 64% of the top genes consistently appear in the three lists, and the top 20 genes remain near the top in the other three lists.
The designed BIRF algorithm is an appropriate choice to select genes from imbalanced high-throughput gene expression microarray data. BIRF outperforms the state-of-the-art methods, especially the ability to handle the class-imbalanced data. Moreover, the analysis of the selected genes also provides a way to distinguish between the predictive genes and those that only appear to be predictive.
Texture within biological specimens may reveal critical insights, while being very difficult to quantify. This is a particular problem in histological analysis. For example, cross-polar images of picrosirius stained skin reveal exquisite structure, allowing changes in the basketweave conformation of healthy collagen to be assessed. Existing techniques measure gross pathological changes, such as fibrosis, but are not sufficiently sensitive to detect more subtle and progressive pathological changes in the dermis, such as those seen in ageing. Moreover, screening methods for cutaneous therapeutics require accurate, unsupervised and high-throughput image analysis techniques.
By analyzing spectra of images post Gabor filtering and Fast Fourier Transform, we were able to measure subtle changes in collagen fibre orientation intractable to existing techniques. We detected the progressive loss of collagen basketweave structure in a series of chronologically aged skin samples, as well as in skin derived from a model of type 2 diabetes mellitus.
We describe a novel bioimaging approach with implications for the evaluation of pathology in a broader range of biological situations.
Fast fourier transform; Gabor filter; Dermis; Ageing; Histology; Collagen; Basketweave; Type 2 diabetes mellitus
DNA methylation is an epigenetic event that adds a methyl-group to the 5’ cytosine. This epigenetic modification can significantly affect gene expression in both normal and diseased cells. Hence, it is important to study methylation signals at the single cytosine site level, which is now possible utilizing bisulfite conversion technique (i.e., converting unmethylated Cs to Us and then to Ts after PCR amplification) and next generation sequencing (NGS) technologies. Despite the advances of NGS technologies, certain quality issues remain. Some of the more prevalent quality issues involve low per-base sequencing quality at the 3’ end, PCR amplification bias, and bisulfite conversion rates. Therefore, it is important to conduct quality assessment before downstream analysis. To the best of our knowledge, no existing software packages can generally assess the quality of methylation sequencing data generated based on different bisulfite-treated protocols.
To conduct the quality assessment of bisulfite methylation sequencing data, we have developed a pipeline named MethyQA. MethyQA combines currently available open-source software packages with our own custom programs written in Perl and R. The pipeline can provide quality assessment results for tens of millions of reads in under an hour. The novelty of our pipeline lies in its examination of bisulfite conversion rates and of the DNA sequence structure of regions that have different conversion rates or coverage.
MethyQA is a new software package that provides users with a unique insight into the methylation sequencing data they are researching. It allows the users to determine the quality of their data and better prepares them to address the research questions that lie ahead. Due to the speed and efficiency at which MethyQA operates, it will become an important tool for studies dealing with bisulfite methylation sequencing data.
DNA methylation; Next generation sequencing; Alignment; BRAT; Quality assessment
Large-scale chromosomal deletions or other non-specific perturbations of the transcriptome can alter the expression of hundreds or thousands of genes, and it is of biological interest to understand which genes are most profoundly affected. We present a method for predicting a gene’s expression as a function of other genes thereby accounting for the effect of transcriptional regulation that confounds the identification of genes differentially expressed relative to a regulatory network. The challenge in constructing such models is that the number of possible regulator transcripts within a global network is on the order of thousands, and the number of biological samples is typically on the order of 10. Nevertheless, there are large gene expression databases that can be used to construct networks that could be helpful in modeling transcriptional regulation in smaller experiments.
We demonstrate a type of penalized regression model that can be estimated from large gene expression databases, and then applied to smaller experiments. The ridge parameter is selected by minimizing the cross-validation error of the predictions in the independent out-sample. This tends to increase the model stability and leads to a much greater degree of parameter shrinkage, but the resulting biased estimation is mitigated by a second round of regression. Nevertheless, the proposed computationally efficient “over-shrinkage” method outperforms previously used LASSO-based techniques. In two independent datasets, we find that the median proportion of explained variability in expression is approximately 25%, and this results in a substantial increase in the signal-to-noise ratio allowing more powerful inferences on differential gene expression leading to biologically intuitive findings. We also show that a large proportion of gene dependencies are conditional on the biological state, which would be impossible with standard differential expression methods.
By adjusting for the effects of the global network on individual genes, both the sensitivity and reliability of differential expression measures are greatly improved.
Since their first commercialization, the diversity of taxa and the genetic composition of transgene sequences in genetically modified plants (GMOs) are constantly increasing. To date, the detection of GMOs and derived products is commonly performed by PCR-based methods targeting specific DNA sequences introduced into the host genome. Information available regarding the GMOs’ molecular characterization is dispersed and not appropriately organized. For this reason, GMO testing is very challenging and requires more complex screening strategies and decision making schemes, demanding in return the use of efficient bioinformatics tools relying on reliable information.
The GMOseek matrix was built as a comprehensive, online open-access tabulated database which provides a reliable, comprehensive and user-friendly overview of 328 GMO events and 247 different genetic elements (status: 18/07/2013). The GMOseek matrix is aiming to facilitate GMO detection from plant origin at different phases of the analysis. It assists in selecting the targets for a screening analysis, interpreting the screening results, checking the occurrence of a screening element in a group of selected GMOs, identifying gaps in the available pool of GMO detection methods, and designing a decision tree. The GMOseek matrix is an independent database with effective functionalities in a format facilitating transferability to other platforms. Data were collected from all available sources and experimentally tested where detection methods and certified reference materials (CRMs) were available.
The GMOseek matrix is currently a unique and very valuable tool with reliable information on GMOs from plant origin and their present genetic elements that enables further development of appropriate strategies for GMO detection. It is flexible enough to be further updated with new information and integrated in different applications and platforms.
Genetically modified organism (GMO); GMO screening; Matrix approach; Genetically modified plant
Cheminformaticians have to routinely process and analyse libraries of small molecules. Among other things, that includes the standardization of molecules, calculation of various descriptors, visualisation of molecular structures, and downstream analysis. For this purpose, scientific workflow platforms such as the Konstanz Information Miner can be used if provided with the right plug-in. A workflow-based cheminformatics tool provides the advantage of ease-of-use and interoperability between complementary cheminformatics packages within the same framework, hence facilitating the analysis process.
KNIME-CDK comprises functions for molecule conversion to/from common formats, generation of signatures, fingerprints, and molecular properties. It is based on the Chemistry Development Toolkit and uses the Chemical Markup Language for persistence. A comparison with the cheminformatics plug-in RDKit shows that KNIME-CDK supports a similar range of chemical classes and adds new functionality to the framework. We describe the design and integration of the plug-in, and demonstrate the usage of the nodes on ChEBI, a library of small molecules of biological interest.
KNIME-CDK is an open-source plug-in for the Konstanz Information Miner, a free workflow platform. KNIME-CDK is build on top of the open-source Chemistry Development Toolkit and allows for efficient cross-vendor structural cheminformatics. Its ease-of-use and modularity enables researchers to automate routine tasks and data analysis, bringing complimentary cheminformatics functionality to the workflow environment.
Cheminformatics; Workflows; Data integration; Software library
Whole-cell models promise to accelerate biomedical science and engineering. However, discovering new biology from whole-cell models and other high-throughput technologies requires novel tools for exploring and analyzing complex, high-dimensional data.
We developed WholeCellViz, a web-based software program for visually exploring and analyzing whole-cell simulations. WholeCellViz provides 14 animated visualizations, including metabolic and chromosome maps. These visualizations help researchers analyze model predictions by displaying predictions in their biological context. Furthermore, WholeCellViz enables researchers to compare predictions within and across simulations by allowing users to simultaneously display multiple visualizations.
WholeCellViz was designed to facilitate exploration, analysis, and communication of whole-cell model data. Taken together, WholeCellViz helps researchers use whole-cell model simulations to drive advances in biology and bioengineering.
Whole-cell modeling; Data visualization; Cell physiology; Computational biology; Mycoplasma; Bacteria; Systems biology
Primer design for highly variable DNA sequences is difficult, and experimental success requires attention to many interacting constraints. The advent of next-generation sequencing methods allows the investigation of rare variants otherwise hidden deep in large populations, but requires attention to population diversity and primer localization in relatively conserved regions, in addition to recognized constraints typically considered in primer design.
Design constraints include degenerate sites to maximize population coverage, matching of melting temperatures, optimizing de novo sequence length, finding optimal bio-barcodes to allow efficient downstream analyses, and minimizing risk of dimerization. To facilitate primer design addressing these and other constraints, we created a novel computer program (PrimerDesign) that automates this complex procedure. We show its powers and limitations and give examples of successful designs for the analysis of HIV-1 populations.
PrimerDesign is useful for researchers who want to design DNA primers and probes for analyzing highly variable DNA populations. It can be used to design primers for PCR, RT-PCR, Sanger sequencing, next-generation sequencing, and other experimental protocols targeting highly variable DNA samples.
Primer design; DNA sequencing; Amplicon sequencing; Next-generation sequencing; PCR; Primer dimer; Bio-barcodes; Multiplex
Time course data from microarrays and high-throughput sequencing experiments require simple, computationally efficient and powerful statistical models to extract meaningful biological signal, and for tasks such as data fusion and clustering. Existing methodologies fail to capture either the temporal or replicated nature of the experiments, and often impose constraints on the data collection process, such as regularly spaced samples, or similar sampling schema across replications.
We propose hierarchical Gaussian processes as a general model of gene expression time-series, with application to a variety of problems. In particular, we illustrate the method’s capacity for missing data imputation, data fusion and clustering.The method can impute data which is missing both systematically and at random: in a hold-out test on real data, performance is significantly better than commonly used imputation methods. The method’s ability to model inter- and intra-cluster variance leads to more biologically meaningful clusters. The approach removes the necessity for evenly spaced samples, an advantage illustrated on a developmental Drosophila dataset with irregular replications.
The hierarchical Gaussian process model provides an excellent statistical basis for several gene-expression time-series tasks. It has only a few additional parameters over a regular GP, has negligible additional complexity, is easily implemented and can be integrated into several existing algorithms. Our experiments were implemented in python, and are available from the authors’ website: http://staffwww.dcs.shef.ac.uk/people/J.Hensman/.
Biologists may need to know the set of genes that are semantically related to a given set of genes. For instance, a biologist may need to know the set of genes related to another set of genes known to be involved in a specific disease. Some works use the concept of gene clustering in order to identify semantically related genes. Others propose tools that return the set of genes that are semantically related to a given set of genes. Most of these gene similarity measures determine the semantic similarities among the genes based solely on the proximity to each other of the GO terms annotating the genes, while overlook the structural dependencies among these GO terms, which may lead to low recall and precision of results.
We propose in this paper a search engine called GRank, which overcomes the limitations of the current gene similarity measures outlined above as follows. It employs the concept of existence dependency to determine the structural dependencies among the GO terms annotating a given set of gene. After determining the set of genes that are semantically related to input genes, GRank would use microarray experiment to rank these genes based on their degree of relativity to the input genes. We evaluated GRank experimentally and compared it with a comparable gene prediction tool called DynGO, which retrieves the genes and gene products that are relatives of input genes. Results showed marked improvement.
The experimental results demonstrated that GRank overcomes the limitations of current gene similarity measures. We attribute this performance to GRank’s use of existence dependency concept for determining the semantic relationships among gene annotations. The recall and precision values for two benchmarking datasets showed that GRank outperforms DynGO tool, which does not employ the concept of existence dependency. The demo of GRank using 11000 KEGG yeast genes and a Gene Expression Omnibus (GEO) microarray file named “GSM34635.pad” is available at: http://ecesrvr.kustar.ac.ae:8080/ (click on the link labelled Gene Ontology 2).
Biologists make frequent use of databases containing large and complex biological networks. One popular database is the Kyoto Encyclopedia of Genes and Genomes (KEGG) which uses its own graphical representation and manual layout for pathways. While some general drawing conventions exist for biological networks, arbitrary graphical representations are very common. Recently, a new standard has been established for displaying biological processes, the Systems Biology Graphical Notation (SBGN), which aims to unify the look of such maps. Ideally, online repositories such as KEGG would automatically provide networks in a variety of notations including SBGN. Unfortunately, this is non‐trivial, since converting between notations may add, remove or otherwise alter map elements so that the existing layout cannot be simply reused.
Here we describe a methodology for automatic translation of KEGG metabolic pathways into the SBGN format. We infer important properties of the KEGG layout and treat these as layout constraints that are maintained during the conversion to SBGN maps.
This allows for the drawing and layout conventions of SBGN to be followed while creating maps that are still recognizably the original KEGG pathways. This article details the steps in this process and provides examples of the final result.
The visualisation of genetic and genomic maps aligned within and between species and across data sources can be used to inform studies of genome evolution, assist genome assembly projects and aid gene discovery and identification. Whilst annotation, integration and exploration of assembled genome sequences is well supported, there are fewer tools available which can display genetic maps for less well-characterized species, and integrate these maps with annotated reference genomes to support cross species comparisons.
We have developed a desktop application to draw and align genetic and genomic maps, retrieved from remote data sources or loaded as local files. Maps can be retrieved from our public map database ArkDB or from any Ensembl data source (i.e. Ensembl and Ensembl Genomes). By using the JEnsembl API, maps can be drawn for any release version of any of the thousands of species present in Ensembl data sources, allowing not only inter-specific comparisons, but also comparisons between different versions/revisions of assembled genomes. Maps can be aligned by relating identical or synonymous markers across maps, or through the gene homology/orthology relationship data stored in the Ensembl Compara databases, allowing ready visualization of regions of conserved synteny between species. The map drawing canvas is highly configurable, supports interactive exploration of maps, markers and relationships and allows export of publication quality graphics.
ArkMAP allows users to draw and interactively explore gene and variation maps for any version of any annotated genome curated in the Ensembl data sources, and to integrate local mapping data. The maps and inter-map relationships drawn are highly configurable and ArkMAP may be used to produce publication quality graphics. ArkMAP is freely available as an auto-updating Java ‘Web Start’ application, or as a standalone archived application.
Software; Map drawing; Genetic map; Genomic map; Synteny; Conserved synteny; JEnsembl; Ensembl
Dynamic protein phosphorylation is an essential regulatory mechanism in various organisms. In this capacity, it is involved in a multitude of signal transduction pathways. Kinase-specific phosphorylation data lay the foundation for reconstruction of signal transduction networks. For this reason, precise annotation of phosphorylated proteins is the first step toward simulating cell signaling pathways. However, the vast majority of kinase-specific phosphorylation data remain undiscovered and existing experimental methods and computational phosphorylation site (P-site) prediction tools have various limitations with respect to addressing this problem.
To address this issue, a novel protein kinase identification web server, PKIS, is here presented for the identification of the protein kinases responsible for experimentally verified P-sites at high specificity, which incorporates the composition of monomer spectrum (CMS) encoding strategy and support vector machines (SVMs). Compared to widely used P-site prediction tools including KinasePhos 2.0, Musite, and GPS2.1, PKIS largely outperformed these tools in identifying protein kinases associated with known P-sites. In addition, PKIS was used on all the P-sites in Phospho.ELM that currently lack kinase information. It successfully identified 14 potential SYK substrates with 36 known P-sites. Further literature search showed that 5 of them were indeed phosphorylated by SYK. Finally, an enrichment analysis was performed and 6 significant SYK-related signal pathways were identified.
In general, PKIS can identify protein kinases for experimental phosphorylation sites efficiently. It is a valuable bioinformatics tool suitable for the study of protein phosphorylation. The PKIS web server is freely available at http://bioinformatics.ustc.edu.cn/pkis.
Teaching bioinformatics at universities is complicated by typical computer classroom settings. As well as running software locally and online, students should gain experience of systems administration. For a future career in biology or bioinformatics, the installation of software is a useful skill. We propose that this may be taught by running the course on GNU/Linux running on inexpensive Raspberry Pi computer hardware, for which students may be granted full administrator access.
We release 4273π, an operating system image for Raspberry Pi based on Raspbian Linux. This includes minor customisations for classroom use and includes our Open Access bioinformatics course, 4273π Bioinformatics for Biologists. This is based on the final-year undergraduate module BL4273, run on Raspberry Pi computers at the University of St Andrews, Semester 1, academic year 2012–2013.
4273π is a means to teach bioinformatics, including systems administration tasks, to undergraduates at low cost.
Bioinformatics education; Teaching material; Raspberry Pi; Linux
With recent improvements of protocols for the assembly of transcriptional parts, synthetic biological devices can now more reliably be assembled according to a given design. The standardization of parts open up the way for in silico design tools that improve the construct and optimize devices with respect to given formal design specifications. The simplest such optimization is the selection of kinetic parameters and protein abundances such that the specified design constraints are robustly satisfied. In this work we address the problem of determining parameter values that fulfill specifications expressed in terms of a functional on the trajectories of a dynamical model. We solve this inverse problem by linearizing the forward operator that maps parameter sets to specifications, and then inverting it locally. This approach has two advantages over brute-force random sampling. First, the linearization approach allows us to map back intervals instead of points and second, every obtained value in the parameter region is satisfying the specifications by construction. The method is general and can hence be incorporated in a pipeline for the rational forward design of arbitrary devices in synthetic biology.
Classification methods of DNA most commonly use comparison of the differences in DNA symbolic records, which requires the global multiple sequence alignment. This solution is often inappropriate, causing a number of imprecisions and requires additional user intervention for exact alignment of the similar segments. The similar segments in DNA represented as a signal are characterized by a similar shape of the curve. The DNA alignment in genomic signals may adjust whole sections not only individual symbols. The dynamic time warping (DTW) is suitable for this purpose and can replace the multiple alignment of symbolic sequences in applications, such as phylogenetic analysis.
The proposed method is composed of three main parts. The first part represent conversion of symbolic representation of DNA sequences in the form of a string of A,C,G,T symbols to signal representation in the form of cumulated phase of complex components defined for each symbol. Next part represents signals size adjustment realized by standard signal preprocessing methods: median filtration, detrendization and resampling. The final part necessary for genomic signals comparison is position and length alignment of genomic signals by dynamic time warping (DTW).
The application of the DTW on set of genomic signals was evaluated in dendrogram construction using cluster analysis. The resulting tree was compared with a classical phylogenetic tree reconstructed using multiple alignment. The classification of genomic signals using the DTW is evolutionary closer to phylogeny of organisms. This method is more resistant to errors in the sequences and less dependent on the number of input sequences.
Classification of genomic signals using dynamic time warping is an adequate variant to phylogenetic analysis using the symbolic DNA sequences alignment; in addition, it is robust, quick and more precise technique.
Zebrafish embryos have recently been established as a xenotransplantation model of the metastatic behaviour of primary human tumours. Current tools for automated data extraction from the microscope images are restrictive concerning the developmental stage of the embryos, usually require laborious manual image preprocessing, and, in general, cannot characterize the metastasis as a function of the internal organs.
We present a tool, ZebIAT, that allows both automatic or semi-automatic registration of the outer contour and inner organs of zebrafish embryos. ZebIAT provides a registration at different stages of development and an automatic analysis of cancer metastasis per organ, thus allowing to study cancer progression. The semi-automation relies on a graphical user interface.
We quantified the performance of the registration method, and found it to be accurate, except in some of the smallest organs. Our results show that the accuracy of registering small organs can be improved by introducing few manual corrections. We also demonstrate the applicability of the tool to studies of cancer progression.
ZebIAT offers major improvement relative to previous tools by allowing for an analysis on a per-organ or region basis. It should be of use in high-throughput studies of cancer metastasis in zebrafish embryos.
Cell imaging is becoming an indispensable tool for cell and molecular biology research. However, most processes studied are stochastic in nature, and require the observation of many cells and events. Ideally, extraction of information from these images ought to rely on automatic methods. Here, we propose a novel segmentation method, MAMLE, for detecting cells within dense clusters.
MAMLE executes cell segmentation in two stages. The first relies on state of the art filtering technique, edge detection in multi-resolution with morphological operator and threshold decomposition for adaptive thresholding. From this result, a correction procedure is applied that exploits maximum likelihood estimate as an objective function. Also, it acquires morphological features from the initial segmentation for constructing the likelihood parameter, after which the final segmentation is obtained.
We performed an empirical evaluation that includes sample images from different imaging modalities and diverse cell types. The new method attained very high (above 90%) cell segmentation accuracy in all cases. Finally, its accuracy was compared to several existing methods, and in all tests, MAMLE outperformed them in segmentation accuracy.
High-throughput genome-wide screening to study gene-specific functions, e.g. for drug discovery, demands fast automated image analysis methods to assist in unraveling the full potential of such studies. Image segmentation is typically at the forefront of such analysis as the performance of the subsequent steps, for example, cell classification, cell tracking etc., often relies on the results of segmentation.
We present a cell cytoplasm segmentation framework which first separates cell cytoplasm from image background using novel approach of image enhancement and coefficient of variation of multi-scale Gaussian scale-space representation. A novel outline-learning based classification method is developed using regularized logistic regression with embedded feature selection which classifies image pixels as outline/non-outline to give cytoplasm outlines. Refinement of the detected outlines to separate cells from each other is performed in a post-processing step where the nuclei segmentation is used as contextual information.
Results and conclusions
We evaluate the proposed segmentation methodology using two challenging test cases, presenting images with completely different characteristics, with cells of varying size, shape, texture and degrees of overlap. The feature selection and classification framework for outline detection produces very simple sparse models which use only a small subset of the large, generic feature set, that is, only 7 and 5 features for the two cases. Quantitative comparison of the results for the two test cases against state-of-the-art methods show that our methodology outperforms them with an increase of 4-9% in segmentation accuracy with maximum accuracy of 93%. Finally, the results obtained for diverse datasets demonstrate that our framework not only produces accurate segmentation but also generalizes well to different segmentation tasks.
Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance.
Diffusion is a key component of many biological processes such as chemotaxis, developmental differentiation and tissue morphogenesis. Since recently, the spatial gradients caused by diffusion can be assessed in-vitro and in-vivo using microscopy based imaging techniques. The resulting time-series of two dimensional, high-resolutions images in combination with mechanistic models enable the quantitative analysis of the underlying mechanisms. However, such a model-based analysis is still challenging due to measurement noise and sparse observations, which result in uncertainties of the model parameters.
We introduce a likelihood function for image-based measurements with log-normal distributed noise. Based upon this likelihood function we formulate the maximum likelihood estimation problem, which is solved using PDE-constrained optimization methods. To assess the uncertainty and practical identifiability of the parameters we introduce profile likelihoods for diffusion processes.
Results and conclusion
As proof of concept, we model certain aspects of the guidance of dendritic cells towards lymphatic vessels, an example for haptotaxis. Using a realistic set of artificial measurement data, we estimate the five kinetic parameters of this model and compute profile likelihoods. Our novel approach for the estimation of model parameters from image data as well as the proposed identifiability analysis approach is widely applicable to diffusion processes. The profile likelihood based method provides more rigorous uncertainty bounds in contrast to local approximation methods.
The circadian clock is an important molecular mechanism that enables many organisms to anticipate and adapt to environmental change. Pokhilko et al. recently built a deterministic ODE mathematical model of the plant circadian clock in order to understand the behaviour, mechanisms and properties of the system. The model comprises 30 molecular species (genes, mRNAs and proteins) and over 100 parameters. The parameters have been fitted heuristically to available gene expression time series data and the calibrated model has been shown to reproduce the behaviour of the clock components. Ongoing work is extending the clock model to cover downstream effects, in particular metabolism, necessitating further parameter estimation and model selection. This work investigates the challenges facing a full Bayesian treatment of parameter estimation. Using an efficient adaptive MCMC proposed by Haario et al. and working in a high performance computing setting, we quantify the posterior distribution around the proposed parameter values and explore the basin of attraction. We investigate if Bayesian inference is feasible in this high dimensional setting and thoroughly assess convergence and mixing with different statistical diagnostics, to prevent apparent convergence in some domains masking poor mixing in others.
In this paper we present an algorithm based on the sum-product algorithm that finds elements in the preimage of a feed-forward Boolean networks given an output of the network. Our probabilistic method runs in linear time with respect to the number of nodes in the network. We evaluate our algorithm for randomly constructed Boolean networks and a regulatory network of Escherichia coli and found that it gives a valid solution in most cases.
The emergence of high-throughput genomic datasets from different sources and platforms (e.g., gene expression, single nucleotide polymorphisms (SNP), and copy number variation (CNV)) has greatly enhanced our understandings of the interplay of these genomic factors as well as their influences on the complex diseases. It is challenging to explore the relationship between these different types of genomic data sets. In this paper, we focus on a multivariate statistical method, canonical correlation analysis (CCA) method for this problem. Conventional CCA method does not work effectively if the number of data samples is significantly less than that of biomarkers, which is a typical case for genomic data (e.g., SNPs). Sparse CCA (sCCA) methods were introduced to overcome such difficulty, mostly using penalizations with l-1 norm (CCA-l1) or the combination of l-1and l-2 norm (CCA-elastic net). However, they overlook the structural or group effect within genomic data in the analysis, which often exist and are important (e.g., SNPs spanning a gene interact and work together as a group).
We propose a new group sparse CCA method (CCA-sparse group) along with an effective numerical algorithm to study the mutual relationship between two different types of genomic data (i.e., SNP and gene expression). We then extend the model to a more general formulation that can include the existing sCCA models. We apply the model to feature/variable selection from two data sets and compare our group sparse CCA method with existing sCCA methods on both simulation and two real datasets (human gliomas data and NCI60 data). We use a graphical representation of the samples with a pair of canonical variates to demonstrate the discriminating characteristic of the selected features. Pathway analysis is further performed for biological interpretation of those features.
The CCA-sparse group method incorporates group effects of features into the correlation analysis while performs individual feature selection simultaneously. It outperforms the two sCCA methods (CCA-l1 and CCA-group) by identifying the correlated features with more true positives while controlling total discordance at a lower level on the simulated data, even if the group effect does not exist or there are irrelevant features grouped with true correlated features. Compared with our proposed CCA-group sparse models, CCA-l1 tends to select less true correlated features while CCA-group inclines to select more redundant features.
Group sparse CCA; Genomic data integration; Feature selection; SNP