With the availability of gene expression data by RNA-seq, powerful statistical approaches for grouping similar gene expression profiles across different environments have become increasingly important. We describe and assess a computational model for clustering genes into distinct groups based on the pattern of gene expression in response to changing environment. The model capitalizes on the Poisson distribution to capture the count property of RNA-seq data. A two-stage hierarchical expectation–maximization (EM) algorithm is implemented to estimate an optimal number of groups and mean expression amounts of each group across two environments. A procedure is formulated to test whether and how a given group shows a plastic response to environmental changes. The impact of gene–environment interactions on the phenotypic plasticity of the organism can also be visualized and characterized. The model was used to analyse an RNA-seq dataset measured from two cell lines of breast cancer that respond differently to an anti-cancer drug, from which genes associated with the resistance and sensitivity of the cell lines are identified. We performed simulation studies to validate the statistical behaviour of the model. The model provides a useful tool for clustering gene expression data by RNA-seq, facilitating our understanding of gene functions and networks.
RNA-seq; Poisson distribution; EM algorithm; breast cancer cell lines
With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.
alignment-free; word patterns; Markov model; genome comparison; statistical power; NGS data
Among alignment-free methods, Iterated Maps (IMs) are on a particular extreme: they are also scale free (order free). The use of IMs for sequence analysis is also distinct from other alignment-free methodologies in being rooted in statistical mechanics instead of computational linguistics. Both of these roots go back over two decades to the use of fractal geometry in the characterization of phase-space representations. The time series analysis origin of the field is betrayed by the title of the manuscript that started this alignment-free subdomain in 1990, ‘Chaos Game Representation’. The clash between the analysis of sequences as continuous series and the better established use of Markovian approaches to discrete series was almost immediate, with a defining critique published in same journal 2 years later. The rest of that decade would go by before the scale-free nature of the IM space was uncovered. The ensuing decade saw this scalability generalized for non-genomic alphabets as well as an interest in its use for graphic representation of biological sequences. Finally, in the past couple of years, in step with the emergence of BigData and MapReduce as a new computational paradigm, there is a surprising third act in the IM story. Multiple reports have described gains in computational efficiency of multiple orders of magnitude over more conventional sequence analysis methodologies. The stage appears to be now set for a recasting of IMs with a central role in processing nextgen sequencing results.
sequence analysis; iterated maps; chaos game; mapreduce; big data; alignment-free
Epigenetic mechanisms play an important role in the regulation of cell type-specific gene activities, yet how epigenetic patterns are established and maintained remains poorly understood. Recent studies have supported a role of DNA sequences in recruitment of epigenetic regulators. Alignment-free methods have been applied to identify distinct sequence features that are associated with epigenetic patterns and to predict epigenomic profiles. Here, we review recent advances in such applications, including the methods to map DNA sequence to feature space, sequence comparison and prediction models. Computational studies using these methods have provided important insights into the epigenetic regulatory mechanisms.
epigenetics; nucleosome; DNA sequence; alignment-free method; machine learning
Epigenetic modifications may play an important role in the formation and progression of complex diseases through the regulation of gene expression. The systematic identification of epigenetic variants that contribute to human diseases can be made possible using genome-wide association studies (GWAS), although epigenetic effects are currently not included in commonly used case–control designs for GWAS. Here, we show that epigenetic modifications can be integrated into a case–control setting by dissolving the overall genetic effect into its different components, additive, dominant and epigenetic. We describe a general procedure for testing and estimating the significance of each component based on a conventional chi-squared test approach. Simulation studies were performed to investigate the power and false-positive rate of this procedure, providing recommendations for its practical use. The integration of epigenetic variants into GWAS can potentially improve our understanding of how genetic, environmental and stochastic factors interact with epialleles to construct the genetic architecture of complex diseases.
Case-control design; epigenetic effect; quantitative genetics
We compared the performance of template-free (docking) and template-based methods for the prediction of protein–protein complex structures. We found similar performance for a template-based method based on threading (COTH) and another template-based method based on structural alignment (PRISM). The template-based methods showed similar performance to a docking method (ZDOCK) when the latter was allowed one prediction for each complex, but when the same number of predictions was allowed for each method, the docking approach outperformed template-based approaches. We identified strengths and weaknesses in each method. Template-based approaches were better able to handle complexes that involved conformational changes upon binding. Furthermore, the threading-based and docking methods were better than the structural-alignment-based method for enzyme–inhibitor complex prediction. Finally, we show that the near-native (correct) predictions were generally not shared by the various approaches, suggesting that integrating their results could be the superior strategy.
protein–protein structure; template-based prediction; protein–protein docking; ZDOCK; PRISM; COTH
The formation of phenotypic traits, such as biomass production, tumor volume and viral abundance, undergoes a complex process in which interactions between genes and developmental stimuli take place at each level of biological organization from cells to organisms. Traditional studies emphasize the impact of genes by directly linking DNA-based markers with static phenotypic values. Functional mapping, derived to detect genes that control developmental processes using growth equations, has proven powerful for addressing questions about the roles of genes in development. By treating phenotypic formation as a cohesive system using differential equations, a different approach—systems mapping—dissects the system into interconnected elements and then map genes that determine a web of interactions among these elements, facilitating our understanding of the genetic machineries for phenotypic development. Here, we argue that genetic mapping can play a more important role in studying the genotype–phenotype relationship by filling the gaps in the biochemical and regulatory process from DNA to end-point phenotype. We describe a new framework, named network mapping, to study the genetic architecture of complex traits by integrating the regulatory networks that cause a high-order phenotype. Network mapping makes use of a system of differential equations to quantify the rule by which transcriptional, proteomic and metabolomic components interact with each other to organize into a functional whole. The synthesis of functional mapping, systems mapping and network mapping provides a novel avenue to decipher a comprehensive picture of the genetic landscape of complex phenotypes that underlie economically and biomedically important traits.
network mappin; complex traits; differential equations; DNA polymorphism; systems biology
Despite considerable progress in the past decades, protein structure prediction remains one of the major unsolved problems in computational biology. Angular-sampling-based methods have been extensively studied recently due to their ability to capture the continuous conformational space of protein structures. The literature has focused on using a variety of parametric models of the sequential dependencies between angle pairs along the protein chains. In this article, we present a thorough review of angular-sampling-based methods by assessing three main questions: What is the best distribution type to model the protein angles? What is a reasonable number of components in a mixture model that should be considered to accurately parameterize the joint distribution of the angles? and What is the order of the local sequence–structure dependency that should be considered by a prediction method? We assess the model fits for different methods using bivariate lag-distributions of the dihedral/planar angles. Moreover, the main information across the lags can be extracted using a technique called Lag singular value decomposition (LagSVD), which considers the joint distribution of the dihedral/planar angles over different lags using a nonparametric approach and monitors the behavior of the lag-distribution of the angles using singular value decomposition. As a result, we developed graphical tools and numerical measurements to compare and evaluate the performance of different model fits. Furthermore, we developed a web-tool (http://www.stat.tamu.edu/∼madoliat/LagSVD) that can be used to produce informative animations.
protein conformational sampling; parametric models; assessment tools; hidden Markov models; principal component analysis; dihedral and planar angles
A number of bioinformatic or biostatistical methods are available for analyzing DNA copy number profiles measured from microarray or sequencing technologies. In the absence of rich enough gold standard data sets, the performance of these methods is generally assessed using unrealistic simulation studies, or based on small real data analyses. To make an objective and reproducible performance assessment, we have designed and implemented a framework to generate realistic DNA copy number profiles of cancer samples with known truth. These profiles are generated by resampling publicly available SNP microarray data from genomic regions with known copy-number state. The original data have been extracted from dilutions series of tumor cell lines with matched blood samples at several concentrations. Therefore, the signal-to-noise ratio of the generated profiles can be controlled through the (known) percentage of tumor cells in the sample. This article describes this framework and its application to a comparison study between methods for segmenting DNA copy number profiles from SNP microarrays. This study indicates that no single method is uniformly better than all others. It also helps identifying pros and cons of the compared methods as a function of biologically informative parameters, such as the fraction of tumor cells in the sample and the proportion of heterozygous markers. This comparison study may be reproduced using the open source and cross-platform R package jointseg, which implements the proposed data generation and evaluation framework: http://r-forge.r-project.org/R/?group_id=1562.
DNA copy number; segmentation; realistic data generation; performance evaluation
Advancements in high-throughput nucleotide sequencing techniques have brought with them state-of-the-art bioinformatics programs and software packages. Given the importance of molecular sequence data in contemporary life science research, these software suites are becoming an essential component of many labs and classrooms, and as such are frequently designed for non-computer specialists and marketed as one-stop bioinformatics toolkits. Although beautifully designed and powerful, user-friendly bioinformatics packages can be expensive and, as more arrive on the market each year, it can be difficult for researchers, teachers and students to choose the right software for their needs, especially if they do not have a bioinformatics background. This review highlights some of the currently available and most popular commercial bioinformatics packages, discussing their prices, usability, features and suitability for teaching. Although several commercial bioinformatics programs are arguably overpriced and overhyped, many are well designed, sophisticated and, in my opinion, worth the investment. If you are just beginning your foray into molecular sequence analysis or an experienced genomicist, I encourage you to explore proprietary software bundles. They have the potential to streamline your research, increase your productivity, energize your classroom and, if anything, add a bit of zest to the often dry detached world of bioinformatics.
bioinformatics software; CLC bio; Geneious; genome assembly; nucleotide alignment; phylogenetics software
Amino acid repeats (AARs) are abundant in protein sequences. They have particular roles in protein function and evolution. Simple repeat patterns generated by DNA slippage tend to introduce length variations and point mutations in repeat regions. Loss of normal and gain of abnormal function owing to their variable length are potential risks leading to diseases. Repeats with complex patterns mostly refer to the functional domain repeats, such as the well-known leucine-rich repeat and WD repeat, which are frequently involved in protein–protein interaction. They are mainly derived from internal gene duplication events and stabilized by ‘gate-keeper’ residues, which play crucial roles in preventing inter-domain aggregation. AARs are widely distributed in different proteomes across a variety of taxonomic ranges, and especially abundant in eukaryotic proteins. However, their specific evolutionary and functional scenarios are still poorly understood. Identifying AARs in protein sequences is the first step for the further investigation of their biological function and evolutionary mechanism. In principle, this is an NP-hard problem, as most of the repeat fragments are shaped by a series of sophisticated evolutionary events and become latent periodical patterns. It is not possible to define a uniform criterion for detecting and verifying various repeat patterns. Instead, different algorithms based on different strategies have been developed to cope with different repeat patterns. In this review, we attempt to describe the amino acid repeat-detection algorithms currently available and compare their strategies based on an in-depth analysis of the biological significance of protein repeats.
amino acid repeat; detection algorithm; low complexity sequence; repeat containing protein; protein domain repeats
The discipline of bioinformatics has developed rapidly since the complete sequencing of the first genomes in the 1990s. The development of many high-throughput techniques during the last decades has ensured that bioinformatics has grown into a discipline that overlaps with, and is required for, the modern practice of virtually every field in the life sciences. This has placed a scientific premium on the availability of skilled bioinformaticians, a qualification that is extremely scarce on the African continent. The reasons for this are numerous, although the absence of a skilled bioinformatician at academic institutions to initiate a training process and build sustained capacity seems to be a common African shortcoming. This dearth of bioinformatics expertise has had a knock-on effect on the establishment of many modern high-throughput projects at African institutes, including the comprehensive and systematic analysis of genomes from African populations, which are among the most genetically diverse anywhere on the planet. Recent funding initiatives from the National Institutes of Health and the Wellcome Trust are aimed at ameliorating this shortcoming. In this paper, we discuss the problems that have limited the establishment of the bioinformatics field in Africa, as well as propose specific actions that will help with the education and training of bioinformaticians on the continent. This is an absolute requirement in anticipation of a boom in high-throughput approaches to human health issues unique to data from African populations.
bioinformatics education; bioinformatics in Africa; postgraduate program
Gene fusions are important genomic events in human cancer because their fusion gene products can drive the development of cancer and thus are potential prognostic tools or therapeutic targets in anti-cancer treatment. Major advancements have been made in computational approaches for fusion gene discovery over the past 3 years due to improvements and widespread applications of high-throughput next generation sequencing (NGS) technologies. To identify fusions from NGS data, existing methods typically leverage the strengths of both sequencing technologies and computational strategies. In this article, we review the NGS and computational features of existing methods for fusion gene detection and suggest directions for future development.
gene fusion; next generation sequencing; cancer; whole genome sequencing; transcriptome sequencing; computational tools
The need to analyze high-dimension biological data is driving the development of new data mining methods. Biclustering algorithms have been successfully applied to gene expression data to discover local patterns, in which a subset of genes exhibit similar expression levels over a subset of conditions. However, it is not clear which algorithms are best suited for this task. Many algorithms have been published in the past decade, most of which have been compared only to a small number of algorithms. Surveys and comparisons exist in the literature, but because of the large number and variety of biclustering algorithms, they are quickly outdated. In this article we partially address this problem of evaluating the strengths and weaknesses of existing biclustering methods. We used the BiBench package to compare 12 algorithms, many of which were recently published or have not been extensively studied. The algorithms were tested on a suite of synthetic data sets to measure their performance on data with varying conditions, such as different bicluster models, varying noise, varying numbers of biclusters and overlapping biclusters. The algorithms were also tested on eight large gene expression data sets obtained from the Gene Expression Omnibus. Gene Ontology enrichment analysis was performed on the resulting biclusters, and the best enrichment terms are reported. Our analyses show that the biclustering method and its parameters should be selected based on the desired model, whether that model allows overlapping biclusters, and its robustness to noise. In addition, we observe that the biclustering algorithms capable of finding more than one model are more successful at capturing biologically relevant clusters.
biclustering; microarray; gene expression; clustering
Glycosylation of proteins is involved in immune defense, cell–cell adhesion, cellular recognition and pathogen binding and is one of the most common and complex post-translational modifications. Science is still struggling to assign detailed mechanisms and functions to this form of conjugation. Even the structural analysis of glycoproteins—glycoproteomics—remains in its infancy due to the scarcity of high-throughput analytical platforms capable of determining glycopeptide composition and structure, especially platforms for complex biological mixtures. Glycopeptide composition and structure can be determined with high mass-accuracy mass spectrometry, particularly when combined with chromatographic separation, but the sheer volume of generated data necessitates computational software for interpretation. This review discusses the current state of glycopeptide assignment software—advances made to date and issues that remain to be addressed. The various software and algorithms developed so far provide important insights into glycoproteomics. However, there is currently no freely available software that can analyze spectral data in batch and unambiguously determine glycopeptide compositions for N- and O-linked glycopeptides from relevant biological sources such as human milk and serum. Few programs are capable of aiding in structural determination of the glycan component. To significantly advance the field of glycoproteomics, analytical software and algorithms are required that: (i) solve for both N- and O-linked glycopeptide compositions, structures and glycosites in biological mixtures; (ii) are high-throughput and process data in batches; (iii) can interpret mass spectral data from a variety of sources and (iv) are open source and freely available.
glycopeptide; glycoproteomics; glycopeptidomics; bioinformatics; N-linked; O-linked
A number of supervised machine learning models have recently been introduced for the prediction of drug–target interactions based on chemical structure and genomic sequence information. Although these models could offer improved means for many network pharmacology applications, such as repositioning of drugs for new therapeutic uses, the prediction models are often being constructed and evaluated under overly simplified settings that do not reflect the real-life problem in practical applications. Using quantitative drug–target bioactivity assays for kinase inhibitors, as well as a popular benchmarking data set of binary drug–target interactions for enzyme, ion channel, nuclear receptor and G protein-coupled receptor targets, we illustrate here the effects of four factors that may lead to dramatic differences in the prediction results: (i) problem formulation (standard binary classification or more realistic regression formulation), (ii) evaluation data set (drug and target families in the application use case), (iii) evaluation procedure (simple or nested cross-validation) and (iv) experimental setting (whether training and test sets share common drugs and targets, only drugs or targets or neither). Each of these factors should be taken into consideration to avoid reporting overoptimistic drug–target interaction prediction results. We also suggest guidelines on how to make the supervised drug–target interaction prediction studies more realistic in terms of such model formulations and evaluation setups that better address the inherent complexity of the prediction task in the practical applications, as well as novel benchmarking data sets that capture the continuous nature of the drug–target interactions for kinase inhibitors.
drug–target interaction; kinase bioactivity assays; nested cross-validation; predictive modeling; supervised machine learning
In an interesting and quite exhaustive review on Random Forests (RF) methodology in bioinformatics Touw et al. address—among other topics—the problem of the detection of interactions between variables based on RF methodology. We feel that some important statistical concepts, such as ‘interaction’, ‘conditional dependence’ or ‘correlation’, are sometimes employed inconsistently in the bioinformatics literature in general and in the literature on RF in particular. In this letter to the Editor, we aim to clarify some of the central statistical concepts and point out some confusing interpretations concerning RF given by Touw et al. and other authors.
random forest; statistics; interaction; correlation; conditional inference trees; conditional variable importance
The rise of personalized medicine and the availability of high-throughput molecular analyses in the context of clinical care have increased the need for adequate tools for translational researchers to manage and explore these data. We reviewed the biomedical literature for translational platforms allowing the management and exploration of clinical and omics data, and identified seven platforms: BRISK, caTRIP, cBio Cancer Portal, G-DOC, iCOD, iDASH and tranSMART. We analyzed these platforms along seven major axes. (1) The community axis regrouped information regarding initiators and funders of the project, as well as availability status and references. (2) We regrouped under the information content axis the nature of the clinical and omics data handled by each system. (3) The privacy management environment axis encompassed functionalities allowing control over data privacy. (4) In the analysis support axis, we detailed the analytical and statistical tools provided by the platforms. We also explored (5) interoperability support and (6) system requirements. The final axis (7) platform support listed the availability of documentation and installation procedures. A large heterogeneity was observed in regard to the capability to manage phenotype information in addition to omics data, their security and interoperability features. The analytical and visualization features strongly depend on the considered platform. Similarly, the availability of the systems is variable. This review aims at providing the reader with the background to choose the platform best suited to their needs. To conclude, we discuss the desiderata for optimal translational research platforms, in terms of privacy, interoperability and technical features.
translational medical research; biomedical research; clinical data; high-throughput technologies; information storage and retrieval
Since its launch in 2004, the open-source AMOS project has released several innovative DNA sequence analysis applications including: Hawkeye, a visual analytics tool for inspecting the structure of genome assemblies; the Assembly Forensics and FRCurve pipelines for systematically evaluating the quality of a genome assembly; and AMOScmp, the first comparative genome assembler. These applications have been used to assemble and analyze dozens of genomes ranging in complexity from simple microbial species through mammalian genomes. Recent efforts have been focused on enhancing support for new data characteristics brought on by second- and now third-generation sequencing. This review describes the major components of AMOS in light of these challenges, with an emphasis on methods for assessing assembly quality and the visual analytics capabilities of Hawkeye. These interactive graphical aspects are essential for navigating and understanding the complexities of a genome assembly, from the overall genome structure down to individual bases. Hawkeye and AMOS are available open source at http://amos.sourceforge.net.
DNA Sequencing; genome assembly; assembly forensics; visual analytics
JBrowse is a web-based genome browser, allowing many sources of data to be visualized, interpreted and navigated in a coherent visual framework. JBrowse uses efficient data structures, pre-generation of image tiles and client-side rendering to provide a fast, interactive browsing experience. Many of JBrowse's design features make it well suited for visualizing high-volume data, such as aligned next-generation sequencing reads.
genome browser; web; next-generation sequencing
Plants have been used as a source of medicine since historic times and several commercially important drugs are of plant-based origin. The traditional approach towards discovery of plant-based drugs often times involves significant amount of time and expenditure. These labor-intensive approaches have struggled to keep pace with the rapid development of high-throughput technologies. In the era of high volume, high-throughput data generation across the biosciences, bioinformatics plays a crucial role. This has generally been the case in the context of drug designing and discovery. However, there has been limited attention to date to the potential application of bioinformatics approaches that can leverage plant-based knowledge. Here, we review bioinformatics studies that have contributed to medicinal plants research. In particular, we highlight areas in medicinal plant research where the application of bioinformatics methodologies may result in quicker and potentially cost-effective leads toward finding plant-based remedies.
medicinal plants; bioinformatics; drug discovery
The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support ‘spaced seeds’ and ‘subset seeds’ used in many biological applications.
suffix array construction; linear-time algorithm; text index; spaced seeds; subset seeds
Deep sequencing has become a popular tool for novel miRNA detection but its data must be viewed carefully as the state of the field is still undeveloped. Using three different programs, miRDeep (v1, 2), miRanalyzer and DSAP, we have analyzed seven data sets (six biological and one simulated) to provide a critical evaluation of the programs performance. We selected these software based on their popularity and overall approach toward the detection of novel and known miRNAs using deep-sequencing data. The program comparisons suggest that, despite differing stringency levels they all identify a similar set of known and novel predictions. Comparisons between the first and second version of miRDeep suggest that the stringency level of each of these programs may, in fact, be a result of the algorithm used to map the reads to the target. Different stringency levels are likely to affect the number of possible novel candidates for functional verification, causing undue strain on resources and time. With that in mind, we propose that an intersection across multiple programs be taken, especially if considering novel candidates that will be targeted for additional analysis. Using this approach, we identify and performed initial validation of 12 novel predictions in our in-house data with real-time PCR, six of which have been previously unreported.
deep sequencing; software; miRNA detection; comparison
RNA-sequencing (RNA-seq) has rapidly become a popular tool to characterize transcriptomes. A fundamental research problem in many RNA-seq studies is the identification of reliable molecular markers that show differential expression between distinct sample groups. Together with the growing popularity of RNA-seq, a number of data analysis methods and pipelines have already been developed for this task. Currently, however, there is no clear consensus about the best practices yet, which makes the choice of an appropriate method a daunting task especially for a basic user without a strong statistical or computational background. To assist the choice, we perform here a systematic comparison of eight widely used software packages and pipelines for detecting differential expression between sample groups in a practical research setting and provide general guidelines for choosing a robust pipeline. In general, our results demonstrate how the data analysis tool utilized can markedly affect the outcome of the data analysis, highlighting the importance of this choice.
RNA-seq; gene expression; differential expression