The formation of phenotypic traits, such as biomass production, tumor volume and viral abundance, undergoes a complex process in which interactions between genes and developmental stimuli take place at each level of biological organization from cells to organisms. Traditional studies emphasize the impact of genes by directly linking DNA-based markers with static phenotypic values. Functional mapping, derived to detect genes that control developmental processes using growth equations, has proven powerful for addressing questions about the roles of genes in development. By treating phenotypic formation as a cohesive system using differential equations, a different approach—systems mapping—dissects the system into interconnected elements and then map genes that determine a web of interactions among these elements, facilitating our understanding of the genetic machineries for phenotypic development. Here, we argue that genetic mapping can play a more important role in studying the genotype–phenotype relationship by filling the gaps in the biochemical and regulatory process from DNA to end-point phenotype. We describe a new framework, named network mapping, to study the genetic architecture of complex traits by integrating the regulatory networks that cause a high-order phenotype. Network mapping makes use of a system of differential equations to quantify the rule by which transcriptional, proteomic and metabolomic components interact with each other to organize into a functional whole. The synthesis of functional mapping, systems mapping and network mapping provides a novel avenue to decipher a comprehensive picture of the genetic landscape of complex phenotypes that underlie economically and biomedically important traits.
network mappin; complex traits; differential equations; DNA polymorphism; systems biology
Despite considerable progress in the past decades, protein structure prediction remains one of the major unsolved problems in computational biology. Angular-sampling-based methods have been extensively studied recently due to their ability to capture the continuous conformational space of protein structures. The literature has focused on using a variety of parametric models of the sequential dependencies between angle pairs along the protein chains. In this article, we present a thorough review of angular-sampling-based methods by assessing three main questions: What is the best distribution type to model the protein angles? What is a reasonable number of components in a mixture model that should be considered to accurately parameterize the joint distribution of the angles? and What is the order of the local sequence–structure dependency that should be considered by a prediction method? We assess the model fits for different methods using bivariate lag-distributions of the dihedral/planar angles. Moreover, the main information across the lags can be extracted using a technique called Lag singular value decomposition (LagSVD), which considers the joint distribution of the dihedral/planar angles over different lags using a nonparametric approach and monitors the behavior of the lag-distribution of the angles using singular value decomposition. As a result, we developed graphical tools and numerical measurements to compare and evaluate the performance of different model fits. Furthermore, we developed a web-tool (http://www.stat.tamu.edu/∼madoliat/LagSVD) that can be used to produce informative animations.
protein conformational sampling; parametric models; assessment tools; hidden Markov models; principal component analysis; dihedral and planar angles
Amino acid repeats (AARs) are abundant in protein sequences. They have particular roles in protein function and evolution. Simple repeat patterns generated by DNA slippage tend to introduce length variations and point mutations in repeat regions. Loss of normal and gain of abnormal function owing to their variable length are potential risks leading to diseases. Repeats with complex patterns mostly refer to the functional domain repeats, such as the well-known leucine-rich repeat and WD repeat, which are frequently involved in protein–protein interaction. They are mainly derived from internal gene duplication events and stabilized by ‘gate-keeper’ residues, which play crucial roles in preventing inter-domain aggregation. AARs are widely distributed in different proteomes across a variety of taxonomic ranges, and especially abundant in eukaryotic proteins. However, their specific evolutionary and functional scenarios are still poorly understood. Identifying AARs in protein sequences is the first step for the further investigation of their biological function and evolutionary mechanism. In principle, this is an NP-hard problem, as most of the repeat fragments are shaped by a series of sophisticated evolutionary events and become latent periodical patterns. It is not possible to define a uniform criterion for detecting and verifying various repeat patterns. Instead, different algorithms based on different strategies have been developed to cope with different repeat patterns. In this review, we attempt to describe the amino acid repeat-detection algorithms currently available and compare their strategies based on an in-depth analysis of the biological significance of protein repeats.
amino acid repeat; detection algorithm; low complexity sequence; repeat containing protein; protein domain repeats
Gene fusions are important genomic events in human cancer because their fusion gene products can drive the development of cancer and thus are potential prognostic tools or therapeutic targets in anti-cancer treatment. Major advancements have been made in computational approaches for fusion gene discovery over the past 3 years due to improvements and widespread applications of high-throughput next generation sequencing (NGS) technologies. To identify fusions from NGS data, existing methods typically leverage the strengths of both sequencing technologies and computational strategies. In this article, we review the NGS and computational features of existing methods for fusion gene detection and suggest directions for future development.
gene fusion; next generation sequencing; cancer; whole genome sequencing; transcriptome sequencing; computational tools
The need to analyze high-dimension biological data is driving the development of new data mining methods. Biclustering algorithms have been successfully applied to gene expression data to discover local patterns, in which a subset of genes exhibit similar expression levels over a subset of conditions. However, it is not clear which algorithms are best suited for this task. Many algorithms have been published in the past decade, most of which have been compared only to a small number of algorithms. Surveys and comparisons exist in the literature, but because of the large number and variety of biclustering algorithms, they are quickly outdated. In this article we partially address this problem of evaluating the strengths and weaknesses of existing biclustering methods. We used the BiBench package to compare 12 algorithms, many of which were recently published or have not been extensively studied. The algorithms were tested on a suite of synthetic data sets to measure their performance on data with varying conditions, such as different bicluster models, varying noise, varying numbers of biclusters and overlapping biclusters. The algorithms were also tested on eight large gene expression data sets obtained from the Gene Expression Omnibus. Gene Ontology enrichment analysis was performed on the resulting biclusters, and the best enrichment terms are reported. Our analyses show that the biclustering method and its parameters should be selected based on the desired model, whether that model allows overlapping biclusters, and its robustness to noise. In addition, we observe that the biclustering algorithms capable of finding more than one model are more successful at capturing biologically relevant clusters.
biclustering; microarray; gene expression; clustering
Glycosylation of proteins is involved in immune defense, cell–cell adhesion, cellular recognition and pathogen binding and is one of the most common and complex post-translational modifications. Science is still struggling to assign detailed mechanisms and functions to this form of conjugation. Even the structural analysis of glycoproteins—glycoproteomics—remains in its infancy due to the scarcity of high-throughput analytical platforms capable of determining glycopeptide composition and structure, especially platforms for complex biological mixtures. Glycopeptide composition and structure can be determined with high mass-accuracy mass spectrometry, particularly when combined with chromatographic separation, but the sheer volume of generated data necessitates computational software for interpretation. This review discusses the current state of glycopeptide assignment software—advances made to date and issues that remain to be addressed. The various software and algorithms developed so far provide important insights into glycoproteomics. However, there is currently no freely available software that can analyze spectral data in batch and unambiguously determine glycopeptide compositions for N- and O-linked glycopeptides from relevant biological sources such as human milk and serum. Few programs are capable of aiding in structural determination of the glycan component. To significantly advance the field of glycoproteomics, analytical software and algorithms are required that: (i) solve for both N- and O-linked glycopeptide compositions, structures and glycosites in biological mixtures; (ii) are high-throughput and process data in batches; (iii) can interpret mass spectral data from a variety of sources and (iv) are open source and freely available.
glycopeptide; glycoproteomics; glycopeptidomics; bioinformatics; N-linked; O-linked
Since its launch in 2004, the open-source AMOS project has released several innovative DNA sequence analysis applications including: Hawkeye, a visual analytics tool for inspecting the structure of genome assemblies; the Assembly Forensics and FRCurve pipelines for systematically evaluating the quality of a genome assembly; and AMOScmp, the first comparative genome assembler. These applications have been used to assemble and analyze dozens of genomes ranging in complexity from simple microbial species through mammalian genomes. Recent efforts have been focused on enhancing support for new data characteristics brought on by second- and now third-generation sequencing. This review describes the major components of AMOS in light of these challenges, with an emphasis on methods for assessing assembly quality and the visual analytics capabilities of Hawkeye. These interactive graphical aspects are essential for navigating and understanding the complexities of a genome assembly, from the overall genome structure down to individual bases. Hawkeye and AMOS are available open source at http://amos.sourceforge.net.
DNA Sequencing; genome assembly; assembly forensics; visual analytics
JBrowse is a web-based genome browser, allowing many sources of data to be visualized, interpreted and navigated in a coherent visual framework. JBrowse uses efficient data structures, pre-generation of image tiles and client-side rendering to provide a fast, interactive browsing experience. Many of JBrowse's design features make it well suited for visualizing high-volume data, such as aligned next-generation sequencing reads.
genome browser; web; next-generation sequencing
Plants have been used as a source of medicine since historic times and several commercially important drugs are of plant-based origin. The traditional approach towards discovery of plant-based drugs often times involves significant amount of time and expenditure. These labor-intensive approaches have struggled to keep pace with the rapid development of high-throughput technologies. In the era of high volume, high-throughput data generation across the biosciences, bioinformatics plays a crucial role. This has generally been the case in the context of drug designing and discovery. However, there has been limited attention to date to the potential application of bioinformatics approaches that can leverage plant-based knowledge. Here, we review bioinformatics studies that have contributed to medicinal plants research. In particular, we highlight areas in medicinal plant research where the application of bioinformatics methodologies may result in quicker and potentially cost-effective leads toward finding plant-based remedies.
medicinal plants; bioinformatics; drug discovery
The suffix array and its variants are text-indexing data structures that have become indispensable in the field of bioinformatics. With the uninitiated in mind, we provide an accessible exposition of the SA-IS algorithm, which is the state of the art in suffix array construction. We also describe DisLex, a technique that allows standard suffix array construction algorithms to create modified suffix arrays designed to enable a simple form of inexact matching needed to support ‘spaced seeds’ and ‘subset seeds’ used in many biological applications.
suffix array construction; linear-time algorithm; text index; spaced seeds; subset seeds
Deep sequencing has become a popular tool for novel miRNA detection but its data must be viewed carefully as the state of the field is still undeveloped. Using three different programs, miRDeep (v1, 2), miRanalyzer and DSAP, we have analyzed seven data sets (six biological and one simulated) to provide a critical evaluation of the programs performance. We selected these software based on their popularity and overall approach toward the detection of novel and known miRNAs using deep-sequencing data. The program comparisons suggest that, despite differing stringency levels they all identify a similar set of known and novel predictions. Comparisons between the first and second version of miRDeep suggest that the stringency level of each of these programs may, in fact, be a result of the algorithm used to map the reads to the target. Different stringency levels are likely to affect the number of possible novel candidates for functional verification, causing undue strain on resources and time. With that in mind, we propose that an intersection across multiple programs be taken, especially if considering novel candidates that will be targeted for additional analysis. Using this approach, we identify and performed initial validation of 12 novel predictions in our in-house data with real-time PCR, six of which have been previously unreported.
deep sequencing; software; miRNA detection; comparison
RNA-sequencing (RNA-seq) has rapidly become a popular tool to characterize transcriptomes. A fundamental research problem in many RNA-seq studies is the identification of reliable molecular markers that show differential expression between distinct sample groups. Together with the growing popularity of RNA-seq, a number of data analysis methods and pipelines have already been developed for this task. Currently, however, there is no clear consensus about the best practices yet, which makes the choice of an appropriate method a daunting task especially for a basic user without a strong statistical or computational background. To assist the choice, we perform here a systematic comparison of eight widely used software packages and pipelines for detecting differential expression between sample groups in a practical research setting and provide general guidelines for choosing a robust pipeline. In general, our results demonstrate how the data analysis tool utilized can markedly affect the outcome of the data analysis, highlighting the importance of this choice.
RNA-seq; gene expression; differential expression
Capturing the dynamism that pervades biological systems requires a computational approach that can accommodate both the continuous features of the system environment as well as the flexible and heterogeneous nature of component interactions. This presents a serious challenge for the more traditional mathematical approaches that assume component homogeneity to relate system observables using mathematical equations. While the homogeneity condition does not lead to loss of accuracy while simulating various continua, it fails to offer detailed solutions when applied to systems with dynamically interacting heterogeneous components. As the functionality and architecture of most biological systems is a product of multi-faceted individual interactions at the sub-system level, continuum models rarely offer much beyond qualitative similarity. Agent-based modelling is a class of algorithmic computational approaches that rely on interactions between Turing-complete finite-state machines—or agents—to simulate, from the bottom-up, macroscopic properties of a system. In recognizing the heterogeneity condition, they offer suitable ontologies to the system components being modelled, thereby succeeding where their continuum counterparts tend to struggle. Furthermore, being inherently hierarchical, they are quite amenable to coupling with other computational paradigms. The integration of any agent-based framework with continuum models is arguably the most elegant and precise way of representing biological systems. Although in its nascence, agent-based modelling has been utilized to model biological complexity across a broad range of biological scales (from cells to societies). In this article, we explore the reasons that make agent-based modelling the most precise approach to model biological systems that tend to be non-linear and complex.
agent-based model; biological complexity; computational modeling; cell; emergence; hybrid models
Semantic web technologies offer an approach to data integration and sharing, even for resources developed independently or broadly distributed across the web. This approach is particularly suitable for scientific domains that profit from large amounts of data that reside in the public domain and that have to be exploited in combination. Translational medicine is such a domain, which in addition has to integrate private data from the clinical domain with proprietary data from the pharmaceutical domain. In this survey, we present the results of our analysis of translational medicine solutions that follow a semantic web approach. We assessed these solutions in terms of their target medical use case; the resources covered to achieve their objectives; and their use of existing semantic web resources for the purposes of data sharing, data interoperability and knowledge discovery. The semantic web technologies seem to fulfill their role in facilitating the integration and exploration of data from disparate sources, but it is also clear that simply using them is not enough. It is fundamental to reuse resources, to define mappings between resources, to share data and knowledge. All these aspects allow the instantiation of translational medicine at the semantic web-scale, thus resulting in a network of solutions that can share resources for a faster transfer of new scientific results into the clinical practice. The envisioned network of translational medicine solutions is on its way, but it still requires resolving the challenges of sharing protected data and of integrating semantic-driven technologies into the clinical practice.
semantic web; translational medicine; data integration; data sharing; data interoperability; knowledge discovery
With the development of the Internet and the growth of online resources, bioinformatics training for wet-lab biologists became necessary as a part of their education. This article describes a one-semester course ‘Applied Bioinformatics Course’ (ABC, http://abc.cbi.pku.edu.cn/) that the author has been teaching to biological graduate students at the Peking University and the Chinese Academy of Agricultural Sciences for the past 13 years. ABC is a hands-on practical course to teach students to use online bioinformatics resources to solve biological problems related to their ongoing research projects in molecular biology. With a brief introduction to the background of the course, detailed information about the teaching strategies of the course are outlined in the ‘How to teach’ section. The contents of the course are briefly described in the ‘What to teach’ section with some real examples. The author wishes to share his teaching experiences and the online teaching materials with colleagues working in bioinformatics education both in local and international universities.
bioinformatics education; introductory course; hands-on course; project-based learning; on-site teaching
Genome-wide association studies are becoming computationally more demanding with the growing amounts of data. Combinatorial traits can increase the data dimensions beyond the computational capabilities of the current tools. We addressed this issue by creating an application for quick association analysis that is ten to hundreds of times faster than the leading fast methods. Our tool (RegScan) is designed for performing basic linear regression analysis with continuous traits maximally fast on large data sets. RegScan specifically targets association analysis of combinatorial traits in metabolomics. It can both generate and analyze the combinatorial traits efficiently. RegScan is capable of analyzing any number of traits together without the need to specify each trait individually. The main goal of the article is to show that RegScan can be the preferred analytical tool when large amounts of data need to be analyzed quickly using the allele frequency test.
Availability: Precompiled RegScan (all major platforms), source code, user guide and examples are freely available at www.biobank.ee/regscan.
Requirements: Qt 4.4.3 or newer for dynamic compilations.
GWAS; genome-wide analysis; linear regression; continuous traits; combinatorial traits; metabolomics
Access to public data sets is important to the scientific community as a resource to develop new experiments or validate new data. Projects such as the PeptideAtlas, Ensembl and The Cancer Genome Atlas (TCGA) offer both access to public data and a repository to share their own data. Access to these data sets is often provided through a web page form and a web service API. Access technologies based on web protocols (e.g. http) have been in use for over a decade and are widely adopted across the industry for a variety of functions (e.g. search, commercial transactions, and social media). Each architecture adapts these technologies to provide users with tools to access and share data. Both commonly used web service technologies (e.g. REST and SOAP), and custom-built solutions over HTTP are utilized in providing access to research data. Providing multiple access points ensures that the community can access the data in the simplest and most effective manner for their particular needs. This article examines three common access mechanisms for web accessible data: BioMart, caBIG, and Google Data Sources. These are illustrated by implementing each over the PeptideAtlas repository and reviewed for their suitability based on specific usages common to research. BioMart, Google Data Sources, and caBIG are each suitable for certain uses. The tradeoffs made in the development of the technology are dependent on the uses each was designed for (e.g. security versus speed). This means that an understanding of specific requirements and tradeoffs is necessary before selecting the access technology.
BioMart; Google Data Sources; caBIG; data access; proteomics
Affymetrix GeneChip microarrays are the most widely used high-throughput technology to measure gene expression, and a wide variety of preprocessing methods have been developed to transform probe intensities reported by a microarray scanner into gene expression estimates. There have been numerous comparisons of these preprocessing methods, focusing on the most common analyses—detection of differential expression and gene or sample clustering. Recently, more complex multivariate analyses, such as gene co-expression, differential co-expression, gene set analysis and network modeling, are becoming more common; however, the same preprocessing methods are typically applied. In this article, we examine the effect of preprocessing methods on some of these multivariate analyses and provide guidance to the user as to which methods are most appropriate.
microarray; preprocessing; gene expression; multivariate analysis
Access to gene expression data has become increasingly common in recent years; however, analysis has become more difficult as it is often desirable to integrate data from different platforms. Probe mapping across microarray platforms is the first and most crucial step for data integration. In this article, we systematically review and compare different approaches to map probes across seven platforms from different vendors: U95A, U133A and U133 Plus 2.0 from Affymetrix, Inc.; HT-12 v1, HT-12v2 and HT-12v3 from Illumina, Inc.; and 4112A from Agilent, Inc. We use a unique data set, which contains 56 lung cancer cell line samples—each of which has been measured by two different microarray platforms—to evaluate the consistency of expression measurement across platforms using different approaches. Based on the evaluation from the empirical data set, the BLAST alignment of the probe sequences to a recent revision of the Transcriptome generated better results than using annotations provided by Vendors or from Bioconductor's Annotate package. However, a combination of all three methods (deemed the ‘Consensus Annotation’) yielded the most consistent expression measurement across platforms. To facilitate data integration across microarray platforms for the research community, we develop a user-friendly web-based tool, an API and an R package to map data across different microarray platforms from Affymetrix, Illumina and Agilent. Information on all three can be found at http://qbrc.swmed.edu/software/probemapper/.
microarray; gene expression; probe; integrated analysis; probe mapping
High-throughput studies have been extensively conducted in the research of complex human diseases. As a representative example, consider gene-expression studies where thousands of genes are profiled at the same time. An important objective of such studies is to rank the diagnostic accuracy of biomarkers (e.g. gene expressions) for predicting outcome variables while properly adjusting for confounding effects from low-dimensional clinical risk factors and environmental exposures. Existing approaches are often fully based on parametric or semi-parametric models and target evaluating estimation significance as opposed to diagnostic accuracy. Receiver operating characteristic (ROC) approaches can be employed to tackle this problem. However, existing ROC ranking methods focus on biomarkers only and ignore effects of confounders. In this article, we propose a model-based approach which ranks the diagnostic accuracy of biomarkers using ROC measures with a proper adjustment of confounding effects. To this end, three different methods for constructing the underlying regression models are investigated. Simulation study shows that the proposed methods can accurately identify biomarkers with additional diagnostic power beyond confounders. Analysis of two cancer gene-expression studies demonstrates that adjusting for confounders can lead to substantially different rankings of genes.
ranking biomarkers; ROC; confounders; high-throughput data
Infinium HumanMethylation450 beadarray is a popular technology to explore DNA methylomes in health and disease, and there is a current explosion in the use of this technique. Despite experience acquired from gene expression microarrays, analyzing Infinium Methylation arrays appeared more complex than initially thought and several difficulties have been encountered, as those arrays display specific features that need to be taken into consideration during data processing. Here, we review several issues that have been highlighted by the scientific community, and we present an overview of the general data processing scheme and an evaluation of the different normalization methods available to date to guide the 450K users in their analysis and data interpretation.
Epigenomics; Genome-wide DNA methylation technology
Large quantities of information describing the mechanisms of biological pathways continue to be collected in publicly available databases. At the same time, experiments have increased in scale, and biologists increasingly use pathways defined in online databases to interpret the results of experiments and generate hypotheses. Emerging computational techniques that exploit the rich biological information captured in reaction systems require formal standardized descriptions of pathways to extract these reaction networks and avoid the alternative: time-consuming and largely manual literature-based network reconstruction. Here, we systematically evaluate the effects of commonly used knowledge representations on the seemingly simple task of extracting a reaction network describing signal transduction from a pathway database. We show that this process is in fact surprisingly difficult, and the pathway representations adopted by various knowledge bases have dramatic consequences for reaction network extraction, connectivity, capture of pathway crosstalk and in the modelling of cell–cell interactions. Researchers constructing computational models built from automatically extracted reaction networks must therefore consider the issues we outline in this review to maximize the value of existing pathway knowledge.
databases; signal transduction; modelling; reaction networks
The emergence of high-throughput technologies for measuring biological systems has introduced problems for data interpretation that must be addressed for proper inference. First, analysis techniques need to be matched to the biological system, reflecting in their mathematical structure the underlying behavior being studied. When this is not done, mathematical techniques will generate answers, but the values and reliability estimates may not accurately reflect the biology. Second, analysis approaches must address the vast excess in variables measured (e.g. transcript levels of genes) over the number of samples (e.g. tumors, time points), known as the ‘large-p, small-n’ problem. In large-p, small-n paradigms, standard statistical techniques generally fail, and computational learning algorithms are prone to overfit the data. Here we review the emergence of techniques that match mathematical structure to the biology, the use of integrated data and prior knowledge to guide statistical analysis, and the recent emergence of analysis approaches utilizing simple biological models. We show that novel biological insights have been gained using these techniques.
Bayesian analysis; computational molecular biology; signal pathways; metabolic pathways; databases
The patterns of variation within a molecular sequence data set result from the interplay between population genetic, molecular evolutionary and macroevolutionary processes—the standard purview of evolutionary biologists. Elucidating these patterns, particularly for large data sets, requires an understanding of the structure, assumptions and limitations of the algorithms used by bioinformatics software—the domain of mathematicians and computer scientists. As a result, bioinformatics often suffers a ‘two-culture’ problem because of the lack of broad overlapping expertise between these two groups. Collaboration among specialists in different fields has greatly mitigated this problem among active bioinformaticians. However, science education researchers report that much of bioinformatics education does little to bridge the cultural divide, the curriculum too focused on solving narrow problems (e.g. interpreting pre-built phylogenetic trees) rather than on exploring broader ones (e.g. exploring alternative phylogenetic strategies for different kinds of data sets). Herein, we present an introduction to the mathematics of tree enumeration, tree construction, split decomposition and sequence alignment. We also introduce off-line downloadable software tools developed by the BioQUEST Curriculum Consortium to help students learn how to interpret and critically evaluate the results of standard bioinformatics analyses.
bioinformatics education; discrete mathematics; quantitative reasoning; off-line downloadable free and open-source software; evolutionary problem solving