The formation of phenotypic traits, such as biomass production, tumor volume and viral abundance, undergoes a complex process in which interactions between genes and developmental stimuli take place at each level of biological organization from cells to organisms. Traditional studies emphasize the impact of genes by directly linking DNA-based markers with static phenotypic values. Functional mapping, derived to detect genes that control developmental processes using growth equations, has proven powerful for addressing questions about the roles of genes in development. By treating phenotypic formation as a cohesive system using differential equations, a different approach—systems mapping—dissects the system into interconnected elements and then map genes that determine a web of interactions among these elements, facilitating our understanding of the genetic machineries for phenotypic development. Here, we argue that genetic mapping can play a more important role in studying the genotype–phenotype relationship by filling the gaps in the biochemical and regulatory process from DNA to end-point phenotype. We describe a new framework, named network mapping, to study the genetic architecture of complex traits by integrating the regulatory networks that cause a high-order phenotype. Network mapping makes use of a system of differential equations to quantify the rule by which transcriptional, proteomic and metabolomic components interact with each other to organize into a functional whole. The synthesis of functional mapping, systems mapping and network mapping provides a novel avenue to decipher a comprehensive picture of the genetic landscape of complex phenotypes that underlie economically and biomedically important traits.
network mappin; complex traits; differential equations; DNA polymorphism; systems biology
Despite considerable progress in the past decades, protein structure prediction remains one of the major unsolved problems in computational biology. Angular-sampling-based methods have been extensively studied recently due to their ability to capture the continuous conformational space of protein structures. The literature has focused on using a variety of parametric models of the sequential dependencies between angle pairs along the protein chains. In this article, we present a thorough review of angular-sampling-based methods by assessing three main questions: What is the best distribution type to model the protein angles? What is a reasonable number of components in a mixture model that should be considered to accurately parameterize the joint distribution of the angles? and What is the order of the local sequence–structure dependency that should be considered by a prediction method? We assess the model fits for different methods using bivariate lag-distributions of the dihedral/planar angles. Moreover, the main information across the lags can be extracted using a technique called Lag singular value decomposition (LagSVD), which considers the joint distribution of the dihedral/planar angles over different lags using a nonparametric approach and monitors the behavior of the lag-distribution of the angles using singular value decomposition. As a result, we developed graphical tools and numerical measurements to compare and evaluate the performance of different model fits. Furthermore, we developed a web-tool (http://www.stat.tamu.edu/∼madoliat/LagSVD) that can be used to produce informative animations.
protein conformational sampling; parametric models; assessment tools; hidden Markov models; principal component analysis; dihedral and planar angles
Gene fusions are important genomic events in human cancer because their fusion gene products can drive the development of cancer and thus are potential prognostic tools or therapeutic targets in anti-cancer treatment. Major advancements have been made in computational approaches for fusion gene discovery over the past 3 years due to improvements and widespread applications of high-throughput next generation sequencing (NGS) technologies. To identify fusions from NGS data, existing methods typically leverage the strengths of both sequencing technologies and computational strategies. In this article, we review the NGS and computational features of existing methods for fusion gene detection and suggest directions for future development.
gene fusion; next generation sequencing; cancer; whole genome sequencing; transcriptome sequencing; computational tools
The need to analyze high-dimension biological data is driving the development of new data mining methods. Biclustering algorithms have been successfully applied to gene expression data to discover local patterns, in which a subset of genes exhibit similar expression levels over a subset of conditions. However, it is not clear which algorithms are best suited for this task. Many algorithms have been published in the past decade, most of which have been compared only to a small number of algorithms. Surveys and comparisons exist in the literature, but because of the large number and variety of biclustering algorithms, they are quickly outdated. In this article we partially address this problem of evaluating the strengths and weaknesses of existing biclustering methods. We used the BiBench package to compare 12 algorithms, many of which were recently published or have not been extensively studied. The algorithms were tested on a suite of synthetic data sets to measure their performance on data with varying conditions, such as different bicluster models, varying noise, varying numbers of biclusters and overlapping biclusters. The algorithms were also tested on eight large gene expression data sets obtained from the Gene Expression Omnibus. Gene Ontology enrichment analysis was performed on the resulting biclusters, and the best enrichment terms are reported. Our analyses show that the biclustering method and its parameters should be selected based on the desired model, whether that model allows overlapping biclusters, and its robustness to noise. In addition, we observe that the biclustering algorithms capable of finding more than one model are more successful at capturing biologically relevant clusters.
biclustering; microarray; gene expression; clustering
Glycosylation of proteins is involved in immune defense, cell–cell adhesion, cellular recognition and pathogen binding and is one of the most common and complex post-translational modifications. Science is still struggling to assign detailed mechanisms and functions to this form of conjugation. Even the structural analysis of glycoproteins—glycoproteomics—remains in its infancy due to the scarcity of high-throughput analytical platforms capable of determining glycopeptide composition and structure, especially platforms for complex biological mixtures. Glycopeptide composition and structure can be determined with high mass-accuracy mass spectrometry, particularly when combined with chromatographic separation, but the sheer volume of generated data necessitates computational software for interpretation. This review discusses the current state of glycopeptide assignment software—advances made to date and issues that remain to be addressed. The various software and algorithms developed so far provide important insights into glycoproteomics. However, there is currently no freely available software that can analyze spectral data in batch and unambiguously determine glycopeptide compositions for N- and O-linked glycopeptides from relevant biological sources such as human milk and serum. Few programs are capable of aiding in structural determination of the glycan component. To significantly advance the field of glycoproteomics, analytical software and algorithms are required that: (i) solve for both N- and O-linked glycopeptide compositions, structures and glycosites in biological mixtures; (ii) are high-throughput and process data in batches; (iii) can interpret mass spectral data from a variety of sources and (iv) are open source and freely available.
glycopeptide; glycoproteomics; glycopeptidomics; bioinformatics; N-linked; O-linked
JBrowse is a web-based genome browser, allowing many sources of data to be visualized, interpreted and navigated in a coherent visual framework. JBrowse uses efficient data structures, pre-generation of image tiles and client-side rendering to provide a fast, interactive browsing experience. Many of JBrowse's design features make it well suited for visualizing high-volume data, such as aligned next-generation sequencing reads.
genome browser; web; next-generation sequencing
Plants have been used as a source of medicine since historic times and several commercially important drugs are of plant-based origin. The traditional approach towards discovery of plant-based drugs often times involves significant amount of time and expenditure. These labor-intensive approaches have struggled to keep pace with the rapid development of high-throughput technologies. In the era of high volume, high-throughput data generation across the biosciences, bioinformatics plays a crucial role. This has generally been the case in the context of drug designing and discovery. However, there has been limited attention to date to the potential application of bioinformatics approaches that can leverage plant-based knowledge. Here, we review bioinformatics studies that have contributed to medicinal plants research. In particular, we highlight areas in medicinal plant research where the application of bioinformatics methodologies may result in quicker and potentially cost-effective leads toward finding plant-based remedies.
medicinal plants; bioinformatics; drug discovery
Deep sequencing has become a popular tool for novel miRNA detection but its data must be viewed carefully as the state of the field is still undeveloped. Using three different programs, miRDeep (v1, 2), miRanalyzer and DSAP, we have analyzed seven data sets (six biological and one simulated) to provide a critical evaluation of the programs performance. We selected these software based on their popularity and overall approach toward the detection of novel and known miRNAs using deep-sequencing data. The program comparisons suggest that, despite differing stringency levels they all identify a similar set of known and novel predictions. Comparisons between the first and second version of miRDeep suggest that the stringency level of each of these programs may, in fact, be a result of the algorithm used to map the reads to the target. Different stringency levels are likely to affect the number of possible novel candidates for functional verification, causing undue strain on resources and time. With that in mind, we propose that an intersection across multiple programs be taken, especially if considering novel candidates that will be targeted for additional analysis. Using this approach, we identify and performed initial validation of 12 novel predictions in our in-house data with real-time PCR, six of which have been previously unreported.
deep sequencing; software; miRNA detection; comparison
High-throughput studies have been extensively conducted in the research of complex human diseases. As a representative example, consider gene-expression studies where thousands of genes are profiled at the same time. An important objective of such studies is to rank the diagnostic accuracy of biomarkers (e.g. gene expressions) for predicting outcome variables while properly adjusting for confounding effects from low-dimensional clinical risk factors and environmental exposures. Existing approaches are often fully based on parametric or semi-parametric models and target evaluating estimation significance as opposed to diagnostic accuracy. Receiver operating characteristic (ROC) approaches can be employed to tackle this problem. However, existing ROC ranking methods focus on biomarkers only and ignore effects of confounders. In this article, we propose a model-based approach which ranks the diagnostic accuracy of biomarkers using ROC measures with a proper adjustment of confounding effects. To this end, three different methods for constructing the underlying regression models are investigated. Simulation study shows that the proposed methods can accurately identify biomarkers with additional diagnostic power beyond confounders. Analysis of two cancer gene-expression studies demonstrates that adjusting for confounders can lead to substantially different rankings of genes.
ranking biomarkers; ROC; confounders; high-throughput data
An international consortium released the first draft sequence of the human genome 10 years ago. Although the analysis of this data has suggested the genetic underpinnings of many diseases, we have not yet been able to fully quantify the relationship between genotype and phenotype. Thus, a major current effort of the scientific community focuses on evaluating individual predispositions to specific phenotypic traits given their genetic backgrounds. Many resources aim to identify and annotate the specific genes responsible for the observed phenotypes. Some of these use intra-species genetic variability as a means for better understanding this relationship. In addition, several online resources are now dedicated to collecting single nucleotide variants and other types of variants, and annotating their functional effects and associations with phenotypic traits. This information has enabled researchers to develop bioinformatics tools to analyze the rapidly increasing amount of newly extracted variation data and to predict the effect of uncharacterized variants. In this work, we review the most important developments in the field—the databases and bioinformatics tools that will be of utmost importance in our concerted effort to interpret the human variome.
genomic variation; genome interpretation; genomic variant databases; gene prioritization; deleterious variants
The explosion of biomedical data, both on the genomic and proteomic side as well as clinical data, will require complex integration and analysis to provide new molecular variables to better understand the molecular basis of phenotype. Currently, much data exist in silos and is not analyzed in frameworks where all data are brought to bear in the development of biomarkers and novel functional targets. This is beginning to change. Network biology approaches, which emphasize the interactions between genes, proteins and metabolites provide a framework for data integration such that genome, proteome, metabolome and other -omics data can be jointly analyzed to understand and predict disease phenotypes. In this review, recent advances in network biology approaches and results are identified. A common theme is the potential for network analysis to provide multiplexed and functionally connected biomarkers for analyzing the molecular basis of disease, thus changing our approaches to analyzing and modeling genome- and proteome-wide data.
network biology; bioinformatics
Many complex diseases such as cancer are associated with changes in biological pathways and molecular networks rather than being caused by single gene alterations. A major challenge in the diagnosis and treatment of such diseases is to identify characteristic aberrancies in the biological pathways and molecular network activities and elucidate their relationship to the disease. This review presents recent progress in using high-throughput biological assays to decipher aberrant pathways and network activities. In particular, this review provides specific examples in which high-throughput data have been applied to identify relationships between diseases and aberrant pathways and network activities. The achievements in this field have been remarkable, but many challenges have yet to be addressed.
pathways; biological networks; biomarker discovery; omics studies; systems biology
This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.
text mining; information extraction; knowledge discovery from texts; text analytics; biomedical natural language processing; pharmacogenomics; pharmacogenetics
Recent advances in high-throughput biotechnologies have led to the rapid growing research interest in reverse engineering of biomolecular systems (REBMS). ‘Data-driven’ approaches, i.e. data mining, can be used to extract patterns from large volumes of biochemical data at molecular-level resolution while ‘design-driven’ approaches, i.e. systems modeling, can be used to simulate emergent system properties. Consequently, both data- and design-driven approaches applied to –omic data may lead to novel insights in reverse engineering biological systems that could not be expected before using low-throughput platforms. However, there exist several challenges in this fast growing field of reverse engineering biomolecular systems: (i) to integrate heterogeneous biochemical data for data mining, (ii) to combine top–down and bottom–up approaches for systems modeling and (iii) to validate system models experimentally. In addition to reviewing progress made by the community and opportunities encountered in addressing these challenges, we explore the emerging field of synthetic biology, which is an exciting approach to validate and analyze theoretical system models directly through experimental synthesis, i.e. analysis-by-synthesis. The ultimate goal is to address the present and future challenges in reverse engineering biomolecular systems (REBMS) using integrated workflow of data mining, systems modeling and synthetic biology.
reverse engineering biological systems; high-throughput technology; –omic data; synthetic biology; analysis-by-synthesis
Understanding biological complexity demands a combination of high-throughput data and interdisciplinary skills. One way to bring to bear the necessary combination of data types and expertise is by encapsulating domain knowledge in software and composing that software to create a customized data analysis environment. To this end, simple flexible strategies are needed for interconnecting heterogeneous software tools and enabling data exchange between them. Drawing on our own work and that of others, we present several strategies for interoperability and their consequences, in particular, a set of simple data structures—list, matrix, network, table and tuple—that have proven sufficient to achieve a high degree of interoperability. We provide a few guidelines for the development of future software that will function as part of an interoperable community of software tools for biological data analysis and visualization.
interoperability; software engineering; bioinformatics; integration; systems biology; data analysis
More than a decade ago, a number of methods were proposed for the inference of protein interactions, using whole-genome information from gene clusters, gene fusions and phylogenetic profiles. This structural and evolutionary view of entire genomes has provided a valuable approach for the functional characterization of proteins, especially those without sequence similarity to proteins of known function. Furthermore, this view has raised the real possibility to detect functional associations of genes and their corresponding proteins for any entire genome sequence. Yet, despite these exciting developments, there have been relatively few cases of real use of these methods outside the computational biology field, as reflected from citation analysis. These methods have the potential to be used in high-throughput experimental settings in functional genomics and proteomics to validate results with very high accuracy and good coverage. In this critical survey, we provide a comprehensive overview of 30 most prominent examples of single pairwise protein interaction cases in small-scale studies, where protein interactions have either been detected by gene fusion or yielded additional, corroborating evidence from biochemical observations. Our conclusion is that with the derivation of a validated gold-standard corpus and better data integration with big experiments, gene fusion detection can truly become a valuable tool for large-scale experimental biology.
genome analysis; comparative genomics; gene fusion; protein interactions; proteomics; validation study
With the development of novel assay technologies, biomedical experiments and analyses have gone through substantial evolution. Today, a typical experiment can simultaneously measure hundreds to thousands of individual features (e.g. genes) in dozens of biological conditions, resulting in gigabytes of data that need to be processed and analyzed. Because of the multiple steps involved in the data generation and analysis and the lack of details provided, it can be difficult for independent researchers to try to reproduce a published study. With the recent outrage following the halt of a cancer clinical trial due to the lack of reproducibility of the published study, researchers are now facing heavy pressure to ensure that their results are reproducible. Despite the global demand, too many published studies remain non-reproducible mainly due to the lack of availability of experimental protocol, data and/or computer code. Scientific discovery is an iterative process, where a published study generates new knowledge and data, resulting in new follow-up studies or clinical trials based on these results. As such, it is important for the results of a study to be quickly confirmed or discarded to avoid wasting time and money on novel projects. The availability of high-quality, reproducible data will also lead to more powerful analyses (or meta-analyses) where multiple data sets are combined to generate new knowledge. In this article, we review some of the recent developments regarding biomedical reproducibility and comparability and discuss some of the areas where the overall field could be improved.
Analysis pipeline; accuracy; open science; precision; protocol; standardization
microRNAs (miRNAs) are small endogenous non-coding RNAs that function as the universal specificity factors in post-transcriptional gene silencing. Discovering miRNAs, identifying their targets and further inferring miRNA functions have been a critical strategy for understanding normal biological processes of miRNAs and their roles in the development of disease. In this review, we focus on computational methods of inferring miRNA functions, including miRNA functional annotation and inferring miRNA regulatory modules, by integrating heterogeneous data sources. We also briefly introduce the research in miRNA discovery and miRNA-target identification with an emphasis on the challenges to computational biology.
miRNA; functional annotation; functional miRNA–mRNA regulatory modules
Metagenomic approaches are increasingly recognized as a baseline for understanding the
ecology and evolution of microbial ecosystems. The development of methods for pathway
inference from metagenomics data is of paramount importance to link a phenotype to a
cascade of events stemming from a series of connected sets of genes or proteins.
Biochemical and regulatory pathways have until recently been thought and modelled within
one cell type, one organism, one species. This vision is being dramatically changed by the
advent of whole microbiome sequencing studies, revealing the role of symbiotic microbial
populations in fundamental biochemical functions. The new landscape we face requires a
clear picture of the potentialities of existing tools and development of new tools to
characterize, reconstruct and model biochemical and regulatory pathways as the result of
integration of function in complex symbiotic interactions of ontologically and
evolutionary distinct cell types.
metagenomics; next-generation sequencing; microbiome; pathway analysis; gene annotation
Genome-scale metabolic network reconstructions are now routinely used in the study of metabolic pathways, their evolution and design. The development of such reconstructions involves the integration of information on reactions and metabolites from the scientific literature as well as public databases and existing genome-scale metabolic models. The reconciliation of discrepancies between data from these sources generally requires significant manual curation, which constitutes a major obstacle in efforts to develop and apply genome-scale metabolic network reconstructions. In this work, we discuss some of the major difficulties encountered in the mapping and reconciliation of metabolic resources and review three recent initiatives that aim to accelerate this process, namely BKM-react, MetRxn and MNXref (presented in this article). Each of these resources provides a pre-compiled reconciliation of many of the most commonly used metabolic resources. By reducing the time required for manual curation of metabolite and reaction discrepancies, these resources aim to accelerate the development and application of high-quality genome-scale metabolic network reconstructions and models.
data integration; data interoperability; metabolic resources; metabolic networks; cheminformatics
Good accessibility of publicly funded research data is essential to secure an open scientific system and eventually becomes mandatory [Wellcome Trust will Penalise Scientists Who Don’t Embrace Open Access. The Guardian 2012]. By the use of high-throughput methods in many research areas from physics to systems biology, large data collections are increasingly important as raw material for research. Here, we present strategies worked out by international and national institutions targeting open access to publicly funded research data via incentives or obligations to share data. Funding organizations such as the British Wellcome Trust therefore have developed data sharing policies and request commitment to data management and sharing in grant applications. Increased citation rates are a profound argument for sharing publication data. Pre-publication sharing might be rewarded by a data citation credit system via digital object identifiers (DOIs) which have initially been in use for data objects. Besides policies and incentives, good practice in data management is indispensable. However, appropriate systems for data management of large-scale projects for example in systems biology are hard to find. Here, we give an overview of a selection of open-source data management systems proved to be employed successfully in large-scale projects.
data management; data sharing; open access; data citation; systems biology
Metagenomics has become an indispensable tool for studying the diversity and metabolic potential of environmental microbes, whose bulk is as yet non-cultivable. Continual progress in next-generation sequencing allows for generating increasingly large metagenomes and studying multiple metagenomes over time or space. Recently, a new type of holistic ecosystem study has emerged that seeks to combine metagenomics with biodiversity, meta-expression and contextual data. Such ‘ecosystems biology’ approaches bear the potential to not only advance our understanding of environmental microbes to a new level but also impose challenges due to increasing data complexities, in particular with respect to bioinformatic post-processing. This mini review aims to address selected opportunities and challenges of modern metagenomics from a bioinformatics perspective and hopefully will serve as a useful resource for microbial ecologists and bioinformaticians alike.
16S rRNA biodiversity; binning; bioinformatics; Genomic Standards Consortium; metagenomics; next-generation sequencing
Several thousand metagenomes have already been sequenced, and this number is set to grow rapidly in the forthcoming years as the uptake of high-throughput sequencing technologies continues. Hand-in-hand with this data bonanza comes the computationally overwhelming task of analysis. Herein, we describe some of the bioinformatic approaches currently used by metagenomics researchers to analyze their data, the issues they face and the steps that could be taken to help overcome these challenges.
metagenomics; next-generation sequencing (NGS); high-throughput sequencing (HTS); functional analysis; environmental bioinformatics
Ontologies are now pervasive in biomedicine, where they serve as a means to standardize terminology, to enable access to domain knowledge, to verify data consistency and to facilitate integrative analyses over heterogeneous biomedical data. For this purpose, research on biomedical ontologies applies theories and methods from diverse disciplines such as information management, knowledge representation, cognitive science, linguistics and philosophy. Depending on the desired applications in which ontologies are being applied, the evaluation of research in biomedical ontologies must follow different strategies. Here, we provide a classification of research problems in which ontologies are being applied, focusing on the use of ontologies in basic and translational research, and we demonstrate how research results in biomedical ontologies can be evaluated. The evaluation strategies depend on the desired application and measure the success of using an ontology for a particular biomedical problem. For many applications, the success can be quantified, thereby facilitating the objective evaluation and comparison of research in biomedical ontology. The objective, quantifiable comparison of research results based on scientific applications opens up the possibility for systematically improving the utility of ontologies in biomedical research.
biomedical ontology; quantitative biology; ontology evaluation; evaluation criteria; ontology-based applications