Recent advances in massively parallel sequencing technology have created new opportunities to probe the hidden world of microbes. Taxonomy-independent clustering of the 16S rRNA gene is usually the first step in analyzing microbial communities. Dozens of algorithms have been developed in the last decade, but a comprehensive benchmark study is lacking. Here, we survey algorithms currently used by microbiologists, and compare seven representative methods in a large-scale benchmark study that addresses several issues of concern. A new experimental protocol was developed that allows different algorithms to be compared using the same platform, and several criteria were introduced to facilitate a quantitative evaluation of the clustering performance of each algorithm. We found that existing methods vary widely in their outputs, and that inappropriate use of distance levels for taxonomic assignments likely resulted in substantial overestimates of biodiversity in many studies. The benchmark study identified our recently developed ESPRIT-Tree, a fast implementation of the average linkage-based hierarchical clustering algorithm, as one of the best algorithms available in terms of computational efficiency and clustering accuracy.
pyrosequencing; 16S rRNA; taxonomy-independent analysis; massive data; clustering; microbial diversity estimation; human microbiome
Over the past two decades, there has been a long-standing debate about the impact of taxon sampling on phylogenetic inference. Studies have been based on both real and simulated data sets, within actual and theoretical contexts, and using different inference methods, to study the impact of taxon sampling. In some cases, conflicting conclusions have been drawn for the same data set. The main questions explored in studies to date have been about the effects of using sparse data, adding new taxa, including more characters from genome sequences and using different (or concatenated) locus regions. These questions can be reduced to more fundamental ones about the assessment of data quality and the design guidelines of taxon sampling in phylogenetic inference experiments. This review summarizes progress to date in understanding the impact of taxon sampling on the accuracy of phylogenetic analysis.
Phylogenetics; taxonomic sampling; bioinformatics
Genetic imprinting, by which the expression of a gene depends on the parental origin of its alleles, may be subjected to reprogramming through each generation. Currently, such reprogramming is limited to qualitative description only, lacking more precise quantitative estimation for its extent, pattern and mechanism. Here, we present a computational framework for analyzing the magnitude of genetic imprinting and its transgenerational inheritance mode. This quantitative model is based on the breeding scheme of reciprocal backcrosses between reciprocal F1 hybrids and original inbred parents, in which the transmission of genetic imprinting across generations can be tracked. We define a series of quantitative genetic parameters that describe the extent and transmission mode of genetic imprinting and further estimate and test these parameters within a genetic mapping framework using a new powerful computational algorithm. The model and algorithm described will enable geneticists to identify and map imprinted quantitative trait loci and dictate a comprehensive atlas of developmental and epigenetic mechanisms related to genetic imprinting. We illustrate the new discovery of the role of genetic imprinting in regulating hyperoxic acute lung injury survival time using a mouse reciprocal backcross design.
With the development of novel assay technologies, biomedical experiments and analyses have gone through substantial evolution. Today, a typical experiment can simultaneously measure hundreds to thousands of individual features (e.g. genes) in dozens of biological conditions, resulting in gigabytes of data that need to be processed and analyzed. Because of the multiple steps involved in the data generation and analysis and the lack of details provided, it can be difficult for independent researchers to try to reproduce a published study. With the recent outrage following the halt of a cancer clinical trial due to the lack of reproducibility of the published study, researchers are now facing heavy pressure to ensure that their results are reproducible. Despite the global demand, too many published studies remain non-reproducible mainly due to the lack of availability of experimental protocol, data and/or computer code. Scientific discovery is an iterative process, where a published study generates new knowledge and data, resulting in new follow-up studies or clinical trials based on these results. As such, it is important for the results of a study to be quickly confirmed or discarded to avoid wasting time and money on novel projects. The availability of high-quality, reproducible data will also lead to more powerful analyses (or meta-analyses) where multiple data sets are combined to generate new knowledge. In this article, we review some of the recent developments regarding biomedical reproducibility and comparability and discuss some of the areas where the overall field could be improved.
Analysis pipeline; accuracy; open science; precision; protocol; standardization
Metagenomic approaches are increasingly recognized as a baseline for understanding the
ecology and evolution of microbial ecosystems. The development of methods for pathway
inference from metagenomics data is of paramount importance to link a phenotype to a
cascade of events stemming from a series of connected sets of genes or proteins.
Biochemical and regulatory pathways have until recently been thought and modelled within
one cell type, one organism, one species. This vision is being dramatically changed by the
advent of whole microbiome sequencing studies, revealing the role of symbiotic microbial
populations in fundamental biochemical functions. The new landscape we face requires a
clear picture of the potentialities of existing tools and development of new tools to
characterize, reconstruct and model biochemical and regulatory pathways as the result of
integration of function in complex symbiotic interactions of ontologically and
evolutionary distinct cell types.
metagenomics; next-generation sequencing; microbiome; pathway analysis; gene annotation
In analysis of bioinformatics data, a unique challenge arises from the high dimensionality of measurements. Without loss of generality, we use genomic study with gene expression measurements as a representative example but note that analysis techniques discussed in this article are also applicable to other types of bioinformatics studies. Principal component analysis (PCA) is a classic dimension reduction approach. It constructs linear combinations of gene expressions, called principal components (PCs). The PCs are orthogonal to each other, can effectively explain variation of gene expressions, and may have a much lower dimensionality. PCA is computationally simple and can be realized using many existing software packages. This article consists of the following parts. First, we review the standard PCA technique and their applications in bioinformatics data analysis. Second, we describe recent ‘non-standard’ applications of PCA, including accommodating interactions among genes, pathways and network modules and conducting PCA with estimating equations as opposed to gene expressions. Third, we introduce several recently proposed PCA-based techniques, including the supervised PCA, sparse PCA and functional PCA. The supervised PCA and sparse PCA have been shown to have better empirical performance than the standard PCA. The functional PCA can analyze time-course gene expression data. Last, we raise the awareness of several critical but unsolved problems related to PCA. The goal of this article is to make bioinformatics researchers aware of the PCA technique and more importantly its most recent development, so that this simple yet effective dimension reduction technique can be better employed in bioinformatics data analysis.
principal component analysis; dimension reduction; bioinformatics methodologies; gene expression
Metagenomics has become an indispensable tool for studying the diversity and metabolic potential of environmental microbes, whose bulk is as yet non-cultivable. Continual progress in next-generation sequencing allows for generating increasingly large metagenomes and studying multiple metagenomes over time or space. Recently, a new type of holistic ecosystem study has emerged that seeks to combine metagenomics with biodiversity, meta-expression and contextual data. Such ‘ecosystems biology’ approaches bear the potential to not only advance our understanding of environmental microbes to a new level but also impose challenges due to increasing data complexities, in particular with respect to bioinformatic post-processing. This mini review aims to address selected opportunities and challenges of modern metagenomics from a bioinformatics perspective and hopefully will serve as a useful resource for microbial ecologists and bioinformaticians alike.
16S rRNA biodiversity; binning; bioinformatics; Genomic Standards Consortium; metagenomics; next-generation sequencing
Several thousand metagenomes have already been sequenced, and this number is set to grow rapidly in the forthcoming years as the uptake of high-throughput sequencing technologies continues. Hand-in-hand with this data bonanza comes the computationally overwhelming task of analysis. Herein, we describe some of the bioinformatic approaches currently used by metagenomics researchers to analyze their data, the issues they face and the steps that could be taken to help overcome these challenges.
metagenomics; next-generation sequencing (NGS); high-throughput sequencing (HTS); functional analysis; environmental bioinformatics
Accurate inference of orthologous genes is a pre-requisite for most comparative genomics studies, and is also important for functional annotation of new genomes. Identification of orthologous gene sets typically involves phylogenetic tree analysis, heuristic algorithms based on sequence conservation, synteny analysis, or some combination of these approaches. The most direct tree-based methods typically rely on the comparison of an individual gene tree with a species tree. Once the two trees are accurately constructed, orthologs are straightforwardly identified by the definition of orthology as those homologs that are related by speciation, rather than gene duplication, at their most recent point of origin. Although ideal for the purpose of orthology identification in principle, phylogenetic trees are computationally expensive to construct for large numbers of genes and genomes, and they often contain errors, especially at large evolutionary distances. Moreover, in many organisms, in particular prokaryotes and viruses, evolution does not appear to have followed a simple ‘tree-like’ mode, which makes conventional tree reconciliation inapplicable. Other, heuristic methods identify probable orthologs as the closest homologous pairs or groups of genes in a set of organisms. These approaches are faster and easier to automate than tree-based methods, with efficient implementations provided by graph-theoretical algorithms enabling comparisons of thousands of genomes. Comparisons of these two approaches show that, despite conceptual differences, they produce similar sets of orthologs, especially at short evolutionary distances. Synteny also can aid in identification of orthologs. Often, tree-based, sequence similarity- and synteny-based approaches can be combined into flexible hybrid methods.
homolog; ortholog; paralog; xenolog; orthologous groups; tree reconciliation; comparative genomics
The UCSC Genome Browser (http://genome.ucsc.edu) is a graphical viewer for genomic data now in its 13th year. Since the early days of the Human Genome Project, it has presented an integrated view of genomic data of many kinds. Now home to assemblies for 58 organisms, the Browser presents visualization of annotations mapped to genomic coordinates. The ability to juxtapose annotations of many types facilitates inquiry-driven data mining. Gene predictions, mRNA alignments, epigenomic data from the ENCODE project, conservation scores from vertebrate whole-genome alignments and variation data may be viewed at any scale from a single base to an entire chromosome. The Browser also includes many other widely used tools, including BLAT, which is useful for alignments from high-throughput sequencing experiments. Private data uploaded as Custom Tracks and Data Hubs in many formats may be displayed alongside the rich compendium of precomputed data in the UCSC database. The Table Browser is a full-featured graphical interface, which allows querying, filtering and intersection of data tables. The Saved Session feature allows users to store and share customized views, enhancing the utility of the system for organizing multiple trains of thought. Binary Alignment/Map (BAM), Variant Call Format and the Personal Genome Single Nucleotide Polymorphisms (SNPs) data formats are useful for visualizing a large sequencing experiment (whole-genome or whole-exome), where the differences between the data set and the reference assembly may be displayed graphically. Support for high-throughput sequencing extends to compact, indexed data formats, such as BAM, bigBed and bigWig, allowing rapid visualization of large datasets from RNA-seq and ChIP-seq experiments via local hosting.
UCSC genome browser; bioinformatics; genetics; human genome; genomics; sequencing
Network-based intervention has been a trend of curing systemic diseases, but it relies on regimen optimization and valid multi-target actions of the drugs. The complex multi-component nature of medicinal herbs may serve as valuable resources for network-based multi-target drug discovery due to its potential treatment effects by synergy. Recently, robustness of multiple systems biology platforms shows powerful to uncover molecular mechanisms and connections between the drugs and their targeting dynamic network. However, optimization methods of drug combination are insufficient, owning to lacking of tighter integration across multiple ‘-omics’ databases. The newly developed algorithm- or network-based computational models can tightly integrate ‘-omics’ databases and optimize combinational regimens of drug development, which encourage using medicinal herbs to develop into new wave of network-based multi-target drugs. However, challenges on further integration across the databases of medicinal herbs with multiple system biology platforms for multi-target drug optimization remain to the uncertain reliability of individual data sets, width and depth and degree of standardization of herbal medicine. Standardization of the methodology and terminology of multiple system biology and herbal database would facilitate the integration. Enhance public accessible databases and the number of research using system biology platform on herbal medicine would be helpful. Further integration across various ‘-omics’ platforms and computational tools would accelerate development of network-based drug discovery and network medicine.
network-based drug discovery; systems biology; bioinformatics; computational technologies; network medicine
In the Life Sciences ‘omics’ data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.
Random Forest; variable importance; local importance; conditional relationships; variable interaction; proximity
Metagenomic sequencing provides a unique opportunity to explore earth’s limitless environments harboring scores of yet unknown and mostly unculturable microbes and other organisms. Functional analysis of the metagenomic data plays a central role in projects aiming to explore the most essential questions in microbiology, namely ‘In a given environment, among the microbes present, what are they doing, and how are they doing it?’ Toward this goal, several large-scale metagenomic projects have recently been conducted or are currently underway. Functional analysis of metagenomic data mainly suffers from the vast amount of data generated in these projects. The shear amount of data requires much computational time and storage space. These problems are compounded by other factors potentially affecting the functional analysis, including, sample preparation, sequencing method and average genome size of the metagenomic samples. In addition, the read-lengths generated during sequencing influence sequence assembly, gene prediction and subsequently the functional analysis. The level of confidence for functional predictions increases with increasing read-length. Usually, the most reliable functional annotations for metagenomic sequences are achieved using homology-based approaches against publicly available reference sequence databases. Here, we present an overview of the current state of functional analysis of metagenomic sequence data, bottlenecks frequently encountered and possible solutions in light of currently available resources and tools. Finally, we provide some examples of applications from recent metagenomic studies which have been successfully conducted in spite of the known difficulties.
functional annotation; metagenomics; bioinformatics; next-generation sequencing; pathway-mapping; comparative analysis
The rapid advances of high-throughput sequencing technologies dramatically prompted metagenomic studies of microbial communities that exist at various environments. Fundamental questions in metagenomics include the identities, composition and dynamics of microbial populations and their functions and interactions. However, the massive quantity and the comprehensive complexity of these sequence data pose tremendous challenges in data analysis. These challenges include but are not limited to ever-increasing computational demand, biased sequence sampling, sequence errors, sequence artifacts and novel sequences. Sequence clustering methods can directly answer many of the fundamental questions by grouping similar sequences into families. In addition, clustering analysis also addresses the challenges in metagenomics. Thus, a large redundant data set can be represented with a small non-redundant set, where each cluster can be represented by a single entry or a consensus. Artifacts can be rapidly detected through clustering. Errors can be identified, filtered or corrected by using consensus from sequences within clusters.
clustering; metagenomics; next-generation sequencing; protein families; artificial duplicates; OTU
Finding new uses for existing drugs, or drug repositioning, has been used as a strategy for decades to get drugs to more patients. As the ability to measure molecules in high-throughput ways has improved over the past decade, it is logical that such data might be useful for enabling drug repositioning through computational methods. Many computational predictions for new indications have been borne out in cellular model systems, though extensive animal model and clinical trial-based validation are still pending. In this review, we show that computational methods for drug repositioning can be classified in two axes: drug based, where discovery initiates from the chemical perspective, or disease based, where discovery initiates from the clinical perspective of disease or its pathology. Newer algorithms for computational drug repositioning will likely span these two axes, will take advantage of newer types of molecular measurements, and will certainly play a role in reducing the global burden of disease.
bioinformatics; drug repositioning; drug development; microarrays; gene expression; systems biology; genomics
A recent study examined the stability of rankings from random forests using two variable importance measures (mean decrease accuracy (MDA) and mean decrease Gini (MDG)) and concluded that rankings based on the MDG were more robust than MDA. However, studies examining data-specific characteristics on ranking stability have been few. Rankings based on the MDG measure showed sensitivity to within-predictor correlation and differences in category frequencies, even when the number of categories was held constant, and thus may produce spurious results. The MDA measure was robust to these data characteristics. Further, under strong within-predictor correlation, MDG rankings were less stable than those using MDA.
Random forest; variable importance measures; stability; ranking; correlation; linkage disequilibrium
Proposed molecular classifiers may be overfit to idiosyncrasies of noisy genomic and proteomic data. Cross-validation methods are often used to obtain estimates of classification accuracy, but both simulations and case studies suggest that, when inappropriate methods are used, bias may ensue. Bias can be bypassed and generalizability can be tested by external (independent) validation. We evaluated 35 studies that have reported on external validation of a molecular classifier. We extracted information on study design and methodological features, and compared the performance of molecular classifiers in internal cross-validation versus external validation for 28 studies where both had been performed. We demonstrate that the majority of studies pursued cross-validation practices that are likely to overestimate classifier performance. Most studies were markedly underpowered to detect a 20% decrease in sensitivity or specificity between internal cross-validation and external validation [median power was 36% (IQR, 21–61%) and 29% (IQR, 15–65%), respectively]. The median reported classification performance for sensitivity and specificity was 94% and 98%, respectively, in cross-validation and 88% and 81% for independent validation. The relative diagnostic odds ratio was 3.26 (95% CI 2.04–5.21) for cross-validation versus independent validation. Finally, we reviewed all studies (n = 758) which cited those in our study sample, and identified only one instance of additional subsequent independent validation of these classifiers. In conclusion, these results document that many cross-validation practices employed in the literature are potentially biased and genuine progress in this field will require adoption of routine external validation of molecular classifiers, preferably in much larger studies than in current practice.
predictive medicine; genes; gene expression; proteomics
Developments in whole genome biotechnology have stimulated statistical focus on prediction methods. We review here methodology for classifying patients into survival risk groups and for using cross-validation to evaluate such classifications. Measures of discrimination for survival risk models include separation of survival curves, time-dependent ROC curves and Harrell’s concordance index. For high-dimensional data applications, however, computing these measures as re-substitution statistics on the same data used for model development results in highly biased estimates. Most developments in methodology for survival risk modeling with high-dimensional data have utilized separate test data sets for model evaluation. Cross-validation has sometimes been used for optimization of tuning parameters. In many applications, however, the data available are too limited for effective division into training and test sets and consequently authors have often either reported re-substitution statistics or analyzed their data using binary classification methods in order to utilize familiar cross-validation. In this article we have tried to indicate how to utilize cross-validation for the evaluation of survival risk models; specifically how to compute cross-validated estimates of survival distributions for predicted risk groups and how to compute cross-validated time-dependent ROC curves. We have also discussed evaluation of the statistical significance of a survival risk model and evaluation of whether high-dimensional genomic data adds predictive accuracy to a model based on standard covariates alone.
predictive medicine; survival risk classification; cross-validation; gene expression
Motif discovery has been one of the most widely studied problems in bioinformatics ever since genomic and protein sequences have been available. In particular, its application to the de novo prediction of putative over-represented transcription factor binding sites in nucleotide sequences has been, and still is, one of the most challenging flavors of the problem. Recently, novel experimental techniques like chromatin immunoprecipitation (ChIP) have been introduced, permitting the genome-wide identification of protein–DNA interactions. ChIP, applied to transcription factors and coupled with genome tiling arrays (ChIP on Chip) or next-generation sequencing technologies (ChIP-Seq) has opened new avenues in research, as well as posed new challenges to bioinformaticians developing algorithms and methods for motif discovery.
motif discovery; transcription factor binding sites; chromatin immunoprecipitation; ChIP-Seq
Data visualization is an essential component of genomic data analysis. However, the size and diversity of the data sets produced by today’s sequencing and array-based profiling methods present major challenges to visualization tools. The Integrative Genomics Viewer (IGV) is a high-performance viewer that efficiently handles large heterogeneous data sets, while providing a smooth and intuitive user experience at all levels of genome resolution. A key characteristic of IGV is its focus on the integrative nature of genomic studies, with support for both array-based and next-generation sequencing data, and the integration of clinical and phenotypic data. Although IGV is often used to view genomic data from public sources, its primary emphasis is to support researchers who wish to visualize and explore their own data sets or those from colleagues. To that end, IGV supports flexible loading of local and remote data sets, and is optimized to provide high-performance data visualization and exploration on standard desktop systems. IGV is freely available for download from http://www.broadinstitute.org/igv, under a GNU LGPL open-source license.
visualization; next-generation sequencing; NGS; genome viewer; IGV
A variety of genome-wide profiling techniques are available to investigate complementary aspects of genome structure and function. Integrative analysis of heterogeneous data sources can reveal higher level interactions that cannot be detected based on individual observations. A standard integration task in cancer studies is to identify altered genomic regions that induce changes in the expression of the associated genes based on joint analysis of genome-wide gene expression and copy number profiling measurements. In this review, we highlight common approaches to genomic data integration and provide a transparent benchmarking procedure to quantitatively compare method performances in cancer gene prioritization. Algorithms, data sets and benchmarking results are available at http://intcomp.r-forge.r-project.org.
DNA copy number; gene expression; microarrays; data integration; algorithms; cancer
This article reviews recent advances in ‘microbiome studies’: molecular, statistical and graphical techniques to explore and quantify how microbial organisms affect our environments and ourselves given recent increases in sequencing technology. Microbiome studies are moving beyond mere inventories of specific ecosystems to quantifications of community diversity and descriptions of their ecological function. We review the last 24 months of progress in this sort of research, and anticipate where the next 2 years will take us. We hope that bioinformaticians will find this a helpful springboard for new collaborations with microbiologists.
microbial ecology; biodiversity; metagenomics; next generation sequencing; microbiome; visual analytics
With the development of ultra-high-throughput technologies, the cost of sequencing bacterial genomes has been vastly reduced. As more genomes are sequenced, less time can be spent manually annotating those genomes, resulting in an increased reliance on automatic annotation pipelines. However, automatic pipelines can produce inaccurate genome annotation and their results often require manual curation. Here, we discuss the automatic and manual annotation of bacterial genomes, identify common problems introduced by the current genome annotation process and suggests potential solutions.
bacteria; genomics; annotation; automatic; errors
So-called next-generation sequencing (NGS) has provided the ability to sequence on a massive scale at low cost, enabling biologists to perform powerful experiments and gain insight into biological processes. BamView has been developed to visualize and analyse sequence reads from NGS platforms, which have been aligned to a reference sequence. It is a desktop application for browsing the aligned or mapped reads [Ruffalo, M, LaFramboise, T, Koyutürk, M. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 2011;27:2790–6] at different levels of magnification, from nucleotide level, where the base qualities can be seen, to genome or chromosome level where overall coverage is shown. To enable in-depth investigation of NGS data, various views are provided that can be configured to highlight interesting aspects of the data. Multiple read alignment files can be overlaid to compare results from different experiments, and filters can be applied to facilitate the interpretation of the aligned reads. As well as being a standalone application it can be used as an integrated part of the Artemis genome browser, BamView allows the user to study NGS data in the context of the sequence and annotation of the reference genome. Single nucleotide polymorphism (SNP) density and candidate SNP sites can be highlighted and investigated, and read-pair information can be used to discover large structural insertions and deletions. The application will also calculate simple analyses of the read mapping, including reporting the read counts and reads per kilobase per million mapped reads (RPKM) for genes selected by the user.
Availability: BamView and Artemis are freely available software. These can be downloaded from their home pages:
Requirements: Java 1.6 or higher.
genome browser; next-generation sequencing; visualization; Artemis; BamView