Classification of mitochondrial DNA (mtDNA) into their respective haplogroups allows the addressing of various anthropologic and forensic issues. Unique to mtDNA is its abundance and non-recombining uni-parental mode of inheritance; consequently, mutations are the only changes observed in the genetic material. These individual mutations are classified into their cladistic haplogroups allowing the tracing of different genetic branch points in human (and other organisms) evolution. Due to the large number of samples, it becomes necessary to automate the classification process. Using 5-fold cross-validation, we investigated two classification techniques on the consented database of 21 141 samples published by the Genographic project. The support vector machines (SVM) algorithm achieved a macro-accuracy of 88.06% and micro-accuracy of 96.59%, while the random forest (RF) algorithm achieved a macro-accuracy of 87.35% and micro-accuracy of 96.19%. In addition to being faster and more memory-economic in making predictions, SVM and RF are better than or comparable to the nearest-neighbor method employed by the Genographic project in terms of prediction accuracy.
mitochondrial DNA; ensemble learning; classification algorithms; support vector machines; random forest; genographic project
In cancer research, high-throughput genomic studies have been extensively conducted, searching for markers associated with cancer diagnosis, prognosis and variation in response to treatment. In this article, we analyze cancer prognosis studies and investigate ranking markers based on their marginal prognosis power. To avoid ambiguity, we focus on microarray gene expression studies where genes are the markers, but note that the methodology and results are applicable to other high-throughput studies. The objectives of this study are 2-fold. First, we investigate ranking markers under three commonly adopted semiparametric models, namely the Cox, accelerated failure time and additive risk models. Data analysis shows that the ranking may vary significantly under different models. Second, we describe a nonparametric concordance measure, which has roots in the time-dependent ROC (receiver operating characteristic) framework and relies on much weaker assumptions than the semiparametric models. In simulation, it is shown that ranking using the concordance measure is not sensitive to model specification whereas ranking under the semiparametric models is. In data analysis, the concordance measure generates rankings significantly different from those under the semiparametric models.
cancer prognosis markers; semiparametric survival analysis; concordance measure
The PHylogenetic Analysis with Space/Time models (PHAST) software package consists of a collection of command-line programs and supporting libraries for comparative genomics. PHAST is best known as the engine behind the Conservation tracks in the University of California, Santa Cruz (UCSC) Genome Browser. However, it also includes several other tools for phylogenetic modeling and functional element identification, as well as utilities for manipulating alignments, trees and genomic annotations. PHAST has been in development since 2002 and has now been downloaded more than 1000 times, but so far it has been released only as provisional (‘beta’) software. Here, we describe the first official release (v1.0) of PHAST, with improved stability, portability and documentation and several new features. We outline the components of the package and detail recent improvements. In addition, we introduce a new interface to the PHAST libraries from the R statistical computing environment, called RPHAST, and illustrate its use in a series of vignettes. We demonstrate that RPHAST can be particularly useful in applications involving both large-scale phylogenomics and complex statistical analyses. The R interface also makes the PHAST libraries acccessible to non-C programmers, and is useful for rapid prototyping. PHAST v1.0 and RPHAST v1.0 are available for download at http://compgen.bscb.cornell.edu/phast, under the terms of an unrestrictive BSD-style license. RPHAST can also be obtained from the Comprehensive R Archive Network (CRAN; http://cran.r-project.org).
statistical phylogenetics; functional element identification
The amount of biological data is increasing rapidly, and will continue to increase as new rapid technologies are developed. Professionals in every area of bioscience will have data management needs that require publicly available bioinformatics resources. Not all scientists desire a formal bioinformatics education but would benefit from more informal educational sources of learning. Effective bioinformatics education formats will address a broad range of scientific needs, will be aimed at a variety of user skill levels, and will be delivered in a number of different formats to address different learning styles. Informal sources of bioinformatics education that are effective are available, and will be explored in this review.
bioinformatics education; training and learning; outreach; genomics; data management; computational biology resources
The National Center for Biotechnology Information (NCBI) hosts 39 literature and molecular biology databases containing almost half a billion records. As the complexity of these data and associated resources and tools continues to expand, so does the need for educational resources to help investigators, clinicians, information specialists and the general public make use of the wealth of public data available at the NCBI. This review describes the educational resources available at NCBI via the NCBI Education page (www.ncbi.nlm.nih.gov/Education/). These resources include materials designed for new users, such as About NCBI and the NCBI Guide, as well as documentation, Frequently Asked Questions (FAQs) and writings on the NCBI Bookshelf such as the NCBI Help Manual and the NCBI Handbook. NCBI also provides teaching materials such as tutorials, problem sets and educational tools such as the Amino Acid Explorer, PSSM Viewer and Ebot. NCBI also offers training programs including the Discovery Workshops, webinars and tutorials at conferences. To help users keep up-to-date, NCBI produces the online NCBI News and offers RSS feeds and mailing lists, along with a presence on Facebook, Twitter and YouTube.
Bioinformatics; education; tutorials; NCBI; databases; GenBank
Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is sequence alignment, whereby sequence reads must be compared to a reference. A wide variety of alignment algorithms and software have been subsequently developed over the past two years. In this article, we will systematically review the current development of these algorithms and introduce their practical applications on different types of experimental data. We come to the conclusion that short-read alignment is no longer the bottleneck of data analyses. We also consider future development of alignment algorithms with respect to emerging long sequence reads and the prospect of cloud computing.
new sequencing technologies; alignment algorithm; sequence analysis
Massively parallel sequencing technologies continue to alter the study of human genetics. As the cost of sequencing declines, next-generation sequencing (NGS) instruments and datasets will become increasingly accessible to the wider research community. Investigators are understandably eager to harness the power of these new technologies. Sequencing human genomes on these platforms, however, presents numerous production and bioinformatics challenges. Production issues like sample contamination, library chimaeras and variable run quality have become increasingly problematic in the transition from technology development lab to production floor. Analysis of NGS data, too, remains challenging, particularly given the short-read lengths (35–250 bp) and sheer volume of data. The development of streamlined, highly automated pipelines for data analysis is critical for transition from technology adoption to accelerated research and publication. This review aims to describe the state of current NGS technologies, as well as the strategies that enable NGS users to characterize the full spectrum of DNA sequence variation in humans.
massively parallel sequencing; next generation sequencing; human genome; variant detection; short read alignment; whole genome sequencing
The EB-eye is a fast and efficient search engine that provides easy and uniform access to the biological data resources hosted at the EMBL-EBI. Currently, users can access information from more than 62 distinct datasets covering some 400 million entries. The data resources represented in the EB-eye include: nucleotide and protein sequences at both the genomic and proteomic levels, structures ranging from chemicals to macro-molecular complexes, gene-expression experiments, binary level molecular interactions as well as reaction maps and pathway models, functional classifications, biological ontologies, and comprehensive literature libraries covering the biomedical sciences and related intellectual property. The EB-eye can be accessed over the web or programmatically using a SOAP Web Services interface. This allows its search and retrieval capabilities to be exploited in workflows and analytical pipe-lines. The EB-eye is a novel alternative to existing biological search and retrieval engines. In this article we describe in detail how to exploit its powerful capabilities.
text search; biological databases; integration; interoperability; web services; Apache Lucene
Rat models have been used to investigate physiological and pathophysiological mechanisms for decades. With the availability of the rat genome and other online resources, tools to identify rat models that mimic human disease are an important step in translational research. Despite the large number of papers published each year using rat models, integrating this information remains a problem. Resources for the rat genome are continuing to grow rapidly, while resources providing access to rat phenotype data are just emerging. An overview of rat models of disease, tools to characterize strain by phenotype and genotype, and steps being taken to integrate rat physiological data is presented in this article. Integrating functional and physiological data with the rat genome will build a solid research platform to facilitate innovative studies to unravel the mechanisms resulting in disease.
phenotype; physiological genomics; database; rat strains; disease models; genome
Development of high-throughput technologies makes it possible to survey the whole genome. Genomic studies have been extensively conducted, searching for markers with predictive power for prognosis of complex diseases such as cancer, diabetes and obesity. Most existing statistical analyses are focused on developing marker selection techniques, while little attention is paid to the underlying prognosis models. In this article, we review three commonly used prognosis models, namely the Cox, additive risk and accelerated failure time models. We conduct simulation and show that gene identification can be unsatisfactory under model misspecification. We analyze three cancer prognosis studies under the three models, and show that the gene identification results, prediction performance of all identified genes combined, and reproducibility of each identified gene are model-dependent. We suggest that in practical data analysis, more attention should be paid to the model assumption, and multiple models may need to be considered.
genomic studies; semiparametric prognosis models; model comparison
Modeling tools can play an important role in synthetic biology the same way modeling helps in other engineering disciplines: simulations can quickly probe mechanisms and provide a clear picture of how different components influence the behavior of the whole. We present a brief review of available tools and present SynBioSS Designer. The Synthetic Biology Software Suite (SynBioSS) is used for the generation, storing, retrieval and quantitative simulation of synthetic biological networks. SynBioSS consists of three distinct components: the Desktop Simulator, the Wiki, and the Designer. SynBioSS Designer takes as input molecular parts involved in gene expression and regulation (e.g. promoters, transcription factors, ribosome binding sites, etc.), and automatically generates complete networks of reactions that represent transcription, translation, regulation, induction and degradation of those parts. Effectively, Designer uses DNA sequences as input and generates networks of biomolecular reactions as output. In this paper we describe how Designer uses universal principles of molecular biology to generate models of any arbitrary synthetic biological system. These models are useful as they explain biological phenotypic complexity in mechanistic terms. In turn, such mechanistic explanations can assist in designing synthetic biological systems. We also discuss, giving practical guidance to users, how Designer interfaces with the Registry of Standard Biological Parts, the de facto compendium of parts used in synthetic biology applications.
synthetic biology; computational biology; multiscale models; automated design
Dynamic molecular interactions play a central role in regulating the functioning of cells and organisms. The availability of experimentally determined large-scale cellular networks, along with other high-throughput experimental data sets that provide snapshots of biological systems at different times and conditions, is increasingly helpful in elucidating interaction dynamics. Here we review the beginnings of a new subfield within computational biology, one focused on the global inference and analysis of the dynamic interactome. This burgeoning research area, which entails a shift from static to dynamic network analysis, promises to be a major step forward in our ability to model and reason about cellular function and behavior.
network analysis; network dynamics; interaction networks; systems biology