An international consortium released the first draft sequence of the human genome 10 years ago. Although the analysis of this data has suggested the genetic underpinnings of many diseases, we have not yet been able to fully quantify the relationship between genotype and phenotype. Thus, a major current effort of the scientific community focuses on evaluating individual predispositions to specific phenotypic traits given their genetic backgrounds. Many resources aim to identify and annotate the specific genes responsible for the observed phenotypes. Some of these use intra-species genetic variability as a means for better understanding this relationship. In addition, several online resources are now dedicated to collecting single nucleotide variants and other types of variants, and annotating their functional effects and associations with phenotypic traits. This information has enabled researchers to develop bioinformatics tools to analyze the rapidly increasing amount of newly extracted variation data and to predict the effect of uncharacterized variants. In this work, we review the most important developments in the field—the databases and bioinformatics tools that will be of utmost importance in our concerted effort to interpret the human variome.
genomic variation; genome interpretation; genomic variant databases; gene prioritization; deleterious variants
The explosion of biomedical data, both on the genomic and proteomic side as well as clinical data, will require complex integration and analysis to provide new molecular variables to better understand the molecular basis of phenotype. Currently, much data exist in silos and is not analyzed in frameworks where all data are brought to bear in the development of biomarkers and novel functional targets. This is beginning to change. Network biology approaches, which emphasize the interactions between genes, proteins and metabolites provide a framework for data integration such that genome, proteome, metabolome and other -omics data can be jointly analyzed to understand and predict disease phenotypes. In this review, recent advances in network biology approaches and results are identified. A common theme is the potential for network analysis to provide multiplexed and functionally connected biomarkers for analyzing the molecular basis of disease, thus changing our approaches to analyzing and modeling genome- and proteome-wide data.
network biology; bioinformatics
Many complex diseases such as cancer are associated with changes in biological pathways and molecular networks rather than being caused by single gene alterations. A major challenge in the diagnosis and treatment of such diseases is to identify characteristic aberrancies in the biological pathways and molecular network activities and elucidate their relationship to the disease. This review presents recent progress in using high-throughput biological assays to decipher aberrant pathways and network activities. In particular, this review provides specific examples in which high-throughput data have been applied to identify relationships between diseases and aberrant pathways and network activities. The achievements in this field have been remarkable, but many challenges have yet to be addressed.
pathways; biological networks; biomarker discovery; omics studies; systems biology
This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.
text mining; information extraction; knowledge discovery from texts; text analytics; biomedical natural language processing; pharmacogenomics; pharmacogenetics
Recent advances in high-throughput biotechnologies have led to the rapid growing research interest in reverse engineering of biomolecular systems (REBMS). ‘Data-driven’ approaches, i.e. data mining, can be used to extract patterns from large volumes of biochemical data at molecular-level resolution while ‘design-driven’ approaches, i.e. systems modeling, can be used to simulate emergent system properties. Consequently, both data- and design-driven approaches applied to –omic data may lead to novel insights in reverse engineering biological systems that could not be expected before using low-throughput platforms. However, there exist several challenges in this fast growing field of reverse engineering biomolecular systems: (i) to integrate heterogeneous biochemical data for data mining, (ii) to combine top–down and bottom–up approaches for systems modeling and (iii) to validate system models experimentally. In addition to reviewing progress made by the community and opportunities encountered in addressing these challenges, we explore the emerging field of synthetic biology, which is an exciting approach to validate and analyze theoretical system models directly through experimental synthesis, i.e. analysis-by-synthesis. The ultimate goal is to address the present and future challenges in reverse engineering biomolecular systems (REBMS) using integrated workflow of data mining, systems modeling and synthetic biology.
reverse engineering biological systems; high-throughput technology; –omic data; synthetic biology; analysis-by-synthesis
The mountains of data thrusting from the new landscape of modern high-throughput biology are irrevocably changing biomedical research and creating a near-insatiable demand for training in data management and manipulation and data mining and analysis. Among life scientists, from clinicians to environmental researchers, a common theme is the need not just to use, and gain familiarity with, bioinformatics tools and resources but also to understand their underlying fundamental theoretical and practical concepts. Providing bioinformatics training to empower life scientists to handle and analyse their data efficiently, and progress their research, is a challenge across the globe. Delivering good training goes beyond traditional lectures and resource-centric demos, using interactivity, problem-solving exercises and cooperative learning to substantially enhance training quality and learning outcomes. In this context, this article discusses various pragmatic criteria for identifying training needs and learning objectives, for selecting suitable trainees and trainers, for developing and maintaining training skills and evaluating training quality. Adherence to these criteria may help not only to guide course organizers and trainers on the path towards bioinformatics training excellence but, importantly, also to improve the training experience for life scientists.
bioinformatics; training; bioinformatics courses; training life scientists; train the trainers
Next-generation sequencing (NGS) is increasingly being adopted as the backbone of biomedical research. With the commercialization of various affordable desktop sequencers, NGS will be reached by increasing numbers of cellular and molecular biologists, necessitating community consensus on bioinformatics protocols to tackle the exponential increase in quantity of sequence data. The current resources for NGS informatics are extremely fragmented. Finding a centralized synthesis is difficult. A multitude of tools exist for NGS data analysis; however, none of these satisfies all possible uses and needs. This gap in functionality could be filled by integrating different methods in customized pipelines, an approach helped by the open-source nature of many NGS programmes. Drawing from community spirit and with the use of the Wikipedia framework, we have initiated a collaborative NGS resource: The NGS WikiBook. We have collected a sufficient amount of text to incentivize a broader community to contribute to it. Users can search, browse, edit and create new content, so as to facilitate self-learning and feedback to the community. The overall structure and style for this dynamic material is designed for the bench biologists and non-bioinformaticians. The flexibility of online material allows the readers to ignore details in a first read, yet have immediate access to the information they need. Each chapter comes with practical exercises so readers may familiarize themselves with each step. The NGS WikiBook aims to create a collective laboratory book and protocol that explains the key concepts and describes best practices in this fast-evolving field.
next-generation sequencing; bioinformatics; training; collaborative learning; best practice
The extremely high error rates reported by Keegan et al. in ‘A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE’ (PLoS Comput Biol 2012;8:e1002541) for many next-generation sequencing datasets prompted us to re-examine their results. Our analysis reveals that the presence of conserved artificial sequences, e.g. Illumina adapters, and other naturally occurring sequence motifs accounts for most of the reported errors. We conclude that DRISEE reports inflated levels of sequencing error, particularly for Illumina data. Tools offered for evaluating large datasets need scrupulous review before they are implemented.
next-generation sequencing; sequencing error; adapter ligation; PCR; quality score
Inference of gene regulatory network from expression data is a challenging task. Many methods have been developed to this purpose but a comprehensive evaluation that covers unsupervised, semi-supervised and supervised methods, and provides guidelines for their practical application, is lacking.
We performed an extensive evaluation of inference methods on simulated and experimental expression data. The results reveal low prediction accuracies for unsupervised techniques with the notable exception of the Z-SCORE method on knockout data. In all other cases, the supervised approach achieved the highest accuracies and even in a semi-supervised setting with small numbers of only positive samples, outperformed the unsupervised techniques.
gene regulatory networks; simulation; gene expression data; machine learning
A central goal of biology is understanding and describing the molecular basis of plasticity: the sets of genes that are combinatorially selected by exogenous and endogenous environmental changes, and the relations among the genes. The most viable current approach to this problem consists of determining whether sets of genes are connected by some common theme, e.g. genes from the same pathway are overrepresented among those whose differential expression in response to a perturbation is most pronounced. There are many approaches to this problem, and the results they produce show a fair amount of dispersion, but they all fall within a common framework consisting of a few basic components. We critically review these components, suggest best practices for carrying out each step, and propose a voting method for meeting the challenge of assessing different methods on a large number of experimental data sets in the absence of a gold standard.
gene set enrichment analysis; pathway enrichment analysis; expression analysis; GSEA; PWEA; performance evaluation; controlled mutual coverage; CMC
Over the last decade, multiple functional genomic datasets studying chromosomal aberrations and their downstream effects on gene expression have accumulated for several cancer types. A vast majority of them are in the form of paired gene expression profiles and somatic copy number alterations (CNA) information on the same patients identified using microarray platforms. In response, many algorithms and software packages are available for integrating these paired data. Surprisingly, there has been no serious attempt to review the currently available methodologies or the novel insights brought using them. In this work, we discuss the quantitative relationships observed between CNA and gene expression in multiple cancer types and biological milestones achieved using the available methodologies. We discuss the conceptual evolution of both, the step-wise and the joint data integration methodologies over the last decade. We conclude by providing suggestions for building efficient data integration methodologies and asking further biological questions.
data integration; copy number; gene expression; integrative analysis; cancer
The presence of different transcripts of a gene across samples can be analysed by whole-transcriptome microarrays. Reproducing results from published microarray data represents a challenge owing to the vast amounts of data and the large variety of preprocessing and filtering steps used before the actual analysis is carried out. To guarantee a firm basis for methodological development where results with new methods are compared with previous results, it is crucial to ensure that all analyses are completely reproducible for other researchers. We here give a detailed workflow on how to perform reproducible analysis of the GeneChip®Human Exon 1.0 ST Array at probe and probeset level solely in R/Bioconductor, choosing packages based on their simplicity of use. To exemplify the use of the proposed workflow, we analyse differential splicing and differential gene expression in a publicly available dataset using various statistical methods. We believe this study will provide other researchers with an easy way of accessing gene expression data at different annotation levels and with the sufficient details needed for developing their own tools for reproducible analysis of the GeneChip®Human Exon 1.0 ST Array.
reproducible research; exon array; differential splicing; ANOSVA; FIRMA; probe-level analysis
Time is of the essence in biology as in so much else. For example, monitoring disease progression or the timing of developmental defects is important for the processes of drug discovery and therapy trials. Furthermore, an understanding of the basic dynamics of biological phenomena that are often strictly time regulated (e.g. circadian rhythms) is needed to make accurate inferences about the evolution of biological processes. Recent advances in technologies have enabled us to measure timing effects more accurately and in more detail. This has driven related advances in visualization and analysis tools that try to effectively exploit this data. Beyond timeline plots, notable attempts at more involved temporal interpretation have been made in recent years, but awareness of the available resources is still limited within the scientific community. Here, we review some advances in biological visualization of time-driven processes and consider how they aid data analysis and interpretation.
visualization software; representations of time; dynamics of processes
The widespread adoption of high-throughput next-generation sequencing (NGS) technology among the Australian life science research community is highlighting an urgent need to up-skill biologists in tools required for handling and analysing their NGS data. There is currently a shortage of cutting-edge bioinformatics training courses in Australia as a consequence of a scarcity of skilled trainers with time and funding to develop and deliver training courses. To address this, a consortium of Australian research organizations, including Bioplatforms Australia, the Commonwealth Scientific and Industrial Research Organisation and the Australian Bioinformatics Network, have been collaborating with EMBL-EBI training team. A group of Australian bioinformaticians attended the train-the-trainer workshop to improve training skills in developing and delivering bioinformatics workshop curriculum. A 2-day NGS workshop was jointly developed to provide hands-on knowledge and understanding of typical NGS data analysis workflows. The road show–style workshop was successfully delivered at five geographically distant venues in Australia using the newly established Australian NeCTAR Research Cloud. We highlight the challenges we had to overcome at different stages from design to delivery, including the establishment of an Australian bioinformatics training network and the computing infrastructure and resource development. A virtual machine image, workshop materials and scripts for configuring a machine with workshop contents have all been made available under a Creative Commons Attribution 3.0 Unported License. This means participants continue to have convenient access to an environment they had become familiar and bioinformatics trainers are able to access and reuse these resources.
training; next-generation sequencing; NGS; cloud; workshop
High-throughput technologies are widely used in the field of functional genomics and used in an increasing number of applications. For many ‘wet lab’ scientists, the analysis of the large amount of data generated by such technologies is a major bottleneck that can only be overcome through very specialized training in advanced data analysis methodologies and the use of dedicated bioinformatics software tools. In this article, we wish to discuss the challenges related to delivering training in the analysis of high-throughput sequencing data and how we addressed these challenges in the hands-on training courses that we have developed at the European Bioinformatics Institute.
bioinformatics training; high-throughput sequencing analysis; statistical methodologies; practical courses; open-source software
With the advent of YouTube channels in bioinformatics, open platforms for problem solving in bioinformatics, active web forums in computing analyses and online resources for learning to code or use a bioinformatics tool, the more traditional continuing education bioinformatics training programs have had to adapt. Bioinformatics training programs that solely rely on traditional didactic methods are being superseded by these newer resources. Yet such face-to-face instruction is still invaluable in the learning continuum. Bioinformatics.ca, which hosts the Canadian Bioinformatics Workshops, has blended more traditional learning styles with current online and social learning styles. Here we share our growing experiences over the past 12 years and look toward what the future holds for bioinformatics training programs.
continuing education; bioinformatics; online learning; massive open online courses (MOOCs)
The recent advent of high-throughput microarray data has enabled the global analysis of the transcriptome, driving the development and application of computational approaches to study transcriptional regulation on the genome scale, by reconstructing in silico the regulatory interactions of the gene network. Although there are many in-depth reviews of such ‘reverse-engineering’ methodologies, most have focused on the practical aspect of data mining, and few on the biological problem and the biological relevance of the methodology. Therefore, in this review, from a biological perspective, we used a set of yeast microarray data as a working example, to evaluate the fundamental assumptions implicit in associating transcription factor (TF)–target gene expression levels and estimating TFs’ activity, and further explore cooperative models. Finally we confirm that the detailed transcription mechanism is overly-complex for expression data alone to reveal, nevertheless, future network reconstruction studies could benefit from the incorporation of context-specific information, the modeling of multiple layers of regulation (e.g. micro-RNA), or the development of approaches for context-dependent analysis, to uncover the mechanisms of gene regulation.
transcription factors; transcriptional regulation; network reconstruction; gene expression
Organisms usually cope with change in the environment by altering the dynamic trajectory of gene expression to adjust the complement of active proteins. The identification of particular sets of genes whose expression is adaptive in response to environmental changes helps to understand the mechanistic base of gene–environment interactions essential for organismic development. We describe a computational framework for clustering the dynamics of gene expression in distinct environments through Gaussian mixture fitting to the expression data measured at a set of discrete time points. We outline a number of quantitative testable hypotheses about the patterns of dynamic gene expression in changing environments and gene–environment interactions causing developmental differentiation. The future directions of gene clustering in terms of incorporations of the latest biological discoveries and statistical innovations are discussed. We provide a set of computational tools that are applicable to modeling and analysis of dynamic gene expression data measured in multiple environments.
dynamic gene expression; functional clustering; gene–environment interaction; mixture model
Network motifs are statistically overrepresented sub-structures (sub-graphs) in a network, and have been recognized as ‘the simple building blocks of complex networks’. Study of biological network motifs may reveal answers to many important biological questions. The main difficulty in detecting larger network motifs in biological networks lies in the facts that the number of possible sub-graphs increases exponentially with the network or motif size (node counts, in general), and that no known polynomial-time algorithm exists in deciding if two graphs are topologically equivalent. This article discusses the biological significance of network motifs, the motivation behind solving the motif-finding problem, and strategies to solve the various aspects of this problem. A simple classification scheme is designed to analyze the strengths and weaknesses of several existing algorithms. Experimental results derived from a few comparative studies in the literature are discussed, with conclusions that lead to future research directions.
Network motifs; biological networks; graph isomorphism
Creating useful software is a major activity of many scientists, including bioinformaticians. Nevertheless, software development in an academic setting is often unsystematic, which can lead to problems associated with maintenance and long-term availibility. Unfortunately, well-documented software development methodology is difficult to adopt, and technical measures that directly improve bioinformatic programming have not been described comprehensively. We have examined 22 software projects and have identified a set of practices for software development in an academic environment. We found them useful to plan a project, support the involvement of experts (e.g. experimentalists), and to promote higher quality and maintainability of the resulting programs. This article describes 12 techniques that facilitate a quick start into software engineering. We describe 3 of the 22 projects in detail and give many examples to illustrate the usage of particular techniques. We expect this toolbox to be useful for many bioinformatics programming projects and to the training of scientific programmers.
software development; programming; project management; software quality
The Rat Genome Database (RGD) was started >10 years ago to provide a core genomic resource for rat researchers. Currently, RGD combines genetic, genomic, pathway, phenotype and strain information with a focus on disease. RGD users are provided with access to structured and curated data from the molecular level through the organismal level. Those users access RGD from all over the world. End users are not only rat researchers but also researchers working with mouse and human data. Translational research is supported by RGD’s comparative genetics/genomics data in disease portals, in GBrowse, in VCMap and on gene report pages. The impact of RGD also goes beyond the traditional biomedical researcher, as the influence of RGD reaches bioinformaticians, tool developers and curators. Import of RGD data into other publicly available databases expands the influence of RGD to a larger set of end users than those who avail themselves of the RGD website. The value of RGD continues to grow as more types of data and more tools are added, while reaching more types of end users.
database; genome; rat; disease; human
Many methods of gene set analysis developed in recent years have been compared empirically in a number of comprehensive review articles. Although it is recognized that different methods tend to identify different gene sets as significant, no consensus has been worked out as to which method is preferable, as the recommendations are often contradictory. In this article, we want to group and compare different methods in terms of the methodological assumptions pertaining to definition of a sample and formulation of the actual null hypothesis. We discuss four models of statistical experiment explicitly or implicitly assumed by most if not all currently available methods of gene set analysis. We analyse validity of the models in the context of the actual biological experiment. Based on this, we recommend a group of methods that provide biologically interpretable results in statistically sound way. Finally, we demonstrate how correlated or low signal-to-noise data affects performance of different methods, observed in terms of the false-positive rate and power.
gene set analysis; high-throughput data; gene expression; GWAS; competitive methods; self-contained methods
GBrowse is a mature web-based genome browser that is suitable for deployment on both public and private web sites. It supports most of genome browser features, including qualitative and quantitative (wiggle) tracks, track uploading, track sharing, interactive track configuration, semantic zooming and limited smooth track panning. As of version 2.0, GBrowse supports next-generation sequencing (NGS) data by providing for the direct display of SAM and BAM sequence alignment files. SAM/BAM tracks provide semantic zooming and support both local and remote data sources. This article provides step-by-step instructions for configuring GBrowse to display NGS data.
bioinformatics; genomics; DNA sequencing; genome browser; data visualization; data sharing
MicroRNAs (miRNAs) are short regulatory RNAs that down-regulate gene expression. They are essential for cell homeostasis and active in many disease states. A major discovery is the ability of miRNAs to determine the efficacy of drugs, which has given rise to the field of ‘miRNA pharmacogenomics’ through ‘Pharmaco-miRs’. miRNAs play a significant role in pharmacogenomics by down-regulating genes that are important for drug function. These interactions can be described as triplet sets consisting of a miRNA, a target gene and a drug associated with the gene. We have developed a web server which links miRNA expression and drug function by combining data on miRNA targeting and protein–drug interactions. miRNA targeting information derive from both experimental data and computational predictions, and protein–drug interactions are annotated by the Pharmacogenomics Knowledge base (PharmGKB). Pharmaco-miR’s input consists of miRNAs, genes and/or drug names and the output consists of miRNA pharmacogenomic sets or a list of unique associated miRNAs, genes and drugs. We have furthermore built a database, named Pharmaco-miR Verified Sets (VerSe), which contains miRNA pharmacogenomic data manually curated from the literature, can be searched and downloaded via Pharmaco-miR and informs on trends and generalities published in the field. Overall, we present examples of how Pharmaco-miR provides possible explanations for previously published observations, including how the cisplatin and 5-fluorouracil resistance induced by miR-148a may be caused by miR-148a targeting of the gene KIT. The information is available at www.Pharmaco-miR.org.
microRNAs; pharmacogenomics; database; web server; miRNA pharmacogenomic set; Pharmaco-miR