It is now almost 15 years since the publication of the first paper on text mining in the genomics domain, and decades since the first paper on text mining in the medical domain. Enormous progress has been made in the areas of information retrieval, evaluation methodologies and resource construction. Some problems, such as abbreviation-handling, can essentially be considered solved problems, and others, such as identification of gene mentions in text, seem likely to be solved soon. However, a number of problems at the frontiers of biomedical text mining continue to present interesting challenges and opportunities for great improvements and interesting research. In this article we review the current state of the art in biomedical text mining or ‘BioNLP’ in general, focusing primarily on papers published within the past year.
text mining; natural language processing; information extraction; text summarization; image mining; question answering; literature-based discovery; evaluation; user orientation
genetic disorders; medical disorders; genome; geneticist; adolescence
The rate of change in genomics, and 'omics generally, shows no signs of slowing down. Related analysis software is struggling to keep apace. This paper provides a brief review of the field.
genomics; software; systems genetics; 'omics; genome-wide association studies; SNP annotation; networks
In recent years, biological web resources such as databases and tools have become more complex because of the enormous amounts of data generated in the field of life sciences. Traditional methods of distributing tutorials include publishing textbooks and posting web documents, but these static contents cannot adequately describe recent dynamic web services. Due to improvements in computer technology, it is now possible to create dynamic content such as video with minimal effort and low cost on most modern computers. The ease of creating and distributing video tutorials instead of static content improves accessibility for researchers, annotators and curators. This article focuses on online video repositories for educational and tutorial videos provided by resource developers and users. It also describes a project in Japan named TogoTV (http://togotv.dbcls.jp/en/) and discusses the production and distribution of high-quality tutorial videos, which would be useful to viewer, with examples. This article intends to stimulate and encourage researchers who develop and use databases and tools to distribute how-to videos as a tool to enhance product usability.
screencast; vodcast; tutorial; YouTube; QuickTime; Flash
Gene fusions are important genomic events in human cancer because their fusion gene products can drive the development of cancer and thus are potential prognostic tools or therapeutic targets in anti-cancer treatment. Major advancements have been made in computational approaches for fusion gene discovery over the past 3 years due to improvements and widespread applications of high-throughput next generation sequencing (NGS) technologies. To identify fusions from NGS data, existing methods typically leverage the strengths of both sequencing technologies and computational strategies. In this article, we review the NGS and computational features of existing methods for fusion gene detection and suggest directions for future development.
gene fusion; next generation sequencing; cancer; whole genome sequencing; transcriptome sequencing; computational tools
The need to analyze high-dimension biological data is driving the development of new data mining methods. Biclustering algorithms have been successfully applied to gene expression data to discover local patterns, in which a subset of genes exhibit similar expression levels over a subset of conditions. However, it is not clear which algorithms are best suited for this task. Many algorithms have been published in the past decade, most of which have been compared only to a small number of algorithms. Surveys and comparisons exist in the literature, but because of the large number and variety of biclustering algorithms, they are quickly outdated. In this article we partially address this problem of evaluating the strengths and weaknesses of existing biclustering methods. We used the BiBench package to compare 12 algorithms, many of which were recently published or have not been extensively studied. The algorithms were tested on a suite of synthetic data sets to measure their performance on data with varying conditions, such as different bicluster models, varying noise, varying numbers of biclusters and overlapping biclusters. The algorithms were also tested on eight large gene expression data sets obtained from the Gene Expression Omnibus. Gene Ontology enrichment analysis was performed on the resulting biclusters, and the best enrichment terms are reported. Our analyses show that the biclustering method and its parameters should be selected based on the desired model, whether that model allows overlapping biclusters, and its robustness to noise. In addition, we observe that the biclustering algorithms capable of finding more than one model are more successful at capturing biologically relevant clusters.
biclustering; microarray; gene expression; clustering
This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.
text mining; information extraction; knowledge discovery from texts; text analytics; biomedical natural language processing; pharmacogenomics; pharmacogenetics
The presence of different transcripts of a gene across samples can be analysed by whole-transcriptome microarrays. Reproducing results from published microarray data represents a challenge owing to the vast amounts of data and the large variety of preprocessing and filtering steps used before the actual analysis is carried out. To guarantee a firm basis for methodological development where results with new methods are compared with previous results, it is crucial to ensure that all analyses are completely reproducible for other researchers. We here give a detailed workflow on how to perform reproducible analysis of the GeneChip®Human Exon 1.0 ST Array at probe and probeset level solely in R/Bioconductor, choosing packages based on their simplicity of use. To exemplify the use of the proposed workflow, we analyse differential splicing and differential gene expression in a publicly available dataset using various statistical methods. We believe this study will provide other researchers with an easy way of accessing gene expression data at different annotation levels and with the sufficient details needed for developing their own tools for reproducible analysis of the GeneChip®Human Exon 1.0 ST Array.
reproducible research; exon array; differential splicing; ANOSVA; FIRMA; probe-level analysis
Good accessibility of publicly funded research data is essential to secure an open scientific system and eventually becomes mandatory [Wellcome Trust will Penalise Scientists Who Don’t Embrace Open Access. The Guardian 2012]. By the use of high-throughput methods in many research areas from physics to systems biology, large data collections are increasingly important as raw material for research. Here, we present strategies worked out by international and national institutions targeting open access to publicly funded research data via incentives or obligations to share data. Funding organizations such as the British Wellcome Trust therefore have developed data sharing policies and request commitment to data management and sharing in grant applications. Increased citation rates are a profound argument for sharing publication data. Pre-publication sharing might be rewarded by a data citation credit system via digital object identifiers (DOIs) which have initially been in use for data objects. Besides policies and incentives, good practice in data management is indispensable. However, appropriate systems for data management of large-scale projects for example in systems biology are hard to find. Here, we give an overview of a selection of open-source data management systems proved to be employed successfully in large-scale projects.
data management; data sharing; open access; data citation; systems biology
The amount of biological data is increasing rapidly, and will continue to increase as new rapid technologies are developed. Professionals in every area of bioscience will have data management needs that require publicly available bioinformatics resources. Not all scientists desire a formal bioinformatics education but would benefit from more informal educational sources of learning. Effective bioinformatics education formats will address a broad range of scientific needs, will be aimed at a variety of user skill levels, and will be delivered in a number of different formats to address different learning styles. Informal sources of bioinformatics education that are effective are available, and will be explored in this review.
bioinformatics education; training and learning; outreach; genomics; data management; computational biology resources
The EB-eye is a fast and efficient search engine that provides easy and uniform access to the biological data resources hosted at the EMBL-EBI. Currently, users can access information from more than 62 distinct datasets covering some 400 million entries. The data resources represented in the EB-eye include: nucleotide and protein sequences at both the genomic and proteomic levels, structures ranging from chemicals to macro-molecular complexes, gene-expression experiments, binary level molecular interactions as well as reaction maps and pathway models, functional classifications, biological ontologies, and comprehensive literature libraries covering the biomedical sciences and related intellectual property. The EB-eye can be accessed over the web or programmatically using a SOAP Web Services interface. This allows its search and retrieval capabilities to be exploited in workflows and analytical pipe-lines. The EB-eye is a novel alternative to existing biological search and retrieval engines. In this article we describe in detail how to exploit its powerful capabilities.
text search; biological databases; integration; interoperability; web services; Apache Lucene
A scientific database can be a powerful tool for biologists in an era where large-scale genomic analysis, combined with smaller-scale scientific results, provides new insights into the roles of genes and their products in the cell. However, the collection and assimilation of data is, in itself, not enough to make a database useful. The data must be incorporated into the database and presented to the user in an intuitive and biologically significant manner. Most importantly, this presentation must be driven by the user’s point of view; that is, from a biological perspective. The success of a scientific database can therefore be measured by the response of its users – statistically, by usage numbers and, in a less quantifiable way, by its relationship with the community it serves and its ability to serve as a model for similar projects. Since its inception ten years ago, the Saccharomyces Genome Database (SGD) has seen a dramatic increase in its usage, has developed and maintained a positive working relationship with the yeast research community, and has served as a template for at least one other database. The success of SGD, as measured by these criteria, is due in large part to philosophies that have guided its mission and organisation since it was established in 1993. This paper aims to detail these philosophies and how they shape the organisation and presentation of the database.
S. cerevisiae; database; genome-wide analysis; bioinformatics; yeast
The unanimous agreement that cellular processes are (largely) governed by interactions between proteins has led to enormous community efforts culminating in overwhelming information relating to these proteins; to the regulation of their interactions, to the way in which they interact and to the function which is determined by these interactions. These data have been organized in databases and servers. However, to make these really useful, it is essential not only to be aware of these, but in particular to have a working knowledge of which tools to use for a given problem; what are the tool advantages and drawbacks; and no less important how to combine these for a particular goal since usually it is not one tool, but some combination of tool-modules that is needed. This is the goal of this review.
protein–protein interactions; protein–protein interfaces; binding site prediction; docking; web servers; databases
Hemoglobinopathies are important inherited disorders with high prevalence in many tropical countries. Prediction of protein nanostructure and function is a great challenge in proteomics and structural genomics. Identifying the point vulnerable to mutation is a new trend in research on disorders at the genomic and proteomic level. A bioinformatics analysis was performed to determine the positions that tend to correspond with peptide motifs in the amino acid sequence of alpha and beta globin chains. To identify the weak linkage in alpha globin and beta globin chains, a new bioinformatics tool, GlobPlot, was used. For the alpha globin chain, 22 positions were identified: the disorders were found at positions 3–8, 38–42, 46–51, and 75–79. For the beta globin chain, 46 positions were identified: the disorders were found at positions 61–146. The study showed that weak linkages in alpha globin and beta globin chains can be identified and provide good information for predicting possible new mutations that could lead to new hemoglobinopathies.
globin; hemoglobinopathy; protein structure; weak linkage
This paper reports on an analysis of the bioinformatics and medical informatics literature with the objective to identify upcoming trends that are shared among both research fields to derive benefits from potential collaborative initiatives for their future. Our results present the main characteristics of the two fields and show that these domains are still relatively separated.
Computational Biology; classification; statistics & numerical data; trends; Databases, Bibliographic; trends; Internationality; MEDLINE; Medical Informatics; classification; statistics & numerical data; trends; Natural Language Processing; Periodicals as Topic; statistics & numerical data; trends; Vocabulary, Controlled; medicine; informatics; biology; bioinformatics; correspondence analysis; bibliometrics; Principal Component Analysis; PCA; MCA
We announce the draft genome sequence of Lactobacillus casei W56 in one contig. This strain shows immunomodulatory and probiotic properties. The strain is also an ingredient of commercially available probiotic products.
Zymomonas mobilis ZM401 is a flocculating strain which can be self-immobilized within fermentors for a high-cell-density culture to improve ethanol productivity, as well as high-gravity fermentation to increase ethanol titer, due to its improved ethanol tolerance associated with the morphological change. Here, we report its draft genome sequence.
Fibrisoma limi strain BUZ 3T, a Gram-negative bacterium, was isolated from coastal mud from the North Sea (Fedderwardersiel, Germany) and characterized using a polyphasic approach in 2011. The genome consists of a chromosome of about 7.5 Mb and three plasmids.
Fibrella aestuarina BUZ 2T is the type strain of the recently characterized genus Fibrella. Here we report the draft genome sequence of this strain, which consists of a single scaffold representing the chromosome (with 11 gaps) and a 161-kb circular plasmid.
We report the complete genome sequences of TI0902, a highly virulent type A1 strain, and TIGB03, a related, attenuated chemical mutant strain. Compared to the wild type, the mutant strain had 45 point mutations and a 75.9-kb duplicated region that had not been previously observed in Francisella species.