XGAP, a software platform for the integration and analysis of genotype and phenotype data.
We present an extensible software model for the genotype and phenotype community, XGAP. Readers can download a standard XGAP (http://www.xgap.org) or auto-generate a custom version using MOLGENIS with programming interfaces to R-software and web-services or user interfaces for biologists. XGAP has simple load formats for any type of genotype, epigenotype, transcript, protein, metabolite or other phenotype data. Current functionality includes tools ranging from eQTL analysis in mouse to genome-wide association studies in humans.
xQTL workbench is a scalable web platform for the mapping of quantitative trait loci (QTLs) at multiple levels: for example gene expression (eQTL), protein abundance (pQTL), metabolite abundance (mQTL) and phenotype (phQTL) data. Popular QTL mapping methods for model organism and human populations are accessible via the web user interface. Large calculations scale easily on to multi-core computers, clusters and Cloud. All data involved can be uploaded and queried online: markers, genotypes, microarrays, NGS, LC-MS, GC-MS, NMR, etc. When new data types come available, xQTL workbench is quickly customized using the Molgenis software generator.
xQTL workbench runs on all common platforms, including Linux, Mac OS X and Windows. An online demo system, installation guide, tutorials, software and source code are available under the LGPL3 license from http://www.xqtl.org.
The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.
BioHackathon; Bioinformatics; Semantic Web; Web services; Ontology; Visualization; Knowledge representation; Databases; Semantic interoperability; Data models; Data sharing; Data integration
The Generation Challenge programme (GCP) is a global crop research consortium directed toward crop improvement through the application of comparative biology and genetic resources characterization to plant breeding. A key consortium research activity is the development of a GCP crop bioinformatics platform to support GCP research. This platform includes the following: (i) shared, public platform-independent domain models, ontology, and data formats to enable interoperability of data and analysis flows within the platform; (ii) web service and registry technologies to identify, share, and integrate information across diverse, globally dispersed data sources, as well as to access high-performance computational (HPC) facilities for computationally intensive, high-throughput analyses of project data; (iii) platform-specific middleware reference implementations of the domain model integrating a suite of public (largely open-access/-source) databases and software tools into a workbench to facilitate biodiversity analysis, comparative analysis of crop genomic data, and plant breeding decision making.
Reproducibility verification is essential to the practice of the scientific method. Researchers report their findings, which are strengthened as other independent groups in the scientific community share similar outcomes. In the many scientific fields where software has become a fundamental tool for capturing and analyzing data, this requirement of reproducibility implies that reliable and comprehensive software platforms and tools should be made available to the scientific community. The tools will empower them and the public to verify, through practice, the reproducibility of observations that are reported in the scientific literature. Medical image analysis is one of the fields in which the use of computational resources, both software and hardware, are an essential platform for performing experimental work. In this arena, the introduction of the Insight Toolkit (ITK) in 1999 has transformed the field and facilitates its progress by accelerating the rate at which algorithmic implementations are developed, tested, disseminated and improved. By building on the efficiency and quality of open source methodologies, ITK has provided the medical image community with an effective platform on which to build a daily workflow that incorporates the true scientific practices of reproducibility verification. This article describes the multiple tools, methodologies, and practices that the ITK community has adopted, refined, and followed during the past decade, in order to become one of the research communities with the most modern reproducibility verification infrastructure. For example, 207 contributors have created over 2400 unit tests that provide over 84% code line test coverage. The Insight Journal, an open publication journal associated with the toolkit, has seen over 360,000 publication downloads. The median normalized closeness centrality, a measure of knowledge flow, resulting from the distributed peer code review system was high, 0.46.
reproducibility; ITK; insight toolkit; insight journal; code review; open science
The work showed that the integrated suite of software tools for detecting criminals using DNA databases has achieved the overall objective by providing a working platform for sequence analysis. The work also demonstrated that by integrating BLAST and FASTA (two widely used and freely available algorithms), plus an additional implementation of PSA (custom-built pairwise sequence alignment algorithms) and TR analysis tools (for detecting tandem repeats) with the rest of the utilities supporting tools (databases and files management) developed, it is entirely possible to have an initial working version of the software tool for criminal DNA analysis and detection work. The integrated software tool has great potential and that the results obtained during the tests were satisfactory. The recent South Asia Tsunami incident has renewed the need to establish a quick and reliable system for DNA matching and comparison. This work may also contribute towards the quick identification of victims in many disasters.
Future works are to further enhance the existing tools by adding more options and controls, improve upon the visualisation display, and to build robust software architecture to better manage the system loadings. Fault tolerance enhancement to the system is one of the key areas that can further help to make the entire application efficient, robust and reliable.
In systems biology, and many other areas of research, there is a need for the interoperability of tools and data sources that were not originally designed to be integrated. Due to the interdisciplinary nature of systems biology, and its association with high throughput experimental platforms, there is an additional need to continually integrate new technologies. As scientists work in isolated groups, integration with other groups is rarely a consideration when building the required software tools.
We illustrate an approach, through the discussion of a purpose built software architecture, which allows disparate groups to reuse tools and access data sources in a common manner. The architecture allows for: the rapid development of distributed applications; interoperability, so it can be used by a wide variety of developers and computational biologists; development using standard tools, so that it is easy to maintain and does not require a large development effort; extensibility, so that new technologies and data types can be incorporated; and non intrusive development, insofar as researchers need not to adhere to a pre-existing object model.
By using a relatively simple integration strategy, based upon a common identity system and dynamically discovered interoperable services, a light-weight software architecture can become the focal point through which scientists can both get access to and analyse the plethora of experimentally derived data.
Summary: seeQTL is a comprehensive and versatile eQTL database, including various eQTL studies and a meta-analysis of HapMap eQTL information. The database presents eQTL association results in a convenient browser, using both segmented local-association plots and genome-wide Manhattan plots.
Availability and implementation: seeQTL is freely available for non-commercial use at http://www.bios.unc.edu/research/genomic_software/seeQTL/.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
Until recently, sequencing has primarily been carried out in large genome centers which have invested heavily in developing the computational infrastructure that enables genomic sequence analysis. The recent advancements in next generation sequencing (NGS) have led to a wide dissemination of sequencing technologies and data, to highly diverse research groups. It is expected that clinical sequencing will become part of diagnostic routines shortly. However, limited accessibility to computational infrastructure and high quality bioinformatic tools, and the demand for personnel skilled in data analysis and interpretation remains a serious bottleneck. To this end, the cloud computing and Software-as-a-Service (SaaS) technologies can help address these issues.
We successfully enabled the Atlas2 Cloud pipeline for personal genome analysis on two different cloud service platforms: a community cloud via the Genboree Workbench, and a commercial cloud via the Amazon Web Services using Software-as-a-Service model. We report a case study of personal genome analysis using our Atlas2 Genboree pipeline. We also outline a detailed cost structure for running Atlas2 Amazon on whole exome capture data, providing cost projections in terms of storage, compute and I/O when running Atlas2 Amazon on a large data set.
We find that providing a web interface and an optimized pipeline clearly facilitates usage of cloud computing for personal genome analysis, but for it to be routinely used for large scale projects there needs to be a paradigm shift in the way we develop tools, in standard operating procedures, and in funding mechanisms.
The annotation of genomes from next-generation sequencing platforms needs to
be rapid, high-throughput, and fully integrated and automated. Although a
few Web-based annotation services have recently become available, they may
not be the best solution for researchers that need to annotate a large
number of genomes, possibly including proprietary data, and store them
locally for further analysis. To address this need, we developed a
standalone software application, the Annotation of microbial Genome
Sequences (AGeS) system, which incorporates publicly available and
in-house-developed bioinformatics tools and databases, many of which are
parallelized for high-throughput performance.
The AGeS system supports three main capabilities. The first is the storage of
input contig sequences and the resulting annotation data in a central,
customized database. The second is the annotation of microbial genomes using
an integrated software pipeline, which first analyzes contigs from
high-throughput sequencing by locating genomic regions that code for
proteins, RNA, and other genomic elements through the Do-It-Yourself
Annotation (DIYA) framework. The identified protein-coding regions are then
functionally annotated using the in-house-developed Pipeline for Protein
Annotation (PIPA). The third capability is the visualization of annotated
sequences using GBrowse. To date, we have implemented these capabilities for
bacterial genomes. AGeS was evaluated by comparing its genome annotations
with those provided by three other methods. Our results indicate that the
software tools integrated into AGeS provide annotations that are in general
agreement with those provided by the compared methods. This is demonstrated
by a >94% overlap in the number of identified genes, a significant
number of identical annotated features, and a >90% agreement in
enzyme function predictions.
Motivation: R/qtl is free and powerful software for mapping and exploring quantitative trait loci (QTL). R/qtl provides a fully comprehensive range of methods for a wide range of experimental cross types. We recently added multiple QTL mapping (MQM) to R/qtl. MQM adds higher statistical power to detect and disentangle the effects of multiple linked and unlinked QTL compared with many other methods. MQM for R/qtl adds many new features including improved handling of missing data, analysis of 10 000 s of molecular traits, permutation for determining significance thresholds for QTL and QTL hot spots, and visualizations for cis–trans and QTL interaction effects. MQM for R/qtl is the first free and open source implementation of MQM that is multi-platform, scalable and suitable for automated procedures and large genetical genomics datasets.
Availability: R/qtl is free and open source multi-platform software for the statistical language R, and is made available under the GPLv3 license. R/qtl can be installed from http://www.rqtl.org/. R/qtl queries should be directed at the mailing list, see http://www.rqtl.org/list/.
Single nucleotide polymorphisms (SNPs) represent the most abundant type of genetic variation that can be used as molecular markers. The SNPs that are hidden in sequence databases can be unlocked using bioinformatic tools. For efficient application of these SNPs, the sequence set should be error-free as much as possible, targeting single loci and suitable for the SNP scoring platform of choice. We have developed a pipeline to effectively mine SNPs from public EST databases with or without quality information using QualitySNP software, select reliable SNP and prepare the loci for analysis on the Illumina GoldenGate genotyping platform. The applicability of the pipeline was demonstrated using publicly available potato EST data, genotyping individuals from two diploid mapping populations and subsequently mapping the SNP markers (putative genes) in both populations. Over 7000 reliable SNPs were identified that met the criteria for genotyping on the GoldenGate platform. Of the 384 SNPs on the SNP array approximately 12% dropped out. For the two potato mapping populations 165 and 185 SNPs segregating SNP loci could be mapped on the respective genetic maps, illustrating the effectiveness of our pipeline for SNP selection and validation.
Electronic supplementary material
The online version of this article (doi:10.1007/s11032-009-9377-5) contains supplementary material, which is available to authorized users.
EST database; Illumina GoldenGate assay; QualitySNP; Potato
In vitro selection has been an essential tool in the development of recombinant antibodies against various antigen targets. Deep sequencing has recently been gaining ground as an alternative and valuable method to analyze such antibody selections. The analysis provides a novel and extremely detailed view of selected antibody populations, and allows the identification of specific antibodies using only sequencing data, potentially eliminating the need for expensive and laborious low-throughput screening methods such as enzyme-linked immunosorbant assay. The high cost and the need for bioinformatics experts and powerful computer clusters, however, have limited the general use of deep sequencing in antibody selections. Here, we describe the AbMining ToolBox, an open source software package for the straightforward analysis of antibody libraries sequenced by the three main next generation sequencing platforms (454, Ion Torrent, MiSeq). The ToolBox is able to identify heavy chain CDR3s as effectively as more computationally intense software, and can be easily adapted to analyze other portions of antibody variable genes, as well as the selection outputs of libraries based on different scaffolds. The software runs on all common operating systems (Microsoft Windows, Mac OS X, Linux), on standard personal computers, and sequence analysis of 1–2 million reads can be accomplished in 10–15 min, a fraction of the time of competing software. Use of the ToolBox will allow the average researcher to incorporate deep sequence analysis into routine selections from antibody display libraries.
HCDR3; antibody library; deep sequencing; regular expression; AbMining ToolBox
Summary: The BioRuby software toolkit contains a comprehensive set of free development tools and libraries for bioinformatics and molecular biology, written in the Ruby programming language. BioRuby has components for sequence analysis, pathway analysis, protein modelling and phylogenetic analysis; it supports many widely used data formats and provides easy access to databases, external programs and public web services, including BLAST, KEGG, GenBank, MEDLINE and GO. BioRuby comes with a tutorial, documentation and an interactive environment, which can be used in the shell, and in the web browser.
Availability: BioRuby is free and open source software, made available under the Ruby license. BioRuby runs on all platforms that support Ruby, including Linux, Mac OS X and Windows. And, with JRuby, BioRuby runs on the Java Virtual Machine. The source code is available from http://www.bioruby.org/.
Using DNA markers in plant breeding with marker-assisted selection (MAS) could greatly improve the precision and efficiency of selection, leading to the accelerated development of new crop varieties. The numerous examples of MAS in rice have prompted many breeding institutes to establish molecular breeding labs. The last decade has produced an enormous amount of genomics research in rice, including the identification of thousands of QTLs for agronomically important traits, the generation of large amounts of gene expression data, and cloning and characterization of new genes, including the detection of single nucleotide polymorphisms. The pinnacle of genomics research has been the completion and annotation of genome sequences for indica and japonica rice. This information—coupled with the development of new genotyping methodologies and platforms, and the development of bioinformatics databases and software tools—provides even more exciting opportunities for rice molecular breeding in the 21st century. However, the great challenge for molecular breeders is to apply genomics data in actual breeding programs. Here, we review the current status of MAS in rice, current genomics projects and promising new genotyping methodologies, and evaluate the probable impact of genomics research. We also identify critical research areas to “bridge the application gap” between QTL identification and applied breeding that need to be addressed to realize the full potential of MAS, and propose ideas and guidelines for establishing rice molecular breeding labs in the postgenome sequence era to integrate molecular breeding within the context of overall rice breeding and research programs.
To identify genetic loci that regulate spontaneous arthritis in interleukin-1 receptor antagonist (IL-1ra)-deficient mice, an F2 population was created from a cross between Balb/c IL-1ra-deficient mice and DBA/1 IL-1ra-deficient mice. Spontaneous arthritis in the F2 population was examined and recorded. Genotypes of those F2 mice were determined using microsatellite markers. Quantitative trail locus (QTL) analysis was conducted with R/qtlbim. Functions of genes within QTL chromosomal regions were evaluated using a bioinformatics tool, PGMapper, and microarray analysis. Potential candidate genes were further evaluated using GeneNetwork. A total of 137 microsatellite markers with an average of 12 cM spacing along the whole genome were used for determining the correlation of arthritis phenotypes with genotypes of 191 F2 progenies. By whole-genome mapping, we obtained QTLs on chromosomes 1 and 6 that were above the significance threshold for strong Bayesian evidence. The QTL on chromosome 1 had a peak near D1Mit55 and D1Mit425 at 82·6 cM. It may account for as much as 12% of the phenotypic variation in susceptibility to spontaneous arthritis. The QTL region contained 208 known transcripts. According to their functions, Mr1, Pla2g4a and Fasl are outstanding candidate genes. From microarray analysis, 11 genes were selected as favourable candidates based on their function and expression profiles. Three of those 11 genes, Prg4, Ptgs2 and Mr1, correlated with the IL-1ra pathway. Those genes were considered to be the best candidates.
Quantitative trait locus (QTL) mapping identifies genomic regions that likely contain genes regulating a quantitative trait. However, QTL regions may encompass tens to hundreds of genes. To find the most promising candidate genes that regulate the trait, the biologist typically collects information from multiple resources about the genes in the QTL interval. This process is very laborious and time consuming.
QTLminer is a bioinformatics tool that automatically performs QTL region analysis. It is available in GeneNetwork and it integrates information such as gene annotation, gene expression and sequence polymorphisms for all the genes within a given genomic interval.
QTLminer substantially speeds up discovery of the most promising candidate genes within a QTL region.
Statistical analysis system (SAS) is the most comprehensive statistical analysis software package in the world. It offers data analysis for almost all experiments under various statistical models. Each analysis is performed using a particular subroutine, called a procedure (PROC). For example, PROC ANOVA performs analysis of variances. PROC QTL is a user-defined SAS procedure for mapping quantitative trait loci (QTL). It allows users to perform QTL mapping for continuous and discrete traits within the SAS platform. Users of PROC QTL are able to take advantage of all existing features offered by the general SAS software, for example, data management and graphical treatment. The current version of PROC QTL can perform QTL mapping for all line crossing experiments using maximum likelihood (ML), least square (LS), iteratively reweighted least square (IRLS), Fisher scoring (FISHER), Bayesian (BAYES), and empirical Bayes (EBAYES) methods.
A steep drop in the cost of next-generation sequencing during recent years has made the technology affordable to the majority of researchers, but downstream bioinformatic analysis still poses a resource bottleneck for smaller laboratories and institutes that do not have access to substantial computational resources. Sequencing instruments are typically bundled with only the minimal processing and storage capacity required for data capture during sequencing runs. Given the scale of sequence datasets, scientific value cannot be obtained from acquiring a sequencer unless it is accompanied by an equal investment in informatics infrastructure.
Cloud BioLinux is a publicly accessible Virtual Machine (VM) that enables scientists to quickly provision on-demand infrastructures for high-performance bioinformatics computing using cloud platforms. Users have instant access to a range of pre-configured command line and graphical software applications, including a full-featured desktop interface, documentation and over 135 bioinformatics packages for applications including sequence alignment, clustering, assembly, display, editing, and phylogeny. Each tool's functionality is fully described in the documentation directly accessible from the graphical interface of the VM. Besides the Amazon EC2 cloud, we have started instances of Cloud BioLinux on a private Eucalyptus cloud installed at the J. Craig Venter Institute, and demonstrated access to the bioinformatic tools interface through a remote connection to EC2 instances from a local desktop computer. Documentation for using Cloud BioLinux on EC2 is available from our project website, while a Eucalyptus cloud image and VirtualBox Appliance is also publicly available for download and use by researchers with access to private clouds.
Cloud BioLinux provides a platform for developing bioinformatics infrastructures on the cloud. An automated and configurable process builds Virtual Machines, allowing the development of highly customized versions from a shared code base. This shared community toolkit enables application specific analysis platforms on the cloud by minimizing the effort required to prepare and maintain them.
Recent advances in genomics and structural biology have resulted in an unprecedented increase in biological data available from Internet-accessible databases. In order to help students effectively use this vast repository of information, undergraduate biology students at Drake University were introduced to bioinformatics software and databases in three courses, beginning with an introductory course in cell biology. The exercises and projects that were used to help students develop literacy in bioinformatics are described. In a recently offered course in bioinformatics, students developed their own simple sequence analysis tool using the Perl programming language. These experiences are described from the point of view of the instructor as well as the students. A preliminary assessment has been made of the degree to which students had developed a working knowledge of bioinformatics concepts and methods. Finally, some conclusions have been drawn from these courses that may be helpful to instructors wishing to introduce bioinformatics within the undergraduate biology curriculum.
undergraduate; bioinformatics; genomics; Perl
Differences in gene expression in the CNS influence behavior and disease susceptibility. To systematically explore the role of normal variation in expression on hippocampal structure and function, we generated an online microarray database for a diverse panel of strains of mice, including most common inbred strains and numerous recombinant inbred lines (www.genenetwork.org). Using this resource, coexpression networks for families of genes can be generated rapidly to test causal models related to function. The data set is optimized for quantitative trait locus (QTL) mapping and was used to identify over 5500 QTLs that modulate mRNA levels. We describe a wide variety of analyses and novel synthetic approaches that take advantage of this resource, and demonstrate how both the data and associated tools can be applied to the study of gene regulation in the hippocampus and relations to structure and function.
recombinant inbred mice; hippocampus; QTL; genetical genomics; transcript expression
Systems biologists work with many kinds of data, from many different sources, using a variety of software tools. Each of these tools typically excels at one type of analysis, such as of microarrays, of metabolic networks and of predicted protein structure. A crucial challenge is to combine the capabilities of these (and other forthcoming) data resources and tools to create a data exploration and analysis environment that does justice to the variety and complexity of systems biology data sets. A solution to this problem should recognize that data types, formats and software in this high throughput age of biology are constantly changing.
In this paper we describe the Gaggle -a simple, open-source Java software environment that helps to solve the problem of software and database integration. Guided by the classic software engineering strategy of separation of concerns and a policy of semantic flexibility, it integrates existing popular programs and web resources into a user-friendly, easily-extended environment.
We demonstrate that four simple data types (names, matrices, networks, and associative arrays) are sufficient to bring together diverse databases and software. We highlight some capabilities of the Gaggle with an exploration of Helicobacter pylori pathogenesis genes, in which we identify a putative ricin-like protein -a discovery made possible by simultaneous data exploration using a wide range of publicly available data and a variety of popular bioinformatics software tools.
We have integrated diverse databases (for example, KEGG, BioCyc, String) and software (Cytoscape, DataMatrixViewer, R statistical environment, and TIGR Microarray Expression Viewer). Through this loose coupling of diverse software and databases the Gaggle enables simultaneous exploration of experimental data (mRNA and protein abundance, protein-protein and protein-DNA interactions), functional associations (operon, chromosomal proximity, phylogenetic pattern), metabolic pathways (KEGG) and Pubmed abstracts (STRING web resource), creating an exploratory environment useful to 'web browser and spreadsheet biologists', to statistically savvy computational biologists, and those in between. The Gaggle uses Java RMI and Java Web Start technologies and can be found at .
As advances in life sciences and information technology bring profound influences on bioinformatics due to its interdisciplinary nature, bioinformatics is experiencing a new leap-forward from in-house computing infrastructure into utility-supplied cloud computing delivered over the Internet, in order to handle the vast quantities of biological data generated by high-throughput experimental technologies. Albeit relatively new, cloud computing promises to address big data storage and analysis issues in the bioinformatics field. Here we review extant cloud-based services in bioinformatics, classify them into Data as a Service (DaaS), Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS), and present our perspectives on the adoption of cloud computing in bioinformatics.
This article was reviewed by Frank Eisenhaber, Igor Zhulin, and Sandor Pongor.
Cloud computing; Bioinformatics; Big data; Data storage; Data analysis
Transcriptomics, proteomics, and metabolomics are three major platforms of comprehensive omics analysis in the science of food and complementary medicine. Other omics disciplines, including those of epigenetics and microRNA, are matters of increasing concern. The increased use of the omics approach in food science owes much to the recent advancement of technology and bioinformatic methodologies. Moreover, many researchers now put the combination of multiple omics analysis (integrated omics) into practice to exhaustively understand the functionality of food components. However, data analysis of integrated omics requires huge amount of work and high skill of data handling. A database of nutritional omics data was constructed by the authors, which should help food scientists to analyze their own omics data more effectively. In addition, a novel tool for the easy visualization of omics data was developed by the authors’ group. The tool enables one to overview the changes of multiple omics in the KEGG pathway. Research in traditional and complementary medicine will be further facilitated by promoting the integrated omics research of food functionality. Such integrated research will only be possible with the effective collaboration of scientists with different backgrounds.
Nutrigenomics; Transcriptomics; Proteomics; Metabolomics; Database
Summary: Systems glycobiology studies the interaction of various pathways that regulate glycan biosynthesis and function. Software tools for the construction and analysis of such pathways are not yet available. We present GNAT, a platform-independent, user-extensible MATLAB-based toolbox that provides an integrated computational environment to construct, manipulate and simulate glycans and their networks. It enables integration of XML-based glycan structure data into SBML (Systems Biology Markup Language) files that describe glycosylation reaction networks. Curation and manipulation of networks is facilitated using class definitions and glycomics database query tools. High quality visualization of networks and their steady-state and dynamic simulation are also supported.
Availability: The software package including source code, help documentation and demonstrations are available at http://sourceforge.net/projects/gnatmatlab/files/.
email@example.com or firstname.lastname@example.org