xQTL workbench is a scalable web platform for the mapping of quantitative trait loci (QTLs) at multiple levels: for example gene expression (eQTL), protein abundance (pQTL), metabolite abundance (mQTL) and phenotype (phQTL) data. Popular QTL mapping methods for model organism and human populations are accessible via the web user interface. Large calculations scale easily on to multi-core computers, clusters and Cloud. All data involved can be uploaded and queried online: markers, genotypes, microarrays, NGS, LC-MS, GC-MS, NMR, etc. When new data types come available, xQTL workbench is quickly customized using the Molgenis software generator.
xQTL workbench runs on all common platforms, including Linux, Mac OS X and Windows. An online demo system, installation guide, tutorials, software and source code are available under the LGPL3 license from http://www.xqtl.org.
XGAP, a software platform for the integration and analysis of genotype and phenotype data.
We present an extensible software model for the genotype and phenotype community, XGAP. Readers can download a standard XGAP (http://www.xgap.org) or auto-generate a custom version using MOLGENIS with programming interfaces to R-software and web-services or user interfaces for biologists. XGAP has simple load formats for any type of genotype, epigenotype, transcript, protein, metabolite or other phenotype data. Current functionality includes tools ranging from eQTL analysis in mouse to genome-wide association studies in humans.
The present study examines the structure and covariance of endogenous variation in gene expression across the recently expanded family of C57BL/6J (B) X DBA/2J (D) Recombinant Inbred (BXD RI) strains of mice. This work is accompanied by a highly interactive database that can be used to generate and test specific hypotheses. For example, we define the genetic network regulating growth associated protein 43 (Gap43) and phosphatase tensin homolog (Pten).
The Hamilton Eye Institute (HEI) Retina Database within GeneNetwork features the data analysis of 346 Illumina Sentrix BeadChip Arrays (mouse whole genome-6 version 2). Eighty strains of mice are presented, including 75 BXD RI strains, the parental strains (C57BL/6J and DBA/2J), the reciprocal crosses, and the BALB/cByJ mice. Independent biologic samples for at least two animals from each gender were obtained with a narrow age range (48 to 118 days). Total RNA was prepared followed by the production of biotinylated cRNAs, which were pipetted into the Mouse WG-6V2 arrays. The data was globally normalized with rank invariant and stabilization (2z+8).
The HEI Retina Database is located on the GeneNetwork website. The database was used to extract unique transcriptome signatures for specific cell types in the retina (retinal pigment epithelial, amacrine, and retinal ganglion cells). Two genes associated with axonal outgrowth (Gap43 and Pten) were used to display the power of this new retina database. Bioinformatic tools located within GeneNetwork in conjunction with the HEI Retina Database were used to identify the unique signature Quantitative Trait Loci (QTLs) for Gap43 and Pten on chromosomes 1, 2, 12, 15, 16, and 19. Gap43 and Pten possess networks that are similar to ganglion cell networks that may be associated with axonal growth in the mouse retina. This network involves high correlations of transcription factors (SRY sex determining region Y-box 2 [Sox2], paired box gene 6 [Pax6], and neurogenic differentiation 1 [Neurod1]), and genes involved in DNA binding (proliferating cell nuclear antigen [Pcna] and zinc finger, BED-type containing 4 [Zbed4]), as well as an inhibitor of DNA binding (inhibitor of DNA binding 2, dominant negative helix–loop–helix protein [Id2]). Furthermore, we identified the potential upstream modifiers on chromosome 2 (teashirt zinc finger homeobox 2 [Tshz2], RNA export 1 homolog [Rae1] and basic helix–loop–helix domain contatining, class B4 [Bhlhb4]) on chromosome 15 (RAB, member of RAS oncogene family-like 2a [Rabl2a], phosphomannomutase 1 [Pmm1], copine VIII [Cpne8], and fibulin 1 [Fbln1]).
The endogenous variation in mRNA levels among BXD RI strains can be used to explore and test expression networks underlying variation in retina structure, function, and disease susceptibility. The Gap43 and Pten network highlights the covariance of gene expression and forms a molecular network associated with axonal outgrowth in the adult retina.
A common approach to understanding the genetic basis of complex traits is through identification of associated quantitative trait loci (QTL). Fine mapping QTLs requires several generations of backcrosses and analysis of large populations, which is time-consuming and costly effort. Furthermore, as entire genomes are being sequenced and an increasing amount of genetic and expression data are being generated, a challenge remains: linking phenotypic variation to the underlying genomic variation. To identify candidate genes and understand the molecular basis underlying the phenotypic variation of traits, bioinformatic approaches are needed to exploit information such as genetic map, expression and whole genome sequence data of organisms in biological databases.
The Sol Genomics Network (SGN, http://solgenomics.net) is a primary repository for phenotypic, genetic, genomic, expression and metabolic data for the Solanaceae family and other related Asterids species and houses a variety of bioinformatics tools. SGN has implemented a new approach to QTL data organization, storage, analysis, and cross-links with other relevant data in internal and external databases. The new QTL module, solQTL, http://solgenomics.net/qtl/, employs a user-friendly web interface for uploading raw phenotype and genotype data to the database, R/QTL mapping software for on-the-fly QTL analysis and algorithms for online visualization and cross-referencing of QTLs to relevant datasets and tools such as the SGN Comparative Map Viewer and Genome Browser. Here, we describe the development of the solQTL module and demonstrate its application.
solQTL allows Solanaceae researchers to upload raw genotype and phenotype data to SGN, perform QTL analysis and dynamically cross-link to relevant genetic, expression and genome annotations. Exploration and synthesis of the relevant data is expected to help facilitate identification of candidate genes underlying phenotypic variation and markers more closely linked to QTLs. solQTL is freely available on SGN and can be used in private or public mode.
Quantitative trait loci (QTL) detection on a huge amount of phenotypes, like eQTL detection on transcriptomic data, can be dramatically impaired by the statistical properties of interval mapping methods. One of these major outcomes is the high number of QTL detected at marker locations. The present study aims at identifying and specifying the sources of this bias, in particular in the case of analysis of data issued from outbred populations. Analytical developments were carried out in a backcross situation in order to specify the bias and to propose an algorithm to control it. The outbred population context was studied through simulated data sets in a wide range of situations.
The likelihood ratio test was firstly analyzed under the "one QTL" hypothesis in a backcross population. Designs of sib families were then simulated and analyzed using the QTL Map software. On the basis of the theoretical results in backcross, parameters such as the population size, the density of the genetic map, the QTL effect and the true location of the QTL, were taken into account under the "no QTL" and the "one QTL" hypotheses. A combination of two non parametric tests - the Kolmogorov-Smirnov test and the Mann-Whitney-Wilcoxon test - was used in order to identify the parameters that affected the bias and to specify how much they influenced the estimation of QTL location.
A theoretical expression of the bias of the estimated QTL location was obtained for a backcross type population. We demonstrated a common source of bias under the "no QTL" and the "one QTL" hypotheses and qualified the possible influence of several parameters. Simulation studies confirmed that the bias exists in outbred populations under both the hypotheses of "no QTL" and "one QTL" on a linkage group. The QTL location was systematically closer to marker locations than expected, particularly in the case of low QTL effect, small population size or low density of markers, i.e. designs with low power. Practical recommendations for experimental designs for QTL detection in outbred populations are given on the basis of this bias quantification. Furthermore, an original algorithm is proposed to adjust the location of a QTL, obtained with interval mapping, which co located with a marker.
Therefore, one should be attentive when one QTL is mapped at the location of one marker, especially under low power conditions.
QTL; linkage analysis; QTL location; bias
Here, we present WormQTL (http://www.wormqtl.org), an easily accessible database enabling search, comparative analysis and meta-analysis of all data on variation in Caenorhabditis spp. Over the past decade, Caenorhabditis elegans has become instrumental for molecular quantitative genetics and the systems biology of natural variation. These efforts have resulted in a valuable amount of phenotypic, high-throughput molecular and genotypic data across different developmental worm stages and environments in hundreds of C. elegans strains. WormQTL provides a workbench of analysis tools for genotype–phenotype linkage and association mapping based on but not limited to R/qtl (http://www.rqtl.org). All data can be uploaded and downloaded using simple delimited text or Excel formats and are accessible via a public web user interface for biologists and R statistic and web service interfaces for bioinformaticians, based on open source MOLGENIS and xQTL workbench software. WormQTL welcomes data submissions from other worm researchers.
A typical genetical genomics experiment results in four separate data sets; genotype, gene expression, higher-order phenotypic data and metadata that describe the protocols, processing and the array platform. Used in concert, these data sets provide the opportunity to perform genetic analysis at a systems level. Their predictive power is largely determined by the gene expression dataset where tens of millions of data points can be generated using currently available mRNA profiling technologies. Such large, multidimensional data sets often have value beyond that extracted during their initial analysis and interpretation, particularly if conducted on widely distributed reference genetic materials. Besides quality and scale, access to the data is of primary importance as accessibility potentially allows the extraction of considerable added value from the same primary dataset by the wider research community. Although the number of genetical genomics experiments in different plant species is rapidly increasing, none to date has been presented in a form that allows quick and efficient on-line testing for possible associations between genes, loci and traits of interest by an entire research community.
Using a reference population of 150 recombinant doubled haploid barley lines we generated novel phenotypic, mRNA abundance and SNP-based genotyping data sets, added them to a considerable volume of legacy trait data and entered them into the GeneNetwork . GeneNetwork is a unified on-line analytical environment that enables the user to test genetic hypotheses about how component traits, such as mRNA abundance, may interact to condition more complex biological phenotypes (higher-order traits). Here we describe these barley data sets and demonstrate some of the functionalities GeneNetwork provides as an easily accessible and integrated analytical environment for exploring them.
By integrating barley genotypic, phenotypic and mRNA abundance data sets directly within GeneNetwork's analytical environment we provide simple web access to the data for the research community. In this environment, a combination of correlation analysis and linkage mapping provides the potential to identify and substantiate gene targets for saturation mapping and positional cloning. By integrating datasets from an unsequenced crop plant (barley) in a database that has been designed for an animal model species (mouse) with a well established genome sequence, we prove the importance of the concept and practice of modular development and interoperability of software engineering for biological data sets.
New systems biology studies require researchers to understand how interplay among myriads of biomolecular entities is orchestrated in order to achieve high-level cellular and physiological functions. Many software tools have been developed in the past decade to help researchers visually navigate large networks of biomolecular interactions with built-in template-based query capabilities. To further advance researchers' ability to interrogate global physiological states of cells through multi-scale visual network explorations, new visualization software tools still need to be developed to empower the analysis. A robust visual data analysis platform driven by database management systems to perform bi-directional data processing-to-visualizations with declarative querying capabilities is needed.
We developed ProteoLens as a JAVA-based visual analytic software tool for creating, annotating and exploring multi-scale biological networks. It supports direct database connectivity to either Oracle or PostgreSQL database tables/views, on which SQL statements using both Data Definition Languages (DDL) and Data Manipulation languages (DML) may be specified. The robust query languages embedded directly within the visualization software help users to bring their network data into a visualization context for annotation and exploration. ProteoLens supports graph/network represented data in standard Graph Modeling Language (GML) formats, and this enables interoperation with a wide range of other visual layout tools. The architectural design of ProteoLens enables the de-coupling of complex network data visualization tasks into two distinct phases: 1) creating network data association rules, which are mapping rules between network node IDs or edge IDs and data attributes such as functional annotations, expression levels, scores, synonyms, descriptions etc; 2) applying network data association rules to build the network and perform the visual annotation of graph nodes and edges according to associated data values. We demonstrated the advantages of these new capabilities through three biological network visualization case studies: human disease association network, drug-target interaction network and protein-peptide mapping network.
The architectural design of ProteoLens makes it suitable for bioinformatics expert data analysts who are experienced with relational database management to perform large-scale integrated network visual explorations. ProteoLens is a promising visual analytic platform that will facilitate knowledge discoveries in future network and systems biology studies.
Interactions between proteins are highly conserved across species. As a result, the molecular basis of multiple diseases affecting humans can be studied in model organisms that offer many alternative experimental opportunities. One such organism—Caenorhabditis elegans—has been used to produce much molecular quantitative genetics and systems biology data over the past decade. We present WormQTLHD (Human Disease), a database that quantitatively and systematically links expression Quantitative Trait Loci (eQTL) findings in C. elegans to gene–disease associations in man. WormQTLHD, available online at http://www.wormqtl-hd.org, is a user-friendly set of tools to reveal functionally coherent, evolutionary conserved gene networks. These can be used to predict novel gene-to-gene associations and the functions of genes underlying the disease of interest. We created a new database that links C. elegans eQTL data sets to human diseases (34 337 gene–disease associations from OMIM, DGA, GWAS Central and NHGRI GWAS Catalogue) based on overlapping sets of orthologous genes associated to phenotypes in these two species. We utilized QTL results, high-throughput molecular phenotypes, classical phenotypes and genotype data covering different developmental stages and environments from WormQTL database. All software is available as open source, built on MOLGENIS and xQTL workbench.
Systems biologists work with many kinds of data, from many different sources, using a variety of software tools. Each of these tools typically excels at one type of analysis, such as of microarrays, of metabolic networks and of predicted protein structure. A crucial challenge is to combine the capabilities of these (and other forthcoming) data resources and tools to create a data exploration and analysis environment that does justice to the variety and complexity of systems biology data sets. A solution to this problem should recognize that data types, formats and software in this high throughput age of biology are constantly changing.
In this paper we describe the Gaggle -a simple, open-source Java software environment that helps to solve the problem of software and database integration. Guided by the classic software engineering strategy of separation of concerns and a policy of semantic flexibility, it integrates existing popular programs and web resources into a user-friendly, easily-extended environment.
We demonstrate that four simple data types (names, matrices, networks, and associative arrays) are sufficient to bring together diverse databases and software. We highlight some capabilities of the Gaggle with an exploration of Helicobacter pylori pathogenesis genes, in which we identify a putative ricin-like protein -a discovery made possible by simultaneous data exploration using a wide range of publicly available data and a variety of popular bioinformatics software tools.
We have integrated diverse databases (for example, KEGG, BioCyc, String) and software (Cytoscape, DataMatrixViewer, R statistical environment, and TIGR Microarray Expression Viewer). Through this loose coupling of diverse software and databases the Gaggle enables simultaneous exploration of experimental data (mRNA and protein abundance, protein-protein and protein-DNA interactions), functional associations (operon, chromosomal proximity, phylogenetic pattern), metabolic pathways (KEGG) and Pubmed abstracts (STRING web resource), creating an exploratory environment useful to 'web browser and spreadsheet biologists', to statistically savvy computational biologists, and those in between. The Gaggle uses Java RMI and Java Web Start technologies and can be found at .
The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.
BioHackathon; Bioinformatics; Semantic Web; Web services; Ontology; Visualization; Knowledge representation; Databases; Semantic interoperability; Data models; Data sharing; Data integration
Computational Quantum Chemistry has developed into a powerful, efficient, reliable and increasingly routine tool for exploring the structure and properties of small to medium sized molecules. Many thousands of calculations are performed every day, some offering results which approach experimental accuracy. However, in contrast to other disciplines, such as crystallography, or bioinformatics, where standard formats and well-known, unified databases exist, this QC data is generally destined to remain locally held in files which are not designed to be machine-readable. Only a very small subset of these results will become accessible to the wider community through publication.
In this paper we describe how the Quixote Project is developing the infrastructure required to convert output from a number of different molecular quantum chemistry packages to a common semantically rich, machine-readable format and to build respositories of QC results. Such an infrastructure offers benefits at many levels. The standardised representation of the results will facilitate software interoperability, for example making it easier for analysis tools to take data from different QC packages, and will also help with archival and deposition of results. The repository infrastructure, which is lightweight and built using Open software components, can be implemented at individual researcher, project, organisation or community level, offering the exciting possibility that in future many of these QC results can be made publically available, to be searched and interpreted just as crystallography and bioinformatics results are today.
Although we believe that quantum chemists will appreciate the contribution the Quixote infrastructure can make to the organisation and and exchange of their results, we anticipate that greater rewards will come from enabling their results to be consumed by a wider community. As the respositories grow they will become a valuable source of chemical data for use by other disciplines in both research and education.
The Quixote project is unconventional in that the infrastructure is being implemented in advance of a full definition of the data model which will eventually underpin it. We believe that a working system which offers real value to researchers based on tools and shared, searchable repositories will encourage early participation from a broader community, including both producers and consumers of data. In the early stages, searching and indexing can be performed on the chemical subject of the calculations, and well defined calculation meta-data. The process of defining more specific quantum chemical definitions, adding them to dictionaries and extracting them consistently from the results of the various software packages can then proceed in an incremental manner, adding additional value at each stage.
Not only will these results help to change the data management model in the field of Quantum Chemistry, but the methodology can be applied to other pressing problems related to data in computational and experimental science.
A steep drop in the cost of next-generation sequencing during recent years has made the technology affordable to the majority of researchers, but downstream bioinformatic analysis still poses a resource bottleneck for smaller laboratories and institutes that do not have access to substantial computational resources. Sequencing instruments are typically bundled with only the minimal processing and storage capacity required for data capture during sequencing runs. Given the scale of sequence datasets, scientific value cannot be obtained from acquiring a sequencer unless it is accompanied by an equal investment in informatics infrastructure.
Cloud BioLinux is a publicly accessible Virtual Machine (VM) that enables scientists to quickly provision on-demand infrastructures for high-performance bioinformatics computing using cloud platforms. Users have instant access to a range of pre-configured command line and graphical software applications, including a full-featured desktop interface, documentation and over 135 bioinformatics packages for applications including sequence alignment, clustering, assembly, display, editing, and phylogeny. Each tool's functionality is fully described in the documentation directly accessible from the graphical interface of the VM. Besides the Amazon EC2 cloud, we have started instances of Cloud BioLinux on a private Eucalyptus cloud installed at the J. Craig Venter Institute, and demonstrated access to the bioinformatic tools interface through a remote connection to EC2 instances from a local desktop computer. Documentation for using Cloud BioLinux on EC2 is available from our project website, while a Eucalyptus cloud image and VirtualBox Appliance is also publicly available for download and use by researchers with access to private clouds.
Cloud BioLinux provides a platform for developing bioinformatics infrastructures on the cloud. An automated and configurable process builds Virtual Machines, allowing the development of highly customized versions from a shared code base. This shared community toolkit enables application specific analysis platforms on the cloud by minimizing the effort required to prepare and maintain them.
Dense marker maps require efficient statistical methods for QTL fine mapping that work fast and efficiently with a large number of markers. In this study, the simulated dataset for the XIIth QTLMAS workshop was analyzed using a QTL fine mapping set of tools.
The QTL fine-mapping strategy was based on the use of statistical methods combining linkage and linkage disequilibrium analysis. Variance component based linkage analysis provided confidence intervals for the QTL. Within these regions, two additional analyses combining both linkage analysis and linkage disequilibrium information were applied. The first method estimated identity-by-descent probabilities among base haplotypes that were used to group them in different clusters. The second method constructed haplotype groups based on identity-by-state probabilities.
Two QTL explaining 9.4 and 3.3% of the genetic variance were found with high significance on chromosome 1 at positions 19.5 and 76.6 cM. On chromosome 2, two QTL were also detected at positions 26.0 and 53.2 explaining respectively 9.0 and 7.8 of total genetic variance. The QTL detected on chromosome 3 at position 11.9 cM (5% of variance) was less important. The QTL with the highest effect (37% of variance) was detected on chromosome 4 at position 3.1 cM and another QTL (13.6% of variance) was detected on chromosome 5 at position 93.9 cM.
The proposed strategy for fine-mapping of QTL combining linkage and linkage disequilibrium analysis allowed detecting the most important QTL with an additive effect in a short period but it should be extended in the future in order to fine-map linked and epistatic QTL.
The Matlab software is a one of the most advanced development tool for application in engineering practice. From our point of view the most important is the image processing toolbox, offering many built-in functions, including mathematical morphology, and implementation of a many artificial neural networks as AI. It is very popular platform for creation of the specialized program for image analysis, also in pathology. Based on the latest version of Matlab Builder Java toolbox, it is possible to create the software, serving as a remote system for image analysis in pathology via internet communication. The internet platform can be realized based on Java Servlet Pages with Tomcat server as servlet container.
In presented software implementation we propose remote image analysis realized by Matlab algorithms. These algorithms can be compiled to executable jar file with the help of Matlab Builder Java toolbox. The Matlab function must be declared with the set of input data, output structure with numerical results and Matlab web figure. Any function prepared in that manner can be used as a Java function in Java Servlet Pages (JSP). The graphical user interface providing the input data and displaying the results (also in graphical form) must be implemented in JSP. Additionally the data storage to database can be implemented within algorithm written in Matlab with the help of Matlab Database Toolbox directly with the image processing. The complete JSP page can be run by Tomcat server.
The proposed tool for remote image analysis was tested on the Computerized Analysis of Medical Images (CAMI) software developed by author. The user provides image and case information (diagnosis, staining, image parameter etc.). When analysis is initialized, input data with image are sent to servlet on Tomcat. When analysis is done, client obtains the graphical results as an image with marked recognized cells and also the quantitative output. Additionally, the results are stored in a server database. The internet platform was tested on PC Intel Core2 Duo T9600 2.8GHz 4GB RAM server with 768x576 pixel size, 1.28Mb tiff format images reffering to meningioma tumour (x400, Ki-67/MIB-1). The time consumption was as following: at analysis by CAMI, locally on a server – 3.5 seconds, at remote analysis – 26 seconds, from which 22 seconds were used for data transfer via internet connection. At jpg format image (102 Kb) the consumption time was reduced to 14 seconds.
The results have confirmed that designed remote platform can be useful for pathology image analysis. The time consumption is depended mainly on the image size and speed of the internet connections. The presented implementation can be used for many types of analysis at different staining, tissue, morphometry approaches, etc. The significant problem is the implementation of the JSP page in the multithread form, that can be used parallelly by many users. The presented platform for image analysis in pathology can be especially useful for small laboratory without its own image analysis system.
Natural variation in the absolute and relative size of different parts of the human brain is substantial, with a range that often exceeds a factor of two. Much of this variation is generated by the cumulative effects of sets of unknown gene variants that modulate the proliferation, growth, and death of neurons and glial cells. Discovering and testing the functions of these genes should contribute significantly to our understanding of differences in brain development, behavior, and disease susceptibility. We have exploited a large population of genetically well-characterized strains of mice (BXD recombinant inbred strains to map gene variants that influence the volume of the dorsal striatum (caudate-putamen without nucleus accumbens). We used unbiased methods to estimate volumes bilaterally in a sex-balanced sample taken from the Mouse Brain Library (www.mbl.org). We generated a matched microarray data set to efficiently evaluate candidate genes (www.genenetwork.org). As in humans, volume of the striatum is highly heritable, with greater than two-fold differences among strains. We mapped a locus that modulates striatal volume on chromosome (Chr) 6 at 88 ± 5 Mb. We also uncovered an epistatic interaction between loci on Chr 6 and Chr 17 that modulates striatal volume. Using bioinformatic tools and the corresponding expression database, we have identified positional candidates in these QTL intervals.
Striatum; Quantitative Trait Loci; Systems Genetics; Stereology; Point Counting
Flow cytometry bioinformatics is the application of bioinformatics to flow cytometry data, which involves storing, retrieving, organizing, and analyzing flow cytometry data using extensive computational resources and tools. Flow cytometry bioinformatics requires extensive use of and contributes to the development of techniques from computational statistics and machine learning. Flow cytometry and related methods allow the quantification of multiple independent biomarkers on large numbers of single cells. The rapid growth in the multidimensionality and throughput of flow cytometry data, particularly in the 2000s, has led to the creation of a variety of computational analysis methods, data standards, and public databases for the sharing of results. Computational methods exist to assist in the preprocessing of flow cytometry data, identifying cell populations within it, matching those cell populations across samples, and performing diagnosis and discovery using the results of previous steps. For preprocessing, this includes compensating for spectral overlap, transforming data onto scales conducive to visualization and analysis, assessing data for quality, and normalizing data across samples and experiments. For population identification, tools are available to aid traditional manual identification of populations in two-dimensional scatter plots (gating), to use dimensionality reduction to aid gating, and to find populations automatically in higher dimensional space in a variety of ways. It is also possible to characterize data in more comprehensive ways, such as the density-guided binary space partitioning technique known as probability binning, or by combinatorial gating. Finally, diagnosis using flow cytometry data can be aided by supervised learning techniques, and discovery of new cell types of biological importance by high-throughput statistical methods, as part of pipelines incorporating all of the aforementioned methods.
Open standards, data, and software are also key parts of flow cytometry bioinformatics. Data standards include the widely adopted Flow Cytometry Standard (FCS) defining how data from cytometers should be stored, but also several new standards under development by the International Society for Advancement of Cytometry (ISAC) to aid in storing more detailed information about experimental design and analytical steps. Open data is slowly growing with the opening of the CytoBank database in 2010 and FlowRepository in 2012, both of which allow users to freely distribute their data, and the latter of which has been recommended as the preferred repository for MIFlowCyt-compliant data by ISAC. Open software is most widely available in the form of a suite of Bioconductor packages, but is also available for web execution on the GenePattern platform.
The Cancer Bioinformatics Grid (caBIG™) program was created by the National Cancer Institute to facilitate sharing of IT infrastructure, data, and applications among the National Cancer Institute-sponsored cancer research centers. The program was launched in February 2004 and now links more than 50 cancer centers. In April 2005, the In Vivo Imaging Workspace was added to promote the use of imaging in cancer clinical trials. At the inaugural meeting, four special interest groups (SIGs) were established. The Software SIG was charged with identifying projects that focus on open-source software for image visualization and analysis. To date, two projects have been defined by the Software SIG. The eXtensible Imaging Platform project has produced a rapid application development environment that researchers may use to create targeted workflows customized for specific research projects. The Algorithm Validation Tools project will provide a set of tools and data structures that will be used to capture measurement information and associated needed to allow a gold standard to be defined for the given database against which change analysis algorithms can be tested. Through these and future efforts, the caBIG™ In Vivo Imaging Workspace Software SIG endeavors to advance imaging informatics and provide new open-source software tools to advance cancer research.
Open source, digital imaging and communications in medicine (DICOM); grid computing; image analysis; imaging informatics; caBIG; XIP; AVT
Association analysis using genome-wide expression quantitative trait locus (eQTL) data investigates the effect that genetic variation has on cellular pathways and leads to the discovery of candidate regulators. Traditional analysis of eQTL data via pairwise statistical significance tests or linear regression does not leverage the availability of the structural information of the transcriptome, such as presence of gene networks that reveal correlation and potentially regulatory relationships among the study genes. We employ a new eQTL mapping algorithm, GFlasso, which we have previously developed for sparse structured regression, to reanalyze a genome-wide yeast dataset. GFlasso fully takes into account the dependencies among expression traits to suppress false positives and to enhance the signal/noise ratio. Thus, GFlasso leverages the gene-interaction network to discover the pleiotropic effects of genetic loci that perturb the expression level of multiple (rather than individual) genes, which enables us to gain more power in detecting previously neglected signals that are marginally weak but pleiotropically significant.
While eQTL hotspots in yeast have been reported previously as genomic regions controlling multiple genes, our analysis reveals additional novel eQTL hotspots and, more interestingly, uncovers groups of multiple contributing eQTL hotspots that affect the expression level of functional gene modules. To our knowledge, our study is the first to report this type of gene regulation stemming from multiple eQTL hotspots. Additionally, we report the results from in-depth bioinformatics analysis for three groups of these eQTL hotspots: ribosome biogenesis, telomere silencing, and retrotransposon biology. We suggest candidate regulators for the functional gene modules that map to each group of hotspots. Not only do we find that many of these candidate regulators contain mutations in the promoter and coding regions of the genes, in the case of the Ribi group, we provide experimental evidence suggesting that the identified candidates do regulate the target genes predicted by GFlasso.
Thus, this structured association analysis of a yeast eQTL dataset via GFlasso, coupled with extensive bioinformatics analysis, discovers a novel regulation pattern between multiple eQTL hotspots and functional gene modules. Furthermore, this analysis demonstrates the potential of GFlasso as a powerful computational tool for eQTL studies that exploit the rich structural information among expression traits due to correlation, regulation, or other forms of biological dependencies.
To identify genetic loci that regulate spontaneous arthritis in interleukin-1 receptor antagonist (IL-1ra)-deficient mice, an F2 population was created from a cross between Balb/c IL-1ra-deficient mice and DBA/1 IL-1ra-deficient mice. Spontaneous arthritis in the F2 population was examined and recorded. Genotypes of those F2 mice were determined using microsatellite markers. Quantitative trail locus (QTL) analysis was conducted with R/qtlbim. Functions of genes within QTL chromosomal regions were evaluated using a bioinformatics tool, PGMapper, and microarray analysis. Potential candidate genes were further evaluated using GeneNetwork. A total of 137 microsatellite markers with an average of 12 cM spacing along the whole genome were used for determining the correlation of arthritis phenotypes with genotypes of 191 F2 progenies. By whole-genome mapping, we obtained QTLs on chromosomes 1 and 6 that were above the significance threshold for strong Bayesian evidence. The QTL on chromosome 1 had a peak near D1Mit55 and D1Mit425 at 82·6 cM. It may account for as much as 12% of the phenotypic variation in susceptibility to spontaneous arthritis. The QTL region contained 208 known transcripts. According to their functions, Mr1, Pla2g4a and Fasl are outstanding candidate genes. From microarray analysis, 11 genes were selected as favourable candidates based on their function and expression profiles. Three of those 11 genes, Prg4, Ptgs2 and Mr1, correlated with the IL-1ra pathway. Those genes were considered to be the best candidates.
The Generation Challenge programme (GCP) is a global crop research consortium directed toward crop improvement through the application of comparative biology and genetic resources characterization to plant breeding. A key consortium research activity is the development of a GCP crop bioinformatics platform to support GCP research. This platform includes the following: (i) shared, public platform-independent domain models, ontology, and data formats to enable interoperability of data and analysis flows within the platform; (ii) web service and registry technologies to identify, share, and integrate information across diverse, globally dispersed data sources, as well as to access high-performance computational (HPC) facilities for computationally intensive, high-throughput analyses of project data; (iii) platform-specific middleware reference implementations of the domain model integrating a suite of public (largely open-access/-source) databases and software tools into a workbench to facilitate biodiversity analysis, comparative analysis of crop genomic data, and plant breeding decision making.
Explosive growth of next-generation sequencing data has resulted in ultra-large-scale data sets and ensuing computational problems. Cloud computing provides an on-demand and scalable environment for large-scale data analysis. Using a MapReduce framework, data and workload can be distributed via a network to computers in the cloud to substantially reduce computational latency. Hadoop/MapReduce has been successfully adopted in bioinformatics for genome assembly, mapping reads to genomes, and finding single nucleotide polymorphisms. Major cloud providers offer Hadoop cloud services to their users. However, it remains technically challenging to deploy a Hadoop cloud for those who prefer to run MapReduce programs in a cluster without built-in Hadoop/MapReduce.
We present CloudDOE, a platform-independent software package implemented in Java. CloudDOE encapsulates technical details behind a user-friendly graphical interface, thus liberating scientists from having to perform complicated operational procedures. Users are guided through the user interface to deploy a Hadoop cloud within in-house computing environments and to run applications specifically targeted for bioinformatics, including CloudBurst, CloudBrush, and CloudRS. One may also use CloudDOE on top of a public cloud. CloudDOE consists of three wizards, i.e., Deploy, Operate, and Extend wizards. Deploy wizard is designed to aid the system administrator to deploy a Hadoop cloud. It installs Java runtime environment version 1.6 and Hadoop version 0.20.203, and initiates the service automatically. Operate wizard allows the user to run a MapReduce application on the dashboard list. To extend the dashboard list, the administrator may install a new MapReduce application using Extend wizard.
CloudDOE is a user-friendly tool for deploying a Hadoop cloud. Its smart wizards substantially reduce the complexity and costs of deployment, execution, enhancement, and management. Interested users may collaborate to improve the source code of CloudDOE to further incorporate more MapReduce bioinformatics tools into CloudDOE and support next-generation big data open source tools, e.g., Hadoop BigTop and Spark. Availability: CloudDOE is distributed under Apache License 2.0 and is freely available at http://clouddoe.iis.sinica.edu.tw/.
There is a huge demand on bioinformaticians to provide their biologists with user friendly and scalable software infrastructures to capture, exchange, and exploit the unprecedented amounts of new *omics data. We here present MOLGENIS, a generic, open source, software toolkit to quickly produce the bespoke MOLecular GENetics Information Systems needed.
The MOLGENIS toolkit provides bioinformaticians with a simple language to model biological data structures and user interfaces. At the push of a button, MOLGENIS’ generator suite automatically translates these models into a feature-rich, ready-to-use web application including database, user interfaces, exchange formats, and scriptable interfaces. Each generator is a template of SQL, JAVA, R, or HTML code that would require much effort to write by hand. This ‘model-driven’ method ensures reuse of best practices and improves quality because the modeling language and generators are shared between all MOLGENIS applications, so that errors are found quickly and improvements are shared easily by a re-generation. A plug-in mechanism ensures that both the generator suite and generated product can be customized just as much as hand-written software.
In recent years we have successfully evaluated the MOLGENIS toolkit for the rapid prototyping of many types of biomedical applications, including next-generation sequencing, GWAS, QTL, proteomics and biobanking. Writing 500 lines of model XML typically replaces 15,000 lines of hand-written programming code, which allows for quick adaptation if the information system is not yet to the biologist’s satisfaction. Each application generated with MOLGENIS comes with an optimized database back-end, user interfaces for biologists to manage and exploit their data, programming interfaces for bioinformaticians to script analysis tools in R, Java, SOAP, REST/JSON and RDF, a tab-delimited file format to ease upload and exchange of data, and detailed technical documentation. Existing databases can be quickly enhanced with MOLGENIS generated interfaces using the ‘ExtractModel’ procedure.
The MOLGENIS toolkit provides bioinformaticians with a simple model to quickly generate flexible web platforms for all possible genomic, molecular and phenotypic experiments with a richness of interfaces not provided by other tools. All the software and manuals are available free as LGPLv3 open source at http://www.molgenis.org.
Expression Quantitative Trait Locus (eQTL) analysis is a powerful tool to study the biological mechanisms linking the genotype with gene expression. Such analyses can identify genomic locations where genotypic variants influence the expression of genes, both in close proximity to the variant (cis-eQTL), and on other chromosomes (trans-eQTL). Many traditional eQTL methods are based on a linear regression model. In this study, we propose a novel method by which to identify eQTL associations with information theory and machine learning approaches. Mutual Information (MI) is used to describe the association between genetic marker and gene expression. MI can detect both linear and non-linear associations. What’s more, it can capture the heterogeneity of the population. Advanced feature selection methods, Maximum Relevance Minimum Redundancy (mRMR) and Incremental Feature Selection (IFS), were applied to optimize the selection of the affected genes by the genetic marker. When we applied our method to a study of apoE-deficient mice, it was found that the cis-acting eQTLs are stronger than trans-acting eQTLs but there are more trans-acting eQTLs than cis-acting eQTLs. We compared our results (mRMR.eQTL) with R/qtl, and MatrixEQTL (modelLINEAR and modelANOVA). In female mice, 67.9% of mRMR.eQTL results can be confirmed by at least two other methods while only 14.4% of R/qtl result can be confirmed by at least two other methods. In male mice, 74.1% of mRMR.eQTL results can be confirmed by at least two other methods while only 18.2% of R/qtl result can be confirmed by at least two other methods. Our methods provide a new way to identify the association between genetic markers and gene expression. Our software is available from supporting information.
Many studies have found segregating quantitative trait loci (QTL) for milk production traits in different dairy cattle populations. However, even for relatively large effects with a saturated marker map the confidence interval for QTL location by linkage analysis spans tens of map units, or hundreds of genes. Combining mapping and arraying has been suggested as an approach to identify candidate genes. Thus, gene expression analysis in the mammary gland of genes positioned in the confidence interval of the QTL can bridge the gap between fine mapping and quantitative trait nucleotide (QTN) determination.
We hybridized Affymetrix microarray (MG-U74v2), containing 12,488 murine probes, with RNA derived from mammary gland of virgin, pregnant, lactating and involuting C57BL/6J mice in a total of nine biological replicates. We combined microarray data from two additional studies that used the same design in mice with a total of 75 biological replicates. The same filtering and normalization was applied to each microarray data using GeneSpring software. Analysis of variance identified 249 differentially expressed probe sets common to the three experiments along the four developmental stages of puberty, pregnancy, lactation and involution. 212 genes were assigned to their bovine map positions through comparative mapping, and thus form a list of candidate genes for previously identified QTLs for milk production traits. A total of 82 of the genes showed mammary gland-specific expression with at least 3-fold expression over the median representing all tissues tested in GeneAtlas.
This work presents a web tool for candidate genes for QTL (cgQTL) that allows navigation between the map of bovine milk production QTL, potential candidate genes and their level of expression in mammary gland arrays and in GeneAtlas. Three out of four confirmed genes that affect QTL in livestock (ABCG2, DGAT1, GDF8, IGF2) were over expressed in the target organ. Thus, cgQTL can be used to determine priority of candidate genes for QTN analysis based on differential expression in the target organ.