Related Articles
Swertz, Morris A | Velde, K Joeri van der | Tesson, Bruno M | Scheltema, Richard A | Arends, Danny | Vera, Gonzalo | Alberts, Rudi | Dijkstra, Martijn | Schofield, Paul | Schughart, Klaus | Hancock, John M | Smedley, Damian | Wolstencroft, Katy | Goble, Carole | de Brock, Engbert O | Jones, Andrew R | Parkinson, Helen E | Jansen, Ritsert C
XGAP, a software platform for the integration and analysis of genotype and phenotype data.
We present an extensible software model for the genotype and phenotype community, XGAP. Readers can download a standard XGAP (http://www.xgap.org) or auto-generate a custom version using MOLGENIS with programming interfaces to R-software and web-services or user interfaces for biologists. XGAP has simple load formats for any type of genotype, epigenotype, transcript, protein, metabolite or other phenotype data. Current functionality includes tools ranging from eQTL analysis in mouse to genome-wide association studies in humans.
doi:10.1186/gb-2010-11-3-r27
PMCID: PMC2864567
PMID: 20214801
Summary:
xQTL workbench is a scalable web platform for the mapping of quantitative trait loci (QTLs) at multiple levels: for example gene expression (eQTL), protein abundance (pQTL), metabolite abundance (mQTL) and phenotype (phQTL) data. Popular QTL mapping methods for model organism and human populations are accessible via the web user interface. Large calculations scale easily on to multi-core computers, clusters and Cloud. All data involved can be uploaded and queried online: markers, genotypes, microarrays, NGS, LC-MS, GC-MS, NMR, etc. When new data types come available, xQTL workbench is quickly customized using the Molgenis software generator.
Availability:
xQTL workbench runs on all common platforms, including Linux, Mac OS X and Windows. An online demo system, installation guide, tutorials, software and source code are available under the LGPL3 license from http://www.xqtl.org.
Contact:
m.a.swertz@rug.nl
doi:10.1093/bioinformatics/bts049
PMCID: PMC3315722
PMID: 22308096
Bruskiewich, Richard | Senger, Martin | Davenport, Guy | Ruiz, Manuel | Rouard, Mathieu | Hazekamp, Tom | Takeya, Masaru | Doi, Koji | Satoh, Kouji | Costa, Marcos | Simon, Reinhard | Balaji, Jayashree | Akintunde, Akinnola | Mauleon, Ramil | Wanchana, Samart | Shah, Trushar | Anacleto, Mylah | Portugal, Arllet | Ulat, Victor Jun | Thongjuea, Supat | Braak, Kyle | Ritter, Sebastian | Dereeper, Alexis | Skofic, Milko | Rojas, Edwin | Martins, Natalia | Pappas, Georgios | Alamban, Ryan | Almodiel, Roque | Barboza, Lord Hendrix | Detras, Jeffrey | Manansala, Kevin | Mendoza, Michael Jonathan | Morales, Jeffrey | Peralta, Barry | Valerio, Rowena | Zhang, Yi | Gregorio, Sergio | Hermocilla, Joseph | Echavez, Michael | Yap, Jan Michael | Farmer, Andrew | Schiltz, Gary | Lee, Jennifer | Casstevens, Terry | Jaiswal, Pankaj | Meintjes, Ayton | Wilkinson, Mark | Good, Benjamin | Wagner, James | Morris, Jane | Marshall, David | Collins, Anthony | Kikuchi, Shoshi | Metz, Thomas | McLaren, Graham | van Hintum, Theo
The Generation Challenge programme (GCP) is a global crop research consortium directed toward crop improvement through the application of comparative biology and genetic resources characterization to plant breeding. A key consortium research activity is the development of a GCP crop bioinformatics platform to support GCP research. This platform includes the following: (i) shared, public platform-independent domain models, ontology, and data formats to enable interoperability of data and analysis flows within the platform; (ii) web service and registry technologies to identify, share, and integrate information across diverse, globally dispersed data sources, as well as to access high-performance computational (HPC) facilities for computationally intensive, high-throughput analyses of project data; (iii) platform-specific middleware reference implementations of the domain model integrating a suite of public (largely open-access/-source) databases and software tools into a workbench to facilitate biodiversity analysis, comparative analysis of crop genomic data, and plant breeding decision making.
doi:10.1155/2008/369601
PMCID: PMC2375972
PMID: 18483570
The work showed that the integrated suite of software tools for detecting criminals using DNA databases has achieved the overall objective by providing a working platform for sequence analysis. The work also demonstrated that by integrating BLAST and FASTA (two widely used and freely available algorithms), plus an additional implementation of PSA (custom-built pairwise sequence alignment algorithms) and TR analysis tools (for detecting tandem repeats) with the rest of the utilities supporting tools (databases and files management) developed, it is entirely possible to have an initial working version of the software tool for criminal DNA analysis and detection work. The integrated software tool has great potential and that the results obtained during the tests were satisfactory. The recent South Asia Tsunami incident has renewed the need to establish a quick and reliable system for DNA matching and comparison. This work may also contribute towards the quick identification of victims in many disasters.
Future works are to further enhance the existing tools by adding more options and controls, improve upon the visualisation display, and to build robust software architecture to better manage the system loadings. Fault tolerance enhancement to the system is one of the key areas that can further help to make the entire application efficient, robust and reliable.
doi:10.2174/1874431100802010001
PMCID: PMC2666961
PMID: 19415131
Background
In systems biology, and many other areas of research, there is a need for the interoperability of tools and data sources that were not originally designed to be integrated. Due to the interdisciplinary nature of systems biology, and its association with high throughput experimental platforms, there is an additional need to continually integrate new technologies. As scientists work in isolated groups, integration with other groups is rarely a consideration when building the required software tools.
Results
We illustrate an approach, through the discussion of a purpose built software architecture, which allows disparate groups to reuse tools and access data sources in a common manner. The architecture allows for: the rapid development of distributed applications; interoperability, so it can be used by a wide variety of developers and computational biologists; development using standard tools, so that it is easy to maintain and does not require a large development effort; extensibility, so that new technologies and data types can be incorporated; and non intrusive development, insofar as researchers need not to adhere to a pre-existing object model.
Conclusion
By using a relatively simple integration strategy, based upon a common identity system and dynamically discovered interoperable services, a light-weight software architecture can become the focal point through which scientists can both get access to and analyse the plethora of experimentally derived data.
doi:10.1186/1471-2105-9-295
PMCID: PMC2478690
PMID: 18578887
Recent advances in genomics and structural biology have resulted in an unprecedented increase in biological data available from Internet-accessible databases. In order to help students effectively use this vast repository of information, undergraduate biology students at Drake University were introduced to bioinformatics software and databases in three courses, beginning with an introductory course in cell biology. The exercises and projects that were used to help students develop literacy in bioinformatics are described. In a recently offered course in bioinformatics, students developed their own simple sequence analysis tool using the Perl programming language. These experiences are described from the point of view of the instructor as well as the students. A preliminary assessment has been made of the degree to which students had developed a working knowledge of bioinformatics concepts and methods. Finally, some conclusions have been drawn from these courses that may be helpful to instructors wishing to introduce bioinformatics within the undergraduate biology curriculum.
doi:10.1187/cbe.03-06-0026
PMCID: PMC256976
PMID: 14673489
undergraduate; bioinformatics; genomics; Perl
Xia, Kai | Shabalin, Andrey A. | Huang, Shunping | Madar, Vered | Zhou, Yi-Hui | Wang, Wei | Zou, Fei | Sun, Wei | Sullivan, Patrick F. | Wright, Fred A.
Summary: seeQTL is a comprehensive and versatile eQTL database, including various eQTL studies and a meta-analysis of HapMap eQTL information. The database presents eQTL association results in a convenient browser, using both segmented local-association plots and genome-wide Manhattan plots.
Availability and implementation: seeQTL is freely available for non-commercial use at http://www.bios.unc.edu/research/genomic_software/seeQTL/.
Contact: fred_wright@unc.edu; kxia@bios.unc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr678
PMCID: PMC3268245
PMID: 22171328
Evani, Uday S | Challis, Danny | Yu, Jin | Jackson, Andrew R | Paithankar, Sameer | Bainbridge, Matthew N | Jakkamsetti, Adinarayana | Pham, Peter | Coarfa, Cristian | Milosavljevic, Aleksandar | Yu, Fuli
Background
Until recently, sequencing has primarily been carried out in large genome centers which have invested heavily in developing the computational infrastructure that enables genomic sequence analysis. The recent advancements in next generation sequencing (NGS) have led to a wide dissemination of sequencing technologies and data, to highly diverse research groups. It is expected that clinical sequencing will become part of diagnostic routines shortly. However, limited accessibility to computational infrastructure and high quality bioinformatic tools, and the demand for personnel skilled in data analysis and interpretation remains a serious bottleneck. To this end, the cloud computing and Software-as-a-Service (SaaS) technologies can help address these issues.
Results
We successfully enabled the Atlas2 Cloud pipeline for personal genome analysis on two different cloud service platforms: a community cloud via the Genboree Workbench, and a commercial cloud via the Amazon Web Services using Software-as-a-Service model. We report a case study of personal genome analysis using our Atlas2 Genboree pipeline. We also outline a detailed cost structure for running Atlas2 Amazon on whole exome capture data, providing cost projections in terms of storage, compute and I/O when running Atlas2 Amazon on a large data set.
Conclusions
We find that providing a web interface and an optimized pipeline clearly facilitates usage of cloud computing for personal genome analysis, but for it to be routinely used for large scale projects there needs to be a paradigm shift in the way we develop tools, in standard operating procedures, and in funding mechanisms.
doi:10.1186/1471-2164-13-S6-S19
PMCID: PMC3481437
PMID: 23134663
Kumar, Kamal | Desai, Valmik | Cheng, Li | Khitrov, Maxim | Grover, Deepak | Satya, Ravi Vijaya | Yu, Chenggang | Zavaljevski, Nela | Reifman, Jaques | Aerts, Stein
Background
The annotation of genomes from next-generation sequencing platforms needs to
be rapid, high-throughput, and fully integrated and automated. Although a
few Web-based annotation services have recently become available, they may
not be the best solution for researchers that need to annotate a large
number of genomes, possibly including proprietary data, and store them
locally for further analysis. To address this need, we developed a
standalone software application, the Annotation of microbial Genome
Sequences (AGeS) system, which incorporates publicly available and
in-house-developed bioinformatics tools and databases, many of which are
parallelized for high-throughput performance.
Methodology
The AGeS system supports three main capabilities. The first is the storage of
input contig sequences and the resulting annotation data in a central,
customized database. The second is the annotation of microbial genomes using
an integrated software pipeline, which first analyzes contigs from
high-throughput sequencing by locating genomic regions that code for
proteins, RNA, and other genomic elements through the Do-It-Yourself
Annotation (DIYA) framework. The identified protein-coding regions are then
functionally annotated using the in-house-developed Pipeline for Protein
Annotation (PIPA). The third capability is the visualization of annotated
sequences using GBrowse. To date, we have implemented these capabilities for
bacterial genomes. AGeS was evaluated by comparing its genome annotations
with those provided by three other methods. Our results indicate that the
software tools integrated into AGeS provide annotations that are in general
agreement with those provided by the compared methods. This is demonstrated
by a >94% overlap in the number of identified genes, a significant
number of identical annotated features, and a >90% agreement in
enzyme function predictions.
doi:10.1371/journal.pone.0017469
PMCID: PMC3049762
PMID: 21408217
Motivation: R/qtl is free and powerful software for mapping and exploring quantitative trait loci (QTL). R/qtl provides a fully comprehensive range of methods for a wide range of experimental cross types. We recently added multiple QTL mapping (MQM) to R/qtl. MQM adds higher statistical power to detect and disentangle the effects of multiple linked and unlinked QTL compared with many other methods. MQM for R/qtl adds many new features including improved handling of missing data, analysis of 10 000 s of molecular traits, permutation for determining significance thresholds for QTL and QTL hot spots, and visualizations for cis–trans and QTL interaction effects. MQM for R/qtl is the first free and open source implementation of MQM that is multi-platform, scalable and suitable for automated procedures and large genetical genomics datasets.
Availability: R/qtl is free and open source multi-platform software for the statistical language R, and is made available under the GPLv3 license. R/qtl can be installed from http://www.rqtl.org/. R/qtl queries should be directed at the mailing list, see http://www.rqtl.org/list/.
Contact: kbroman@biostat.wisc.edu
doi:10.1093/bioinformatics/btq565
PMCID: PMC2982156
PMID: 20966004
Single nucleotide polymorphisms (SNPs) represent the most abundant type of genetic variation that can be used as molecular markers. The SNPs that are hidden in sequence databases can be unlocked using bioinformatic tools. For efficient application of these SNPs, the sequence set should be error-free as much as possible, targeting single loci and suitable for the SNP scoring platform of choice. We have developed a pipeline to effectively mine SNPs from public EST databases with or without quality information using QualitySNP software, select reliable SNP and prepare the loci for analysis on the Illumina GoldenGate genotyping platform. The applicability of the pipeline was demonstrated using publicly available potato EST data, genotyping individuals from two diploid mapping populations and subsequently mapping the SNP markers (putative genes) in both populations. Over 7000 reliable SNPs were identified that met the criteria for genotyping on the GoldenGate platform. Of the 384 SNPs on the SNP array approximately 12% dropped out. For the two potato mapping populations 165 and 185 SNPs segregating SNP loci could be mapped on the respective genetic maps, illustrating the effectiveness of our pipeline for SNP selection and validation.
Electronic supplementary material
The online version of this article (doi:10.1007/s11032-009-9377-5) contains supplementary material, which is available to authorized users.
doi:10.1007/s11032-009-9377-5
PMCID: PMC2869401
PMID: 20502512
EST database; Illumina GoldenGate assay; QualitySNP; Potato
Summary: The BioRuby software toolkit contains a comprehensive set of free development tools and libraries for bioinformatics and molecular biology, written in the Ruby programming language. BioRuby has components for sequence analysis, pathway analysis, protein modelling and phylogenetic analysis; it supports many widely used data formats and provides easy access to databases, external programs and public web services, including BLAST, KEGG, GenBank, MEDLINE and GO. BioRuby comes with a tutorial, documentation and an interactive environment, which can be used in the shell, and in the web browser.
Availability: BioRuby is free and open source software, made available under the Ruby license. BioRuby runs on all platforms that support Ruby, including Linux, Mac OS X and Windows. And, with JRuby, BioRuby runs on the Java Virtual Machine. The source code is available from http://www.bioruby.org/.
Contact: katayama@bioruby.org
doi:10.1093/bioinformatics/btq475
PMCID: PMC2951089
PMID: 20739307
Using DNA markers in plant breeding with marker-assisted selection (MAS) could greatly improve the precision and efficiency of selection, leading to the accelerated development of new crop varieties. The numerous examples of MAS in rice have prompted many breeding institutes to establish molecular breeding labs. The last decade has produced an enormous amount of genomics research in rice, including the identification of thousands of QTLs for agronomically important traits, the generation of large amounts of gene expression data, and cloning and characterization of new genes, including the detection of single nucleotide polymorphisms. The pinnacle of genomics research has been the completion and annotation of genome sequences for indica and japonica rice. This information—coupled with the development of new genotyping methodologies and platforms, and the development of bioinformatics databases and software tools—provides even more exciting opportunities for rice molecular breeding in the 21st century. However, the great challenge for molecular breeders is to apply genomics data in actual breeding programs. Here, we review the current status of MAS in rice, current genomics projects and promising new genotyping methodologies, and evaluate the probable impact of genomics research. We also identify critical research areas to “bridge the application gap” between QTL identification and applied breeding that need to be addressed to realize the full potential of MAS, and propose ideas and guidelines for establishing rice molecular breeding labs in the postgenome sequence era to integrate molecular breeding within the context of overall rice breeding and research programs.
doi:10.1155/2008/524847
PMCID: PMC2408710
PMID: 18528527
Background
Quantitative trait locus (QTL) mapping identifies genomic regions that likely contain genes regulating a quantitative trait. However, QTL regions may encompass tens to hundreds of genes. To find the most promising candidate genes that regulate the trait, the biologist typically collects information from multiple resources about the genes in the QTL interval. This process is very laborious and time consuming.
Results
QTLminer is a bioinformatics tool that automatically performs QTL region analysis. It is available in GeneNetwork and it integrates information such as gene annotation, gene expression and sequence polymorphisms for all the genes within a given genomic interval.
Conclusions
QTLminer substantially speeds up discovery of the most promising candidate genes within a QTL region.
doi:10.1186/1471-2105-11-516
PMCID: PMC2964687
PMID: 20950438
Background
A steep drop in the cost of next-generation sequencing during recent years has made the technology affordable to the majority of researchers, but downstream bioinformatic analysis still poses a resource bottleneck for smaller laboratories and institutes that do not have access to substantial computational resources. Sequencing instruments are typically bundled with only the minimal processing and storage capacity required for data capture during sequencing runs. Given the scale of sequence datasets, scientific value cannot be obtained from acquiring a sequencer unless it is accompanied by an equal investment in informatics infrastructure.
Results
Cloud BioLinux is a publicly accessible Virtual Machine (VM) that enables scientists to quickly provision on-demand infrastructures for high-performance bioinformatics computing using cloud platforms. Users have instant access to a range of pre-configured command line and graphical software applications, including a full-featured desktop interface, documentation and over 135 bioinformatics packages for applications including sequence alignment, clustering, assembly, display, editing, and phylogeny. Each tool's functionality is fully described in the documentation directly accessible from the graphical interface of the VM. Besides the Amazon EC2 cloud, we have started instances of Cloud BioLinux on a private Eucalyptus cloud installed at the J. Craig Venter Institute, and demonstrated access to the bioinformatic tools interface through a remote connection to EC2 instances from a local desktop computer. Documentation for using Cloud BioLinux on EC2 is available from our project website, while a Eucalyptus cloud image and VirtualBox Appliance is also publicly available for download and use by researchers with access to private clouds.
Conclusions
Cloud BioLinux provides a platform for developing bioinformatics infrastructures on the cloud. An automated and configurable process builds Virtual Machines, allowing the development of highly customized versions from a shared code base. This shared community toolkit enables application specific analysis platforms on the cloud by minimizing the effort required to prepare and maintain them.
doi:10.1186/1471-2105-13-42
PMCID: PMC3372431
PMID: 22429538
Statistical analysis system (SAS) is the most comprehensive statistical analysis software package in the world. It offers data analysis for almost all experiments under various statistical models. Each analysis is performed using a particular subroutine, called a procedure (PROC). For example, PROC ANOVA performs analysis of variances. PROC QTL is a user-defined SAS procedure for mapping quantitative trait loci (QTL). It allows users to perform QTL mapping for continuous and discrete traits within the SAS platform. Users of PROC QTL are able to take advantage of all existing features offered by the general SAS software, for example, data management and graphical treatment. The current version of PROC QTL can perform QTL mapping for all line crossing experiments using maximum likelihood (ML), least square (LS), iteratively reweighted least square (IRLS), Fisher scoring (FISHER), Bayesian (BAYES), and empirical Bayes (EBAYES) methods.
doi:10.1155/2009/141234
PMCID: PMC2796467
PMID: 20037656
Jones, Andrew R. | Eisenacher, Martin | Mayer, Gerhard | Kohlbacher, Oliver | Siepen, Jennifer | Hubbard, Simon J. | Selley, Julian N. | Searle, Brian C. | Shofstahl, James | Seymour, Sean L. | Julian, Randall | Binz, Pierre-Alain | Deutsch, Eric W. | Hermjakob, Henning | Reisinger, Florian | Griss, Johannes | Vizcaíno, Juan Antonio | Chambers, Matthew | Pizarro, Angel | Creasy, David
We report the release of mzIdentML, an exchange standard for peptide and protein identification data, designed by the Proteomics Standards Initiative. The format was developed by the Proteomics Standards Initiative in collaboration with instrument and software vendors, and the developers of the major open-source projects in proteomics. Software implementations have been developed to enable conversion from most popular proprietary and open-source formats, and mzIdentML will soon be supported by the major public repositories. These developments enable proteomics scientists to start working with the standard for exchanging and publishing data sets in support of publications and they provide a stable platform for bioinformatics groups and commercial software vendors to work with a single file format for identification data.
doi:10.1074/mcp.M111.014381
PMCID: PMC3394945
PMID: 22375074
Abstract
As advances in life sciences and information technology bring profound influences on bioinformatics due to its interdisciplinary nature, bioinformatics is experiencing a new leap-forward from in-house computing infrastructure into utility-supplied cloud computing delivered over the Internet, in order to handle the vast quantities of biological data generated by high-throughput experimental technologies. Albeit relatively new, cloud computing promises to address big data storage and analysis issues in the bioinformatics field. Here we review extant cloud-based services in bioinformatics, classify them into Data as a Service (DaaS), Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS), and present our perspectives on the adoption of cloud computing in bioinformatics.
Reviewers
This article was reviewed by Frank Eisenhaber, Igor Zhulin, and Sandor Pongor.
doi:10.1186/1745-6150-7-43
PMCID: PMC3533974
PMID: 23190475
Cloud computing; Bioinformatics; Big data; Data storage; Data analysis
Laulederkind, Stanley J. F. | Shimoyama, Mary | Hayman, G. Thomas | Lowry, Timothy F. | Nigam, Rajni | Petri, Victoria | Smith, Jennifer R. | Wang, Shur-Jen | de Pons, Jeff | Kowalski, George | Liu, Weisong | Rood, Wes | Munzenmaier, Diane H. | Dwinell, Melinda R. | Twigger, Simon N. | Jacob, Howard J.
The Rat Genome Database (RGD) is the premier repository of rat genomic and genetic data and currently houses over 40 000 rat gene records as well as human and mouse orthologs, 1771 rat and 1911 human quantitative trait loci (QTLs) and 2209 rat strains. Biological information curated for these data objects includes disease associations, phenotypes, pathways, molecular functions, biological processes and cellular components. A suite of tools has been developed to aid curators in acquiring and validating data objects, assigning nomenclature, attaching biological information to objects and making connections among data types. The software used to assign nomenclature, to create and edit objects and to make annotations to the data objects has been specifically designed to make the curation process as fast and efficient as possible. The user interfaces have been adapted to the work routines of the curators, creating a suite of tools that is intuitive and powerful.
Database URL: http://rgd.mcw.edu
doi:10.1093/database/bar002
PMCID: PMC3041158
PMID: 21321022
Background
Systems biologists work with many kinds of data, from many different sources, using a variety of software tools. Each of these tools typically excels at one type of analysis, such as of microarrays, of metabolic networks and of predicted protein structure. A crucial challenge is to combine the capabilities of these (and other forthcoming) data resources and tools to create a data exploration and analysis environment that does justice to the variety and complexity of systems biology data sets. A solution to this problem should recognize that data types, formats and software in this high throughput age of biology are constantly changing.
Results
In this paper we describe the Gaggle -a simple, open-source Java software environment that helps to solve the problem of software and database integration. Guided by the classic software engineering strategy of separation of concerns and a policy of semantic flexibility, it integrates existing popular programs and web resources into a user-friendly, easily-extended environment.
We demonstrate that four simple data types (names, matrices, networks, and associative arrays) are sufficient to bring together diverse databases and software. We highlight some capabilities of the Gaggle with an exploration of Helicobacter pylori pathogenesis genes, in which we identify a putative ricin-like protein -a discovery made possible by simultaneous data exploration using a wide range of publicly available data and a variety of popular bioinformatics software tools.
Conclusion
We have integrated diverse databases (for example, KEGG, BioCyc, String) and software (Cytoscape, DataMatrixViewer, R statistical environment, and TIGR Microarray Expression Viewer). Through this loose coupling of diverse software and databases the Gaggle enables simultaneous exploration of experimental data (mRNA and protein abundance, protein-protein and protein-DNA interactions), functional associations (operon, chromosomal proximity, phylogenetic pattern), metabolic pathways (KEGG) and Pubmed abstracts (STRING web resource), creating an exploratory environment useful to 'web browser and spreadsheet biologists', to statistically savvy computational biologists, and those in between. The Gaggle uses Java RMI and Java Web Start technologies and can be found at .
doi:10.1186/1471-2105-7-176
PMCID: PMC1464137
PMID: 16569235
Background
Statistical bioinformatics is the study of biological data sets obtained by new micro-technologies by means of proper statistical methods. For a better understanding of environmental adaptations of proteins, orthologous sequences from different habitats may be explored and compared. The main goal of the DeltaProt Toolbox is to provide users with important functionality that is needed for comparative screening and studies of extremophile proteins and protein classes. Visualization of the data sets is also the focus of this article, since visualizations can play a key role in making the various relationships transparent. This application paper is intended to inform the reader of the existence, functionality, and applicability of the toolbox.
Results
We present the DeltaProt Toolbox, a software toolbox that may be useful in importing, analyzing and visualizing data from multiple alignments of proteins. The toolbox has been written in MATLAB™ to provide an easy and user-friendly platform, including a graphical user interface, while ensuring good numerical performance. Problems in genome biology may be easily stated thanks to a compact input format. The toolbox also offers the possibility of utilizing structural information from the SABLE or other structure predictors. Different sequence plots can then be viewed and compared in order to find their similarities and differences. Detailed statistics are also calculated during the procedure.
Conclusions
The DeltaProt package is open source and freely available for academic, non-commercial use. The latest version of DeltaProt can be obtained from http://services.cbu.uib.no/software/deltaprot/. The website also contains documentation, and the toolbox comes with real data sets that are intended for training in applying the models to carry out bioinformatical and statistical analyses of protein sequences.
Equipped with the new algorithms proposed here, DeltaProt serves as an auxiliary analysis tool capable of visualizing and comparing orthologus protein sequences. The framework of the algorithms also enables easy incorporation of extra information on structure, if such data is available.
doi:10.1186/1471-2105-11-573
PMCID: PMC3001747
PMID: 21092291
Background
New systems biology studies require researchers to understand how interplay among myriads of biomolecular entities is orchestrated in order to achieve high-level cellular and physiological functions. Many software tools have been developed in the past decade to help researchers visually navigate large networks of biomolecular interactions with built-in template-based query capabilities. To further advance researchers' ability to interrogate global physiological states of cells through multi-scale visual network explorations, new visualization software tools still need to be developed to empower the analysis. A robust visual data analysis platform driven by database management systems to perform bi-directional data processing-to-visualizations with declarative querying capabilities is needed.
Results
We developed ProteoLens as a JAVA-based visual analytic software tool for creating, annotating and exploring multi-scale biological networks. It supports direct database connectivity to either Oracle or PostgreSQL database tables/views, on which SQL statements using both Data Definition Languages (DDL) and Data Manipulation languages (DML) may be specified. The robust query languages embedded directly within the visualization software help users to bring their network data into a visualization context for annotation and exploration. ProteoLens supports graph/network represented data in standard Graph Modeling Language (GML) formats, and this enables interoperation with a wide range of other visual layout tools. The architectural design of ProteoLens enables the de-coupling of complex network data visualization tasks into two distinct phases: 1) creating network data association rules, which are mapping rules between network node IDs or edge IDs and data attributes such as functional annotations, expression levels, scores, synonyms, descriptions etc; 2) applying network data association rules to build the network and perform the visual annotation of graph nodes and edges according to associated data values. We demonstrated the advantages of these new capabilities through three biological network visualization case studies: human disease association network, drug-target interaction network and protein-peptide mapping network.
Conclusion
The architectural design of ProteoLens makes it suitable for bioinformatics expert data analysts who are experienced with relational database management to perform large-scale integrated network visual explorations. ProteoLens is a promising visual analytic platform that will facilitate knowledge discoveries in future network and systems biology studies.
doi:10.1186/1471-2105-9-S9-S5
PMCID: PMC2537576
PMID: 18793469
Motivation: Data clustering techniques are an essential component of a good data analysis toolbox. Many current bioinformatics applications are inherently compute-intense and work with very large datasets. Sequential algorithms are inadequate for providing the necessary performance. For this reason, we have created Clustering Algorithms for Massively Parallel Architectures, Including GPU Nodes (CAMPAIGN), a central resource for data clustering algorithms and tools that are implemented specifically for execution on massively parallel processing architectures.
Results: CAMPAIGN is a library of data clustering algorithms and tools, written in ‘C for CUDA’ for Nvidia GPUs. The library provides up to two orders of magnitude speed-up over respective CPU-based clustering algorithms and is intended as an open-source resource.
New modules from the community will be accepted into the library and the layout of it is such that it can easily be extended to promising future platforms such as OpenCL.
Availability: Releases of the CAMPAIGN library are freely available for download under the LGPL from https://simtk.org/home/campaign. Source code can also be obtained through anonymous subversion access as described on https://simtk.org/scm/?group_id=453.
Contact: kjk33@cantab.net
doi:10.1093/bioinformatics/btr386
PMCID: PMC3150041
PMID: 21712246
Background
qtl.outbred is an extendible interface in the statistical environment, R, for combining quantitative trait loci (QTL) mapping tools. It is built as an umbrella package that enables outbred genotype probabilities to be calculated and/or imported into the software package R/qtl.
Findings
Using qtl.outbred, the genotype probabilities from outbred line cross data can be calculated by interfacing with a new and efficient algorithm developed for analyzing arbitrarily large datasets (included in the package) or imported from other sources such as the web-based tool, GridQTL.
Conclusion
qtl.outbred will improve the speed for calculating probabilities and the ability to analyse large future datasets. This package enables the user to analyse outbred line cross data accurately, but with similar effort than inbred line cross data.
doi:10.1186/1756-0500-4-154
PMCID: PMC3117720
PMID: 21615912
Background
A common approach to understanding the genetic basis of complex traits is through identification of associated quantitative trait loci (QTL). Fine mapping QTLs requires several generations of backcrosses and analysis of large populations, which is time-consuming and costly effort. Furthermore, as entire genomes are being sequenced and an increasing amount of genetic and expression data are being generated, a challenge remains: linking phenotypic variation to the underlying genomic variation. To identify candidate genes and understand the molecular basis underlying the phenotypic variation of traits, bioinformatic approaches are needed to exploit information such as genetic map, expression and whole genome sequence data of organisms in biological databases.
Description
The Sol Genomics Network (SGN, http://solgenomics.net) is a primary repository for phenotypic, genetic, genomic, expression and metabolic data for the Solanaceae family and other related Asterids species and houses a variety of bioinformatics tools. SGN has implemented a new approach to QTL data organization, storage, analysis, and cross-links with other relevant data in internal and external databases. The new QTL module, solQTL, http://solgenomics.net/qtl/, employs a user-friendly web interface for uploading raw phenotype and genotype data to the database, R/QTL mapping software for on-the-fly QTL analysis and algorithms for online visualization and cross-referencing of QTLs to relevant datasets and tools such as the SGN Comparative Map Viewer and Genome Browser. Here, we describe the development of the solQTL module and demonstrate its application.
Conclusions
solQTL allows Solanaceae researchers to upload raw genotype and phenotype data to SGN, perform QTL analysis and dynamically cross-link to relevant genetic, expression and genome annotations. Exploration and synthesis of the relevant data is expected to help facilitate identification of candidate genes underlying phenotypic variation and markers more closely linked to QTLs. solQTL is freely available on SGN and can be used in private or public mode.
doi:10.1186/1471-2105-11-525
PMCID: PMC2984588
PMID: 20964836