Transcriptional regulation is one of the most basic regulatory mechanisms in the cell. The accumulation of multiple metazoan genome sequences and the advent of high-throughput experimental techniques have motivated the development of a large number of bioinformatics methods for the detection of regulatory motifs. The regulatory process is extremely complex and individual computational algorithms typically have very limited success in genome-scale studies. Here, we argue the importance of integrating multiple computational algorithms and present an infrastructure that integrates eight web services covering key areas of transcriptional regulation. We have adopted the client-side integration technology and built a consistent input and output environment with a versatile visualization tool named SeqVISTA. The infrastructure will allow for easy integration of gene regulation analysis software that is scattered over the Internet. It will also enable bench biologists to perform an arsenal of analysis using cutting-edge methods in a familiar environment and bioinformatics researchers to focus on developing new algorithms without the need to invest substantial effort on complex pre- or post-processors. SeqVISTA is freely available to academic users and can be launched online at http://zlab.bu.edu/SeqVISTA/web.jnlp, provided that Java Web Start has been installed. In addition, a stand-alone version of the program can be downloaded and run locally. It can be obtained at http://zlab.bu.edu/SeqVISTA.
Widespread adoption of high-throughput sequencing has greatly increased the scale and sophistication of computational infrastructure needed to perform genomic research. An alternative to building and maintaining local infrastructure is “cloud computing”, which, in principle, offers on demand access to flexible computational infrastructure. However, cloud computing resources are not yet suitable for immediate “as is” use by experimental biologists.
We present a cloud resource management system that makes it possible for individual researchers to compose and control an arbitrarily sized compute cluster on Amazon’s EC2 cloud infrastructure without any informatics requirements. Within this system, an entire suite of biological tools packaged by the NERC Bio-Linux team (http://nebc.nerc.ac.uk/tools/bio-linux) is available for immediate consumption. The provided solution makes it possible, using only a web browser, to create a completely configured compute cluster ready to perform analysis in less than five minutes. Moreover, we provide an automated method for building custom deployments of cloud resources. This approach promotes reproducibility of results and, if desired, allows individuals and labs to add or customize an otherwise available cloud system to better meet their needs.
The expected knowledge and associated effort with deploying a compute cluster in the Amazon EC2 cloud is not trivial. The solution presented in this paper eliminates these barriers, making it possible for researchers to deploy exactly the amount of computing power they need, combined with a wealth of existing analysis software, to handle the ongoing data deluge.
The enormous throughput and low cost of second-generation sequencing platforms now allow research and clinical geneticists to routinely perform single experiments that identify tens of thousands to millions of variant sites. Existing methods to annotate variant sites using information from publicly available databases via web browsers are too slow to be useful for the large sequencing datasets being routinely generated by geneticists. Because sequence annotation of variant sites is required before functional characterization can proceed, the lack of a high-throughput pipeline to efficiently annotate variant sites can act as a significant bottleneck in genetics research.
SeqAnt (Sequence Annotator) is an open source web service and software package that rapidly annotates DNA sequence variants and identifies recessive or compound heterozygous loci in human, mouse, fly, and worm genome sequencing experiments. Variants are characterized with respect to their functional type, frequency, and evolutionary conservation. Annotated variants can be viewed on a web browser, downloaded in a tab-delimited text file, or directly uploaded in a BED format to the UCSC genome browser. To demonstrate the speed of SeqAnt, we annotated a series of publicly available datasets that ranged in size from 37 to 3,439,107 variant sites. The total time to completely annotate these data completely ranged from 0.17 seconds to 28 minutes 49.8 seconds.
SeqAnt is an open source web service and software package that overcomes a critical bottleneck facing research and clinical geneticists using second-generation sequencing platforms. SeqAnt will prove especially useful for those investigators who lack dedicated bioinformatics personnel or infrastructure in their laboratories.
There is a huge demand on bioinformaticians to provide their biologists with user friendly and scalable software infrastructures to capture, exchange, and exploit the unprecedented amounts of new *omics data. We here present MOLGENIS, a generic, open source, software toolkit to quickly produce the bespoke MOLecular GENetics Information Systems needed.
The MOLGENIS toolkit provides bioinformaticians with a simple language to model biological data structures and user interfaces. At the push of a button, MOLGENIS’ generator suite automatically translates these models into a feature-rich, ready-to-use web application including database, user interfaces, exchange formats, and scriptable interfaces. Each generator is a template of SQL, JAVA, R, or HTML code that would require much effort to write by hand. This ‘model-driven’ method ensures reuse of best practices and improves quality because the modeling language and generators are shared between all MOLGENIS applications, so that errors are found quickly and improvements are shared easily by a re-generation. A plug-in mechanism ensures that both the generator suite and generated product can be customized just as much as hand-written software.
In recent years we have successfully evaluated the MOLGENIS toolkit for the rapid prototyping of many types of biomedical applications, including next-generation sequencing, GWAS, QTL, proteomics and biobanking. Writing 500 lines of model XML typically replaces 15,000 lines of hand-written programming code, which allows for quick adaptation if the information system is not yet to the biologist’s satisfaction. Each application generated with MOLGENIS comes with an optimized database back-end, user interfaces for biologists to manage and exploit their data, programming interfaces for bioinformaticians to script analysis tools in R, Java, SOAP, REST/JSON and RDF, a tab-delimited file format to ease upload and exchange of data, and detailed technical documentation. Existing databases can be quickly enhanced with MOLGENIS generated interfaces using the ‘ExtractModel’ procedure.
The MOLGENIS toolkit provides bioinformaticians with a simple model to quickly generate flexible web platforms for all possible genomic, molecular and phenotypic experiments with a richness of interfaces not provided by other tools. All the software and manuals are available free as LGPLv3 open source at http://www.molgenis.org.
The advancement of the computational biology field hinges on progress in three fundamental directions – the development of new computational algorithms, the availability of informatics resource management infrastructures and the capability of tools to interoperate and synergize. There is an explosion in algorithms and tools for computational biology, which makes it difficult for biologists to find, compare and integrate such resources. We describe a new infrastructure, iTools, for managing the query, traversal and comparison of diverse computational biology resources. Specifically, iTools stores information about three types of resources–data, software tools and web-services. The iTools design, implementation and resource meta - data content reflect the broad research, computational, applied and scientific expertise available at the seven National Centers for Biomedical Computing. iTools provides a system for classification, categorization and integration of different computational biology resources across space-and-time scales, biomedical problems, computational infrastructures and mathematical foundations. A large number of resources are already iTools-accessible to the community and this infrastructure is rapidly growing. iTools includes human and machine interfaces to its resource meta-data repository. Investigators or computer programs may utilize these interfaces to search, compare, expand, revise and mine meta-data descriptions of existent computational biology resources. We propose two ways to browse and display the iTools dynamic collection of resources. The first one is based on an ontology of computational biology resources, and the second one is derived from hyperbolic projections of manifolds or complex structures onto planar discs. iTools is an open source project both in terms of the source code development as well as its meta-data content. iTools employs a decentralized, portable, scalable and lightweight framework for long-term resource management. We demonstrate several applications of iTools as a framework for integrated bioinformatics. iTools and the complete details about its specifications, usage and interfaces are available at the iTools web page http://iTools.ccb.ucla.edu.
Laboratory Information Management Systems (LIMS) are an increasingly important part of modern laboratory infrastructure. As typically very sophisticated software products, LIMS often require considerable resources to select, deploy and maintain. Larger organisations may have access to specialist IT support to assist with requirements elicitation and software customisation, however smaller groups will often have limited IT support to perform the kind of iterative development that can resolve the difficulties that biologists often have when specifying requirements. Translational medicine aims to accelerate the process of treatment discovery by bringing together multiple disciplines to discover new approaches to treating disease, or novel applications of existing treatments. The diverse set of disciplines and complexity of processing procedures involved, especially with the use of high throughput technologies, bring difficulties in customizing a generic LIMS to provide a single system for managing sample related data within a translational medicine research setting, especially where limited IT support is available.
We have designed and developed a LIMS, BonsaiLIMS, around a very simple data model that can be easily implemented using a variety of technologies, and can be easily extended as specific requirements dictate. A reference implementation using Oracle 11 g database and the Python framework, Django is presented.
By focusing on a minimal feature set and a modular design we have been able to deploy the BonsaiLIMS system very quickly. The benefits to our institute have been the avoidance of the prolonged implementation timescales, budget overruns, scope creep, off-specifications and user fatigue issues that typify many enterprise software implementations. The transition away from using local, uncontrolled records in spreadsheet and paper formats to a centrally held, secured and backed-up database brings the immediate benefits of improved data visibility, audit and overall data quality. The open-source availability of this software allows others to rapidly implement a LIMS which in itself might sufficiently address user requirements. In situations where this software does not meet requirements, it can serve to elicit more accurate specifications from end-users for a more heavyweight LIMS by acting as a demonstrable prototype.
The AMBIT web services package is one of the several existing independent implementations of the OpenTox Application Programming Interface and is built according to the principles of the Representational State Transfer (REST) architecture. The Open Source Predictive Toxicology Framework, developed by the partners in the EC FP7 OpenTox project, aims at providing a unified access to toxicity data and predictive models, as well as validation procedures. This is achieved by i) an information model, based on a common OWL-DL ontology ii) links to related ontologies; iii) data and algorithms, available through a standardized REST web services interface, where every compound, data set or predictive method has a unique web address, used to retrieve its Resource Description Framework (RDF) representation, or initiate the associated calculations.
The AMBIT web services package has been developed as an extension of AMBIT modules, adding the ability to create (Quantitative) Structure-Activity Relationship (QSAR) models and providing an OpenTox API compliant interface. The representation of data and processing resources in W3C Resource Description Framework facilitates integrating the resources as Linked Data. By uploading datasets with chemical structures and arbitrary set of properties, they become automatically available online in several formats. The services provide unified interfaces to several descriptor calculation, machine learning and similarity searching algorithms, as well as to applicability domain and toxicity prediction models. All Toxtree modules for predicting the toxicological hazard of chemical compounds are also integrated within this package. The complexity and diversity of the processing is reduced to the simple paradigm "read data from a web address, perform processing, write to a web address". The online service allows to easily run predictions, without installing any software, as well to share online datasets and models. The downloadable web application allows researchers to setup an arbitrary number of service instances for specific purposes and at suitable locations. These services could be used as a distributed framework for processing of resource-intensive tasks and data sharing or in a fully independent way, according to the specific needs. The advantage of exposing the functionality via the OpenTox API is seamless interoperability, not only within a single web application, but also in a network of distributed services. Last, but not least, the services provide a basis for building web mashups, end user applications with friendly GUIs, as well as embedding the functionalities in existing workflow systems.
High-throughput, image-based screens of cellular responses to genetic or chemical perturbations generate huge numbers of cell images. Automated analysis is required to quantify and compare the effects of these perturbations. However, few of the current freely-available bioimage analysis software tools are optimized for efficient handling of these images. Even fewer of them are designed to transform the phenotypic features measured from these images into discriminative profiles that can reveal biologically meaningful associations among the tested perturbations.
We present a fast and user-friendly software platform called "cellXpress" to segment cells, measure quantitative features of cellular phenotypes, construct discriminative profiles, and visualize the resulting cell masks and feature values. We have also developed a suite of library functions to load the extracted features for further customizable analysis and visualization under the R computing environment. We systematically compared the processing speed, cell segmentation accuracy, and phenotypic-profile clustering performance of cellXpress to other existing bioimage analysis software packages or algorithms. We found that cellXpress outperforms these existing tools on three different bioimage datasets. We estimate that cellXpress could finish processing a genome-wide gene knockdown image dataset in less than a day on a modern personal desktop computer.
The cellXpress platform is designed to make fast and efficient high-throughput phenotypic profiling more accessible to the wider biological research community. The cellXpress installation packages for 64-bit Windows and Linux, user manual, installation guide, and datasets used in this analysis can be downloaded freely from http://www.cellXpress.org.
Dynamic visualisation interfaces are required to explore the multiple microbial genome data now available, especially those obtained by high-throughput sequencing — a.k.a. “Next-Generation Sequencing” (NGS) — technologies; they would also be useful for “standard” annotated genomes whose chromosome organizations may be compared. Although various software systems are available, few offer an optimal combination of feature-rich capabilities, non-static user interfaces and multi-genome data handling.
We developed SynTView, a comparative and interactive viewer for microbial genomes, designed to run as either a web-based tool (Flash technology) or a desktop application (AIR environment). The basis of the program is a generic genome browser with sub-maps holding information about genomic objects (annotations). The software is characterised by the presentation of syntenic organisations of microbial genomes and the visualisation of polymorphism data (typically Single Nucleotide Polymorphisms — SNPs) along these genomes; these features are accessible to the user in an integrated way. A variety of specialised views are available and are all dynamically inter-connected (including linear and circular multi-genome representations, dot plots, phylogenetic profiles, SNP density maps, and more). SynTView is not linked to any particular database, allowing the user to plug his own data into the system seamlessly, and use external web services for added functionalities. SynTView has now been used in several genome sequencing projects to help biologists make sense out of huge data sets.
The most important assets of SynTView are: (i) the interactivity due to the Flash technology; (ii) the capabilities for dynamic interaction between many specialised views; and (iii) the flexibility allowing various user data sets to be integrated. It can thus be used to investigate massive amounts of information efficiently at the chromosome level. This innovative approach to data exploration could not be achieved with most existing genome browsers, which are more static and/or do not offer multiple views of multiple genomes. Documentation, tutorials and demonstration sites are available at the URL: http://genopole.pasteur.fr/SynTView.
Genome browser; Microbial genomics; Synteny; Next-Generation Sequencing (NGS); Single Nucleotide Polymorphism (SNP); Flash; Interactive graphical user interface
The ability to combine physiology and engineering analyses with computer sciences has opened the door to the possibility of creating the "Virtual Human" reality. This paper presents a broad foundation for a full-featured biomechanical simulator for the human musculoskeletal system physiology. This simulation technology unites the expertise in biomechanical analysis and graphic modeling to investigate joint and connective tissue mechanics at the structural level and to visualize the results in both static and animated forms together with the model. Adaptable anatomical models including prosthetic implants and fracture fixation devices and a robust computational infrastructure for static, kinematic, kinetic, and stress analyses under varying boundary and loading conditions are incorporated on a common platform, the VIMS (Virtual Interactive Musculoskeletal System). Within this software system, a manageable database containing long bone dimensions, connective tissue material properties and a library of skeletal joint system functional activities and loading conditions are also available and they can easily be modified, updated and expanded. Application software is also available to allow end-users to perform biomechanical analyses interactively. Examples using these models and the computational algorithms in a virtual laboratory environment are used to demonstrate the utility of these unique database and simulation technology. This integrated system, model library and database will impact on orthopaedic education, basic research, device development and application, and clinical patient care related to musculoskeletal joint system reconstruction, trauma management, and rehabilitation.
Quantitative proteomics holds great promise for identifying proteins that are differentially abundant between populations representing different physiological or disease states. A range of computational tools is now available for both isotopically labeled and label-free liquid chromatography mass spectrometry (LC-MS) based quantitative proteomics. However, they are generally not comparable to each other in terms of functionality, user interfaces, information input/output, and do not readily facilitate appropriate statistical data analysis. These limitations, along with the array of choices, present a daunting prospect for biologists, and other researchers not trained in bioinformatics, who wish to use LC-MS-based quantitative proteomics.
We have developed Corra, a computational framework and tools for discovery-based LC-MS proteomics. Corra extends and adapts existing algorithms used for LC-MS-based proteomics, and statistical algorithms, originally developed for microarray data analyses, appropriate for LC-MS data analysis. Corra also adapts software engineering technologies (e.g. Google Web Toolkit, distributed processing) so that computationally intense data processing and statistical analyses can run on a remote server, while the user controls and manages the process from their own computer via a simple web interface. Corra also allows the user to output significantly differentially abundant LC-MS-detected peptide features in a form compatible with subsequent sequence identification via tandem mass spectrometry (MS/MS). We present two case studies to illustrate the application of Corra to commonly performed LC-MS-based biological workflows: a pilot biomarker discovery study of glycoproteins isolated from human plasma samples relevant to type 2 diabetes, and a study in yeast to identify in vivo targets of the protein kinase Ark1 via phosphopeptide profiling.
The Corra computational framework leverages computational innovation to enable biologists or other researchers to process, analyze and visualize LC-MS data with what would otherwise be a complex and not user-friendly suite of tools. Corra enables appropriate statistical analyses, with controlled false-discovery rates, ultimately to inform subsequent targeted identification of differentially abundant peptides by MS/MS. For the user not trained in bioinformatics, Corra represents a complete, customizable, free and open source computational platform enabling LC-MS-based proteomic workflows, and as such, addresses an unmet need in the LC-MS proteomics field.
Reliable access to basic services can improve a community's resilience to HIV/AIDS. Accordingly, work is being done to upgrade the physical infrastructure in affected areas, often employing a strategy of decentralised service provision. Spatial characteristics are one of the major determinants in implementing services, even in the smaller municipal areas, and good quality spatial information is needed to inform decision making processes. However, limited funds, technical infrastructure and human resource capacity result in little or no access to spatial information for crucial infrastructure development decisions at local level.
This research investigated whether it would be possible to develop a GIS for basic infrastructure planning and management at local level. Given the resource constraints of the local government context, particularly in small municipalities, it was decided that open source software should be used for the prototype system.
The design and development of a prototype system illustrated that it is possible to develop an open source GIS system that can be used within the context of local information management. Usability tests show a high degree of usability for the system, which is important considering the heavy workload and high staff turnover that characterises local government in South Africa. Local infrastructure management stakeholders interviewed in a case study of a South African municipality see the potential for the use of GIS as a communication tool and are generally positive about the use of GIS for these purposes. They note security issues that may arise through the sharing of information, lack of skills and resource constraints as the major barriers to adoption.
The case study shows that spatial information is an identified need at local level. Open source GIS software can be used to develop a system to provide local-level stakeholders with spatial information. However, the suitability of the technology is only a part of the system – there are wider information and management issues which need to be addressed before the implementation of a local-level GIS for infrastructure management can be successful.
New high throughput pyrosequencers such as the 454 Life Sciences GS 20 are capable of massively parallelizing DNA sequencing providing an unprecedented rate of output data as well as potentially reducing costs. However, these new pyrosequencers bear a different error profile and provide shorter reads than those of a more traditional Sanger sequencer. These facts pose new challenges regarding how the data are handled and analyzed, in addition, the steep increase in the sequencers throughput calls for much computation power at a low cost.
To address these challenges, we created an automated multi-step computation pipeline integrated with a database storage system. This allowed us to store, handle, index and search (1) the output data from the GS20 sequencer (2) analysis projects, possibly multiple on every dataset (3) final results of analysis computations (4) intermediate results of computations (these allow hand-made comparisons and hence further searches by the biologists). Repeatability of computations was also a requirement. In order to access the needed computation power, we ported the pipeline to the European Grid: a large community of clusters, load balanced as a whole. In order to better achieve this Grid port we created Vnas: an innovative Grid job submission, virtual sandbox manager and job callback framework.
After some runs of the pipeline aimed at tuning the parameters and thresholds for optimal results, we successfully analyzed 273 sequenced amplicons from a cancerous human sample and correctly found punctual mutations confirmed by either Sanger resequencing or NCBI dbSNP. The sequencing was performed with our 454 Life Sciences GS 20 pyrosequencer.
We handled the steep increase in throughput from the new pyrosequencer by building an automated computation pipeline associated with database storage, and by leveraging the computing power of the European Grid. The Grid platform offers a very cost effective choice for uneven workloads, typical in many scientific research fields, provided its peculiarities can be accepted (these are discussed). The mentioned infrastructure was used to analyze human amplicons for mutations. More analyses will be performed in the future.
The Center for Genome Research and Biocomputing (CGRB) Core Lab at Oregon State University provides services for fee in genomic technologies (DNA sequencing, DNA genotyping (fragment analysis)) and in functional genomic technologies (microarray). Sequencing is provided both for traditional Sanger sequencing on an AB 3730 and for Ultra high throughput sequencing on the Illumina Genome Analyzer GAIIx. This year we upgraded the GA by adding the ′x′ capability for longer reads and more images per cycle. DNA genotyping has transitioned from the AB 3100 to the AB 3730. Our microarray services include Affymetrix GeneChip and Roche NimbleGen platforms, and sample labeling, hybridization and scanning are offered. We purchased ArrayStar microarray analysis software which is accessible to our researchers through our Biocomputing cluster network. The Agilent Bioanalyzer service now offers High Sensitivity DNA and Plant RNA applications. Our Biocomputing capabilities have expanded. This past year we expanded our server room, doubled the floor space, increased the cooling infrastructure and increased the UPS power back up. To our cluster we added new nodes (now have 420+ processors), new service servers (mysql, http, mailist), new domains, a new Gbrowse development server, and we increased our back-up space. The Biocomputing changes to our High Throughput Sequencing service include removing the IPAR unit from the network and simplifying data pathways both for data management and researcher analysis capabilities. The CGRB Core Lab also maintains multi-user instruments available to researchers. These include a Zeiss LSM510Meta confocal microscope, an AB 7500 FAST qPCR instrument that was upgraded to include High Resolution Melting (HRM) capability, a Storm 820 Phosphoimager, a Nanodrop, a Genetix Q-Pix colony picker, an Axon Genepix 4200A microarray scanner and a fluorescent plate reader.
A steep drop in the cost of next-generation sequencing during recent years has made the technology affordable to the majority of researchers, but downstream bioinformatic analysis still poses a resource bottleneck for smaller laboratories and institutes that do not have access to substantial computational resources. Sequencing instruments are typically bundled with only the minimal processing and storage capacity required for data capture during sequencing runs. Given the scale of sequence datasets, scientific value cannot be obtained from acquiring a sequencer unless it is accompanied by an equal investment in informatics infrastructure.
Cloud BioLinux is a publicly accessible Virtual Machine (VM) that enables scientists to quickly provision on-demand infrastructures for high-performance bioinformatics computing using cloud platforms. Users have instant access to a range of pre-configured command line and graphical software applications, including a full-featured desktop interface, documentation and over 135 bioinformatics packages for applications including sequence alignment, clustering, assembly, display, editing, and phylogeny. Each tool's functionality is fully described in the documentation directly accessible from the graphical interface of the VM. Besides the Amazon EC2 cloud, we have started instances of Cloud BioLinux on a private Eucalyptus cloud installed at the J. Craig Venter Institute, and demonstrated access to the bioinformatic tools interface through a remote connection to EC2 instances from a local desktop computer. Documentation for using Cloud BioLinux on EC2 is available from our project website, while a Eucalyptus cloud image and VirtualBox Appliance is also publicly available for download and use by researchers with access to private clouds.
Cloud BioLinux provides a platform for developing bioinformatics infrastructures on the cloud. An automated and configurable process builds Virtual Machines, allowing the development of highly customized versions from a shared code base. This shared community toolkit enables application specific analysis platforms on the cloud by minimizing the effort required to prepare and maintain them.
Many biological databases that provide comparative genomics information and tools are now available on the internet. While certainly quite useful, to our knowledge none of the existing databases combine results from multiple comparative genomics methods with manually curated information from the literature. Here we describe the Princeton Protein Orthology Database (P-POD, http://ortholog.princeton.edu), a user-friendly database system that allows users to find and visualize the phylogenetic relationships among predicted orthologs (based on the OrthoMCL method) to a query gene from any of eight eukaryotic organisms, and to see the orthologs in a wider evolutionary context (based on the Jaccard clustering method). In addition to the phylogenetic information, the database contains experimental results manually collected from the literature that can be compared to the computational analyses, as well as links to relevant human disease and gene information via the OMIM, model organism, and sequence databases. Our aim is for the P-POD resource to be extremely useful to typical experimental biologists wanting to learn more about the evolutionary context of their favorite genes. P-POD is based on the commonly used Generic Model Organism Database (GMOD) schema and can be downloaded in its entirety for installation on one's own system. Thus, bioinformaticians and software developers may also find P-POD useful because they can use the P-POD database infrastructure when developing their own comparative genomics resources and database tools.
Researchers who use MEDLINE for text mining, information extraction, or natural language processing may benefit from having a copy of MEDLINE that they can manage locally. The National Library of Medicine (NLM) distributes MEDLINE in eXtensible Markup Language (XML)-formatted text files, but it is difficult to query MEDLINE in that format. We have developed software tools to parse the MEDLINE data files and load their contents into a relational database. Although the task is conceptually straightforward, the size and scope of MEDLINE make the task nontrivial. Given the increasing importance of text analysis in biology and medicine, we believe a local installation of MEDLINE will provide helpful computing infrastructure for researchers.
We developed three software packages that parse and load MEDLINE, and ran each package to install separate instances of the MEDLINE database. For each installation, we collected data on loading time and disk-space utilization to provide examples of the process in different settings. Settings differed in terms of commercial database-management system (IBM DB2 or Oracle 9i), processor (Intel or Sun), programming language of installation software (Java or Perl), and methods employed in different versions of the software. The loading times for the three installations were 76 hours, 196 hours, and 132 hours, and disk-space utilization was 46.3 GB, 37.7 GB, and 31.6 GB, respectively. Loading times varied due to a variety of differences among the systems. Loading time also depended on whether data were written to intermediate files or not, and on whether input files were processed in sequence or in parallel. Disk-space utilization depended on the number of MEDLINE files processed, amount of indexing, and whether abstracts were stored as character large objects or truncated.
Relational database (RDBMS) technology supports indexing and querying of very large datasets, and can accommodate a locally stored version of MEDLINE. RDBMS systems support a wide range of queries and facilitate certain tasks that are not directly supported by the application programming interface to PubMed. Because there is variation in hardware, software, and network infrastructures across sites, we cannot predict the exact time required for a user to load MEDLINE, but our results suggest that performance of the software is reasonable. Our database schemas and conversion software are publicly available at .
Many contemporary neuroscientific investigations face significant challenges in terms of data management, computational processing, data mining, and results interpretation. These four pillars define the core infrastructure necessary to plan, organize, orchestrate, validate, and disseminate novel scientific methods, computational resources, and translational healthcare findings. Data management includes protocols for data acquisition, archival, query, transfer, retrieval, and aggregation. Computational processing involves the necessary software, hardware, and networking infrastructure required to handle large amounts of heterogeneous neuroimaging, genetics, clinical, and phenotypic data and meta-data. Data mining refers to the process of automatically extracting data features, characteristics and associations, which are not readily visible by human exploration of the raw dataset. Result interpretation includes scientific visualization, community validation of findings and reproducible findings. In this manuscript we describe the novel high-throughput neuroimaging-genetics computational infrastructure available at the Institute for Neuroimaging and Informatics (INI) and the Laboratory of Neuro Imaging (LONI) at University of Southern California (USC). INI and LONI include ultra-high-field and standard-field MRI brain scanners along with an imaging-genetics database for storing the complete provenance of the raw and derived data and meta-data. In addition, the institute provides a large number of software tools for image and shape analysis, mathematical modeling, genomic sequence processing, and scientific visualization. A unique feature of this architecture is the Pipeline environment, which integrates the data management, processing, transfer, and visualization. Through its client-server architecture, the Pipeline environment provides a graphical user interface for designing, executing, monitoring validating, and disseminating of complex protocols that utilize diverse suites of software tools and web-services. These pipeline workflows are represented as portable XML objects which transfer the execution instructions and user specifications from the client user machine to remote pipeline servers for distributed computing. Using Alzheimer's and Parkinson's data, we provide several examples of translational applications using this infrastructure1.
aging; pipeline; neuroimaging; genetics; computation solutions; Alzheimer's disease; big data; visualization
BLAST is one of the most common and useful tools for Genetic Research. This paper describes a software application we have termed Windows .NET Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST), which enhances the BLAST utility by improving usability, fault recovery, and scalability in a Windows desktop environment. Our goal was to develop an easy to use, fault tolerant, high-throughput BLAST solution that incorporates a comprehensive BLAST result viewer with curation and annotation functionality.
W.ND-BLAST is a comprehensive Windows-based software toolkit that targets researchers, including those with minimal computer skills, and provides the ability increase the performance of BLAST by distributing BLAST queries to any number of Windows based machines across local area networks (LAN). W.ND-BLAST provides intuitive Graphic User Interfaces (GUI) for BLAST database creation, BLAST execution, BLAST output evaluation and BLAST result exportation. This software also provides several layers of fault tolerance and fault recovery to prevent loss of data if nodes or master machines fail. This paper lays out the functionality of W.ND-BLAST. W.ND-BLAST displays close to 100% performance efficiency when distributing tasks to 12 remote computers of the same performance class. A high throughput BLAST job which took 662.68 minutes (11 hours) on one average machine was completed in 44.97 minutes when distributed to 17 nodes, which included lower performance class machines. Finally, there is a comprehensive high-throughput BLAST Output Viewer (BOV) and Annotation Engine components, which provides comprehensive exportation of BLAST hits to text files, annotated fasta files, tables, or association files.
W.ND-BLAST provides an interactive tool that allows scientists to easily utilizing their available computing resources for high throughput and comprehensive sequence analyses. The install package for W.ND-BLAST is freely downloadable from . With registration the software is free, installation, networking, and usage instructions are provided as well as a support forum.
Summary: Complex computational experiments in Systems Biology, such as fitting model parameters to experimental data, can be challenging to perform. Not only do they frequently require a high level of computational power, but the software needed to run the experiment needs to be usable by scientists with varying levels of computational expertise, and modellers need to be able to obtain up-to-date experimental data resources easily. We have developed a software suite, the Systems Biology Software Infrastructure (SBSI), to facilitate the parameter-fitting process. SBSI is a modular software suite composed of three major components: SBSINumerics, a high-performance library containing parallelized algorithms for performing parameter fitting; SBSIDispatcher, a middleware application to track experiments and submit jobs to back-end servers; and SBSIVisual, an extensible client application used to configure optimization experiments and view results. Furthermore, we have created a plugin infrastructure to enable project-specific modules to be easily installed. Plugin developers can take advantage of the existing user-interface and application framework to customize SBSI for their own uses, facilitated by SBSI’s use of standard data formats.
Availability and implementation: All SBSI binaries and source-code are freely available from http://sourceforge.net/projects/sbsi under an Apache 2 open-source license. The server-side SBSINumerics runs on any Unix-based operating system; both SBSIVisual and SBSIDispatcher are written in Java and are platform independent, allowing use on Windows, Linux and Mac OS X. The SBSI project website at http://www.sbsi.ed.ac.uk provides documentation and tutorials.
Supplementary information: Supplementary data are available at Bioinformatics online.
As the availability, affordability and magnitude of genomics and genetics research increases so does the need to provide online access to resulting data and analyses. Availability of a tailored online database is the desire for many investigators or research communities; however, managing the Information Technology infrastructure needed to create such a database can be an undesired distraction from primary research or potentially cost prohibitive. Tripal provides simplified site development by merging the power of Drupal, a popular web Content Management System with that of Chado, a community-derived database schema for storage of genomic, genetic and other related biological data. Tripal provides an interface that extends the content management features of Drupal to the data housed in Chado. Furthermore, Tripal provides a web-based Chado installer, genomic data loaders, web-based editing of data for organisms, genomic features, biological libraries, controlled vocabularies and stock collections. Also available are Tripal extensions that support loading and visualizations of NCBI BLAST, InterPro, Kyoto Encyclopedia of Genes and Genomes and Gene Ontology analyses, as well as an extension that provides integration of Tripal with GBrowse, a popular GMOD tool. An Application Programming Interface is available to allow creation of custom extensions by site developers, and the look-and-feel of the site is completely customizable through Drupal-based PHP template files. Addition of non-biological content and user-management is afforded through Drupal. Tripal is an open source and freely available software package found at http://tripal.sourceforge.net
Research biologists increasingly face the arduous process of assessing and implementing a combination of freeware, commercial software, and web services for the management and analysis of data from high-throughput experiments. Laboratories can spend a remarkable amount of their research budgets on software, data analysis, and data management systems. The National Institutes of Health (NIH) and the National Science Foundation (NSF) have emphasized the need for contemporary software to be well-documented, interoperable, and extensible. However, laboratories often invest significant resources for personnel to build bespoke bioinformatics tools. This can have a marked impact on productivity and ROI, because these software tools often do not perform the way needed or hidden costs arise unexpectedly because of inefficiencies in the software. In this poster, we present a framework that Biomatters developed as a practical evaluation process to assist core facility managers and principal investigators to determine the best tools for DNA/RNA/protein sequence analysis and molecular cloning. The evaluation was performed on commercial & open source software packages using six criteria: 1) user interface, 2) data management, 3) data analysis, 4) feature availability, 5) extensibility, and 6) support. Thirteen software packages were evaluated (six commercial and seven free packages) using our six-tiered framework. This framework is useful for efficiently discriminating the strengths and weaknesses of the various packages, standardizes this process, and is helpful in reducing the amount of time spent on the evaluation process.
Molecular dynamics (MD) simulations provide valuable insight into biomolecular systems at the atomic level. Notwithstanding the ever-increasing power of high performance computers current MD simulations face several challenges: the fastest atomic movements require time steps of a few femtoseconds which are small compared to biomolecular relevant timescales of milliseconds or even seconds for large conformational motions. At the same time, scalability to a large number of cores is limited mostly due to long-range interactions. An appealing alternative to atomic-level simulations is coarse-graining the resolution of the system or reducing the complexity of the Hamiltonian to improve sampling while decreasing computational costs. Native structure-based models, also called Gō-type models, are based on energy landscape theory and the principle of minimal frustration. They have been tremendously successful in explaining fundamental questions of, e.g., protein folding, RNA folding or protein function. At the same time, they are computationally sufficiently inexpensive to run complex simulations on smaller computing systems or even commodity hardware. Still, their setup and evaluation is quite complex even though sophisticated software packages support their realization.
Here, we establish an efficient infrastructure for native structure-based models to support the community and enable high-throughput simulations on remote computing resources via GridBeans and UNICORE middleware. This infrastructure organizes the setup of such simulations resulting in increased comparability of simulation results. At the same time, complete workflows for advanced simulation protocols can be established and managed on remote resources by a graphical interface which increases reusability of protocols and additionally lowers the entry barrier into such simulations for, e.g., experimental scientists who want to compare their results against simulations. We demonstrate the power of this approach by illustrating it for protein folding simulations for a range of proteins.
We present software enhancing the entire workflow for native structure-based simulations including exception-handling and evaluations. Extending the capability and improving the accessibility of existing simulation packages the software goes beyond the state of the art in the domain of biomolecular simulations. Thus we expect that it will stimulate more individuals from the community to employ more confidently modeling in their research.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-292) contains supplementary material, which is available to authorized users.
Protein folding; RNA folding; Native structure-based model; Molecular dynamics; GridBeans
The High-performance Integrated Virtual Environment (HIVE) is a high-throughput cloud-based infrastructure developed for the storage and analysis of genomic and associated biological data. HIVE consists of a web-accessible interface for authorized users to deposit, retrieve, share, annotate, compute and visualize Next-generation Sequencing (NGS) data in a scalable and highly efficient fashion. The platform contains a distributed storage library and a distributed computational powerhouse linked seamlessly. Resources available through the interface include algorithms, tools and applications developed exclusively for the HIVE platform, as well as commonly used external tools adapted to operate within the parallel architecture of the system. HIVE is composed of a flexible infrastructure, which allows for simple implementation of new algorithms and tools. Currently, available HIVE tools include sequence alignment and nucleotide variation profiling tools, metagenomic analyzers, phylogenetic tree-building tools using NGS data, clone discovery algorithms, and recombination analysis algorithms. In addition to tools, HIVE also provides knowledgebases that can be used in conjunction with the tools for NGS sequence and metadata analysis.
big data; bioinformatics; high-performance cloud computing; high-throughput sequencing; next-generation sequencing; genomics
The accurate determination of orthology and inparalogy relationships is essential for comparative sequence analysis, functional gene annotation and evolutionary studies. Various methods have been developed based on either simple blast all-versus-all pairwise comparisons and/or time-consuming phylogenetic tree analyses.
We have developed OrthoInspector, a new software system incorporating an original algorithm for the rapid detection of orthology and inparalogy relations between different species. In comparisons with existing methods, OrthoInspector improves detection sensitivity, with a minimal loss of specificity. In addition, several visualization tools have been developed to facilitate in-depth studies based on these predictions. The software has been used to study the orthology/in-paralogy relationships for a large set of 940,855 protein sequences from 59 different eukaryotic species.
OrthoInspector is a new software system for orthology/paralogy analysis. It is made available as an independent software suite that can be downloaded and installed for local use. Command line querying facilitates the integration of the software in high throughput processing pipelines and a graphical interface provides easy, intuitive access to results for the non-expert.