High-throughput genomic technologies continue to move in a direction where data yield from the instruments is increasing, while the cost for acquiring the technology is continuously decreasing. For example, the introduction of benchtop genome sequencers such as MiSeq from Illumina [1
], has made complete sequencing of viral, bacterial, and small fungal genomes affordable to small laboratories. Nonetheless, acquiring the sequence is only the first step, and must be followed by large-scale computational analysis to process the data, test hypotheses and draw scientific insights. Therefore, investment in a sequencing instrument would normally be accompanied by substantial investment in computer hardware, skilled informatics support, and bioinformaticians competent in configuring and using specific software to analyze the data.
An alternative model is now available: computational capacity can be purchased as a service from a cloud computing provider, and specialized computational systems can be run on such platforms [2
]. Cloud infrastructures provide researchers with the ability to perform computations using a practically unlimited pool of Virtual Machines (VMs), without the burden of owning or maintaining hardware [3
]. Cloud computing services use a charge model similar to utilities such as electricity, and thus customers are billed based on amounts of computing resources consumed [4
]. Along these lines, the Cloud BioLinux project offers an on-demand, cloud computing solution for the bioinformatics community, and is available for use on private or publicly accessible, commercially hosted cloud computing infrastructure such as Amazon EC2. For small laboratories without access to large computational resources, running Cloud BioLinux through a commercial cloud platform provides a cost-effective route from data to knowledge, while those with access to private clouds will still benefit from the abundance of pre-configured software and the user-friendly desktop interface available.
Cloud BioLinux takes advantage of the fact that VMs provide a mechanism for whole system snapshot exchange [5
]. With this approach, the operating system, software tools and databases, are encapsulated into a single digital image of the computing system that is readily archived and restored for later use. A snapshot captures all changes made inside a VM server from its initial execution, up to the point of snapshot creation. These changes include for example user-uploaded data, configuration settings and analysis results generated by running bioinformatic pipelines. A snapshot is also an executable VM, and can be shared with other users of the cloud, therefore allowing collaborating researchers to share uploaded data, analysis results and bioinformatics tools in as a single digital image. Having access to specialized VMs with scientific results for a particular scientific domain can greatly speed up research, as it substantially decreases, and in many cases removes the time required for an individual to configure the computing system with data and software to meet their research needs.
An early pioneering effort to provide such a system for the bioinformatics community was NEBC BioLinux [6
]. NEBC BioLinux contains over 135 bioinformatics packages, including the blastall and blast+ NCBI applications, the Staden toolkit, EMBOSS, hmmer, and phylip collections of software, many stand-alone applications for tasks such as sequence alignment, clustering, assembly, display, editing, and phylogeny, as well as tools for working with next generation sequencing data. The system is also designed to allow setting up and maintaining a data analysis environment with very little informatics expertise, running a "live system" from a DVD or USB stick (without modifications to the user's workstation), or installing it to the hard drive with a simple graphical installer.
Building on the bioinformatics packages, documentation and desktop interface of NEBC BioLinux release 6.0, we developed Cloud BioLinux by leveraging VM technology and the cloud to offer a pre-configured, high-performance bioinformatics computing solution. Included by default are all bioinformatic software packages from NEBC BioLinux, in addition to next-generation sequencing data analysis tools including for example the Fastx utilities, SAM and BAM toolsets, Genome Analysis Toolkit (GATK), BWA, Novoalign and Bowtie aligners, the Mummer toolkit, and the Velvet, SSAKE, Mira, Newbler and Cap3 genome seqeunce assemblers. Furthermore, bioinformatic code libraries such as BioPython, BioPerl, BioRuby, BioJava, R and R-Bioconductor programming languages are included. Besides the pre-installed software, Cloud BioLinux provides scripts for accessing a repository of reference genomes (human, mouse, D. melanogaster, A. thaliana, X. tropicallis, S. cerevisiae and C. elegans) on an Amazon S3 bucket. The reference genomes are pre-indexed for a number of popular sequence alignment software packages, including BWA, Bowtie and Novoalign. A script and configuration files are included as part of the Cloud BioLinux framework, for selecting indexed genomes and installing them directly from the cloud storage on a running VM.
Detailed documentation for each tool included in Cloud BioLinux is available as set of HTML pages structured as mini-website, and linked from the main Cloud BioLinux website and as an icon on the graphical interface of the VM (in addition, readers can download the complete set of documentation as a compressed file from Additional file 1
; Suppl.1). This mini, self-contained website allows users to select documentation for the installed packages from a drop-down menu, where the applications grouped based on their functionality, with some example groups including Statistics, Alignment, Clustering, Databases, Microarrays and Phylogeny.
End-users can simply instantiate Cloud BioLinux VMs using only a web browser through a local desktop computer to access the Amazon EC2 cloud console, and then login to the rich graphical interface using a remote connection without need for any advanced technical knowledge. An example remote desktop client is the one by NoMachine [7
], which is available at no charge for Windows, Mac or Linux computers. For advanced users and developers, we have implemented an automated software management framework, which allows complete customization of the bioinformatics tools included in the Cloud BioLinux VM, while also enabling easy updates and deployment on different cloud platforms. Since the project is fully open-source, researchers and software developers at their laboratories can freely download, modify, and run the VMs on a public or private cloud.
In the following sections we first present the technical details of the Cloud BioLinux software management framework, and how it can be leveraged for creating customized VMs and deploying to different cloud platforms. Then we detail how end-users without access to local computing infrastructure, can run Cloud BioLinux by simply using a desktop computer with Internet access. Finally, we discuss our future plans for further development of this project.