Which brings us, at last, to 'cloud computing.' This is a general term for computation-as-a-service. There are various different types of cloud computing, but the one that is closest to the way that computational biologists currently work depends on the concept of a 'virtual machine'. In the traditional economic model of computation, customers purchase server, storage and networking hardware, configure it the way they need, and run software on it. In computation-as-a-service, customers essentially rent the hardware and storage for as long or as short a time as they need to achieve their goals. Customers pay only for the time the rented systems are running and only for the storage they actually use.
This model would be lunatic if the rented machines were physical ones. However, in cloud computing, the rentals are virtual: without ever touching a power cable, customers can power up a fully functional 10-computer server farm with a terabyte of shared storage, upgrade the cluster in minutes to 100 servers when needed for some heavy duty calculations, and then return to the baseline 10-server system when the extra virtual machines are no longer needed.
The way it works is that a service provider puts up the capital expenditure of creating an extremely large compute and storage farm (tens of thousands of nodes and petabytes of storage) with all the frills needed to maintain an operation of this size, including a dedicated system administration staff, storage redundancy, data centers distributed to strategically placed parts of the world, and broadband network connectivity. The service provider then implements the infrastructure to give users the ability to create, upload and launch virtual machines on this compute farm. Because of economies of scale, the service provider can obtain highly discounted rates on hardware, electricity and network connectivity, and can pass these savings on to the end users to make virtual machine rental economically competitive with purchasing the real thing.
A virtual machine is a piece of software running on the host computer (the real hardware) that emulates the properties of a computer: the emulator provides a virtual central processing unit (CPU), network card, hard disk, keyboard and so forth. You can run the operating system of your choice on the virtual machine, log into it remotely via the internet, configure it to run web servers, databases, load management software, parallel computation libraries, and any other software you favor. You may be familiar with virtual machines from working with consumer products such as VMware [35
] or open source projects such as KVM [36
]. A single physical machine can host multiple virtual machines, and software running on the physical server farm can distribute requests for new virtual machines across the server farm in a way that intelligently distributes load.
The experience of working with virtual machines is relatively painless. Choose the physical aspects of the virtual machine you wish to make, including CPU type, memory size and hard disk capacity, specify the operating system you wish to run, and power up one or more machines. Within a couple of minutes, your virtual machines are up and running. Log into them over the network and get to work. When a virtual machine is not running, you can store an image of its bootable hard disk. You can then use this image as a template on which to start up multiple virtual machines, which is how you can launch a virtual compute cluster in a matter of minutes.
For the field of genome informatics, a key feature of cloud computing is the ability of service providers and their customers to store large datasets in the cloud. These datasets typically take the form of virtual disk images that can be attached to virtual machines as local hard disks and/or shared as networked volumes. For example, the entire GenBank archive could be (and in fact is, see below) stored in the cloud as a disk image that can be loaded and unloaded as needed.
Figure shows what the genome informatics ecosystem might look like in a cloud computing environment. Here, instead of there being separate copies of genome datasets stored at diverse locations and groups copying the data to their local machines in order to work with them, most datasets are stored in the cloud as virtual disks and databases. Web services that run on top of these datasets, including both the primary archives and the value-added integrators, run as virtual machines within the cloud. Casual users, who are accustomed to accessing the data via the web pages at NCBI, DDBJ, Ensembl or UCSC, continue to work with the data in their accustomed way; the fact that these servers are now located inside the cloud is invisible to them.
Figure 3 The 'new' genome informatics ecosystem based on cloud computing. In this model, the community's storage and compute resources are co-located in a 'cloud' maintained by a large service provider. The sequence archives and value-added integrators maintain (more ...)
Power users can continue to download the data, but they now have an attractive alternative. Instead of moving the data to the compute cluster, they move the compute cluster to the data. Using the facilities provided by the service provider, they configure a virtual machine image that contains the software they wish to run, launch as many copies as they need, mount the disks and databases containing the public datasets they need, and do the analysis. When the job is complete, their virtual cluster sends them the results and then vanishes until it is needed again.
Cloud computing also creates a new niche in the ecosystem for genome software developers to package their work in the form of virtual machines. For example, many genome annotation groups have developed pipelines for identifying and classifying genes and other functional elements. Although many of these pipelines are open source, packaging and distributing them for use by other groups has been challenging given their many software dependencies and site-specific configuration options. In a cloud computing environment these pipelines can be packaged into virtual machine images and stored in a way that lets anyone copy them, run them and customize them for their own needs, thus avoiding the software installation and configuration complexities.