|Home | About | Journals | Submit | Contact Us | Français|
Today we can generate hundreds of gigabases of DNA and RNA sequencing data in a week for less than US$5,000. The astonishing rate of data generation by these low-cost, high-throughput technologies in genomics is being matched by that of other technologies, such as real-time imaging and mass spectrometry-based flow cytometry. Success in the life sciences will depend on our ability to properly interpret the large-scale, high-dimensional data sets that are generated by these technologies, which in turn requires us to adopt advances in informatics. Here we discuss how we can master the different types of computational environments that exist — such as cloud and heterogeneous computing — to successfully tackle our big data problems.
The wave of new technologies in genomics — such as ‘third-generation’ sequencing technologies1, sophisticated imaging systems and mass spectrometry-based flow cytometry2 — are enabling data to be generated at unprecedented scales. As a result, we can monitor the expression of tens of thousands of genes simultaneously3,4, score hundreds of thousands of SNPs in individual samples5, sequence an entire human genome for less than US$5,000 (REF. 6) and relate these data patterns to other biologically relevant information.
In under a year, genomics technologies will enable individual laboratories to generate terabyte or even petabyte scales of data at a reasonable cost. However, the computational infrastructure that is required to maintain and process these large-scale data sets, and to integrate them with other large-scale sets, is typically beyond the reach of small laboratories and is increasingly posing challenges even for large institutes.
Luckily, the computational field is rife with possibilities for addressing these needs. Life scientists have begun to borrow solutions from fields such as high-energy particle physics and climatology, which have already passed through similar inflection points. Companies such as Microsoft, Amazon, Google and Facebook have also become masters of petabyte-scale data sets — as they have been linking pieces of data that are distributed over a massively parallel architecture in response to a user’s requests and presenting them to the user in a matter of seconds. Following these advances made by others, we provide an overview and guidance on the types of computational environments that currently exist and that, in the immediate future, can tackle many of the big data problems now being faced by the life sciences.
Computational solutions range from cloud-based computing to an emerging revolution in high-speed, low-cost heterogeneous computational environments. But are life scientists ready to embrace these possibilities? If you are a life scientist confronted with the task of analysing a Mount Everest of data, and you are wondering how to derive meaning from them using Microsoft Excel 2007’s 1,048,576 row and 16,384 column limit, then this article will take you through the steps needed to allow you to compete with more computationally savvy groups.
In this Review, we define the typical workflows associated with the generation of high-throughput biological data, the challenges in those workflows, and how cloud computing and heterogeneous computational environments can help us to overcome these challenges. We then describe how complex data sets can be distilled to obtain higher-order biological relationships, and we discuss the direction that computation must take to help us to further our understanding at the cellular, tissue, organism, population and community levels.
Understanding how living systems operate will require the integration of the many layers of biological information that high-throughput technologies are generating.
As an example, the amount of data from large projects such as 1000 Genomes will collectively approach the petabyte scale for the raw information alone. The situation will soon be exacerbated by third-generation sequencing technologies7 that will enable us to scan entire genomes, microbiomes and transcriptomes and to assess epigenetic changes directly8 in just minutes, and for less than US$100. To this should be added data from imaging technologies, other high-dimensional sensing methods and personal medical records. Although processing individual data dimensions is complex (for example, uncovering functional DNA variation in multiple cancer samples using whole-genome sequencing), the true challenge is in integrating the multiple sources of data. Mining such large high-dimensional data sets poses several hurdles for storage and analysis. Among the most pressing challenges are: data transfer, access control and management; standardization of data formats; and accurate modelling of biological systems by integrating data from multiple dimensions.
Analysis results can markedly increase the size of the raw data, given that all relationships among DNA, RNA and other variables of interest are stored and mined. Therefore, it is important to efficiently move these big data sets around the internet, to provide for access control if the data are stored centrally (to reduce storage costs) and to properly organize large-scale data in ways that facilitate analyses. Network speeds are too slow to enable terabytes of data to be routinely transferred over the web. Currently, the most efficient mode of transferring large quantities of data is to copy the data to a big storage drive and then ship the drive to the destination. This is inefficient and presents a barrier for data exchange between groups. One solution is to house the data sets centrally and bring the high-performance computing (HPC) to the data. Although this is an attractive solution, it presents access control challenges, as groups generating the data may want to retain control over who can access the data before they are published. Furthermore, controlling access to big data sets requires IT support, which is costly.
Mining the data for discovery crucially requires managing and organizing big data sets — consider, for example, the task of comparing whole-genome sequence data from multiple tumour and matched adjacent normal tissue pairs. Retrieving sequences, over all pairs, that map to many different genomic regions would not be trivial on inappropriately organized data.
Different centres generate data in different formats, and some analysis tools require data to be in particular formats or require different types of data to be linked together. Thus, time is wasted reformatting and re-integrating data multiple times during a single analysis. For example, next-generation sequencing companies do not deliver raw sequencing data in a format common to all platforms, as there is no industry-wide standard beyond simple text files that include the nucleotide sequence and the corresponding quality values. As a result, carrying out sequence analyses across different platforms requires tools to be adapted to specific platforms.
It is therefore crucial to develop interoperable sets of analysis tools that can be run on different computational platforms depending on which is best suited for a given application, and then stitch those tools together to form analysis pipelines.
A primary goal for biological researchers is to integrate diverse, large-scale data sets to construct models that can predict complex phenotypes such as disease. As mentioned above, constructing predictive models can be computationally demanding. Consider, for example, reconstructing Bayesian networks using large-scale DNA or RNA variation, DNA–protein binding, protein interaction, metabolite and other types of data. As the scales and diversity of the data grow, this type of modelling will become increasingly important for representing complex systems and predicting their behaviour. Computationally, however, the need for this type of modelling poses an intense problem that falls into the category of NP hard problems9 (FIG. 1). Finding the best Bayesian network by searching through all possible networks is a complex process; this is true even in cases in which there are only ten genes (or nodes), given that there would be in the order of 1018 possible networks. As the number of nodes increases, the number of networks to consider grows superexponentially. The computational environments that are required to organize vast amounts of data, build complex models from them and then enable others to interpret their data in the more informative context of existing models are beyond what is available today in the life sciences.
Addressing big data and computational challenges requires efficiently targeting limited resources — money, power, space and people — to solve an application of interest. In turn, this requires understanding and exploiting the nature of the data and the analysis algorithms. Factors that must be taken into account to solve a particular problem most efficiently include: the size and complexity of the data; the ease with which data can be efficiently transported over the internet; whether the algorithm to apply to the data can be efficiently parallelized; and whether the algorithm is simple (for example, an algorithm used to compute the mean and standard deviation of a vector of numbers) or complex (for example, an algorithm applied to reconstructing Bayesian networks through the integration of diverse types of large-scale data) (BOX 1).
Selecting the best computational platform for your problem of interest requires an understanding of the magnitude and complexity of the data, as well as the memory, network bandwidth and computational constraints of the problem. This is because different platforms have different strengths and weaknesses with respect to these constraints. One of the key technical metrics to consider is OPs/byte. Others are described below.
Disk- and network-bound applications may be more dependent on the broader system than on the type of processor. These types of applications benefit from targeted investment in system components or a distributed approach that assembles large, aggregate memory or disk bandwidth from clusters of low-cost, low-power components37. In some instances, such as during the construction of weighted co-expression networks, expensive special-purpose supercomputing resources may be required.
One of the most important aspects to consider for computing large data sets is the parallelization of the analysis algorithms. Computationally or data-intensive problems are primarily solved by distributing tasks over many computer processors. Because different algorithms used to solve a problem are amenable to different types of parallelization, different computational platforms (TABLE 1) are needed to achieve the best performance.
We can classify the types of parallelism into two broad categories: loosely coupled (or coarse-grained) parallelism and tightly coupled (or fine-grained) parallelism. In loosely coupled parallelism, little effort is required to break up a problem into parallel tasks. For example, consider the problem of computing all genetic associations between thousands of gene expression traits and hundreds of thousands of SNP genotypes assayed in a tissue-specific cohort10. Each SNP–trait pair (or a given SNP tested against all traits) can be computed independently of the other pairs, so the computation can be carried out on independent processors or even completely separate computers. Conversely, tightly coupled parallelism requires a substantial programming effort, and possibly specialized hardware, because communication between the different parallel tasks using specialized frameworks must be maintained with minimal delay. The message passing interface (MPI)11 is an example of such a framework. In the context of the example just given involving the genetics of gene expression, instead of testing for SNP–trait associations independently, one could seek to partition the traits into modules of interconnected expression traits that are associated with common sets of SNPs12. This same problem was also solved through a Bayesian approach12 using a Markov chain Monte Carlo (MCMC) method that combines the two types of parallelism: the construction of each Markov chain is a coarsely parallel problem, but the construction of each chain is also an example of a fine-grained parallelism in which the algorithm was coded using MPI.
Solutions to integrating the new generation of large-scale data sets require approaches akin to those used in physics, climatology and other quantitative disciplines that have mastered the collection of large data sets. Cloud computing and heterogeneous computational environments are relatively recent inventions that address many of the limitations mentioned above relating to data transfer, access control, data management, standardization of data formats and advanced model building (FIG. 2).
Compared to general purpose processors (GPPs), heterogeneous systems can deliver a tenfold increase or greater in peak arithmetic throughput for a few hundred US dollars. Cloud computing can make large-scale computational clusters readily available on a pay-as-you-need basis. But both approaches have trade-offs that result from trying to optimize for peak performance (heterogeneous systems) or low-cost and flexibility (cloud computing).
It is important that the working scientists understand the advantages and disadvantages of these different computational platforms and the problems to which they are best suited (TABLE 1). Computational platforms can be classified by the performance characteristic that they aim to optimize and how they allocate resources to do so. In the following sections we attempt to address exactly this issue, with particular attention to how heterogeneous systems and cloud computing have changed the traditional cost–capability trade-offs that users consider when deciding on the approach that best suits their computing needs. We provide an overview of these computational systems and directions for choosing the ideal computational architecture for an application of interest.
HPC has been transformed in the past decade by the maturation of ‘cluster-based computing’ (FIG. 2). With a wide range of components available (for example, different networks, computational nodes and storage systems), clusters can be optimized for many classes of computationally intense applications. Consider the annotation of genes predicted in novel bacterial genomes, an important research problem for understanding microbiomes and how they interact with our own genomes to alter phenotypes of interest13–15. Using BLASTP16 to search 6,000 predicted genes against, say, the non-redundant protein (NR) database is a moderately computationally intense problem. Searching one gene against the NR database on a standard desktop computer would take ~30 s, so searching all 6,000 predictions could take 4 days. Distributing the searches over 1,000 central processing units (CPUs) would complete the search in less than 10 min. Although not all applications experience improvements on this scale, cluster-based computing can markedly accelerate these types of operations. However, there is a significant cost associated with building and maintaining a cluster. Even after the hardware components are acquired and assembled to build an HPC cluster, substantial costs are associated with operating the cluster, including those of space, power, cooling, fault recovery, backup and IT support. In this sense a cluster contrasts with grid computing, an earlier form of cluster-like computing. Here, a combination of loosely coupled networked computers that are administrated independently work together on common computational tasks at low or no cost, as individual computer owners volunteer their systems for such efforts (for example, the Folding@Home project).
More recently, supercomputing has been made more accessible and affordable through the development of virtualization technology. Virtualization software allows systems to behave like a true physical computer, but with the flexible specification of details such as number of processors, memory and disk size, and operating system. Multiple virtual machines can often be run from a single physical server, providing significant cost savings in server hardware, administration and maintenance.
The use of these on-demand virtual computers17,18 is known as cloud computing. The combination of virtual machines and large numbers of affordable CPUs has made it possible for internet-based companies such as Amazon, Google and Microsoft to invest in ‘mega-scale’ computational clusters and advanced, large-scale data-storage systems. These companies offer on-demand computing to tens of thousands of users simultaneously, with flexible computer architectures that can manipulate and process petabyte scales of data (BOX 2).
With the rise of remote, or cloud, computational resources, the ‘where’ has become as important a question for scientific computing as the ‘how’. At the terabyte, or even gigabyte, scale it is often more efficient to bring the computation to the data.
As a rule of thumb, 100,000 central processing unit (CPU) cycles are required to amortize the cost of transferring a single byte of data to a remote computational resource via the internet (equivalent to roughly 1 second per megabyte on current high-end laptops)38. For example, constructing a Bayesian network using state-of-the-art algorithms can require in the order of 1018 computational cycles on current high-end general purpose processors (GPPs). Given such computational demands on problems that may involve no more than hundreds of megabytes of data3,10,39, it is cost effective to pipe such data through the internet to high-end computational environments to carry out such an operation.
However, if summarizing the distribution of allele frequencies in a population of 50,000 individuals genotyped at 1,000,000 SNPs (about one trillion bytes, or a terabyte, of data), only 1012 computational cycles are required, orders of magnitude less than the 1017 computational cycles that one would need to run to break even in piping a terabyte of data over the internet.
Even so, remote systems can offer compelling ease of use and significant cost efficiencies, and in many cases large shared data sets are already being hosted in the cloud. Again, understanding how best to spend one’s resources is key. The online presentation associated with this paper (‘Computational solutions to large-scale data management’) provides a decision tree that can be used to help users decide on the most appropriate platform for their problem.
Cloud computing platforms offer a convenient solution for addressing computational challenges in modern genetics research, beyond what could be achieved with traditional on-site clusters. For example, paying for cloud computing is typically limited to what you use — that is, the user requests a computer system type on demand to fit their needs and only pays for the time in which they used an instance of that system. Administrative functions such as backup and recovery are included in this cost. This pay-as-you-go model provides enormous flexibility: a computational job that would normally take 24 hours to complete on a single virtual computer can be sped up to 1 hour on 24 virtual computers for the same nominal cost. Such flexibility is difficult to achieve in the life sciences outside the large data centres in the genome sequencing centres or larger research institutes, but it can be achieved by anyone leveraging the cloud computing services offered by large information centres such as Amazon or Microsoft. In fact, Microsoft Research and the US National Science Foundation (NSF) recently initiated a programme to offer individual researchers and research groups selected through NSF’s merit review process free access to advanced cloud computing resources (see ‘Microsoft and the National Science Foundation enable research in the cloud’ on the Microsoft News Center website).
In addition to flexibility, cloud computing addresses one of the challenges relating to transferring and sharing data, because data sets and analysis results held in the cloud can be shared with others. For example, Amazon Web Services provides access to many useful data sets, such as the Ensembl and 1000 Genomes data. Also, to minimize cost and maximize flexibility, cloud vendors offer almost all aspects of the computer system, not just the CPU, as a service. For example, persistent data are often maintained in networked storage services, such as Amazon S3.
The downsides of cloud computing are a reduced control over the distribution of the computation and the underlying hardware, and the time and cost that are required to transfer large volumes of data to and from the cloud. Although cloud computing offers flexibility and easy access to big computational resources, it does not solve the data transfer problem. Network bandwidth issues make the transfer of large data sets into and out of the cloud or between clouds impractical, making it difficult for groups using different cloud services to collaborate easily.
Furthermore, there are privacy concerns relating to the hosting of data sets on publicly accessible servers, as well as issues related to storage of data from human studies (for example, those posting human data must ensure compliance with the Health Insurance Portability and Accountability Act (HIPAA)).
Several years ago, a distributed computing paradigm known as MapReduce emerged to simplify the development of massively parallel computing applications and provide better scalability and fault tolerance for computing procedures involving many (>100) simultaneous processes19. In the context of distributed computing, MapReduce refers to the splitting of a problem into many homogeneous sub-problems in a ‘map’ step, followed by a ‘reduce’ step that combines the output of the smaller problems into the whole desired output. Whereas the Google implementation of distributed MapReduce remains proprietary, an open-source implementation of the concept is widely available through the Hadoop project19. In addition to meeting the required computational demand, this project addresses several of the data access and management challenges, as it provides for useful abstractions such as distributed file systems, distributed query language and distributed databases. An example of a problem that can be solved using MapReduce is that of aligning raw sequence reads that have been generated from a whole-genome sequencing run against those of a reference genome. The homogeneous sub-problems in this case (the map step) consist of aligning individual reads against the reference genome. Once all reads are aligned, the reduce step consists of aggregating all of the aligned reads into a single alignment file. In a later section, we cover in detail how you could implement this problem using the Amazon Web Services’ Elastic MapReduce resource.
The combination of distributed MapReduce and cloud computing can be an effective answer for providing petabyte-scale computing to a wider set of practitioners. Several recent papers have demonstrated the feasibility of this concept by implementing MapReduce workflows on cloud-based resources for searching sequence databases20 and aligning raw sequencing reads to reference genomes21,22. The work by Langmead21 advances this trend one step further by converting a traditional two-stage workflow — sequence alignment followed by consensus calling — into a single application of a MapReduce workflow. The resultant pipeline scales well with additional computational resources: a whole-genome SNP analysis that started with 38× sequencing coverage of the human genome was completed in less than 3 hours using a 320-CPU cluster.
The MapReduce model is often a straightforward fit for data-parallel approaches in which a single task, such as read alignment, is split into smaller identical sub-tasks that operate on an independent subset of the data. Many large-scale analyses in biology fit into this ‘embarrassingly parallel’ decomposition (for example, sequence searches, image recognition, read alignments and protein ID by mass spectrometry). The use of distributed file systems (implicit in distributed MapReduce) becomes a necessity when operating on data sets of a terabyte scale or larger.
One of the bigger advantages of combining MapReduce and cloud computing is the reuse of a growing community of developers and software tools, and with just a few mouse clicks you can set up a MapReduce cluster with many nodes. For example, the Amazon EC2 cloud-computing environment provides a specific service for streamlining the set-up and running of Hadoop-based workflows (see the Amazon Machine Images website). The ease with which these workflows can be established has no doubt fuelled the movement of companies such as Eli Lilly to carry out their bioinformatics analyses on Amazon’s EC2 cloud-computing environment23.
As a measure of how ready this platform is for widespread use, Hadoop is now being taught to undergraduate students across the United States24. A new generation of computer users is therefore being trained to view the internet, rather than a laptop computer, as its computing device.
However, not all large-scale computations are best treated with a simple MapReduce approach. For example, algorithms with complex data access patterns, such as those for reconstructing probabilistic gene networks, require special treatment. Distributed MapReduce should be viewed as an emerging successful model for petabyte-scale computing, but not necessarily the only model to consider when mapping a particular problem to larger-scale computing.
Several emerging trends in cloud computing will influence petabyte-scale computing for biology. The analysis of large biological data sets retrieved from standard gateways, such as Ensembl, GenBank, Protein Data Bank (PDB), Gene Expression Omnibus (GEO), UniGene and the database of Genotypes and Phenotypes (dbGAP), can incur high network traffic and, consequently, generate redundant copies of the data that are distributed around the world. By placing these types of data sets into cloud-based storage and attaching them to cloud-based analysis clusters, biological analyses can be brought to the data without creating redundant copies and with a greatly reduced transfer start-up time25. The use of cloud resources in this context is also being seen as an ideal solution for data storage from biomedical consortia efforts that require the consolidation of large data sets from multiple distributed sites around the globe26,27.
Standardization and easy access to petabyte-scale computing will soon enable an ecosystem of reusable scientific tools and workflows. One would simply transfer or point to their large data set, select their favourite workflow of choice and receive results back in the form of files in cloud-based storage. The foundations for this model of reusable petabyte-scale workflows can be seen in the success of virtual appliance marketplaces such as those hosted by Amazon, 3Tera, VMware and Microsoft.
A complementary system to cloud-based computing is the use of heterogeneous multiple core (one-CPU) computers — integrated specialized accelerators that increase peak arithmetic throughput by 10-fold to 100-fold and can turn individual computers, the work-horses of both desktop and cluster computers, into mini-supercomputers28.
The accelerators consist of graphics processing units (GPUs) that operate alongside the multi-core GPPs that are commonly found in desktop and laptop computers. Similar to cloud computing, heterogeneous systems are helping to expand access to HPC capabilities over a broad range of applications. Modern GPUs were driven by the videogame industry to deliver ever more realistic, real-time gaming environments to consumers. Although the working scientist is more likely to encounter GPUs, some vendors sell field-programmable gate array (FPGA)-based accelerators for genomic applications (for example, the CLC Bioinformatics Cube). Given that every modern computer includes a GPU, many of which are usable for general purpose computing, GPUs can be purchased for between US$100 and US$1,500. GPUs from NVIDIA and AMD/ATI offer ‘cluster-scale’ performance (more than one trillion floating point operations (FLOPS) per second) in an add-on card that costs only several hundred dollars.
General purpose heterogeneous computing has become feasible only in the past few years, but it has already yielded notable successes for biological data processing. The Folding@Home project uses a distributed client for protein folding simulation, with clients available for GPUs. Although GPUs constitute only 5% of the processors that are actively engaged in this project, they contribute 60% of the total FLOPS29,30. The GPU port of NAMD, a widely used program for molecular dynamics simulation, running on a 4-GPU cluster outperforms a cluster with 16 quad-core GPPs (48 cores). A GPU version of the sequence alignment program MUMmer was developed and achieved a roughly three- to fourfold speed increase over the serial-CPU version31. Several important algorithms have been developed for the GPU for annotating genomes, identifying SNPs and identifying RNA isoforms. They include CUDASW++, a GPU version of the Smith–Waterman sequence database search algorithm32, and Infernal, a novel RNA alignment tool33. Finally, in our own work in Bayesian network learning, the GPU implementation is 5 to 7.5 times faster than the heavily optimized GPP implementation34.
In addition to these types of applications, programmers are writing GPU-accelerated plug-ins for commonly used functions in high-level programming frameworks such as R and Matlab. Furthermore, GPU-optimized libraries of software functions are also being developed, including cuBLAS for basic linear algebra operations, cuFFT for carrying out fast Fourier transforms and Thrust for large vector operations.
With speed increases of fivefold to tenfold, a laptop can be used in place of a dedicated workstation, a workstation in place of a small cluster, or a small cluster in place of a larger cluster. The improved performance results in significant cost savings to the researcher. Data can be stored and analysed locally without requiring any specialized infrastructure, such as heavy-duty cooling, high-amperage power or professional system administration. In this respect, heterogeneous systems complement cloud computing by radically changing the cost–capability trade-offs.
Because GPUs were primarily designed for real-time graphics and gaming, they are optimal for solving problems involving tightly coupled (or fine-grained) parallelism (TABLE 1). However, for such an application to benefit from a GPU, it must require sufficient computation to amortize the cost of transferring data between the GPP and the accelerator (the local analogue of computing on a local machine versus transferring data to the cloud to carry out computations on the data). For example, in constructing Bayesian networks with fewer than 25 nodes, the overhead is too great, in relation to the amount of work done by the GPU, to achieve an advantage over a GPP. By contrast, for networks with more than 25 nodes, the GPUs provide a substantial advantage. This highlights an important point: ideally, applications such as Bayesian network reconstructions would be implemented to run on several different computer architectures, as the characteristics for any given problem define the architecture that is best suited to solve that particular instance. Although not all applications can take advantage of heterogeneous systems, when they can they will benefit from an efficient alternative to assembling similar capabilities from clusters of general purpose computers.
One of the primary challenges in leveraging heterogeneous computational environments for scientific computing is that most applications of interest to geneticists and others in the life sciences have not been ported to these environments. Significant informatics expertise is required to develop or modify applications to run effectively on GPUs or FPGAs. Heterogeneous systems improve performance and efficiency by exposing to programmers certain architectural features, such as vector arithmetic units, that are unavailable in GPPs. Not only is the code that executes on these accelerators different from its GPP counterpart, but often entirely different algorithms are needed to take advantage of the unique capabilities of the specialized accelerators. As a result, developing applications for these architectures is more challenging than for traditional GPPs.
The implementation of algorithms to be run in a general purpose GPU environment is carried out using specialized programming languages, such as CUDA programming (a proprietary programming model for NVIDIA GPUs)35. To implement an algorithm using CUDA, the programmer must write sequential code for a single ‘thread’ that processes a small subset of the total data set. Then, when the associated program is run on a data set of interest, large numbers of these individual processes are created in a grid to process the entire data set. The GPU operates in a separate memory space from the GPP, and often uses its own, much faster graphics memory. However, as noted above, the disadvantage here is that, before launching the program, the programmer must explicitly copy to the GPU the necessary data, and then copy back the results after completion.
The number of cloud service providers, big data centres and third-party vendors aiming to facilitate your entry into cloud computing is growing quickly. The options for transferring data to a cloud and begin computing on it can be overwhelming (TABLE 2). However, service providers have now found simpler means of allowing users to access the cloud. An example of a simpler approach to cloud computing is the Amazon Web Services Management Console (FIG. 3). The management console — which can be accessed from any web browser — provides a simple and intuitive interface for several uses: moving data and applications into and out of the Amazon S3 storage system; creating instances in Amazon EC2; and running data-processing programs on those instances, including the analysis of big data sets using MapReduce-based algorithms.
The best way to understand how to begin processing data in the cloud is to run through an example: aligning raw sequencing reads to a reference genome. This application is on the rise owing to the drop in whole-genome sequencing costs and the development of third-generation sequencing technologies, such as the Pacific Biosciences single-molecule, real-time sequencing (SMRT sequencing) approach1,8. One of the first processing steps to be carried out on raw sequencing reads is to align the reads to a reference genome to derive a consensus sequence that can then be used to identify novel mutations, structural variations or other interesting genomic features.
As discussed above, this application fits well into the MapReduce framework21. The input data are the raw reads from the sequencing run and a reference genome to which to map the reads. Because the reads are mapped independently to the reference genome, the input reads can be partitioned into sets that are then distributed over multiple processors or cores (the map part of MapReduce) to carry out the alignments more rapidly than could be done on a single processor. The alignment files generated in this way can then be aggregated into a single alignment file after all reads are mapped (the reduce part of MapReduce). The lower panel of FIG. 3 details the steps needed to carry out this type of process using the Amazon Elastic MapReduce (EMR) resource. These steps are done in a user-guided manner using the management console to create a job flow consisting of three simple steps: uploading your input data and applications to Amazon S3; configuring the job flow and submitting it; and retrieving results from S3.
In our example, the first step consists of assembling the input files and applications to carry out the alignment procedure and uploading these files into your own personal ‘bucket’ on Amazon S3. Using the management console, you simply select an option to create buckets for the input files, for the applications and scripts used to process the data and for the output results (FIG. 3). In the alignment problem for SMRT reads, the input data are the raw reads from the SMRT sequencer and a reference genome of interest (in FASTA format). For MapReduce, the raw reads need to be split into, say, hundreds of raw reads files so that they can be distributed by the Elastic MapReduce resource over many different cores on Amazon EC2. The long-read aligner application (ReadMatcher) and documentation for this tool can be downloaded from the Pacific Biosciences Developers Network website. With all files assembled, the management console can be used to upload the files onto Amazon S3.
The next step is defining the job flow via the management console, which again is a user-guided series of steps that are initiated by clicking a ‘create workflow’ button in the management console. In configuring the workflow the user specifies which buckets contain the input files, the applications and the output location, and then specifies the size and number of instances on Amazon EC2 to allocate for the job flow, which will define the amount of memory and number of cores to make available for computing the alignments. In the map portion of MapReduce, a wrapper script calls the ReadMatcher application with each of the segmented raw read files. Because each segmented file contains different information, multiple instances of ReadMatcher can operate on each file independently on each Amazon EC2 instance. In the reduce component of MapReduce, another script takes all of the independent alignment results and combines — or reduces — them into a single alignment file. The mapper and reducer scripts are the only programming steps that need to be taken for this example. The scripting for this example is straightforward and involves substantially less development time than would traditional multi-process or multi-threading programming.
After the job flow is configured it can be launched, again by submitting the job flow via the management console. Several tools are provided in the management console that can be used to monitor the progress of the run. The final step is retrieving the alignment results from Amazon S3. The results can be downloaded to your local system or can remain on S3 for shared use or for additional processing. For example, the alignment results in this example can be fed into another application, EviCons (available at the Pacific Biosciences Development Network), to derive a consensus sequence from the alignments, which can then be used to identify SNPs and other types of variation of interest.
The problems with data storage and analysis will continue to grow at a superexponential pace, with the complexity of the data only increasing as we are able to isolate and sequence individual cells, monitor the dynamics of single molecules in real time and lower the cost of the technologies that generate all of these data, such that hundreds of millions of individuals can be profiled. Sequencing DNA, RNA, the epigenome, the metabolome and the proteome from numerous cells in millions of individuals, and sequencing environmentally collected samples routinely from thousands of locations a day, will take us into the exabyte scales of data in the next 5–10 years. Integrating these data will demand unprecedented high-performance computational environments. Big genome centres, such as the Beijing Genomics Institute (BGI), are already constructing their own cloud-based computational environments with exabyte-scale data-storage capabilities and hundreds of thousands of cores.
Choosing the optimal computer architecture for storing, organizing and analysing big data sets requires an understanding of the problems one wishes to solve and the advantages of each of the architectures or mixture of architectures for efficiently solving such problems. The optimal solution may not always be obvious or may require a mixture of advanced computational environments. We anticipate the creation of more versatile cloud-based services that make use of computer architectures that are not limited to map-reducible problems, as well as low-cost, high-performance heterogeneous computational solutions that can serve as an adequate local solution for many laboratories.
Ultimately, our ability to consider very large-scale, diverse types of data collected on whole populations for the construction of predictive disease models will demand an open, data-sharing environment — not only from the perspective of industry but from academic communities as well, in which there are strong incentives to restrict data distribution to maintain competitive advantages. This will require the development of tools and software platforms that enable the integration of large-scale, diverse data into complex models that can then be operated on and refined by experimental researchers in an iterative fashion. This is perhaps the most crucial milestone we must achieve in the biomedical and life sciences if large-scale data and the results derived from them are to routinely affect biological research at all levels.
Competing interests statement
E.E.S., J.S. and L.L. declare competing financial interests: see Web version for details.
1000 Genomes Project: http://www.1000genomes.org
3Tera Application Store: http://appstore.3tera.com
Amazon Machine Images: http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=171
Amazon Web Services Management Console (MC): http://aws.amazon.com/console
CLC Bioinformatics Cube: http://www.clccube.com
Collaboration between the National Science Foundation and Microsoft Research for access to cloud computing: http://www.microsoft.com/presspass/press/2010/feb10/02-04nsfpr.mspx
Condor Project: http://www.cs.wisc.edu/condor
Database of Genotypes and Phenotypes (dbGAP): http://www.ncbi.nlm.nih.gov/gap
Gene Expression Omnibus (GEO): http://www.ncbi.nlm.nih.gov/geo
Nature Reviews Genetics audio slide show on ‘Computational solutions to large-scale data management’: http://www.nature.com/nrg/multimedia/compsolutions/index.html
NIST Technical Report: http://csrc.nist.gov/groups/SNS/cloud-computing
NVIDIA Bio WorkBench: http://www.nvidia.com/object/tesla_bio_workbench.html
Pacific Biosciences Developers Network: http://www.pacbiodevnet.com
Protein Data Bank (PDB): http://www.pdb.org
Public data sets available through Amazon Web Services: http://aws.amazon.com/publicdatasets
VMware Virtual Applicances: http://www.vmware.com/appliances
ALL LINKS ARE ACTIVE IN THE ONLINE PDF