Advances in next generation sequencing technologies and their use in discovering genetic variations associated with complex diseases are generating enormous amounts of data. The major bottleneck in genome sequencing is no longer data generation, but the computational challenges around data analysis, display, and integration of disparate data types [Green and Guyer, 2011
]. As GWAS data have stretched informatics capacity, meeting the storage and analytical needs for next generation sequencing data will be even more challenging. The average next generation sequencing experiment generates terabytes of data [Zhang et al., 2011
]. In fact, the rate of increase in DNA sequencing and genotyping capacity is outstripping the rate of increase in disk storage [Richter and Sexton, 2009
; Stein, 2010
]. In some cases, the expense of storing and archiving raw data is greater than repeating the experiment, which is not ideal given finite biospecimen resources and the need to evaluate changes in base-calling methods over time. In a review of informatics requirements for next generation sequence data, some needs outlined included scalable, dense, and inexpensive disk storage systems, high-performance disk storage systems, archival storage systems, improved software, data analysis tools and increased staffing to handle the large increase in data [Richter and Sexton, 2009
Adequate computation infrastructure is necessary to analyze and manage large-scale data sets. Participants strongly agreed that there is a need for investigator(s)-led regional or statewide shared computing clusters, to support small- to medium-sized laboratories in particular (Recommendation 5.1). The group specified that researchers, rather than administrators, should design these clusters because of their better understanding of analytical needs. Exploiting the power of graphical processing units (GPUs) may enhance computational power [Sinnott-Armstrong et al., 2009
; Greene et al., 2010
] for some analytical studies. However, to effectively use clusters and GPU technologies (or GPGPU/FPGA), an increased emphasis on cluster-friendly software and conversion of applications for GPU computing is required. Overall, meeting participants noted that computing power, relative to other challenges discussed, is less of a limitation because they felt that the technology is already moving in this direction. However, it was noted that while the goal of the $1,000 genome sequencing platform is likely within reach, this does not include the cost of data analysis or storage, which some suggested could be substantial [Mardis, 2010
; Pennisi, 2011
]. Finally, given the computational demands of scientific research, the group advocated for improved and increased training of graduate students and postdoctoral fellows in computer programming either through training programs or other grant support and increased support of computational personnel (Recommendations 5.2 and 5.3).
Meeting participants agreed that impending issues surrounding data storage and networking were most daunting as the field migrates to denser genotyping and sequencing platforms and more complex analyses. Historical methods of data storage, e.g. relational databases, are no longer adequate for the next generation of studies. Some newer models to improve storage capacity being explored include compressed and binary formats for data, hybrid database architectures (e.g. row/column oriented designs or chunking formats), virtualization of data, and cloud computing. Participants identified needs for cost-effective and increased data storage capacity, including new, more efficient data structures and formats (Recommendation 5.4). In addition to new formats and structures, participants concluded that there is a need for establishing standardized data formats for data storage and networking. One approach for developing standards discussed was organizing a conference of multiple different interested groups to reach a consensus on the best methods to store, deliver, archive, describe, and distribute data (Recommendation 5.5).
In addition to issues regarding data storage, combining data sets from multiple sites and sources creates further challenges that require careful attention to QC assessment. Best practices for QC of GWAS data were recently published based on lessons learned from the eMERGE network and Gene Environment Association Studies (GENEVA) program [Laurie et al., 2010
; Turner et al., 2001
]. Integrating GWAS and sequencing data with other data sources, e.g. omics data, also leads to data management challenges. These additional -omic data sets are quite large. Improved methods for combining meta-dimensional data from multiple data sources (e.g. dbGaP, gene ontology databases) and data types (e.g. DNA, RNA, protein and clinical data) are needed along with standard QC procedures (Recommendations 5.6 and 5.7).
Analytical software tools are needed to efficiently manage large genotype and sequence data sets, including the ability to efficiently subset, merge, annotate, harmonize data, perform variant calling, run standard QC tasks, fit standard models to test for relationships and population structure, and detect association with phenotypes. Meeting participants advocated for the development of user-friendly, and ideally open-source, tools available for the research community to accommodate next generation sequencing data and more complex forms of data analysis (Recommendation 5.8). The group outlined needs for baseline tools and environments for core libraries, data management, QC, data analysis, and methods development. These tools would ideally be extendable to more sophisticated tasks and models. It was suggested that a new funding mechanism should be used to support such development (Recommendation 5.9). As well as funding support, the group suggested that NIH should coordinate the development of these resources by organizing a conference among multiple groups to form consensus standards and formats that are needed (Recommendation 5.10). The group emphasized that more collaboration is needed among computer scientists, statisticians, and biologists to more effectively leverage biological knowledge and interpret the vast amounts of data obtained.
As described during the session on complex phenotypes, most genetic epidemiology studies focused on a “one SNP at a time” approach, ignoring the complexity of disease pathways. Biological pathways are typically nonlinear and a linear assumption may hinder our ability to detect complex relationships. Therefore, some participants suggested that analytical approaches that handle different models may improve our ability to detect genetic variants or pathways with important roles in disease [Moore et al., 2010
]. Some alternatives to traditional linear models mentioned included symbolic modeling of epistasis [Moore et al., 2007
], computational evolution systems [Moore et al., 2008
], and logic regression [Kooperberg et al., 2001
]. Strengths of these approaches are that they are not bound by assumptions of the underlying linear model. However, some cautioned that testing such a large number of models, or even millions of models, may be hindered by a large false-positive rate (i.e. there is a low prior probability that any individual model is correct). This limitation may be partially addressed by providing biological knowledge to limit the model space of the search, i.e. limiting the number of potential models. Improved high-throughput biological systems, as currently being assessed in the National Toxicology Program’s toxicogenomics program (http://ntp.niehs.nih.gov/?ob-jectid=7E6CAEBD-BDB5-82F8-F8C29152153B80B1
), to test different models may be used to add to the knowledge base for analysis. Additional analytical approaches may be found in the computer science fields, such as quantum computing or immune systems.
Leveraging biological knowledge was suggested as an approach for managing and analyzing large complex genetic data sets. Biological annotation may be implemented to guide searches for interactions or as an approach for combining rare variants obtained by sequencing. However, many different databases exist (e.g. Kegg, Biocarta, Ensemble, Entrez, Gene Ontology), and these databases often use different genome builds and have inconsistencies in gene nomenclature. Another challenge for using these databases relates to the uncertainty in current pathway knowledge and ontologies. The uncertainty in these models within existing databases should be quantified and accounted for within analysis. Importantly, a large percentage of known genes do not map to functional annotation within these biological functional databases. Therefore, improved annotation is needed along with other sources of biological information to better assign pathways. Combining data based on biological pathway information is an emerging field—consistent, reliable, and well-curated annotation resources are needed (Recommendation 5.11).
Data visualization is an integral part of scientific discovery; this is challenging for large data sets. One novel method of visualization described was the 3D Heat Map which enables exploration of high-throughput data using an interactive medium that allows addition of information or annotation to the map [Moore et al., 2011
]. The group discussed ways to utilize the emerging 3D visualization tools in genomic epidemiology. The majority agreed that innovative methods for visualizing data may lead to novel scientific discoveries and emphasized the need for baseline charts and visualizations which require no scripting, but could be used for a variety of applications and improve current capabilities for visualizing increasingly complex data sets (Recommendation 5.12).
Sharing genomics data sets must be facilitated by information technology, which must also enforce data access protections based on the guidance of IRBs, data access committees, legal requirements, and institutional policies. Participants debated the best practices for data sharing and discussed road blocks to accessing existing databases or datasets. The accessibility of data housed in dbGaP was discussed. Some meeting participants believed that although access to dbGaP is straightforward to users of the resource, the process may be a barrier to data sharing among nonusers. For example, in computer science fields, data sets often are made available by simply clicking on links. It was also suggested that some simple data sets could be made available or identified for sharing to allow researchers to use comparable sets for methods development research (Recommendation 5.13). Additionally, the group opined that, at present, computing architectures, software development, data cleaning methods, etc., are not generally published, despite being critically important. Participants suggested that the field should support a mechanism to publish and/or share this type of knowledge, perhaps through additional analytical conferences (Recommendation 5.14).