|Home | About | Journals | Submit | Contact Us | Français|
High-throughput genomic technologies have been used to explore personal human genomes for the past few years. Although the integration of technologies is important for high-accuracy detection of personal genomic variations, no databases have been prepared to systematically archive genomes and to facilitate the comparison of personal genomic data sets prepared using a variety of experimental platforms. We describe here the Total Integrated Archive of Short-Read and Array (TIARA; http://tiara.gmi.ac.kr) database, which contains personal genomic information obtained from next generation sequencing (NGS) techniques and ultra-high-resolution comparative genomic hybridization (CGH) arrays. This database improves the accuracy of detecting personal genomic variations, such as SNPs, short indels and structural variants (SVs). At present, 36 individual genomes have been archived and may be displayed in the database. TIARA supports a user-friendly genome browser, which retrieves read-depths (RDs) and log2 ratios from NGS and CGH arrays, respectively. In addition, this database provides information on all genomic variants and the raw data, including short reads and feature-level CGH data, through anonymous file transfer protocol. More personal genomes will be archived as more individuals are analyzed by NGS or CGH array. TIARA provides a new approach to the accurate interpretation of personal genomes for genome research.
Recently developed high-throughput DNA technologies have revolutionized human genomics. Massively parallel sequencing—next generation sequencing (NGS)—has been used to analyze nearly 20 personal genomes (1–10). The cost of sequencing a single genome is decreasing dramatically, and we are now approaching an era in which personal genomic sequencing will cost US$1000. The sequencing of a large number of individual genomes, possibly more than 1000, is expected to be complete within the next year (http://www.1000genomes.org). Current sequencing technologies, which provide sufficient read depth (RD), enable the detection of genome-wide SNPs and short indels with >99.9% accuracy (3,4).
Comparative genomic hybridization (CGH) arrays have been used to detect copy number variants (CNVs), a major type of structural variant (SV) in the human genome (11–16). CNVs are irregular in size and often reside in ambiguous regions (e.g. repetitive sequences) making them difficult to detect by NGS technologies alone. Although several sequencing approaches have attempted to detect CNVs (6,17–19), CGH arrays remain a standard approach to CNV detection (11–16).
Human genomic variants are believed to have important functional impacts on human biology and medicine. To evaluate the potential biological functions of the large number of variants, it is essential to develop intuitive methods for comparing multiple genomes using raw-level data generated by diverse technologies. Moreover, the cooperative integration of different genomic technologies is necessary for high-accuracy detection of variants, especially of CNVs (6,10,16). Although many genomic databases and browsers have been developed (20–24), the comparison and integration of genomic data sets from different platforms is not yet feasible.
TIARA contains massively parallel sequencing data from five individuals, three of whom—[AK1 (6), AK2 (16), and NA10851 (10,16)]—have been described previously. The other two genomes deposited in TIARA, AK4 and AK6, were sequenced using the Illumina Genome Analyzer. The average RDs of the sequencing coverage for these five individuals were 27.8x, 27.5x, 25.0x, 22.3x, and 23.1x, respectively. The details of the whole genome sequencing process have been described previously (6,10,16). Briefly, short-reads from the Illumina Genome Analyzer and AB SOLiD were aligned using the GSNAP and BioScope alignment tools, respectively, with respect to the human reference genome build 36.3 (6,16,25). The RDs of sequencing coverage were obtained by adjusting the effects of GC content as described previously (10,18).
CGH array data from 33 individuals (11 Koreans, 10 Chinese, 10 Japanese, 1 European and 1 West African) were obtained using a whole genome tiling CGH array comprising 24-M probes (16) (Supplementary Table S1). In addition to the usual type of CGH data, which depends on a comparison with a reference sample (NA10851), the absolute or reference-free CGH array data were also provided.
SNPs and indels were discovered by applying conservative filter criteria to the NGS data as described elsewhere (6). Briefly, four matches from uniquely aligned short reads with a quality score ≥20 were required for SNP identification. CNVs were identified in the CGH array using the ADM2 algorithm (16,26) in the Agilent Genomic Workbench Standard Edition 5.0.14. The summary statistics of each individual genome are provided in Table 1.
The TIARA system mainly consists of a ‘genome data repository’ and a ‘genome browser’ (Figure 1). The genome data repository has three types of storage archive: (i) a ‘Lucene index file system’, (ii) a ‘MySQL database’ and (iii) an ‘anonymous file transfer protocol (FTP) archive’. These archives were built on a virtualization file system designed to support high-performance computing clusters. The Lucene index file system includes inverted index files for real-time query processing of genomic data, including SNPs, indels, RDs and log2 ratios. Inverted index files are generated using an ‘index build module’. The MySQL database stores information about the aligned short reads, such as read length, alignment position and quality. The anonymous FTP archive enables downloading of the raw CGH and short read data as well as the filtered genome variants, including SNPs, non-synonymous SNPs, indels and CNVs.
In this section, we describe the user interface of TIARA, the structure of which is displayed in Figure 2a. In area (A) of Figure 2a, the user can specify the genomic region and individual regions of interest for browsing. Areas (B), (C), (D) and (E) present, respectively, the RefSeq gene, SNPs, indels and RDs from the high-throughput sequencing data. Areas (F) and (G) present the CNV regions and log2 ratios from the high-resolution CGH array data, respectively. Once the user selects or deselects an individual genome data set, the personal genome data are displayed in or removed from areas (C), (D), (E) and (G). The ‘GeneSearch’ button allows the user to browse the genome data for a specific gene selected by the user. For example, the user can browse the TP53 gene locus (Figure 2b). The ‘XMLDownload’ button exports an XML document that contains structured information describing the SNPs, indels, RDs and log2 ratios visualized in the genome browser. The downloaded XML document permits analysis of the selected genomic region using other genomic browsers or custom scripts of the user’s creation. A schema of each XML document is shown in the Supplementary Figures 1 and 2.
We have described the development of the TIARA genome database, into which massively parallel sequencing data, high-resolution array CGH data and genomic variants of human whole genomes have been deposited. The TIARA genome browser is a unique visualization tool that facilitates multi-individual and cross-technology analysis of complex human genomic variations. TIARA will be upgraded to improve the efficiency of genome research by developing advanced genome browser functions and by adding more personal genomes. GMI-SNU has recently completed sequencing of the entire genomes of 10 Korean individuals using NGS and high-resolution CGH arrays. Our group plans to analyze 1000 Asian genomes and release the data through TIARA before the end of the next year. We believe that TIARA and the genomic data will prove to be an invaluable resource for human genome research.
Supplementary Data are available at NAR Online.
Korean Ministry of Knowledge Economy (grant number 0411-20100061); Korean Ministry of Education, Science and Technology (grant number 2010-0013662); Green Cross Therapeutics (0411-20080023). Funding for open access charge: Korean Ministry of Education, Science and Technology (grant number 2010-0013662).
Conflict of interest statement. None declared.