To evaluate the performance of the Atlas2 Suite, we processed 92 samples from the 1000 Genomes (1000G) Phase 1 Whole Exome Project http://www.1000genomes.org
, which is an ongoing large scale population resequencing project aiming to provide the most comprehensive human variant call set [14
]. These 92 samples were chosen because they are also included in the 1000G Exon Pilot Project [14
]. For INDELs, we additionally analyzed 10 samples from the 1000G Whole Exome Project for comparison against other INDEL callers.
In the 92 samples a total of 134,182 SNPs were discovered, with an average of 14,867 SNPs per sample (Figure ). Previous studies have established an expected transition/transversion (Ts/Tv) ratio of 3-4 for coding regions [15
]. The Atlas-SNP2 call set has a Ts/Tv ratio of 3.49 and a dbSNP v129b re-discovery rate of 90.1%. We also checked the SNP density by normalizing the number of SNPs against each sample's callable region. The average SNP density of the 92 exomes was 0.63 SNP per 1,000 bp, which is consistent with previous results in the 1000 G Exon Pilot data (Figure )
Figure 3 SNP Call Metrics. (a) SNP metrics of 92 1000 Genomes exome samples. The four figures are SNP number, Ts/Tv, dbSNP% and SNP density distributions respectively. SNPs were called and compared in the callable region with a variant read depth of at least 2. (more ...)
We compared the SNP calls of Atlas-SNP2 to the latest official release of 1000 G Exon pilot SNP calls (Aug 2010) on a sample-by-sample basis (Figure ). Only SNP calls on the consensus high coverage genomic region of the two data sets were compared. On average, Atlas-SNP2 re-discovered 96.7% of the SNPs called in 1000 G Exon project on the consensus region. 89.5% of the SNPs discovered in 1000 G Exome data by Atlas-SNP2 are confirmed by the 1000 G Exon project data. With a Ts/Tv ratio of 3.21 and a dbSNP re-discovery rate of 71.1%, the remaining 10.5% of SNPs unique to the exome call set are likely to be true SNPs not discovered in the Exon project, which resulted from the stringent calling procedure employed by the Exon Pilot Project [15
] as well as the evolving nature of the capture sequencing technology.
Figure 4 Comparison of 92 1000 Genomes Exome samples to Exome Pilot Data. We made SNP calls using the Atlas2 Suite on 92 samples from 1000 Genomes Phase 1 Exome project, and compared the result to the most recent release from the 1000 Genomes Exon Pilot project. (more ...)
In the 92 samples we ran through the Atlas2 Suite pipeline, we called a total of 2,971 INDELs, with an average of 197 INDELs per sample (autosomal exons only). Frameshift INDELs in coding DNA nearly always render the resulting protein non-functional, and are expected to be significantly less common than in non-coding DNA. Previous studies have indicated that approximately 50% of coding INDELs cause frameshifts [16
]. The samples analyzed by Atlas-Indel2 were found to have an average in-frame rate (ratio of INDELs that do not cause frameshift events) of 46.7%, indicating the call set may be of high quality.
For the purpose of comparison against other variant calling tools, 10 samples (5 European, 5 African) were processed by Atlas-Indel2, GATK [12
], and SAMtools mPileup [8
] (see Methods). Results were compared on the basis of total exome INDELs called and the percent of the INDELS that are in-frame (Table ). The comparison shows that the Atlas-Indel2 call set has a significantly higher average in-frame rate of 47.52% compared to 10.39% for GATK and 25.82% for SAMtools mPileup (p < 3.9e-13 in a Student's t-test). In these 10 samples, Atlas-Indel2 called an average of 194 INDELs per sample, while GATK called 1,947 INDELs per sample and SAMtools called 1,560 INDELs per sample. 194 INDELs per sample is much closer to the number found in previous exome INDEL studies [14
]. The low in-frame rate and large call set size for GATK and SAMtools indicate a much higher false-positive rate compared to Atlas-Indel2 (Additional file 1
, Figure S9a and Additional file 2
Comparison to other INDEL Callers
Atlas-Indel2 is also specifically tuned for the Illumina platform in short INDEL calling and genotyping. The model is described in detail in the Supplement (Additional file 1
, Table S3). As with the SOLiD data, we analyzed a small number of Illumina samples and compared the results of Atlas2 to a few other widely used INDEL callers including Dindel [18
] and GATK [12
]. The results show that all callers performed very similarly, calling between 221-241 average coding INDELs per sample with an average in-frame rate of 57-61% (Additional file 1
, Table S5). 86% of the INDEL sites called by Atlas-Indel2 were also called by Dindel and 89% were also called by GATK (Additional file 1
, Figure S9b).
The regular size of WECS BAMs is 10-20 gigabytes per BAM. Despite the enormous size of high coverage sequencing data, the Atlas2 Suite is engineered in a manner that allows it to process WECS data on a standard desktop computer or a small server in a reasonable amount of time. Sequencing reads are processed one at a time, with a minimal number stored in memory. The result is that the run-time increases linearly with the BAM file size and memory usage remains constant (Figure ). Memory usage is dependant on the reference genome used; for example, using the human reference genome the maximum memory usage is about 250 MB. A whole exome 28 GB BAM file will take approximately 2 hours to be processed by Atlas-SNP2 and 5.5 hours by Atlas-Indel2, using one core of a 2.27 GHz Intel Xeon Processor. Using 64 CPU cores on a computational cluster we are able to process all 92 samples in ~4 hours for SNPs and ~11 hours for INDELs, demonstrating that the Atlas2 Suite is well suited for population scale analysis.
Figure 5 Computational resources. Both Atlas-SNP2 (a.) and Atlas-Indel2 (b.) were tested on a series of BAM files to evaluate run time and maximum memory usage. The algorithm for both applications is designed so that runtime increases linearly with the number (more ...)
Using Atlas2 in the Genboree Workbench for Integrative Genomic Analysis
In order to make the Atlas2 Suite tools accessible to a wide range of researchers we integrated the tools into the Genboree Workbench (http://www.genboree.org
, registration required). Primarily aimed at collaborative genomics research, Genboree provides web-based services for groups of researchers to share, visualize, and analyze genomic annotations and raw data files. Using the graphical user interface at the Workbench, researchers can run Atlas2 Suite tools on their BAM and SAM files (Additional file 1
, Figure S7a). The interface provides help information, validation of parameters, and adds the ability to upload the SNP and INDEL calls as annotation tracks. Upon submission, the configured job is added to the Genboree job queue and will execute on a modest compute cluster, after which the user is notified of job completion via email. The result files are also made available to the collaborating researchers via the Workbench, and are organized in a directory structure using the Study and Job names provided by the researcher (Additional file 1
, Figure S7b).
By running the Atlas2 Suite tools via the Workbench, researchers benefit from integrative analysis. For example, if the SNP calls are uploaded as an annotation track, they can be viewed visually in the internal genome browser (Additional file 1
, Figure S7b). Researchers can also configure Genboree to export their SNP tracks to the UCSC browser, and can use the tracks as inputs to other Workbench tools. Additionally, the Workbench provides us with a framework for adding new Altas2 Suite models and additional external tools. Because the Workbench is implemented on top of the Genboree HTTP REST API, researchers can automate this kind of analysis. For example, BAM file upload and launching an Atlas2 Suite tool can be done through the API, as can checking if result files are available and, if so, downloading them. Using the APIs, the researchers can also extend the analysis capabilities using their own software.