An optimized sample-pooling strategy
We utilized a PCR-based amplicon-ligation method because PCR remains the most reliable method of template enrichment for selected regions in a complex genome. This approach ensures low cost and maximal flexibility in study design compared to other techniques [9
]. Additionally, PCR of pooled samples alleviates known technical issues associated with PCR multiplexing [12
]. We sequenced 24 exon-containing regions (250 to 300 bp) of a gene on chromosome 3, GRIP2
(encoding glutamate-receptor interacting protein 2; [GenBank: AB051506
]) in 480 unrelated individuals (Figure ). The total targeted region is 6.7 kb per sample. We pooled 40 DNA samples at equal concentration into 12 pools, which was done conveniently by combining samples from the same columns of five 96-well plates. We separately amplified each of the 24 regions for each pool, then normalized and combined resulting PCR products at equal molar ratio. The 12 pools of amplicons were individually blunt-end ligated and randomly fragmented for construction of sequencing libraries, each with a unique Illumina barcode [13
]. These 12 indexed libraries were combined at equal molar concentrations and sequenced on one lane of a GAII (Illumina) using a 47-bp single-end module. We aimed for 30-fold coverage for each allele. Examples of amplicon ligation, distribution of fragmented products, and 12 indexed libraries are shown in Figure .
Figure 1 Schematic diagram of the sequencing strategy. Sample pools of 40 samples × 12 pools were generated from a cohort of 480 individuals for PCR amplification of individual exons. After blunt-ended ligation and random fragmentation, PCR amplicons from (more ...)
Figure 2 Amplicon ligation, fragmentation and indexed Illumina libraries. (a) Amplicon ligation and fragmentation: L-1, low molecular weight marker; lane 1, PCR amplicons before ligation; lane 2, PCR amplicons after ligation; lane 3, random fragmentation using (more ...)
Data analysis and variant calling
Sequence reads were mapped by Bowtie using strict alignment parameters (-v 3: entire read must align with three or fewer mismatches) [14
]. We chose strict alignment to focus on high quality reads. Variants were called using SAMtools (deprecated algorithms [pileup -A -N 80]; see Materials and methods) [15
]. A total of 11.1 million reads that passed Illumina filtering and had identifiable barcodes were aligned to the human genome (hg19), generating approximately 520 megabases of data. The distribution of reads for each indexed library ranged from 641 k to 978 k and 80% of reads had a reported read score (Phred) greater than 25 (Figure ). The aggregate nucleotide content of all reads in the four channels across sequencing cycles was constant (Figure ), indicating a lack of global biases in the data. There was little variability in total coverage per amplicon pool, and sufficient coverage was achieved to make variant calling possible from all amplicon pools (Additional file 1
). Our data indicated that 98% of exonic positions had an expected minimum coverage of 15× per allele (approximately 1,200× minimum coverage per position) and 94% had an expected minimum coverage of 30× (approximately 2,400× minimum coverage per position). Overall average expected allelic coverage was 68×. No exonic positions had zero coverage. To filter potential false positive variants from SAMtools, we included only high-quality variant calls by retaining variants with consensus quality (cq) and SNP quality (sq) scores in 95% of the score distributions (cq ≥ 196, sq ≥ 213; Figure ). This initially generated 388 variant calls across the 12 pools. A fraction of these variant calls (n
= 39) were limited to single pools, indicating potential rare variants.
Figure 3 Quality assessment of the Illumina sequence data. (a) Number of reads with barcodes that passed Illumina filtering and aligned to the reference templates using Bowtie from individually indexed libraries (n = 12). Range, 641 k to 978 k reads; mean ± (more ...)
Figure 4 Distribution of quality score from SAMtools Pileup. Filtering was conducted at the 95th percentile of the consensus and SNP quality distributions reported by SAMtools; only the distribution of SNP quality values is depicted here. The blue bar is the 95th (more ...)
Initial validations by Sanger sequencing indicated that approximately 25% or more of these variant calls were false positives. Sequencing errors contribute to false positive calls and are particularly problematic for pooled samples where rare variant frequencies approach the error rate. To determine the effect of cycle-dependent errors on variant calls [7
], we analyzed the proportions of each nucleotide called at each of the 47 sequencing cycles in each variant. We refer to this analysis as a tailcurve analysis due to the characteristic profile of these proportion curves in many false-positive variant calls (Figure ; Additional file 2
). This analysis indicated that many false positive calls arise from cycle-dependent errors during later sequencing cycles (Figure ). The default base-calling algorithm (BUSTARD) and the quality values it generates make existing variant detection software prone to false positive calls because of these technical biases. Examples of tailcurves reflecting base composition by cycle at specific genetic loci for wild type, common SNP, rare variant, and false positive calls are shown in Figure .
Representative base reads and tailcurves for common and rare variants and error calls. (a) Position with no variant. (b) Position with a common variant. (c) Position with a rare variant. (d) Position with a false positive call.
Quality assessment and base calling using SRFIM
To overcome this problem, we utilized Srfim, a quality assessment and base-calling algorithm based on a statistical model of fluorescence intensity measurements that captures the technical effects leading to base-calling biases [7
]. Srfim explicitly models cycle-dependent effects to create read-specific estimates that yield a probability of nucleotide identity for each position along the read. The algorithm identifies nucleotides with highest probability as the final base call, and uses these probabilities to define highly discriminatory quality metrics. Srfim increased the total number of mapped reads by 1% (to 11.2 million), reflecting improved base-calling and quality metrics, and reduced the number of variant calls by 20% (308 variants across 12 pools; 33 variant calls present in only a single pool).
Cross-pool filtering using SERVIC4E
Further validation by Sanger sequencing indicated the persistence of a few false positive calls from this dataset. Analysis of these variant calls allowed us to define statistics that capture regularities in the base calls and quality values at false positive positions compared to true variant positions. We developed SERVIC4E, an automated filtering algorithm designed for high sensitivity and reliable detection of rare variants using these statistics.
Our filtering methods are based on four statistics derived from the coverage and qualities of variant calls at each position and pool: (1) continuity, defined as the number of cycles in which the variant nucleotide is called (ranges from 1 to 47); (2) weighted allele frequency, defined as the ratio of the sum of Phred quality scores of the variant base call to the sum of Phred quality scores of all base calls; (3) average quality, defined as the average quality of all base calls for a variant; and (4) tailcurve ratio, a metric that captures strand-specific tailcurve profiles that are characteristic of falsely called variants. SERVIC4E employs filters based on these four statistics to remove potential false-positive variant calls. Additionally, SERVIC4E searches for patterns of close-proximity variant calls, a hallmark of errors that have been observed across different sequenced libraries and sequencing chemistries (Figure ), and uses these patterns to further filter out remaining false positive variants. In the next few paragraphs we provide rationales for our filtering statistics, and then define the various filters employed.
Figure 6 Local pool patterns for error analysis. X-axes denote position in a local sequence. Position 16 is the variant site being analyzed, positions 1 to 15 are immediately upstream and positions 17 to 31 are downstream. Y-axes denote the weighted allele frequency (more ...)
The motivation for using continuity and weighted allele frequency is based on the observation that a true variant is generally called evenly across all cycles, leading to a continuous representation of the variant nucleotide along the 47 cycles, and is captured by a high continuity score. However, continuity is coverage-dependent and should only be reliable when the variant nucleotide has sufficient sequencing quality. For this reason, continuity is assessed in the context of the variant's weighted allele frequency. Examples of continuity versus weighted allele frequency curves for common and rare variants are shown in Figure . Using these two statistics, SERVIC4E can use those pools lacking the variant allele (negative pools) as a baseline to isolate those pools that possess the variant allele (positive pools).
Figure 7 Continuity versus weighted allele frequency curves for select variants. (a) Very common variant present in all 12 pools. (b) Modestly common variant present in the majority of pools. (c) Infrequent variant present in a minority of pools. (d) Rare variant (more ...) SERVIC4E
uses a clustering analysis of continuity and weighted allele frequency to filter variant calls between pools. We use k-medioid clustering and decide the number of clusters using average silhouette width [16
]. For common variants, negative pools tend to cluster and are filtered out while all other pools are retained as positives (Figure ). Rare variant pools, due to their lower allele frequency, will have a narrower range in continuity and weighted allele frequency. Negative pools will appear to cluster less, while positive pools cluster more. SERVIC4E
will retain as positive only the cluster with highest continuity and weighted allele frequency (Figure ).
The second filter used by SERVIC4E is based on the average quality of the variant base calls at each position. One can expect that the average quality score is not static, and can differ substantially between different sequencing libraries and even different base-calling algorithms. As such, the average quality cutoff is best determined by the aggregate data for an individual project (Figure ). Based on the distribution of average qualities analyzed, SERVIC4E again uses cluster analysis to separate and retain the highest quality variants from the rest of the data. Alternatively, if the automated clustering method is deemed unsatisfactory for a particular set of data, a more refined average quality cutoff score can be manually provided to SERVIC4E, which will override the default clustering method. For our datasets, we used automated clustering to retain variants with high average quality.
Figure 8 Average quality versus weighted allele frequency for variant pools after filtering by clustering. The X-axis is average Phred sequencing quality score and the Y-axis is weighted allele frequency (ratio of the sum of Phred quality scores for the variant (more ...)
The third filtering step used by SERVIC4E captures persistent cycle-dependent errors in variant tailcurves that are not eliminated by Srfim. Cycle-specific nucleotide proportions (tailcurves) from calls in the first half of sequencing cycles are compared to the proportions from calls in the second half of sequencing cycles. The ratio of nucleotide proportions between both halves of cycles is calculated separately for plus and minus strands, thereby providing the tailcurve ratio added sensitivity to strand biases. By default, variant calls are filtered out if the tailcurve ratio differs more than ten-fold; we do not anticipate that this default will need adjustment with future sequencing applications, as it is already fairly generous, chiefly eliminating variant pools with clearly erroneous tailcurve ratios. This default was used for all our datasets.
The combination of filtering by average quality and tailcurve structure eliminates a large number of false variant calls. Additional file 3
demonstrates the effect of these filtering steps applied sequentially on two sets of base call data.
In addition to these filtering steps, SERVIC4E employs limited error modeling. The pattern of errors observed in many libraries may be dependent on the sequence context of the reads, the preparation of the library being sequenced, the sequencing chemistry used, or a combination of these three factors. We have observed that certain erroneous variant calls tend to aggregate in proximity. These clusters of errors can sometimes occur in the same positions across multiple pools. These observations appeared in two independent datasets in our studies. Importantly, many of the false positive calls that escaped our tailcurve and quality filtering fell within these clusters of errors. To overcome this problem, SERVIC4E conducts error filtering by analyzing mismatch rates in proximity to a variant position of interest and then determining the pattern of error across multiple pools. This pattern is defined as the most frequently occurring combination of pools with high mismatch rates at multiple positions within the isolated regions. The similarity between a variant call of interest and the local pattern or error across pools can then be used to eliminate that variant call (Figure ). The consequences of these sequential filtering steps on variant output are outlined in Table for both cohorts tested in this study.
Effect of sequential filtering by SERVIC4E on variant output
Finally, SERVIC4E provides a trim parameter that masks a defined length of sequence from the extremes of target regions from variant calling. This allows for SERVIC4E to ignore spurious variant calling that may occur in primer regions as a result of the concatenation of amplicons. By default, this parameter is set to 0; for our datasets, we used a trim value of 25, which is the approximate length of our primers.
Reliable detection of rare variants in pooled samples
, we identified 68 unique variants (total of 333 among 12 pools), of which 34 were exonic variants in our first dataset of 480 samples (Additional file 4
). For validation, we performed Sanger sequencing for all exonic variants in individual samples in at least one pool. A total of 4,050 medium/high-quality Sanger traces were generated, targeting approximately 3,380 individual amplicons. Total coverage in the entire study by Sanger sequencing was approximately 930 kb (approximately 7.3% of total coverage obtained by high-throughput sequencing). Sanger sequencing confirmed 31 of the 34 variants. Fifteen rare exonic variants were identified as heterozygous in a single sample in the entire cohort.
A comparison with available variant calling algorithms
We compared our variant calling method to publicly available algorithms, including SAMtools, SNPSeeker, CRISP, and Syzygy [1
]. Because some variants are present and validated in multiple pools and each pool is considered as an independent discovery step, we determined the detection sensitivity and specificity on a variant pool basis. Results are shown in Table .
Validation analysis of variant calling from first cohort samples
To call variants with SAMtools [15
], we used the deprecated Maq algorithms (SAMtools pileup -A -N 80), as the regular SAMtools algorithms failed to identify all but the most common variants. As a filtering cutoff we retained only the top 95th percentile of variants by consensus quality and SNP quality score (cq ≥ 196 and sq ≥ 213 for standard Illumina base calls, Figure ; cq ≥ 161 and sq ≥ 184 for Srfim base calls, Figure ).
] uses large deviation theory to identify rare variants. It reduces the effect of sequencing errors by generating an error model based on internal negative controls. We used exons 6 and 7 as the negative controls in our analysis (total length = 523 bp) as both unfiltered SAMtools analysis and subsequent Sanger validation indicated a complete absence of variants in both exons across all 12 pools. Only Illumina base calls were used in this comparison because of a compatibility issue with the current version of Srfim. The authors of SNPSeeker recently developed a newer variant caller called SPLINTER [18
], which requires both negative and positive control DNA to be added to the sequencing library. SPLINTER was not tested due to the lack of a positive control in our libraries.
] conducts variant calling using multiple criteria, including the distribution of reads and pool sizes. Most importantly, it analyzes variants across multiple pools, a strategy also employed by SERVIC4E
. CRISP was run on both Illumina base calls and Srfim base calls using default parameters.
] uses likelihood computation to determine the probability of a non-reference allele at each position for a given number of alleles in each pool, in this case 80 alleles. Additionally, Syzygy conducts error modeling by analyzing strand consistency (correlation of mismatches between the plus and minus strands), error rates for dinucleotide and trinucleotide sequences, coverage consistency, and cycle positions for mismatches in the read [19
]. Syzygy was run on both Illumina and Srfim base calls, using the number of alleles in each pool (80) and known dbSNP positions as primary input parameters.
SERVIC4E was run using a trim value of 25 and a total allele number of 80. All other parameters were run at default. The focus of our library preparation and analysis strategy is to identify rare variants in large sample cohorts, which necessitates variant calling software with very high sensitivity. At the same time, specificity must remain high, primarily to ease the burden during validation of potential variants. In addition to calculating sensitivity and specificity, we calculated the Matthews correlation coefficient (MCC; see Materials and methods) for each method (Table ) in order to provide a more balanced comparison between the nine methods.
For validation of our dataset, we focused primarily on changes in the exonic regions of our amplicons. Any intronic changes that were collaterally sequenced successfully were also included in our final analysis (Table ). Sixty-one exonic positions were called as having a variant allele in at least one pool by one or more of the nine combinations of algorithms tested. We generated Sanger validation data in at least one pool for 49 of the 61 positions identified. Genotypes for validated samples are indicated in Additional file 5
SNPSeeker (with Illumina base calls) performed with the highest specificity (97.3%), but with the worst sensitivity (62.2%), identifying less than half of the 15 valid rare exonic variants (Table ). This is likely due to an inability of this algorithm to discriminate variants with very low allele frequencies in a pool; 84% of SNPSeeker's true positive calls have an allele frequency ≥ 1/40, while only 13% of the false negative calls have a frequency ≥ 1/40 (Additional files 4
). SNPSeeker's MCC score was low (61.8%), due in large part to its very low false positive rate.
SAMtools alone with Illumina base calls achieved a 92.2% sensitivity, identifying all 15 rare exonic variants; however, these results were adulterated with the highest number of false positives, resulting in the worst specificity (56.2%) and MCC score (52.8%) among the nine methods (Table ). Incorporation of Srfim base calls cut the number of false positives by 60% (from 32 to 13) without a sizeable reduction in the number of true positive calls (from 83 to 80). Fourteen of the fifteen valid rare exonic variants were successfully identified, which while not perfect, is an acceptably high sensitivity (Table ). Srfim made noticeable improvements to individual base quality assessment as reflected in a substantial reduction in low quality variant calls (Figure ) by reducing the contribution of low quality base calls to the average quality distribution (Figure ) and by reducing the tailcurve effect that leads to many false positives (Additional file 3a, b
). Most low quality variant calls eliminated when transitioning to Srfim were not valid; nonetheless, three low quality valid variant calls were similarly affected by Srfim, and their loss resulted in a slight reduction in the true positive rate.
CRISP using Illumina base calls achieved a sensitivity slightly lower than SAMtools (87.8% versus 92.2%). Additionally, CRISP identified only 13 of the 15 valid rare exonic variants. Though this is lower than SAMtools, it is a large improvement over SNPSeeker; for the purposes set forth in our protocol, the > 75% sensitivity for extremely rare variants achieved by CRISP (using either base-calling method) is acceptable (Table ).
Syzygy achieved the second highest sensitivity (94.4%) using Illumina base calls, but specificity remained low (67.1%). Fourteen of the fifteen rare exonic variants were successfully identified. CRISP and Syzygy achieved relatively average MCC values (50.5% and 65.0%, respectively), reflecting better performance than SAMtools with Illumina base calls.
SERVIC4E using Illumina base calls achieved the highest sensitivity (97.8%) and identified all 15 valid rare exonic variants. Both sensitivity and specificity were improved over SAMtools, CRISP, and Syzygy (Table ), reflected in the highest MCC score of all the tested methods (84.2%). Taken together, the combination of SERVIC4E with either base-calling algorithm provides the highest combination of sensitivity and specificity in the dataset from pooled samples.
As previously mentioned, Srfim greatly improved variant calling in SAMtools, as is reflected in the 19% increase in SAMtools' MCC value (from 52.8% to 71.4%). CRISP, Syzygy, and SERVIC4E benefited little from using Srfim base calls: the MCC value for CRISP improved by only 6% (from 50.5% to 56.5%), Syzygy diminished by 4.6% (from 65.0% to 60.4%), and SERVIC4E diminished by 6.5% (from 84.2% to 77.7%). Importantly, use of Srfim base calls with Syzygy diminished its capacity to detect rare variants by a third. These three programs are innately designed to distinguish low frequency variants from errors using many different approaches. As such, it can be inferred from our results that any initial adjustments to raw base calls and quality scores by the current version of Srfim will do little to improve that innate capacity. In contrast, SAMtools, which is not specifically built for rare variant detection and would therefore have more difficulty distinguishing such variants from errors, benefits greatly from the corrective pre-processing provided by Srfim.
In addition to performance metrics like sensitivity and specificity, we analyzed annotated SNP rates, transition-transversion rates, and synonymous-non-synonymous rates of the nine algorithms on a variant-pool basis (Additional file 7
The variant pools with the greatest discrepancies between the various detection methods tended to have an estimated allele frequency within the pool that is less than the minimum that should be expected (1/80; Additional files 4
, and 8
). Such deviations are inevitable, even with normalization steps, given the number of samples being pooled. This underscores the importance of having careful, extensive normalization of samples to minimize these deviations as much as possible, and the importance of using variant detection methods that are not heavily reliant on allele frequency as a filtering parameter or are otherwise confounded by extremely low allele frequencies.
Validation using data from an independent cohort of samples
To further assess the strength of our method and analysis software, we sequenced the same 24 GRIP2
exons in a second cohort of 480 unrelated individuals. The same protocol for the first cohort was followed, with minor differences. Firstly, we pooled 20 DNA samples at equal concentration into 24 pools. The first 12 pools were sequenced in one lane of a GAII and the last 12 pools were sequenced in a separate lane (Additional file 9
). Additionally, the libraries were sequenced using the 100-bp paired-end module, and sequencing was conducted using a newer version of Illumina's sequencing chemistry. These 24 libraries occupied approximately 5% of the total sequencing capacity of the two lanes. The remaining capacity was occupied by unrelated libraries that lacked reads originating from the GRIP2
To map reads from this dataset, we initially used Bowtie's strict alignment parameters (-v 3), as we had done with our first dataset, but this resulted in a substantial loss of coverage in the perimeters of target regions. This is likely due to reads that cross the junctions between our randomly concatenated amplicons; such reads, which have sequence from two distant amplicons, appear to have extensive mismatching that would result in their removal. This effect became pronounced when using long read lengths (100 bp), but was not noticeable when using the shorter reads in our first dataset (Additional file 10
). This effect should not be an issue when using hybridization enrichment, where ligation of fragments is not needed.
In order to improve our coverage, we used Bowtie's default parameter, which aligns the first 28 bases of each read, allowing no more than two mismatches. To focus on GRIP2
alignments, we provided a fasta reference of 60 kb covering the GRIP2
locus. A total of 6.4 million reads (5.6% of all reads) aligned to our reference template of the GRIP2
locus. The depth of coverage for each amplicon pool is shown in Additional file 11
. For exonic positions, the average allelic coverage was 60.8×, and the minimum coverage was 10×; 99.9% of exonic positions were covered at least 15× per allele, and 98.5% were covered at least 30× per allele.
We did not apply Srfim base calls to our variant calling as Srfim has not yet been fully adapted to the newer sequencing chemistry used with this cohort. For variant calling, we tested Syzygy and SERVIC4E
, the two most sensitive software identified in our first dataset when using only the standard Illumina base calls (Table ). Syzygy was provided with a template-adjusted dbSNP file and a total allele number of 40 as input parameters. All other parameters were run at default. Syzygy made a total of 474 variant calls across 24 pools (74 unique variant calls). Of the 74 unique calls made, 36 were exonic changes. SERVIC4E
was run using a trim value of 25 and a total allele number of 40. All other parameters were run at default. SERVIC4E
made a total of 378 variant calls across 24 pools (68 unique variant calls). Of the 68 unique calls made, 33 were exonic changes. Between Syzygy and SERVIC4E
, a total of 42 unique exonic sequence variant calls were made (Additional files 12
For validation of these results, we again targeted variants within exons for Sanger sequencing. Sanger data were successfully obtained from individual samples in at least one pool for 41 of the 42 exonic variants. Genotypes for validated samples are indicated in Additional file 14
. Results are summarized in Table and include any intronic variant pools that were collaterally Sanger sequenced successfully. Of the 41 exonic variants checked, 29 were valid. Sixteen were identified as occurring only once in the entire cohort of 480 individuals. Syzygy achieved a high sensitivity of 85.5% but a fairly low specificity of 59.4%. Of the 16 valid rare exonic variants, 13 (81.25%) were identified. The MCC score was low (45.9%), primarily as a result of the low specificity (Table ). SERVIC4E
achieved a higher sensitivity of 96.4% and a higher specificity of 93.8%. All 16 valid rare exonic variants were identified and a high MCC score (89.9%) was obtained. The combined analysis of the first and second cohorts identified 47 valid coding variants, of which 30 were present only once in each cohort.
Validation analysis of variant calling from second cohort samples