Our experimental protocol for indexing is summarized in and further detailed in the supplementary methods
. We amplified multiple 5kb regions (supplementary table 1 and 2
) by long-range PCR, for 46 individuals genotyped by the ENCODE projects1,8
. Amplicons were equimolar pooled for each individual, digested, blunt end-repaired, flanked by an adenosine overhang, and ligated to one of the 46 indexed adapters (supplementary table 3 and 4
). Following ligation, samples from all individuals were pooled into a single sample (referred to as an indexed library), purified, enriched by PCR, and sequenced on the Illumina GA on a single lane of an 8 lane flow-cell. We prepared two libraries, Library A, consisting of 10 5kb amplicons covering 50 kb, and Library B, consisting of 14 5kb amplicons covering 70kb (supplementary table 2
). Library A contains both regions that were previously capillary sequenced and regions that were not sequenced within the ENCODE project, whereas Library B contains only regions previously sequenced within the ENCODE project.
Schematic describing the preparation of indexed libraries. The red box indicates the indexing step, where for each person a unique indexed adapter was ligated to the fragmented genomic DNA.
We used a six-base design, which allows us to control, tolerate, and measure error during base-calling of the index. The six-base index provides substantial degeneracy: only 46 of the 4096 possible nucleotide combinations were utilized (see supplementary table 4
for indexes). Moreover, indexes were chosen so that 1, and in some cases 2, sequencing errors could be tolerated without an index being incorrectly identified as being a different valid index. While not implemented in this study, utilizing each of the four nucleotides within an index may provide for higher accuracy base-calling since each base would have to be correctly called at least once within a sequenced read.
Using this design strategy, 48 of the 4096 possible 6-mers were synthesized and used as indexes for multiplexed sequencing. Perfect alignment of any index should occur at ~0.1% by chance. The 6th base of the index was an obligate thymidine necessary for ligation of the adenosine overhang. The first and fifth bases were identical to detect biases during normalization and calculation of the deconvolution matrix. In practice, we used 46 of the 48 indexes to allow for plate layouts that included positive and negative controls.
Typically, 3–10 million short-read (32 or 42 base) sequences were generated for each lane of an 8-lane flow cell, though early sequencing runs exhibited greater variability in the number of sequenced reads. After filtering using Illumina analysis pipeline defaults, approximately 45–50% of the reads remained. We observed a large spread in the number of counts per index (). Although a systematic reason for the initial spread in index performance was not identified, weaknesses in index design were obvious in some cases. For example, ‘AAAAAT’ which was frequently read as ‘AAAAAAT’, perhaps due to an oligonucleotide synthesis bias. A few indexes that were not well represented were complementary to other sections of the adapter sequence, possibly hindering adapter formation. Resequencing the same library gave nearly the identical distribution of reads regardless of run performance, indicating that the distribution is likely not due to a post-PCR enrichment step. Furthermore, recreating libraries and sequencing different individuals in additional sequencing runs did not substantially alter performance for indexes that were substantially under-represented or over-represented. Of the 46 initial indexes, 19 indexes varied by less than a factor of 5 between the most and least common index and 13 indexes varying by less than a factor of 2. While some of the initial index variability was consistent between sequencing runs, retrospective analysis of gel images suggests that a portion of the index variance may be due to subtle differences in DNAse digestion of pooled amplicons, whereby the number of available ligation targets is higher for samples that are digested with higher efficiency. In runs subsequent to these initial libraries (data not shown), we observed that using gel-images of the PCR-enriched products or qPCR, to quantify the ligated adapter prior to pooling, reduced index variability such that the best covered index was observed 5-fold more frequently than the least covered index. By comparison the same ratio was 11-fold without quantification of the ligated primers prior to pooling. While future studies may improve index variability still further, it may be effectively managed without substantially affecting workflow, by requiring higher average coverage within a study, by sequencing on two lanes with different indexes, or by sequestering samples with deficient coverage for later runs.
As shown for a subset of library A, coverage across individual 5kb amplicons was even and generally free of large gaps (). We did observe base-to-base variability in the coverage, as expected from alignment of short reads. Both between amplicons and within an amplicon, some deviation from the expected Poisson distribution was observed. Clearly amplicon-to-amplicon variability contributes to some extent to the departure from the expected Poisson distribution. For a given index, we observed approximately a 1.5 to 2.0 fold difference between the amplicons with the most and fewest number of reads. Inspecting gel images for selected amplicons confirmed that these observed differences within regions were largely due to uneven pooling of amplicons. The observed amplicon-to-amplicon variability is likely to be due to the fact that we utilized median concentrations across the plate when pooling amplicons for an individual, rather than individually pipetting each amplicon based on its specific concentration.
Relationship between mean and local coverage
Comparing a given amplicon across indexes (i.e. across individuals), there is clearly some base-level correlation in coverage based on the positions of spikes and valleys within the coverage plots (). Within a single amplicon there was also departure from a Poisson distribution, evident from the fact that the same bases had little or no coverage across individuals. Indeed, there is consistency between individuals with regard to bases that are under or over-represented. The rank correlation coefficient between indexes at a given base averaged 0.408, suggesting that local sequence (or base order) accounts for slightly less than half of the base-to-base variability in coverage.
Error reduction/Alignment strategy
Depending on alignment rules, aligning a short read to a reference sequence reduces the sequencing error rate at the cost of limiting discovery. We aligned 35-base pair sequences, allowing for only a single error. We are thus essentially limited to identifying single base substitutions in an aligned read, while limiting error to 1/35 or 2.8% as explained below. We further required that two stretches of 11 or more consecutive bases match the reference sequence or that the read have at least one stretch of 15 consecutive matches to the reference sequence. In both cases, our aligner required that the final 2 bases match the reference sequence to insure we did not over-align an error at the final base. The rules for alignment were largely chosen to control error, and would falsely align a randomly generated sequence in less than 0.1% of alignments in a 100kb region. Given our tolerance for 1 error in alignment, we expect a maximum per-base error rate of 2-3% (1 error in 35bases ≈ 2.8%).
Alignment of short-reads has advantages. For example, one would expect that we would have greater difficulty detecting closely neighboring single nucleotide polymorphisms (SNPs) since we mostly limit our aligner to 1 non-consecutive mismatch. However, the short-reads stochastically overlap and these various types of neighboring genetic variants are observed by alignment of multiple sequences not spanning both variants.
Polymorphism discovery is a primary goal for resequencing an association interval for a GWA study, particularly under the common variant hypothesis. Indeed, in some cases one may only wish to know which bases are polymorphic for custom-genotyping on a separate platform.
We first provide an intuitive explanation of our analysis approach for polymorphism discovery, noting that detailed equations are provided in the methods
. We utilized a Bayes factors to compare the probability that the distribution of mismatched bases arises from sequencing error to the probability that the distribution of mismatches arises from diploid polymorphism. For example, if 20% of reads for a given base were non-concordant with the reference sequence across all individuals, and the non-concordant bases were due to the presence of a SNP, one would expect each individual to be homozygous (0% or 100% concordance with reference) or heterozygous (concordance split 50/50). On the other hand, if the 20% non-concordant bases were due to sequencing error, then the number of non-concordant bases for each individual would follow a binomial distribution around 20% (e.g. person 1~20.5%, person 2~19.3%, person 3~20.7%, etc). As described below, the error estimates required to calculate the probability of a genetic variant being a true variant are readily obtainable when individuals are indexed and multiplex-sequenced. Further, indexed and multiplexed sequencing removes run-to-run biases which would confound these estimates if all aspects of experimental design were not properly randomized. Bayes factors are particularly effective for the uneven coverage inherent to short read sequencing, and provide a mechanism to control false positives in light of more or less evidence.
Sequenced regions were analyzed base-by-base for all individuals by calculating a polymorphism discovery Bayes factor (defined as Ks
in equation 2
). An example plot of Ks
across each base (of 50kb) is shown in for Library A; a similar analysis was conducted for Library B (supplemental figure 1
Discovery of variant bases by simultaneous analysis of all individuals
We next evaluated false-positive and false negative rates to assess our experimental and analytical framework for variant discovery ( for Library B and for both Library A and B). False positives are particularly difficult to quantify since not all polymorphic sites are known, even in previously resequenced regions. In our analysis, to be defined as a false positive, a variant must not exactly match the location of variants within dbSNP, and must not have trace sequencing data indicating a previously missed variant. In some cases trace sequence data was not available or unreliable. Consequently, the false positive rate is expected to be an upper estimate since the exact position must be validated as polymorphic by an existing database. False negative rates were determined by calculating if a base known to be polymorphic in our library of HapMap individuals reached previously specified Ks thresholds. This calculation of false-negative rates does have some bias, since it does not take into account coverage of the polymorphic base. plots the dependence of Ks on coverage.
Table 1 Evaluation of false positive and false negative rates for polymorphism discovery at various Ks and Ki thresholds, irrespective of coverage. Rates are calculated using Library B since all regions had been previously resequenced within the ENCODE project. (more ...)
Relationship between base-level coverage and Bayes-factor for polymorphism discovery and variant genotyping
As expected, setting a higher threshold for Ks
gives fewer false positives. In for Library A, as Ks
increases from 10 to 1,000 the false positive rate decreases from 69.6% to 11.3%. Likewise, with fixed coverage we observe the false negative rate increasing from 10.8% to 90.8% as Ks
increases from 10 to 1,000. A more detailed discussion of false-negative and false-positive rates is provided in the supplementary methods
. Referring to , all the false negatives occur when the cumulative coverage of individuals with the rarer variant is less than 10 reads. Further highlighting the dependence of false-negatives on coverage, all polymorphisms that were covered by 20+ reads (summed across individuals known to differ from the reference) have a Ks
>1,000. Overall, we observed that 90% of variants were detectable, though designing for >20 reads will be essential for controlling false negatives.
Through the course of analyzing bases with a Ks > 100 for false positives, using NCBI archived ENCODE traces, new SNPs were discovered that were evident in visual reinspection of capillary traces, but that had not been annotated in dbSNP (). These examples demonstrate that index-based resequencing can identify novel variants even in heavily sequenced and heavily annotated regions. Within Library B, it is intriguing to note that two variants with a Ks>100 were not SNPs but actually insertions (rs11279266 is a 1bp insertion and rs10555419 is a 6bp insertion). Thus it is possible to identify genetic variants explicitly not allowed within the alignment scheme.
Genotyping individuals at known polymorphisms
Since false negatives are clearly tied to coverage, we explored the influence of coverage further by analyzing the above regions in an individual-by-individual analysis. Derived in Equation 3
within the methods, Ki
is an analogous Bayes-factor for an individual having the rarer allele at a known polymorphic base. Conceptually, it can be thought of as a specific individual’s contribution to Ks
. Shown in high granularity (), we calculate the percentage of variants correctly identified in an individual given a certain number of reads. For example, when the coverage for a base was ~20 reads (averaging from 16 to 24), we detected >80–90% of the bases at Ki
>10, with a false-positive rate of 1.6%. In comparison to polymorphism discovery, the low false-positive rates of genotyping at a known polymorphic base are due to the fact that we are no longer assessing thousands of bases for a rare event, but rather assessing a few dozen individuals for a more frequent event.