|Home | About | Journals | Submit | Contact Us | Français|
Complementary techniques that deepen information content and minimize reagent costs are required to realize the full potential of massively parallel sequencing. Here, we describe a resequencing approach that directs focus to genomic regions of high interest by combining hybridization-based purification of multi-megabase regions with sequencing on the Illumina Genome Analyzer (GA). The capture matrix is created by a microarray on which probes can be programmed as desired to target any non-repeat portion of the genome, while the method requires only a basic familiarity with microarray hybridization. We present a detailed protocol suitable for 1–2 µg of input genomic DNA and highlight key design tips in which high specificity (>65% of reads stem from enriched exons) and high sensitivity (98% targeted base pair coverage) can be achieved. We have successfully applied this to the enrichment of coding regions, in both human and mouse, ranging from 0.5 to 4 Mb in length. From genomic DNA library production to base-called sequences, this procedure takes approximately 9–10 d inclusive of array captures and one Illumina flow cell run.
The notion that the personal genome of a human individual could be sequenced in less than a year was virtually unthinkable 5 years ago. Even more unthinkable was the idea of doing a low-coverage sequence survey for <$20,000. The commercial availability and widespread adoption of massively parallel sequencing by synthesis (SBS) platforms has turned our collective focus toward the development of tools and infrastructure to apply their capacity to rapidly and cost-effectively tackle a wide variety of questions related to genome biology. To this end, multiple consortia have assembled to accelerate the generation of sequence catalogs of human genetic variation as exemplified by the 1000 Genomes Project1,2 and The Cancer Genome Atlas3,4. Moreover, a recent surge of publications describing the resequencing of the diploid genomes of Craig Venter5, James Watson6, the first genome of an Asian individual7, and the genome of a malignant tumor8 firmly demonstrates the unprecedented opportunity presented by the use of this technology.
Though many would argue that whole genome sequencing provides the most comprehensive and unbiased approach for the discovery of disease-causing mutations, this method presently has substantial barriers to its routine application. Despite their considerable throughput, the large data sets produced by SBS instruments comprise relatively short sequence read lengths that suffer from higher error rates than conventional methods9. The combined informatic challenges associated with these two characteristics necessitate high levels of sequence sampling to obtain definitive evidence for detection of a variant. In fact, several groups have estimated the minimal required read depth to be around 20-fold coverage for sufficient error compensation to generate accurate base calls10. Furthermore, detecting a heterozygous genotype with high statistical confidence will require even deeper coverage, as adequate sampling of both alleles is needed11. On the basis of these estimates, the cost and sequencing run time required for such endeavors remain too prohibitive for an individual investigator or a small group wishing to undertake a survey of genetic variation in many individuals. Therefore, the ability to fractionate a complex eukaryotic genome for resequencing provides a potential approach to a variety of biological questions.
Recently, we and others showed a means to direct sequencing efforts to sets of defined genomic intervals in order to boost coverage of high-value regions by exclusion of others using a subtractive hybridization strategy12–17. This selection scheme uses either complex libraries of soluble probes or high-density tiling DNA microarrays to purify large continuous or discontinuous genomic regions. After hybridization, the captured material is recovered for direct sequencing on massively parallel SBS platforms. Through array-based hybrid selection and deep sequencing, we achieve significant enrichment and ample coverage of target exon regions, illustrating the potential breadth of this resequencing approach to uncover variations that might otherwise escape detection.
Earlier versions of solution-based hybrid selection have used BAC/YAC (bacterial/yeast artificial chromosomes) clones as bait for the isolation of specific transcripts from cDNA libraries or regions of total genomic DNA16. More recently, the highly parallel in situ oligonucleotide synthesis characteristic of current microarray printing technologies has enabled the production of complex nucleic acid libraries containing tens of thousands of custom-defined probes. Some synthesis platforms achieve lengths up to 200 nucleotides. In addition, the probes can be cleaved and solubilized by soaking the arrays in an alkaline solution18. This complex pool of probes may then be PCR-amplified to create a renewable source of material for capture, and may be tagged with a biotin label for subsequent affinity purification. Several adaptations of array-released oligo libraries for genomic enrichment have been reported including a ‘molecular inversion probe’ strategy15 and probes transcribed into long (170 bp) RNA baits17. The advantage of hybridization in solution is that high specificity can be achieved with minimal amounts of input material. The disadvantages, however, are that nonuniform representation of probes resulting from their initial molecular preparation and management may produce unequal recovery of targets leading to disparities in coverage of genomic intervals.
Tiling arrays consist of chemically synthesized and spatially immobilized oligonucleotides in which predefined sequences reflect regions of the genome at frequent and uniform intervals. Such arrays may be programmed to exclude repetitive elements, thereby optimizing available array capacity and performance. Earlier, we described the genome-wide selection and focal resequencing of more than 200,000 protein coding human exons13. We validated this approach using 385k Nimblegen arrays, which were printed with custom-designed, variable-length probes tiling exons at 20 nucleotide intervals. In brief, genomic DNA (20 µg input)was first sheared to a specified length (~600 bp) and fragments were hybridized to arrays over a period of 3–4 d, followed by stringent array washing and fragment recovery by thermal elution. In the captured material from the initial set of eight arrays, approximately >50% of all sequenced fragments originated from the exonic regions selected by probes. In addition, 25% of the total bases within those intervals were covered by reads. These results are based on the initially described procedure and do not take into account our observation that fragment size and read distribution can influence the interpretation of the data. On the basis of several key lessons gained from this study, we were motivated to develop an improved protocol that better combines array capture with sequencing on short read SBS platforms (Fig. 1).
Improvements to the method and their rationale include the following:
Here, we describe in detail our array capture protocol. This procedure assumes initial DNA quantities of 1–2 µg. We are applying this protocol to tackle numerous genomic questions in multiple organisms including humans, mouse and rat. In fact, highly comparable enrichment data were obtained from capture experiments probing both human exons from a region below 1 Mb (see ANTICIPATED RESULTS) and from ~1,000 mouse exons for a total of 4 Mb of target sequence. Therefore, we find that the improvements described here yield better, more comparable results that are largely species- and target-independent.
This section outlines key points for the array capture setup providing, in our experience, the best results that we have been able to attain thus far.
The array configurations currently in use in our lab are designed in a manner similar to earlier described methods13, with several important exceptions. First, Agilent arrays have 243,504 features available for probe design, which are ~140,000 fewer features than the NimbleGen arrays used earlier. However, the Agilent features cover a larger surface area, and thus are likely to contain >50 times more molecules per feature. In addition, we have relied on a standard 60-mer oligonucleotide production format that omits synthesis of variable length probes for Tm matching, reducing complications during probe design and array synthesis. We have designed arrays that target genomic regions ranging from 0.5 to 4 Mb with tiling intervals, meaning the distance in nucleotides that separate the start positions of the probes, from 3 up to 20 bp depending on the size of the areas under selection. Earlier, we noted that tiling across individual regions (in this case exons) generates unequal read depth within the boundaries of that interval. This may be explained by the fact that probes begin to overlap so that the center of the region contains more overlapping probes than the edges of the region (where probes begin to taper off). To compensate for this trend, we begin tiling each interval in flanking regions beginning 60 bp upstream and ending 60 bp downstream of the interval boundaries. Furthermore, if an interval is shorter than 150 bp, we extend the region at both ends to include a minimum of 150 bases of genomic sequence.
Probably the most critical aspect of array design is the filtering of repetitive elements to reduce nonspecific binding. Not only do we exclude nonunique probes but also we eliminate probe sequences that correspond to highly repetitive regions in the genome. There are two methods by which this can be accomplished. The first method, called RepeatMasker filtering (Smit, A.F.A., Hubley, R. & Green, P. RepeatMasker Open-3.0, 1996–2004), results in conservative coverage of the genome with regard to repeat-rich regions. An alternative approach, based on the frequency of 15-mer sequence combinations in the genome, calculates the 15-mer frequency for all 46 potential stretches of contiguous 15 bases within an individual 60-mer probe13,19. If the average genomic frequency for the 46 15-mers is greater than a set threshold, in our case 100, the probe is filtered out. The 15-mer frequency tables that have been generated for both the mouse and human genomes are available upon request. In general, this method is slightly less stringent than the Repeat-Masker filtering, but can be adjusted by setting the threshold lower or higher depending on the desired stringency and the sequence content of the regions of interest. For example, if the desired target is an entire genic region, containing both exons and introns, and a considerable portion of this is repetitive, it may be necessary to relax this threshold to include more of the region or achieve better coverage. On the other hand, if several paralogs exist for a particular gene, a lower, more rigid threshold may be necessary.
To control for performance and reproducibility between experiments, a number of randomly dispersed array features may be reserved for a small subset of informative intervals. These intervals may serve as either positive controls of enrichment (enrichment standards of an expected level), i.e., regions/exons displaying high levels in earlier captured material, or for controlling sample content whereby contamination levels can be estimated by looking at known/expected haplotype regions. The ‘control grid’ can be used to unify results obtained from arrays of different target origin and size (within the same species).
The quality of the genomic DNA library strongly influences the success of the array capture. Our library protocol is historically based on the standard Illumina procedure11 for single-end sequencing, but can be easily adapted for paired-end applications. The first step in this process is DNA fragmentation, either by nebulization or by sonication. Nebulization consumes more DNA than sonication owing to the atomizing process and the wide size distributions generated, which is not favorable when size selection is necessary. Thus, for many samples, nebulization is unsuitable because of the large amounts of required input DNA. For this reason, we use sonication as an alternate fragmentation method for most library preparation. Although many options exist, we routinely generate high-quality libraries using a focused acoustics system (Covaris) or a closed system ultrasonic disruptor (Bioruptor, Diagenode). The latter system is lower in cost, while achieving higher energy transfer efficiency and more reproducible performance than standard probe sonicators. In addition, multiple samples can be processed simultaneously in a uniform manner. Subsequently, DNA is subjected to a series of enzymatic reactions that repair frayed ends, phosphorylate the fragments, and add a single nucleotide A overhang (Fig. 1). In these protocols, we have used both the kit reagents purchased from Illumina11 and off-the-shelf reagents as described below with comparable results. After ligating Illumina adaptors, 150–300 bp fragments are selected and purified by gel extraction. At this stage, we carry out multiple parallel PCR amplifications (8–10 per sample) of the ligated fragments after which the amplified products are pooled. By carrying out multiple reactions in parallel and pooling, fewer cycles are required to generate the optimal 20 µg for capture without comprising the complexity or skewing the representation/introducing a representational bias. In addition, barcoding strategies, which enable sample multiplexing, are suitable for array capture and may offset the per sample number of PCRs required. It is also worth noting that significant improvements to the library preparation protocol have been recently reported and could easily be implemented here20.
Although the arrays are processed under stringent hybridization and washing conditions, certain factors can influence the specificity of the capture experiment. There are several scenarios in which unintended cross-hybridization of fragments can occur. First, although repetitive probe filtering helps to reduce nonspecific hybridization to the array, it does not prevent repeats from confounding the results altogether. This may be because the average size of the input DNA is generally larger (2.5–5×) than the size of the probes. Therefore, the complementary oligo segment affixed to the array accounts for less than half of the DNA fragment. This leaves an unbound area of single-stranded DNA free to bind to any complement in the applied sample. If a repeat is contained within the unbound segment of the fragment, it may hybridize to complementary repeats in other library fragments (Fig. 2). This is especially challenging on the periphery of interval borders where bound DNA fragments can extend to regions beyond probe coverage into areas not filtered for repeats. To counteract this potential problem, species-specific Cot-1 DNA is added in excess of the input DNA to the hybridization mixture. The Cot-1 DNA is the repeat-rich fraction of genomic DNA that has been isolated on the basis of its re-annealing characteristics. We have used up to five-fold excess Cot-1, achieving our best results at 2.5-fold excess (see ANTICIPATED RESULTS). A second source of cross-hybridization stems from the common adaptor sequences present on the hybridized fragments. When the DNA becomes denatured before hybridization, the complementary adaptor sequences can bind indiscriminately to each other, regardless of the insert sequence (Fig. 2). The Illumina adaptors are also easily long enough to remain annealed under the conditions used for hybridization. Therefore, to compete for adaptor binding, we supplement the hybridization mixture with a molar excess of four distinct ‘blocking oligos’ that complement each strand of the adaptor sequence.
The elution step can be challenging without the appropriate equipment. Our goal was to develop a straightforward means of denaturing and recovering the hybridized material without using a specialized elution apparatus. The strategy we use takes advantage of the standard Agilent slide gasket chamber system (Fig. 3). This system comprises a steel chamber base and a rubber gasket-slide that, when assembled, forms a sandwich with the microarray and creates a hybridization compartment for the printed surface of the array. The rubber gasket creates a space through which a small-gauge syringe can be inserted without compromising the integrity of the seal. The procedure we outline requires the use of a hybridization oven that must be capable of reaching 95 °C. The elution mixture is withdrawn from the chamber using a syringe, and the liquid eluate is lyophilized from a 490- to 50-µl volume. Subsequently, an additional PCR step is carried out to accurately quantify the captured material. However, as the captured fragments already contain the full sequences necessary for cluster generation, this step may be eliminated provided the fragments can be quantified by quantitative PCR (qPCR).
Capture performance can be assessed either by qPCR for a defined set of intervals or by sequencing the captured material on an SBS instrument. For qPCR, primer pairs are selected for a small set of target loci and the CT values for each locus are compared between the input DNA and the captured DNA (see ANTICIPATED RESULTS). Larger exons are ideal for designing well-matched pairs. Negative controls should also be included. These can include any genomic region for which there are no selected probes. Each primer should be ~20 bp with a melting temperature of 58–60 °C. Optimal amplicon sizes are around 100 bp. Further instructions on designing primers can be found in the SYBR Green PCR master mix user guide. qPCR is useful for acquiring an initial snapshot of success, and as an experimental quality control for determining whether sequencing the eluate will be informative. However, qPCR does not provide a global sense of enrichment specificity and sensitivity. Therefore, massively parallel sequence analysis from single captured molecules is the only accurate and comprehensive way to estimate performance. The sequence analysis procedure will be described in further detail below.
After the hybrid selection process, enriched samples undergo cluster generation on the Illumina flow cell. Cluster generation is followed by 36 cycles of single base extension and cluster imaging on the Illumina Genome Analyzer (GA). Next, the cluster images are compiled and analyzed, and the called bases are assigned a quality score. Filtered sequences are mapped to the reference human genome (Hg18) using Eland, a built-in mapping algorithm for the Illumina analysis pipeline. Only reads mapping to unique positions in the genome with at most two mismatches to the reference are considered. We evaluate the success of an individual capture experiment by several metrics. First, the specificity is measured by the percentage of reads (the number of reads in target/the number of reads mapped) that overlap with targeted intervals by at least 1 bp (see ANTICIPATED RESULTS). Another way to describe specificity is by generating a ‘fold enrichment’ score that is calculated by dividing the percentage of reads in target by the percentage of bases targeted in the genome.
The sensitivity of the array selection process is evaluated on two levels: (1) sequence coverage of the regions selected and (2) higher resolution coverage at the actual base pair level. The first level of coverage is determined by the number of targeted intervals covered by at least one overlapping read. Similarly, the base pair coverage is measured by the number of total target bases covered by at least one read. Both numbers provide a general estimate of the extensiveness and the complexity of target purification (see ANTICIPATED RESULTS). However, as the overall objective is to enable high-quality variant detection through enhanced coverage, the most valuable measurement will be the median sequencing depth, or the median number of times (X) an individual base within the target interval is covered by a unique read. For confident single nucleotide polymorphism (SNP) detection, we require high-quality base calls (> Q20) with a sequencing depth of at least 20×.
0.25% (wt/vol) bromophenol blue, 0.25% (wt/vol) xylene cyanol FF, 30% (vol/vol) glycerol in water. Add 1 vol of 6× DNA-loading dye to 5 vol of DNA sample. Store at RT for regular use or −20 °C for long-term storage.
50 mM Tris (pH 8.0), 40 mM EDTA, 40% (wt/vol) sucrose: can be stored at RT (20–25 °C) for several months.
Dissolve 96.8 g Tris in 250 ml water. Add 22.8 ml acetic acid and 40 ml of 0.5 M EDTA (pH 8.0). Adjust to a final volume of 1 liter. Vacuum filter the buffer before use. Prepare a 1× TAE working dilution; can be stored at RT for several months.
Add 3 g agarose to 150 ml 1× TAE electrophoresis buffer. Heat in a microwave oven until completely melted. Add ethidium bromide (final concentration 0.4–0.5 µg ml−1) to facilitate visualization of DNA after electrophoresis. After cooling to 50–60 °C, pour gel into a casting tray containing a gel comb and allow it to solidify at RT. Make a fresh gel before loading samples.
Make 1 mM working dilutions from 100 mM stock. Store at −20 °C until further usage.
Dissolve the primers in nuclease-free water to obtain 100 µM stock solutions. Prepare 50 µM working dilutions. Freeze at −20 °C until further usage.
Dissolve the primers in nuclease-free water to obtain 200 µM working dilutions for the hybridization master mix. Be sure to vortex thoroughly. Freeze at −20 °C until further usage.
Add 1,350 µl nuclease-free water to the lyophilized pellet. Leave at RT for 60 min. Mix gently by vortexing. Aliquot ~110 µl into several 1.5-ml centrifuge tubes to avoid freeze–thaw cycles. Store at −20 °C until further usage.
Dissolve the real-time primers in nuclease-free water to obtain 100 µM stock solutions. Prepare 10 µM working dilutions for real-time PCR master mix. Freeze at −20 °C until further usage.
We use the Diagenode Bioruptor UCD-200 equipped with a 1.5-ml microtube unit, which holds up to six 1.5-ml centrifuge tubes. Set the power to high (200 W) with alternating cycles of a 30-s burst of sonication followed by a 30-s pause.
Refer to the Agilent microarray chamber user guide (G2354-90001) for detailed instructions on how to load samples, assemble and disassemble chambers, as well as for other useful tips. The user guide is also available with the purchase of the Agilent microarray hybridization chamber kit (G2534A).
|Reagent||Volume (µl) per sample||Final concentration in reaction|
|T4 DNA ligase buffer with 10 mM ATP (10×)||10||1 mM ATP (1×)|
|dNTP mix (10 mM each)||4||400 µM each|
|T4 DNA polymerase (3 U µl−1)||5||0.15 U µl−1|
|Klenow DNA polymerase (5 U µl−1)||1||0.05 U µl−1|
|T4 polynucleotide kinase (10 U µl−1)||5||0.5 U µl−1|
|Reagent||Volume (µl) per sample||Final concentration in reaction|
|Klenow buffer (10×)||5||1×|
|dATP (1 mM)||10||0.2 mM|
|Klenow fragment exo− (5 U µl−1)||3||0.3 U µl−1|
|Reagent||Volume (µl) per sample||Final concentration in reaction|
|T4 DNA ligase buffer (2×)||25||~1×|
|DNA dilution buffer (5×)||5||0.5×|
|Adapter oligo mix 10 µM||10||2 µM|
|T4 DNA ligase (5 U µl−1)||1||0.1 U µl−1|
|Reagent||Volume (µl) per sample||Final concentration in reaction|
|Phusion DNA polymerase (2×)||25||1×|
|PCR primer 1.1 (50 µM)||1||1 µM|
|PCR primer 2.1 (50 µM)||1||1 µM|
|Reagent||Volume (µl) per sample||Final concentration in reaction|
|20 µg adapter-modified genomic DNA||138||~38.5 ng µl−1|
|BO 1 (200 µM)||5||~2 µM|
|BO 2 (200 µM)||5||~2 µM|
|BO 3 (200 µM)||5||~2 µM|
|BO 4 (200 µM)||5||~2 µM|
|Cot-1 DNA (1 mg ml−1)||50||~0.01 mg ml−1|
|Agilent blocking agent (10×)||52||1×|
|Agilent hybridization buffer (2×)||260||1×|
BO, blocking oligonucleotide.
|Reagent||Volume (µl) per sample||Final concentration|
|Phusion DNA polymerase (2×)||25.0||1×|
|PCR primer 1.1 (50 µM)||1.0||1 µM|
|PCR primer 2.1 (50 µM)||1.0||1 µM|
SUMMARY 5–6 d
Steps 1–22: 5 h (1 d)
Fragment genomic DNA, repair ends and add 3′A-overhangs (Steps 1–13): 3 h
Ligate adapters (Steps 14 and 15): 0.5 h
Size-select and gel-purify adapted DNA fragments (Steps 16–22): 1.5 h
Steps 23–42: 68.5 h (3 d)
Enrich adapter-modified DNA and prepare it for hybridization (Steps 23–33): 3.5 h
Set up hybridization and pre-warm wash buffer (Steps 34–42): 65 h
Steps 43–68: 1 d
Post-hybridization wash and elution (Steps 43–58): 1 h
Concentrate and amplify eluted material (Steps 59–68): 4.5–6.5 h
Steps 69–72: 3.5 h (0.5 d)
Validate enrichment with real-time PCR (Steps 69–72): 3.5 h
Troubleshooting advice can be found in Table 2.
After elution and amplification of the captured material, qPCR may be carried out for a set of selected and depleted exons. The exons depleted from the captured genomic DNA are negative controls containing regions not selected by probes (Fig. 4). For example, in Figure 4, a difference of ~10 cycles is observed between the pre- and post-hybridization templates from a single experiment, indicating >1,000× copies of the selected exon are present in the enriched sample. If the cycle threshold is lower in the captured library than in the input library, the selection has passed an initial quality control and may proceed through sequencing.
To show how various components of hybridization influence capture efficiency, we performed a series of experiments shown in Table 3. These results stem from iterative optimization trials carried out on 244k arrays designed to select ~0.8 Mb of human exonic sequence from three independent chromosomal regions. The genomic DNA used for all captures was obtained from a single HapMap individual from the CEPH group (Coriell NA12762). It should be noted that the results described here are based on sequence data obtained from one lane of the eight-lane flow cell for each experiment/array capture. For the fourth experiment, two replicates (4A and 4B) are shown. These captures were performed and sequenced independently with two genomic DNA libraries generated separately from the same HapMap sample. Although there were differences in the number of total and mappable reads produced (a normal variation between sequencing runs), the results are highly comparable, indicating the reproducibility of the procedure given the optimal components and conditions.
Optimal results were achieved in the presence of excess Cot-1 DNA and adaptor blocking oligos. To show how enrichment specificity is determined, if 65.37% of the reads overlap with targeted intervals and the total target length is 829,564 bp, or 0.025% of the 3.3 × 109 bp human genome, the enrichment score will be 65.37%/0.025% or ~2,600, as indicated in Table 3. In other words, of the 328 Mb of sequence output generated in experiment 4A, 215 Mb are potentially informative. Importantly, the enrichment score is inversely proportional to the target length printed on the array. Therefore, it should only be used as a parallel comparison between two arrays consisting of comparable target lengths, and not a cross-comparison between two arrays directed to markedly different genomic sizes.
For many assays of this kind, specificity and sensitivity are often at odds with one another. Skewing the representation disproportionately in favor of some regions over others can result in some target regions being missed while others become overrepresented in the data. However, we show here that high sensitivity may also be achieved by this method, as all of the targeted exons are occupied by at least one captured fragment (% regions with read) and >98% of the targeted bases are covered by at least one read (Fig. 5).
Importantly, both the total number of reads obtained and the percentage of those reads that can be mapped to the genome for any given sequencing run directly influence sequencing depth. After recent upgrades to the reagent chemistry of the GA, imaging hardware and accompanying software, we achieve nearly 3× higher read depth than what was generated by the earlier GA. The upgrade from GA1 to GA2 is exemplified by the discrepancy in total read depth reported in Table 3 between experiments 1 and 4. At present, we typically see 14–15 million reads per lane, of which 50–60% map uniquely to the reference genome with ≤2 mismatches. A majority of ‘background’ reads are likely derived from the genome as well, but are cast aside as a result of either ambiguous mapping positions or lower read quality stemming from high rates of erroneous base calls/high per read error rates. In our experience, one GA flow cell lane of array-captured sequence is sufficient to achieve >20× sequencing depth for >93% of the bases inside a 0.7–0.8-Mb target. Consequently, for larger genomic regions, the specificity is only marginally affected, while the sequencing depth will be reduced. Thus, it is necessary to sequence more lanes in order to achieve deeper coverage.
The ultimate and most informative measure of sensitivity is the ability to effectively determine the full extent of polymorphisms present in a sample. To test this we compared our sequence data with the set of known SNPs registered within the genomic intervals captured from the HapMap library (Table 4). From this data, >90% of the known SNPs, both homozygous and heterozygous, were correctly called, whereas <3% were not called due to lack of coverage. Table 4 also illustrates the accuracy of SNP detection when read depth is considered as a criterion/cut-off for calling SNPs. In other words, a SNP must be supported by a minimal number of reads covering that base position in order to be considered a confident call. We performed SNP analysis using a minimal requirement of either 20× or 50× read depth. The more stringent requirement of at least 50× read depth reduces the rate of false positives, but the total number of identified SNPs is also significantly reduced, especially in experiment 3, where three times fewer reads in target were obtained compared with 4A and 4B. Our data show that a sufficient compromise between accuracy and coverage can be achieved when the read depth cutoff is set to 20×.
We are thankful to Danea Rebolini, Laura Cardone and Melissa Kramer for sequencing and informatic support. We also thank Mona Spector for helpful discussions. This work was supported by an NIH postdoctoral training grant (E.H.) and by kind gifts from the Stanley Foundation and Kathryn W. Davis (G.J.H.). G.J.H. is an investigator of the Howard Hughes Medical Institute.
COMPETING FINANCIAL INTERESTS The authors declare competing financial interests (see the HTML version of this article for details).
Reprints and permissions information is available online at http://npg.nature.com/reprintsandpermissions