Both Nimblegen and Agilent have released their commercial products for capture. However, Nimblegen's protocol is specific for the Roche 454 sequencer and no details of the hybridization mix contents are provided. Agilent's protocol uses solution based oligos and although the protocol can be adjusted for either Illumina GA or ABI SOLiD sequencing, it is not cost effective yet for small number of samples. Here, we described a complete instruction of the improved capture protocol with a troubleshooting guide (Table ) that should facilitate the preparation of enriched genomic libraries given access to either Agilent or Nimblegen hybridization equipment and any of the next generation sequencers and be applicable to other genomes.
Two simple optimizations of the hybridization protocol have improved the capture performance significantly. First, by blocking the adapter sequences flanking each of the genomic fragments, we reduced the non-specific pull down through adapter-adapter hybridization. Blocking the nonspecific DNA is an old trick to reduce the background when microarray experiments are performed, with human cot-1 being the most commonly used reagent to block repetitive sequences [20
]. Recently, Hodges et al. has shown similar results with the same approach, validating our experimental protocol [21
]. Secondly, we repeated the hybridization step to further enrich the genomic fragment pool. While the specificity was enhanced up to 90%, this step introduced ~1% of variant loss and some degree of bias in the relative abundance of specific amplicons. For example, the fold difference observed for EGFR
gene was weakened by 2.5-fold when the double hybridization capture protocol was applied, suggesting saturation of the hybridization step effectively normalizing the yield from each amplicon. The overall correlation coefficient between the single hybridization experiment and the double hybridization experiment after excluding the ~100 exons that were outlier was 0.82. This interferes some with the ability to reliable call copy number state of individual exons from the pull down sequence data. Two-round hybridization should be used with caution when copy number detection is critical. The array designed for our current experiments and those in previous reports were all masked for repeats. To test whether including the repeat regions would affect the capture, we have attempted to tile every 15 bp across a 4 Mb region of a single chromosome using Agilent 244 K custom array without vigorously masking for the repeats. The specificity was significantly reduced to 15~30% even with the addition of the primer blockers and increased human cot-1 DNA in the hybridization mix (data not shown). This phenomenon should be taken into account when it is unavoidable to target the repeat regions.
Throughout the experiments, the sequence reads generated were tightly mapped nearby the intended probe regions. For each probe, the local sequence coverage will extend out in relation to the length of genomic fragment library initially created. Without any major variations in the genomic fragments that could interrupt the hybridization to the probe, the sequence coverage will peak within the probe region and decrease with increased distance.
There are ~18,000 genes in RefSeq database composed of ~33 Mb of coding sequences. To tile every 30 bp, ~914 K probes are needed to be designed which is possible to accomplish on a total set of four Agilent 244 K custom arrays or one Agilent 1 M custom arrays. Figure shows the proportion of 8 million targeted bases covered at various minimum coverage for different mean coverage within the targeted regions. For example, 76% of the targeted bases were considered completely sequenced with sequence depth of 20× or more when the mean coverage within the targeted regions was 55×. From these data, we can project how many sequence reads are required to comprehensively sequence all RefSeq exons. In this report, we used 36 bp of single end sequence reads generated by Illumina GA I. Currently, longer reads of 76 bp paired end sequence reads can be generated and are of sufficient quality for resequencing by Illumina GA IIx. This improvement not only increases the total sequences read by one channel of a flowcell, but also facilitates the alignment to the genome significantly. On average, 2.5 Gb of sequences are generated by one channel of Illumina GA IIx run. Of this, about half of the sequences are mapped uniquely to the human genome and assuming 60-85% specificity of capture, we will be able to generate 0.75-1.06 Gb of sequences within the targeted region. If targeting 33 Mb of the human genome for all RefSeq coding exons, it will require 2 channels (quarter a machine run) of sequencing with Illumina Genome Analyzer (GA) IIx to achieve 20× or more coverage on ~80% of the targeted sequences for one sample: or four samples can be sequenced with each machine run. Alternatively, each run of the ABI SOLiD 3 Plus instrument can generate up to 1 billion 50 base paired end reads, and a total of 40 Gb of mapped genomic sequence, such that 12 exomes can be resequenced at comparable coverage with each machine run (S. Nelson, unpublished results). Thus, whole transcriptome resequencing is economically feasible on the current generation of capture tools and sequencing devices, and, in principle, can be performed for under $2000 per genome.
Figure 4 Percentage of targeted bases sequenced at various minimum coverage for different mean coverages. X-axis represents the coverage per base level and the corresponding y-axis represents the percentage of targeted bases that were covered at greater or equal (more ...)