|Home | About | Journals | Submit | Contact Us | Français|
Identifying genetic variants and mutations that underlie human diseases requires development of robust, cost-effective tools for routine resequencing of regions of interest in the human genome. Here, we demonstrate that coupling Applied Biosystems SOLiD™ system-sequencing platform with microarray capture of targeted regions provides an efficient and robust method for high-coverage resequencing and polymorphism discovery in human protein-coding exons.
Recent advances in high-throughput sequencing technologies have resulted in development of a number of accurate and sensitive methods for polymorphism discovery in the whole human genome. Still, even with massively parallel sequencing technologies, there remains a trade-off between the power to detect localized variation in thousands of patients versus the power to detect all variation throughout the genome in a few individuals. One way of addressing this issue is to focus resequencing efforts on smaller genomic regions, selected as a result of prior investigations, such as previous linkage studies.
The enrichment approach described above has been used previously to provide up to sevenfold median-resequencing coverage of 0.05–80 Mb genomic regions in a single sequencing run,1–4 with an estimated average theoretical enrichment of 300- to 400-fold. However, it is not obvious from these studies if the enriched sample maintained an accurate representation of polymorphism profiles after resequencing and could be used as a surrogate for polymorphism detection. Here, we demonstrate that SOLiD™ resequencing of microarray-enriched regions of the human genome provides a sensitive, accurate, and cost-efficient tool for detecting human polymorphisms and investigate if any possible biases in polymorphism detection can result from this procedure.
A custom oligonuclotide array (Agilent Technologies, Santa Clara, CA, USA; 244 K format) was designed with 60 bp probes targeting 4.3 Mb human protein-coding exons. A portion (30 μg) of a fragment library from human DNA (NA18507) was hybridized to the array, and the remainder was reserved for control. Hybridized (enriched) fragments were eluted from the array, concentrated by precipitation, and then control and a hybridized sample were interrogated with SOLiD™ system sequencing. The 35-bp sequencing reads were aligned against the target sequence with up to three mismatches; single nucleotide polymorphism (SNP) detection was performed on the alignment; and identified SNPs were verified by comparison with HapMap release 23a for NA18507.
The empirical enrichment was calculated as follows: (percent of sequence reads uniquely matching target regions in the enriched sample/percent of sequence reads uniquely matching target regions in the control sample).
The theoretical enrichment for the array-hybridized sample was calculated as in ref. 1: (percent of sequence reads uniquely matching target regions in the enriched sample/percent of sequence reads uniquely matching human genome in the enriched sample)* maximum enrichment, where maximum enrichment is defined as a ratio of genome length versus target length and is equal to 722 for our 4,258,893 bp enrichment target.
A random fragment library is obtained (Figure 1)by ligating generic adapters (highlighted in magenta) to sheared genomic fragments and is divided into two samples. The control is set aside, and the other sample undergoes hybridization to a custom-designed, high-density oligonucleotide array to enrich for the target DNA regions (highlighted in blue). The eluted captured regions are then SOLiD™ system-sequenced in parallel with the control sample.
For Figure 2,the x axis indicates the chromosomal position of the mapped reads, and the y axis represents per-base coverage. (Top) Coverage in unhybridized control sample; (middle) coverage in an enriched sample; (bottom) placement of hybridization probes in the genome. The per-base sequence coverage for the enriched sample (middle) exceeds 340× (reaching almost 1800×) and stays under 11× in the control sample (top).
Comparison with the chromosomal location of the probe targets (Figure 2, bottom) demonstrates that the vast majority of sequencing reads in the enriched sample maps to regions overlapping array hybridization probes. Averaging over the whole 4.3-Mb enrichment target, we obtained 138-fold per-base coverage (median coverage is 59-fold), with 91% of target sequence covered by at least one tag, and 87% of covered bases having coverage within 1 sd of the average. This level of coverage enabled highly accurate and sensitive SNP detection, with 99.5% of identified HapMap SNPs called correctly, establishing suitability of a combined enrichment/SOLiD™ system-resequencing approach for polymorphism discovery (Table 1).
We evaluated accuracy and sensitivity of SNP discovery by comparing our results with the HapMap database and where necessary, resequencing by conventional Sanger sequencing. Two approaches were used. First, we tested if our sequencing data corroborated presence and identity of known HapMap SNPs (we applied this strategy to SNPs homozygous for the reference allele, as these SNPs are indistinguishable from reference when a single individual is resequenced). Second, we performed de novo SNP detection, as described in Materials and Methods, aiming to identify all heterozygous SNPs and all SNPs homozygous for the alternative allele in the target regions. Those “newly found” SNPs were then compared with HapMap reference, and only SNPs present in HapMap were selected to evaluate the accuracy of our SNP discovery.
The only HapMap SNP homozygous for the alternative allele and erroneously called heterozygous by the SOLiD™ system appears to overlap a 500-bp homologous duplication (88–98% identity) on chromosomes 7, 10, and 22 (Table 2), which probably contributed to its erroneous identification by HapMap and the SOLiD™ system (Table 3).
Since, at a minimum, de novo SOLiD™ SNP discovery required three sequencing reads to call a SNP, we estimated how many HapMap SNPs had three-plus coverage in our target regions (Table 4). Positions of 83% of the HapMap SNPs homozygous for the alternative allele and 86% of the heterozygous HapMap SNPs were covered by at least three reads. The proportion of “unknown” HapMap SNPs with three-plus coverage was lower at 74%, suggesting that sequences surrounding unknown HapMap SNPs, where conventional SNP calling was not successful, might also present some challenge to our platform. Indeed, the SOLiD™ system found only 36% of the unknown HapMap SNPs with three-plus coverage. Still, of the SNPs found, 89% were identified correctly, thus demonstrating suitability of our SNP detection approach, even in the regions with complex sequences not amenable to SNP discovery by conventional methods.
The empirically calculated enrichment—a ratio of enriched sample reads matching target versus control sample reads—was equal to 184. The theoretical enrichment, commonly used to evaluate the efficiency of the enrichment procedure,1 was predicted to be 391 (see Materials and Methods). Thus, the theoretical enrichment twice exceeded the empirical enrichment derived from direct comparison of the enriched and the control samples (Table 5). The measure of empirical enrichment is more reflective of how much more sequence is required if whole genome shotgun sequencing were used, as opposed to array enrichment.
The coverage obtained by SOLiD™ system (Figure 3)sequencing allows accurate detection of indels in fragment libraries for heterozygous and homozygous polymorphisms.
In summary, by coupling the Applied Biosystems SOLiD™ system-sequencing platform with microarray capture of targeted regions, we developed an efficient and customizable method for high-coverage resequencing and polymorphism discovery in human protein-coding exons. The theoretical enrichment was calculated to be 391-fold, which is higher or on par with data reported previously.1 However, although commonly used, we believe this evaluation of the enrichment is misleading, as it does not take into account the sequence complexity of the target. Here, we propose an alternative metric for estimating efficacy of the enrichment procedure by comparing amplification of the target regions in the enriched sample and in the control. In this study, we report that 184× more sequencing reads mapped to the target after the microarray selection procedure, thus providing an accurate empirical evaluation of the enrichment efficiency of our protocol. Our results demonstrate that a combination of array enrichment with SOLiD™ system sequencing provides an accurate representation of the polymorphism profile, as evidenced by the 99.5% accuracy of SNP discovery for the enriched sample. The errors in the SNP discovery result primarily from low per-base coverage when there is not enough sequencing reads to form a consensus on the genotype of the SNP or from genomic duplications highly homologous to the probes. Overall, our results demonstrate that SOLiD™ system resequencing of microarray-enriched genomic regions provides a powerful tool for genetic analysis, which may facilitate the search for genes contributing to inherited common diseases and diseases in which somatic mutations play a role, such as atherosclerosis and cancer.
We thank members of the High Throughput Discovery Department at Applied Biosystems and in particular, Stephen McLaughlin and Jonathan Manning for their contributions. AB (Design) and Applied Biosystems are registered trademarks, and SOLiD is a trademark of Applied Biosystems and its affiliates in the United States and/or certain other countries. All other trademarks are the sole property of their respective owners.