|Home | About | Journals | Submit | Contact Us | Français|
Sequencing the coding regions, the exome, of the human genome is one of the major current strategies to identify low frequency and rare variants associated with human disease traits. So far, the most widely used commercial exome capture reagents have mainly targeted the consensus coding sequence (CCDS) database. We report the design of an extended set of targets for capturing the complete human exome, based on annotation from the GENCODE consortium. The extended set covers an additional 5594 genes and 10.3Mb compared with the current CCDS-based sets. The additional regions include potential disease genes previously inaccessible to exome resequencing studies, such as 43 genes linked to ion channel activity and 70 genes linked to protein kinase activity. In total, the new GENCODE exome set developed here covers 47.9Mb and performed well in sequence capture experiments. In the sample set used in this study, we identified over 5000 SNP variants more in the GENCODE exome target (24%) than in the CCDS-based exome sequencing.
Exome resequencing is increasingly becoming a standard tool for the discovery of genes underlying rare monogenic disease and the discovery of coding variants associated with common disease.1, 2, 3 Although the cost of whole-genome sequencing has fallen dramatically over the last 2–3 years, it is still too expensive to be a useful approach for the identification of variants associated with different phenotypes in large cohorts. However, the combination of ‘second-generation' sequencing technologies (reviewed in refs 4,5) with robust and efficient methods of sequence capture6, 7, 8, 9 has enabled the widespread targeting of the exome.
Exome resequencing studies are, however, currently being performed using designs based on an incomplete exome, and consequently many medically relevant genes are not being screened in ongoing large-scale disease studies. The two most widely used commercial kits for capturing the exome target exons from genes in the consensus coding sequence (CCDS10) consortium database, in addition to a selection of miRNAs and non-coding RNAs, are NimbleGen Sequence Capture 2.1M Human Exome Array, http://www.nimblegen.com/products/seqcap/ and Agilent SureSelect Human All Exon Kit, http://www.genomics.agilent.com. Although the collaborative effort behind the CCDS database has provided a high-quality set of consistently annotated protein-coding regions, there are still many annotated genes, with solid evidence of transcription, that are not yet part of this set. In addition, only 21% of CCDS genes have an alternative spliced variant annotated. To address this shortcoming, we have designed and experimentally tested a more complete set of target regions for the human exome, based on the GENCODE annotation11 (release 2). The GENCODE collaboration is part of the Encode project and responsible for the annotation and experimental validation of gene loci on the human genome.
To generate the coordinates for the GENCODE exome, we extracted the coordinates for a total of 288654 unique exons from 46275 transcripts of 20921 Ensembl12 protein-coding genes (release 53) and 33621 transcripts of 13772 manually annotated protein-coding genes (HAVANA,13 database version February 2009), together with an additional 1635 miRNA genes (Ensembl/miRBase). If the coordinates of any of these exons overlapped by one or more base pairs, regardless of strand, the overlapping exons were clustered together into expressed cluster regions (ECRs). A 10bp flank was added on both sides of each ECR. Any ECR that now overlapped as a result of this flank by at least 1bp was merged. This resulted in 207108 ECRs, covering ~39.3Mb as the design target (35.2Mb of exonic sequence plus 4.1Mb of flanking sequence). The coordinates of these regions were used for bait design. The baits were created using the Agilent SureSelect design algorithm in three rounds of design using RepeatMasker- and WindowMasker-defined repeats in an attempt to avoid repetitive regions as well as to increase coverage of the target exons. Each successive round of design was more permissive of repeat overlap (0, 20 and 40bp). After sequencing, the underperforming baits were boosted at specific ratios to even out the coverage across all targets. Depending on the placement of baits relative to repeat regions, the boosting was done either by direct replication or by shifting the booster bait either up or downstream by 30bp. It was possible to design baits to 205031 of the ECRs or 99% of the GENCODE exome target regions (Table 1). The total size of these final bait regions is 47.9Mb.
In all, 15μg of DNA diluted in TE was sheared to 100–400bp using a Covaris S2 (Covaris, Woburn, MA, USA). Sheared samples were quantified on a Bioanalyzer 2100 (Agilent, Santa Clara, CA, USA), and 7.5μl of COT 1 DNA at 100ng/μl was added. This library was lyophilised in a vacuum concentrator to a pellet and resuspended in 3.4μl of ultrapure water. Following Agilent's SureSelect protocol,14 10μg of sheared DNA were end repaired and polyA tailed, and Illumina-sequencing adapters were ligated to the resulting fragments using the Illumina (San Diego, CA, USA) Paired-End DNA Sample-Prep protocol, except that the gel-size selection step was replaced with a purification using magnetic bead-based solid phase reversible immobilisation (SPRI) beads (following Agilent's protocol14). The capture library was prepared by mixing 5μl of the oligo capture library, 1.5μl of ultrapure water and 1μl of 1:1 dilution of RNase block. In all, 500ng of each sample library was hybridised to the appropriate bait set in PCR plates on a thermocycler at 65°C for 24h (following the manufacturer's protocol with the modification that no prehybridisation PCR was performed). The capture was performed according to the manufacturer's protocol with streptavidin-coated Dynal beads (Invitrogen, Paisley, UK), and captured samples were washed three times using SureSelect wash buffers with a series of incubation steps. The samples were cleaned up using Mini Elute columns (Qiagen, Hilden, Germany) and eluted in 50μl of PCR-grade water. Eluted samples were amplified using a master-mix containing 2m MgCl2, 0.2m dNTPs, 0.5μ PE.1, 0.5μ PE.2 and 3 units of Platinum Pfx DNA Polymerase (Invitrogen) per sample. Samples were aliquoted into three individual wells of a plate and amplified using the following conditions: 94°C for 5min, followed by 20 cycles of 94°C for 15s, 58°C for 30s, 72°C for 30s and a final extension of 72°C for 5min. PCR products were purified using SPRI beads before sequencing. All data represent results of one capture reaction for each sample. Captured libraries were sequenced on the Illumina Genome Analyzer 2 platform as paired-end 54-bp reads according to the manufacturer's protocol. The presequencing preparation time is about 3 days, where sonication and library creation take ~1 day, and hybridisation and amplification 1 day each.
Reads were aligned to the human genome (NCBI36) using the MAQ software package v0.7.1.15 Base qualities were recalibrated using the Genome Analysis Toolkit v1.0.3540 (http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit) and duplicate fragments marked using Picard v1.17 (http://picard.sourceforge.net/). SNPs were called using SAMtools v0.1.716 and GATK, and the intersection of the resulting call sets in the target regions (39.3Mb) with a sequence read depth of ≥8 × were reported. Coverage comparisons of the different target set locations were done using BEDTools v.220.127.116.11
A comparison of the coverage of the bait/oligonucleotide positions of the available CCDS-based exome sets and the GENCODE exome with the set of GENCODE design targets (Table 1) illustrates the increased coverage of our extended target set. The bait positions of the GENCODE exome cover 99% of the targets, which represents an additional 59600 exons available for capture that are not present in either one of the CCDS-based sets (Supplementary Data 3). The missing 1% consists of regions where reliable bait design was not possible. Comparison of exon and transcript coverage between bait/oligonucleotide locations of the available exome sets and the GENCODE exome, and three current reference gene sets (Figure 1), shows that in all cases, the GENCODE exome covers a greater percentage of the reference gene sets. For example, there is an additional 9% of the exons from the CCDS database and 12% of the exons from RefSeq covered by our expanded target set.
The content present in the GENCODE exome exclusively consists of 38933 cluster regions, which contain 5594 additional genes of the design target. The 4363 distinct Ensembl-53-based genes of this set contain 1881 (43%) genes that have an official HGNC identifier, 711 (16%) that are linked to an OMIM entry and 1410 (32%) that have Gene Ontology annotation (Supplementary Data 4). In all, 41% (1809) of these genes have no external annotation of this kind and as such represent novel genes, which could prove to be an important source of variation. The content of repetitive/low-complexity sequence in the bait sets is comparable. The ratios of bases masked by RepeatMasker, Dust and TRF against the total bases in the sets are Nimblegen CCDS: 0.027, Agilent CCDS: 0.021 and GENCODE exome: 0.027 (Supplementary Table 3). A comparison with a sequence uniqueness mask is given in Supplementary Table 4 and supports these findings. The list of 5594 genes and regions targeted by the GENCODE exome exclusively is available as supplementary data and on our ftp site (http://ftp.sanger.ac.uk/gencode/exome), as well as data for the full GENCODE exome and the initial design target. The 406539 bait locations are supplied as a Distributed Annotation System data source as well (das.sanger.ac.uk/das/Exome), which can be displayed in genome browsers like Ensembl (version 53; http://tinyurl.com/browse-exome).
To evaluate the performance of the GENCODE exome, DNA from three HapMap individuals (NA12878, NA07000 and NA19240) was subjected to sequence capture using both the Agilent SureSelect Human All Exon kit and baits designed to the GENCODE exome. In addition, to evaluate the performance using DNA from clinical samples, DNA from seven individuals recruited from a clinical neurological unit was subjected to sequence capture using baits designed to the GENCODE exome. All samples were sequenced as described in the methods section. On average, 97% of reads could be successfully mapped back for both the GENCODE and the Agilent CCDS set. Full details of the sequence yield and reads mapping back to target are given in Supplementary Table 2 (coverage was reported only using reads with a mapping quality of ≥10). The average fold coverage for the HapMap exomes for the CCDS-based targets was 73-fold from 9.2Gb of sequence and for the GENCODE exome, 82-fold from 11.5Gb of sequence. The average fold coverage for the clinical samples was 58-fold from 7.5Gb of sequence. On average for the HapMap samples, 96% of targeted bases were covered at least once and 90% were covered at greater than or equal to eightfold for the CCDS exome, with similar figures for the GENCODE exome of 92 and 83% (Figure 2a). The clinical samples gave an average for the GENCODE exome of 95% of targeted bases covered at least once and 88% covered at greater than or equal to eightfold (Supplementary Figure 1). The results demonstrate that on average, the GENCODE-only regions perform equally to the CCDS regions.
An average of 22271 SNPs, of which 2.6% were novel, were found for the HapMap GENCODE exomes compared with 18554, of which 1.7% were novel, for the CCDS-based exome (Table 2; it should be noted that for most samples, only one lane of the sequencing machine was used. Thus, the sequencing depth does not allow to identify all possible variants, slightly underestimating the number of identified variants). In this instance, novel is defined as not being present in dbSNP18 (version 130) or 1000 Genomes project (1000 Genomes Project Consortium, http://www.1000genomes.org, released on 26 March 2010). An average of 21866 variants, of which 4.2% were novel, was found in the clinical samples. The clinical samples had been previously genotyped on the Illumina 660K chip that allowed the concordance rate of the variants found in common with exome sequencing using the GENCODE exome to be calculated at 99.8%. Of the 62 sites, which were discrepant between array genotyping and sequencing, 47 were discrepant only in one sample, suggesting that the number of systematic genotyping errors is low. The ratio of STOP codons gained is approximately in proportion to the size of the exome being captured, suggesting that the extra material in the GENCODE exome does not represent or select for a significant excess of pseudogenes (1.2:1 for the CCDS-based exome in comparison with 1.8:1 for the GENCODE exome). The 22002 SNPs found on average in the GENCODE exome-captured samples included a mean per sample of 9006 non-synonymous variants, 9424 synonymous variants and 91 stop-gained variants. Therefore, on average, 268 synonymous variants, 256 non-synonymous variants and 2.6 stop-gained variants were found per megabase of the 35.2-Mb targeted genomic sequence, corresponding to a total of 626.6 variants/Mb. In the CCDS-based exome-captured sample among the 18554 coding SNPs found on average, there was a mean per sample of 7585 non-synonymous variants, 8880 synonymous variants and 45 stop-gained variants, corresponding to 512 variants/Mb.
The GENCODE gene set as the basis for our exome design provides a more complete set of targets, as it is a merger of the slower but thorough manual Havana and the genome-wide automatic Ensembl annotation. Both Havana and Ensembl are part of the CCDS consortium, and all of the agreed CCDS structures at the time of construction have been incorporated into the new target set.
The new design includes relevant genes for disease-associated variant and mutation discovery. Among the genes in the new, expanded set are members of well-characterised gene families, which are associated with important medical conditions. For example, 43 genes are linked to ion channel activity. Mutations in ion channel genes are known to cause a range of channelopathies, including arrhythmias and inherited paroxysmal neurological disorders.19 Also, the recently identified MLL2 gene linked to the Kabuki syndrome20 was not covered in the CCDS exome but is now represented in the expanded GENCODE exome. There are 70 genes linked to kinase activity. Deregulated protein kinase activity is frequently associated with, and members of the protein kinase family are commonly mutated in, cancer and are thus desired to be covered in cancer sequencing studies. Furthermore, protein kinases are considered to be the targets for the development of new anticancer therapies.21
The sequence capture of the clinical samples for two genes that are targeted by the GENCODE exome only, ABCB11 and XPC, (Figures 2b and c) demonstrates that we have been able to design baits for genes that are unique to the GENCODE exome, and capture them efficiently, with a high degree of specificity. Each exon is covered by a read depth of substantially more than eightfold, which is preferred for variant calling. ABCB11 and XPC genes are already associated with disease, and provide examples of candidate disease genes that are missing from the existing exome-sequencing kits. Indeed, use of the GENCODE exome has already allowed the identification of a pathogenic mutation in a gene causing an autosomal recessive Dwarfism syndrome, which would not have been discovered using a standard CCDS-based exome.22
The advent of the GENCODE exome represents a substantial improvement to the currently available designs for exome sequencing, allowing the capture of a more complete target. We estimate that we were able to call variants in 84% of the total GENCODE exome. This fraction of callable exome regions is likely to increase further with improving sequencing technology and sequencing depths. The GENCODE exome design is being used by the International Cancer Genome Consortium (ICGC; http://www.icgc.org) for their exome-sequencing programme as part of their aim to obtain a comprehensive description of different tumour types and by the UK10K project (http://www.uk10k.org).
We thank Hazel Arbery and Matt Humphries for their work on sequence capture, the sequencing teams at the WTSI, especially Mike Quail and John Burton, as well as Barbara Novak, Carlos Pabon-Pena and Angelica Giuffre for probe design, formulation and validation. This work was supported by the Wellcome Trust (grant numbers WT089062, WT062023 and WT077198). FK is supported by the National Institute of Health (grant number 5U54HG004555). MSC is supported by the National Institute for Health Research Cambridge Biomedical Research Centre. PP is supported by the ECOGENE project (EC grant number 205419), by the EU through the Estonian Centre of Excellence in Genomics and by the Estonian Ministry of Education and Research (grant number SF0180026s09). AP and A-EL acknowledge support from the Academy of Finland (200923). Sequencing data have been deposited at the European Genome-Phenome Archive (http://www.ebi.ac.uk/ega/) under accession number EGAS00001000016 and the European Nucleotide Archive (http://www.ebi.ac.uk/ena) under accession ERP000523.
EML is employed by the company selling a product based on the GENCODE exome. The remaining authors declare no conflict of interest.
Supplementary Information accompanies the paper on European Journal of Human Genetics website (http://www.nature.com/ejhg)