Protein coding regions constitute ~1% of the human genome or ~30 Mb, split across ~180,000 exons. A brute-force approach to exome sequencing with conventional technology6
is expensive relative to what may be possible with second-generation platforms3
. However, the efficient isolation of this fragmentary genomic subset is technically challenging7
. Hodges et al.
(2007) described enrichment of an exome by hybridization of shotgun libraries constructed from 140 micrograms of genomic DNA to seven microarrays8
. To improve the practicality of hybridization capture, we developed a protocol to enrich for coding sequences at a genome-wide scale starting with 10 micrograms of DNA and using two microarrays. Our initial target was 27.9 Mb of coding sequence defined by CCDS (the NCBI Consensus CDS database)9
. This curated set avoids the inclusion of spurious hypothetical genes that contaminate broader exome definitions10
. The target is reduced to 26.6 Mb upon exclusion of regions that are poorly mapped to with our anticipated read length due to paralogous sequences elsewhere in the genome (Supplementary Data 1
We captured and sequenced the exomes of eight individuals previously characterized by the HapMap4
and Human Genome Structural Variation11
projects. We also analyzed four unrelated individuals affected with Freeman-Sheldon syndrome (FSS; OMIM #193700) or distal arthrogryposis type 2A, a rare autosomal dominant disorder caused by mutations in MYH35
. Unpaired, 76 base-pair (bp) reads12
from post-enrichment shotgun libraries were aligned to the reference genome13
. On average, 6.4 gigabases (Gb) of mappable sequence was generated per individual (20-fold less than whole genome sequencing with the same platform12
), and 49% of reads mapped to targets (Supplementary Table 1
). After removing duplicate reads that represent potential PCR artifacts14
, the average fold-coverage of each exome was 51× (Supplementary Fig. 1
). On average per exome, 99.7% of targeted bases were covered at least once, and 96.3% (25.6 Mb) were covered sufficiently for variant calling (>=8× coverage and phred
consensus quality>=30). This corresponded to 78% of genes having >95% of their coding bases called (Supplementary Fig. 2
, Supplementary Data 2
). The average pairwise correlation coefficient between individuals for gene-by-gene coverage was 0.87, consistent with systematic bias in coverage between individual exomes.
False positives and false negatives are critical issues in genomic resequencing. We assessed the quality of our exome data in four ways. First, comparing sequence-based calls for the eight HapMap exomes to array-based genotyping, we observed a high concordance with both homozygous (99.94%; n = 219,077) and heterozygous (99.57%; n = 43,070) genotypes (). Second, we compared our coding single-nucleotide polymorphism (cSNP) catalogue to ~1 megabase of coding sequence determined in each of the eight HapMap individuals by molecular inversion probe (MIP) capture and direct resequencing16
. At coordinates called in both datasets, 99.9% of all cSNPs (n = 4,620) and 100% of novel cSNPs (n = 334) identified here were concordant, consistent with a low false discovery rate (FDR). Third, we compared the NA18507 cSNPs identified here to those called by recent whole genome sequencing of this individual12
, and found substantial overlap (Supplementary Fig. 3
). The relative numbers of cSNPs called by only one approach, and the proportions of these represented in dbSNP, indicate that exome sequencing has equivalent sensitivity for cSNP detection as compared to whole genome sequencing. Fourth, we compared our data to cSNPs in high quality Sanger sequence of single haplotype regions from fosmid clones of the same HapMap individuals17
. 38 of 40 fosmid-defined cSNPs were at coordinates with sufficient coverage in our data for variant calling. Of these, 38 of 38 were correctly identified as variant.
Sequence coverage and array-based validation
A comparison of our data to past reports on exonic18
array-based capture revealed roughly equivalent capture specificity, but greater completeness in terms of coverage and variant calling (Supplementary Table 2
). These improvements likely arise from a combination of greater sequencing depth, differences in array designs and in experimental conditions for capture. Within the set of called positions, the high concordance with heterozygous array-based genotypes (>99%) provides an estimate of our sensitivity for rare variant detection, as rare variants are overwhelmingly expected to be heterozygous. However, sensitivity was limited in that ~4% of known heterozygous genotypes were at coordinates where there was insufficient coverage to make a confident call.
56,240 cSNPs were called in one or more individuals, of which 13,347 were novel. On average, 17,272 cSNPs were called per individual, of which 92% were already annotated in a public database (dbSNP v129) (). The proportion of previously annotated cSNPs was consistent by population, and higher for European (94%; n = 6) and Asian (93%; n = 2) than Yoruba (88%; n = 4) ancestry. These confirmation rates are ~10% higher than recent whole genome analyses12,19–22
. The most likely explanation is that coding sequences have historically been more heavily ascertained than non-coding sequences, although other factors such as dbSNP version, prior ascertainment of HapMap individuals, and different FDRs may contribute as well. For the subset of cSNPs at coordinates with sufficient coverage for variant calling in all 12 individuals (n = 47,079), 32% of annotated variants and 86% of novel variants were singleton observations across 24 chromosomes ().
Coding variation across 12 human exomes
Minor allele frequency and coding indel length distributions
We also estimated the total number of cSNPs in each individual relative to the reference genome (). As the precise and comprehensive definition of the human exome remains incomplete, we extrapolated our data to an estimated exome size of exactly 30 Mb. The results were remarkably consistent by population. As expected, a higher number of nonsynonymous cSNPs were estimated for Yoruba (avg. 10,254; n = 4) than non-Africans (avg. 8,489; n = 8). More heterozygous cSNPs were estimated for four Yoruba (avg. 14,995) than six European Americans (avg. 11,586) and two Asians (avg. 10,631). The ratio of synonymous to nonsynonymous cSNPs was 1.2 within any single individual, and 1.1 when calculated for a non-redundant list of variants identified across all individuals. The difference results from the slightly shifted allele frequency distribution of nonsynonymous variants (). Consistent with expectation23
, the trend is more pronounced for nonsynonymous variants predicted to be damaging (by PolyPhen24
Nonsense mutations (NMs) and splice-site disruptions (SSDs) are often assumed to be deleterious, but have a broad range of potential fitness effects25–27
. Our non-redundant cSNP catalogue included 225 NMs (112 novel) and 102 SSDs (49 novel). Excluding 86 nonsense alleles that are common in this dataset (2+ observations) or in a recent study by Yngvadottir et al.25
(>5% allele frequency), our genome-wide estimate (projected to 30 Mb) for the average number of relatively rare mutations introducing premature nonsense codons in an individual genome was 10 for non-Africans (n = 8) and 20 for Yoruba (n = 4). However, these are likely overestimates, given that our catalogue of common nonsense mutations remains incomplete.
Short insertion-deletions (indels) in coding sequence are likely to be functionally important when they cause frameshifts but are difficult to detect with short reads. We developed and applied an approach for identifying indels from our unpaired 76 bp reads. In total, 664 coding indels were called in 1+ individuals. On average, 166 coding indels were called per individual, of which 63% were previously annotated in dbSNP (Supplementary Table 3
). To assess our sensitivity, we compared our data for NA18507 to Bentley et al.12
. 73% of their coding indels were also observed in our data (136 of 187). To assess specificity, we attempted PCR and Sanger sequencing of 28 novel coding indels chosen at random. Of 21 successful assays, 20 coding indels were verified, and 1 was a false positive. We anticipate that future use of paired-end reads will improve detection of coding indels.
The shape of the distribution of coding indel lengths was consistent with other studies10,20
as well as across the 12 exomes (), demonstrating a preference for multiples of 3 (“3n”). Of the 664 coding indels observed here, 65% were 3n in length. The allele frequency distribution for novel indels relative to annotated indels was markedly shifted towards rarer variants (Supplementary Fig. 4
). However, the length histogram for novel versus annotated coding indels were similar (Supplementary Fig. 5
), reinforcing that our set of novel coding indels is not excessively contaminated with false positives (as these would not be expected to have the observed 3n bias). Excluding indels that were common in this dataset (2+ observations), the average number of relatively rare frameshifting indels identified per individual was 8 for non-Africans (n = 8) and 17 for Yoruba (n = 4).
The number of synonymous, missense, nonsense, splice site, frameshifting indel, and non-frameshifting indel variants observed in each individual (as well as the size of the subsets that are novel and singleton observations) are presented in Supplementary Table 4
. Also shown are the average numbers of variants of each class for non-Africans and Yoruba.
Phenotypes inherited in an apparently Mendelian pattern often lack sufficiently sized pedigrees to pinpoint the causal locus. We evaluated whether exome sequencing could be applied to directly identify the causative gene underlying a monogenic human disease (FSS), i.e. with neither linkage data nor candidate gene analysis. Even in this simple scenario for “whole exome/genome genetics”, the key challenge that arises immediately is that the large number of apparently private mutations present by chance in any single human genome makes it difficult to identify which variant is causal, even when only considering nonsynonymous variants. Jones et al.
recently overcame this in the context of hereditary pancreatic cancer by restricting focus to only nonsense mutations and also resequencing tumor DNA from the same individual, but this approach greatly limits sensitivity and is only relevant to a subset of mechanisms within one disease class28
To quantify this background of non-causal variants in our exome data, we first asked how many genes had one or more nonsynonymous cSNPs, splice site disruptions, or coding indels in one or several FSS exomes (, row 1). Simply requiring that a gene contain variants in multiple affected individuals was clearly insufficient, as over 2,000 candidate genes remained even after intersecting four FSS exomes. We then applied filters to remove presumably common variants, as these are unlikely to be causative. Removing dbSNP catalogued variants from consideration reduced the number of candidates considerably (, row 2). Remarkably, the 8 HapMap exomes provided a filter nearly equivalent to dbSNP (, row 3). Combining the two catalogues had a synergistic effect (, row 4), such that the candidate list could be narrowed to a single gene (MYH3
, previously identified by a candidate gene approach as causative for FSS5
). Specifically, MYH3
is the only gene where: (a) at least one (but not necessarily the same) nonsynonymous cSNP, splice-site disruption, or coding indel is observed in all four individuals with FSS; (b) the mutations are not in dbSNP, nor in the eight HapMap exomes. Taking the predicted deleteriousness of individual mutations into account served as an effective filter as well (, row 5), but was not required to identify MYH3
. Ranges of candidate list sizes when other permutations of individuals are used are shown in Supplementary Fig. 6
Direct identification of the causal gene for a monogenic disorder by exome sequencing
was well-covered in our data. To assess our sensitivity more globally, we calculated the probability that a mutation would have been identified in all four FSS-affected individuals for each gene, based on our overall coverage of that gene in each individual (Supplementary Data 2
). The average probability across all genes was 86%. This is likely still an overestimate of sensitivity, as functional non-coding or structural mutations would be missed. It also remains challenging to detect mutations in segmentally duplicated regions of the genome with short read sequencing.
Nevertheless, our analysis suggests that direct sequencing of exomes of small numbers of unrelated individuals (but more than one) with a shared monogenic disorder can serve as a genome-wide scan for the causative gene. The availability of the 8 HapMap exomes was clearly helpful, suggesting that the power of this approach will improve as the 1000 Genomes Project29
generates a catalogue of common variation that is more complete and evenly ascertained than dbSNP. Also, FSS is inherited in an autosomal dominant pattern so the presence of only one mutant allele is sufficient to cause disease. Applying this strategy to a recessive disease would likely be easier, because there are far fewer genes in each exome that are homozygous or compound heterozygous for rare nonsynonymous variants. We also note that modeling of even a modest degree of genetic heterogeneity or data incompleteness is observed to significantly impact performance (, column offset to right). Moving along the spectrum from rare monogenic disorders to complex common diseases, it is likely that the increasing extent of genetic heterogeneity will need to be matched by increasingly large sample sizes30
, and/or more sophisticated weighting of predicted mutational impact.
A clear limitation of exome sequencing is that it does not identify the structural and non-coding variants found by whole genome sequencing. At the same time, it allows a given amount of sequencing to be extended across at least 20 times as many samples as compared to whole genome sequencing. In studies focused on identifying rare variants or somatic mutations with medical relevance, sample size and the interpretability of functional impact may be critical to achieving meaningful success. It is the context of such studies that exome sequencing may be most valuable.
In summary, we demonstrate that targeted capture and massively parallel sequencing represents a cost-effective, reproducible, and robust strategy for the sensitive and specific identification of variants causing protein-coding changes in individual human genomes. The 307 megabases determined here across 12 individuals is the largest dataset reported to date of human coding sequence ascertained by second-generation sequencing methods. Finally, our successful demonstration that the causative gene for a monogenic disorder can be identified directly by exome sequencing of several unrelated individuals provides increasing context to the possibility that exome or genome sequencing may represent a new approach for identifying gene-disease relationships.