The allelic spectrum of variants causing common human diseases has long been a topic of debate [1
]. Whereas many monogenic diseases are typically caused by extremely rare (<<1%), heterogeneous, and highly penetrant alleles, the genetic basis of common diseases remains largely unexplained [3
]. The results of hundreds of genome-wide association scans have demonstrated that common genetic variation accounts for a non-negligible but modest proportion of inherited risk [4
], leading many to suggest recently that rare variants may contribute substantially to the genetic burden underlying common disease. Data from deep sampling of small numbers of loci have confirmed the population-genetic prediction [6
] that rare variants constitute the vast majority of polymorphic sites in human populations. Most are absent from current databases [8
], which are dominated by sites discovered from smaller population samples, and are consequently biased toward common variants. Analysis of whole exome data from a modest number of samples (n
= 35) suggests that natural selection is likely to constrain the vast majority of deleterious alleles (at least those that alter amino acid identity and, therefore, possibly protein function) to low frequencies (<1%) under a plethora of evolutionary models for the distribution of fitness effects consistent with patterns of human exomic variation [9
]. However, in order to broadly characterize the contribution of rare variants to human genetic variability and to inform medical sequencing projects seeking to identify disease-causing alleles, one must first be able to systematically sample variants below an alternative allele frequency (AF) of 1%.
Recent technical developments have produced a series of new DNA sequencing platforms that can generate hundreds of gigabases of data per instrument run at a rapidly diminishing cost. Innovations in oligonucleotide synthesis have also enabled a series of laboratory methods for targeted enrichment of specific DNA sequences (Figure S1 in Additional file 1
). These capture methods can be applied at low cost, and large scale, to analyze the coding regions of genes, where genomic changes that most likely influence gene function can be recognized. Together, these two technologies present the opportunity to obtain full exome sequence for population samples sufficiently large to capture a substantial collection of rare variants.
The 1000 Genomes Exon Pilot (Exon Pilot) Project set out to use capture sequencing to compile a large catalog of coding sequence variants with four goals in mind: (1) to drive the development of capture technologies; (2) to develop tools for effective downstream analysis of targeted capture sequencing data; (3) to better understand the distribution of coding variation across populations; and (4) to assess the functional qualities of coding variants and their allele frequencies, based on the representation of both common (AF > 10%), intermediate (1% < AF < 10%) and low frequency (AF < 1%) sites. To attain these objectives, while simultaneously improving DNA enrichment methods, we targeted approximately 1,000 genes in 800 individuals, from seven populations representing Africa (LWK, YRI), Asia (CHB, CHD, JPT), and Europe (CEU, TSI) in roughly equal proportions (Table ).
Samples, read coverage, SNP calls, and nucleotide diversity in the Exon Pilot dataset