|Home | About | Journals | Submit | Contact Us | Français|
Rare coding variants constitute an important class of human genetic variation, but are underrepresented in current databases that are based on small population samples. Recent studies show that variants altering amino acid sequence and protein function are enriched at low variant allele frequency, 2 to 5%, but because of insufficient sample size it is not clear if the same trend holds for rare variants below 1% allele frequency.
The 1000 Genomes Exon Pilot Project has collected deep-coverage exon-capture data in roughly 1,000 human genes, for nearly 700 samples. Although medical whole-exome projects are currently afoot, this is still the deepest reported sampling of a large number of human genes with next-generation technologies. According to the goals of the 1000 Genomes Project, we created effective informatics pipelines to process and analyze the data, and discovered 12,758 exonic SNPs, 70% of them novel, and 74% below 1% allele frequency in the seven population samples we examined. Our analysis confirms that coding variants below 1% allele frequency show increased population-specificity and are enriched for functional variants.
This study represents a large step toward detecting and interpreting low frequency coding variation, clearly lays out technical steps for effective analysis of DNA capture data, and articulates functional and population properties of this important class of genetic variation.
The allelic spectrum of variants causing common human diseases has long been a topic of debate [1,2]. Whereas many monogenic diseases are typically caused by extremely rare (<<1%), heterogeneous, and highly penetrant alleles, the genetic basis of common diseases remains largely unexplained . The results of hundreds of genome-wide association scans have demonstrated that common genetic variation accounts for a non-negligible but modest proportion of inherited risk [4,5], leading many to suggest recently that rare variants may contribute substantially to the genetic burden underlying common disease. Data from deep sampling of small numbers of loci have confirmed the population-genetic prediction [6,7] that rare variants constitute the vast majority of polymorphic sites in human populations. Most are absent from current databases , which are dominated by sites discovered from smaller population samples, and are consequently biased toward common variants. Analysis of whole exome data from a modest number of samples (n = 35) suggests that natural selection is likely to constrain the vast majority of deleterious alleles (at least those that alter amino acid identity and, therefore, possibly protein function) to low frequencies (<1%) under a plethora of evolutionary models for the distribution of fitness effects consistent with patterns of human exomic variation . However, in order to broadly characterize the contribution of rare variants to human genetic variability and to inform medical sequencing projects seeking to identify disease-causing alleles, one must first be able to systematically sample variants below an alternative allele frequency (AF) of 1%.
Recent technical developments have produced a series of new DNA sequencing platforms that can generate hundreds of gigabases of data per instrument run at a rapidly diminishing cost. Innovations in oligonucleotide synthesis have also enabled a series of laboratory methods for targeted enrichment of specific DNA sequences (Figure S1 in Additional file 1). These capture methods can be applied at low cost, and large scale, to analyze the coding regions of genes, where genomic changes that most likely influence gene function can be recognized. Together, these two technologies present the opportunity to obtain full exome sequence for population samples sufficiently large to capture a substantial collection of rare variants.
The 1000 Genomes Exon Pilot (Exon Pilot) Project set out to use capture sequencing to compile a large catalog of coding sequence variants with four goals in mind: (1) to drive the development of capture technologies; (2) to develop tools for effective downstream analysis of targeted capture sequencing data; (3) to better understand the distribution of coding variation across populations; and (4) to assess the functional qualities of coding variants and their allele frequencies, based on the representation of both common (AF > 10%), intermediate (1% < AF < 10%) and low frequency (AF < 1%) sites. To attain these objectives, while simultaneously improving DNA enrichment methods, we targeted approximately 1,000 genes in 800 individuals, from seven populations representing Africa (LWK, YRI), Asia (CHB, CHD, JPT), and Europe (CEU, TSI) in roughly equal proportions (Table (Table11).
Four data collection centers, the Baylor College of Medicine (BCM), the Broad Institute (BI), the Wellcome Trust Sanger Institute, and Washington University applied different combinations of solid-phase or liquid-phase capture, and Illumina or 454 sequencing procedures on subsets of the samples (Materials and methods). In order to aggregate the data for a comparison of analytical methods, a set of consensus exon target regions was derived (Materials and methods; Figure S2 in Additional file 1). After filtering out genes that could not be fully tested because of failed capture or low sequence coverage, and samples that showed evidence of cross-contamination, a final sequence data set was assembled that corresponded to a total of 1.43 Mb of exonic sequence (8,279 exons representing 942 genes) in 697 samples (see section 3, 'Data quality control' and Figure S3 in Additional file 1 for details of our quality control procedures). The project was closely coordinated with two related Pilot programs in the ongoing 1000 Genomes Project, the Trio Sequencing Pilot and the Low Coverage Sequencing Pilot, enabling quality control and performance comparisons.
Two separate and complementary pipelines (Materials and methods; Figure Figure1a),1a), developed at Boston College (BC) and the BI, were used to identify SNPs in the sequence data. The main functional steps in both pipelines were as follows: (1) read mapping to align the sequence reads to the genome reference sequence; (2) alignment post-processing to remove duplicate sequence fragments and recalibrate base quality values; (3) variant calling to identify putative polymorphic sites; and (4) variant filtering to remove likely false positive calls.
In both pipelines, the individual sequence reads were first mapped to the genome (using the entire human reference sequence, as opposed to just the targeted regions), with the MOSAIK  program (at BC), and a combination of the MAQ  and SSAHA2  mapping programs (at BI) (Materials and methods).
Mapped reads were filtered to remove duplicate reads resulting from clonal amplification of the same fragments during library construction and sequencing. If kept, such duplicate reads would interfere with variant detection. We also applied a base quality re-calibration procedure that resulted in a much better correspondence of the base quality values to actual base error rates (Figure S4 in Additional file 1), a property that is essential for accurate variant detection.
There was substantial heterogeneity in the depth of coverage of different regions that were targeted for capture (Figure (Figure2a),2a), reflecting different affinities for individual probes. Although the coverage variance was generally reproducible from experiment to experiment, additional variance could be attributed to individual samples, capture reagents, or sequencing platforms (Table (Table1).1). Despite this variance, >87% of the target sites in all samples have at least 5× read coverage, >80% at least 10×, and >62% at least 20× (Figure (Figure2b2b).
The two pipelines differed in the variant calling procedures. Two different Bayesian algorithms (Unified Genotyper  at BI, GigaBayes at BC: see Materials and methods) were used to identify SNPs based on read alignments produced by the two different read mapping procedures. Another important difference between the BI and BC call sets was that the BI calls were made separately within each of the seven study populations, and the called sites merged post hoc, whereas the BC calls were made simultaneously in all 697 samples.
Both raw SNP call sets were filtered using variant quality (representing the probability that the called variant is a true polymorphism as opposed to a false positive call). The BC set was only filtered on this variant quality and required a high-quality variant genotype call from at least one sample. The BI calls were additionally filtered to remove spurious calls that most likely stem from mapping artifacts (for example, calls that lie in the proximity of a homopolymer run, in low sequence coverage, or where the balance of reads for the alternative versus the reference allele was far from the expected proportions; see Materials and methods for more details). Results from the two pipelines, for each of the seven population-specific sample sets, are summarized in Table Table2.2. The overlap between the two data sets (that is, sites called by both algorithms) represented highly confident calls, as characterized by a high ratio of transitions to transversions, and was designated as the Exon Pilot SNP release (Table (Table1).1). This set comprised 12,758 distinct genomic locations containing variants in one or more samples in the exon target regions, with 70% of these (8,885) representing previously unknown (that is, novel) sites. All data corresponding to the release, including sequence alignments and variant calls, are available through the 1000 Genomes Project ftp site .
A series of validation experiments (see Materials and methods; Table S1 in Additional file 1), based on random subsets of the calls, demonstrated that the sequence-based identification of SNPs in the Exon Pilot SNP release was highly accurate. More than 91% of the experimental assays were successful (that is, provided conclusive positive or negative confirmation of the variant) and therefore could be used to assess validation rates. The overall variant validation rate (see Table S2 in Additional file 1 for raw outcomes; see Table S3 in Additional file 1 and Table Table33 for rates) was estimated at 96.6% (98.8% for alternative allele count (AC) 2 to 5, and 93.8% for singletons (AC = 1) in the full set of 697 samples). The validation experiments also allowed us to estimate the accuracy of genotype calling in the samples, at sites called by both algorithms, as >99.8% (see Table S4 in Additional file 1 for raw outcomes; see Table S5 in Additional file 1 for rates). Reference allele homozygotes were the most accurate (99.9%), followed by heterozygote calls (97.0%), and then alternative allele homozygotes (92.3%) (Table S5 in Additional file 1). Although the main focus of our validation experiments was to estimate the accuracy of the Exon Pilot SNP release calls, a small number of sites only called by the BC or the BI pipeline were also assayed (Table S2 in Additional file 1). Although there were not enough sites to thoroughly understand all the error modes, these experiments suggest that the homopolymer and allele balance filters described above are effective in identifying false positive sites from the unfiltered call set.
We performed in silico analyses (see Materials and methods) to estimate the sensitivity of our calls. In particular, a comparison with variants from the CEU samples that overlap those in HapMap3.2 indicated that our average variant detection sensitivity was 96.8%. A similar comparison with shared samples in the 1000 Genomes Trio Pilot data also showed a sensitivity >95% (see section 7, 'SNP quality metrics - sensitivity of SNP calls', in Additional file 1). When the sensitivity was examined as a function of alternative allele count within the CEU sample (Figure (Figure3),3), most missed sites were singletons and doubletons. The sensitivity of the intersection call set was 31% for singletons and 60% for doubletons. For AC > 2, sensitivity was better than 95%. The strict requirement that variants had to be called by both pipelines weighted accuracy over sensitivity and was responsible for the majority of the missed sites. Using less strict criteria, there was evidence for 73% of singletons and 89% of doubletons in either the BC or the BI unfiltered dataset.
We investigated other, data-related determinants of singleton detection sensitivity, beyond the impact of the Project's decision to form the official Exon Pilot variant list as the intersection of the two independently derived call sets (see section 7.1, 'Sensitivity of singleton detection', in Additional file 1, and Figure S7 in Additional file 1). Singleton detection sensitivity improves significantly from low (1× to 9×) to medium (10× to 29×) read coverage (although there is no further improvement beyond 30× coverage). Importantly, approximately 9% (9 of 97) of HapMap3.2 singletons in the 84 samples shared with the Exon Pilot CEU sample panel had zero read coverage in our data. There was no significant difference in sensitivity between the Illumina and 454 reads, at comparable sequence coverage. Based on these observations, the main data-related reason for lower singleton sensitivity is lack of sufficient read coverage in the samples that have the singleton. Finally, our analysis (data not shown) revealed that, even at some of the sites with >100× read coverage in the sample with the putative HapMap3 singleton, there were no reads showing the alternative allele, and therefore it would not be possible to call the sites from the primary data. These cases represent either sites with allele-specific capture (that is, fragments with the alternative allele were not captured) or false positive sites in the HapMap3 study.
The high quality of the data enabled us to accurately estimate values of nucleotide diversity, a commonly used measure of genetic variability within populations, in the coding regions (using pair-wise heterozygosity as our metric (section 8, 'Heterozygosity estimates', in Additional file 1) within each of the seven populations (Table (Table1).1). These estimates were confirmed in the 1000 Genomes Low Coverage Pilot data in the Exon Pilot target regions (Table S9a in Additional file 1). Nucleotide diversity in the coding regions was 47.3 to 48.4% of the genome-averaged value for the corresponding population (Table S9b in Additional file 1). As expected, diversity was substantially higher in African than in European and Asian populations. It was, however, very similar for populations within the same continent (Table S9c in Additional file 1). Missense variation is substantially reduced (for example, compared to four-fold degenerate sites, where a single base substitution does not alter the amino acid) as a result of purifying selection. In turn, diversity at four-fold degenerate sites is comparable to average genomic diversity, consistent with very weak selection, if any. Diversity ratios across site types (for example, missense, four-fold degenerate) and datasets (for example, Exon Pilot, Low Coverage Pilot) are highly consistent between populations.
We compared the allele frequency spectrum (AFS) in the sequenced coding regions among the Exon Pilot populations (Figure (Figure4a).4a). The high sensitivity assures us that the observed AFS are accurate for AC > 2 (or AF > approximately 1%). The AFS were very similar for populations from the same continent, except for the JPT population, where we observed a significantly lower fraction of rare alleles than in the two other Asian populations, consistent with reduced recent population growth in Japanese. Despite the large difference among continents at low AF, they converged at higher AF, reflecting the greater age of common variants, many of which pre-date the expansion of modern humans out of Africa. In all seven populations, there was a notable excess of rare variants compared to predictions for a constant-size, neutrally evolving population. This effect was enhanced at missense sites (Figure (Figure4b),4b), which were more highly represented at low alternative allele frequency than silent variants, as well as intergenic variants from the HapMap Encyclopedia of Coding Elements Project (ENCODE) re-sequencing study. The apparent excess of high frequency derived sites has often been observed in studies of human AFS, and may in part be due to ancestral misidentification .
Recent reports  have also recognized an excess of rare, missense variants at frequencies in the range of 2 to 5%, and suggested that such variants arose recently enough to escape negative selection pressures . The present study is the first to broadly ascertain the fraction of variants down to approximately 1% frequency across nearly 700 samples. Based on the observed AFS (Figure (Figure4c),4c), 73.7% of the variants in our collection are in the sub-1% category, and an overwhelming majority of them novel (Figure (Figure4c,4c, inset). The discovery of so many sites at low allele frequency provided a unique opportunity to compare functional properties of common and rare variants.
We used three approaches to classify the functional spectrum (see Materials and methods): (i) impact on the amino acid sequence (silent, missense, nonsense); (ii) functional prediction based on evolutionary conservation and effect on protein structure by computational methods (SIFT  and PolyPhen-2 ); and (iii) presence in a database of human disease mutations (Human Gene Mutation Database (HGMD)). All three indicators showed a substantial enrichment of functional variants in the low frequency category within our data (Figure (Figure5).5). First, and as noted by other studies [19,20], we saw a highly significant difference (P << 10-16) in the AFS of silent versus missense variants (Figure (Figure5a)5a) with a skew towards rare alleles in the latter, so that approximately 63% of missense variants were <1% in frequency whereas approximately 53% of silent variants fell into this category. The same patterns held for nonsense versus either silent or missense variants (P << 10-16) where approximately 78% of nonsense variants were below AF = 1%. Second, we found that PolyPhen-2/SIFT damaging predictions (Figure (Figure5b)5b) were likewise enriched in the rare part of the spectrum (approximately 72% for damaging versus 63% for possibly damaging, and 61% benign). This observation goes an important step beyond the enrichment of amino acid changing variants because the PolyPhen-2/SIFT programs make specific predictions about whether or not such a variant is damaging to protein function. Error rate variation between different AFS bins was not a significant confounder for these conclusions: error rates were estimated at 6.2%, 3.2% and 3.4% for different AFS bins (Tables S3, S4 and S5 in Additional file 1) and highly significant differences were still found after correcting for this error rate variation (P << 10-16 for missense, and P < 10-5 for nonsense SNPs). Third, 99 coding variants in our dataset were also present in HGMD, and therefore linked with a disease in the literature (although not necessarily causative). We tested these variants with SIFT and PolyPhen-2, and obtained predictions for 89 (Figure (Figure5c).5c). All 14 variants classified as damaging were below 1% frequency in our dataset, and found only in a heterozygous state. This observation strongly suggests that the majority of variants that are directly damaging to protein structure and therefore may result in deleterious phenotypic effects (that is, actual causative variants, as opposed to merely disease-linked markers) are likely to occur at low AF in the population. It is also noteworthy that only a very small fraction (<20% in each category, marked on all three panels of Figure Figure5)5) of the putatively damaging variants in the Exon Pilot dataset were detected with an alternative, low coverage whole genome sampling strategy employed in the Low Coverage Pilot in the 1000 Genome Project , which was designed to find common variants but not powered to systematically detect low frequency sites (also see Figure Figure4b).4b). The higher performance in detecting rare damaging variants in the Exon Pilot compared to the Low Coverage Pilot underlines the utility of targeted exome sequencing for disease studies.
We next examined the patterns of allele sharing (Materials and methods) among the Exon Pilot populations and between continents (Figure (Figure6),6), and observed an expected reduction in the degree of allele sharing at low frequency. Comparison to intergenic variants from the HapMap3 ENCODE re-sequencing project  revealed that allele sharing at high and intermediate frequency was similar, but that at AF <1% it was substantially reduced in the coding regions, relative to intergenic regions (P < 10-6). This suggests that the low level of allele sharing of rare coding variants cannot be explained by allele frequency alone, and that such variants are likely to be younger than would be expected from neutral models, presumably because of negative selection acting at these sites.
In addition to SNPs, the data also supported the identification of multiple, 1- to 30-bp insertions and deletions (INDELs; Materials and methods). The BCM and BI INDEL calling pipelines were applied (Figure (Figure1b),1b), and identified a total of 21 insertions and 75 deletions in the 1.43 Mb target regions (Tables S6 and S7 in Additional file 1). Comparisons with dbSNP and the other pilot projects showed high concordance rates. The overall experimental INDEL validation rate (Table S8 in Additional file 1) was 81.3%. Secondary visual inspection revealed that many of the events that did not validate were cases where multiple INDEL events were incorrectly merged, and the wrong coordinates were submitted for validation. This visual inspection confirmed all such alleles as true positives, substantially raising the effective validation rate. Coding INDEL variants change the amino acid sequence of the gene, and therefore these variants are very likely to impact protein function. Indeed, the majority of the events were non-frameshift variants (Figure S5 in Additional file 1) altering, but not terminating, the protein sequence. In agreement with our observations for SNPs, most INDELs were present at low population allele frequency (Figure S6 in Additional file 1).
In addition to its goal of generating an extensive catalog of human population variations, the 1000 Genomes Project has served as an intensive technology development project in terms of both molecular methodologies and informatics methods for high-throughput data collection and data analysis. Although it is not a main focus of our manuscript, development and refinement of the DNA capture methods for this project have led to the current whole-exome capture reagents available for the community. The Exon Pilot project also led to the construction of informatics pipelines for effective analysis of targeted exon sequencing data, and these pipelines are now routinely used for whole-exome datasets. This study clearly lays out the informatics steps required to analyze such datasets and avoid the many pitfalls due to capture biases, coverage fluctuations, INDELs and alignment issues, population biases, and sequencing errors.
The extensive collection of SNPs in the 8,000 exons, detected with accurate and sensitive algorithms, allowed us to characterize fundamental variation properties in coding regions, and to compare them to overall genomic variation. The most important contribution of this study concerns the functional properties of rare variations, and their population specificity. We see a substantial depletion of putatively functional variants at intermediate and high AF, and a corresponding enrichment at low AF, which is expected as a result of negative selection, and has been noted recently [20,21]. However, our ability to study variants at 1% frequency revealed more direct signals, strongly suggesting that variants conferring direct changes on protein function will be present mostly at low population frequency. We were also able to note a significant reduction in the level of between-population allele sharing of rare coding variants, compared to intergenic variants, an effect that was not visible for variants above 1% in frequency. This effect is likely to reflect a combination of more recent origin and stronger negative selection for rare alleles in coding, compared to intergenic regions. Our complete dataset, including a list of SNP and INDEL variants with well characterized ascertainment properties is providing a useful substrate for more specialized analyses  to interpret functional and population aspects of low frequency coding variation.
NimbleGen 385 K capture chips were designed to target the coding regions of the 1,000 genes. Target enrichment was performed following the Short Library Construction Protocol and NimbleGen Arrays User's Guide. Capture libraries were then sequenced on the 454 FLX/Titanium platform using standard vendor emPCR, enrichment and sequencing methods (GS FLX Titanium Sample Preparation Manual).
Single-stranded RNA 'bait' was produced using the Agilent microarray-based method. Genomic DNA was sheared and ligated to Illumina sequencing adapters. This 'pond' of DNA was hybridized with an excess of bait in solution. The sequencing was done using the Illumina GA-II sequencers to produce either 36-bp fragment reads or 76-bp paired-end reads.
A custom Nimblegen 385-K array was used following the manufacturer's protocols (Roche/Nimblegen, Madison, Wisconsin, USA), with the modification that no pre-hybridization PCR was performed. Captured libraries were sequenced on the Illumina GA platform as paired-end 37-bp reads.
Whole genome shotgun libraries for Illumina sequencing were prepared according to the manufacturer's instructions. The pool of synthetic oligos was amplified by PCR and incorporated biotin-14-dCTP to produce a biotinylated capturing library. Each target library was hybridized with the biotinylated capturing library, isolated using streptavidin magnetic beads, and then amplified by PCR. The captured library fragments were reclaimed by denaturation and sequenced as fragment end reads on the Illumina GAIIx sequencer.
A substantial amount of technological heterogeneity existed among different centers' production pipelines. The Exon Pilot initially selected 1,000 genes as targeted sequences. However, the capture target designs used in the four production centers were significantly different. To account for the heterogeneity introduced by different capture designs, we defined a set of consensus exon target sequences by intersecting the initial designs (the individual .bed files) with the exonic sequences based on the CCDS database to create the consensus exon target sequences (Figure S2 in Additional file 1), which form the basis of all the analyses described in this study. The consensus has approximately 1.43 Mb of exonic sequence, covering 86.1% of the coding regions in the initial 1,000 genes (the consensus target definition file is available through the 1000 Genomes Project technical release ftp directory .
The SNP calls were a result of intersecting SNP calls from the BI using the GATK  and from BC using the MOSAIK  read mapper and the GigaBayes variant detection algorithm  (a new version of the PolyBayes SNP discovery program ). The BC call set was generated by calling all 697 individuals together, and per-population call sets were generated by a straightforward projection algorithm: a variant was called in a population if at least one individual in the population carried a non-reference allele (Figure (Figure1a).1a). The BI calls were made separately within each of the seven populations and a superset call set was generated as the union of all seven individual population call sets (Figure (Figure1a).1a). Variants were only called in the consensus target regions.
MOSAIK hash size was 15 with minimum mismatches of 4, 6, and 12 for 36-, 51-, and 76-/101-mer read lengths. MOSAIK parameters for Roche 454 reads were set to 15 with at least 70% of the read being aligned with a 5% mismatch rate.
MOSAIK Illumina alignments were duplicate-marked using the MarkDuplicates program from the Picard software suite . MOSAIK Roche 454 alignments were duplicate-marked with BCMRemoveDuplicates program (M Bainbridge, personal communication).
MOSAIK Illumina alignments were re-calibrated using GATK  (with the CountCovariates and TableRecalibration commands). Roche 454 reads aligned with MOSAIK were not recalibrated.
GigaBayes was used at BC for SNP calls. Briefly, it calculates genotype likelihoods, excluding reads with a mapping quality of <20 and nucleotides with a base quality <20. It then calculates genotypes using the previously calculated genotype likelihoods and a prior on variant frequency. Summing the probabilities of sample genotypes with at least one non-reference allele generates the posterior probability.
Variant calls were filtered out if they did not meet the criteria of a PHRED scaled quality score of at least 40 with at least one individual with a non-reference genotype with a genotype quality score of at least 10.
The Broad Institute employed a five-step protocol consisting of alignment, PCR duplicate marking, base quality score recalibration, application of the SNP calling algorithm, and filtration of the results.
Reads were aligned by the Sanger Institute using MAQ and SSAHA2 for Illumina and Roche 454 data, respectively. All aligned reads and metadata (sequencing center, sequencing technology, run identifier, lane identifier, library identifier, and so on) were written in BAM format.
We applied the Picard  MarkDuplicates algorithm. This algorithm locates reads from the same sequencing library with precisely the same starting position on the genome. When more than one read is found to have the same start position, all but one are flagged as duplicates in the BAM file and therefore ignored in downstream processing.
To correct for inaccuracies in the base quality scores, we developed and applied a base quality score recalibrator. Comparison of the estimated quality scores to the empirical quality scores allowed us to compute corrected quality scores, which were recorded in the BAM files.
We developed a multi-sample Bayesian SNP calling algorithm, now part of the GATK package . This algorithm considers reads from the provided samples simultaneously, attempting to ascertain the likelihood of a site harboring an alternative allele with a frequency of at least 1/N, where N is the number of samples provided. Once the presence of a variant is established, the likelihood for each sample's genotype is determined by a greedy combinatorial search algorithm (approximately behaving like Expectation-Maximization).
SNP calls were generated per population. The specific parameters used were: minimum base quality, 10; minimum mapping quality, 10; minimum confidence threshold, 50.
The SNP calling stage provided a list of any site in the target region that may plausibly be variant. These sites were then filtered to identify a set of true variants, discarding the ones deemed to be false-positives. To this end, we developed several heuristic filters by comparing the behavior of different covariates for known variants versus novel variants. Putative variants failing the following filters were ignored in downstream analysis: QD (discovery confidence of the variant/depth of coverage) ≥5; HRun (length of adjacent, allele-sharing homopolymer run) >3; AB (allele balance of variant, averaged over all heterozygous samples, polarized for the reference allele) ≥75%; SnpCluster (N or more variants found within M bases of each other) 3, 10.
Next, we intersected the BC and BI SNP call sets within the target consensus regions (Figure (Figure1a).1a). This intersecting operation greatly improved the SNP call accuracy (Table (Table2),2), and the calls within the intersection were used in our official Exon Pilot release in March 2010. Table Table22 presents the SNP calls of the seven population-specific call sets (that is, CEU, TSI, CHB, CHD, JPT, LWK, and YRI) that were generated by BC and BI pipelines independently. Across each of the seven populations, the intersection calls (BC ∩ BI) range from 50 to 79% of the total SNP calls made by BC and BI; more than 50% of the calls were in dbSNP (build 129), and show a high transition/transversion ratio (Ts/Tv) above 3.00. The large fraction of overlapping SNPs, with a high fraction of dbSNP entries and high Ts/Tv ratio, indicated high quality in the intersection call sets. These call sets were thus highly confident due to being generated from two independent pipelines with quite different and complementary algorithms. Several iterations of comparisons and tuning of the pipelines led to convergence of these call sets. In addition, the intersection call sets have yielded high validation rates (Table (Table3;3; Table S2 in Additional file 1).
The BC unique SNP call set (BC\BI) or BI unique SNP call set (BI\BC) accounted for the remaining 30 to 50% of the SNPs. About 20% of BC unique calls and 8% of BI unique calls were present in dbSNP build 129. Both unique call sets had a much lower Ts/Tv of 1.00, indicating relatively lower quality in the unique call sets (Table (Table22).
We designed five series of validation experiments in order to examine the false positive and false negative rate, both globally in the officially released call sets, and in the SNP calls specific to the BC or BI call set, as well as in the rare and singleton SNPs and almost all the SNPs altering codons (Table S1 in Additional file 1). The validation experiments were carried out at the BCM Human Genome Sequencing Center (BCM-HGSC) and BI, using PCR-Sanger sequencing and Sequenom genotyping, respectively.
We randomly chose 105 non-dbSNP sites in the intersection (that is, regardless of the frequency spectrum), and tested them by Sequenom at BI across the entire sample set.
Approximately 135 non-dbSNP sites were chosen regardless of the frequency spectrum from each of CEU, YRI + LWK, and CHB + CHD + JPT populations. They were selected to represent both the BC/BI intersection, BC-specific and BI-specific call sets. The sites were genotyped using Sequenom at BI across the samples in the populations where they were discovered.
We tested 510 sites at low frequency (1 to 5 alleles/occurrences; approximately 300 in the intersection and approximately 200 in the BC-specific/BI-specific sets) using PCR and Sanger sequencing at the BCM-HGSC, in the particular samples where they were discovered. We allocated approximately 50% of the sites to singletons, and approximately 50% to sites with alternative allele count 2 to 5.
We chose 33 sites with alternative allele count 2 to 5 and 35 singletons from the intersection call set, and tested across all samples using Sequenom at BI.
We drew 227 sites at low frequency (singletons and SNPs with an alternative allele count of 2 to 5) from different functional annotation classes (such as missense, silent, promoter regions, and so on), and examined them using PCR-Sanger sequencing at the BCM-HGSC.
The overall validation rate in the official released data set (that is, the intersection) was very high at 96.8% (Table (Table3;3; Tables S3 and S4 in Additional file 1), meeting and exceeding the 1000 Genomes Project goal of >95% validation. The validation rates at the low-frequency categories were also high, greater than 93.0% for singletons and SNPs with alternative allele count 2 to 5 (series 3, 4 and 5 in Table S2 in Additional file 1). The exceedingly high validation percentages indicated that 1) the high coverage targeted resequencing methods were effective in accurately detecting SNPs at both common and rare allele frequencies; and 2) the intersection calls were highly accurate, and the vast majority of correctly called low frequency alleles were indeed at low frequency. Most of the non-validated sites (Table S2 in Additional file 1) were in the unique fractions of the BC and BI call sets.
The genotype call accuracies were calculated by comparing the called genotypes to the genotype measurements in the validation assays for all four series (series 1 to 4; Table S5 in Additional file 1). In total, 33,938 called genotypes were compared, and the vast majority of the genotypes agreed with the validation results: 32,532, 1,320 and 12 for Ref/Ref (Homozygote Reference), Ref/Alt (heterozygote) and Alt/Alt (Homozygote NonReference) classes, respectively. The accuracy rate for all called genotypes was as high as 99.8%, with 99.9% accuracy for Homozygote Reference (HomRef), 97.0% for heterozygote (Het), and 92.3% for Homozygote NonReference (HomNonRef). The overall false discovery rate of variant genotypes was <3% and the missed variant genotype rate was <1% as measured in series 1. The variant genotypes in low-frequency categories in series 3 were confirmed for 133 of 133 (100%) singleton sites, and 395 of 419 (94.3%) SNPs with alternative allele count 2 to 5. The accuracy compared to series 4 validated sites showed the false discovery rate for these categories was approximately 6.0% with a missed variant genotype rate of 0.1%.
Per-base heterozygosity estimates for the Exon Pilot were calculated at missense, two-fold, three-fold, and four-fold degenerate sites, and all base pairs in the autosomal targeted regions. We included only targeted base pairs with ≥10× coverage in at least 100 chromosomes based on the MOSAIK alignments. The same analysis was performed on the Low Coverage Pilot, but excluding base pairs that were masked in the Low Coverage callability files . Base pairs were masked if >20% of Illumina reads had a mapping quality of 0 and/or read depth was greater than twice the average depth at HapMap3 sites. Also, a base pair had to be callable in all three Low Coverage populations in order to be included in our analysis. Per-base estimates of heterozygosity of ENCODE regions in HapMap3 were normalized by the nominal sequence length of 1 Mbp.
Degeneracy was calculated based on the hg18 reference sequence and the Gencode gene model annotations . Note that some base pair positions may have been counted in multiple categories due to differing reading frames in alternative splice variants at a locus, but this number was less than 1% in each category and should have negligible effects on the resulting analyses.
In the Exon Pilot SNP data set, not all variant sites had the same number of genotypes in each of the seven populations studied. In order to make comparisons of spectra from different populations easier, the unfolded AF spectrum (using orthologous bases from the panTro2 assembly as the ancestral alleles) for each population was projected to a common sample size of 100 chromosomes using the software Dadi . The projection is based off the hyper-geometric distribution, without correcting for ancestral misidentifications.
SIFT and PolyPhen-2 were used to predict possible impacts of missense SNPs on the function of human proteins. Both programs utilize sequence and/or structure information in prediction. SIFT uses sequence homology to build a position-specified scoring matrix with Dirichlet priors, whereas PolyPhen-2 uses both phylogenetic and structural features combined with machine learning. In total, 3,708 and 5,990 missense SNPs in the Exon Pilot were evaluated by either SIFT or PolyPhen-2. We evaluated 3,176 missense SNPs by both SIFT and PolyPhen-2, which had a concordance rate in functional prediction of 55%.
The overlaps of the Exon Pilot SNP and INDEL sets with the HGMD Professional 2009.4 version missense/nonsense SNPs, small insertions, small deletions and small INDELs were identified based on their locations in the reference genome sequence (build 36). There were no overlapping insertions, deletions or INDELs; however, 99 overlapping SNPs within the HGMD-DM class were found, and these were used in subsequent analyses. Four led to premature stop codons and the remaining 95 to missense amino acid changes; the consequences of these for protein structure were predicted using SIFT and PolyPhen-2. The predicted consequences were combined into three classes: (1) Benign: 'benign' from PolyPhen-2 + 'tolerated' from SIFT, or one of these plus no prediction from the other program; (2) Possibly damaging: 'possibly damaging' from PolyPhen-2 plus 'damaging (low confidence)' from SIFT, or a conflict between the predictions; (3) Damaging: 'probably damaging' from PolyPhen-2 plus 'damaging' from SIFT, or one of these plus no prediction from the other program. AFs were determined in each population from the number of disease and non-disease allele calls, excluding individuals with missing data. These AFs were averaged across all populations.
Allele sharing was measured as a function of alternative allele frequency using the following steps. Singletons, which cannot be shared, were removed from the catalog of 12,758 Exon Pilot exonic variants. The remaining 7,137 variants were further filtered using stringent coverage requirements (section 9, 'Allele sharing among populations', in Additional file 1) to ensure that coverage fluctuations between populations would not impact sampling. As a measure of sharing, we considered the likelihood that two minor alleles, when sampled at random without replacement among all minor alleles, belonged to the same population, to different populations from the same continent, or to different continents. In a panmictic population, every pair of sampled chromosomes is equally likely to be sampled, and the expected sharing depends only on the number of pairs of chromosomes in each sharing category - a combinatorial property of sample sizes, but independent of allele frequency.
We compared the Exon Pilot data with published data obtained by resequencing ten 100-kb ENCODE regions as part of the International HapMap 3 Consortium study. We extracted 3,618 HapMap SNPs based on a noncoding annotation. Since the HapMap and Exon Pilot data differ in their sample sizes, we calculated the expected amount of sharing for each dataset based on subsampling each population panel to 90% of the minimum population size between the two datasets, namely CEU:134, CHB:162, CHD:54, JPT:152, LWK:108, TSI:98, YRI:170. The probability of sharing was averaged over all sites, weighted by the probability that a site had two minor alleles in the down-sampled set. Confidence intervals were obtained by bootstrap over the different variant sites.
INDELs were called on the Exon Pilot data from both the Illumina and the Roche 454 platforms, and the results were merged to create the final call set (Figure (Figure1b).1b). Only INDELs inside the consensus target regions were included in the official release. The Illumina data were processed with two independent pipelines in a parallel fashion, by BCM-HGSC and BI (Figure (Figure1b;1b; Table S7 in Additional file 1). The Roche 454 INDELs were processed by BCM-HGSC. The results were combined by taking the union of the three call sets (Figure (Figure1b;1b; Table S7 in Additional file 1).
The BCM-HGSC Illumina INDEL calling pipeline used the MOSAIK alignments created at BC as explained in the SNP calling methods.
Duplicate reads were marked in the alignment using the Picard MarkDuplicates tool  as explained in the SNP calling methods.
The base qualities reported by the instrument were recalibrated using GATK as explained in the SNP calling methods.
INDELs were called using Atlas-Indel2 (Challis et al., submitted), which uses logistic regression models trained on validated exon capture data to identify true INDELs and remove false INDELs arising from sequencing or mapping errors.
INDEL calls were further filtered to require at least two variant reads in a sample. We additionally filtered out all singleton INDELs with a length of 1, in order to remove the high number of false positive INDELs in this category.
The BI Illumina INDEL calling pipeline used the MAQ alignments created at Sanger as explained in the SNP calling methods.
Duplicate reads were marked in the alignment using the Picard MarkDuplicates  as explained in the SNP calling methods.
The base qualities reported by the instrument were recalibrated using GATK as explained in the SNP calling methods.
Reads in the alignment were realigned by GATK IndelRealigner around putative INDELs.
INDELs were called using IndelGenotyperV2.
INDEL calls were further filtered based on local mismatch rate, nearby homopolymer runs, strand bias and other similar features.
The Roche 454 INDEL data were aligned using BLAT-CrossMatch at the BCM-HGSC.
Duplicate reads were removed from the alignment using the BCMRemoveDuplicates script.
INDELs were called using the Atlas-Indel program at the BCM-HGSC.
Initial calls were further filtered by removing lower quality reads, singleton INDELs, 2-bp low frequency INDELs, and any INDELs that may have arisen due to flow-space errors.
The intersection of the BCM-HGSC and BI Illumina INDEL call sets was taken as the consensus for the Illumina data. The union of the Illumina consensus set and the Roche 454 call set formed the final call set. When merging call sets any INDELs of the same type (insertion or deletion) within 5 bp of each other were considered equivalent and merged together.
In total, we detected 96 INDELs (21 insertions and 75 deletions) from the 697 individuals (Table S7 in Additional file 1). The call set had a dbSNP (build 129) concordance rate of 26%. On the Illumina platform, 9 insertions and 39 deletions were called by BCM-HGSC and 11 insertions and 37 deletions by BI. A total of 10 insertions and 24 deletions were called on the Roche 454 data. The Roche 454 INDEL set appeared to be enriched with 2-bp INDELs. This is likely due to flowspace errors on the sequencing platform, which may make 1- or 3-bp INDELs appear to be 2 bp long.
When combining call sets from BCM-HGSC and BI, and calculating concordance, INDELs within 5 bp of each other and of the same type (insertion or deletion) were considered equivalent. The INDEL call set for each population was combined by continent for the alternative allele count analysis, and all seven sets were combined into one set for the INDEL size analysis (Figures S5 and S6 in Additional file 1). When INDELs were found to be equivalent, they were combined to remove the duplication. When combined to the continental level, 51 INDELs were found in Africa, 46 in Asia, and 30 in Europe (Figure (Figure1b;1b; Table S7 in Additional file 1).
The Illumina union INDEL calls were assessed by two methods (Table S8a in Additional file 1). First, the 31 INDELs called by both centers were validated via Sequenom assays for the haplotypes resulting from the INDEL event. The assays were designed using the GATK, and dbSNP sites were masked to avoid bias due to nearby SNPs. The 13 INDELs exclusive to the JPT population and exclusive to a single center were validated via Sequenom assays following the same protocol. Second, the remaining INDEL sites unique to either center were validated by targeted resequencing using PCR and the Roche 454 platform.
Sequenom probe design resulted in probes for 31 sites in the overlapping call set, of which five failed quality control checks. The remaining 26 sites all validated as variants, though genotype concordance between sequencing calls and validation was very low (Table S8b in Additional file 1). Of the 13 probes designed to assess the unique coding INDEL calls in the JPT population, 10 passed quality control filters, and 6 validated as true variants.
Unique INDELs from the Illumina BCM and BI call sets underwent PCR-Roche 454 validation at BCM-HGSC (Table S8a in Additional file 1). Some additional low-confidence INDELs that were filtered out of the BCM call set were also included for software tuning purposes. Equivalent INDELs within any of these sets were merged. A total of 114 on-target sample-sites were submitted for validation; 94 INDELs had conclusive results. The BI unique call set had a confirmation rate of 78.6% and the BCM-HGSC call set had a confirmation rate of 80.0% (Table S8c in Additional file 1). In addition to these INDELs, 405 off-target non-coding sample sites underwent validation. Of these, 227 gave conclusive results, BI INDELs had a confirmation rate of 88.6% and BCM-HGSC had a confirmation rate of 59.6%. BCM's low confirmation rate was due to the exon-specific nature of the Atlas-Indel2 pipeline.
AC: allele count; AF: allele frequency; AFS: allele frequency spectrum; BC: Boston College; BCM: Baylor College of Medicine; BCM-HGSC: Baylor College of Medicine Human Genome Sequencing Center; BI: Broad Institute; bp: base pair; CEU: Utah residents with Northern and Western European ancestry from the CEPH collection; CHB: Han Chinese in Beijing, China; CHD: Chinese in Metropolitan Denver, Colorado; ENCODE: The Encyclopedia of Coding Elements Project; HGMD: Human Gene Mutation Database; INDEL: insertion/deletion polymorphism; JPT: Japanese in Tokyo, Japan; LWK: Luhya in Webuye, Kenya; PCR: polymerase chain reaction; SNP: single-nucleotide polymorphism; Ts/Tv: transition/transversion ratio; TSI: Tuscans in Italy; YRI: Yoruba in Ibadan, Nigeria.
The authors declare that they have no competing interests.
GTM is co-chair of the 1000 Genomes Exon Sequencing Pilot Subgroup, designed the analysis, and coordinated the preparation of the manuscript. FY carried out much of the analysis and coordinated the supplement. ARI, KG, CH, WFL, MB, TB, YC, DC, DK, and ChS performed the data processing and variant calling. SG performed the population genetic analyses. CTS, EVB, DNC, CH, AW, JY, YC, and YX performed the functional impact analyses. CDB, AGC, and MD oversaw population genetic analysis. RG initiated and oversaw the project. DA, SG, EM, AP, and MDP oversaw data production and the data producing center's contributions. XZB and LC provided data management support, overseen by PF, KC, BF, DM, and RS, and CaS performed and oversaw data production. GTM, FY, CTS, SG, and RG wrote the manuscript. All authors read and approved the final manuscript.
Supplemental information. Additional methodological details, figures, tables and citations .
1000 Genomes Project members. List of the member of the pilot phase of the 1000 Genomes Project.
This research was supported by the National Institutes of Health grants R01 HG004719 and RC2 HG005552 (GTM). R01 HG003229 (AGC and CDB). CTS, YC and YX were supported by The Wellcome Trust (WT 077009). FY, DC, JY, and RG were supported by the National Human Genome Research Institute, National Institutes of Health, under grants 5U54HG003273 and 1U01HG005211-0109. A list of members of the pilot phase of the1000 Genomes Project is provided in Additional file 2.