The Levant is a region in the Near East with an impressive record of continuous human existence and major cultural developments since the Paleolithic period. Genetic and archeological studies present solid evidence placing the Middle East and the Arabian Peninsula as the first stepping-stone outside Africa. There is, however, little understanding of demographic changes in the Middle East, particularly the Levant, after the first Out-of-Africa expansion and how the Levantine peoples relate genetically to each other and to their neighbors. In this study we analyze more than 500,000 genome-wide SNPs in 1,341 new samples from the Levant and compare them to samples from 48 populations worldwide. Our results show recent genetic stratifications in the Levant are driven by the religious affiliations of the populations within the region. Cultural changes within the last two millennia appear to have facilitated/maintained admixture between culturally similar populations from the Levant, Arabian Peninsula, and Africa. The same cultural changes seem to have resulted in genetic isolation of other groups by limiting admixture with culturally different neighboring populations. Consequently, Levant populations today fall into two main groups: one sharing more genetic characteristics with modern-day Europeans and Central Asians, and the other with closer genetic affinities to other Middle Easterners and Africans. Finally, we identify a putative Levantine ancestral component that diverged from other Middle Easterners ∼23,700–15,500 years ago during the last glacial period, and diverged from Europeans ∼15,900–9,100 years ago between the last glacial warming and the start of the Neolithic.
Population stratification caused by nonrandom mating between groups of the same species is often due to geographical distances leading to physical separation followed by genetic drift of allele frequencies in each group. In humans, population structures are also often driven by geographical barriers or distances; however, humans might also be structured by abstract factors such as culture, a consequence of their reasoning and self-awareness. Religion in particular, is one of the unusual conceptual factors that can drive human population structures. This study explores the Levant, a region flanked by the Middle East and Europe, where individual and population relationships are still strongly influenced by religion. We show that religious affiliation had a strong impact on the genomes of the Levantines. In particular, conversion of the region's populations to Islam appears to have introduced major rearrangements in populations' relations through admixture with culturally similar but geographically remote populations, leading to genetic similarities between remarkably distant populations like Jordanians, Moroccans, and Yemenis. Conversely, other populations, like Christians and Druze, became genetically isolated in the new cultural environment. We reconstructed the genetic structure of the Levantines and found that a pre-Islamic expansion Levant was more genetically similar to Europeans than to Middle Easterners.
The Middle East was a funnel of human expansion out of Africa, a staging area for the Neolithic Agricultural Revolution, and the home to some of the earliest world empires. Post LGM expansions into the region and subsequent population movements created a striking genetic mosaic with distinct sex-based genetic differentiation. While prior studies have examined the mtDNA and Y-chromosome contrast in focal populations in the Middle East, none have undertaken a broad-spectrum survey including North and sub-Saharan Africa, Europe, and Middle Eastern populations. In this study 5,174 mtDNA and 4,658 Y-chromosome samples were investigated using PCA, MDS, mean-linkage clustering, AMOVA, and Fisher exact tests of FST's, RST's, and haplogroup frequencies. Geographic differentiation in affinities of Middle Eastern populations with Africa and Europe showed distinct contrasts between mtDNA and Y-chromosome data. Specifically, Lebanon's mtDNA shows a very strong association to Europe, while Yemen shows very strong affinity with Egypt and North and East Africa. Previous Y-chromosome results showed a Levantine coastal-inland contrast marked by J1 and J2, and a very strong North African component was evident throughout the Middle East. Neither of these patterns were observed in the mtDNA. While J2 has penetrated into Europe, the pattern of Y-chromosome diversity in Lebanon does not show the widespread affinities with Europe indicated by the mtDNA data. Lastly, while each population shows evidence of connections with expansions that now define the Middle East, Africa, and Europe, many of the populations in the Middle East show distinctive mtDNA and Y-haplogroup characteristics that indicate long standing settlement with relatively little impact from and movement into other populations.
The evolutionary history of variation in the human Rh blood group system, determined by variants in the RHD and RHCE genes, has long been an unresolved puzzle in human genetics. Prior to medical treatments and interventions developed in the last century, the D-positive children of D-negative women were at risk for hemolytic disease of the newborn, if the mother produced anti-D antibodies following sensitization to the blood of a previous D-positive child. Given the deleterious fitness consequences of this disease, the appreciable frequencies in European populations of the responsible RHD gene deletion variant (for example, 0.43 in our study) seem surprising. In this study, we used new molecular and genomic data generated from four HapMap population samples to test the idea that positive selection for an as-of-yet unknown fitness benefit of the RHD deletion may have offset the otherwise negative fitness effects of hemolytic disease of the newborn. We found no evidence that positive natural selection affected the frequency of the RHD deletion. Thus, the initial rise to intermediate frequency of the RHD deletion in European populations may simply be explained by genetic drift/ founder effect, or by an older or more complex sweep that we are insufficiently powered to detect. However, our simulations recapitulate previous findings that selection on the RHD deletion is frequency dependent, and weak or absent near 0.5. Therefore, once such a frequency was achieved, it could have been maintained by a relatively small amount of genetic drift. We unexpectedly observed evidence for positive selection on the C allele of RHCE in non-African populations (on chromosomes with intact copies of the RHD gene) in the form of an unusually high FST value and the high frequency of a single haplotype carrying the C allele. RhCE function is not well understood, but the C/c antigenic variant is clinically relevant and can result in hemolytic disease of the newborn, albeit much less commonly and severely than that related to the D-negative blood type. Therefore, the potential fitness benefits of the RHCE C allele are currently unknown but merit further exploration.
Blood group polymorphism; copy number variation; human evolution; balancing selection
Genome-wide genotypes and sequences are enriching our understanding of the past 50,000 years of human history and providing insights into earlier periods largely inaccessible to mitochondrial DNA and Y-chromosomal studies.
To see a world in a grain of sand ...
William Blake, Auguries of Innocence
Geneticists have long sought to identify the genetic changes that made us human, but pinpointing the functional-relevant changes has been challenging. Two papers in this issue suggest that partial duplication of SRGAP2, producing an incomplete protein that antagonizes the original, contributed to human brain evolution.
This report is of a round-table discussion held in Cardiff in September 2009 for Cesagen, a research centre within the Genomics Network of the UK’s Economic and Social Research Council. The meeting was arranged to explore ideas as to the likely future course of human genomics. The achievements of genomics research were reviewed, and the likely constraints on the pace of future progress were explored. New knowledge is transforming biology and our understanding of evolution and human disease. The difficulties we face now concern the interpretation rather than the generation of new sequence data. Our understanding of gene-environment interaction is held back by our current primitive tools for measuring environmental factors, and in addition, there may be fundamental constraints on what can be known about these complex interactions.
The geographic origin and time of dispersal of Austroasiatic (AA) speakers, presently settled in south and southeast Asia, remains disputed. Two rival hypotheses, both assuming a demic component to the language dispersal, have been proposed. The first of these places the origin of Austroasiatic speakers in southeast Asia with a later dispersal to south Asia during the Neolithic, whereas the second hypothesis advocates pre-Neolithic origins and dispersal of this language family from south Asia. To test the two alternative models, this study combines the analysis of uniparentally inherited markers with 610,000 common single nucleotide polymorphism loci from the nuclear genome. Indian AA speakers have high frequencies of Y chromosome haplogroup O2a; our results show that this haplogroup has significantly higher diversity and coalescent time (17–28 thousand years ago) in southeast Asia, strongly supporting the first of the two hypotheses. Nevertheless, the results of principal component and “structure-like” analyses on autosomal loci also show that the population history of AA speakers in India is more complex, being characterized by two ancestral components—one represented in the pattern of Y chromosomal and EDAR results and the other by mitochondrial DNA diversity and genomic structure. We propose that AA speakers in India today are derived from dispersal from southeast Asia, followed by extensive sex-specific admixture with local Indian populations.
Austroasiatic; mtDNA; Y chromosome; autosomes; admixture
We analyzed 40 SNP and 19 STR Y-chromosomal markers in a large sample of 1,525 indigenous individuals from 14 populations in the Caucasus and 254 additional individuals representing potential source populations. We also employed a lexicostatistical approach to reconstruct the history of the languages of the North Caucasian family spoken by the Caucasus populations. We found a different major haplogroup to be prevalent in each of four sets of populations that occupy distinct geographic regions and belong to different linguistic branches. The haplogroup frequencies correlated with geography and, even more strongly, with language. Within haplogroups, a number of haplotype clusters were shown to be specific to individual populations and languages. The data suggested a direct origin of Caucasus male lineages from the Near East, followed by high levels of isolation, differentiation and genetic drift in situ. Comparison of genetic and linguistic reconstructions covering the last few millennia showed striking correspondences between the topology and dates of the respective gene and language trees, and with documented historical events. Overall, in the Caucasus region, unmatched levels of gene-language co-evolution occurred within geographically isolated populations, probably due to its mountainous terrain.
Y chromosome; glottochronology; Caucasus; gene geography
TSPY1 is a tandemly-repeated gene on the human Y chromosome forming an array of approximately 21–35 copies. The testicular expression pattern and the inferred function of the TSPY1 protein suggest possible involvement in spermatogenesis. However, data are scarce on TSPY1 copy number variation in different Y lineages and its role in spermatogenesis.
We sought to define: 1) the extent of TSPY1 copy number variation within and among Y chromosome haplogroups; and 2) the role of TSPY1 dosage in spermatogenic efficiency.
Materials and Methods
A total of 154 idiopathic infertile men and 130 normozoospermic controls from Central Italy were analyzed. We used a quantitative PCR assay to measure TSPY1 copy number and also defined Y haplogroups in all subjects.
We provide evidence that TSPY1 copy number shows substantial variation among Y haplogroups and thus that population stratification does represent a potential bias in case-control association studies. We also found: 1) a significant positive correlation between TSPY1 copy number and sperm count (P < 0.001); 2) a significant difference in mean TSPY1 copy number between patients and controls (28.4 ± 8.3 vs. 33.9 ± 10.7; P < 0.001); and 3) a 1.5-fold increased risk of abnormal sperm parameters in men with less than 33 copies (P < 0.001).
TSPY copy number variation significantly influences spermatogenic efficiency. Low TSPY1 copy number is a new risk factor for male infertility with potential clinical consequences.
Structural variations of DNA greater than 1 kilobase in size account for most bases that vary among human genomes, but are still relatively under-ascertained. Here we use tiling oligonucleotide microarrays, comprising 42 million probes, to generate a comprehensive map of 11,700 copy number variations (CNVs) greater than 443 base pairs, of which most (8,599) have been validated independently. For 4,978 of these CNVs, we generated reference genotypes from 450 individuals of European, African or East Asian ancestry. The predominant mutational mechanisms differ among CNV size classes. Retrotransposition has duplicated and inserted some coding and non-coding DNA segments randomly around the genome. Furthermore, by correlation with known trait-associated single nucleotide polymorphisms (SNPs), we identified 30 loci with CNVs that are candidates for influencing disease susceptibility. Despite this, having assessed the completeness of our map and the patterns of linkage disequilibrium between CNVs and SNPs, we conclude that, for complex traits, the heritability void left by genome-wide association studies will not be accounted for by common CNVs.
Copy number variants (CNVs) account for the majority of human genomic diversity in terms of base coverage. Here, we have developed and applied a new method to combine high-resolution array comparative genomic hybridization (CGH) data with whole-genome DNA sequencing data to obtain a comprehensive catalog of common CNVs in Asian individuals. The genomes of 30 individuals from three Asian populations (Korean, Chinese and Japanese) were interrogated with an ultra-high-resolution array CGH platform containing 24 million probes. Whole-genome sequencing data from a reference genome (NA10851, with 28.3× coverage) and two Asian genomes (AK1, with 27.8× coverage and AK2, with 32.0× coverage) were used to transform the relative copy number information obtained from array CGH experiments into absolute copy number values. We discovered 5,177 CNVs, of which 3,547 were putative Asian-specific CNVs. These common CNVs in Asian populations will be a useful resource for subsequent genetic studies in these populations, and the new method of calling absolute CNVs will be essential for applying CNV data to personalized medicine.
Afghanistan has held a strategic position throughout history. It has been inhabited since the Paleolithic and later became a crossroad for expanding civilizations and empires. Afghanistan's location, history, and diverse ethnic groups present a unique opportunity to explore how nations and ethnic groups emerged, and how major cultural evolutions and technological developments in human history have influenced modern population structures. In this study we have analyzed, for the first time, the four major ethnic groups in present-day Afghanistan: Hazara, Pashtun, Tajik, and Uzbek, using 52 binary markers and 19 short tandem repeats on the non-recombinant segment of the Y-chromosome. A total of 204 Afghan samples were investigated along with more than 8,500 samples from surrounding populations important to Afghanistan's history through migrations and conquests, including Iranians, Greeks, Indians, Middle Easterners, East Europeans, and East Asians. Our results suggest that all current Afghans largely share a heritage derived from a common unstructured ancestral population that could have emerged during the Neolithic revolution and the formation of the first farming communities. Our results also indicate that inter-Afghan differentiation started during the Bronze Age, probably driven by the formation of the first civilizations in the region. Later migrations and invasions into the region have been assimilated differentially among the ethnic groups, increasing inter-population genetic differences, and giving the Afghans a unique genetic diversity in Central Asia.
We analysed 67 short tandem repeat polymorphisms from the non-recombining part of the Y-chromosome (Y-STRs), including 49 rarely-studied simple single-copy (ss)Y-STRs and 18 widely-used Y-STRs, in 590 males from 51 populations belonging to 8 worldwide regions (HGDP-CEPH panel). Although autosomal DNA profiling provided no evidence for close relationship, we found 18 Y-STR haplotypes (defined by 67 Y-STRs) that were shared by two to five men in 13 worldwide populations, revealing high and widespread levels of cryptic male relatedness. Maximal (95.9%) haplotype resolution was achieved with the best 25 out of 67 Y-STRs in the global dataset, and with the best 3-16 markers in regional datasets (89.6-100% resolution). From the 49 rarely-studied ssY-STRs, the 25 most informative markers were sufficient to reach the highest possible male lineage differentiation in the global (92.2% resolution), and 3-15 markers in the regional datasets (85.4-100%). Considerably lower haplotype resolutions were obtained with the three commonly-used Y-STR sets (Minimal Haplotype, PowerPlex Y®, and AmpFlSTR® Yfiler®). Six ssY-STRs (DYS481, DYS533, DYS549, DYS570, DYS576 and DYS643) were most informative to supplement the existing Y-STR kits for increasing haplotype resolution, or – together with additional ssY-STRs - as a new set for maximizing male lineage differentiation. Mutation rates of the 49 ssY-STRs were estimated from 403 meiotic transfers in deep-rooted pedigrees, and ranged from ~4.8×10−4 for 31 ssY-STRs with no mutations observed to 1.3×10−2 and 1.5×10−2 for DYS570 and DYS576, respectively, the latter representing the highest mutation rates reported for human Y-STRs so far. Our findings thus demonstrate that ssY-STRs are useful for maximizing global and regional resolution of male lineages, either as a new set, or when added to commonly-used Y-STR sets, and support their application to forensic, genealogical and anthropological studies.
Y-STR; microsatellites; Y-chromosome; haplotype resolution; lineage differentiation; HGDP-CEPH, mutation rates
A recently-published study has used next-gen sequencing technology to resequence two Y chromosomes separated by 13 generations and discovered four single-base differences in ~10 Mb DNA, suggesting that the Y chromosome euchromatin accumulates around one mutation per generation. Y-SNPs therefore now offer the best resolution of Y haplotypes and promise to distinguish almost every Y chromosome. This work illustrates the promise of current sequencing technology for forensically-relevant applications.
Next-gen sequencing; Y-SNP; Y-STR; Haplotype resolution; forensic applications
We have examined the male-specific phylogeography of the Levant and its surroundings by analyzing Y-chromosomal haplogroup distributions using 5,874 samples (885 new) from 23 countries. The diversity within some of these haplogroups was also examined. The Levantine populations showed clustering in SNP and STR analyses when considered against a broad Middle-East and North African background. However, we also found a coastal-inland, east-west pattern of diversity and frequency distribution in several haplogroups within the small region of the Levant. Since estimates of effective population size are similar in the two regions, this strong pattern is likely to have arisen mainly from differential migrations, with different lineages introduced from the east and west.
Y chromosome; Y-SNP; Y-STR; Levant
Cultural expansions, including of religions, frequently leave genetic traces of differentiation and in-migration. These expansions may be driven by complex doctrinal differentiation, together with major population migrations and gene flow. The aim of this study was to explore the genetic signature of the establishment of religious communities in a region where some of the most influential religions originated, using the Y chromosome as an informative male-lineage marker. A total of 3139 samples were analyzed, including 647 Lebanese and Iranian samples newly genotyped for 28 binary markers and 19 short tandem repeats on the non-recombinant segment of the Y chromosome. Genetic organization was identified by geography and religion across Lebanon in the context of surrounding populations important in the expansions of the major sects of Lebanon, including Italy, Turkey, the Balkans, Syria, and Iran by employing principal component analysis, multidimensional scaling, and AMOVA. Timing of population differentiations was estimated using BATWING, in comparison with dates of historical religious events to determine if these differentiations could be caused by religious conversion, or rather, whether religious conversion was facilitated within already differentiated populations. Our analysis shows that the great religions in Lebanon were adopted within already distinguishable communities. Once religious affiliations were established, subsequent genetic signatures of the older differentiations were reinforced. Post-establishment differentiations are most plausibly explained by migrations of peoples seeking refuge to avoid the turmoil of major historical events.
Y-chromosome; population genetics; human migrations; cultural diffusion; religion; Maronites
Genomic structure in a global collection of domesticated sheep reveals a history of artificial selection for horn loss and traits relating to pigmentation, reproduction, and body size.
Through their domestication and subsequent selection, sheep have been adapted to thrive in a diverse range of environments. To characterise the genetic consequence of both domestication and selection, we genotyped 49,034 SNP in 2,819 animals from a diverse collection of 74 sheep breeds. We find the majority of sheep populations contain high SNP diversity and have retained an effective population size much higher than most cattle or dog breeds, suggesting domestication occurred from a broad genetic base. Extensive haplotype sharing and generally low divergence time between breeds reveal frequent genetic exchange has occurred during the development of modern breeds. A scan of the genome for selection signals revealed 31 regions containing genes for coat pigmentation, skeletal morphology, body size, growth, and reproduction. We demonstrate the strongest selection signal has occurred in response to breeding for the absence of horns. The high density map of genetic variability provides an in-depth view of the genetic history for this important livestock species.
During the process of domestication, mankind recruited animals from the wild into a captive environment, changing their morphology, behaviour, and genetics. In the case of sheep, domestication and subsequent selection by their animal handlers over thousands of years has produced a spectrum of breeds specialised for the production of wool, milk, and meat. We sought to use this population history to search for the genes that directly underpin phenotypic variation. We collected DNA from 2,819 sheep, belonging to 74 breeds sampled from around the world, and assessed the genotype of each animal at nearly 50,000 locations across the genome. Our results show that sheep breeds have maintained high levels of genetic diversity, in contrast to other domestic animals such as dogs. We also show that particular regions of the genome contain strong evidence for accelerated change in response to artificial selection. The most prominent example was identified in response to breeding for the absence of horns, a trait now common across many modern breeds. Furthermore, we demonstrate that other genomic regions under selection in sheep contain genes controlling pigmentation, reproduction, and body size.
Numerous genome-wide scans conducted by genotyping previously ascertained single-nucleotide polymorphisms (SNPs) have provided candidate signatures for positive selection in various regions of the human genome, including in genes involved in pigmentation traits. However, it is unclear how well the signatures discovered by such haplotype-based test statistics can be reproduced in tests based on full resequencing data. Four genes (oculocutaneous albinism II (OCA2), tyrosinase-related protein 1 (TYRP1), dopachrome tautomerase (DCT), and KIT ligand (KITLG)) implicated in human skin-color variation, have shown evidence for positive selection in Europeans and East Asians in previous SNP-scan data. In the current study, we resequenced 4.7 to 6.7 kb of DNA from each of these genes in Africans, Europeans, East Asians, and South Asians.
Applying all commonly used neutrality-test statistics for allele frequency distribution to the newly generated sequence data provided conflicting results regarding evidence for positive selection. Previous haplotype-based findings could not be clearly confirmed. Although some tests were marginally significant for some populations and genes, none of them were significant after multiple-testing correction. Combined P values for each gene-population pair did not improve these results. Application of Approximate Bayesian Computation Markov chain Monte Carlo based to these sequence data using a simple forward simulator revealed broad posterior distributions of the selective parameters for all four genes, providing no support for positive selection. However, when we applied this approach to published sequence data on SLC45A2, another human pigmentation candidate gene, we could readily confirm evidence for positive selection, as previously detected with sequence-based and some haplotype-based tests.
Overall, our data indicate that even genes that are strong biological candidates for positive selection and show reproducible signatures of positive selection in SNP scans do not always show the same replicability of selection signals in other tests, which should be considered in future studies on detecting positive selection in genetic data.
Autism is a common, severe and highly heritable neurodevelopmental disorder in children, affecting up to 100 children per 10,000. The MET gene has been regarded as a promising candidate gene for this disorder because it is located within a replicated linkage interval, is involved in pathways affecting the development of the cerebral cortex and cerebellum in ways relevant to autism patients, and has shown significant association signals in previous studies.
Here, we present new ASD patient and control samples from Heilongjiang, China and use them in a case-control and family-based replication study of two MET variants. One SNP, rs38845, was successfully replicated in a case-control association study, but failed to replicate in a family-based study, possibly due to small sample size. The other SNP, rs1858830, failed to replicate in both case-control and family-based studies.
This is the first attempt to replicate associations in Chinese autism samples, and our result provides evidence that MET variants may be relevant to autism susceptibility in the Chinese Han population.
We have investigated whether regions of the genome showing signs of positive selection in scans based on haplotype structure also show evidence of positive selection when sequence-based tests are applied, whether the target of selection can be localized more precisely, and whether such extra evidence can lead to increased biological insights. We used two tools: simulations under neutrality or selection, and experimental investigation of two regions identified by the HapMap2 project as putatively selected in human populations. Simulations suggested that neutral and selected regions should be readily distinguished and that it should be possible to localize the selected variant to within 40 kb at least half of the time. Re-sequencing of two ~300 kb regions (chr4:158Mb and chr10:22Mb) lacking known targets of selection in HapMap CHB individuals provided strong evidence for positive selection within each and suggested the micro-RNA gene hsa-miR-548c as the best candidate target in one region, and changes in regulation of the sperm protein gene SPAG6 in the other.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-011-1111-9) contains supplementary material, which is available to authorized users.
Genetic variants predicted to seriously disrupt the function of human protein-coding genes—so-called loss-of-function (LOF) variants—have traditionally been viewed in the context of severe Mendelian disease. However, recent large-scale sequencing and genotyping projects have revealed a surprisingly large number of these variants in the genomes of apparently healthy individuals—at least 100 per genome, including more than 30 in a homozygous state—suggesting a previously unappreciated level of variation in functional gene content between humans. These variants are mostly found at low frequency, suggesting that they are enriched for mildly deleterious polymorphisms suppressed by negative natural selection, and thus represent an attractive set of candidate variants for complex disease susceptibility. However, they are also enriched for sequencing and annotation artefacts, so overall present serious challenges for clinical sequencing projects seeking to identify severe disease genes amidst the ‘noise’ of technical error and benign genetic polymorphism. Systematic, high-quality catalogues of LOF variants present in the genomes of healthy individuals, built from the output of large-scale sequencing studies such as the 1000 Genomes Project, will help to distinguish between benign and disease-causing LOF variants, and will provide valuable resources for clinical genomics.
Exome sequencing, which allows the global analysis of protein coding sequences in the human genome, has become an effective and affordable approach to detecting causative genetic mutations in diseases. Currently, there are several commercial human exome capture platforms; however, the relative performances of these have not been characterized sufficiently to know which is best for a particular study.
We comprehensively compared three platforms: NimbleGen's Sequence Capture Array and SeqCap EZ, and Agilent's SureSelect. We assessed their performance in a variety of ways, including number of genes covered and capture efficacy. Differences that may impact on the choice of platform were that Agilent SureSelect covered approximately 1,100 more genes, while NimbleGen provided better flanking sequence capture. Although all three platforms achieved similar capture specificity of targeted regions, the NimbleGen platforms showed better uniformity of coverage and greater genotype sensitivity at 30- to 100-fold sequencing depth. All three platforms showed similar power in exome SNP calling, including medically relevant SNPs. Compared with genotyping and whole-genome sequencing data, the three platforms achieved a similar accuracy of genotype assignment and SNP detection. Importantly, all three platforms showed similar levels of reproducibility, GC bias and reference allele bias.
We demonstrate key differences between the three platforms, particularly advantages of solutions over array capture and the importance of a large gene target set.
Rare coding variants constitute an important class of human genetic variation, but are underrepresented in current databases that are based on small population samples. Recent studies show that variants altering amino acid sequence and protein function are enriched at low variant allele frequency, 2 to 5%, but because of insufficient sample size it is not clear if the same trend holds for rare variants below 1% allele frequency.
The 1000 Genomes Exon Pilot Project has collected deep-coverage exon-capture data in roughly 1,000 human genes, for nearly 700 samples. Although medical whole-exome projects are currently afoot, this is still the deepest reported sampling of a large number of human genes with next-generation technologies. According to the goals of the 1000 Genomes Project, we created effective informatics pipelines to process and analyze the data, and discovered 12,758 exonic SNPs, 70% of them novel, and 74% below 1% allele frequency in the seven population samples we examined. Our analysis confirms that coding variants below 1% allele frequency show increased population-specificity and are enriched for functional variants.
This study represents a large step toward detecting and interpreting low frequency coding variation, clearly lays out technical steps for effective analysis of DNA capture data, and articulates functional and population properties of this important class of genetic variation.