Understanding the relationship between genotype and phenotype is one of the central goals in biology and medicine. The reference human genome sequence1
provides a foundation for the study of human genetics, but systematic investigation of human variation requires full knowledge of DNA sequence variation across the entire spectrum of allele frequencies and types of DNA differences. Substantial progress has already been made. By 2008 the public catalogue of variant sites (dbSNP 129) contained approximately 11 million single nucleotide polymorphisms (SNPs) and 3 million short insertions and deletions (indels)2-4
. Databases of structural variants (SVs) (e.g., dbVAR) indexed the locations of large genomic variants. The International HapMap Project catalogued both allele frequencies and the correlation patterns between nearby variants, a phenomenon known as linkage disequilibrium (LD), across several populations for 3.5 million SNPs3, 4
These resources have driven disease gene discovery in the first generation of genome wide association studies (GWAS), wherein genotypes at several hundred thousand variant sites, combined with the knowledge of LD structure, allow the vast majority of common variants (here, those with > 5% minor allele frequency, or MAF) to be tested for association4
with disease. Over the last five years association studies have identified more than a thousand genomic regions associated with disease susceptibility and other common traits5
. Genome wide collections of both common and rare SVs have similarly been tested for association with disease6
Despite these successes, much work is still needed to achieve a deep understanding of the genetic contribution to human phenotypes7
. Once a region has been identified as harbouring a risk locus, detailed study of all genetic variants in the locus is required to discover the causal variant(s), to quantify their contribution to disease susceptibility, and to elucidate their roles in functional pathways. Low frequency and rare variants (here defined as 0.5% to 5% MAF, and below 0.5% MAF respectively) vastly outnumber common variants and also contribute significantly to the genetic architecture of disease but it has not yet been possible to study them systematically7-9
. Meanwhile, advances in DNA sequencing technology have enabled the sequencing of individual genomes10-13
, illuminating the gaps in the first generation of databases that contain mostly common variant sites. A much more complete catalogue of human DNA variation is a prerequisite to fully understanding the role of common and low frequency variants in human phenotypic variation.
The aim of the 1000 Genomes Project is to discover, genotype and provide accurate haplotype information on all forms of human DNA polymorphism in multiple human populations. Specifically, the goal is to characterise over 95% of variants that are in genomic regions accessible to current high throughput sequencing technologies and that have allele frequency of 1% or higher (the classical definition of polymorphism) in each of five major population groups (populations in or with ancestry from Europe, East Asia, South Asia, West Africa and the Americas). Because functional alleles are often found in coding regions and have reduced allele frequencies, lower frequency alleles (down to 0.1%) will also be catalogued in such regions.
Here we report the results of the pilot phase of the project, the aim of which was to develop and compare different strategies for genome wide sequencing with high throughput platforms. To this end we undertook three projects: low coverage sequencing of 179 individuals, deep sequencing of six individuals in two trios, and exon sequencing of 1,000 genes in 697 individuals (Box 1
). The results give us a much deeper, more uniform picture of human genetic variation than was previously available, enabling new insights into the landscapes of functional variation, genetic association and natural selection in humans.
Box 1. The 1000 Genomes pilot projects
To develop and assess multiple strategies to detect and genotype variants of various types and frequencies using high throughput sequencing we carried out three projects, using samples from the extended HapMap collection14
- Trio project: whole genome shotgun sequencing at high coverage (average 42x) of two families (one Yoruba from Ibadan, Nigeria (YRI), one of European ancestry in Utah (CEU)), each including two parents and one daughter. Each of the offspring was sequenced using three platforms and by multiple centres.
- Low coverage project: whole genome shotgun sequencing at low coverage (2-6x) of 59 unrelated individuals from YRI, 60 unrelated individuals from CEU, 30 unrelated Han Chinese individuals in Beijing (CHB) and 30 unrelated Japanese individuals in Tokyo (JPT).
- Exon project: targeted capture of the exons from nearly 1000 randomly selected genes (total of 1.4 Mb) followed by sequencing at high coverage (average > 50x) in 697 individuals from 7 populations of African (YRI, Luhya in Webuye, Kenya (LWK)), European (CEU, Toscani in Italia (TSI)) and East Asian (CHB, JPT, Chinese in Denver, Colorado (CHD)) ancestry.
The three experimental designs differ substantially both in their ability to obtain data for variants of different types and frequencies and in the analytical methods we used to infer individual genotypes. The shows a schematic representation of the projects and the type of information obtained from each. Colours in the left region indicate different haplotypes in individual genomes, and line width indicates depth of coverage (not to scale). The shaded region to the right gives an example of genotype data that could be generated for the same sample under the three strategies (dots indicate missing data, dashes indicate phase information, i.e., whether heterozygous variants can be assigned to the correct haplotype). Within a short region of the genome, each individual carries two haplotypes, typically shared by others in the population. In the trio design, high sequence coverage and the use of multiple platforms enable accurate discovery of multiple variant types across most of the genome, with Mendelian transmission aiding genotype estimation, inference of haplotypes and quality control. The low coverage project, in contrast, efficiently identifies shared variants on common haplotypes15, 16
(red or blue), but has lower power to detect rare haplotypes (light green) and associated variants (indicated by the missing alleles), and will give some inaccurate genotypes (indicated by the red allele incorrectly assigned G). The exon design enables accurate discovery of common, rare and low frequency variation in the targeted portion of the genome, but lacks the ability to observe variants outside the targeted regions or assign haplotype phase.