The 1000 Genomes Project (1000 Genomes Project Consortium, 2010
) sets an excellent example on how to design a sequencing project to get the maximum output pertinent to human populations. An important lesson from this project is to sequence many human samples at relatively low coverage instead of a few samples at high coverage. We adopt this strategy because with higher coverage, we will mostly reconfirm information from other reads, but with more samples, we will be able to reduce the sampling fluctuations, gain power on variants present in multiple samples and get access to many more rare variants. On the other hand, sequencing errors counteract the power in variant calling, which necessitates a minimum coverage. The optimal balancing point is broadly regarded to be in the 2–6 fold range per sample (Le and Durbin, 2010
; Li et al., 2011
), depending on the sequencing error rate, level of linkage disequilibrium (LD) and the purpose of the project.
A major concern with this design is that at 2–6 fold coverage per sample, non-reference alleles may not always be covered by sequence reads, especially at heterozygous loci. Calling variants from each individual and then combining the calls usually yield poor results. The preferred strategy is to enhance the power of variant discovery by jointly considering all samples (Depristo et al., 2011
; Le and Durbin, 2010
; Li et al., 2011
; Nielsen et al., 2011
). This strategy largely solves the variant discovery problem, but acquiring accurate genotypes for each individual remains unsolved. Without accurate genotypes, most of the previous methods [e.g. testing Hardy–Weinberg equilibrium (HWE) and association mapping] would not work.
To reuse the rich methods developed for genotyping data, the 1000 Genomes Project proposes to impute genotypes utilizing LD across loci (Browning and Yu, 2009
; Howie et al., 2009
; Li et al., 2009b
). Suppose at a site A
, one sample has a low coverage. If some samples at A
have high coverage and there exists a site B
that is linked with A
and has sufficient sequence support, we can transfer information across sites and between individuals, and thus make a reliable inference for the low-coverage sample at A
. The overall genotype accuracy can be greatly improved.
However, imputation is not without potential concerns. First, imputation cannot be used to infer the regional allele frequency spectrum (AFS) because imputation as of now can only be applied to candidate variant sites, while we need to consider non-variants to infer AFS. Second, the effectiveness of imputation depends on the pattern of LD, which may lead to potential bias in population genetical inferences. Third, the current imputation algorithms are slow. For a thousand samples, the fastest algorithm may be slower than read mapping algorithms, which is frequently the bottleneck of analyzing NGS data (H.M.Kang, personal communication). Considering more samples and using more accurate algorithms will make imputation even slower.
These potential concerns make us reconsider if imputation is always preferred. We notice that we perform imputation mainly to reuse the methods developed for genotyping data, but would it be possible to derive new methods to solve classical medical and population genetical problems without precise genotypes?
Another application of NGS that requires genotype data is to discover somatic mutations or germline mutations between a few related samples (Conrad et al., 2011
; Ley et al., 2008
; Mardis et al., 2009
; Pleasance et al., 2010a
; Roach et al., 2010
; Shah et al., 2009
). For such an application, samples are often sequenced to high coverage. Although it is not hard to achieve an error rate one per 100 000 bases (Bentley et al., 2008
), mutations occur at a much lower rate, typically of the order of 10−6
or even 10−7
. Naively calling genotypes and then comparing samples frequently would not work well (Ajay et al., 2011
), because subtle uncertainty in genotypes may lead to a bulk of errors. From another angle, however, when discovering rare mutations, we only care about the difference between samples. Genotypes are just a way of measuring the difference. Is it really necessary to go through the genotype calling step?
This article explores the answer to these questions. We will show in the following how to compute various statistics directly from sequencing data without knowing genotypes. We will also evaluate our methods on real data.