To sample genotypic space within species, empirical population genetics has closely followed the current state of the art molecular techniques for surveying genetic variation (
Avise 1994). From the early days of protein electrophoresis (
Lewontin and Hubby 1966), to DNA sequencing (
Kreitman 1983), to surveys of microsatellite variation (
Schlotterer et al. 1997;
Irvin et al. 1998), to large-scale resequencing screens (
Hutter et al. 2007), two major goals in population genetics have been to characterize patterns of genetic variation in natural populations and, subsequently, to infer processes of evolutionary change. Efforts to unravel evolutionary processes at a finer scale have motivated the development of tools to increase both the number of sampled individuals and the fraction of the genome covered. New sequencing technologies (
Mardis 2008) are now instigating a step change in the scope of population genetics by generating sample coverage and depth at a much higher scale than ever before. As the quantity of data available for population genetic analysis grows to multiple full genome proportions (
Liti et al. 2009), new techniques and analytical approaches will be required. Several studies have already begun to develop methods for estimating nucleotide diversity from sparse data (
Hellmann et al. 2008;
Jiang et al. 2009;
Lynch 2009), but none have yet applied these approaches to real short-read data sets. As data begin to appear from large-scale resequencing projects such as the 1000 Genomes Project in humans, the 1001 Genome Project in Arabidopsis, and the Drosophila Genetic Reference Panel Project in flies, understanding the practical application and limitations of these short-read methods will become increasingly important.
Here, we present the first attempt to make population genetic inference on a genomic scale from low-coverage alignments. Using two populations of D. melanogaster, sampled at different coverage levels, we employed stringent approaches and criteria, including conservative alignments, probabilistic SNP models, and a correction to estimate nucleotide diversity. In many cases, we recapitulate patterns of SNP variation previously observed in Drosophila: reduced diversity on the X chromosome relative to autosomes, reduced diversity in non-African populations relative to ancestral African populations, and positive correlations between recombination rate and diversity. We also report novel results that depend on broad-scale sampling, in particular our observation that correlations between recombination rate (based on the standard genetic map of D. melanogaster) and diversity appear to be stronger for non-African autosomes than other populations and chromosomes.
However, our approach also suffers from important limitations. Our estimates of θ appear to be influenced by the conservative choices made during alignment and SNP calling: we tend to observe lower estimates of θ than previously reported. In future studies, it will be important to recognize that alignment and SNP calling methods can have significant impacts on downstream estimates of diversity. Additionally, given current methods and the sparse nature of our data set, we cannot make inferences that depend on frequency-based statistics. Deeper coverage and methods that allow for the calculation of full data likelihoods (as opposed to just the probability of a site being a SNP or not, relative to the reference) will be necessary to fully capture allele frequency information in sparse data sets.
Sampling entire genomes from natural populations via an increasing number of new sequencing platforms is likely to become the norm in population genomics. As sequencing significantly decreases in cost, in may soon be feasible to generate full, high-coverage resequencing data for model organisms with relatively small genomes. However, sparse data sets such as the one we describe here will undoubtedly become the norm in nonmodel organisms and in organisms with large genomes. It is therefore imperative that we continue to develop rigorous statistical methods that deal with this onslaught of random genomic sequences. In this paper, we highlight the potential problems of sparse coverage population genomics, which include alignment issues, sequencing quality, variable depth of coverage, and missing sites. We show that solutions to these problems—a conservative Mosaik assembly incorporating sequencing errors, Bayesian model for SNP identification, and unbiased estimators of nucleotide diversity (i.e., θ)—allow us to infer the expected patterns of variation from even very sparse coverage across two populations of D. melanogaster, although further work will be required to develop methods to allow inference based on allele frequencies and to address the challenges inherent in a probabilistic approach to alignment and data quality.
Even current methods demonstrate the ample promise of short-read population genomics, especially for organisms where resources for high-quality and deep-coverage resequencing projects are not available. Sparse-coverage population genomic projects will always face some limitations: de novo assembly of low-coverage data is not feasible, and thus any population genomic study of this sort will require a reference genome for mapping purposes. Although the reference genome need not necessarily be the same species as the surveyed populations, more distant reference genomes will reduce mapping efficiency. Mapping efficiency is also likely to be reduced in organisms with very large genomes and especially those with high repetitive DNA content, as repetitive sequences generally cannot be uniquely mapped to the reference. However, beyond the availability of a suitable reference, we believe that sparse-coverage short-read approaches provide a cost-effective and accessibility way to survey genome-wide variation in a wide range of organisms. The method for inferring θ described here is easily applicable to heterozygous organisms (
Hellmann et al. 2008), obviating the need for inbreeding prior to sequencing. Furthermore, genome-wide sampling has important advantages over alternative approaches, such as sequencing targeted genomic regions: as we demonstrate, a single experiment can provide information about SNPs, CNPs, and variation in TE content.
Population genetics has historically focused on mutational variants comprised of single nucleotide change. By utilizing random sequences aligned to a reference assembly, new genomic data hold the promise to provide a richer snapshot of extant genetic variation beyond single nucleotide variants. With genome-wide data amassed on a population scale, we can also characterize such patterns of genomic variation as TE diversity and CNP. By sampling structural and sequence variation in an integrated manner and by providing cost-effective ways for population-genomic inference in nonmodel organisms, next-generation sequencing is ushering us into a new era in population genomics that will allow comprehensive insights into the molecular variation underlying all genome and organismal evolution.