Reported here is a method for quantitative determination of allele frequencies called FREQ-Seq. FREQ-Seq utilizes standard PCR and depends upon a novel bar-coding system to generate locus-specific Illumina sequencing libraries. This capitalizes on focusing the sequencing depth of next-generation sequencing platforms upon just the key loci of interest across timepoints or populations in order to accurately infer their frequency. A key aspect of FREQ-Seq is its capacity for multiplexing a large number of samples. For example, for academic laboratories engaged in experimental evolution or environmental microbiology, FREQ-Seq provides a low-cost, highly extendable method for investigating genetic variation over large temporal and spatial scales.
As a proof of principle we have demonstrated the utility of FREQ-Seq in tracking the emergence of four beneficial alleles from a previously characterized evolution experiment 
. We also observed that, while the order of allele emergence matched the order which would have been the most optimal adaptive trajectory amongst the epistatic combinations generated by Chou and coworkers 
, a previously unknown period of clonal interference existed early during adaptation. Based on genotyping of clones during this period, one prominent source of interfering lineages resulted from IS insertions into the replication machinery (trfA
) of the RK2-based plasmid harboring the exogenous formaldehyde oxidation pathway. Similar IS insertions have recently been found in many of the replicate populations from this experiment, and provide 17.4–24.1% fitness increases due to reducing the cost associated with exogenous pathway expression 
. Interestingly, while these IS mutational variants were the eventual evolutionary ‘winners’ in other populations, they were unable to outcompete the fghAEVO gshAEVO
genotype despite being simultaneously present in gshAEVO
backgrounds. These results underpin the importance of allele frequency data in revealing the complex processes during adaptation. Furthermore, by simultaneously determining the population growth rate with high resolution, we were able to both estimate the fitness of the genotype containing the allele of interest and demonstrate that the reversal in its frequency actually occurred during a period of continued rise in fitness. As more studies move to characterizing improvement accurately and with fine time resolution, the synergistic role of FREQ-Seq estimates of allele frequencies will become even more important to connect between phenotypic and genotypic change.
This test case highlights advantages of the FREQ-Seq method. First, our results show that a wide range of allelic types can be detected ranging from SNPs and small in-dels to novel IS junctions. Second, operating at approximately 5% of an Illumina flow cell capacity we were able to achieve >105 reads per locus per time-point. Third, we observed minimal bias in our control ratios. This suggests that FREQ-Seq is fully capable of quantitatively detecting these types of alleles with little or no calibration required, particularly for SNPs or small in-dels. Where needed a simple quadratic fit can be used to account for small, systematic bias. We encourage potential users to consider a control mixture to account for possible skews in observed frequencies for these types of alleles. Fourth, the sensitivity available is quite reasonable. Our data allowed us to quantify allele frequencies reliably from 1–99%. More sophisticated statistical error models or increased flow cell capacity can likely increase the sensitivity of FREQ-Seq. Fifth, the publicly available FREQout analysis package allows for researchers to easily recover data and receive a full report of allele frequency types.
A critical component of any new method is the cost of implementation, and the associated tradeoff in accuracy and throughput, relative to existing techniques and technologies. With this in mind, we explored the cost of FREQ-Seq implementation compared to existing high-throughput platforms (Fluidigm AccessArray™ and Raindance Thunderstorm™) and sequencing-based allele frequency detection methods (Sanger sequencing and whole genome resquencing). As an initial cost comparison, we chose to examine two commercial systems from Fluidigm and Raindance that use a very similar strategy for PCR enrichment and barcoding of chromosomal loci. FREQ-Seq and these two platforms both require the use of locus specific oligonucleotide primers containing long overhangs. Unlike FREQ-Seq, however, both the Fluidigm and Raindance systems require long barcoded primers, containing both Illumina adapter sequences, for every locus of interest. While an unassuming difference, the latter primers routinely exceed most standard oligonucleotide synthesis lengths (currently 60 nucleotides at $0.35 US/base, IDT) resulting in increased synthesis and purification costs (>$1 US/base). Additionally, as these primers are locus specific, each locus must have a new set of barcoded primers. To directly compare, each Fluidigm or Raindance barcoded primer set would cost approximately $3,120 US (48×65 nt barcoded primers at approximately $1 US/base) for a single locus whereas a FREQ-Seq locus primer set of similar size costs $32 US (1×37 nt primer +1×54 nt primer at $0.35 US/base) as all barcoding is accomplished with the universal bridging primer in a locus-indendent manner. Finally, comparing equipment costs (assuming equal amounts of PCR reagents and user time are used) shows that the aforementioned commercial technologies (not considering annual custom consumables and instrument servicing costs) range between $70,000 to $100,000 US. In contrast, FREQ-Seq relies only on a standard, 96-well PCR thermal cycler already present in or generally available to most laboratories. Collectively, FREQ-Seq is two orders of magnitude less expensive in oligonucleotide costs and again obviates purchasing an expensive equipment specifically designed for this purpose. Further, as FREQ-Seq is an open-source platform, no reagents or equipment are proprietary, thereby alleviating significant annual service and licensing costs for small research laboratories and institutions.
Additionally, one may consider alternative strategies such as traditional Sanger sequencing and whole genome resequencing to infer allele frequencies in mixed populations. Using the analysis of an experimentally evolved population demonstrated in as a metric, Sanger sequencing of four loci for 22 time points would result in a cost for a single set of replicates to be $440 US (5 $US/reaction, GeneWiz). However, to achieve any statistical accuracy, multiple replicate samples for each allele and time point would need to be run. Moreover, the accuracy generated by Sanger sequencing is routinely difficult to achieve below 10% allele abundance 
. Likewise, employing the same logic to whole genome resequencing of mixed population samples using the Illumina HiSeq platform, approximately 12 samples can be run per lane to produce an average coverage of 100–200 fold. Conservatively, this experiment would result in the use of two full lanes in an Illumina flow cell (costing approximately $2,400 US using Harvard FAS Core pricing) and would require 22 barcoded Illumina libraries to be generated (an additional cost of $1,200 US with TruSeq preparations at $55 US/sample). An immediate benefit of this strategy is that nearly all chromosomal mutations can be seen simultaneously. The drawback is that 100–200 fold coverage results in both substantial sampling error in estimating alleles of intermediate frequency, and is insufficient to detect alleles at low frequency, resulting in increased coverage being necessary (with costs increasing linearly). In contrast, using only ~3% of a single Illumina flow cell lane ($40 US), FREQ-Seq is able to achieve an average of 100,000 fold coverage per allele, per time point. If we thus consider the cost per 1,000 fold coverage, whole genome sequencing would cost ~$700 US per time point, whereas FREQ-Seq costs $0.005 US per locus of interest in the population. By focusing on loci of interest, FREQ-Seq uses the five order of magnitude difference in cost per coverage to achieve robust quantitative results and sub 1% frequency detection. Taken together, these comparisons demonstrate that FREQ-Seq is a cost-effective, quantitative strategy for localized allele frequency determination.
In the present implementation of FREQ-Seq, we have described a set of 48 uniquely bar-coded bridging primers to produce localized Illumina sequencing libraries compatible with either single-end or paired-end read flow cells. Although currently designed for single-end reads, changing compatibility to paired-end flow cells would only require simple modification to the reverse locus primer to change its 5′ overhang, and an identical change in the primer ‘B’ for enrichment (Table S1
). Though employed here to explore the frequencies of known alleles in a laboratory evolution experiment, we expect FREQ-Seq to be useful in other diagnostic applications. Multi-way competitions between bar-coded strains could be readily performed. Moving out of the lab entirely, microbial species variation in environmental samples is entirely feasible. For example, phylogenetic analyses of 16S rRNA from communities could occur with excellent throughput and no subsequent library preparation. As a result, FREQ-Seq should be a useful tool to many biomedical and environmental microbiologists.