Taken together, these bioinformatic and experimental results suggest that phenotype sequencing can be a practical and effective method for identifying the genetic causes of a phenotype, provided several requirements are met: 1. a sufficient number of mutant strains with the desired phenotype, independently generated from a common ancestor, with a low density of random mutations; 2. a small enough genome (or region of genetic interest) to enable sequencing of this number of mutant strains at an acceptable cost; 3. a reference genome sequence that closely matches the ancestral genome, with gene annotations. We now discuss each of these requirements in turn.
The statistical power of phenotype sequencing depends entirely on the number of independent selection events (producing the same phenotype) that are sequenced. This can be achieved by performing independent mutagenesis experiments starting from a single parental strain, and screening each experiment for the desired phenotype. This both ensures that each mutant strain constitutes an independent mutation event, and permits control over the density of mutagenesis. Lowering the density of mutagenesis reduces the number of mutant strains that are needed to obtain a desired target gene discovery yield (but may also increase phenotype screening costs, due to the larger number of mutants that must be screened to find the desired phenotype).
Phenotype sequencing may also be applicable to mutant strains isolated from wild populations, tissue samples, or laboratory evolution under specific conditions   
. Existing examples illustrate that it is possible to obtain a sufficient number of independent mutant strains from such sources 
. However, naturally occurring mutant strains may require more costly sequencing analysis. Unless it is previously known that a given set of mutant strains form a star topology (i.e. their sequences are conditionally independent given the sequence of their most recent common ancestor (MRCA)), it would be necessary to reconstruct their detailed phylogeny, which is not possible using library-pooling. Instead, it would require a pure tag-pooling design, tagging each strain in a given lane uniquely, to obtain its individual sequence. In this case, target genes can be identified by calculating p-values based on the number of independent mutation events in each gene, inferred from the phylogenetic tree. Furthermore, we note that if a subset of mutant strains are believed to be conditionally independent given their MRCA, that subset can be pooled as a single library, reducing the cost without loss of information.
It should be emphasized that evidence of such phylogenetic structure (i.e. non-independence among mutant strains) can be easily detected even in library-pooled sequence data. Since independent mutation events are very unlikely to hit the exact same nucleotide site, each observed mutation should be found in only a single mutant strain. By contrast, if different strains share common ancestry subsequent to the MRCA (i.e. are not independent), by definition they will share some fraction of their mutations. Thus, detection of the exact same mutations in two or more strains constitutes a signature of non-independence. This can be detected either qualitatively, if the same mutations are separately detected in two different libraries, or quantitatively (if the two strains are in the same library, their shared mutations will be observed on average at double the expected read count). It should be noted that in some cases observation of the same mutation in two different strains might be due to selection (e.g. if a specific mutation is much more likely to cause the phenotype than other mutations are, or if only a small number of different mutations in the genome are capable of causing the phenotype), rather than due to common inheritance.
The cost of phenotype sequencing scales according to the size of the genome (or region of interest) being sequenced. Thus, it is clearly most useful for microbial and other small genomes. Increasing genome size proportionally increases not only the baseline sequencing cost, but also the genome-wide false positive rate due to sequencing error. This means that when the sequencing error rate per nucleotide site
is held constant, a larger genome requires reducing the pooling factor
(in order to raise the mutation-call threshold enough to suppress false positives). This implies that for phenotype sequencing of larger genomes, it will be very valuable to reduce the per-nucleotide sequencing rate
, as discussed below.
Local variations in sequencing coverage might also raise the sequencing cost needed for obtaining a desired target discovery yield. Systematic studies of existing next-gen sequencing platforms have shown that they robustly detect >95% of SNPs despite local variations in coverage, with anomalously low coverage at approximately 0.1% (Illumina) to 1% (SOLiD) of nucleotide positions 
, especially AT-rich repeats. If poor coverage regions constitute only 5% of each gene region, they will not degrade target discovery yield significantly, since 95% of mutations in a target gene will still be detected. On the other hand, if a large fraction of each target gene fell into a poor coverage zone, that would reduce the target discovery yield proportionally. If an experiment gives poor discovery yield and suffers poor coverage across a large fraction of potential candidate genes, using a different sequencing platform would probably resolve the problem by supplying improved coverage in these regions (because the platforms differ markedly in their coverage biases 
). However, existing data suggest that such problematic cases are likely to be uncommon.
To interpret the results of phenotype sequencing requires a reference genome sequence annotated with gene regions. Although it is possible to obtain results from phenotype sequencing without this, that would both require extra work, and dilute the biological meaningfulness of the results. First of all, it is not strictly necessary to have a reference genome sequence that exactly matches the actual parent of the mutant strains. Mismatches between the reference and the parent will simply be observed in each tagged library with an apparent allele frequency of 100%, and can be automatically excluded from consideration. For example, in our phenotype sequencing experiment we detected 23 mutations observed with 100% allele frequency in at least one library, and each such mutation was detected identically (at 100% frequency) in all ten libraries. We excluded these parental mutations from our analysis. Thus, the primary value of a reference genome sequence is that it greatly facilitates and accelerates phenotype sequencing, by enabling rapid alignment of reads and detection of mutations. In the absence of a reference genome, one would first have to assemble the reads ab initio, a considerably more complicated task. Similarly, accurate gene annotations with meaningful functional information are required not so much for obtaining phenotype sequencing results, but for biological interpretation of the results. In principle, for a completely unannotated genome, one could predict open reading frames (ORFs) and detect clustering of multiple mutations within individual ORFs just as effectively as with annotated gene regions. However, it might be harder to interpret the biological meaning of a discovered target gene, if little or no functional information could be found for it.
While phenotype sequencing can be useful for well-established microbial systems such as E. coli
, it may have special value for genetically intractable organisms like Chlamydia
, an important human pathogen. For example, in Chlamydia
, researchers have identified a variety of potentially revealing mutant phenotypes, but deeper understanding of their genetic causes is impeded by the lack of powerful genetic systems for these bacteria 
. For such organisms, phenotype sequencing can open up a fast path for directly identifying a phenotype's genetic causes, for any phenotype where a good screen exists for generating multiple independent mutant strains.
Our mathematical model of phenotype sequencing makes a number of assumptions that may be overly conservative relative to real-world phenotype sequencing experiments. We deliberately chose our model to represent the hardest possible case for phenotype sequencing, via the following conservative assumptions: 1. a maximum entropy split of the selection signal between all target genes; 2. only a single mutation is required to produce the phenotype; 3. a relatively high mutagenesis density and effective number of target genes. We now discuss each of these in turn. (We also note that while we only analyzed our experimental data for single nucleotide substitutions, in principle the same p-value scoring approach could be applied to other types of mutation events, e.g. deletions or insertions).
In our initial analysis, we assumed that each target gene is equally likely to be mutated, and equally likely to produce the phenotype. Both of these assumptions could be wrong. Splitting the selection signal equally among all target genes ensures that no target gene is any more detectable than any other target gene, and thus minimizes the detectability of the most detectable target gene. Introducing variability in either the probability of mutation or the probability of producing the desired phenotype increases the probability of detecting the top target gene. It seems unlikely that real-world phenotype sequencing targets will exactly match the hardest-case category. Many sources of gene variation are likely to create variability in the effective target size for a given phenotype. Empirically, we know that genes vary widely in size. We also expect that the contributions of different proteins to a given phenotype are likely to vary: whereas one protein might be absolutely central to that phenotype, such that a large fraction of amino acid mutations could cause the phenotype, in a protein that participates in only part of that function, perhaps only a small fraction of mutations could cause that specific phenotype. In our isobutanol tolerance mutants, we observed that one gene (acrB) showed a dramatically higher detectability than the other two validated targets (marC, acrA). Finally, whereas loss-of-function mutations may be possible in many genes within a pathway, gain-of-function mutations may be possible at only a subset of sites in a specific gene. Thus, a gain-of-function phenotype may display much stronger selection bias to a subset of target genes, making such target(s) easier to detect. Overall, we expect that real-world phenotype sequencing experiments will be easier (and more successful) than the estimates we have reported here from our uniform target size model.
We also assumed that the phenotype is produced via only a single mutational step from the parental strain. In other words, if a given mutant strain contains 100 mutations relative to the parent, we assume that only one of those mutations is causal (i.e. needs to be in a true target gene). This minimizes the “signal-to-noise” ratio (in this example, to just one causal mutation out of 100 total mutations), making the signal harder to detect. By contrast, if two or more mutations are required to produce the phenotype, that would multiply the signal-to-noise ratio proportionally, by two-fold or more. Our assumption of a single causal mutation means that the probability that each target gene is mutated in a given strain should sum to 1.0 (100%) over all the target genes. Empirically, in our isobutanol tolerance mutants we observed a target gene mutation probability sum much larger than 1.0: one gene (acrB
) was itself mutated in nearly all the strains, and several more statistically significant genes were mutated in a third to a fifth of the strains each (marC
, experimentally validated, plus stfP, ykgC, aes
, not yet tested experimentally). Furthermore, Atsumi et al. have independently dissected the genetic causes in a single mutant strain, and found that five different mutations (in five genes) were responsible for the observed phenotype 
. Similarly, Conrad et al. found that enhanced E. coli
growth in lactate minimal media typically arose in each mutant strain via 5 to 8 contributory mutations in different genes 
. Thus, we think that real-world phenotype sequencing experiments are likely to contain a higher signal-to-noise ratio than assumed by our single-causal-mutation model.
Our values for the mutagenesis density and total target gene number may also be larger than necessary. For example, our
target gene model assumes that the selection signal is split equally over 20 genes, making each true target gene 20-fold harder to detect than turned out to actually be the case for acrB
in our validation experiment. Are there really phenotypes in which 20 different genes can each cause the phenotype with equal probability? This seems like an extreme, difficult case, yet our results show that even it can be solved by sequencing a practical number of mutant strains (see ). Similarly, in our bioinformatic analyses and our experimental validation, we considered mutagenesis densities of greater than 100 mutations per strain. It should first be noted that such a density of potentially functional mutations (in our case, we restricted our analysis to non-synonymous mutations), corresponds to an even higher total mutation density (e.g. for 100 non-synonymous mutations, we might expect 150 total mutations). Since the experimenter can control the mutagenesis directly by reducing the concentration or time of mutagenesis, we suggest that future phenotype sequencing experiments should use a substantially lower mutagenesis density than we employed, to boost the signal-to-noise ratio.
Two additional trends appear to favor successful phenotype sequencing. First, the ongoing trend of decreasing sequencing cost per read (or equivalently, increased reads per unit cost) appears likely to continue for some time. We have sought to project the effect of this cost reduction on phenotype sequencing in , which considers the effect of a two-fold reduction in sequencing cost. Second, sequencing technologies offer several ways to reduce the baseline sequencing error rate. For example, multibase encoding schemes can greatly increase the ability to distinguish real mutations from sequencing errors 
, assuming that the reference sequence is known. As shown in , reducing the sequencing error rate to 0.1% has a similar effect on phenotype sequencing as reducing the sequencing cost two-fold.
To demonstrate the utility of phenotype sequencing, we have applied it to an important real-world problem in biofuels research, namely the production of long chain alcohols from well-characterized fermentation bacteria (in this case, E. coli
). Recently, the UCLA-DOE Lab has engineered strains of E. coli
that produce long-chain alcohols such as isobutanol and isopropanol  
. We believe phenotype sequencing brings several advantages to this work and to biofuels research in general: 1. It makes no assumptions about exactly what genes or pathways affect the yield, and can experimentally discover the factors that actually improve biofuel yield. 2. It utilizes the organism's own ability to evolve under externally applied selection pressure, to produce the desired result. 3. It employs an inexpensive, highly scalable technology (next-gen sequencing) to rapidly identify genes that actually cause the phenotype. In principle, this approach has an exciting ability to survey the factors that can improve yield of a desired biofuel.