Pyrosequencing constitutes a promising approach to estimating the genetic diversity of a community. However, sequencing errors and short read lengths impose considerable challenges on inferring the population structure from a set of pyrosequencing reads. We have approached this task by identifying and solving consecutively three computational problems: error correction, assembly of candidate haplotypes, and estimation of haplotype frequencies. Our methods focus on the situation where a reference genome is available for the alignment of reads. This is the case, for example, for many important pathogens, such as bacterial and viral populations.
The procedure consists of three steps. First, error correction is performed locally. We take windows of fixed width over the aligned reads and cluster reads within the windows in order to resolve the local haplotype structure. This approach is based on previous methods
[27],
[28] that are specifically tailored to pyrosequencing reads. Next, haplotypes are reconstructed using a new application of a classic combinatorial algorithm. This step is the main theoretical advance in this paper. Finally, haplotype frequencies are inferred as the ML estimates of a statistical model that mimics the pyrosequencing process. We have developed an EM algorithm for solving this ML problem.
Haplotype reconstruction is based on two assumptions: consistency and parsimony. We require that each haplotype be constructible from a sequence of overlapping reads and that the set of explaining haplotypes be as small as possible. The Lander–Waterman model of sequencing implies lower bounds on the number of reads necessary to meet the first requirement. The fundamental object for haplotype reconstruction is the read graph. A minimal set of explaining haplotypes corresponds to a minimal path cover in the read graph, and this path cover can be found efficiently using combinatorial optimization. Moreover, the cardinality of the minimal path cover is an important invariant of the haplotype reconstruction problem related to the genetic diversity of the population.
We believe that these methods are also applicable to many metagenomic projects. In such projects, estimation of the diversity of a population is a fundamental question. The size of a minimal cover of a fragment assembly graph provides an intuitive and computable measure of this diversity.
We have validated our methods by extensive simulations of the pyrosequencing process, as well as by comparing haplotypes inferred from pyrosequencing data to sequences obtained from direct clonal sequencing of the same samples. Our results show that pyrosequencing is an effective technology for quantitatively assessing the diversity of a population of RNA viruses, such as HIV.
Resistance to most antiretroviral drugs is caused by specific mutational patterns comprising several mutations, rather than one single mutation. Thus, an important question that can be addressed efficiently by pyrosequencing is which of the resistance mutations actually occur on the same haplotype in the population. Since our methods avoid costly clonal sequencing of the HIV populations for determining the co-occurrence of HIV resistance mutations
[36], pyrosequencing may become an attractive alternative to the traditional clonal Sanger sequencing.
The sample size of approximately 10,000 reads we have considered provides us with the opportunity of detecting variants present in only 1% of the population. Pyrosequencing can produce 200,000 reads and thus twenty populations could be sequenced to a good resolution using a process less labor intensive than a limiting dilution clonal sequencing to a similar resolution of a single population.
The simulations suggest that the method works best with populations that are suitably diverse. Intuitively, the information linking two reads together on the same haplotype decays rapidly in sections of the genome where there are few identifying features of that haplotype (as in a region of low diversity). In particular, repeats of sufficient length in the reference genome can completely destroy linkage information. However, at some point the benefits of increased diversity will be partially reduced by the increased difficulty of the alignment problem. With more diverse populations or true indels, alignment to single reference genome will become less accurate.
The HIV
pol gene analyzed here is on the low end of the diversity spectrum. The
env gene with its higher variability may be a better target for some applications. We expect the proposed methods to improve early detection of emerging drug resistant variants
[37],
[38], and to support the genetic and epidemiological study of acute infections, in particular the detection of dual infections
[39].
Since our computational procedure produces an estimate of the entire virus population, it allows the study of fundamental questions about the evolution of viral populations in general
[40]. For example, mathematical models of virus evolution can be tested directly within the accuracy of estimated viral haplotype frequencies
[41]. Predicting viral evolution is considered an important step in HIV vaccine development
[10].
In addition to the promising biological applications, there are many interesting theoretical questions about reconstructing populations from pyrosequencing data. The errors in pyrosequencing reads tend to be highly correlated, as they occur predominately in homopolymeric regions. While this can make correction more difficult (a fact which can be counteracted by the use of quality scores), we believe that it can make haplotype reconstruction more accurate than if the errors were uniform. If errors are isolated to a few sites in the genome, fewer additional explaining haplotypes are needed than if the errors were distributed throughout. The exact relationship between the error process of pyrosequencing, error correction, and haplotype reconstruction is worthy of further study.
As pyrosequencing datasets can contain 200,000 reads, it is worthwhile to investigate how our methods scale to such large datasets. Haplotype reconstruction is the only step that is not immediately practical on such a large number of reads, since it is at worst cubic in the number of irredundant reads. However, problems of this size are approachable with our methods as follows.
The theoretical resolution of the algorithms depends on two factors: first, the ability to differentiate between errors and rare variants; and second, whether there are enough reads so that we can assemble all haplotypes. We have seen that the number of reads necessary for assembly scales with the inverse of the desired resolution: if N reads cover all haplotypes of frequency at least ρ, then kN reads are needed to cover all haplotypes of frequency at least ρk. However, the resolution of error correction is at most the overall error rate as the number of reads grows; see .
| Table 3Resolution of viral haplotype estimation. |
The limited resolution of error correction combined with the elimination of redundant reads makes haplotype reconstruction feasible for large datasets. For example, error correction on 200,000 reads with ε

=

0.0025 and α

=

0.001 will erase all variants with frequency below 0.365 (;
Methods). In order to have enough information to reconstruct these variants under the Lander–Waterman model, we would expect to need only about 30,000 reads. Furthermore, in regions of low diversity, many of the reads will be redundant and are thus discarded before building the graph. For example, with 30,000 error-free reads simulated from 275≈1/0.00365 haplotypes at 5% diversity, typically about 13,000 reads are irredundant. This number of irredundant reads is near the limits of our current implementation.
Current and future improvements to pyrosequencing technology will lead to longer reads (250 bp), more reads, and lower errors. However, in order for huge numbers of reads to be of great help in the ultra-deep sequencing of a population, the error rates must also decrease. The performance of our methods as read length varies is an important question, given the availability of sequencing technologies with different read lengths (e.g., Solexa sequencing with 30 to 50 bp reads) and the desire to assemble haplotypes of greater size (e.g., the entire 10 kb HIV genome).
Notice that haplotype reconstruction seems to be quite good locally ( and
Figure S1) in that many reconstructed haplotypes contain large contiguous regions where they agree with a real haplotype. However, the measures of performance considered in this paper all deal with the entire haplotype and ignore partial results of this type. This would seem to imply that longer reads will improve the reconstruction performance on a fixed length genome; new performance measures will have to be developed to analyze these problems.