We have presented a comparison of two sequencing platforms for the study of viral diversity highlighting the trade-offs between sequencing depth, sequencing errors, and read length. If the analysis is focused on a local region of the genome covered by the reads, then Illumina's higher accuracy and higher throughput enabling deep coverage are advantageous with respect to 454/Roche. In this case, haplotype reconstruction is both more sensitive and more specific for Illumina data. On the other hand, read length has a tremendous impact when one tries to match the diversity detected at sites more distant than the read length, and in this case, the 454/Roche platform has a clear advantage. Even the experimental Illumina datasets obtained from the highly diverse population analyzed here, do not allow for reliable reconstruction of the haplotypes. For example, with 36 bp long reads, regardless of the coverage and even assuming a low error rate, one can hardly reconstruct 50% of the population reliably (). Thus, for long-range haplotype reconstruction in clinical samples, which often will display less diversity, read length appears to be the most critical factor.
Although both NGS technologies analyzed here have been improving rapidly in the last few years, their main distinctions remain. 454/Roche is still characterized by a higher indel error rate in homopolymeric regions. Illumina has a smaller total error rate, and a lower cost per sequenced base 
. Both platforms increased their read length, with 454 now generating reads of average length 800 bp and Illumina of 150 bp, but their relative advantages and disadvantages are virtually unaltered. Of course, the performance of either platform can be boosted by increasing the coverage, but the sequencing error patterns remain a limiting factor. Importantly, increasing coverage is more cost-effective and less labor-intensive with Illumina than with 454/Roche.
To compare the relatively long-read 454/Roche sequencing platform with the short-read Illumina technology, we have considered a genomic region covered entirely by the long reads but not by the short reads. Since a head-to-head comparison is not possible, we have explored two approaches. First, we defined a local window of maximal average entropy in the hope of detecting the population diversity with local reconstruction methods from short reads there. This approach is particularly useful for diverse populations and although it will not result in the set of global haplotypes, it can be sufficient to estimate the population diversity, i.e., number and frequencies of clones. Second, we have attempted to assemble short reads into global haplotypes. This approach is statistically and computationally more challenging, but has the potential to recover all full-length haplotype sequences.
In dealing with NGS data, one has to take into account sequencing errors. Without a proper treatment, they would artificially inflate the estimated diversity. The approach presented here uses clustering of reads as a method to correct errors. Further measures would be to take quality scores into account, or to correct for strand bias. Variants that are observed prevalently on one strand are more likely to be artifacts than real biological variants 
. The improved results obtained for the non-PCR amplified samples show that an additional source of noise is given by the PCR amplification, which can contribute in different ways to inflate and distort the observed diversity. Amplification efficiency can vary among different haplotypes, leading to an amplification bias. Moreover, PCR can introduce artificial variants into the sample by point mutations and, to a much larger extent, by recombination 
. These in vitro
chimera resulted in a larger number of false positives for 454/Roche than for Illumina (), because recombination is more likely to occur and to be detected in longer reads. Carefully chosen PCR conditions can minimize the impact of these artifacts 
For global haplotype reconstruction, we employed a combinatorial inference algorithm based on the read graph. This approach can easily generate recombinant sequences that are not part of the true underlying population, especially if diversity is low and not all read errors have been corrected. Such artificial in silico
chimera are responsible for a large number of false positives in global haplotype reconstruction at deep coverage and might explain the decreasing global reconstruction performance with increasing coverage in some situations. Global haplotype inference may be improved by using alternative methods 
, or by exploiting paired-end reads to phase variants detected at large genomic distances. The results presented here are subject to the specific limitations of ShoRAH's reconstruction algorithm. Other computational tools, including also improved error correction 
, might perform better under some circumstances, but the general limitation observed in this study will remain. Future studies are needed to delineate the feasibility of global haplotype reconstruction in terms of the underlying population diversity, the employed sequencing technology and parameters, and the computational strategy for haplotype inference.
The ability to detect and reconstruct diversity improves with decreasing sequencing error rate and with increasing number of polymorphic sites. As a consequence, for any given level of viral diversity in the sample, sequencing a longer region will result in better diversity estimates, for a given error rate. Since the diversity is usually unknown in advance, it is generally impossible to determine a priori the expected performance of a specific platform in reconstructing the viral population. We have highlighted and quantified here the trade-off between read length and depth of coverage, namely higher accuracy in global haplotype reconstruction with long reads versus improved sensitivity and specificity in local haplotype reconstruction, especially for low-frequency variants, with deep coverage.