Hepatitis C virus (HCV) is a single-stranded RNA virus belonging to the Flaviviridae
]. HCV infects 2.2% of the world's population and is a major cause of liver disease worldwide [20
]. HCV is genetically very heterogeneous and classified into 6 genotypes and numerous subgenotypes [21
]. The most studied HCV region is the hypervariable region 1 (HVR1) located at amino acid (aa) positions 384-410 in the structural protein E2. Sequence variation in HVR1 correlates with neutralization escape and is associated with viral persistence during chronic infection [22
]. NGS methods allow for analyzing the unprecedented number of HVR1 sequence variants from infected patients and present a novel opportunity for understanding HCV evolution, drug resistance and immune escape [1
]. Most current methods are optimized for shotgun analysis and assume that the errors are randomly distributed. This assumption does not compromise the accuracy of shotgun sequencing as much as accuracy of amplicon sequencing. The sequencing error rate for amplicons is not randomly distributed [3
] and should vary among amplicons of different primary structure.
In addition, current error-correction algorithms report performance measures related to their ability of finding true sequences, rather than the number of false haplotypes [1
]. However, the biological applications of viral amplicons necessitate the use of error-free individual reads. All three methods studied here could find the correct sequences in both single-clone and mixture samples but showed marked differences in detecting the frequencies of the true haplotypes and the number of false haplotypes. We found that both ET and KEC are suitable for rapid recovery of high quality haplotypes from reads obtained by 454-sequencing.
The highly non-random nature of 454-sequencing errors calls for internal controls tailored to the amplicon of interest. The error distribution of single-clone samples helped us to calibrate the ET algorithm, thus facilitating its high accuracy in detection of true sequences in the HVR1 amplicons. ET was successful in finding the correct set of haplotypes in all 10 mixtures and in 10 of 14 single-clone samples, while found one false haplotype in 4 single-clone samples. KEC was correct for 13 of 14 single-clone samples (with 2 false haplotypes for one sample) and for 9 of 10 mixtures (not being able to find low-frequency clones in the mixture M5), having also the advantage that it does not need an experimental calibration step. SHORAH found all correct haplotypes for all single-clone samples and for 9 of 10 mixtures, having a very large number of false haplotypes and a significant divergence of expected and found frequencies. Introduction of a frequency cutoff for SHORAH results in loss of true haplotypes. SHORAH with frequency cutoff 1% was correct for 5 single-clone samples and for 3 mixtures, having both missing true and false haplotypes for other samples.
We highly encourage the sequencing of single-clone samples of the desired amplicon in order to understand the nature and distribution of the errors and to measure the performance of the algorithm in this particular amplicon.
Most algorithms are successful in identifying and removing low-frequency errors. However, reads with high-frequency homopolymer errors should not be removed but rather corrected, allowing for preservation of valuable data. All three algorithms correct reads with homopolymers in a different way. KEC uses the k-mer distribution to discern between erroneous and correct k-mers and then fixes homopolymers using a heuristic algorithm. ET fixes the homopolymers based on pairwise alignments with high-quality internal haplotypes. SHORAH clusters reads with a similar sequence, effectively creating a consensus haplotype. Sample S4 is particularly interesting because it included a false haplotype with a raw frequency of 25.85%. This false haplotype contained a deletion in a long homopolymer (n = 7). Both KEC and ET fixed this haplotype, but the clustering algorithm SHORAH preserved this false haplotype because of its high frequency and made it a center for the cluster of other reads, achieving a final frequency of 33.25%. The same situation occurs in most single-clone samples: in samples S1, S2, S3, S4, S6, S8, S11, S12 the second-frequent haplotypes with frequencies 13.3%, 16.2%, 11.6%, 33.25%, 5.2%, 2.79%, 2.9%, 3.89%, respectively differs from most frequent haplotype by one indel in a long homopolymer. The main assumption of clustering algorithms is that the observed set of reads represents a statistical sample of the underlying population and that probabilistic models can be used to assign observed reads to unobserved haplotypes in the presence of sequencing errors [5
]. However, these algorithms assume that errors rates are low and randomly distributed, which is not true for the 454-sequencing of amplicons. Some homopolymer errors achieve very high frequencies, making very difficult to separate these false haplotypes from true ones using a clustering model.