A first step in the analysis of next-generation sequencing (NGS) data is the transformation of the measured intensity signals to a sequence of nucleotides. This process, referred to as base-calling, is an important task, as systematic base-calling errors may mislead downstream analysis
[
1], e.g. in genome assembly and sequence mapping. More accurate base-calling and more reliable base-calling quality scores result in a better distinction between sequencing errors and true polymorphisms between the base-called reads and a reference sequence. This is an essential merit in the detection of single nucleotide polymorphisms (SNPs) or sequence variants
[
2-
4]. A myriad of applications such as the characterization of HIV mutation spectra
[
5], the detection of somatic mutations in cancer
[
6], and the identification of operational taxonomic units in metagenomics
[
7] have the potential to benefit from improved base-calling and more informative quality scores.
The 454 Life Sciences system is based on the sequencing-by-synthesis principle. In each flow of the sequencing process, light produced by a pyrosequencing reaction is emitted if one or more identical nucleotides are incorporated into the DNA template. The addition of each of the 4 possible nucleotide solutions A, C, G or T occurs in a fixed and known order. Hence, 454 base-calling is a matter of discerning the number of incorporated nucleotides or homopolymer length (HPL) of a known nucleotide type from the measured intensity signal in each flow
[
8]. Consequently, the principal sources of 454 sequencing errors are insertion and deletion errors (indels). These are more frequent in sequences containing long homopolymers
[
9,
10], because the increase of intensity signal when more nucleotides are incoporated, attenuates at higher HPLs. This makes it harder to discriminate between subsequent homopolymer lengths, resulting in an inflation of undercalls or overcalls as the HPL increases (Additional file
1: Figure S1).
In the default 454 sequencing pipeline the raw intensities are preprocessed to flowgram values by correcting for the major error sources. These include spatial and read-specific effects such as the abundance of long homopolymers in a read
[
8,
11]. This preprocessing eliminates much obscuring noise, but removes some useful information as well (Figure
). In the next step the base-calling takes place, where, roughly speaking, the flowgram values are rounded to the nearest integer. Subsequently, a quality score is assigned to each called base. Quality score calculation in current 454 base-calling is solely based on the flowgram values without considering information from the preprocessing step
[
11]. An alternative method for quality score assignment has been proposed that focuses on the distribution of observed flowgram values for every possible HPL
[
4], but it does not account for additional error sources. A common feature is that such methods are designed as a next step in the pipeline after the base-calling is finished. Hence, the base-calling uncertainties inherent to the base-calling model or algorithm are not directly utilized in the construction of the quality scores.
A second shortcoming of the current base-callers is that they only produce quality scores in the
Phred format
[
12].
Phred scores can be interpreted in terms of the probability that the called base is not an overcall. Although well-known and widely used, they provide a measure for the quality of the base-call, but they lack additional information on whether there is an undercall or an overcall, and on how likely it is for other HPLs to be the correct call, instead of the HPL being called. This feature is particularly essential in 454 pyrosequencing. An example of overcalls of reference sequences with HPLref 3 shows that the distributions of 454 quality scores are nearly identical for the called bases associated with position 2, 3 and 4 in the homopolymer run (HPL 2, 3 and 4) (Figure
). Hence, these quality scores do not give any insight into whether it is more likely to have an undercall or an overcall, given that a base-calling error was made. This information would, however, be very useful in downstream tasks such as sequence alignment and sequence variant calling. Quite some methods have recently been proposed that consider quality scores to increase accuracies in these downstream analyses, e.g.
[
13-
15]. By adding more detailed information about the probabilities of having an undercall or overcall, these methods could be improved even further.
For these reasons we have developed HPCall, a general probabilistic framework that seamlessly integrates the base-calling with more informative quality score assignment. HPCall is based on the classification of the calls in groups representing the possible HPLs. To this end, a statistical model for count data is used that predicts the HPL in each sequencing cycle as a function of a number of explanatory variables. A particular property of the 454 pyrosequencing process is the high abundance of background intensities (HPL 0). Whenever a nucleotide flow does not match the nucleotide at the interrogated position of the DNA template, no nucleotide is incorporated and no sequencing reaction takes place. The resulting light intensity mainly reflects background optical noise. Consequently, there are more zero counts than expected for a Poisson distribution. Hurdle Poisson regression models
[
16] are one way to deal with these excess zeros.
For each possible HPL, the model produces an estimated probability that this HPL is truly present in the DNA sequence at the interrogated position. The called HPL is then the HPL with the largest estimated probability. The calculation of these probabilities allows the simultaneous construction of quality scores. These scores directly reflect the base-calling’s uncertainties and provide information about potential undercall or overcall errors. In the model we combine information of flowgrams and earlier-stage raw intensities. By including the raw intensities we employ the additional information otherwise removed by the preprocessing (Figure
), both for the base-calling and for the calculation of the quality scores. However, they are not strictly necessary for the method to provide valid results.