PANDAseq aligns each set of paired-end sequence reads in a three-step process. First, it determines the locations of the amplification primers, if they are specified and were sequenced. Then, it identifies the optimal overlap. Finally, it reconstructs the complete sequence, correcting any errors, and checks for various constraints, such as length and quality.
To score alignments, we calculate the probability that the true nucleotides,
, are the same, given the observed nucleotides, X
. We estimate this with the included quality information found in the Illumina reads. For each base, CASAVA provides an encoded quality score, which is the probability of the base being miscalled. This probability (ϵ
) is approximated by
is the ASCII quality value in the Illumina analysis pipeline versions before CASAVA 1.8 and A1
is the ASCII value used in CASAVA 1.8. [7
Assuming all nucleotides are equally likely (i.e., the prior probability that the true bases match is
), and that sequencing errors are independent and result in equiprobable choices over the other three nucleotides, the probability that the true bases match, given that the sequenced bases match, is:
and the probability that the true bases match, given the sequenced bases mismatch, is:
If one of the bases is an uncalled base, N, then the probability that the bases match is:
Using these probabilities, PANDAseq begins the assembly process by determining the positions of forward and reverse primers, if supplied. To accomplish this, the program finds the first offset, x, where the primer aligns. For a primer P and a sequence S, the program calculates
while assuming that
, which is the highest value score assigned by Illumina [8
] and, intuitively, assuming that
The program then finds the best overlap greater than a specified threshold for the forward and reverse sequences, F and R, respectively. If no suitable overlap is found, then the read pair is discarded. This is done for the entire read, even if there are primers to be removed, as it is possible for the overlap to be sufficiently long to be in the primer region. A schematic is shown in Figure .
The value of c
|)) is chosen which maximises this formula:
and the remainder is as above with e fixed at a value determined empirically to be the average error rate. This value of ϵ
was calculated by counting the mismatch rate in known index tags in a defined community data set (described below). This parameter need not be retuned as it is only an estimate of the error. Because the index read is short and sequenced earlier in the process, it likely has fewer errors and, therefore, its error rate should underestimate the true error rate. Regardless, the error rate specified for this step should not negatively affect the ability of PANDAseq to identify the best overlap for the forward and reverse reads.
Once the overlap is selected, the output sequence is constructed and an overall quality score is calculated. During this process, the primer regions are disregarded if primers were specified. The unpaired regions are copied from the available strands and the quality score for these regions is the product of the probability of those bases being correct. For the overlapping region, the decision-making process is more complex. If the bases agree, the base is included and the quality of this base is assumed to be
. If the bases disagree, the base with the higher quality score is chosen and the quality of this base is assumed to be
. If either or both bases are uncalled, they are considered to always match, noting that unassigned bases are always associated with the lowest quality score by CASAVA [8
In certain cases, the CASAVA pipeline masks the quality score at the end of the read, replacing all quality scores with the lowest quality score [8
]. In this case, special quality scoring is used by PANDAseq. If one base is masked, the probability of the other base is used if the bases match or uniformly random,
, is used if they do not match. If both are in the masked region, the quality is assumed to be uniformly random,
The constructed sequence can then be validated against user-specified criteria. The quality score assigned to the assembled sequence is the geometric mean of the quality scores calculated above, which compensates for the variable lengths of the final sequences. PANDAseq enables users to reject sequences based on low quality score, lengths that are too short or too long, or the presence of uncalled bases. A module system is also available within PANDAseq to allow more sophisticated validation of user sequences, such as verification of known secondary structure or conserved regions. Note that there is a detailed manual included with the software that describes example usage scenarios.