If a read is error-free then it can be decoded unambiguously and the calls from the two-base-encoding chemistry and ECC codes will agree. When an error occurs in any of the two-base-encoding chemistry rounds, it translates into multiple base miscalls. The structure of the ECC code allows such errors to be detected, recovered from, and, in certain circumstances, corrected. The syndrome equivalence classes are defined by the types of error that can occur and so, by examining them, we can classify the cases where errors can be corrected unambiguously and those where additional information is needed to resolve the ambiguity.
Since the elements of the syndrome consist solely of the summation of the observed and expected parity colors, there is a one-to-one correspondence between them and parity colors. An error in a block can only affect two elements of the syndrome, those corresponding to the two parity colors that overlap the block’s first and last elements. We refer to these as the upstream and downstream syndromes, respectively, so it is sufficient to concentrate on how errors occur in only one block. All statements we make about the error correcting properties of the encoding are on a per-block basis, so a code that can correct one error per block can correct multiple errors if spread between blocks. The final block only has a upstream syndrome and so has more limited error correction.
The value of the syndromes for all possible single-color errors that could occur in a block or its parity colors are shown in Table . For example, an error of +1 in the second color produces the syndrome βα, identical to that produced by an error of +1 in the third position. If a syndrome is unique (for example, syndrome ββ arises only from error + β in position
c5) then, assuming only a single error has occurred in that block or parity color, that error can be determined and so corrected. Note that while errors in parity color
p−, overlapping a block and the previous one, appear to have unique syndromes, this parity color is also
p + for the previous block and so the syndrome can actually be caused by multiple different errors.
The syndrome equivalence classes show the types of single-color error that cannot be disambiguated without additional information; a error has been detected but is not correctable. For example, the syndromes 0β, 0α and 01 can be generated by single errors at
p + , and so errors at these positions cannot be distinguished. The equivalence classes define the possible simple changes to the observed sequence that will produce a valid sequence that can be decoded into a string of bases, only one of which produces the correct sequence. When an error is wrongly corrected, the code sequence is changed in two places, the original error and the correction; the structure of the code ensures that the changes should be complementary (both ‘+1’ or both ‘ + α’, for example).
Using the inverse of the block generator matrix, the pattern of changes that are induced in base-space by miscorrections to the observed sequence can be derived. If errors occur and are miscorrected, such that the difference between the correct and observed values of the data colors is d
, then the pattern of differences p
induced in base-space is given by
. For example, an error of type + α
at the second position might be erroneously corrected by an error of type + α
at the third position, so the differences are 0αα
00 and the pattern induced in base-space is 00α
00 (the sum of the second and third rows of
multiplied by α
). If this occurred to the nucleotide sequence ATGCG then the sequence ATACG would result.
All patterns induced by single-color errors are listed in Table ; for example, an error of type + α
at the first position may be wrongly corrected at either the fourth position or the parity color, leading to the triple error 0ααα
0 or the quadruple error 0αααα
, respectively. Note that no pattern ever affects the first position of the block; the structure of the matrix
shows that this position can only be changed by errors in the second, third or fifth positions: single-color errors at the fifth position can be always be unambiguously corrected, and errors at the second or third positions result in either corrections or changes that cancel at the first position.
Patterns of error for the ECC generator
One notable feature of these more complex errors is that the error type is the same at all affected positions, a property that might help distinguish sequencing errors from genuine variants if the reads are later mapped to a reference. By adding the mapped read to the reference in GF4, the pattern should be evident if it was due to simple miscall. If the pattern is not evident, the differences are either due to sequence variants or multiple miscalls.
While the consideration of syndromes provides useful information about a code’s properties and syndrome decoding is computationally efficient, it is not a replacement for probabilistic methods of decoding since the latter incorporates the quality information about each call and can use it to make better decisions about correction. However, syndrome decoding techniques may still be of use in conjunction with the more computationally expensive probabilistic methods since the syndrome provides a quick test of whether the observed sequence is valid (no correction needed). Syndrome equivalence classes could also be used to restrict the paths through a trellis to a plausible set, providing a heuristic to speed up the dynamic programming algorithm to find the most probable decoding.
We have shown the type of errors that the ECC code can correct but have not yet addressed whether it is optimal. Before examining specific alternative codes, it is interesting to look at what can possibly be achieved and there are several mathematical results that restrict the performance of any code. Firstly the Hamming [10
], Johnson [11
] and Singleton [12
] bounds place an upper limit on the number of errors that a code can hope to detect or correct but codes meeting these bounds may not exist; the Gilbert–Varshamov bound [13
] is a lower limit on the performance of the best code that does exist.
For reads of 50 bases encoded into 60 letters (as with the ECC code), the lower bound guarantees that a code exists that can detect any two errors and correct any single error. The upper bounds show that no code can guarantee to correct more than three errors. For reads of length 75 bases encoded into 90 letters, then a code exists that can detect and correct any two errors but no code can guarantee to detect and correct more than four errors. There is no guarantee that a convolutional code can meet this bound and the two most common classes of codes that come close to attaining these bounds, Turbo codes [15
] and Low-Density Parity-Check codes [16
], require long-range dependencies between positions in the sequence and so cannot be implemented in any plausible sequencing chemistry.
Examining the syndromes in Table , we notice that, despite many possible single errors having ambiguous syndromes, not all syndromes are present: the three syndromes αβ, β1 and 1α do not occur. The missing syndromes are not truly unused, being generated by multiple errors, but the failure to use them to distinguish single errors suggests that there may exist alternative codes with a greater ability to correct single calling errors. The advantage would derive from using the extra syndromes to partition single-color errors more evenly into equivalence classes. Such alternative codes can be analysed using the same techniques as the ECC code.
Rather than consider all possible convolutional codes, we will focus our attention on a subset that satisfy some reasonable restrictions inspired by the reality of the SOLiD platform. It is desirable that any new chemistry would be backwards compatible with the two-base-encoding chemistry, meaning that the first five rounds of sequencing must use an unaltered two-base-encoding probe set. This backwards compatibility restriction is equivalent to requiring that a new code must have an unpunctured color stream.
While the number of rounds and probe sets could be varied, and probes of differing lengths could be used, to remain comparable to the current ECC chemistry we will focus only on chemistries where a single additional round (and so only one additional probe set) will be used and the probes remain based on pentamers. Analogous to the code structure shown in figure , alternative codes that can be implemented with a single additional round are punctured so only every fifth element of the second stream is produced.
By redefining the boundaries of the block structure, the number of different alternative codes that need to be considered can be further reduced: a code whose probe generator starts with one or more 0s is identical to one starting at the first non-zero element with the tail padded with zeros. The probe generator of any code that starts with α or β can be written as αW or βW, where W is the probe generator of a code starting with 1, and the linearity of convolutional codes ensures that the two codes have identical sets of code words: although the mapping between nucleotide sequence and encoded sequence will differ, the error correcting characteristics will be the same.
The block generator for alternative codes that satisfies all the restrictions, and its inverse, can be written as
p5 is the generator for the second set of probes, L is the matrix whose upper triangle consists of 0 and whose diagonal and lower triangular elements are all 1, x=Lp and y=
x5. The ECC code has the probe generator
p5=1β0β0. The inverse generator only exists when y is invertible (i.e. y≠0); when y is not invertible, blocks of colors cannot be individually inverted and the syndrome analysis is not applicable. Due to the structure of L, requiring y to be invertible is the same as the sum of the elements of the probe generator not being zero.
One further restriction will be placed on the probe generator of alternative codes: the ECC probe generator contains 0 at its fifth position and this is probably a consequence how accurately pentamers with differing final bases can be ligated to the sequence. Codes whose generators use the final position might theoretically have better error correction properties but the increased rate of calling errors, due to incorrect probes being ligated, may cancel any improvement their use may offer; consequently, we will require
There are 48 probe generators satisfying all the restrictions and the invertibility condition, of which the syndromes and equivalence class for two interesting alternatives are shown in Table . The first code has generator
p5=10β00 and has similar equivalence classes to the ECC code but only uses the first three positions of the generator. While the set of syndromes for single-color errors is also incomplete (syndromes 1β, α1 and βα are unused) and thus the error correcting properties will be similar to the ECC code, the shorter length of the generator means that calculations on the trellis, to determine the most probable decoding, can be carried out four times quicker.
Syndromes for alternative codes
The second alternative code shown in Table , with probe generator 1β010, uses all possible syndromes and potentially has better error correcting properties than the ECC code. Comparing the equivalence classes of the new code to those for the ECC code, the new code uses the extra syndromes to split the largest class into two. Whereas the ECC code is unable to distinguish errors at the first, fourth or parity positions, the new code can unambiguously correct errors at the first position and the ambiguity of fourth and parity position errors is reduced.
Table shows the error patterns induced by single-color errors for the two alternative codes considered. Note the reduced number of base-space errors relative to the ECC code (Table ).
Patterns of error for alternative codes
The theory described suggests that the code with probe generator 1β010 is more powerful than the ECC code but this does not necessarily mean that it performs better in practice since actual performance will depend on the distribution of different types of error and at which positions they occur. To help quantify the difference in performance between codes, they were compared on artificial data, simulated so that it had an error profile similar to real data but for which the original sequence of bases is known, the error profile being estimated by mapping to a SNP-corrected genome.
One million fragments were sampled uniformly from the positive strand of the genome of E. coli
DH10b with qualities being sampled from a real sequencing run of the same genome. The probability of generalised error was set to be equivalent to
, close to observed values for prepared samples. Three sets of reads were produced, using different codes on the same set of fragments to generate the ECC information, and sequence was called into both base-space and color-space using the Maximum A Posteriori (MAP) criterion. Reads were then mapped to an appropriately encoded reference using the BWA aligner [17
] with an edit distance of five. Results, in terms of the total proportion of reads mapping and the proportion mapping with a given number of errors, are shown in Table .
Number of errors for simulated data
Using the ECC information (probe generator 1β0β0) to help call reads makes a considerable improvement in both color-space and base-space. Without any correction, 77.1% of simulated reads map to the reference in color-space and only 38.4% of these are perfect, with a further 16.7% having one error. After correction, 48.6% of reads are perfect, with 10.2% containing one error. In base-space, the number of mapped reads increases from 47.2% to 64.3%. This large increase is due to correcting single-color errors that would otherwise induce multiple base errors and prevent the read from being mapped using the ‘five differences or fewer’ criterion.
All three codes perform better than uncorrected sequence in both color-space and base-space. There was little difference between their performance in absolute terms, although the code generated by 1β010 produced more error-free reads than the other two. This small absolute difference disguises a larger increase in the proportion of corrected reads: under the 1β0β0 code, 10.2% of reads were corrected to being perfect compared to 11.4% for the 1β010 code, an 11.8% relative improvement even though the absolute improvement is only 1.2%. The 10β00 code produces more mappable base-space sequence than either of the other codes and the reason for this may be in the pattern of errors in base-space that single-color errors make: comparing the patterns in Tables and shows that, for the two alternative codes, single-color errors predominantly cause a single base error when wrongfully corrected, rather than more complex errors, and so there are fewer base-space errors in total.
While the code generated by 1β010 does perform better than the ECC code, the improvement is not especially dramatic despite the syndrome equivalence classes suggesting superior error-correcting properties. The decoding algorithm uses soft information as well as the hard information from color calls when determining the most probable base at each position of decoded sequence and this is a possible explanation for the small difference between the performance of the codes. The equivalence classes narrow down the possible errors but, rather than randomly picking the correction, MAP decoding uses the quality information to guide the choice and the right correction might be picked the majority of the time even without this extra assistance.