To solve the DNA sequence comparison problem for encoded sequences, we follow a constructive approach. Given an encoded DNA sequence

*c *=

*c*_{1},...,

*c*_{n}, we wish to maximize the similarity between

*c *and some regular DNA sequence

*y *=

*y*_{1},...,

*y*_{m}, with the valid edit operators Σ. In this case the alphabet is {

*A, C, G, T*} corresponding to the bases in DNA, and the encoded alphabet is {

*0, 1, 2, 3*}. We assume the encoded sequence is composed of a two base encoding, referred to as colors, as well as assume a known start base

*p*, which is known in practice [

16,

17,

27]. The valid edit operators are:

1. A base substitution, which substitutes one base for another in the encoded sequence after decoding.

2. An insertion, which inserts a base into the encoded sequence after decoding.

3. A deletion, which deletes a base from the encoded sequence after decoding.

4. A color substitution, which substitutes one encoded color for another.

Operators 1–3 can be applied to base sequence and therefore we assume that all color substitutions are applied to the encoded sequence, then the sequence is decoded to allow the application of operators 1–3. We assign scores to each operator. The function Δ (

*B*_{1},

*B*_{2}) that returns the base substitution score for substituting base

*B*_{2 }for base

*B*_{1}. The score ρ is applied for the first insertion or deletion operator used. Any insertion or deletion operator that is applied so that the insertion or deletion is extended has a score

*ε*. Therefore, for a length

*g*>0 base insertion or deletion, the cost of the entire insertion or deletion is

*ρ *+

*ε *(

*g*-1) and has an average per-gap cost of (

*ρ *+

*ε *(

*g*-1))/

*g*. In practice, this affine gap penalty is useful to penalize a start of an insertion or deletion more heavily than extending the insertion or deletion. The function Π(

*C*_{1},

*C*_{2}) returns the color substitution score for substituting color

*C*_{2 }for color

*C*_{1}. The base and color substitutions functions are both symmetric, and are defined even if

*B*_{1 }=

*B*_{2 }for Δ, or

*C*_{1 }=

*C*_{2 }for Π. To decode an encoded sequence, we define the function Γ(

*B*,

*C*) that returns the decoded base using the encoded color

*C *and the previous base

*B *(see Figure ). For example, to decode the encoded sequence

*c *=

*c*_{1},...,

*c*_{n }with a known start base

*p*, we iteratively use Γ. The decoded sequence will be

*x*_{1 }= Γ(

*p*,

*c*_{1}),

*x*_{2 }= Γ(

*x*_{1},

*c*_{2}),...,

*x*_{n }= Γ(

*x*_{n-1},

*c*_{n}). To encode a sequence, we define the function Φ(

*B*_{1},

*B*_{2}) that returns a color using the bases

*B*_{1 }and

*B*_{2}, where

*B*_{1 }occurs before

*B*_{2 }in the sequence (see Figure ). For example, to encode DNA sequence

*x *=

*x*_{1},...,

*x*_{n}, we assume a known start base

*p *and iteratively use Φ to encode x. Here we have

*c*_{1 }= Φ(

*p*,

*x*_{1}),

*c*_{2 }= Φ(

*x*_{1},

*x*_{2}),...,

*c*_{n }= Φ(

*x*_{n-1},

*x*_{n}). This encoding function is analogous to the Klein Four Group under addition or the X-OR function when the colors and DNA are represented as binary numbers [

14,

15,

17]. The function Φ is used to encode the base sequence whereas the function Γ is used to decode the color sequence. To represent the transformation of

*x *into

*y*, we pair bases in

*x *with bases in

*y *as well as including dashes to indicate that an insertion or deletion occurred. If

*x*_{i }and

*y*_{j }are matched, then we pair

*x*_{i }and

*y*_{j }and draw:

. A deletion of a base in

*x *relative to

*y *is represented using a dash (-) and the base

*y*_{j}, and is drawn as:

. An insertion into

*x *relative to

*y *is represented using a dash and the base

*x*_{i}, and is drawn as:

. For example, for

*x *=

*GATTACA *and

*y *=

*GATACA*, a valid alignment may be:

. In this example, we apply three base substitution operators, one insertion operator, and then three base substitution operators. The base substitution operators do not change the bases in this example, but are defined for completeness when

*x*_{i }=

*y*_{j}. In this manner, we describe an alignment using the base substitution, insertion and deletion operators. To model encoding errors, we assume a two-base encoding scheme; therefore, the encoding can be visualized by placing the colors in between the bases assuming the starting base is an

*A*. For the reference sequence

*y*, we place colors of the encoded version of

*y *in between the bases of

*y*. Let

*c' *be the encoded DNA sequence resulting from applying all color substitution operators to

*c*. Below we place the colors of the encoded sequence

*c' *between the bases of the decoded version of c'. Finally we place the original encoded sequence

*c *below

*c'*. Given an encoded sequence

*c *=

*2030311 *and target DNA sequence

*y *=

*GATACA *a valid alignment may be:

. The placement of the color (in

*y*) within the insertion (relative to

*c*) is arbitrary since it is compared to the composition of the colors within insertion in

*c *as will be seen later. In the above alignment, the second color is changed using a color substitution, where the second color encodes for the first and second base. Without the color substitution, the alignment would be:

illustrating the necessity to model encoding errors.

Our goal is to transform

*x *into

*y *by maximizing the similarity score, thus maximizing sequence similarity. In practice,

*x *is an observed encoded sequence, and

*y *is a decoded target or reference sequence. We prefer to penalize applications of the edit operators where base substitutions or color substitutions occur. Therefore, for all

*B*_{1 }≠

*B*_{2 }and

*C*_{1 }≠

*C*_{2}, we assume that Δ(

*B*_{1},

*B*_{2}) ≤ 0, 0 ≤ Δ(

*B*_{1},

*B*_{1}),

*ε *≤ 0,

*ρ *≤ 0, Π(

*C*_{1},

*C*_{2}) ≤ 0 and 0 ≤ Π(

*C*_{1},

*C*_{1}). Furthermore, to avoid always placing an insertion, we must have that for any

*C*_{1 }that

*ε *+ Π(

*C*_{1},

*C*_{1}) ≤ 0 and

*ρ *+ Π(

*C*_{1},

*C*_{1}) ≤ 0. A subtle but important point is that two adjacent color substitutions in the encoded sequence in some cases are equivalent to a base substitution in-between the two colors. An example of this equivalence can be seen in the following two sub-alignments

and

. In practice we make the assumption that for any bases

*B*_{1},

*B*_{2},

,

*B*_{3 }with

*B*_{2 }≠

, and for any colors

*C*_{2},

,

*C*_{3},

with

*C*_{2 }≠

and

*C*_{3 }≠

such that Γ (

*B*_{1},

*C*_{2}) =

*B*_{2}, Γ (

*B*_{2},

*C*_{3}) =

*B*_{3},

,

: