We formulate the question of devising an optimal synthesis strategy in oligo microarrays as a combinatorial state space search problem. This computational formulation provides insight into the complexity of the problem and enables a range of discrete optimization and heuristic search solutions.
In this paper, we assume that the input to the optimization software is a collection of N oligo sequences of arbitrary length, which have been pre-selected in a probe selection process. For simplicity, we discuss the case of uniform length Kmers, but our framework is readily applicable to the more general case. An optimal synthesis strategy involves L cycles of synthesis where, in each cycle, a single and identical nucleotide is added to all unmasked oligos. The exact spatial location of each oligo on the array is not important, as long as it can be retrieved during actual synthesis. Therefore, we assume that the input to the optimization code is a list of oligos arranged in one dimension, such as shown in Figure A. We define a strategy for constructing Kmers in L cycles as an L long vector S, consisting of elements A, C, G and T. S[j] is the nucleotide added in cycle j. For instance, the vector [A,C,A,T,G] corresponds to using A, C, A, T and G in cycles 1–5, respectively. We define the height of each partially constructed oligo as the number of nucleotides that have been added thus far by the synthesis strategy. We define a frontier F of a partially synthesized array to be an integer vector of size N where F[i] is the height of the ith oligo thus far. For example, the frontier of the four oligos in Figure B is F = [3,1,1,2].
Figure 1 Formulation of synthesis strategy. (A) Four 3mer oligonucleotides. (B) Four partially constructed oligos, defining frontier F = [3,1,1,2]. (C) The synthesis strategy [A,C,G,T] can synthesize the array in four cycles, (more ...)
The ‘traditional’ way to create a chip of N
oligos, each of height K
, is to perform 4K
cycles of synthesis. After the addition of A, C, G and T to the appropriate spots of the array, all oligos will be one base long. We then proceed to synthesize layer two in four cycles and all oligos will be two bases long. We continue until the entire chip is synthesized in 4K
cycles. It is easy to observe that by a slight modification of the order of base addition (8
), we can expedite the above process. As a simplified example (Fig. C), we can synthesize an array of K
= 3 in four cycles using a modified synthesis strategy, compared to six cycles with the ‘traditional’ approach.
In order to introduce the optimization framework for masking, we need to measure the ‘work’ that has been accomplished after several cycles of synthesis. We therefore give two definitions of ‘frontier height’: (i) the min height of a frontier constructed after L cycles is the length of the shortest oligo [in Fig. B, min height (L) = 1]; (ii) the sum height of a frontier is the sum of the lengths of all oligos constructed so far [in Fig. B, sum height (L) = 3 + 1 + 1 + 2 = 7].
The objective optimization criterion we desire to minimize is the number of cycles required to create a frontier of min height = K, i.e., we seek the shortest length strategy vector that is sufficient to synthesize all oligos on the chip. It is obvious that the best possible strategy for a Kmer oligo chip is of length between K and 4K. It is easy to construct an example where the shortest strategy is of length 4K (Fig. A), although genomic sequences typically do not exhibit such extremely low complexity. In general, as the number of oligos on a chip grows, the length of the optimal strategy vector is expected to grow as well.
Figure 2 Example arrays that challenge synthesis strategies. (A) A worst-case scenario requiring 4K stages for Kmer synthesis. (B) The optimal solution for this chip requires 9 cycles whereas the greedy solution produces a 12-cycle strategy. However, the sum (more ...)