We designed a computer program for 3C PCR primer design that fulfills all aforementioned constraints. Our program can be used in two different modes: global mode, most useful for unbiased 3D structure modeling, and targeted mode, to be used for targeted detection of candidate interactors. Depending on the design setup specified by the user, some steps of the algorithm are omitted (like the TaqMan probe design, in case only upstream and downstream primers are required by the user, or the selection of homogeneously distributed restriction sites, in case a targeted screen for candidate interactors is performed). The program flow can be separated into three key steps:
1. Selection of homogeneously distributed restriction sites for the restriction enzyme specified by the user (in 'global mode' only)
2. Enumeration and validation of candidate primers in the vicinity of these restriction sites (depending on design task, two or three primers are designed around each restriction site)
3. Picking of the most-homogeneous set of primer tuples, one tuple for each restriction site.
The most important steps of our software can be found in the flowchart of Figure .
Figure 2 Primer Design Flowchart. A flowchart visualizing the main steps in our program. Starting with one sequence ('global mode') or several sequences ('targeted mode'), primers are enumerated and pairs are created. Thereafter, the most-homogeneous tuple set (more ...)
Selection of restriction sites
To select the optimal set of restriction sites for the global mode, the genomic region of interest is first divided into equally sized intervals (one for each primer pair to be designed). Second, each interval is scanned for the presence of restriction sites for the user-defined restriction enzyme. Third, within each interval, the restriction site with the least distance to the center of the interval is chosen. It should be noted that this step is omitted in case a targeted primer design is required by the user. Instead of automatic selection of restriction sites by our algorithm, the user has to provide distinct sequences. These sequences have to contain at least one restriction site each for the target enzyme. The number of sequences provided must correspond to the number of primer tuples to be designed.
For the set of restriction sites from step 1, or as provided by the user, all valid upstream, downstream and, if desired, hybridization probe primers within a scan window are enumerated, considering the user-defined parameters for GC-content, melting temperature and self alignment cut-offs. The scan window is restricted by the user-defined parameters for maximum and minimum distance of a primer to the restriction site and maximum and minimum primer length. The hybridization probe has to be located between the restriction site and the scan region for the upstream primer on the other (reverse) DNA strand and these two scan regions must not overlap. The reverse primer, of course, has to be also located on the other (reverse) DNA strand in a scan window of equal size than the forward primer scan window but downstream of the restriction site (see Figure ).
In order to avoid redundant computations, we precompute the alignments of all primer enumeration regions with itselves and the relevant combinations of cross-region alignments. Thereafter, we can extract a sub-alignment for each primer self- and pair-alignment (used during primer tuple picking) in linear time. Primer melting temperature calculation was done using the method of SantaLucia et al.
]. This method outperforms the frequently used Breslauer et al.
] in accuracy (data not shown) on a set of experimentally determined oligonucleotide melting temperatures [13
As already mentioned in the introduction, the 3C methodology generates four different ligation products for interacting DNA segments (see Figure ). For this reason, the 3C protocol only uses the upstream primers (and the TaqMan probes, if desired) of the primer pairs to quantify interaction frequencies. The downstream primers are only used to control digestion efficiencies before and after the restriction step. This is a critical test to assure proper digestion of each site to be studied.
A primer is considered valid if: all user-defined requirements are met, the primer does neither contain repetitive elements nor restriction sites within the binding region, and no mispriming is found in the genome under study. Additionally, the maximum length of mononucleotide repeats is restricted to 4 bases. Primer misprimings are checked in the target genome. The overall idea of our extended mispriming checks is to prune primers having misprimings, which are sufficiently close to any restriction site of the target enzyme, since these misprimings (after the re-ligation step) could amplify the wrong hybrid template along with the true positive one.
In case no valid primers can be found around any restriction site, a fall-back mechanism will repeat this enumeration procedure using the 'next best restriction site' if present. In global mode, this 'next best site' is chosen from the list of restriction sites for a sequence region. This list is ordered by increasing distance to interval mean. In targeted mode, the same strategy is used. Here, each interval corresponds to one sequence provided by the user. This procedure is iterated until at least one valid primer pair has been found for a restriction site of the respective interval/sequence or no restriction sites remain to be scanned in this interval/sequence. In case of an interval/sequence where no valid primer pairs can be found (meaning that at least one of the verified candidate primer lists is empty), the program terminates with an error message.
By ordering the restriction sites by increasing distance to the interval/sequence mean, we perform a directed search from the most-centered restriction site in each interval/sequence towards the least-centered.
This search strategy makes sense in 'global mode', given the fact that we are looking for primers which are most homogeneously distributed on the genomic region of interest. In 'targeted mode', this mechanism provides a fall-back but can be omitted by providing sequences containing only one restriction site each.
Primer Tuple Picking
In order to find the best set of primer tuples, we first have to create and screen primer tuples from the lists of valid primers for each restriction site to be studied. Each primer tuple consist of: one forward primer, one reverse primer and (if desired) a TaqMan probe (see Figure ). An acceptable candidate primer tuple to be used for screening has low intra-tuple alignment values to all other primers of the sample tuple (below user-defined threshold). Additionally, the difference in melting temperatures between forward and reverse primer has to be below the user-defined threshold and the difference between the melting temperature of the TaqMan probe (if desired) and the forward and reverse primers has to be above a user-defined threshold.
Using these lists of valid primer tuples, one for each restriction site to be studied, we 'only' have to choose the most homogeneous set of primer tuples containing one primer tuple per interval/restriction site. This indeed is the most demanding task in our primer-design process, since the exhaustive enumeration of all primer pair sets is O(mn) with m the average number of valid primer tuples per restriction site in an interval and n the number of intervals/sequences to be studied. The worst case number of valid primer tuples per restriction site is j * k * l with j and k being the number of valid forward and reverse primers respectively and l being the number of valid TaqMan probes. We made our software more scalable by avoiding an exhaustive enumeration of all primer tuple sets. We incrementally build primer tuple set candidates by adding the 'most suitable' primer tuple as defined by our scoring function (Figure ) at each step (see Figure ). This scoring function includes various parameters of the primer which are also frequently used by other primer design algorithms. Each parameter is weighted and weights are defined by the user - details can be found in the caption of Figure .
Figure 3 Scoring Function. The scoring function used to compare primers and primer tuples. Each component of the scoring function is weighted and weights are refinable by the user. The components are (in order of appearance): difference in melting temperature(s) (more ...)
Our strategy is to sort the array of primer lists by the number of primer tuples in decreasing order. We start with the largest list of valid primer tuples and create a new primer tuple set of size one for each tuple and extend each set by the 'most suitable' primer tuple from the second largest list and so forth. The 'most suitable' primer tuple is defined as the tuple with the most similar score to the average score of the current primer tuples set, while satisfying cross-tuple alignment constraints as defined by the user. The score of a single primer tuple here corresponds to the score between this primer tuple and a virtual optimal primer tuple satisfying all design constraints best.
While creating these candidate primer tuple sets, only the one having smallest score and highest homogeneity is retained after each set enumeration. This procedure can be executed in parallel by splitting the first list of primer tuples into x shares of approximate equal size and running a unique thread for each share. After all threads terminate, the set with lowest score and highest homogeneity is returned as the optimal set of primer tuples. It should be noted that this threading does not influence the outcome of our algorithm but significantly speeds up the screening phase.
In case no valid primer tuples can be created for one restriction site, this sequence region is refined and the 'next closest' restriction site (if existing) with respect to the interval mean is scanned for valid primers, and pairs are created and checked for validity.