Cloning of poly(CA)- and poly(A)-containing inserts

A primer {5′-TGAGGGAAGCTTCTG (CA)_{n} [or(A)_{n}] GCTAGTACTGCAGG-3′} and overlapping primer (5′-CCA CAGGAATTCCTGCAGTACTAGC-3′) were obtained from Operon Technologies Inc., CA. The underlined nucleotides are complementary to one another. (CA)_{n} inserts ranged in size up to 14 repeats and (A)_{n} inserts up to 12 repeats. A control plasmid with the same insert lacking a microsatellite sequence was also constructed.

Single-stranded templates were converted into double-stranded inserts by 20 µl of primer extension reactions containing 5 pmol each of template and primer, 7.8 U of T7 Sequenase enzyme (USB), 1× T7 Sequenase buffer, 0.2 mg/ml BSA and 200 µM each dNTP. The reaction was carried out at 37°C for 60 min after which the enzyme was inactivated at 65°C for 30 min. *Eco*RI and *Hin*dIII sites were engineered into the primer extension product for ease of directional cloning into the vector pUC18, such that all the clones had identical orientation of the inserts. This also had the advantage of replacing the palindrome-rich poly-cloning site of the vector.

Two microliters of the primer extension reaction and 17 ng of vector pUC18 (Invitrogen Top 10™ Cloning Kit) were then added to a 20 µl restriction digestion reaction contain ing 1× Promega MULTI-CORE™ buffer, 6 U of *Eco*RI, 5 U of *Hin*dIII, 5 U of *Bam*HI restriction enzymes (the *Bam*HI is added to digest the liberated cloning site) and 0.1 mg/ml BSA. The reaction was incubated at 37°C for 60 min and the restriction enzymes were inactivated for 30 min at 80°C.

Ten microliters of the digested sample was added to a 15 µl ligation reaction containing 1× T4 ligase buffer (Promega) and 3 U of T4 DNA ligase. The reaction was held at 15°C overnight after which 3 µl was used to transform competent *Escherichia coli* cells (Invitrogen Top 10™ Cloning Kit) according to the manufacturer’s protocol.

The transformants were plated on LB-agar plates containing 100 µg/ml ampicillin and 30 µg/ml X-Gal. The plates were incubated at 37°C for 12–16 h and white colonies were picked and grown overnight in liquid LB containing ampicillin. The inserts from individual plasmid clones were sequenced.

PCR

For single-molecule PCR, the plasmid clones were serially diluted to an average of approximately 0.5 molecules per microliter. Only 9% of the samples with a PCR product would have been expected to contain more than one initial starting template. One microliter of DNA dilution was amplified using two rounds of nested PCR. Fifty microliter PCRs consisted of 1× PCR buffer (10 mM Tris–HCl pH 8.3, 50 mM KCl, 0.01 mg/ml gelatin), 2.5 mM MgCl_{2}, 100 µM each dNTP and 4 pmol each of forward primer (5′-CGGCATCAGAGCAGATTGTA-3′) and reverse primer (5′-GCGTTGGCCGATTCATTAA-3′). The primers are complementary to pUC18 sequences. The 5′ end of one primer is 31 bp away from the beginning of the *Hin*dIII site of the insert, while the 5′ end of the other is 70 bp away from the end of the *Eco*RI site of the insert. First-round PCR conditions were 95°C for 3 min, followed by 10 cycles of 95°C for 30 s and 63°C for 3 min, and 20 cycles of 95°C for 30 s and 63°C for 2 min. Final extension was at 72°C for 5 min. One microliter of the first-round product was further amplified in the second round using 8 pmol each of forward primer (5′-GTCACGACGTTGTAAAACGA-3′) and reverse primer (5′-GGCTCGTATGTTGTGTGGAA-3′). The reverse primer used in the second round was labeled at its 5′ end with the Beckman CEQ WellRED Dye D4 (blue). The second-round amplification conditions were 95°C for 3 min, followed by 30 cycles of 95°C for 30 s, 62°C for 30 s and 72°C for 30 s. The usual final extension step at 72°C for 5–10 min was omitted in order to minimize non-templated nucleotide addition by *Taq* polymerase. Amplification of the control plasmid gave rise to a 132 bp PCR product. The size of PCR products from plasmids with microsatellite markers was equal to 132 bp + the number of (A)_{n} repeats or, in the case of (CA)_{n} tracts, 132 bp + two times the number of repeats. The PCR products were resolved on a Beckman CEQ2000 denaturing microcapillary electrophoresis system. Molecular sizes (in nucleotides) and peak areas of all bands were collected for each of the clones. Clones without any repeat units in the insert were used as controls.

In a second set of experiments, a single round of PCR was performed starting with 100 or 1000 molecules of each clone for 40, 50 and 60 cycles using the second-round PCR conditions described above.

Kinetic PCR

Kinetic PCR (kt-PCR) (

18) was used to calculate PCR efficiency (Perkin-Elmer 5700) in the second round using 1 µl of first-round product (initiated with a single target molecule). 5(6)Carboxy–X-rhodamine (2 µM) and 0.2× Sybr Green I were added to 50 µl of a second-round PCR mix. Real-time relative dye fluorescence intensities were used to estimate the efficiency (λ) of the PCR between PCR cycles

*n*1 and

*n*2 (

*n*2 >

*n*1) according to the equation:

Rf_{n2} = Rf_{n1} × (1 + λ)^{n2–n1}

that is,

λ = 1 – (Rf_{n2} / Rf_{n1})^{1 / (n2–n1)}

where Rf_{n2} is the relative dye fluorescence intensity value at the PCR cycle *n*2, and Rf_{n1} is the relative fluorescence intensity value at the PCR cycle *n*1. In reactions with a known number of starting templates, the efficiency values were obtained for various intervals along the PCR curve. These efficiency values were then used in the mathematical model for calculating mutation rates/cycle for templates with different number of repeats.

A mathematical model and quasi-maximum likelihood estimation method

We consider a mathematical model for PCR similar to that described by Sun (

19). During the

*n*th PCR cycle, each template generates a new copy with probability λ

_{n}. λ

_{n} is referred to as the efficiency of PCR at the

*n*th PCR cycle. During the copying process the newly synthesized copy, but not the template, can undergo a mutation.

Let *S*(*n*) be the expected total number of templates after ‘*n*’ PCR cycles, we have:

*S*(*n*) = (1 + λ_{n})*S*(*n* – 1)**1**

Next, we consider the mutation process. It is observed from the experiment that the mutation rate of a template increases with the number of repeat units. We do not assume any specific relationship between the mutation rate and the number of repeat units. Let µ_{j} be the mutation rate for a template with ‘*j*’ number of repeat units. We assume that when a mutation occurs, the probability that it is an expansion is *e* and the probability that it is a contraction is 1 – *e*.

With the above model, the expected number of template molecules with *j* repeat units after *n* – 1 PCR cycles, *S*_{j}(*n*), satisfies the following recursive equation:

*S*_{j}(

*n* + 1) =

*S*_{j}(

*n*) +

*S*_{j}(

*n*)λ

_{n}(1 – µ

_{j}) +

*S*_{j – 1}(

*n*)λ

_{n}µ

_{j–1}*e* +

*S*_{j+ 1}(

*n*)λ

_{n}µ

_{j+1}(1 –

*e*)

**2**The above equation can be explained as follows. The template molecules with *j* repeat units after *n* PCR cycles are composed of four sets of molecules: (i) those with *j* repeat units after the (*n* – 1)-st PCR cycle [*S*_{j}(*n*)], (ii) newly generated templates from parent molecules of *j* repeat units with no mutations [*S*_{j}(*n*)λ_{n}(1 – µ_{j})], (iii) newly generated templates from parent molecules of *j* – 1 repeat units with one repeat unit added [*S*_{j – 1}(*n*)λ_{n}µ_{j– 1}*e*], and (iv) newly generated templates from parent molecules of *j* + 1 repeat units with one repeat unit deleted [*S*_{j+ 1}(*n*)λ_{n}µ_{j– 1}(1 – *e*)]. Let *f*_{j}(*n*) be the fraction of molecules with *j* repeats after *n* PCR cycles. Then *f*_{j}(*n*) can be approximated by *S*_{j}(*n*) / *E*[*S*(*n*)]. From equations **1** and **2**, *f*_{j}(*n*) satisfies the following recursive equation:

For given efficiencies at different PCR cycles, we can calculate the values of *f*_{j}(*n*) at any values of (µ_{α}, µ_{α + 1}, …, µ_{β}, *e*) = (µ, *e*), where α and β are the lower and upper range of the number of repeat units for PCR products of any given template and µ is the vector (µ_{α}, µ_{α + 1}, …, µ_{β}).

Let

*o*_{j}^{(i)} be the observed fraction of molecules with

*j* repeats in the

*i*th PCR experiment,

*i* *I*, and

*j* *J*, where

*I* is the set of all the experiments and

*J* is the set of repeat units of interest. Let

*f*_{j}^{(i)} be the predicted fraction of molecules using equation

**3** such that

*i* *I*, and

*j* *J*. The quasi-likelihood

*L*(µ,

*e*) is then defined by:

µ and

*e* were estimated by maximizing

*L*(µ,

*e*). Note that

*L*(µ,

*e*) is not the true but rather the quasi-likelihood of the observed data because the branching process of mutations during PCR creates a dependency among the different sized PCR products. There are β – α + 1 mutation rates (µ

_{α}, µ

_{α + 1}, …, µ

_{β}) and one expansion rate

*e* with a total of β – α + 2 parameters to be estimated. Due to the high dimension of the parameter space, we use the Kiefer-Wolfowitz (

20) algorithm to locate the maximum point of

*L*(µ,

*e*). Theoretical studies have shown that the above approach can accurately estimate the mutation rates as well as the expansion probabilities if the number of PCR cycles is greater than 40 (Y. Lai, D. Shinde, N. Arnheim and F. Sun, unpublished results).

For a simple example of how equation **4** (above) is evaluated, suppose we have experimental data only on (CA)_{6} and (CA)_{8}. For (CA)_{6}, the fraction of PCR products with five and six repeats is 17 and 83%. For (CA)_{8}, the fraction of products with six, seven and eight repeats is 3, 20 and 77%, respectively. The goal will be to use the stutter pattern frequencies to estimate the mutation rate/template/cycle for all possible templates from (CA)_{5} to (CA)_{8}. We use the quasi-likelihood function:

*L*(µ_{5},µ_{6},µ_{7},µ_{8},*e*)=(*f*_{5}^{(1)})^{0.17}(*f*_{6}^{(1)})^{0.83}(*f*_{6}^{(2)})^{0.03}(*f*_{7}^{(2)})^{0.20}(*f*_{8}^{(2)})^{0.77}

where *f*_{5}^{(1)} and *f*_{6}^{(1)} are given by the recursive formula **3** with initial value *f*_{6}^{(1)} = 1and *f*_{6}^{(2)}, *f*_{7}^{(2)}, and *f*_{8}^{(2)} are given by the recursive formula **3** with initial value *f*_{8}^{(2)} = 1.

There are a total of five parameters to be estimated and we can use the Kiefer-Wolfowitz algorithm (

20) to find the maximum point of

*L*(µ

_{5}, µ

_{6}, µ

_{7}, µ

_{8},

*e*).

We first analyzed the experimental data assuming that all the mutation rates (for instance µ_{5}, µ_{6}, µ_{7}, µ_{8} in the above example) are independent. When the whole data set was examined we found that this analysis yielded an obvious linear relationship between the mutation rates and the number of repeat units. We therefore assumed µ_{j} = *aj* + *b* and estimated *a* and *b* using the quasi-likelihood approach (above). The theoretical basis of the computational method and simulation studies to test the validity of this approach will be published elsewhere (Y. Lai, D. Shinde, N. Arnheim and F. Sun, unpublished results).