The ruler array protocol generates a population of labeled DNA fragments where the probability that the population contains a specific sequence is inversely related to the sequence’s physical distance to a selected restriction site (). When the labeled material is hybridized to a microarray, probe sequences proximal to restriction sites yield correspondingly higher intensities than distal probe sequences (). The observed intensity falloff is roughly log-linear and consistent with a model in which the extension terminates with equal probability at each base (Figure S1
). The ruler array protocol generates this population of fragments by first digesting a genomic sample with a restriction enzyme and ligating an adapter to the resulting ends. Polymerase extensions are then initiated from a primer that is complementary to the adapter, producing many copies of sequence proximal to the adapter but fewer copies of distal sequence as the limits of processivity for the polymerase are approached and it stochastically terminates.
The ruler array method uses a digest-ligate-extend protocol to generate a labeled DNA sample to detect distances between genomic points.
Log-intensities from a ruler array experiment over part of chromosome VII demonstrate the log-linear decrease in observed array intensity as distance increases from the restriction site.
Comparing ruler array hybridization data from two genomic samples reveals differences between the corresponding genomes ( and S2
). When a sequence is farther from the restriction site in one genome than the other, the observed probe intensities beyond that sequence will be lower in the corresponding channel. Thus, a discontinuity in a line fitted to the intensities in one channel and the absence of a discontinuity in the intensities of the other indicates a sudden jump in the distance of the probes from the restriction site (). The intensity drop does not generally depend on the content of the insertion or deletion, only the change in distance between genomic points.
Schematic ruler array probe intensities at an insertion (top) show a drop over the insertion site.
The ruler array analysis recognizes structural variants by fitting line segments to the microarray data and detecting differences in those segments between channels.
We compared the genomes of the S. cerevisiae
haploid yeast strains S288c and
using ruler arrays with strain specific genome assemblies serving as a control. Ruler array performance was calculated by comparing ruler array variant predictions to two sets of assembly-derived predictions. During curation of the long read
1278b assembly, we selected 106 apparent indels of more than 100 bp relative to S288c for PCR confirmation. These indels were identified by several alignment programs (FSA 
, Blast 
, Blat 
, custom code) and by manual inspection of the alignment results. Thirty-six of the 106 resulted in PCR gel bands whose length differed by roughly 100 bp or more, giving a false positive rate of 66% for the early
1278b assembly (Table S1
lists the confirmed changes). We detected a total of 114 additional indels between the genomes beyond the 106 selected for confirmation based on the final
Two ruler array replicates identified roughly 75% of the PCR confirmed changes (28 and 25 of 36) and many (28 and 20 out of 114) of the set of 100 bp changes. Due to noise and protocol variations between the replicates (such as the polymerase used), the two replicates discover similar but not identical sets of indels and their intersection represents a set of high quality calls. The two replicates also generated a number of false positive calls, predictions that do not correspond to a change of more than 4 bp. There were 553 false positives for the first replicate and 414 for the second.
We used a single replicate of an aCGH experiment between FY4 and
1278b to compare aCGH’s performance against that of the Ruler Array. The experimental protocol used the non-enzymatic ULS labeling system to avoid amplification or dye incorporation biases.
Our HMM analysis of the aCGH experiment produced 183 calls. Twelve appear incorrect given the two genome assemblies and 33 are confirmed by the assemblies. The remainder occur in repetitive regions (e.g. TY, sigma, tau, and delta elements) such that both the CGH data and the assembly are likely to be incorrect.
The aCGH experiment found 21 of the 35 “must-find” indels and missed the remaining 14. Thirteen of the 35 were originally added to our list of known indels because of the aCGH experiment, so their detection is not surprising. shows two examples of insertions that the aCGH experiment misses because there is no change in the unique probes surrounding the changes.
The ruler array can detect structural variants that array-CGH misses.
To more accurately compare the aCGH experiment to the Ruler Array experiment, we re-ran the analysis using only array probes with a unique genomic location; this excludes probes that map to TY or other repetitive elements. By only including unique probes, we now know the location of any change that the aCGH experiment detects. On this input, the same HMM analysis produced only 18 calls and found 6 of the 35 “must find” events.
Our ruler array experiments comparing S288c to
1278b revealed non-uniform polymerase processivity at particular sequence elements. Poly A, AT, or AAT repeats, often found at transcription stop and start sites 
, sometimes caused rapid termination of the polymerase extension and a corresponding drop in observed intensity. In many cases, a small change in the length of such a repeat sequence leads to a discontinuity in the ruler array signal such as one might expect from a large insertion. Thus, we detect certain insertions and deletions as small as 2 bp when they occur in these repeats. shows two such examples. These repeats may also cause reduced signal in downstream sequence.
Figure 6 Two examples of Ruler Array data (S288c in red, 1278b in green) and genomic sequence demonstrating the impact of AT repeat length changes on polymerase processivity.