Curvature

Given three consecutive points A, B and C on a discrete curve, the curvature at B can be approximated by the inverse of the radius of the circle that goes through A, B and C. A kink should induce high curvature in a short portion of the minicircle double-helix axis. We analyzed the distribution of such curvature values in the reconstructed minicircles. Each minicircle provided 200 entries for the curvature measured at each of the 200 indexed points. The curvature distributions of the points belonging to the TATA circles and to the CAP circles are computed separately and compared in . For both sequences, the curvature distribution is peaked; the maximum corresponds to the curvature of a 158 bp prefect circle (0.12 nm^{−1}). The distribution of curvature is very similar for both TATA and CAP (). Note that the shape data have no reference that indicates the location of the TATA box or CAP (CRP) site sequences.

Superposition of DNA minicircles shapes along their principal axes of inertia

shows axial paths of reconstructed minicircles that have been translated and rotated so that their center of mass (assuming uniform mass density), and their principal axes of inertia coincide. Such a presentation allows us to visually compare many minicircle shapes at the same time. The resulting picture does not show a clear difference between the shapes of TATA (red) and CAP (blue) minicircles.

The shape-distance for curves: minimum RMSD over all rigid-body motions, index shifts and curve orientations

Although curvature analysis and visualization did not reveal the presence of a kink in TATA in comparison to CAP minicircles, there may be a more subtle sequence-dependent shape pattern. Therefore, rather than looking for a particular shape, we designed a method to identify groups of similar shapes, and looked whether the sequence correlates with the groups or not. We first chose a distance for the determination of shape similarity, then we clustered the shapes according to their mutual similarities measured in terms of this distance.

Because we do not know the correspondence between the sequence and the curve in each image, in order to estimate the similarity between two minicircle shapes we need to adapt the standard root mean square deviation (RMSD) minimization procedure that is often used to compare the geometries of two solid objects. The standard method is as follows: for two ordered sets of

*N* points

**x** and

**y**, RMSD is the square root of the sum over

*i* of the squares of the Euclidean distances between two corresponding points

**x**_{i} and

**y**_{i}. Then, to eliminate rigid-body motions, one computes a 3 × 3 rotation matrix

and a translation vector

**r** which, when applied to

**x**, minimizes the RMSD function defined in

Equation 1, producing the best superposition of the two structures:

.
A Fortran 95 code given in (

32) was used to compute this minimum RMSD.

Our shape-distance function is then defined in

Equation 2 via minimization over all possible rigid-body rotations and translations in 3D, plus further minimizations in all shifts of an index (the variable δ), and two curve orientations, clockwise or counter-clockwise (the variable α):

,
The additional minimization over δ is necessary in our case because we do not know which point of the discretized curve

**y** should correspond to the first point of the curve

**x**. However, if there is a common pattern between shapes of minicircles, a particular mapping of

**x** onto

**y** should give a minimal RMSD. The minimization over δ in

Equation 2 allows all possible phasing differences in index to compete in the fit. Minimization over α recognizes that a given curve can be discretized with two distinct orientations. Except for particular symmetrical shapes, identical curves that happen to be discretized with opposite orientations cannot be perfectly superposed by standard RMSD.

As a matter of implementation the additional minimizations in

Equation 2 were achieved by calling the RMSD function given in (

32) inside a Matlab loop for all possible shifts δ (δ = 1,

…

,

200 in our data with the index of

**y** to be understood

*modulo* 200), and the two choices of α. The smallest RMSD value found in the loop defines the distance between the two shapes.

Error of reconstruction measurements

To measure similarity or dissimilarity between different reconstructed minicircles it is important to determine the error of reconstruction and to see how much this error could affect the comparison between different reconstructed minicircles. In order to estimate the reconstruction error, we applied our distance function to two reconstructed shapes coming from the same image pair, but obtained by two different users of the reconstruction program. We computed the user error for six image pairs (). We find that the average error is 0.9 nm, with SD 0.3 nm.

Analysis of shape-distances with respect to TATA and CAP sequences

We analyzed a set of 95 distinct minicircles (64 TATA, 31 CAP) all reconstructed by the same user. We therefore have a set of 4465 (or 95 * 94/2) pairwise shape-distances. gives the normalized histograms, i.e. probability distributions of pairwise distances in three groups: TATA to TATA, CAP to CAP and TATA to CAP. The average shape-distances are 2.03 nm for TATA–TATA (SD 0.57 nm), 1.96 nm for CAP–CAP (SD 0.52 nm) and 1.98 nm for TATA–CAP (SD 0.55 nm). TATA–TATA and CAP–CAP shape-distances are not significantly smaller than TATA–CAP distances. Therefore, we do not observe increased shape similarity between minicircles with the same sequence.

Shape clustering

We cannot use classical methods for clustering our shapes, as we do not have a sensible way to represent them as vectors in a multidimensional space. We also do not have reference shapes to build clusters. Accordingly we adopt the reference-free SPIN algorithm (

33) that is capable of ordering elements of a set using only their pairwise distances. For an ordered list of shapes and a shape-distance function, there exists a unique shape-distance matrix defined as follows: each element (

*i, j*) of the matrix is the shape-distance between minicircles

*i* and

*j*. By definition, the matrix is symmetric and the elements on the diagonal vanish; the

*i*-th line (or column) is a list of the distances between minicircle

*i* and all others. SPIN finds a permutation of an initial ordered list of shapes that minimizes the elements near the diagonal. If the resulting matrix has a block of low (dark blue) values near the diagonal, with comparatively higher values above and below (and therefore necessarily by symmetry to right and left), the shapes in the block can be considered as clusters. A SPIN sorted shape-distance matrix and the corresponding clusters are represented in . Three columns were added on the left of the matrix. They show some properties of the shapes. Each line and each column of the matrix correspond to a minicircle. For each line

*i* of the matrix, the corresponding element

*i* of the column ‘Minicircle type’ shows whether the corresponding minicircle

*i* is of type TATA (gray) or CAP (white). It is clear that the TATA and CAP minicircles are spread throughout each cluster. Similarly, the

*i*-th element of the column ‘Circle’ (respectively ‘Ellipse’) shows the distance between the minicircle

*i* and a circle (respectively an ellipse). The circle diameter is 17.1 nm (corresponding to a perimeter of 158 bp). The longer ellipse axis is also 17.1 nm while the shorter axis is 13.7 nm. These two columns and the lower part of suggest that the method was able to identify clusters of circular and ellipsoid shapes, and to find another non-planar cluster. Stereo images of the cluster 7–15 are presented in .

Interestingly, the distance matrix apparently reveals presence of multiple clusters of shapes. It is known that DNA circles with non-uniform sequence have multiple local energy minima (

34). For this reason, we believe that our clustering analysis detected sampling of at least two and possibly more energy wells in the configuration space. However, the small difference between the majority of the clusters (comparable with the error of the reconstruction method) warns against over-interpretation of the distance matrix data. Importantly, each detected cluster contains both TATA and CAP minicircles, so that the different clusters seem to be associated with the sequence-dependent features that are shared between the two sequences, e.g. the six phased A-tracts, rather than the differences between TATA and CAP sequences. We therefore conclude that TATA and CAP sequences produce minicircles with similar 3D shapes.