In the previous sections we gave a qualitative description of the Misty Mountain algorithm. In order to help to understand the logic of the algorithm, we discuss its key features in detail.
The main part of the program reads in the coordinates of the data points, creates an optimal histogram from the data, analyses the consecutive cross sections of the histogram by calling two major routines - LABELING and ANALYZE -, and finally outputs the result (see flowchart in Figure ). These major steps of the program are discussed below.
Figure 6 Flow chart of the main part of the Misty Mountain program. IDIM - dimension of the data space. N - the number of equidistant meshes along each coordinate for creating optimal histogram. LEVELMAX - highest frequency of the histogram. LEVEL - frequency (more ...)
By using the Bayesian framework Knuth [35
] proposed an optimal data-based binning for histograms. He derived the posterior probability, p
for the number of bins of similar shape at given data, d
. If there is similar number of bins, N
along each coordinate axis the logarithm of the posterior probability is:
is the number of data, nk
is the number of data in the kth
bin, and D
is the dimension of the data space. The N
that maximizes this probability is the optimal bin number along each coordinate axis. There are other optimal data based binning methods such as Wand's method [44
]. We prefer using Knuth's method because its implementation is particularly easy for any dimension of data.
The LABELING routine
The LABELING routine separately analyzes each cross section of the histogram. As an example let us consider a two dimensional histogram (e.g. Figure ). Each cross section of the histogram is mathematically represented by an NxN square matrix where the I,J-th element of the matrix is equal 1 if the respective bin content is higher than LEVEL (the frequency at the actual cross section) otherwise it is 0. In this matrix the cross section of a peak appears as a group or aggregate of 1's. The aim of the aggregate labeling algorithm is to assign the same positive integer to the same aggregate. On the other hand different aggregates will be labeled by different integers. The NxN label matrix, NSTRA is created in three steps.
Step 1. Initialization of the label matrix.
NSTRA(I,J) = NBIN if the content of the I,J-th bin is larger than the level of the cross section, otherwise NSTRA(I,J) = 0. NBIN = NxN is the number of bins. Set the aggregate counter zero, i.e.: IL = 0.
Step 2. First scanning of the label matrix.
Starting from the first matrix element, NSTRA(1,1) let us scan the matrix from left to right and from the top to the bottom. Let us change the values of the matrix elements according to the following rules:
a) If NSTRA(I, J) = 0 then it remains zero
b) If NSTRA(I, J) = NBIN, and NSTRA(I-1, J-1), NSTRA(I-1, J), NSTRA(I-1, J+1) and NSTRA(I, J-1) matrix elements (if they exist) are equal to 0, then first let us increase the value of IL by 1; second change the value of NSTRA(I, J) from NBIN to NSTRA(I, J) = IL; and, finally, let the IL-th element of the ICOUNT vector equal to IL.
c) If NSTRA(I, J) = NBIN, and any of the NSTRA(I-1, J-1), NSTRA(I-1, J), NSTRA(I-1, J+1) and NSTRA(I, J-1) matrix elements (if they exist) are not equal to 0 then we determine the proper aggregate label for these non-zero neighbor matrix elements by applying routine CLASSIFY (described in Step 3). Then we select the smallest of the proper labels, called JM, and we set
Step 3: Second scanning of the label matrix.
After the first scanning, an aggregate may have more than one label. During the second scan, we assign a single label to each element of an aggregate. In this scan the zero elements remain unchanged, while the new value of the nonzero element NSTRA(I, J) is determined by means of the following procedure called CLASSIFY:
This simple procedure finds the smallest label among the labels of the aggregate where the I,J
-th bin is situated. The labeling routine is similar to the one used in percolation theory [36
] for labeling spin clusters. The difference is that in Step 2b and 2c in spin cluster labeling, usually only two nearest neighbors: NSTRA(I-1,J)
, of the I,J
-th matrix element are considered. In our algorithm, we also consider two next nearest neighbor matrix elements, NSTRA(I-1,J-1)
. By using this important modification elongated slanted aggregates are properly labeled. The FORTAN source code of the LABELING routine is able to label bin aggregates of any dimension.
The ANALYZE routine
This routine performs a comparative analysis of the actual and previous cross sections, and stores the largest but still separated aggregates of the major peaks. It also recognizes and eliminates small noisy peaks from the analysis. The distinction between small and major peaks is explained below in Sec. Major and Small Peaks of the Histogram. The flowchart in Figure shows the logic of the ANALYZE routine.
Figure 7 Flow chart of the ANALYZE routine of the Misty Mountain program. ISZM - the largest label of the aggregates in a cross section. (For simplicity this flowchart assumes that ISZM is also the number of aggregates. In reality ISZM is frequently larger than (more ...)
First we give a brief description of the flowchart in Figure . The cross sections of the histogram are created consecutively from the highest to the lowest level, i.e. from LEVELMAX to 0. In the grey region of the flow chart aggregates emerging at LEVELMAX are handled. When a new aggregate appears at a lower level the yellow part of the flow chart is active. The cyan colored part of the flow chart is active when a single peak that emerged at a previous level belongs to the aggregate. The rest of the flow chart is active when more than one peak belongs to the aggregate. The green colored part is active when at the previous level every peak was single and no more than one of them was major peak. The red, purple and pink colored parts are active when at the previous level either not every peak was single or more than one single peak were major.
Now in the rest of this section a more detailed description of the flowchart (Figure ) is given. At the highest level, one or more aggregates appear in the cross section. The counter of the peaks NCL is increased from zero, and the types and positions of the emerging peaks are registered in three vectors:
IIPOINT1(ICLU) - location (or characteristic position) of the emerging ICLU-th peak
IIPOINT2(ICLU) - label of the aggregate of the ICLU-th peak
IIPOINT3(ICLU) = 1 - type code of a single peak (the other two type codes are defined below). In the flow chart (in Figure ) these steps are highlighted by grey.
At every other cross section level the analysis of the labeled aggregates starts by creating the IHELP vector. IHELP(ILAB) is the number of characteristic peak positions that fall into the aggregate labeled by ILAB. There are three possibilities:
a) There is no characteristic peak position in the aggregate labeled by ILAB.
This signifies that a peak is just emerging. In this case NCL is increased by one and proper values are assigned to the NCL-th elements of the IIPOINT1, IIPOINT2 and IIPOINT3 vectors. In Figure the respective part of the flow chart is highlighted by yellow.
b) There is one characteristic peak position in the aggregate labeled by ILAB.
This means that the respective peak, with code number ICLU, emerged in one of the previous cross sections, and it is still a single peak. Thus the peak type IIPOINT3(ICLU) remains equal to 1. If the peak remains single until the lowest level of the cross section (i.e. until LEVEL = 0) the aggregate belonging to this peak is copied into the final label matrix, NSTRF where it is labeled by ICLU. In the NSTRF matrix we store the final result of our cluster analysis. In the flow chart (in Figure ) the above described steps are highlighted by cyan.
c) There are more than one characteristic peak positions in the aggregate labeled by ILAB.
This is the most important part of the algorithm that handles the merger of major peaks and the elimination of small noisy peaks. The counter of the peak positions falling into the aggregate is denoted by IP.
The analysis of the aggregate starts by creating the ISZVEC vector. The IP-th element of the vector refers to the type of the respective peak at the previous cross section: ISZVEC(IP) = -1 when the IP-th peak has merged with other peak(s), ISZVEC(IP) = 0 when the IP-th peak was a small single peak, and ISZVEC(IP) = 1 when the peak was a single major peak. The number of 1's in ISZVEC is denoted by IHIGH, while the number of 0's and 1's is denoted by INEW.
c1) If at the previous level every peak was single - INEW = IHELP(ILAB), and no more than one of them was major peak - IHIGH ≤ 1.
In this case the highest peak, encoded by ICLUm, is retained, while all the other peaks are eliminated, i.e.: IIPOINT3(ICLUm) = 1 and IIPOINT3(ICLU) = 0 for all the remaining small peaks. In Figure the respective part of the flow chart is highlighted by green.
This strategy is particularly useful at cross sections where a major peak appears. Frequently, as a first sign of a major peak, small nearby aggregates appear that merge at lower levels. This is also our usual strategy for retaining major peaks, while eliminating the frequently appearing small noisy peaks.
If LEVEL = 0 the aggregate belonging to the retained ICLUm-th peak is also copied into the final label matrix, NSTRF (see part of the flow chart highlighted by cyan).
c2) In all other cases - either not every peak was single at the previous level or more than one single peak was major - there are three options.
c21) Small and previously single peaks are eliminated, i.e. the value of the respective elements of IIPOINT3 vector change from 1 to 0. This part of the flow chart is highlighted by red.
c22) Major and previously single peaks become merged peaks, i.e. the value of the respective elements of IIPOINT3 vector change from 1 to -1. The aggregates belonging to these peaks at the previous cross section are copied into the final label matrix, NSTRF. The respective part of the flow chart is highlighted by pink.
c23) Handling of previously merged peaks is shown and highlighted by purple in the flow chart.
Major and Small Peaks of the Histogram
A histogram contains major peaks such as the four peaks in Figure and small peaks that are superimposed on the major peaks. A small peak is the consequence of the fluctuation of the number of data points in the respective bins. One can observe this fluctuation of bin contents by comparing the histograms of repeated experiments.
The fluctuation of the bin content can be estimated as follows.
First we point out that the content of each bin follows binomial distribution. Let us assume that we measure n
cells to create our FCM data set. The probability that the measured fluorescent intensities of a cell falls into the ε
-th bin is pε
. If the measurements on different cells are statistically independent events the probability that out of n
measurements the result of b
measurements will fall into the ε
-th bin and n-b
measurements will fall out of the ε
-th bin is:
This is the binomial distribution. If the mean bin content, <b
> is larger than 10 the binomial distribution can be approximated by its limit: the Poisson distribution [45
]. The mean of the Poisson distribution can be estimated by the average of the contents in the actual and nearest-neighbor bins
, while the standard deviation of the Poisson distribution by the square root of this average
Every time when two or more aggregates merge we have to decide if the merging aggregates belong to small and/or major peaks. A peak is considered major if:
are the peak height and the height of the saddle between the merging peaks, respectively. On the other hand a peak is considered small if
the Poisson approximation fails and the respective peak is always considered small.
Simulation of Data
In each simulated data set the data points follow a sum of regular or distorted-Gaussian distributions. As the first step of the simulation the means and standard deviations of the distorted-Gaussians are arbitrarily or randomly assigned along each coordinate axis. Also two coordinate axes are randomly selected to each distorted-Gaussian; directions along which the Gaussian will be distorted. XIK, the K-th coordinate of a data point belonging to the I-th distorted-Gaussian is simulated as follows:
where Δ and Δ1
is a normal deviates generated by the Box-Muller method [46
is the K-th coordinate of the mean of the I-th distorted-Gaussian and SDIK
is the standard deviation of the I-th distorted-Gaussian along the K-th axis, while
are the first and second axes, respectively that randomly selected to the I-th distorted-Gaussian. Parameter s
scales the strength of the distortion. In the case of our 2D and 10D simulations s
= 0.002 and 0.004 have been used. Note that by using the above procedure one can simulate the sum of regular Gaussian distributions by setting the distortion parameter s = 0