We first present some simulations to highlight the posssibilities of our methodology. They have been chosen to illustrate cases where BIC and ICL do not select the same number of components.

4.1 SIMULATED EXAMPLE WITH OVERLAPPING COMPONENTS

The data, shown in , were simulated from a two-dimensional Gaussian mixture. There are six components, four of which are axis-aligned with diagonal variance matrices (the four components of the two “crosses”), and two of which are not axis-aligned, and so do not have diagonal variance matrices. There were 600 points, with mixing proportions 1/5 for each non axis-aligned component, 1/5 for each of the upper left cross components, and 1/10 for each of the lower right cross components.

We fitted Gaussian mixture models to this simulated dataset. This experiment was repeated with 100 different such datasets, but we first present a single one of them to illustrate the method. Although all the features of our approach cannot be tabulated, results illustrating the stability of the method are reported and discussed at the end of this subsection.

For the dataset at hand, the BIC selected a six-component mixture model, which was the correct model; this is shown in . ICL selected a four-cluster model, as shown in . The four clusters found by ICL are well separated.

Starting from the BIC six-component solution, we combined two components to get the five-cluster solution shown in . To decide which two components to merge, each pair of components was considered, and the entropy after combining these components into one cluster was computed. The two components for which the resulting entropy was the smallest were combined.

The same thing was done again to find a four-cluster solution, shown in . This is the number of clusters identified by ICL. Note that there is no conventional formal statistical inferential basis for choosing between different numbers of clusters, as the likelihood and the distribution of the observations are the same for all the numbers of clusters considered.

However, the decrease of the entropy at each step of the procedure may help guide the choice of the number of clusters, or of a small number of solutions to be considered. The entropies of the combined solutions are shown in , together with the differences between successive entropy values. There seems to be an elbow in the plot at *K* = 4, and together with the choice of ICL, this leads us to focus on this solution.

A finer examination of those graphics gives more information about the merging process. The first merging (from six to five clusters) is clearly necessary, since the decrease in entropy is large (with respect for example to the minimal decreases, when merging from two to one clusters, say). The second merging (from five to four clusters) also seems to be necessary for the same reason, although it results in a smaller decrease of the entropy (about half of the first one). This is far from zero, but indicates either that the components involved in this merging overlap less than the first two to be merged, or that this merging involves only about half as many observations as the first merging.

To further analyze the situation, we suggest changing the scale of the first of those graphics so that the difference between the abscissas of two successive points is proportional to the number of observations involved in the corresponding merging step: see . This plot leads to the conclusion that the reason why the second merging step gives rise to a smaller entropy decrease than the first one is that it involves fewer observations. The mean decrease in entropy for each observation involved in the corresponding merging step is about the same in both cases, since the last three points of this graphic are almost collinear. The same result can be seen in a slightly different way by plotting the differences of entropies divided by the number of observations involved at each step, as shown in . These new graphical representations accentuate the elbow at *K* = 4.

In the four-cluster solution, the clusters are no longer all Gaussian; now two of them are modeled as mixtures of two Gaussian components each. Note that this four-cluster solution is not the same as the four-cluster solution identified by ICL: ICL identifies a mixture of four Gaussians, while our method identifies four clusters of which two are not Gaussian. shows the true classification. Only three of the 600 points were misclassified.

It will often be scientifically useful to examine the full sequence of clusterings that our method yields and assess them on substantive grounds, as well as by inspection of the entropy plots. However, in some cases an automatic way of choosing the number of clusters may be desired. A simple approach to this was proposed by

Byers and Raftery (1998) in a different context, namely to fit a two-part piecewise linear regression model to the values in the entropy plot, and use the estimated breakpoint as the selected number of clusters.

For simulated example 1, this is shown as the dashed line in for the raw entropy plot and in for the rescaled entropy plot. The method chooses *K* = 4 using both the raw and rescaled entropy plots, but the fit of the piecewise linear regression model is better for the rescaled entropy plot, as expected.

We repeated this experiment 100 times to assess the stability of the method, simulating new data from the same model each time. The piecewise linear regression model fit to the rescaled entropy plot selected *K* = 4, 95 times out of 100.

We carried out an analogous experiment in dimension 6. The “crosses” involved two components each, with four discriminant directions between them and two noisy directions. The proportions of the components were equal. Our piecewise linear regression model method almost always selected 4 clusters.

4.2 SIMULATED EXAMPLE WITH OVERLAPPING COMPONENTS AND RESTRICTIVE MODELS

We now consider the same data again, but this time with more restrictive models. Only Gaussian mixture models with diagonal variance matrices are considered. This illustrates what happens when the mixture model generating the data is not in the set of models considered.

BIC selects more components than before, namely 10 (). This is because the true generating model is not considered, and so more components are needed to approximate the true distribution. For example, the top right non-axis-aligned component cannot be represented correctly by a single Gaussian with a diagonal variance matrix, and BIC selects three diagonal Gaussians to represent it. ICL still selects four clusters ().

In the hierarchical merging process, the two components of one of the “crosses” were combined first (), followed by the components of the other cross (). The nondiagonal cluster on the lower left was optimally represented by three diagonal mixture components in the BIC solution. In the next step, two of these three components were combined (). Next, two of the three mixture components representing the upper right cluster were combined (). After the next step there were five clusters, and all three mixture components representing the lower left cluster had been combined ().

The next step got us to four clusters, the number identified by ICL (). After this last combination, all three mixture components representing the upper right cluster had been combined. Note that this four-cluster solution is not the same as the four-cluster solution got by optimizing ICL directly. Strikingly, this solution is almost identical to that obtained with the less restrictive set of models considered in Section 4.1.

The plot of the combined solution entropies against the number of components in suggests an elbow at *K* = 8, with a possible second, less apparent one at *K* = 4. In the *K* = 8 solution the two crosses have been merged, and in the *K* = 4 solution all four visually apparent clusters have been merged. Recall that the choice of the number of clusters is not based on formal statistical inference, unlike the choice of the number of mixture components. Our method generates a small set of possible solutions that can be compared on substantive grounds. The entropy plot is an exploratory device that can help to assess separation between clusters, rather than a formal inference tool.

In this example, the elbow graphics () exhibit three different stages in the merging process (a two-change-point piecewise line is necessary to fit them well):

- The two first merging steps (from ten to eight clusters) correspond to a large decrease in entropy (). They are clearly necessary. The mean entropy is equivalent in each one of those two steps (). Indeed, shows that they correspond to the formation of the two crosses.
- The four following merging steps (from eight to four clusters) correspond to smaller decreases in entropy (). They have a comparable common mean decrease of entropy, but it is smaller than that of the first stage: a piece of the line would be fitted for them only (as appears in ). They correspond to the merging of components which overlap in a different way than those merged at the first stage ().
- The four last merging steps should not be applied.

In this case the user can consider the solutions with four and eight clusters, and take a final decision according to the needs of the application. The automatic rule in Section 4.1 (see ) selects *K* = 6 clusters, which splits the difference between the two solutions we identified by inspection of the plot. This seems reasonable if a single automatic choice is desired, but either four or eight clusters might be better in specific contexts.

4.3 CIRCLE/SQUARE EXAMPLE

The data shown in were simulated from a mixture of a uniform distribution on a square and a spherical Gaussian distribution. Here, for illustrative purposes, we restricted the models considered to Gaussian mixtures with spherical variance matrices with the same determinant. Note that the true generating model does not belong to this model class.

In the simulation results of

Biernacki et al. (2000), BIC chose two components in only 60% of the simulated cases. Here we show one simulated dataset in which BIC approximated the underlying non-Gaussian density using a mixture of five normals (). ICL always selected two clusters ().

The progress of the combining algorithm is shown in . The final two-cluster solution, obtained by hierarchical merging starting from the BIC solution, is slightly different from the clustering obtained by optimizing ICL directly. It also seems slightly better: ICL classifies seven observations into the uniform cluster that clearly do not belong to it, while the solution shown misclassifies only three observations in the same way. The true labels are shown in . The entropy plot in does not have a clear elbow.

4.4 Comparison With Li’s Method

In this section, our methodology is compared with the related method of

Li (2005). Similar to our approach, Li proposed modeling clusters as Gaussian mixtures, starting with the BIC solution, and then merging mixture components. However, unlike us, Li assumed that the true number of clusters is known in advance. The author also used

*k*-means clustering to merge components; this works well when the mixture components are spherical but may have problems when they are not.

In the framework of the so-called multilayer mixture model,

Li (2005) proposed two methods for partitioning the components of a mixture model into a fixed number of clusters. They are both initialized with the same double-layer

*k*-means procedure. Then the first method consists of computing the maximum likelihood estimator of a Gaussian mixture model with a greater number of components than the desired number of clusters. The components are then merged by minimizing a within-cluster inertia criterion (sum of squares) on the mean vectors of the mixture components. The second method consists of fitting the Gaussian mixture model through a CEM-like algorithm (

Celeux and Govaert 1992), to maximize the classification likelihood, where the clusters are taken as mixtures of components. The total number of components (for each method) and the number of components per cluster (for the CEM method) are selected either through BIC or ICL.

4.4.1 First Experiment: Gaussian Mixture We simulated 100 samples of size *n* = 800 of a four component Gaussian mixture in **R**^{2}. An example of such a sample is shown in .

Since Li’s method imposes a fixed number of clusters, we fixed it to three and stopped our algorithm as soon as it yielded three clusters. For each simulated sample we always obtained the same kind of result for both methods. They are depicted in for our method, which always gave the same result. shows the results for the four variants of Li’s method. Li’s method with the CEM-like algorithm always gave rise to the solution in . Li’s method with the *k*-means on the means and the selection through BIC found the same solution in 93 of the 100 cases. The method with the *k*-means on the means and the selection through ICL found such a solution in 27 cases, but in most other cases found a different solution whose fit was poorer ().

4.4.2 Second Experiment: 3D Uniform Cross We simulated data in **R**^{3} from a mixture of two uniform components: see . One is a horizontal thick cross (red in ) and has proportion 0.75 in the mixture, while the other is a vertical pillar (black in ) and has proportion 0.25. We simulated 100 datasets of size 300, and we applied Li’s procedures, Ward’s sum of squares method, and ours. We fixed the number of clusters to be designed at its true value (two), and we then fitted general Gaussian mixture models.

BIC selected 4 components for 69 of the 100 datasets, and 3 components for 18 of them. ICL selected 4 components for 60 of the datasets, and 3 components for 29 of them.

As in the preceding example, Li’s approach did not recover the true clusters. Li’s CEM-like methods always yielded a bad solution: sometimes one of the arms of the cross merged to the pillar, and sometimes two, as in . Li’s BIC + *k*-means method recovered the true clusters in 19 cases out of 100, and Li’s ICL + *k*-means method did so in 33 cases out of 100. This occurred almost every time the number of Gaussian components was 3 (two for the cross, which then have almost the same mean, and one for the pillar). When the number of fitted components is higher, the distance between the means of the components is no longer a relevant criterion, and those methods yielded clusterings such as .

Our merging procedure almost always (95 times out of 100) recovered the true clusters.

4.4.3 Conclusions on the Comparisons with Other Methods Here are some comments on the comparison between Li’s methods and ours based on these simulations. Our method takes into account the overlap between components to choose which ones to merge, whereas Li’s method is based on the distances between the component means, through the initialization step in each method, and also through the merging procedure in the first method. This sometimes leads to mergings that are not relevant from a clustering point of view.

Our method is appreciably faster since only one EM estimation has to be run for each considered number of components, whereas numerous runs are needed with Li’s method. For the examples we considered, and with codes which should still be optimized, the time has been multiplied by a factor of at least 2.

Our procedure can also be applied when the number of clusters is unknown, unlike Li’s method.

We also compared our results with those of a non-model-based clustering method: Ward’s hierarchical method (

Ward 1963). We used Matlab’s

*datacluster* function to apply this procedure in each of the experiments described in this section. Ward’s method always found irrelevant solutions, close to Li’s ones, for each of the 200(= 2 × 100) datasets.