Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2930825

Formats

Article sections

- Abstract
- 1 Introduction
- 2 An overview of sparse clustering
- 3 Sparse K-means clustering
- 4 Sparse hierarchical clustering
- 5 Re-analysis of a breast cancer data set
- 6 SNP data example
- 7 Extensions and discussion
- References

Authors

Related links

J Am Stat Assoc. Author manuscript; available in PMC 2010 August 31.

Published in final edited form as:

J Am Stat Assoc. 2010 June 1; 105(490): 713–726.

doi: 10.1198/jasa.2010.tm09415PMCID: PMC2930825

NIHMSID: NIHMS201124

See other articles in PMC that cite the published article.

We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observations using an adaptively chosen subset of the features. The method uses a lasso-type penalty to select the features. We use this framework to develop simple methods for sparse *K*-means and sparse hierarchical clustering. A single criterion governs both the selection of the features and the resulting clusters. These approaches are demonstrated on simulated data and on genomic data sets.

Let **X** denote an *n*×*p* data matrix, with *n* observations and *p* features. Suppose that we wish to cluster the observations, and we suspect that the true underlying clusters differ only with respect to some of the features. We propose a method for *sparse clustering*, which allows us to group the observations using only an adaptively-chosen subset of the features. This method is most useful for the high-dimensional setting where *p* *n*, but can also be used when *p* < *n*. Sparse clustering has a number of advantages. If the underlying groups differ only in terms of some of the features, then it might result in more accurate identification of these groups than standard clustering. It also yields interpretable results, since one can determine precisely which features are responsible for the observed differences between the groups or clusters. In addition, fewer features are required to assign a new observation to a pre-existing cluster.

As a motivating example, we generated 500 independent observations from a bivariate normal distribution. A mean shift on the first feature defines the two classes. The resulting data, as well as the clusters obtained using standard 2-means clustering and our sparse 2-means clustering proposal, can be seen in Figure 1. Unlike standard 2-means clustering, our proposal for sparse 2-means clustering automatically identifies a subset of the features to use in clustering the observations. Here it uses only the first feature, and consequently agrees quite well with the true class labels. 1 In this example, one could use an elliptical metric in order to identify the two classes without using feature selection. However, this will not work in general.

In a two-dimensional example, two classes differ only with respect to the first feature. Sparse 2-means clustering selects only the first feature, and therefore yields a superior result.

Clustering methods require some concept of the *dissimilarity* between pairs of observations. Let *d*(**x*** _{i}*,

The rest of this paper is organized as follows. In Section 2, we briey review existing methods for sparse clustering and we present the general framework of our proposal. We present sparse *K*-means clustering in Section 3 and sparse hierarchical clustering in Section 4. Section 5 contains an application of our method to a gene expression data set, and Section 6 contains an application to a single nucleotide polymorphism data set. The discussion is in Section 7.

A number of authors have noted the necessity of specialized clustering techniques for the high-dimensional setting. Here, we briefly review previous proposals for feature selection and dimensionality reduction in clustering.

One way to reduce the dimensionality of the data before clustering is by performing a matrix decomposition. One can approximate the *n* × *p* data matrix **X** as **X** ≈ **AB** where **A** is a *n* × *q* matrix and **B** is a *q* × *p* matrix, *q* *p*. Then, one can cluster the observations using **A** as the data matrix, rather than **X**. For instance, Ghosh & Chinnaiyan (2002) and Liu et al. (2003) propose performing PCA in order to obtain a matrix **A** of reduced dimensionality; then, the *n* rows of **A** can be clustered. Similarly, Tamayo et al. (2007) suggest decomposing **X** using the non-negative matrix factorization (Lee & Seung 1999, Lee & Seung 2001), followed by clustering the rows of **A**. However, these approaches have a number of drawbacks. First of all, the resulting clustering is not sparse in the features, since each of the columns of **A** is a function of the full set of *p* features. Moreover, there is no guarantee that **A** contains the signal that one is interested in detecting via clustering. In fact, Chang (1983) studies the effect of performing PCA to reduce the data dimension before clustering, and finds that this procedure is not justified since the principal components with largest eigenvalues do not necessarily provide the best separation between subgroups.

The model-based clustering framework has been studied extensively in recent years, and many of the proposals for feature selection and dimensionality reduction for clustering fall in this setting. An overview of model-based clustering can be found in McLachlan & Peel (2000) and Fraley & Raftery (2002). The basic idea is as follows. One can model the rows of **X** as independent multivariate observations drawn from a mixture model with *K* components; usually a mixture of Gaussians is used. That is, given the data, the log likelihood is

$$\sum _{i=1}^{n}log[\sum _{k=1}^{K}{\pi}_{k}{f}_{k}({\mathbf{X}}_{i};{\mu}_{k},{\mathbf{\sum}}_{k})]$$

(1)

where *f _{k}* is a Gaussian density parametrized by its mean

However, when *p* ≈ *n* or *p* *n* a problem arises because the *p* × *p* covariance matrix **Σ*** _{k}* cannot be estimated from only

It turns out that model-based clustering lends itself easily to feature selection. Rather than seeking *μ _{k}* and

$$\sum _{i=1}^{n}log[\sum _{k=1}^{K}{\pi}_{k}{f}_{k}({\mathbf{X}}_{i};{\mu}_{k},{\mathbf{\sum}}_{k})]-\lambda \sum _{k=1}^{K}\sum _{j=1}^{p}{\mu}_{kj}$$

(2)

where **Σ**_{1} = …. = **Σ*** _{K}* is taken to be a diagonal matrix. That is, an

Raftery & Dean (2006) also present a method for feature selection in the model-based clustering setting, using an entirely different approach. They recast the variable selection problem as a model selection problem: models containing nested subsets of variables are compared. The nested models are sparse in the features, and so this yields a method for sparse clustering. A related proposal is made in Maugis et al. (2009).

Friedman & Meulman (2004) propose *clustering objects on subsets of attributes (COSA)*. Let *C _{k}* denote the indices of the observations in the

$$\begin{array}{l}{\text{minimize}}_{{C}_{1},\dots ,{C}_{K},\mathbf{w}}\phantom{\rule{0.38889em}{0ex}}\{\sum _{k=1}^{K}{a}_{k}\sum _{i,{i}^{\prime}{C}_{k}}& \text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.16667em}{0ex}}\sum _{j=1}^{p}{w}_{j}=1,{w}_{j}\ge 0\forall j.\end{array}$$

(3)

(Actually, this is a simplified version of the COSA proposal, which allows for different feature weights within each cluster.) Here, *a _{k}* is some function of the number of elements in cluster

Our proposal can be thought of as a much simpler version of (3). It is a general framework that can be applied in order to obtain sparse versions of a number of clustering methods. The resulting algorithms are efficient even when *p* is quite large.

Suppose that we wish to cluster *n* observations on *p* dimensions; recall that **X** is of dimension *n*×*p*. In this paper, we take a general approach to the problem of sparse clustering. Let **X**_{j}* ^{n}* denote feature

$$\underset{\mathbf{\Theta}D}{\text{maximize}}$$

(4)

where *f _{j}*(

$$\underset{\mathbf{w};\mathbf{\Theta}D}{\text{maximize}}$$

(5)

where *w _{j}* is a weight corresponding to feature

- The
*L*_{1}, or*lasso*, penalty on**w**results in sparsity for small values of the tuning parameter*s*: that is, some of the*w*’s will equal zero. The_{j}*L*_{2}penalty also serves an important role, since without it, at most one element of**w**would be non-zero in general. A criterion involving a linear objective subject to both an*L*_{1}and an*L*_{2}constraint was also used in Witten et al. (2009). - The value of
*w*can be interpreted as the contribution of feature_{j}*j*to the resulting sparse clustering: a large value of*w*indicates a feature that contributes greatly, and_{j}*w*= 0 means that feature_{j}*j*is not involved in the clustering. - In general, for the formulation (5) to result in a non-trivial sparse clustering, it is necessary that
*f*(_{j}**X**_{j,}**Θ**) > 0 for some or all*j*. That is, if*f*(_{j}**X**_{j,}**Θ**) ≤ 0, then*w*= 0. If_{j}*f*(_{j}**X**_{j,}**Θ**) > 0, then the non-negativity constraint on*w*has no effect._{j}

We optimize (5) using an iterative algorithm: holding **w** fixed, we optimize (5) with respect to **Θ**, and holding **Θ** fixed, we optimize (5) with respect to **w**. In general, we do not achieve a global optimum of (5) using this iterative approach; however, we are guaranteed that each iteration increases the objective function. The first optimization typically involves application of a standard clustering procedure to a weighted version of the data. To optimize (5) with respect to **w** with **Θ** held fixed, we note that the problem can be re-written as

$$\underset{\mathbf{w}}{\text{maximize}}\{{\mathbf{w}}^{T}\mathbf{a}\}\phantom{\rule{0.16667em}{0ex}}\text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.16667em}{0ex}}{\left|\right|\mathbf{w}\left|\right|}^{2}\le 1,\phantom{\rule{0.16667em}{0ex}}{\left|\right|\mathbf{w}\left|\right|}_{1}\le s,\phantom{\rule{0.16667em}{0ex}}{w}_{j}\ge 0\phantom{\rule{0.16667em}{0ex}}\forall j$$

(6)

where *a _{j}* =

The solution to the convex problem (6) is
$\mathbf{w}={\scriptstyle \frac{S({\mathbf{a}}_{+},\mathrm{\Delta})}{{\left|\right|S({\mathbf{a}}_{+},\mathrm{\Delta})\left|\right|}_{2}}}$, where *x*_{+} denotes the positive part of *x* and where Δ = 0 if that results in ||**w**||_{1} ≤ *s*; otherwise, Δ > 0 is chosen to yield ||**w**||_{1} = *s*. Here, *S* is the soft-thresholding operator, defined as *S*(*x*, *c*) = sign(*x*)(|*x*| − *c*)_{+}.

The proposition follows from the Karush-Kuhn-Tucker conditions (see e.g. Boyd & Vandenberghe 2004).

In the next two sections we show that *K*-means clustering and hierarchical clustering optimize criteria of the form (4). We then propose sparse versions of *K*-means clustering and hierarchical clustering using (5). The resulting criteria for sparse clustering take on simple forms, are easily optimized, and involve a single tuning parameter *s* that controls the number of features involved in the clustering. Moreover, our proposal is a general framework that can be applied to any clustering procedure for which a criterion of the form (4) is available.

In this paper we consider the sparse clustering criterion (5), where sparsity in the features results from the *L*_{1} penalty on **w**. Other penalties on **w** could be used that would also yield sparsity (see e.g. Fan & Li 2001, Lv & Fan 2009). Such alternatives are not considered in this paper.

*K*-means clustering minimizes the *within-cluster sum of squares* (WCSS). That is, it seeks to partition the *n* observations into *K* sets, or clusters, such that the WCSS

$$\sum _{k=1}^{K}\frac{1}{{n}_{k}}\sum _{i,{i}^{\prime}{C}_{k}}$$

(7)

is minimal, where *n _{k}* is the number of observations in cluster

$$\sum _{j=1}^{p}(\frac{1}{n}\sum _{i=1}^{n}\sum _{{i}^{\prime}=1}^{n}{d}_{i,{i}^{\prime},j}-\sum _{k=1}^{K}\frac{1}{{n}_{k}}\sum _{i,{i}^{\prime}{C}_{k}}),$$

(8)

then minimizing the WCSS is equivalent to maximizing the BCSS.

One could try to develop a method for sparse *K*-means clustering by optimizing a weighted WCSS, subject to constraints on the weights: that is,

$$\begin{array}{l}{\text{maximize}}_{{C}_{1},\dots ,{C}_{K},\mathbf{w}}\phantom{\rule{0.38889em}{0ex}}\{\sum _{j=1}^{p}{w}_{j}(-\sum _{k=1}^{K}\frac{1}{{n}_{k}}\sum _{i,{i}^{\prime}{C}_{k}}\}& \text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.16667em}{0ex}}{\left|\right|\mathbf{w}\left|\right|}^{2}\le 1,{\left|\right|\mathbf{w}\left|\right|}_{1}\le s,{w}_{j}\ge 0\phantom{\rule{0.16667em}{0ex}}\forall j.\end{array}$$

(9)

Here, *s* is a tuning parameter. Since each element of the weighted sum is negative, the maximum occurs when all weights are zero, regardless of the value of *s*. This is not an interesting solution. We instead maximize a weighted BCSS, subject to constraints on the weights. Our *sparse K-means clustering criterion* is as follows:

$$\begin{array}{l}{\text{maximize}}_{{C}_{1},\dots ,{C}_{K},\mathbf{w}}\phantom{\rule{0.38889em}{0ex}}\{\sum _{j=1}^{p}{w}_{j}(\frac{1}{n}\sum _{i=1}^{n}\sum _{{i}^{\prime}=1}^{n}{d}_{i,{i}^{\prime},j}-\sum _{k=1}^{K}\frac{1}{{n}_{k}}\sum _{i,{i}^{\prime}{C}_{k}}\}& \text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.16667em}{0ex}}{\left|\right|\mathbf{w}\left|\right|}^{2}\le 1,\phantom{\rule{0.16667em}{0ex}}{\left|\right|\mathbf{w}\left|\right|}_{1}\le s,\phantom{\rule{0.16667em}{0ex}}{w}_{j}\ge 0\phantom{\rule{0.16667em}{0ex}}\forall j.\end{array}$$

(10)

The weights will be sparse for an appropriate choice of the tuning parameter *s*. Note that if *w*_{1} = … = *w _{p}*, then (10) simply reduces to the standard

The criterion (10) assigns a weight to each feature, based on the increase in BCSS that the feature can contribute. First, consider the criterion with the weights *w*_{1}, … *w _{p}* fixed. It reduces to a clustering problem, using a weighted dissimilarity measure. Second, consider the criterion with the clusters

- Initialize
**w**as ${w}_{1}=\dots ={w}_{p}={\scriptstyle \frac{1}{\sqrt{p}}}$. - Iterate until convergence:
- $$\underset{{C}_{1},\dots ,{C}_{K}}{\text{minimize}}\{\sum _{k=1}^{K}\frac{1}{{n}_{k}}\sum _{i,{i}^{\prime}{C}_{k}}\}$$(11)by applying the standard
*K*-means algorithm to the*n*×*n*dissimilarity matrix with (*i*,*i*′) element Σ_{j}w_{j}d_{i,i}_{′}._{,j} - Holding
*C*_{1}, …*C*fixed, optimize (10) with respect to_{K}**w**by applying the Proposition: $\mathbf{w}={\scriptstyle \frac{S({\mathbf{a}}_{+},\mathrm{\Delta})}{{\left|\right|S({\mathbf{a}}_{+},\mathrm{\Delta})\left|\right|}_{2}}}$ where$${a}_{j}=(\frac{1}{n}\sum _{i=1}^{n}\sum _{{i}^{\prime}=1}^{n}{d}_{i,{i}^{\prime},j}-\sum _{k=1}^{K}\frac{1}{{n}_{k}}\sum _{i,{i}^{\prime}{C}_{k}})$$(12)and Δ = 0 if that results in ||**w**||_{1}<*s*; otherwise, Δ > 0 is chosen so that ||**w**||_{1}=*s*.

- The clusters are given by
*C*_{1}, …*C*, and the feature weights corresponding to this clustering are given by_{K}*w*_{1}*,*…*w*._{p}

When *d* is squared Euclidean distance, Step 2(a) can be optimized by performing *K*-means on the data after scaling each feature *j* by
$\sqrt{{w}_{j}}$. In our implementation of sparse *K*-means, we iterate Step 2 until the stopping criterion

$$\frac{{\sum}_{j=1}^{p}{w}_{j}^{r}-{w}_{j}^{r-1}{\sum}_{j=1}^{p}{w}_{j}^{r-1}{10}^{-4}}{}$$

(13)

is satisfied. Here, **w*** ^{r}* indicates the set of weights obtained at iteration

Note the similarity between the COSA criterion (3) and (10): when
${a}_{k}={\scriptstyle \frac{1}{{n}_{k}}}$ in (3), then both criteria involve minimizing a weighted function of the WCSS, where the feature weights reflect the importance of each feature in the clustering. However, (3) does not result in weights that are exactly equal to zero unless *λ* = 0, in which case only one weight is non-zero. The combination of *L*_{1} and *L*_{2} constraints in (10) yields the desired effect.

The Appendix contains an additional remark about our criterion for sparse *K*-means clustering.

The sparse *K*-means clustering algorithm has one tuning parameter, *s*, which is the *L*_{1} bound on **w** in (10). We assume that *K*, the number of clusters, is fixed. The problem of selecting *K* is outside of the scope of this paper, and has been discussed extensively in the literature for standard *K*-means clustering; we refer the interested reader to Milligan & Cooper (1985), Kaufman & Rousseeuw (1990), Tibshirani et al. (2001), Sugar & James (2003), and Tibshirani & Walther (2005).

A method for choosing the value of *s* is required. Note that one cannot simply select *s* to maximize the objective function in (10), since as *s* is increased, the objective will increase as well. Instead, we apply a permutation approach that is closely related to the gap statistic of Tibshirani et al. (2001) for selecting the number of clusters *K* in standard *K*-means clustering.

- Obtain permuted data sets
**X**_{1}, …**X**by independently permuting the observations within each feature._{B} - For each candidate tuning parameter value
*s*:- Compute $O(s)={\sum}_{j}{w}_{j}({\scriptstyle \frac{1}{n}}{\sum}_{i=1}^{n}{\sum}_{{i}^{\prime}=1}^{n}{d}_{i,{i}^{\prime},j}-{\sum}_{k=1}^{K}{\scriptstyle \frac{1}{{n}_{k}}}{\sum}_{i,{i}^{\prime}{C}_{k}})$, the objective obtained by performing sparse
*K*-means with tuning parameter value*s*on the data**X**. - For
*b*= 1, 2, …*B*, compute*O*(_{b}*s*), the objective obtained by performing sparse*K*-means with tuning parameter value*s*on the data**X**._{b} - Calculate $\text{Gap}(s)=log(O(s))-{\scriptstyle \frac{1}{B}}{\sum}_{b=1}^{B}log({O}_{b}(s))$.

- Choose
*s*^{*}corresponding to the largest value of Gap(*s*). Alternatively, one can choose*s*^{*}to equal the smallest value for which Gap(*s*^{*}) is within a standard deviation of log(*O*(_{b}*s*^{*})) of the largest value of Gap(*s*).

Note that while there may be strong correlations between the features in the original data **X**, the features in the permuted data sets **X**_{1}, … **X*** _{b}* are uncorrelated with each other. The gap statistic measures the strength of the clustering obtained on the real data relative to the clustering obtained on null data that does not contain subgroups. The optimal tuning parameter value occurs when this quantity is greatest.

In Figure 2, we apply this method to a simple example with 6 equally-sized classes, where *n* = 120, *p* = 2000, and 200 features differ between classes. In the figure we have used the *classification error rate* (CER) for two partitions of a set of *n* observations. This is defined as follows. Let *P* and *Q* denote the two partitions; *P* might be the true class labels, and *Q* might be a partition obtained by clustering. Let 1_{P}_{(}_{i,i}_{′)} be an indicator for whether partition *P* places observations *i* and *i*′ in the same group, and define 1_{Q}_{(}_{i,i}_{′)} analogously. Then, the CER (used for example in Chipman & Tibshirani 2005) is defined as
${\sum}_{i>{i}^{\prime}}{1}_{P(i,{i}^{\prime})}-{1}_{Q(i,{i}^{\prime})}/\left(\begin{array}{c}n\\ 2\end{array}\right)$. The CER equals 0 if the partitions *P* and *Q* agree perfectly; a high value indicates disagreement. Note that CER is one minus the Rand index (Rand 1971).

We compare the performances of standard and sparse *K*-means in a simulation study where *q* = 50 features differ between *K* = 3 classes. *X _{ij}* ~

Standard 3-means results for Simulation 1. The reported values are the mean (and standard error) of the CER over 20 simulations. The μ/p combinations for which the CER of standard 3-means is significantly less than that of sparse 3-means (at level **...**

Sparse 3-means results for Simulation 1. The reported values are the mean (and standard error) of the CER over 20 simulations. The μ/p combinations for which the CER of sparse 3-means is significantly less than that of standard 3-means (at level **...**

We compare the performance of sparse *K*-means to a number of competitors:

**The COSA proposal of**Friedman & Meulman (2004). COSA was run using the R code available from the website http://www-stat.stanford.edu/~jhf/COSA.html, in order to obtain a reweighted dissimilarity matrix. Then, two methods were used to obtain a clustering:- 3-medoids clustering (using the
*partitioning around medoids*algorithm described in Kaufman & Rousseeuw 1990) was performed on the reweighted dissimilarity matrix. - Hierarchical clustering with average linkage was performed on the reweighted dissimilarity matrix, and the dendrogram was cut so that 3 groups were obtained.

**The model-based clustering approach of**Raftery & Dean (2006). It was run using the R package clustvarsel, available from http://cran.r-project.org/.**The penalized log-likelihood approach of**Pan & Shen (2007). R code implementing this method was provided by the authors.**PCA followed by 3-means clustering.**Only the first principal component was used, since in the simulations considered the first principal component contained the signal. This is similar to several proposals in the literature (see e.g. Ghosh & Chinnaiyan 2002, Liu et al. 2003, Tamayo et al. 2007).

The set-up is similar to that of Section 3.3.1, in that there are *K* = 3 classes and *X _{ij}* ~

Results for Simulation 2. The quantities reported are the mean and standard error (given in parentheses) of the CER, and of the number of non-zero coefficients, over 25 simulated data sets.

We make a few comments about Table 4. First of all, neither variant of COSA performed well in this example, in terms of CER. This is somewhat surprising. However, COSA allows the features to take on a different set of weights with respect to each cluster. In the simulation, each cluster is defined on the same set of features, and COSA may have lost power by allowing different weights for each cluster. The method of Raftery & Dean (2006) also does quite poorly in this example, although its performance seems to improve somewhat as the signal to noise ratio in the simulation is increased (results not shown). The penalized model-based clustering method of Pan & Shen (2007) resulted in low CER as well as sparsity in both simulations. In addition, the simple method of PCA followed by 3-means clustering yielded quite low CER. However, since the principal components are linear combinations of all of the features, the resulting clustering is not sparse in the features and thus does not achieve the stated goal in this paper of performing feature selection.

In both simulations, sparse *K*-means performed quite well, in that it resulted in a low CER and sparsity. The tuning parameter was chosen to maximize the gap statistic; however, greater sparsity could have been achieved by choosing the smallest tuning parameter value within one standard deviation of the maximal gap statistic, as described in Section 3.2. Our proposal also has the advantage of generalizing to other types of clustering, as described in the next section.

Hierarchical clustering produces a dendrogram that represents a nested set of clusters: depending on where the dendrogram is cut, between 1 and *n* clusters can result. One could develop a method for sparse hierarchical clustering by cutting the dendrogram at some height and maximizing a weighted version of the resulting BCSS, as in Section 3. However, it is not clear where the dendrogram should be cut, nor whether multiple cuts should be made and somehow combined. Instead, we pursue a simpler and more natural approach to sparse hierarchical clustering in this section.

Note that hierarchical clustering takes as input a *n* × *n* dissimilarity matrix **U**. The clustering can use any type of linkage - complete, average, or single. If **U** is the *overall dissimilarity matrix* {Σ_{j} d_{i,i}_{′}* _{,j}*}

Since scaling the dissimilarity matrix by a factor does not affect the shape of the resulting dendrogram, we ignore proportionality constants in the following discussion. Consider the criterion

$$\underset{\mathbf{U}}{\text{maximize}}\{\sum _{j}\sum _{i,{i}^{\prime}}{d}_{i,{i}^{\prime},j}{U}_{i,{i}^{\prime}}\}\phantom{\rule{0.16667em}{0ex}}\text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.16667em}{0ex}}\sum _{i,{i}^{\prime}}{U}_{i,{i}^{\prime}}^{2}\le 1.$$

(14)

Let **U**^{*} optimize (14). It is not hard to show that
${U}_{i,{i}^{\prime}}^{}$, and so performing hierarchical clustering on **U**^{*} results in standard hierarchical clustering. So we can think of standard hierarchical clustering as resulting from the criterion (14). To obtain sparsity in the features, we modify (14) by multiplying each element of the summation over *j* by a weight *w _{j,}* subject to constraints on the weights:

$$\underset{\mathbf{w},\mathbf{U}}{\text{maximize}}\{\sum _{j}{w}_{j}\sum _{i,{i}^{\prime}}{d}_{i,{i}^{\prime},j}{U}_{i,{i}^{\prime}}\}\phantom{\rule{0.16667em}{0ex}}\text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.16667em}{0ex}}\sum _{i,{i}^{\prime}}{U}_{i,{i}^{\prime}}^{2}\le 1,{\left|\right|\mathbf{w}\left|\right|}^{2}\le 1,{\left|\right|\mathbf{w}\left|\right|}_{1}\le s,{w}_{j}\ge 0\phantom{\rule{0.16667em}{0ex}}\forall j.$$

(15)

The **U**^{**} that optimizes (15) is proportional to {Σ_{j} d_{i,i}_{′}* _{,j}w_{j}*}

By inspection, (15) is *bi-convex* in **U** and **w**: with **w** fixed, it is convex in **U**, and with **U** fixed, it is convex in **w**. This leads to an extremely simple algorithm for optimizing it. However, before we present this algorithm, we introduce some additional notation that will prove useful. Let **D** ^{n2 × p} be the matrix in which column *j* consists of the elements {*d _{i,i}*

$$\underset{\mathbf{w},\mathbf{u}}{\text{maximize}}\{{\mathbf{u}}^{T}\mathbf{Dw}\}\phantom{\rule{0.16667em}{0ex}}\text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.16667em}{0ex}}{\left|\right|\mathbf{u}\left|\right|}^{2}\le 1,{\left|\right|\mathbf{w}\left|\right|}^{2}\le 1,{\left|\right|\mathbf{w}\left|\right|}_{1}\le s,{w}_{j}\ge 0\phantom{\rule{0.16667em}{0ex}}\forall j.$$

(16)

We now present our algorithm for sparse hierarchical clustering.

- Initialize
**w**as ${w}_{1}=\dots ={w}_{p}={\scriptstyle \frac{1}{\sqrt{p}}}$. - Iterate until convergence:
- Update $\mathbf{u}={\scriptstyle \frac{\mathbf{Dw}}{{\left|\right|\mathbf{Dw}\left|\right|}_{2}}}$.
- Update $\mathbf{w}={\scriptstyle \frac{S({\mathbf{a}}_{+},\mathrm{\Delta})}{{\left|\right|S({\mathbf{a}}_{+},\mathrm{\Delta})\left|\right|}_{2}}}$ where
**a**=**D**^{T}**u**and Δ = 0 if this results in ||**w**||_{1}≤*s*; otherwise, Δ > 0 is chosen such that ||**w**||_{1}=*s*.

- Re-write
**u**as a*n*×*n*matrix,**U**. - Perform hierarchical clustering on the
*n*×*n*dissimilarity matrix**U**.

Observe that (16) is the *sparse principal components* (SPC) criterion of Witten et al. (2009), with an additional non-negativity constraint on **w**. If *d _{i,i}*

When viewed in this way, our method for sparse hierarchical clustering is quite simple. The first SPC of the *n*^{2} × *p* matrix **D** is denoted **w**. Then **u** **Dw** can be re-written as a *n* × *n* matrix **U**, which is a weighted linear combination of the feature-wise dissimilarity matrices. When *s* is small, then some of the *w _{j}* will equal zero, and so

In our implementation of this algorithm, we used (13) as a stopping criterion in Step 2. In the examples that we considered, the stopping criterion generally was satisfied within 10 iterations. As mentioned earlier, the criterion (16) is bi-convex in **u** and **w**, and we are not guaranteed convergence to a global optimum using this iterative algorithm.

We study the behaviors of sparse and standard hierarchical clustering under a simple model. Suppose that the *n* observations fall into two classes, *C*_{1} and *C*_{2}, which differ only with respect to the first *q* features. The elements *X _{ij}* are independent and normally distributed with a mean shift between the two classes in the first

$${X}_{ij}~\{\begin{array}{ll}N({\mu}_{j}+c,{\sigma}^{2})\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}j\le q,i{C}_{1},\hfill & N({\mu}_{j},{\sigma}^{2})\hfill & \text{otherwise}.\hfill \end{array}$$

(17)

Note that for *i* ≠ *i*′,

$${X}_{ij}-{X}_{{i}^{\prime}j}~\{\begin{array}{ll}N(\pm c,2{\sigma}^{2})\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}j\le q\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}i,{i}^{\prime}\phantom{\rule{0.16667em}{0ex}}\text{in}\phantom{\rule{0.16667em}{0ex}}\text{different}\phantom{\rule{0.16667em}{0ex}}\text{classes},\hfill \\ N(0,2{\sigma}^{2})\hfill & \text{otherwise}.\hfill \end{array}$$

(18)

Let *d _{i,i}*

$${d}_{i,{i}^{\prime},j}~\{\begin{array}{ll}2{\sigma}^{2}{\chi}_{1}^{2}({\scriptstyle \frac{{c}^{2}}{2{\sigma}^{2}}})\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}j\le q\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}i,{i}^{\prime}\phantom{\rule{0.16667em}{0ex}}\text{in}\phantom{\rule{0.16667em}{0ex}}\text{different}\phantom{\rule{0.16667em}{0ex}}\text{classes},\hfill \\ 2{\sigma}^{2}{\chi}_{1}^{2}\hfill & \text{otherwise},\hfill \end{array}$$

(19)

where
${\chi}_{1}^{2}(\lambda )$ denotes the noncentral
${\chi}_{1}^{2}$ distribution with noncentrality parameter *λ*. This means that the overall dissimilarity matrix used by standard hierarchical clustering has off-diagonal elements

$$\sum _{j}{d}_{i,{i}^{\prime},j}~\{\begin{array}{ll}2{\sigma}^{2}{\chi}_{p}^{2}({\scriptstyle \frac{{qc}^{2}}{2{\sigma}^{2}}})\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}i,{i}^{\prime}\phantom{\rule{0.16667em}{0ex}}\text{in}\phantom{\rule{0.16667em}{0ex}}\text{different}\phantom{\rule{0.16667em}{0ex}}\text{classes},\hfill \\ 2{\sigma}^{2}{\chi}_{p}^{2}\hfill & \text{otherwise},\hfill \end{array}$$

(20)

and so for *i* ≠ *i*′,

$$\text{E}({d}_{i,{i}^{\prime},j})=\{\begin{array}{ll}2{\sigma}^{2}+{c}^{2}\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}j\le q\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}i,{i}^{\prime}\phantom{\rule{0.16667em}{0ex}}\text{in}\phantom{\rule{0.16667em}{0ex}}\text{different}\phantom{\rule{0.16667em}{0ex}}\text{classes},\hfill \\ 2{\sigma}^{2}\hfill & \text{otherwise},\hfill \end{array}$$

(21)

and

$$\text{E}(\sum _{j}{d}_{i,{i}^{\prime},j})=\{\begin{array}{ll}2p{\sigma}^{2}+{qc}^{2}\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}i,{i}^{\prime}\phantom{\rule{0.16667em}{0ex}}\text{in}\phantom{\rule{0.16667em}{0ex}}\text{different}\phantom{\rule{0.16667em}{0ex}}\text{classes},\hfill \\ 2p{\sigma}^{2}\hfill & \text{otherwise}.\hfill \end{array}$$

(22)

We now consider the behavior of sparse hierarchical clustering. Suppose that *w _{j}* 1

$$\sum _{j}{w}_{j}{d}_{i,{i}^{\prime},j}\{\begin{array}{ll}2{\sigma}^{2}{\chi}_{q}^{2}({\scriptstyle \frac{{qc}^{2}}{2{\sigma}^{2}}})\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}i,{i}^{\prime}\phantom{\rule{0.16667em}{0ex}}\text{in}\phantom{\rule{0.16667em}{0ex}}\text{different}\phantom{\rule{0.16667em}{0ex}}\text{classes},\hfill \\ 2{\sigma}^{2}{\chi}_{q}^{2}\hfill & \text{otherwise}.\hfill \end{array}$$

(23)

So in this ideal setting, the dissimilarity matrix used for sparse hierarchical clustering (23) is a denoised version of the dissimilarity matrix used for standard hierarchical clustering (20). Of course, in practice, *w _{j}* is not proportional to 1

We now allow **w** to take a more general form. Recall that **w** is the first SPC of **D**, obtained by writing {*d _{i,i}*

$${w}_{1}=\dots ={w}_{q}>{w}_{q+1}=\dots ={w}_{p}$$

(24)

from (21). To see this latter point, note that by the sparse hierarchical clustering algorithm, **w** is obtained by repeating the operation **w** = *S*(E(**D**)* ^{T}*E(

From (21), the expectations of the off-diagonal elements of the dissimilarity matrix used for sparse hierarchical clustering are therefore

$$\text{E}(\sum _{j}{w}_{j}{d}_{i,{i}^{\prime},j})=\sum _{j}{w}_{j}\text{E}({d}_{i,{i}^{\prime},j})=\{\begin{array}{ll}2{\sigma}^{2}{\sum}_{j}{w}_{j}+{c}^{2}{\sum}_{j\le q}{w}_{j}\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}i,{i}^{\prime}\phantom{\rule{0.16667em}{0ex}}\text{in}\phantom{\rule{0.16667em}{0ex}}\text{different}\phantom{\rule{0.16667em}{0ex}}\text{classes},\hfill \\ 2{\sigma}^{2}{\sum}_{j}{w}_{j}\hfill & \text{otherwise}.\hfill \end{array}$$

(25)

By comparing (22) to (25), and using (24), we see that the expected dissimilarity between observations in different classes relative to observations in the same class is greater for sparse hierarchical clustering than for standard hierarchical clustering. Note that we have taken the weight vector **w** to be the first SPC of E(**D**), rather than the first SPC of **D**.

We now consider the problem of selecting a value for *s*, the *L*_{1} bound for **w** in the sparse hierarchical clustering criterion. We take an approach similar to that of Section 3.2 for sparse *K*-means, in this case letting *O*(*s*) = Σ* _{j} w_{j}*Σ

We demonstrate the performance of this tuning parameter selection method on the simulated 6-class data set used for Figure 2. We performed standard, COSA, and sparse hierarchical clustering with complete linkage. The results can be seen in Figure 3. Sparse hierarchical clustering results in better separation between the subgroups. Moreover, the correct features are given non-zero weights.

Standard hierarchical clustering, COSA, and sparse hierarchical clustering with complete linkage were performed on simulated 6-class data. **1**, **2**, **3**: The color of each leaf indicates its class identity. CERs were computed by cutting each dendrogram at the **...**

In this example and throughout this paper, we used *d _{i,i}*

Standard hierarchical clustering is often dominated by a single group of features that have high variance and are highly correlated with each other. The same is true of sparse hierarchical clustering. Nowak & Tibshirani (2008) propose *complementary clustering*, a method that allows for the discovery of a secondary clustering after removing the signal found in the standard hierarchical clustering. Here we provide a method for *complementary sparse clustering*, an analogous approach for the sparse clustering framework. This simple method follows directly from our sparse hierarchical clustering proposal.

As in Section 4.1, we let **D** denote the *n*^{2}×*p* matrix of which column *j* consists of {*d _{i,i}*

$$\underset{{\mathbf{u}}_{2},{\mathbf{w}}_{2}}{\text{maximize}}\{{\mathbf{u}}_{2}^{T}\mathbf{D}{\mathbf{w}}_{2}\}\phantom{\rule{0.16667em}{0ex}}\text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.16667em}{0ex}}{\left|\right|{\mathbf{u}}_{2}\left|\right|}^{2}\le 1,{\left|\right|{\mathbf{w}}_{2}\left|\right|}^{2}\le 1,{\left|\right|{\mathbf{w}}_{2}\left|\right|}_{1}\le s,{\mathbf{u}}_{1}^{T}{\mathbf{u}}_{2}=0,{w}_{j}\ge 0\phantom{\rule{0.16667em}{0ex}}\forall j$$

(26)

results in a dissimilarity matrix **U**_{2}, obtained by writing **u**_{2} as a *n* × *n* matrix, that yields a complementary sparse clustering. The feature weights for this secondary clustering are given by **w**_{2}. Note that (26) is simply the proposal of Witten et al. (2009) for finding the second SPC of **D** subject to orthogonality constraints, with an additional non-negativity constraint on **w**.

Observe that **U**_{2} is symmetric with zeroes on the diagonal, and that due to the constraint that
${\mathbf{u}}_{1}^{T}{\mathbf{u}}_{2}=0$, some elements of **U**_{2} will be negative. However, since only the off-diagonal elements of a dissimilarity matrix are used in hierarchical clustering, and since the shape of the dendrogram is not affected by adding a constant to the off-diagonal elements, in practice this is not a problem. The algorithm for complementary sparse clustering is as follows:

- Apply the sparse hierarchical clustering algorithm, and let
**u**_{1}denote the resulting linear combination of the*p*feature-wise dissimilarity matrices, written in vector form. - Initialize
**w**_{2}as ${w}_{21}=\dots ={w}_{2p}={\scriptstyle \frac{1}{\sqrt{p}}}$. - Iterate until convergence:
- Update ${\mathbf{u}}_{2}={\scriptstyle \frac{(\mathbf{I}-{\mathbf{u}}_{1}{\mathbf{u}}_{1}^{T})\mathbf{D}{\mathbf{w}}_{2}}{{\left|\right|(\mathbf{I}-{\mathbf{u}}_{1}{\mathbf{u}}_{1}^{T})\mathbf{D}{\mathbf{w}}_{2}\left|\right|}_{2}}}$.
- Update ${\mathbf{w}}_{2}={\scriptstyle \frac{S({\mathbf{a}}_{+},\mathrm{\Delta})}{{\left|\right|S({\mathbf{a}}_{+},\mathrm{\Delta})\left|\right|}_{2}}}$ where
**a**=**D**^{T}**u**_{2}and Δ = 0 if this results in ||**w**_{2}||_{1}≤*s*; otherwise, Δ > 0 is chosen such that ||**w**_{2}||_{1}=*s*.

- Re-write
**u**_{2}as a*n*×*n*matrix,**U**_{2}. - Perform hierarchical clustering on
**U**_{2}.

Of course, one could easily extend this procedure in order to obtain further complementary clusterings.

In a well-known paper, Perou et al. (2000) used gene expression microarrays to profile 65 surgical specimens of human breast tumors. Some of the samples were taken from the same tumor before and after chemotherapy. The data are available at http://genome-www.stanford.edu/breast_cancer/molecularportraits/download.shtml. The 65 samples were hierarchically clustered using what we will refer to as “Eisen” linkage; this is a centroid-based linkage that is implemented in Michael Eisen’s Cluster program (Eisen et al. 1998). Two sets of genes were used for the clustering: the full set of 1753 genes, and an *intrinsic gene set* consisting of 496 genes. The intrinsic genes were defined as having the greatest level of variation in expression between different tumors relative to variation in expression between paired samples taken from the same tumor before and after chemotherapy. The dendrogram obtained using the intrinsic gene set was used to identify four classes – basal-like, Erb-B2, normal breast-like, and ER+ – to which 62 of the 65 samples belong. It was determined that the remaining three observations did not belong to any of the four classes. These four classes are not visible in the dendrogram obtained using the full set of genes, and the authors concluded that the intrinsic gene set is necessary to observe the classes. In Figure 5, two dendrograms obtained by clustering on the intrinsic gene set are shown. The first was obtained by clustering all 65 observations, and the second was obtained by clustering the 62 observations that were assigned to one of the four classes. The former figure is in the original paper, and the latter is not. In particular, note that the four classes are not clearly visible in the dendrogram obtained using only 62 observations.

Four hierarchical clustering methods were used to cluster the 62 observations that were assigned to one of four classes in Perou et al. (2000). Sparse clustering results in the best separation between the four classes. The color coding is as in Figure **...**

We wondered whether our proposal for sparse hierarchical clustering could yield a dendrogram that reflects the four classes, without any knowledge of the paired samples or of the intrinsic genes. We performed four versions of hierarchical clustering with Eisen linkage on the 62 observations that were assigned to the four classes:

- Sparse hierarchical clustering of all 1753 genes, with the tuning parameter chosen to yield 496 non-zero genes.
- Standard hierarchical clustering using all 1753 genes.
- Standard hierarchical clustering using the 496 genes with highest marginal variance.
- COSA hierarchical clustering using all 1753 genes.

The resulting dendrograms are shown in Figure 5. Sparse clustering of all 1753 genes with the tuning parameter chosen to yield 496 non-zero genes does best at capturing the four classes; in fact, a comparison with Figure 4 reveals that it does quite a bit better than clustering based on the intrinsic genes only! Figure 6 displays the result of performing the automated tuning parameter selection method. This resulted in 93 genes having non-zero weights.

Using the intrinsic gene set, hierarchical clustering was performed on all 65 observations (left panel) and on only the 62 observations that were assigned to one of the four classes (right panel). Note that the classes identified using all 65 observations **...**

The gap statistic was used to determine the optimal value of the tuning parameter for sparse hierarchical clustering. **Left:** The largest value of the gap statistic corresponds to 93 genes with non-zero weights. **Right:** The dendrogram corresponding to 93 **...**

Figure 7 shows that the gene weights obtained using sparse clustering are highly correlated with the marginal variances of the genes. However, the results obtained from sparse clustering are different from the results obtained by simply clustering on the high-variance genes (Figure 5). The reason for this lies in the form of the criterion (15). Though the non-zero *w _{j}*’s tend to correspond to genes with high marginal variances, sparse clustering does not simply cluster the genes with highest marginal variances. Rather, it weights each gene-wise dissimilarity matrix by a different amount.

We also performed complementary sparse clustering on the full set of 1753 genes, using the method of Section 4.4. Tuning parameters for the initial and complementary sparse clusterings were selected to yield 496 genes with non-zero weights. The complementary sparse clustering dendrogram is shown in Figure 8, along with a plot of **w**_{1} and **w**_{2} (the feature weights for the initial and complementary clusterings). The dendrogram obtained using complementary sparse clustering suggests a previously unknown pattern in the data. (Recall that the dendrogram for the initial sparse clustering can be found in Figure 5.)

Complementary sparse clustering was performed. Tuning parameters for the initial and complementary clusterings were selected to yield 496 genes with non-zero weights. **Left:** A plot of **w**_{1} against **w**_{2}. **Right:** The dendrogram for complementary sparse clustering. **...**

In response to a reviewer’s inquiry about the robustness of sparse hierarchical clustering, we repeatedly resampled the 62 observations that belong to one of the four classes and performed sparse hierarchical clustering on the resampled data sets. (A small jitter was added to the resampled observations in order to avoid samples having correlation one with each other.) We found that for the most part, the resulting clusters accurately reflected the true class labels.

We wondered whether one could use sparse clustering in order to identify distinct populations in single nucleotide polymorphism (SNP) data, and also to identify the SNPs that differ between the populations. A SNP is a nucleotide position in a DNA sequence at which genetic variability exists in the population. We used the publicly-available Haplotype Map (“HapMap”) data of the International HapMap Consortium (International HapMap Consortium 2005, International HapMap Consortium 2007). The Phase III SNP data for eleven populations are available from http://ftp.hapmap.org/genotypes/2008-07_phaseIII/hapmap_format/forward/ we used only the data for chromosome 22. We restricted the analysis to three of the populations: African ancestry in southwest USA, Utah residents with European ancestry, and Han Chinese from Beijing. We used the SNPs for which measurements are available in all three populations. The resulting data have dimension 315×17026. We coded AA as 2, Aa as 1, and aa as 0. Missing values were imputed using 5-nearest neighbors (Troyanskaya et al. 2001). Sparse and standard 3-means clustering were performed on the data. The CERs obtained using standard 3-means and sparse 3- means are shown in Figure 9; CER was computed by comparing the clustering class labels to the true population identity for each sample. When the tuning parameter in sparse clustering was chosen to yield between 198 and 2809 SNPs with non-zero weights, sparse clustering resulted in slightly lower CER than standard 3-means clustering. The main improvement of sparse clustering over standard clustering is in interpretability, since the non-zero elements of **w** determine the SNPs involved in the sparse clustering. We can use the weights obtained from sparse clustering to identify SNPs on chromosome 22 that distinguish between the populations (Figure 9). SNPs in a few genomic regions appear to be responsible for the clustering obtained.

In this example, the tuning parameter selection method of Section 3.2 does not perform well. Rather than selecting a tuning parameter that yields between 198 and 2809 SNPs with non-zero weights (resulting in the lowest CER), the highest gap statistic is obtained when all SNPs are used. The one standard deviation rule described in Section 3.2 results in a tuning parameter that yields 7160 genes with non-zero weights. The fact that the gap statistic seemingly overestimates the number of features with non-zero weights may reflect the need for a more accurate method for tuning parameter selection, or it may suggest the presence of further population substructure beyond the three population labels.

In this example, we applied sparse clustering to SNP data for which the populations were already known. However, the presence of unknown subpopulations in SNP data is often a concern, as population substructure can confound attempts to identify SNPs that are associated with diseases and other outcomes (see e.g. Price et al. 2006). In general, one could use sparse clustering to identify subpopulations in SNP data in an unsupervised way before further analyses are performed.

We have proposed a general framework for sparse clustering, and have introduced methods for sparse *K*-means clustering and sparse hierarchical clustering that are special cases in this framework. Other clustering methods can also be made sparse in this way; for instance, a method for sparse *K*-medoids is presented in the Appendix.

The approach of Section 4.1 involves reweighting a dissimilarity matrix. Hence it can be applied to any method that takes a dissimilarity matrix as its input, such as multidimensional scaling. Since this method involves computations on a *n*^{2} × *p* matrix, it has the potential to become quite slow if *n*, in addition to *p*, is very large.

Though our sparse clustering methods seem to perform well on real and simulated data, the performance of the gap statistic of Section 3.2 for selecting the tuning parameter is mixed. This is not surprising, since tuning parameter selection in the unsupervised setting is known to be a very difficult problem. This is an area in which more work is needed.

The R package sparcl implementing the methods proposed in this paper will be made available at http://cran.r-project.org/.

This manuscript was improved by helpful comments from the editor, an associate editor, and two anonymous referees. We thank Jerome Friedman, Nema Dean, Wei Pan, and Hui Zhou for making R code available and for useful comments. Trevor Hastie provided helpful suggestions. Daniela Witten was partially supported by a National Defense Science and Engineering Graduate Fellowship. Robert Tibshirani was partially supported by National Science Foundation Grant DMS-9971405 and National Institutes of Health Contract N01-HV-28183.

In the case where *d* is squared Euclidean distance, the *K*-means criterion (7) is equivalent to

$$\underset{{C}_{1},\dots ,{C}_{K},{\mu}_{1},\dots ,{\mu}_{K}}{\text{minimize}}\{\sum _{k=1}^{K}\sum _{i{C}_{k}}\}$$

(27)

where *μ _{k}* is the centroid for cluster

In Section 2.2, we mentioned that any clustering method of the form (4) could be modified to obtain a sparse clustering method of the form (5). (However, for the resulting sparse method to have a non-zero weight for feature *j*, it is necessary that *f _{j}*(

$$\underset{{C}_{1},\dots ,{C}_{K},{i}_{1},\dots ,{i}_{K}}{\text{minimize}}\{\sum _{k=1}^{K}\sum _{i{C}_{k}}\},$$

(28)

or equivalently

$$\underset{{C}_{1},\dots ,{C}_{K},{i}_{1},\dots ,{i}_{K},{i}_{0}}{\text{maximize}}\{\sum _{j=1}^{p}(\sum _{i=1}^{n}{d}_{i,{i}_{0},j}-\sum _{k=1}^{K}\sum _{i{C}_{k}})\}$$

(29)

where *i*_{0} {1, …, *n*} is the index of the medoid for the full set of *n* observations. Since (29) is of the form (4), the criterion

$$\begin{array}{l}{\text{maximize}}_{\mathbf{w},{C}_{1},\dots ,{C}_{K},{i}_{1},\dots ,{i}_{K},{i}_{0}}\phantom{\rule{0.38889em}{0ex}}\{\sum _{j=1}^{p}{w}_{j}(\sum _{i=1}^{n}{d}_{i,{i}_{0},j}-\sum _{k=1}^{K}\sum _{i{C}_{k}}\}& \text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.16667em}{0ex}}{\left|\right|\mathbf{w}\left|\right|}^{2}\le 1,{\left|\right|\mathbf{w}\left|\right|}_{1}\le s,{w}_{j}\ge 0\phantom{\rule{0.16667em}{0ex}}\forall j\end{array}$$

(30)

results in a method for sparse *K*-medoids clustering, which can be optimized by an iterative approach.

- Boyd S, Vandenberghe L. Convex Optimization. Cambridge University Press; 2004.
- Chang W-C. On using principal components before separating a mixture of two multivariate normal distributions. Journal of the Royal Statistical Society, Series C (Applied Statistics) 1983;32:267–275.
- Chipman H, Tibshirani R. Hybrid hierarchical clustering with applications to microarray data. Biostatistics. 2005;7:286–301. [PubMed]
- Dempster A, Laird N, Rubin D. Maximum likelihood from incomplete data via the EM algorithm (with discussion) J R Statist Soc B. 1977;39:1–38.
- Eisen M, Spellman P, Brown P, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci, USA. 1998;95:14863–14868. [PubMed]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360.
- Fraley C, Raftery A. Model-based clustering, discriminant analysis, and density estimation. J Amer Statist Assoc. 2002;97:611–631.
- Friedman J, Meulman J. Clustering objects on subsets of attributes. J Roy Stat Soc, Ser B. 2004;66:815–849.
- Ghosh D, Chinnaiyan AM. Mixture modelling of gene expression data from microarray experiments. Bioinformatics. 2002;18:275–286. [PubMed]
- International HapMap Consortium. A haplotype map of the human genome. Nature. 2005;437:1299–1320. [PMC free article] [PubMed]
- International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. [PMC free article] [PubMed]
- Kaufman L, Rousseeuw P. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley; New York: 1990.
- Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401:788. [PubMed]
- Lee DD, Seung HS. Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, (NIPS 2001) 2001
- Liu JS, Zhang JL, Palumbo MJ, Lawrence CE. Bayesian clustering with variable and transformation selections. Bayesian statistics. 2003;7:249–275.
- Lv J, Fan Y. A unified approach to model selection and sparse recovery using regularized least squares. Annals of Statistics 2009
- MacQueen J. Some methods for classification and analysis of multivariate observations. In: LeCam LM, Neyman J, editors. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Univ. of California Press; 1967. pp. 281–297.
- Maugis C, Celeux G, Martin-Magniette M-L. Variable selection for clustering with Gaussian mixture models. Biometrics. 2009;65:701–709. [PubMed]
- McLachlan GJ, Bean RW, Peel D. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics. 2002;18:413–422. [PubMed]
- McLachlan GJ, Peel D. Finite Mixture Models. John Wiley & Sons; New York, NY: 2000.
- McLachlan GJ, Peel D, Bean RW. Modelling high-dimensional data by mixtures of factor analyzers. Computational Statisitics and Data Analysis. 2003;41:379–388.
- Milligan GW, Cooper MC. An examination of procedures for determining the number of clusters in a data set. Psychometrika. 1985;50:159–179.
- Nowak G, Tibshirani R. Complementary hierarchical clustering. Biostatistics. 2008;9(3):467–483. [PMC free article] [PubMed]
- Pan W, Shen X. Penalized model-based clustering with application to variable selection. Journal of Machine Learning Research. 2007;8:1145–1164.
- Perou CM, Sorlie T, Eisen MB, Rijn MVD, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX, Lonning PE, Borresen-dale A, Brown PO, Botstein D. Molecular portraits of human breast tumours. Nature. 2000;406:747–752. [PubMed]
- Price AL, Patterson NJ, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38:904–909. [PubMed]
- Raftery A, Dean N. Variable selection for model-based clustering. J Amer Stat Assoc. 2006;101:168–178.
- Rand WM. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association. 1971;66:846–850.
- Sugar CA, James GM. Finding the number of clusters in a dataset: an information-theoretic approach. Journal of the American Statistical Association. 2003;98:750–763.
- Tamayo P, Scanfeld D, Ebert BL, Gillette MA, Roberts CWM, Mesirov JP. Metagene projection for cross-platform, cross-species characterization of global transcriptional states. PNAS. 2007;104:5959–5964. [PubMed]
- Tibshirani R, Walther G. Cluster validation by prediction strength. J Comp Graph Stat. 2005;14(3):511–528.
- Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a dataset via the gap statistic. J Royal Statist Soc B. 2001;32(2):411–423.
- Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman R. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;16:520–525. [PubMed]
- Wang S, Zhu J. Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics. 2008;64:440–448. [PubMed]
- Witten D, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10(3):515–534. [PMC free article] [PubMed]
- Xie B, Pan W, Shen X. Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables. Electronic Journal of Statistics. 2008;2:168–212. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |