Home  About  Journals  Submit  Contact Us  Français 
We consider the problem of performing unsupervised learning in the presence of outliers – that is, observations that do not come from the same distribution as the rest of the data. It is known that in this setting, standard approaches for unsupervised learning can yield unsatisfactory results. For instance, in the presence of severe outliers, Kmeans clustering will often assign each outlier to its own cluster, or alternatively may yield distorted clusters in order to accommodate the outliers. In this paper, we take a new approach to extending existing unsupervised learning techniques to accommodate outliers. Our approach is an extension of a recent proposal for outlier detection in the regression setting. We allow each observation to take on an “error” term, and we penalize the errors using a group lasso penalty in order to encourage most of the observations’ errors to exactly equal zero. We show that this approach can be used in order to develop extensions of Kmeans clustering and principal components analysis that result in accurate outlier detection, as well as improved performance in the presence of outliers. These methods are illustrated in a simulation study and on two gene expression data sets, and connections with Mestimation are explored.
It has long been known that in the presence of outliers, classical statistical methods such as linear regression can fail. For this reason, many proposals for detecting outliers in regression have been made (for a survey, see Rousseeuw & Leroy 1987, Rousseeuw & Hubert 2011). Recently, She & Owen (2011) proposed a new approach for detecting outliers in regression problems. In regression, one generally assumes the model
where y is an outcome vector of length n, X is a n × p design matrix, and ε a nvector of error terms. If some outliers are present among the observations, then a more accurate model might be
where γ is a sparse nvector whose nonzero elements correspond to outlying observations. To fit the model (2), She & Owen (2011) propose solving the optimization problem
where P(γ_{i}; λ) is a penalty on γ_{i} that encourages sparsity, and λ is a tuning parameter. It is shown in She & Owen (2011) that there is a close connection between Mestimation and (3). For instance, if , then solving (3) is equivalent to the Mestimate based on Huber’s loss function (Huber 1964).
In this paper, we consider the problem of performing unsupervised learning in the presence of outliers – that is, observations that do not come from the same distribution as the rest of the data. Specifically, we investigate Kmeans clustering and principal components analysis (PCA) in the presence of outliers. Both of these methods can perform very poorly when outliers are present. A number of proposals have been made for modifying these techniques to accommodate outliers (among others, Jolion & Rosenfeld 1989, Dave 1991, GarciaEscudero & Gordaliza 1999, Jiang et al. 2001, Fraley & Raftery 2002, Tseng & Wong 2005, Tseng 2007). We instead propose an approach for unsupervised learning in the presence of outliers that is motivated by the work of She & Owen (2011). The flexible framework that we propose can be applied to Kmeans clustering, PCA, and other unsupervised learning techniques.
In recent years, much effort has focused upon the use of penalties to perform feature selection in regression problems, and to a lesser extent in the classification setting (for instance, see Tibshirani 1996, Fan & Li 2001, Zou & Hastie 2005, Zou 2006, Witten & Tibshirani 2011). Other work has involved transferring some of the ideas developed for performing feature selection in supervised contexts to the unsupervised setting (examples include Friedman et al. 2007, Pan & Shen 2007, Xie et al. 2008, Witten et al. 2009, Witten & Tibshirani 2010, Guo et al. 2011). In this paper, rather than using penalties to perform feature selection in the unsupervised setting, we develop an approach for performing observation selection in the unsupervised context – that is, an approach to obtain not sparsity in the features, but rather sparsity in the observations, where observations are excluded if they appear not to arise from the unsupervised model being fit to the majority of the observations.
Figure 1 illustrates the performances of Kmeans clustering and our outlier Kmeans clustering proposal on a small simulated example that consists of two clusters, along with a number of outliers. Our outlier Kmeans proposal is able to successfully identify outliers and cluster the nonoutlying observations.
The rest of this paper is organized as follows. In Section 2, we propose our approach for unsupervised learning in the presence of outliers. In Section 3, we discuss Kmeans clustering in the presence of outliers as well as its connection to a generalized version of Kmeans clustering using Huber’s loss function. A simulation study and an application to gene expression data are also presented. In Section 4 we discuss PCA in the presence of outliers, and present a simulation study. The discussion is in Section 5.
Suppose that we have a n × p data matrix X, consisting of p feature measurements on a set of n observations. We wish to perform an unsupervised analysis of this data set, such as clustering or PCA. We assume that the procedure of interest solves an optimization problem of the form
where θ represents a set of parameters for the unsupervised learning operation that is restricted to belong to a set D. For instance, Kmeans clustering involves solving the problem
where X_{i} ^{p} denotes the ith observation (row) of the data matrix X, μ_{1}, …, μ_{K} ^{p} denote the mean vectors for the K clusters, and C_{1}, …, C_{k} denote a partition of the n observations into K clusters, such that C_{k} ∩ C_{k}_{′} = for k ≠ k′ and C_{1} C_{2} … C_{K} = {1, …, n}. PCA can also be written as the solution to an optimization problem: the first K principal components of X are the columns of the matrix V that solves
where D is a K × K diagonal matrix, and U and V are n × K and p × K orthogonal matrices, respectively.
Now, suppose that some of the observations in X are outliers. A simple model for this situation is as follows. We wish to learn the underlying signal in a n × p data matrix W via an unsupervised approach, but we do not observe W; we instead observe X = W + E, where E is a n × p matrix of errors for the observations. Most of the observations do not have errors, and so most rows of E contain only zeros. However, some subset of the observations are outliers, and hence contain errors. These correspond to nonzero rows of E. These errors might potentially be very large.
If E were known then our task would be simple: we could just perform an unsupervised analysis on W = X − E instead of X, leading to the optimization problem
instead of (4). Unfortunately, this is not possible because E is unknown. In particular, we do not know which observations are outliers, and much less the error terms associated with these outliers.
Therefore, rather than solving (7) as written, we propose to optimize it with respect to θ and E jointly. That is, we will estimate the errors for the observations, subject to a penalty intended to encourage just a few observations to be outliers. In particular, we propose to solve
where λ is a nonnegative tuning parameter, and where P(·; λ) is a penalty function that encourages sparsity and that is applied to the _{2} norm of the ith error term, i.e. it encourages the ith error term to be zero. There are a number of possible choices for the penalty function, which may be convex or nonconvex; some possibilities are surveyed in She (2011) and Mazumder et al. (2011). In the examples presented in this paper, we will take P(E_{i}_{2}; λ) = λE_{i}_{2}. This is a group lasso penalty (Yuan & Lin 2007). When λ → ∞ then E_{ij} = 0 for all i and j, and so (8) will just reduce to (4). When λ = 0, then the errors can become arbitrarily large; each observation will be assigned an error, leading in general to useless results. However, for an intermediate value of λ, just a subset of the observations will be assigned nonzero errors. These observations with nonzero errors are thought to be outliers. Unsupervised learning is performed at the same time as the outliers are identified, so outliers are defined with respect to the unsupervised approach used.
In general, we will take an iterative approach to solving the problem (8), as outlined in Algorithm 1. Each iteration of Step 2 of Algorithm 1 decreases the objective. In the next two sections, we will apply this formulation for unsupervised learning with outliers to two specific problems, Kmeans clustering and PCA.

Suppose that we wish to cluster a set of observations, but we think that some of the observations are outliers that do not belong to any cluster. Rather than solving (5), we can instead solve the problem
As previously described, E is a matrix of elements that allows for “errors” in X. That is, if the observation X_{i} does not seem to belong to any cluster, then E_{i} will take on a nonzero value such that X_{i} − E_{i} seems to belong to a cluster. P(E_{i}_{2}; λ) is a penalty function that encourages sparsity in E_{i}_{2}; throughout this paper, we take P(E_{i}_{2}; λ) = λE_{i}_{2}, a group lasso penalty. Then as λ → ∞, (9) becomes equivalent to the Kmeans clustering criterion (5), since then the penalty for having a nonzero error becomes arbitrarily large and so all errors are zero. On the other hand, when λ = 0 we obtain a trivial result: there is no penalty for having nonzero errors, and so the errors will be such that the withincluster sum of squares of X − E is zero, i.e. . For an intermediate value of λ, some of the observations – those that do not seem to belong to any cluster –will be assigned nonzero errors. We refer to (9) as the outlier Kmeans clustering optimization problem, and the clustering procedure based on this criterion as outlier Kmeans clustering.
An example is shown in Figure 1, using a group lasso penalty. Outlier Kmeans clustering successfully identifies the outlying observations and assigns large errors to them, allowing for correct discovery of the two true clusters.
Algorithm 2 is a descent algorithm for performing outlier Kmeans clustering. That is, each step will decrease the objective of (9). Note that in Step 2(b), the problem (11) is convex if P(·; λ) is convex, in which case (12) solves it exactly. However, in Step 2(a), (10) is a nonconvex problem, since Kmeans clustering is nonconvex, and so only a local optimum will be obtained.

We now consider the problem of selecting the tuning parameter λ, assuming that K is known. We would like to be conservative in calling observations outliers – we do not want to call an observation an outlier unless we are quite confident that it does not belong to any cluster. We choose the largest value of λ (corresponding to the smallest number of observations called outliers) such that no observation that is not called an outlier appears to be one. That is, we choose λ at the largest value such that no observation assigned zero error has X_{i} − μ_{C}_{(}_{i}_{)}_{2} larger than m(λ) + 3s(λ), where for a given value of λ, m(λ) is the mean of X_{i} − μ_{C}_{(}_{i}_{)}_{2} over all observations with zero error, and s(λ) is the standard deviation of X_{i} − μ_{C}_{(}_{i}_{)}_{2} over all observations with zero error. To implement this approach, we perform outlier Kmeans clustering over a grid of λ values.
Throughout this paper, we assume that the number of clusters K is known, but in real applications this is often not the case. If K is unknown, then λ and K must be jointly selected in some way. Consider Figure 1. Given that K = 2, it is clear that a number of observations are outliers; however, if K were much larger, then each of the observations shown in green in panel (a) could be assigned to its own cluster, and no observations would be assigned nonzero errors. More simply put, it is impossible to distinguish a data set comprised of K clusters and a single outlier from a data set comprised of K +1 clusters and no outliers, unless one is willing to make an assumption about the number of clusters or the properties of those clusters. This is a complex issue, and in what follows we assume that K is known.
Our proposal for unsupervised learning in the presence of outliers is motivated by a recent proposal in the regression setting (She & Owen 2011). In that paper, it was shown that there is a deep connection between performing regression in the presence of outliers, using the model (3), and Mestimation. We now show that there is a very close connection between our proposal for outlier Kmeans, and a generalized version of Kmeans given by
where ρ(t; λ) is some loss function (GarciaEscudero & Gordaliza 1999). Suppose that for a given penalty function P(·; λ), the problem
has the solution
where Θ(·; λ) is a thresholding function (discussed extensively in She 2009, She 2011). Consider the optimization problem (9) with C_{1}, …, C_{K} fixed. Then it is not hard to see that an iterative algorithm that successively holds E_{1}, …, E_{n} fixed and solves for μ_{1}, …, μ_{K} and then holds μ_{1}, …, μ_{K} fixed and solves for E_{1}, …, E_{n} will decrease the objective at each step. If this algorithm is iterated until the objective converges, then by inspection, the solution will satisfy, for k = 1, …, K,
Suppose that . Then, (16) implies that
In other words, the solution to (9) with C_{1}, …, C_{K} held fixed satisfies the score equation associated with (13) with C_{1}, …, C_{K} held fixed.
Proposition 1 indicates that there is a connection between the outlier Kmeans clustering problem (9) and the generalized Kmeans problem (13) when C_{1}, …, C_{K} are held fixed. For example, consider the use of a group lasso penalty P(E_{i}_{2}; λ) = λE_{i}_{2}, and let ρ(t; λ) be Huber’s loss function, given by (Huber 1964)
Then, it is easily shown that the condition of Proposition 1 is satisfied, since , and where
In other words, with C_{1}, …, C_{K} held fixed, generalized Kmeans with Huber’s loss function and outlier Kmeans with a group lasso penalty yield the same estimates μ_{1}, …, μ_{K}.
Now, holding μ_{1}, …, μ_{K} fixed, suppose we wish to solve (9) for C_{1}, …, C_{K} and E_{1}, …, E_{n}.
Holding μ_{1}, …, μ_{K} fixed and minimizing (9), with a group lasso penalty, with respect to C_{1}, …, C_{K} and E_{1}, …, E_{n} amounts to assigning the ith observation to the class for which ρ(X_{i} − μ_{k}_{2}; λ) is smallest, where ρ is Huber’s loss function.
Minimizing (9) with respect to E_{1}, …, E_{n} and C_{1}, …, C_{K} amounts to assigning the ith observation to the class for which the quantity
Together, Propositions 1 and 2 indicate that a generalized verson of Kmeans with Huber’s loss function (13) is essentially equivalent to our outlier Kmeans clustering proposal with a group lasso penalty.
Tseng (2007) proposed performing Kmeans clustering in the presence of outliers by solving
where λ is a nonnegative tuning parameter and S is a set of observations thought to be outliers. That is, if an observation is in S, then it does not belong to any of the clusters C_{1}, …, C_{K}. In (22), S indicates the number of observations in the set S. Solving (22) is not directly addressed, but a descent algorithm could be obtained by clustering all of the observations, assigning any observations that are more than a distance from the corresponding cluster mean to the set S, reclustering all of the observations not in S, and so on. It turns out that this algorithm is very closely related to our outlier Kmeans proposal (9) using a hardthresholding penalty, given by P(E_{i}_{2}; λ) = 1_{Ei2>0}λ/2 (see e.g. She 2009). Using this penalty, Step 2(b) in Algorithm 2 yields E_{i} = 0 if and E_{i} = X_{i} − μ_{C}_{(}_{i}_{)} otherwise. This is identical to the first iteration of Tseng (2007)’s procedure. However, further iterations of our procedure will differ, because in Step 2(a) clustering is performed on X_{1} − E_{1}, …, X_{n} − E_{n} rather than simply on the observations currently assigned nonzero errors.
We generated data with 25 pdimensional observations in each of K classes, along with q outliers. We used several values of K, p, and q: K = 2 with p = 10, K = 5 with p = 50, and q = 0, 5, 10. Each observation was generated independently; for an observation in class k,
where σ = 1 if K = 2 and σ = 0.5 if K = 5. The outlying observations were also drawn according to (23), after random assignment to one of the K clusters, with an additional independent noise term for each feature. These noise terms were drawn according to a Unif[(−6, −3) (3, 6)] distribution if K = 2, and according to a Unif[(−2, −1) (1, 2)] distribution if K = 5. Three clustering approaches were compared:
To assess the accuracy of the clusters obtained by each method, the clustering error rate (CER) is used. CER measures the extent to which two partitions R and Q of a set of n observations are in agreement with each other. For instance, R might be a partition obtained by clustering, whereas Q could be the true class labels. Let 1_{R}_{(}_{i,i}_{′)} be an indicator for whether partition R places the ith and i′th observation in the same group, and define 1_{Q}_{(}_{i,i}_{′)} analogously. Then the CER is defined as . It equals zero if the two partitions are in perfect agreement, and will take on a positive value if they disagree. Note that the CER is one minus the Rand Index (Rand 1971). Table 1 reports the CER in each simulation setting, where the outliers are coded as a separate class. For purposes of computing the CER, the partition estimated by Kmeans uses only K classes since standard Kmeans clustering does not identify outliers; the other clustering methods use K +1 classes. Table 1 also reports the outlier error rate (OER) – that is, the number of outliers erroneously thought to be nonoutliers, plus the number of nonoutliers erroneously thought to be outliers, divided by the total number of observations. Outlier Kmeans results in far lower CER than ordinary Kmeans, and tends to yield lower OER and CER than does modelbased clustering (Table 1). However, a direct comparison between the performances of outlier Kmeans and MCLUST is challenging, since each approach identifies a different number of observations as outliers.
We now study the performance of outlier Kmeans on two gene expression data sets.
The first data set consists of colon tissue samples for which 2000 gene expression measurements are available, and can be obtained from http://genomicspubs.princeton.edu/oncology/ (Alon et al. 1999). There are 40 tumor samples and 22 normal samples. The data were logtransformed, and observations were centered to have mean zero and standard deviation one. Applying our outlier Kmeans proposal with a group lasso penalty to this data set (with the automated tuning parameter selection procedure described earlier) identifies two outliers, namely observations 3 and 57; both are tumor samples. Kmeans clustering results in a CER of 0.508, whereas outlier Kmeans results in a CER of 0.183. Results are displayed in Figure 2.
We also performed Kmeans clustering on the glioma gene expression data set of Sun et al. (2006), which consists of 180 samples and 52613 gene expression measurements. The samples fall into four classes, one nontumor class and three types of glioma. The data are available from Gene Expression Omnibus (Barrett et al. 2005) with accession number GDS1962. Genes were standardized to have mean zero and standard deviation one before Kmeans clustering and outlier Kmeans clustering were performed on only the 2% of genes with highest variance before standardization. Outlier Kmeans identified seven outlying observations; these “outliers” were drawn from three of the four classes. The results are displayed in Figure 3.
We now consider the problem of performing PCA when some of the observations are outliers. Rather than solving the problem (6), we instead solve the problem
where as in (6), D is a K × K diagonal matrix, and U and V are n×K and p×K orthogonal matrices, respectively. We call this the outlier PCA optimization problem, and columns of the matrix V obtained by solving this problem the outlier principal components. Algorithm 3 provides an iterative approach that will decrease the objective of (24) at each step, but in general will not attain the global optimum since (24) is nonconvex (indeed, PCA itself as given in (6) is a nonconvex problem). We illustrate outlier PCA on a simple toy example in Figure 4.

In the examples that follow, we assume that K is known, we take P(·; λ) to be a group lasso penalty, and we select λ to be the largest value (corresponding to the smallest number of observations declared outliers) such that no observation assigned zero error has X_{i} − U_{i}DV^{T}_{2} greater than m(λ) + 3s(λ). For a given value of λ, m(λ) is defined to be the mean of X_{i} − U_{i}DV^{T}_{2} over all observations assigned zero errors, and s(λ) is defined to be the standard deviation of X_{i} − U_{i}DV^{T}_{2} over all observations assigned zero errors. In other words, we choose the smallest possible number of outliers so that the lowrank model fits the observations assigned zero errors well.
In a series of recent papers, a number of authors have considered the problem of performing PCA in the case that an n × p data matrix W that is exactly low rank is observed with noise. That is, rather than observing W, we instead observe X = W + E, where E is a n × p sparse noise matrix that results from the corruption of certain elements scattered at random throughout the data matrix. For this problem, one can solve (Lin et al. 2009, Wright et al. 2009, Candes et al. 2011)
where λ is a nonnegative tuning parameter, and where ·_{*} indicates the nuclear norm of a matrix, i.e. the sum of its singular values. It has been shown that under certain conditions one can exactly recover the lowrank matrix W. We will refer to the solution to the problem (27) as exact robust PCA, to emphasize its assumption that the underlying matrix W is exactly low rank. If we modify (27) in order to allow the matrix W to be approximately lowrank rather than exactly lowrank, and to encourage sparsity in the rows of E rather than in the individual elements, then we obtain our outlier PCA proposal (24). We note that our proposal is closely related to a formulation in Xu et al. (2010).
To compare the performances of our outlier PCA proposal (24) to that of the exact robust PCA proposal (27), we implemented the algorithm described in Candes et al. (2011) to solve (27). Figure 5 shows a comparison of the results obtained for the two approaches on a simulated example where 100 observations are generated according to a rankone model with Gaussian noise in 10 dimensions, and there are 15 outliers. Outlier PCA performs better because it can accommodate the fact that the data, even with the outliers removed, is only approximately rankone, and because the group lasso penalty exploits the fact that we expect entire observations to be outliers.
To evaluate the performance of outlier PCA, we generated a n × p data matrix X, with n = 50 or n = 100 observations in p = 5 dimensions, and q = 0, q = 5, or q = 10 outliers. The n + q observations were generated according to X_{ij} = 50u_{i}_{1}v_{j}_{1} + 10u_{i}_{2}v_{j}_{2} + ε_{ij} + δ_{ij}1_{i>n}, where u_{1}, u_{2} and v_{1}, v_{2} are orthogonal unit vectors of lengths n + q and p, respectively, and ε_{ij}, δ_{ij} are independent error terms: ε_{ij} ~ N (0, 1) and δ_{ij} ~ Unif[(−5, −3)(3, 5)]. The last q observations are outliers due to the additional noise terms δ_{ij}. Four approaches for estimating the first two principal components were compared:
These latter two approaches were included in the comparisons since they constitute simple alternatives to outlier PCA.
To evaluate the accuracy of the principal component estimates, we computed the vector space agreement (VSA): namely, trace(P_{true}P_{est})/2, where P_{true} is the orthogonal projection matrix onto the space spanned by the two true principal components used to generate the data, and P_{est} is the orthogonal projection matrix onto the space spanned by the two estimated principal components. This quantity lies between 0 and 1. It will be 0 if the spaces spanned by the true and estimated principal components are orthogonal to each other, and will be 1 if the two spaces are identical. We also computed the OER, defined in the previous section. Results are shown in Table 2. In general, outlier PCA and OSPCA yield the highest VSA. Outlier PCA and PCAOS have comparable OERs; that of OSPCA is substantially worse.
In recent years, much effort has focused upon using shrinkage penalties to develop statistical methods that are sparse in the features. Such approaches were initially developed for the supervised setting (see e.g. Tibshirani 1996, Fan & Li 2001, Zou & Hastie 2005) but more recent work has focused upon feature selection in the unsupervised setting (see e.g. Pan & Shen 2007, Xie et al. 2008, Witten et al. 2009, Witten & Tibshirani 2010). Here instead we develop an approach for unsupervised learning that uses shrinkage penalties in order to obtain a result that is sparse in the observations, and is intended for use if outliers may be present among the observations. This general formulation can be used in order to combine outlier detection with any unsupervised learning method that can be written as the solution to an optimization problem.
Our proposal for unsupervised learning with outliers allows for any penalty to be applied to the _{2} norm of the error associated with an individual observation. For simplicity, in the examples throughout the paper, we used a group lasso penalty in order to identify outliers. However, it is argued in She & Owen (2011) that convex penalties are inherently nonrobust in the context of outlier detection in regression, and the use of nonconvex penalties is espoused. Therefore, it may be preferable to use a nonconvex penalty, such as hardthresholding or SCAD, on the _{2} norm of the error associated with each observation in (8). In the context of Kmeans clustering or PCA, this would require simply a different update in Step 2(b) of Algorithm 2 or or33 (She 2009, She 2011). Alternatively, if one has prior knowledge about the probability that a given observation is an outlier, then one might wish to modify the optimization problem (8) to allow a different penalty to be applied to each observation. Or if we believe that individual elements of the data matrix X, rather than entire observations, contain errors, then we could apply a penalty to individual elements of E, e.g. . Indeed, this approach was taken in Candes et al. (2011).
We demonstrated that there is a close connection between our proposal for Kmeans clustering with outliers and a generalized version of Kmeans discussed by GarciaEscudero & Gordaliza (1999). Those authors showed that regardless of the loss function used, “generalized Kmeans do[es] not inherit the robustness properties of the Mestimator from which they came”. This calls into question the extent to which our outlier Kmeans clustering proposal inherits those robustness properties; more investigation into this issue is needed.
Our proposal for unsupervised learning in the presence of outliers is simple and elegant, and can be applied to a range of problems. However, it also has some drawbacks. Though the problem (8) naturally lends itself to an iterative algorithm that decreases the objective at each step, this iterative approach will not, in general, yield the global optimum. Furthermore, the problem of selecting the tuning parameter that controls the number of outliers identified is quite challenging. Finally, in our simulation studies we have found that in many cases, outliers in the unsupervised context can be prohibitively difficult to identify, or else so easy to identify that even very simple methods (such as PCA followed by outlier screening) can do quite well.
This work was supported by NIH grant DP5OD009145. The author thanks an anonymous reviewer for helpful comments that improved the quality of this manuscript.
PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. 