PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Stat Interface. Author manuscript; available in PMC Jul 19, 2013.
Published in final edited form as:
PMCID: PMC3716393
NIHMSID: NIHMS486125
Penalized unsupervised learning with outliers
Daniela M. Witten*
*Box 357232, Department of Biostatistics, University of Washington, Seattle WA 98195-7232
Daniela M. Witten: dwit-ten/at/u.washington.edu
We consider the problem of performing unsupervised learning in the presence of outliers – that is, observations that do not come from the same distribution as the rest of the data. It is known that in this setting, standard approaches for unsupervised learning can yield unsatisfactory results. For instance, in the presence of severe outliers, K-means clustering will often assign each outlier to its own cluster, or alternatively may yield distorted clusters in order to accommodate the outliers. In this paper, we take a new approach to extending existing unsupervised learning techniques to accommodate outliers. Our approach is an extension of a recent proposal for outlier detection in the regression setting. We allow each observation to take on an “error” term, and we penalize the errors using a group lasso penalty in order to encourage most of the observations’ errors to exactly equal zero. We show that this approach can be used in order to develop extensions of K-means clustering and principal components analysis that result in accurate outlier detection, as well as improved performance in the presence of outliers. These methods are illustrated in a simulation study and on two gene expression data sets, and connections with M-estimation are explored.
Keywords: unsupervised learning, outliers, robust, group lasso, k-means clustering, principal components analysis, M-estimation
It has long been known that in the presence of outliers, classical statistical methods such as linear regression can fail. For this reason, many proposals for detecting outliers in regression have been made (for a survey, see Rousseeuw & Leroy 1987, Rousseeuw & Hubert 2011). Recently, She & Owen (2011) proposed a new approach for detecting outliers in regression problems. In regression, one generally assumes the model
equation M1
(1)
where y is an outcome vector of length n, X is a n × p design matrix, and ε a n-vector of error terms. If some outliers are present among the observations, then a more accurate model might be
equation M2
(2)
where γ is a sparse n-vector whose nonzero elements correspond to outlying observations. To fit the model (2), She & Owen (2011) propose solving the optimization problem
equation M3
(3)
where P(γi; λ) is a penalty on γi that encourages sparsity, and λ is a tuning parameter. It is shown in She & Owen (2011) that there is a close connection between M-estimation and (3). For instance, if equation M4, then solving (3) is equivalent to the M-estimate based on Huber’s loss function (Huber 1964).
In this paper, we consider the problem of performing unsupervised learning in the presence of outliers – that is, observations that do not come from the same distribution as the rest of the data. Specifically, we investigate K-means clustering and principal components analysis (PCA) in the presence of outliers. Both of these methods can perform very poorly when outliers are present. A number of proposals have been made for modifying these techniques to accommodate outliers (among others, Jolion & Rosenfeld 1989, Dave 1991, Garcia-Escudero & Gordaliza 1999, Jiang et al. 2001, Fraley & Raftery 2002, Tseng & Wong 2005, Tseng 2007). We instead propose an approach for unsupervised learning in the presence of outliers that is motivated by the work of She & Owen (2011). The flexible framework that we propose can be applied to K-means clustering, PCA, and other unsupervised learning techniques.
In recent years, much effort has focused upon the use of penalties to perform feature selection in regression problems, and to a lesser extent in the classification setting (for instance, see Tibshirani 1996, Fan & Li 2001, Zou & Hastie 2005, Zou 2006, Witten & Tibshirani 2011). Other work has involved transferring some of the ideas developed for performing feature selection in supervised contexts to the unsupervised setting (examples include Friedman et al. 2007, Pan & Shen 2007, Xie et al. 2008, Witten et al. 2009, Witten & Tibshirani 2010, Guo et al. 2011). In this paper, rather than using penalties to perform feature selection in the unsupervised setting, we develop an approach for performing observation selection in the unsupervised context – that is, an approach to obtain not sparsity in the features, but rather sparsity in the observations, where observations are excluded if they appear not to arise from the unsupervised model being fit to the majority of the observations.
Figure 1 illustrates the performances of K-means clustering and our outlier K-means clustering proposal on a small simulated example that consists of two clusters, along with a number of outliers. Our outlier K-means proposal is able to successfully identify outliers and cluster the non-outlying observations.
Figure 1
Figure 1
(a): Three sets of observations were generated in two dimensions: two clusters (shown in red and in black) and a set of outliers that belong to neither cluster (shown in green). (b): The cluster assignments from K-means are shown in red and in black. (more ...)
The rest of this paper is organized as follows. In Section 2, we propose our approach for unsupervised learning in the presence of outliers. In Section 3, we discuss K-means clustering in the presence of outliers as well as its connection to a generalized version of K-means clustering using Huber’s loss function. A simulation study and an application to gene expression data are also presented. In Section 4 we discuss PCA in the presence of outliers, and present a simulation study. The discussion is in Section 5.
Suppose that we have a n × p data matrix X, consisting of p feature measurements on a set of n observations. We wish to perform an unsupervised analysis of this data set, such as clustering or PCA. We assume that the procedure of interest solves an optimization problem of the form
equation M5
(4)
where θ represents a set of parameters for the unsupervised learning operation that is restricted to belong to a set D. For instance, K-means clustering involves solving the problem
equation M6
(5)
where Xi [set membership] Rp denotes the ith observation (row) of the data matrix X, μ1, …, μK [set membership] Rp denote the mean vectors for the K clusters, and C1, …, Ck denote a partition of the n observations into K clusters, such that CkCk = [empty] for kk′ and C1 [union or logical sum] C2 [union or logical sum][union or logical sum] CK = {1, …, n}. PCA can also be written as the solution to an optimization problem: the first K principal components of X are the columns of the matrix V that solves
equation M7
(6)
where D is a K × K diagonal matrix, and U and V are n × K and p × K orthogonal matrices, respectively.
Now, suppose that some of the observations in X are outliers. A simple model for this situation is as follows. We wish to learn the underlying signal in a n × p data matrix W via an unsupervised approach, but we do not observe W; we instead observe X = W + E, where E is a n × p matrix of errors for the observations. Most of the observations do not have errors, and so most rows of E contain only zeros. However, some subset of the observations are outliers, and hence contain errors. These correspond to nonzero rows of E. These errors might potentially be very large.
If E were known then our task would be simple: we could just perform an unsupervised analysis on W = XE instead of X, leading to the optimization problem
equation M8
(7)
instead of (4). Unfortunately, this is not possible because E is unknown. In particular, we do not know which observations are outliers, and much less the error terms associated with these outliers.
Therefore, rather than solving (7) as written, we propose to optimize it with respect to θ and E jointly. That is, we will estimate the errors for the observations, subject to a penalty intended to encourage just a few observations to be outliers. In particular, we propose to solve
equation M9
(8)
where λ is a nonnegative tuning parameter, and where P(·; λ) is a penalty function that encourages sparsity and that is applied to the [ell]2 norm of the ith error term, i.e. it encourages the ith error term to be zero. There are a number of possible choices for the penalty function, which may be convex or non-convex; some possibilities are surveyed in She (2011) and Mazumder et al. (2011). In the examples presented in this paper, we will take P(||Ei||2; λ) = λ||Ei||2. This is a group lasso penalty (Yuan & Lin 2007). When λ → ∞ then Eij = 0 for all i and j, and so (8) will just reduce to (4). When λ = 0, then the errors can become arbitrarily large; each observation will be assigned an error, leading in general to useless results. However, for an intermediate value of λ, just a subset of the observations will be assigned nonzero errors. These observations with nonzero errors are thought to be outliers. Unsupervised learning is performed at the same time as the outliers are identified, so outliers are defined with respect to the unsupervised approach used.
In general, we will take an iterative approach to solving the problem (8), as outlined in Algorithm 1. Each iteration of Step 2 of Algorithm 1 decreases the objective. In the next two sections, we will apply this formulation for unsupervised learning with outliers to two specific problems, K-means clustering and PCA.
Algorithm 1
A descent algorithm for unsupervised learning with outliers
  • Initialize E, an n × p matrix of errors.
  • Iterate until convergence:
    • Holding E fixed, solve (8) with respect to θ. This amounts to performing unsupervised learning on the matrix XE.
    • Holding θ fixed, solve (8) with respect to E. This amounts to updating the estimates of which observations are outliers, given the current value of θ.
  • Perform unsupervised learning (solve (4)) on the observations that were assigned zero errors, i.e. on the observations in the set {i:||Ei||2 = 0}.
3.1 The proposal
Suppose that we wish to cluster a set of observations, but we think that some of the observations are outliers that do not belong to any cluster. Rather than solving (5), we can instead solve the problem
equation M10
(9)
As previously described, E is a matrix of elements that allows for “errors” in X. That is, if the observation Xi does not seem to belong to any cluster, then Ei will take on a nonzero value such that XiEi seems to belong to a cluster. P(||Ei||2; λ) is a penalty function that encourages sparsity in ||Ei||2; throughout this paper, we take P(||Ei||2; λ) = λ||Ei||2, a group lasso penalty. Then as λ → ∞, (9) becomes equivalent to the K-means clustering criterion (5), since then the penalty for having a nonzero error becomes arbitrarily large and so all errors are zero. On the other hand, when λ = 0 we obtain a trivial result: there is no penalty for having nonzero errors, and so the errors will be such that the within-cluster sum of squares of XE is zero, i.e. equation M11. For an intermediate value of λ, some of the observations – those that do not seem to belong to any cluster –will be assigned nonzero errors. We refer to (9) as the outlier K-means clustering optimization problem, and the clustering procedure based on this criterion as outlier K-means clustering.
An example is shown in Figure 1, using a group lasso penalty. Outlier K-means clustering successfully identifies the outlying observations and assigns large errors to them, allowing for correct discovery of the two true clusters.
Algorithm 2 is a descent algorithm for performing outlier K-means clustering. That is, each step will decrease the objective of (9). Note that in Step 2(b), the problem (11) is convex if P(·; λ) is convex, in which case (12) solves it exactly. However, in Step 2(a), (10) is a non-convex problem, since K-means clustering is non-convex, and so only a local optimum will be obtained.
Algorithm 2
A descent algorithm for outlier K-means clustering
  • Initialize the errors – e.g. take Ei = 0 for the 90% of observations that are closest to the overall mean of the observations in terms of Euclidean distance, and Ei = Xi for the others.
  • Iterate until the objective (9) converges:
    • Perform K-means clustering on the matrix XE. That is, solve
      equation M33
      (10)
    • For each observation, i = 1, …, n, solve the problem
      equation M34
      (11)
      where C(i) indicates the current cluster assignment of the ith observation, i.e. C(i) = k if i [set membership] Ck. If P (||Ei||2; λ) = λ||Ei||2, the solution takes the form (Yuan & Lin 2007)
      equation M35
      (12)
  • Perform K-means clustering on the observations that were assigned zero errors, i.e. on the observations in the set {i:||Ei||2 = 0}.
3.2 Tuning parameter selection for outlier K-means clustering
We now consider the problem of selecting the tuning parameter λ, assuming that K is known. We would like to be conservative in calling observations outliers – we do not want to call an observation an outlier unless we are quite confident that it does not belong to any cluster. We choose the largest value of λ (corresponding to the smallest number of observations called outliers) such that no observation that is not called an outlier appears to be one. That is, we choose λ at the largest value such that no observation assigned zero error has ||XiμC(i)||2 larger than m(λ) + 3s(λ), where for a given value of λ, m(λ) is the mean of ||XiμC(i)||2 over all observations with zero error, and s(λ) is the standard deviation of ||XiμC(i)||2 over all observations with zero error. To implement this approach, we perform outlier K-means clustering over a grid of λ values.
Throughout this paper, we assume that the number of clusters K is known, but in real applications this is often not the case. If K is unknown, then λ and K must be jointly selected in some way. Consider Figure 1. Given that K = 2, it is clear that a number of observations are outliers; however, if K were much larger, then each of the observations shown in green in panel (a) could be assigned to its own cluster, and no observations would be assigned nonzero errors. More simply put, it is impossible to distinguish a data set comprised of K clusters and a single outlier from a data set comprised of K +1 clusters and no outliers, unless one is willing to make an assumption about the number of clusters or the properties of those clusters. This is a complex issue, and in what follows we assume that K is known.
3.3 Connection with M-estimation
Our proposal for unsupervised learning in the presence of outliers is motivated by a recent proposal in the regression setting (She & Owen 2011). In that paper, it was shown that there is a deep connection between performing regression in the presence of outliers, using the model (3), and M-estimation. We now show that there is a very close connection between our proposal for outlier K-means, and a generalized version of K-means given by
equation M12
(13)
where ρ(t; λ) is some loss function (Garcia-Escudero & Gordaliza 1999). Suppose that for a given penalty function P(·; λ), the problem
equation M13
(14)
has the solution
equation M14
(15)
where Θ(·; λ) is a thresholding function (discussed extensively in She 2009, She 2011). Consider the optimization problem (9) with C1, …, CK fixed. Then it is not hard to see that an iterative algorithm that successively holds E1, …, En fixed and solves for μ1, …, μK and then holds μ1, …, μK fixed and solves for E1, …, En will decrease the objective at each step. If this algorithm is iterated until the objective converges, then by inspection, the solution will satisfy, for k = 1, …, K,
equation M15
(16)
Proposition 1
Suppose that equation M16. Then, (16) implies that
equation M17
(17)
In other words, the solution to (9) with C1, …, CK held fixed satisfies the score equation associated with (13) with C1, …, CK held fixed.
Proposition 1 indicates that there is a connection between the outlier K-means clustering problem (9) and the generalized K-means problem (13) when C1, …, CK are held fixed. For example, consider the use of a group lasso penalty P(||Ei||2; λ) = λ||Ei||2, and let ρ(t; λ) be Huber’s loss function, given by (Huber 1964)
equation M18
(18)
Then, it is easily shown that the condition of Proposition 1 is satisfied, since equation M19, and equation M20 where
equation M21
(19)
In other words, with C1, …, CK held fixed, generalized K-means with Huber’s loss function and outlier K-means with a group lasso penalty yield the same estimates μ1, …, μK.
Now, holding μ1, …, μK fixed, suppose we wish to solve (9) for C1, …, CK and E1, …, En.
Proposition 2
Holding μ1, …, μK fixed and minimizing (9), with a group lasso penalty, with respect to C1, …, CK and E1, …, En amounts to assigning the ith observation to the class for which ρ(||Xiμk||2; λ) is smallest, where ρ is Huber’s loss function.
Proof
Minimizing (9) with respect to E1, …, En and C1, …, CK amounts to assigning the ith observation to the class for which the quantity
equation M22
(20)
is minimized, where equation M23. Note that (20) can be rewritten as
equation M24
(21)
Together, Propositions 1 and 2 indicate that a generalized verson of K-means with Huber’s loss function (13) is essentially equivalent to our outlier K-means clustering proposal with a group lasso penalty.
3.4 A related proposal
Tseng (2007) proposed performing K-means clustering in the presence of outliers by solving
equation M25
(22)
where λ is a nonnegative tuning parameter and S is a set of observations thought to be outliers. That is, if an observation is in S, then it does not belong to any of the clusters C1, …, CK. In (22), |S| indicates the number of observations in the set S. Solving (22) is not directly addressed, but a descent algorithm could be obtained by clustering all of the observations, assigning any observations that are more than a distance equation M26 from the corresponding cluster mean to the set S, reclustering all of the observations not in S, and so on. It turns out that this algorithm is very closely related to our outlier K-means proposal (9) using a hard-thresholding penalty, given by P(||Ei||2; λ) = 1||Ei||2>0λ/2 (see e.g. She 2009). Using this penalty, Step 2(b) in Algorithm 2 yields Ei = 0 if equation M27 and Ei = XiμC(i) otherwise. This is identical to the first iteration of Tseng (2007)’s procedure. However, further iterations of our procedure will differ, because in Step 2(a) clustering is performed on X1E1, …, XnEn rather than simply on the observations currently assigned nonzero errors.
3.5 Simulation study
We generated data with 25 p-dimensional observations in each of K classes, along with q outliers. We used several values of K, p, and q: K = 2 with p = 10, K = 5 with p = 50, and q = 0, 5, 10. Each observation was generated independently; for an observation in class k,
equation M28
(23)
where σ = 1 if K = 2 and σ = 0.5 if K = 5. The outlying observations were also drawn according to (23), after random assignment to one of the K clusters, with an additional independent noise term for each feature. These noise terms were drawn according to a Unif[(−6, −3) [union or logical sum] (3, 6)] distribution if K = 2, and according to a Unif[(−2, −1) [union or logical sum] (1, 2)] distribution if K = 5. Three clustering approaches were compared:
  • KM: K-means clustering.
  • OKM: Outlier K-means clustering, with the tuning parameter selection approach described previously and using a group lasso penalty.
  • MCLUST: Model-based clustering, allowing for outliers, as described in Fraley & Raftery (2002). The R package mclust was used, under the assumption of spherical covariance matrices with common variance (the same assumption that is made by K-means clustering). Note that the software automatically determines the number of outliers.
To assess the accuracy of the clusters obtained by each method, the clustering error rate (CER) is used. CER measures the extent to which two partitions R and Q of a set of n observations are in agreement with each other. For instance, R might be a partition obtained by clustering, whereas Q could be the true class labels. Let 1R(i,i′) be an indicator for whether partition R places the ith and i′th observation in the same group, and define 1Q(i,i′) analogously. Then the CER is defined as equation M29. It equals zero if the two partitions are in perfect agreement, and will take on a positive value if they disagree. Note that the CER is one minus the Rand Index (Rand 1971). Table 1 reports the CER in each simulation setting, where the outliers are coded as a separate class. For purposes of computing the CER, the partition estimated by K-means uses only K classes since standard K-means clustering does not identify outliers; the other clustering methods use K +1 classes. Table 1 also reports the outlier error rate (OER) – that is, the number of outliers erroneously thought to be non-outliers, plus the number of non-outliers erroneously thought to be outliers, divided by the total number of observations. Outlier K-means results in far lower CER than ordinary K-means, and tends to yield lower OER and CER than does model-based clustering (Table 1). However, a direct comparison between the performances of outlier K-means and MCLUST is challenging, since each approach identifies a different number of observations as outliers.
Table 1
Table 1
In the simulation study described in the text, for various numbers of clusters (K) and outliers (q), K-means clustering (KM), outlier K-means clustering (OKM), and model-based clustering (MCLUST) were performed. Means and standard errors (over 50 repetitions) (more ...)
3.6 Application to gene expression data sets
We now study the performance of outlier K-means on two gene expression data sets.
The first data set consists of colon tissue samples for which 2000 gene expression measurements are available, and can be obtained from http://genomics-pubs.princeton.edu/oncology/ (Alon et al. 1999). There are 40 tumor samples and 22 normal samples. The data were log-transformed, and observations were centered to have mean zero and standard deviation one. Applying our outlier K-means proposal with a group lasso penalty to this data set (with the automated tuning parameter selection procedure described earlier) identifies two outliers, namely observations 3 and 57; both are tumor samples. K-means clustering results in a CER of 0.508, whereas outlier K-means results in a CER of 0.183. Results are displayed in Figure 2.
Figure 2
Figure 2
The colon gene expression data set of Alon et al. (1999). The 62 observations are projected onto the first two principal components. (a): Normal samples are shown in black and tumor samples are in red. (b): K-means clustering was performed, and the two (more ...)
We also performed K-means clustering on the glioma gene expression data set of Sun et al. (2006), which consists of 180 samples and 52613 gene expression measurements. The samples fall into four classes, one non-tumor class and three types of glioma. The data are available from Gene Expression Omnibus (Barrett et al. 2005) with accession number GDS1962. Genes were standardized to have mean zero and standard deviation one before K-means clustering and outlier K-means clustering were performed on only the 2% of genes with highest variance before standardization. Outlier K-means identified seven outlying observations; these “outliers” were drawn from three of the four classes. The results are displayed in Figure 3.
Figure 3
Figure 3
The glioma gene expression data set of Sun et al. (2006). The 180 observations are projected onto the first two principal components, after taking only the 1093 genes with highest variance. (a): The four classes are shown in distinct colors. (b): K-means (more ...)
4.1 The proposal
We now consider the problem of performing PCA when some of the observations are outliers. Rather than solving the problem (6), we instead solve the problem
equation M30
(24)
where as in (6), D is a K × K diagonal matrix, and U and V are n×K and p×K orthogonal matrices, respectively. We call this the outlier PCA optimization problem, and columns of the matrix V obtained by solving this problem the outlier principal components. Algorithm 3 provides an iterative approach that will decrease the objective of (24) at each step, but in general will not attain the global optimum since (24) is non-convex (indeed, PCA itself as given in (6) is a non-convex problem). We illustrate outlier PCA on a simple toy example in Figure 4.
Figure 4
Figure 4
A two-dimensional example. In each figure, non-outliers are shown in black and outliers are in red. (a): The observations are plotted and the first principal component is shown. (b): Outlier PCA was performed with a group lasso penalty, and Xi (more ...)
Algorithm 3
A descent algorithm for outlier PCA
  • Initialize the errors – e.g. take Ei = 0 for the 90% of observations that are closest to the overall mean of the observations in terms of Euclidean distance, and Ei = Xi for the others.
  • Iterate until the objective (24) converges:
    • Compute U, D, and V, the components of the rank-K singular value decomposition of the matrix XE.
    • For i = 1, …, n, solve the problem
      equation M36
      (25)
      where Ui denotes the ith row of U. If P(||Ei||2; λ) = λ||Ei||2, then the solution takes the form (Yuan & Lin 2007)
      equation M37
      (26)
  • Compute the principal components of the observations that were assigned zero errors, i.e. perform PCA on the observations in the set {i:||Ei||2 = 0}.
In the examples that follow, we assume that K is known, we take P(·; λ) to be a group lasso penalty, and we select λ to be the largest value (corresponding to the smallest number of observations declared outliers) such that no observation assigned zero error has ||XiUiDVT||2 greater than m(λ) + 3s(λ). For a given value of λ, m(λ) is defined to be the mean of ||XiUiDVT||2 over all observations assigned zero errors, and s(λ) is defined to be the standard deviation of ||XiUiDVT||2 over all observations assigned zero errors. In other words, we choose the smallest possible number of outliers so that the low-rank model fits the observations assigned zero errors well.
4.2 Related work
In a series of recent papers, a number of authors have considered the problem of performing PCA in the case that an n × p data matrix W that is exactly low rank is observed with noise. That is, rather than observing W, we instead observe X = W + E, where E is a n × p sparse noise matrix that results from the corruption of certain elements scattered at random throughout the data matrix. For this problem, one can solve (Lin et al. 2009, Wright et al. 2009, Candes et al. 2011)
equation M31
(27)
where λ is a nonnegative tuning parameter, and where ||·||* indicates the nuclear norm of a matrix, i.e. the sum of its singular values. It has been shown that under certain conditions one can exactly recover the low-rank matrix W. We will refer to the solution to the problem (27) as exact robust PCA, to emphasize its assumption that the underlying matrix W is exactly low rank. If we modify (27) in order to allow the matrix W to be approximately low-rank rather than exactly low-rank, and to encourage sparsity in the rows of E rather than in the individual elements, then we obtain our outlier PCA proposal (24). We note that our proposal is closely related to a formulation in Xu et al. (2010).
To compare the performances of our outlier PCA proposal (24) to that of the exact robust PCA proposal (27), we implemented the algorithm described in Candes et al. (2011) to solve (27). Figure 5 shows a comparison of the results obtained for the two approaches on a simulated example where 100 observations are generated according to a rank-one model with Gaussian noise in 10 dimensions, and there are 15 outliers. Outlier PCA performs better because it can accommodate the fact that the data, even with the outliers removed, is only approximately rank-one, and because the group lasso penalty exploits the fact that we expect entire observations to be outliers.
Figure 5
Figure 5
A comparison of outlier PCA with a group lasso penalty and exact robust PCA. Results are averaged over 50 simulated data sets. (a): Number of true positives versus number of false positives for outlier PCA and exact robust PCA. “True positive” (more ...)
4.3 Simulation study
To evaluate the performance of outlier PCA, we generated a n × p data matrix X, with n = 50 or n = 100 observations in p = 5 dimensions, and q = 0, q = 5, or q = 10 outliers. The n + q observations were generated according to Xij = 50ui1vj1 + 10ui2vj2 + εij + δij1i>n, where u1, u2 and v1, v2 are orthogonal unit vectors of lengths n + q and p, respectively, and εij, δij are independent error terms: εij ~ N (0, 1) and δij ~ Unif[(−5, −3)[union or logical sum](3, 5)]. The last q observations are outliers due to the additional noise terms δij. Four approaches for estimating the first two principal components were compared:
  • PCA: PCA with K = 2.
  • OPCA: Outlier PCA with K = 2, using a group lasso penalty and the tuning parameter selection approach described previously. Let q denote the number of outliers identified.
  • OSPCA: “Outlier screening” followed by PCA. The q observations with the largest mean distance to all other observations were called “outliers”. Then PCA with K = 2 was performed on the observations that were not identified as outliers.
  • PCAOS: PCA followed by “outlier screening”. PCA with K = 2 was performed on the full set of n + q observations. Then the q observations for which the rank-two approximation fit the worst (in terms of Euclidean distance) were called “outliers”.
These latter two approaches were included in the comparisons since they constitute simple alternatives to outlier PCA.
To evaluate the accuracy of the principal component estimates, we computed the vector space agreement (VSA): namely, trace(PtruePest)/2, where Ptrue is the orthogonal projection matrix onto the space spanned by the two true principal components used to generate the data, and Pest is the orthogonal projection matrix onto the space spanned by the two estimated principal components. This quantity lies between 0 and 1. It will be 0 if the spaces spanned by the true and estimated principal components are orthogonal to each other, and will be 1 if the two spaces are identical. We also computed the OER, defined in the previous section. Results are shown in Table 2. In general, outlier PCA and OSPCA yield the highest VSA. Outlier PCA and PCAOS have comparable OERs; that of OSPCA is substantially worse.
Table 2
Table 2
In the simulation study described in the text, for various numbers of observations (n) and outliers (q), we performed PCA, outlier PCA (OPCA), outlier screening followed by PCA (OSPCA), and PCA followed by outlier screening (PCAOS). The means and standard (more ...)
In recent years, much effort has focused upon using shrinkage penalties to develop statistical methods that are sparse in the features. Such approaches were initially developed for the supervised setting (see e.g. Tibshirani 1996, Fan & Li 2001, Zou & Hastie 2005) but more recent work has focused upon feature selection in the unsupervised setting (see e.g. Pan & Shen 2007, Xie et al. 2008, Witten et al. 2009, Witten & Tibshirani 2010). Here instead we develop an approach for unsupervised learning that uses shrinkage penalties in order to obtain a result that is sparse in the observations, and is intended for use if outliers may be present among the observations. This general formulation can be used in order to combine outlier detection with any unsupervised learning method that can be written as the solution to an optimization problem.
Our proposal for unsupervised learning with outliers allows for any penalty to be applied to the [ell]2 norm of the error associated with an individual observation. For simplicity, in the examples throughout the paper, we used a group lasso penalty in order to identify outliers. However, it is argued in She & Owen (2011) that convex penalties are inherently non-robust in the context of outlier detection in regression, and the use of non-convex penalties is espoused. Therefore, it may be preferable to use a non-convex penalty, such as hard-thresholding or SCAD, on the [ell]2 norm of the error associated with each observation in (8). In the context of K-means clustering or PCA, this would require simply a different update in Step 2(b) of Algorithm 2 or or33 (She 2009, She 2011). Alternatively, if one has prior knowledge about the probability that a given observation is an outlier, then one might wish to modify the optimization problem (8) to allow a different penalty to be applied to each observation. Or if we believe that individual elements of the data matrix X, rather than entire observations, contain errors, then we could apply a penalty to individual elements of E, e.g. equation M32. Indeed, this approach was taken in Candes et al. (2011).
We demonstrated that there is a close connection between our proposal for K-means clustering with outliers and a generalized version of K-means discussed by Garcia-Escudero & Gordaliza (1999). Those authors showed that regardless of the loss function used, “generalized K-means do[es] not inherit the robustness properties of the M-estimator from which they came”. This calls into question the extent to which our outlier K-means clustering proposal inherits those robustness properties; more investigation into this issue is needed.
Our proposal for unsupervised learning in the presence of outliers is simple and elegant, and can be applied to a range of problems. However, it also has some drawbacks. Though the problem (8) naturally lends itself to an iterative algorithm that decreases the objective at each step, this iterative approach will not, in general, yield the global optimum. Furthermore, the problem of selecting the tuning parameter that controls the number of outliers identified is quite challenging. Finally, in our simulation studies we have found that in many cases, outliers in the unsupervised context can be prohibitively difficult to identify, or else so easy to identify that even very simple methods (such as PCA followed by outlier screening) can do quite well.
Acknowledgments
This work was supported by NIH grant DP5OD009145. The author thanks an anonymous reviewer for helpful comments that improved the quality of this manuscript.
  • Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D, Levine A. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Nat Acad Sciences. 1999;96:6745–6750. [PubMed]
  • Barrett T, Suzek T, Troup D, Wilhite S, Ngau W, Ledoux P, Rudnev D, Lash A, Fujibuchi W, Edgar R. NCBI GEO: mining millions of expression profiles–database and tools. Nucleic Acids Research. 2005;33:D562–D566. [PMC free article] [PubMed]
  • Candes E, Li X, Ma Y, Wright J. Robust principal components analysis? Journal of the ACM. 2011;58(3):1–37.
  • Dave R. Characterization and detection of noise in clustering. Pattern Recognition Letters. 1991;12:657–664.
  • Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360.
  • Fraley C, Raftery A. Model-based clustering, discriminant analysis, and density estimation. J Amer Statist Assoc. 2002;97:611–631.
  • Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2007;9:432–441. [PMC free article] [PubMed]
  • Garcia-Escudero L, Gordaliza A. Robustness properties of k means and trimmed k means. Journal of the American Statistical Association. 1999;94:956–969.
  • Guo J, Levina E, Michailidis G, Zhu J. Joint estimation of multiple graphical models. Biometrika. 2011;98(1):1–15. [PMC free article] [PubMed]
  • Huber P. Robust estimation of a location parameter. Annals of Math Stat. 1964;53:73–101.
  • Jiang M, Tseng S, Su C. Two-phase clustering process for outliers detection. Pattern Recognition Leters. 2001;22:691–700.
  • Jolion J, Rosenfeld A. Cluster detection in background noise. Pattern Recognition. 1989;22(5):603–607.
  • Lin Z, Chen M, Wu L, Ma Y. The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices 2009
  • Mazumder R, Friedman J, Hastie T. Sparsenet: coordinate descent with nonconvex penalties. Journal of the American Statistical Association. 2011;106:1125–1138.
  • Pan W, Shen X. Penalized model-based clustering with application to variable selection. Journal of Machine Learning Research. 2007;8:1145–1164.
  • Rand WM. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association. 1971;66:846–850.
  • Rousseeuw P, Hubert M. Robust statistics for outlier detection. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2011;1(1):73–79.
  • Rousseeuw P, Leroy A. Robust regression and outlier detection. Wiley; New York: 1987.
  • She Y. Thresholding-based iterative selection procedures for model selection and shrinkage. Electronic Journal of Statistics. 2009;3:384–415.
  • She Y. An iterative algorithm for fitting nonconvex penalized generalized linear models with grouped predictors. Computational Statistics and Data Analysis 2011
  • She Y, Owen A. Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association. 2011;106(494):626–639.
  • Sun L, Hui A, Su Q, Vortmeyer A, Kotliarov Y, Pastorino S, Passaniti A, Menon J, Walling J, Bailey R, Rosenblum M, Mikkelsen T, Fine H. Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. Cancer Cell. 2006;9:287–300. [PubMed]
  • Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Statist Soc B. 1996;58:267–288.
  • Tseng G. Penalized and weighted k-means for clustering with scattered objects and prior information in high-throughput biological data. Bioinformatics. 2007;23:2247–2255. [PubMed]
  • Tseng G, Wong W. Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics. 2005;61:10–16. [PubMed]
  • Witten D, Tibshirani R. A framework for feature selection in clustering. Journal of the American Statistical Association. 2010;105(490):713–726. [PMC free article] [PubMed]
  • Witten D, Tibshirani R. Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society, Series B. 2011;73(5):753–772. [PMC free article] [PubMed]
  • Witten D, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10(3):515–534. [PMC free article] [PubMed]
  • Wright J, Ganesh A, Rao S, Peng Y, Ma Y. Robust principal component analysis: exact recovery of corrupted low-rank matrices via convex optimization. Proceedings of Neural Information Processing Systems 2009
  • Xie B, Pan W, Shen X. Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables. Electronic Journal of Statistics. 2008;2:168–212. [PMC free article] [PubMed]
  • Xu H, Caramanis C, Sanghavi S. Robust PCA via outlier pursuit. Advances in Neural Information Processing Systems. 2010;23:2496–2504.
  • Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B. 2007;68:49–67.
  • Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429.
  • Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Stat Soc B. 2005;67:301–320.