Suppose that we have a n × p data matrix X, consisting of p feature measurements on a set of n observations. We wish to perform an unsupervised analysis of this data set, such as clustering or PCA. We assume that the procedure of interest solves an optimization problem of the form
where
θ represents a set of parameters for the unsupervised learning operation that is restricted to belong to a set
D. For instance,
Kmeans clustering involves solving the problem
where
X_{i} ^{p} denotes the
ith observation (row) of the data matrix
X,
μ_{1}, …,
μ_{K} ^{p} denote the mean vectors for the
K clusters, and
C_{1}, …,
C_{k} denote a partition of the
n observations into
K clusters, such that
C_{k} ∩
C_{k}_{′} =
for
k ≠
k′ and
C_{1} C_{2} …
C_{K} = {1, …,
n}. PCA can also be written as the solution to an optimization problem: the first
K principal components of
X are the columns of the matrix
V that solves
where
D is a
K ×
K diagonal matrix, and
U and
V are
n ×
K and
p ×
K orthogonal matrices, respectively.
Now, suppose that some of the observations in X are outliers. A simple model for this situation is as follows. We wish to learn the underlying signal in a n × p data matrix W via an unsupervised approach, but we do not observe W; we instead observe X = W + E, where E is a n × p matrix of errors for the observations. Most of the observations do not have errors, and so most rows of E contain only zeros. However, some subset of the observations are outliers, and hence contain errors. These correspond to nonzero rows of E. These errors might potentially be very large.
If E were known then our task would be simple: we could just perform an unsupervised analysis on W = X − E instead of X, leading to the optimization problem
instead of (
4). Unfortunately, this is not possible because
E is unknown. In particular, we do not know which observations are outliers, and much less the error terms associated with these outliers.
Therefore, rather than solving (
7) as written, we propose to optimize it with respect to
θ and
E jointly. That is, we will estimate the errors for the observations, subject to a penalty intended to encourage just a few observations to be outliers. In particular, we propose to solve
where
λ is a nonnegative tuning parameter, and where
P(·;
λ) is a penalty function that encourages sparsity and that is applied to the
_{2} norm of the
ith error term, i.e. it encourages the
ith error term to be zero. There are a number of possible choices for the penalty function, which may be convex or nonconvex; some possibilities are surveyed in
She (2011) and
Mazumder et al. (2011). In the examples presented in this paper, we will take
P(
E_{i}
_{2};
λ) =
λ
E_{i}
_{2}. This is a
group lasso penalty (
Yuan & Lin 2007). When
λ → ∞ then
E_{ij} = 0 for all
i and
j, and so (
8) will just reduce to (
4). When
λ = 0, then the errors can become arbitrarily large; each observation will be assigned an error, leading in general to useless results. However, for an intermediate value of
λ, just a subset of the observations will be assigned nonzero errors. These observations with nonzero errors are thought to be outliers. Unsupervised learning is performed at the same time as the outliers are identified, so outliers are defined with respect to the unsupervised approach used.
In general, we will take an iterative approach to solving the problem (
8), as outlined in . Each iteration of Step 2 of decreases the objective. In the next two sections, we will apply this formulation for unsupervised learning with outliers to two specific problems,
Kmeans clustering and PCA.
Algorithm 1
A descent algorithm for unsupervised learning with outliers
 Initialize E, an n × p matrix of errors.
 Iterate until convergence:
 Holding E fixed, solve (8) with respect to θ. This amounts to performing unsupervised learning on the matrix X − E.
 Holding θ fixed, solve (8) with respect to E. This amounts to updating the estimates of which observations are outliers, given the current value of θ.
 Perform unsupervised learning (solve (4)) on the observations that were assigned zero errors, i.e. on the observations in the set {i:E_{i}_{2} = 0}.
