Home  About  Journals  Submit  Contact Us  Français 
We formalize an algorithm for solving the L_{1}norm bestfit hyperplane problem derived using first principles and geometric insights about L_{1} projection and L_{1} regression. The procedure follows from a new proof of global optimality and relies on the solution of a small number of linear programs. The procedure is implemented for validation and testing. This analysis of the L_{1}norm bestfit hyperplane problem makes the procedure accessible to applications in areas such as location theory, computer vision, and multivariate statistics.
Given points x_{i} ^{m}, i = 1, …, n, consider the L_{p}norm bestfit hyperplane problem,
where ‖·‖_{p} is the L_{p}norm of the argument, V ^{m×m−1}, β ^{m}, α_{i} ^{m−1} for 1 ≤ i ≤ n, and p ≥ 1. A solution to this nonconvex mathematical program, , defines a hyperplane in ^{m}, {x ^{m}x = V^{*}α + β^{*} for some α ^{m−1}}, that minimizes the sum of the pnorm distances of the points to the hyperplane. By requiring β = 0, our results extend directly to subspaces. This representation of an affine set in terms of linear combinations of vectors in V has several specialized applications such as in providing information about the directions of dispersion in a point set with regard to the L_{p}norm.
The case when p = 2 and β is fixed to zero is a wellstudied problem dating back to Pearson [18]. The optimal solution V^{*} consists of the m − 1 eigenvectors of X^{T}X, where X^{T} = [x_{1}, x_{2}, …, x_{n}], corresponding to the m − 1 largest eigenvalues. The solution minimizes the sum of Euclidean distances of points to their orthogonal projections in the fitted subspace. The problem is related to the calculation of principal components in a traditional principal component analysis (PCA) [10].
This paper deals with the case when p = 1. The solution to this problem minimizes the sum of L_{1} distances of points to their L_{1} projections in the fitted hyperplane. Using only first principles about projection and linear programming, we establish a formal connection between L_{1} bestfit hyperplanes and L_{1} regression which results in an efficient procedure based on the solution of a small number of linear programs (LPs). Our result consolidates and synthesizes more general results due to Mangasarian [14] about projections and Martini and Schöbel [15] about the problem of fitting hyperplanes to data using general norms in Minkowski spaces.
Gathering these previous results into an intuitive and practical algorithm based on a new proof will be of benefit to application areas such as location theory, computer vision, and multivariate statistics. The problem appears in locating a line with respect to existing facilities [3], in image analysis using the general perspective camera model [1], and in the development of L_{1}based principal component analysis methods [11, 12, 13]. With the exception of Agarwal et al. [1], the authors are unaware of any successful attempt at finding a globally optimal solution to (1) with p = 1 and general values of m when used in these applications. Agarwal et al. [1] reformulate the problem as a fractional program and give a branchandbound algorithm for finding globally optimal solutions; the algorithm has an exponential worstcase running time. Ke and Kanade [11, 12] apply a more general form of (1) with p = 1 and β = 0, where the subspace defined by V can have fewer than m − 1 dimensions. They use the formulation in the context of subspace estimation for image analysis using the affine camera model [12]. Kwak [13] treats a problem closely related to the L_{1}norm bestfit hyperplane problem by finding successive directions of maximum variation by maximizing the L_{1} lengths of projected points along a fitted vector. The approach is the basis of an L_{1} PCA method that is applied to face recognition data. In each of these three works, there is no guarantee of global optimality in polynomial time to the respective L_{1} bestfit optimization problem formulations. Ke and Kanade [11] provide an exact solution when m = 2, and a heuristic approach for generating locally optimal solutions when m > 2. Kwak [13] provides an algorithm with only local optimality of solutions guaranteed. These works would clearly benefit from a practical, effective, and globally optimal solution to the general L_{1}norm bestfit subspace problem.
Our problem can be seen as a special case of the matrix decomposition problem described by Candés et al. [5]. They show that under certain assumptions about X, the problem can be solved via convex optimization and this solution will be exact with a surprisingly high probability. Their method contrasts with the work presented here in that we require no assumptions about X and can guarantee an exact solution.
The span of the points has dimension at least m − 1, so that there exists a matrix V^{*} of full column rank that is optimal for (1). Boldface uppercase letters represent matrices, and boldface lowercase letters represent vectors. For a vector y, y_{(j)} is the vector created by removing the j^{th} element. A unit direction is a direction along one of the 2m unit vectors ±u_{j}, j = 1, …, m. The external representation [19] of an affine set (a hyperplane is an m − 1dimensional affine set) S ^{m} is the set {x ^{m}Ax = b} for an appropriatelydefined matrix A ^{q×m} and vector b ^{q}. The internal representation [19] of the same affine set is {x ^{m}x = Vα + β for some α} for a matrix V ^{m×q} where the columns of V span the affine set. The projection of a point x onto a set S is the set of points P such that the distance between x and points in P is minimum among all points in S; we will call elements of P projections.
Suppose we are given a point $\widehat{x}$ ^{m}, a matrix V ^{m×m−1} of full column rank, and β ^{m}. The projection of $\widehat{x}$ onto S = {x ^{m}x = Vα + β for some α} can be found by solving the optimization problem,
For nonnegative variables , let λ^{+} − λ^{−} = $\widehat{x}$ − (Vα + β). The mathematical program in (2) can be reformulated as an LP that will lead to important geometric insights.
An optimal solution to LP(V, β, $\widehat{x}$) provides the magnitudes for the unit directions for an L_{1} projection of $\widehat{x}$ onto S. Optimal values for α are scaling factors that locate the projection in terms of a linear combination of the columns of V and an offset β. The following result states that there exists a projection of $\widehat{x}$ ^{m} onto a hyperplane that is located along a single unit direction from $\widehat{x}$.
Proposition 1. Given a hyperplane S = {x ^{m}x = Vα + β for some α} and a point $\widehat{x}$ ^{m}, $\widehat{x}$ S, there exists a solution to (3) with exactly one component from (λ^{+},λ^{−}) positive.
Proof. Because the variables in α are unbounded, they will never leave the basis in a simplex pivot (see [16], p. 170). Therefore, there exists an optimal basic feasible solution with all of the variables in α and one component of (λ^{+},λ^{−}) basic.
Proposition 1 can also be proved using Corollary 2.2 in [14], after applying a correction ( is replaced with , see [14]).
Next consider the projection of a set of points x_{i} ^{m}, i = 1, … n onto a hyperplane. The following result establishes that the same unit direction is simultaneously optimal for projecting all points onto the hyperplane.
Proposition 2. Given a set of points x_{i} ^{m}, i = 1, … n and a hyperplane S = {x ^{m}x = Vα + β for some α}, there exists an optimal solution to LP(V, β, x_{i}) with either for some j′, and for j ≠ j′.
Proof. Consider an optimal solution to LP(V, β, x_{1}), (, ^{+}, ^{−}), with exactly one component of (^{+}, ^{−}) positive. Such a solution exists by Proposition 1. Assume without loss of generality that . Consider the dual of LP(V, β, x_{1}),
Let π^{*} represent an optimal solution to LPD(V, β, x_{1}). Suppose x_{1} and x_{2} are on the same side of S. Construct a feasible solution to LP(V, β, x_{2}) by scaling and recalculating α. Denote the solution as (, ^{+}, ^{−}). (Such a solution exists since x_{1} projects onto S along the first unit direction so that a ray from x_{2} along the same direction will intersect the hyperplane. The locations on S where these intersections occur may be different, requiring different values of α for their representation.) We will show that π^{*} is an optimal solution to the dual of LP(V, β, x_{2}). Dual feasibility is immediate because the constraints in the dual system for LP(V, β, x_{1}) are the same as those in LP(V, β, x_{2}). Next we will establish that the complementary slackness conditions hold.
The only complementary relations from LP(V, β, x_{1}) and its dual that are affected by the feasible solution to LP(V, β, x_{2}) are
The complementarity of the relations in (5) still holds because α is unrestricted and V = 0. The relation in (6) is also still valid because .
For the case when x_{2} is on the opposite side of S, the result follows by setting to a scaling value so that S can be reached from x_{2} using the first unit direction and negating π^{*}.
When S has an external representation, Proposition 2 follows from Theorem 2.1 in [14] and Proposition 1.
The procedure for finding the hyperplane of best fit follows from the properties given by Propositions 1 and 2. We can find an L_{1}norm bestfit hyperplane by considering the residuals along each of the m unit directions.
L_{1} regression is a wellunderstood procedure for analyzing the dependence of one variable on other variables in a point set [17]. The L_{1} regression problem is to find a hyperplane that minimizes the sum of L_{1}norm distances from the points in a point set along the unit direction corresponding to the “independent” variable. The designation of the independent variable in a general point set is effectively arbitrary. From Proposition 2, since an L_{1}norm bestfit hyperplane for a point set will have the property that all projections occur along the same unit direction, the problem reduces to finding the best L_{1} regression in each of the m unit directions. Charnes et al. [6] and Wagner [20] show that L_{1} linear regression can be solved by finding an optimal solution to a linear program. This realization about the relationship between L_{1} projection and L_{1} regression leads directly to the procedure for solving the L_{1}norm bestfit hyperplane problem. Theorem 1 formalizes this result.
Theorem 1. Given a set of points x_{i} ^{m}, i = 1, …, n, let R_{j}(x_{1}, …, x_{n}) be the objective function value of an LP based on this data, and let
where
subject to
Then an optimal solution to the L_{1}norm bestfit hyperplane problem is the hyperplane given by {x R^{m}a^{*T}x + b^{*} = 0} where .
Proof. Suppose that for a point set a different hyperplane attains a better L_{1} fit. By Proposition 2, we know that all points will project onto this hyperplane along a single unit direction corresponding to j′. The contradiction if j′ = j^{*} is immediate by the optimality of (a^{*}, b^{*}). Similarly, j′ ≠ j^{*} leads to a contradiction because R_{j*} would not have been minimal.
Theorem 1 is a special case of a more general result about hyperplane fitting using general norms in Minkowski spaces [15]. The idea for the L_{1} case is suggested by Zemel [21] but no formal proof is provided. Neither of these works implement and test a procedure based on this result.
The proof of Theorem 1 implies that there exists a projection into S that has all of the properties of an optimal L_{1} regression hyperplane. Some of these properties are summarized in the following corollary.
Corollary 1. Given a set of points x_{i} ^{m}, i = 1, …, n, there exists a projection into a hyperplane S such that
Problem (1) is stated in terms of an internal representation of an affine set. Theorem 1 provides an externallydefined bestfit hyperplane. In order to satisfy the original requirements of the problem, we must calculate m − 1 linearly independent vectors that span the optimal hyperplane and the offset vector. We can find an optimal matrix V and offset β by applying an orthogonalization procedure to {x ^{m}a^{*T}x + b^{*}}.
Theorem 1 motivates Algorithm ProjEl1 for calculating the L_{1} hyperplane of best fit.
Given points x_{i} ^{m}, i = 1, …, n.

The input to Algorithm ProjEl1 are n observations of m dimensions. The main loop in Step 2 solves m LPs each of which has 2n + m + 1 variables. Step 3 involves n(m − 1) multiplications. In Step 4, the matrix V can be found by performing an orthogonalization procedure, which has complexity O(m^{2}n + m^{3}) [7]. Therefore, since LPs can be solved in polynomial time, the overall complexity of Algorithm ProjEl1 is polynomial.
Algorithm ProjEl1 is a complete formal description of a procedure for finding the L_{1}norm bestfit hyperplane to a set of points in ^{m}. It is derived and motivated directly from fundamental geometric insights about L_{1} projections. This approach to solving this problem will permit applications which previously relied on heuristics or inefficient methods to take advantage of easytoimplement globally optimal solutions.
Algorithm ProjEl1 is validated by comparing the results obtained for four instances to the results obtained with an industrystandard nonlinear programming solver. The formulation in (1) for p = 1 is recast as a mathematical program with a linear objective and 2mn nonlinear constraints and is submitted to KNITRO [4] via an algebraic modeling language (AMPL). The four point sets have dimensions m = 3, 5, 10, 20 with n = 10, 25, 50, and 25000 observations, respectively. The point sets are available online at http://www.people.vcu.edu/\~{}jpbrooks/ProjEl/index.html.
Algorithm ProjEl1 is implemented using the ILOG CPLEX 11.1 Callable Library [8] for the solution of LPs. The instances are solved on a machine with 3.2 GHz Intel Pentium D processors and 4 GB RAM. The instances for KNITRO are solved using the NEOS Server [9].
Algorithm ProjEl1 and KNITRO are applied to four different point sets to verify that the procedure in Algorithm ProjEl1 produces a solution with the same objective function value as an optimal solution to the original nonlinear bestfit problem formulated directly from expression (1). Because the problem is nonconvex, we allow 200 random starting points for KNITRO. The procedures obtain solutions with identical objective function values for the first three point sets. KNITRO was unable to solve the problem for the fourth point set due to insufficient memory available at the host site. Algorithm ProjEl1 solves the fourth problem in 3127.2 seconds.
There is a gap between hyperplane fit theory and application areas. This gap has denied the latter a practical solution of the L_{1}norm bestfit hyperplane problem. With knowledge about L_{1} projection and L_{1} regression, we show that the problem has an intuitive solution and the resulting algorithm is practical to implement. Two key insights into the geometry of L_{1} projections onto a hyperplane allow for a direct and intuitive proof that the bestfit hyperplane can be found using L_{1} regression: 1) L_{1} projection occurs along a single unit direction, and 2) the direction of projection is independent of the location of the point. This suggests immediately the algorithm for solving the problem: calculate the L_{1} regression hyperplanes for each of the m dimensions in which the points reside, and proceed to select the one that minimizes the sum of the L_{1} distances. The algorithm is implemented and numerically validated. Our computational testing shows that largescale instances arising from multivariate statistics and computer vision are accessible.
The ideas in this paper are limited by the fact that they only apply to bestfit hyperplanes. Our method cannot be applied directly to fitting affine sets with fewer dimensions, as is often required by applications. However, access to the L_{1}norm bestfit hyperplane obtained by solving (1) when p = 1 is the first step towards a future pure L_{1}based PCA procedure. The procedure will project points down one dimension at each iteration until the projected points lie in a onedimensional subspace. Each iteration will provide the direction of least dispersion in the respective subspace. This procedure will benefit from wellknown L_{1} properties such as robustness to outliers.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. 