Appl Math Lett. Author manuscript; available in PMC 2013 April 10.
Published in final edited form as:
Appl Math Lett. 2012 January 1; 26(1): 51–56.
Published online 2012 April 10.
PMCID: PMC3459998
NIHMSID: NIHMS369787

# The L1-norm best-fit hyperplane problem

## Abstract

We formalize an algorithm for solving the L1-norm best-fit hyperplane problem derived using first principles and geometric insights about L1 projection and L1 regression. The procedure follows from a new proof of global optimality and relies on the solution of a small number of linear programs. The procedure is implemented for validation and testing. This analysis of the L1-norm best-fit hyperplane problem makes the procedure accessible to applications in areas such as location theory, computer vision, and multivariate statistics.

Keywords: L1 norm, L1 regression, linear programming, subspace fitting

## 1. Introduction

Given points xi m, i = 1, …, n, consider the Lp-norm best-fit hyperplane problem,

(1)

where ‖·‖p is the Lp-norm of the argument, V m×m−1, β m, αi m−1 for 1 ≤ in, and p ≥ 1. A solution to this nonconvex mathematical program, , defines a hyperplane in m, {x m|x = V*α + β* for some α m−1}, that minimizes the sum of the p-norm distances of the points to the hyperplane. By requiring β = 0, our results extend directly to subspaces. This representation of an affine set in terms of linear combinations of vectors in V has several specialized applications such as in providing information about the directions of dispersion in a point set with regard to the Lp-norm.

The case when p = 2 and β is fixed to zero is a well-studied problem dating back to Pearson [18]. The optimal solution V* consists of the m − 1 eigenvectors of XTX, where XT = [x1, x2, …, xn], corresponding to the m − 1 largest eigenvalues. The solution minimizes the sum of Euclidean distances of points to their orthogonal projections in the fitted subspace. The problem is related to the calculation of principal components in a traditional principal component analysis (PCA) [10].

This paper deals with the case when p = 1. The solution to this problem minimizes the sum of L1 distances of points to their L1 projections in the fitted hyperplane. Using only first principles about projection and linear programming, we establish a formal connection between L1 best-fit hyperplanes and L1 regression which results in an efficient procedure based on the solution of a small number of linear programs (LPs). Our result consolidates and synthesizes more general results due to Mangasarian [14] about projections and Martini and Schöbel [15] about the problem of fitting hyperplanes to data using general norms in Minkowski spaces.

Gathering these previous results into an intuitive and practical algorithm based on a new proof will be of benefit to application areas such as location theory, computer vision, and multivariate statistics. The problem appears in locating a line with respect to existing facilities [3], in image analysis using the general perspective camera model [1], and in the development of L1-based principal component analysis methods [11, 12, 13]. With the exception of Agarwal et al. [1], the authors are unaware of any successful attempt at finding a globally optimal solution to (1) with p = 1 and general values of m when used in these applications. Agarwal et al. [1] reformulate the problem as a fractional program and give a branch-and-bound algorithm for finding globally optimal solutions; the algorithm has an exponential worst-case running time. Ke and Kanade [11, 12] apply a more general form of (1) with p = 1 and β = 0, where the subspace defined by V can have fewer than m − 1 dimensions. They use the formulation in the context of subspace estimation for image analysis using the affine camera model [12]. Kwak [13] treats a problem closely related to the L1-norm best-fit hyperplane problem by finding successive directions of maximum variation by maximizing the L1 lengths of projected points along a fitted vector. The approach is the basis of an L1 PCA method that is applied to face recognition data. In each of these three works, there is no guarantee of global optimality in polynomial time to the respective L1 best-fit optimization problem formulations. Ke and Kanade [11] provide an exact solution when m = 2, and a heuristic approach for generating locally optimal solutions when m > 2. Kwak [13] provides an algorithm with only local optimality of solutions guaranteed. These works would clearly benefit from a practical, effective, and globally optimal solution to the general L1-norm best-fit subspace problem.

Our problem can be seen as a special case of the matrix decomposition problem described by Candés et al. [5]. They show that under certain assumptions about X, the problem can be solved via convex optimization and this solution will be exact with a surprisingly high probability. Their method contrasts with the work presented here in that we require no assumptions about X and can guarantee an exact solution.

## 2. Notation, Assumptions, Definitions

The span of the points has dimension at least m − 1, so that there exists a matrix V* of full column rank that is optimal for (1). Boldface uppercase letters represent matrices, and boldface lowercase letters represent vectors. For a vector y, y(j) is the vector created by removing the jth element. A unit direction is a direction along one of the 2m unit vectors ±uj, j = 1, …, m. The external representation [19] of an affine set (a hyperplane is an m − 1-dimensional affine set) S m is the set {x m|Ax = b} for an appropriately-defined matrix A q×m and vector b q. The internal representation [19] of the same affine set is {x m|x = Vα + β for some α} for a matrix V m×q where the columns of V span the affine set. The projection of a point x onto a set S is the set of points P such that the distance between x and points in P is minimum among all points in S; we will call elements of P projections.

## 3. L1 Projection onto a Specified Hyperplane

Suppose we are given a point $\stackrel{̂}{x}$ m, a matrix V m×m−1 of full column rank, and β m. The projection of $\stackrel{̂}{x}$ onto S = {x m|x = Vα + β for some α} can be found by solving the optimization problem,

(2)

For non-negative variables , let λ+λ = $\stackrel{̂}{x}$ − (Vα + β). The mathematical program in (2) can be reformulated as an LP that will lead to important geometric insights.

(3)

An optimal solution to LP(V, β, $\stackrel{̂}{x}$) provides the magnitudes for the unit directions for an L1 projection of $\stackrel{̂}{x}$ onto S. Optimal values for α are scaling factors that locate the projection in terms of a linear combination of the columns of V and an offset β. The following result states that there exists a projection of $\stackrel{̂}{x}$ m onto a hyperplane that is located along a single unit direction from $\stackrel{̂}{x}$.

Proposition 1. Given a hyperplane S = {x m|x = Vα + β for some α} and a point $\stackrel{̂}{x}$ m, $\stackrel{̂}{x}$ S, there exists a solution to (3) with exactly one component from (λ+,λ) positive.

Proof. Because the variables in α are unbounded, they will never leave the basis in a simplex pivot (see [16], p. 170). Therefore, there exists an optimal basic feasible solution with all of the variables in α and one component of (λ+,λ) basic.

Proposition 1 can also be proved using Corollary 2.2 in [14], after applying a correction ( is replaced with , see [14]).

Next consider the projection of a set of points xi m, i = 1, … n onto a hyperplane. The following result establishes that the same unit direction is simultaneously optimal for projecting all points onto the hyperplane.

Proposition 2. Given a set of points xi m, i = 1, … n and a hyperplane S = {x m|x = Vα + β for some α}, there exists an optimal solution to LP(V, β, xi) with either for some j′, and for jj′.

Proof. Consider an optimal solution to LP(V, β, x1), (, +, ), with exactly one component of (+, ) positive. Such a solution exists by Proposition 1. Assume without loss of generality that . Consider the dual of LP(V, β, x1),

(4)

Let π* represent an optimal solution to LP-D(V, β, x1). Suppose x1 and x2 are on the same side of S. Construct a feasible solution to LP(V, β, x2) by scaling and recalculating α. Denote the solution as (, +, ). (Such a solution exists since x1 projects onto S along the first unit direction so that a ray from x2 along the same direction will intersect the hyperplane. The locations on S where these intersections occur may be different, requiring different values of α for their representation.) We will show that π* is an optimal solution to the dual of LP(V, β, x2). Dual feasibility is immediate because the constraints in the dual system for LP(V, β, x1) are the same as those in LP(V, β, x2). Next we will establish that the complementary slackness conditions hold.

The only complementary relations from LP(V, β, x1) and its dual that are affected by the feasible solution to LP(V, β, x2) are

(5)

(6)

The complementarity of the relations in (5) still holds because α is unrestricted and V = 0. The relation in (6) is also still valid because .

For the case when x2 is on the opposite side of S, the result follows by setting to a scaling value so that S can be reached from x2 using the first unit direction and negating π*.

When S has an external representation, Proposition 2 follows from Theorem 2.1 in [14] and Proposition 1.

The procedure for finding the hyperplane of best fit follows from the properties given by Propositions 1 and 2. We can find an L1-norm best-fit hyperplane by considering the residuals along each of the m unit directions.

## 4. Best-Fit Hyperplane and L1 Regression

L1 regression is a well-understood procedure for analyzing the dependence of one variable on other variables in a point set [17]. The L1 regression problem is to find a hyperplane that minimizes the sum of L1-norm distances from the points in a point set along the unit direction corresponding to the “independent” variable. The designation of the independent variable in a general point set is effectively arbitrary. From Proposition 2, since an L1-norm best-fit hyperplane for a point set will have the property that all projections occur along the same unit direction, the problem reduces to finding the best L1 regression in each of the m unit directions. Charnes et al. [6] and Wagner [20] show that L1 linear regression can be solved by finding an optimal solution to a linear program. This realization about the relationship between L1 projection and L1 regression leads directly to the procedure for solving the L1-norm best-fit hyperplane problem. Theorem 1 formalizes this result.

Theorem 1. Given a set of points xi m, i = 1, …, n, let Rj(x1, …, xn) be the objective function value of an LP based on this data, and let

(7)

where

(8)

subject to

Then an optimal solution to the L1-norm best-fit hyperplane problem is the hyperplane given by {x Rm|a*Tx + b* = 0} where .

Proof. Suppose that for a point set a different hyperplane attains a better L1 fit. By Proposition 2, we know that all points will project onto this hyperplane along a single unit direction corresponding to j′. The contradiction if j′ = j* is immediate by the optimality of (a*, b*). Similarly, j′ ≠ j* leads to a contradiction because Rj* would not have been minimal.

Theorem 1 is a special case of a more general result about hyperplane fitting using general norms in Minkowski spaces [15]. The idea for the L1 case is suggested by Zemel [21] but no formal proof is provided. Neither of these works implement and test a procedure based on this result.

The proof of Theorem 1 implies that there exists a projection into S that has all of the properties of an optimal L1 regression hyperplane. Some of these properties are summarized in the following corollary.

Corollary 1. Given a set of points xi m, i = 1, …, n, there exists a projection into a hyperplane S such that

1. the sum of L1 distances of points to S is minimized among all m − 1-dimensional hyperplanes,
2. at least m of the points lie in S [2],
3. the difference between the number of points on either side of S is at most m [2], and
4. the projection of points into S is maximum likelihood for errors following a joint distribution of m independent, identically distributed Laplace random variables [1].

Problem (1) is stated in terms of an internal representation of an affine set. Theorem 1 provides an externally-defined best-fit hyperplane. In order to satisfy the original requirements of the problem, we must calculate m − 1 linearly independent vectors that span the optimal hyperplane and the offset vector. We can find an optimal matrix V and offset β by applying an orthogonalization procedure to {x m|a*Tx + b*}.

## 5. Optimal L1 Projection Procedure

Theorem 1 motivates Algorithm ProjEl-1 for calculating the L1 hyperplane of best fit.

### Algorithm ProjEl-1

Calculating the L1-norm best-fit hyperplane.

 Given points xi m, i = 1, …, n. Let R* = ∞for j in 1, …, m do Solve the LP in (8) to find the L1 regression hyperplane with the jth column representing the dependent variable and the remaining columns representing the independent variables. The optimal hyperplane has coefficients (a, b) and error Rj. if Rj < R* then R* = Rj, j* = j, a* = a, b* = b, For each xi, the optimal projection onto S is given by zi, i = 1, …, n, where zij = xij for j ≠ j* and .S is defined by {x|Vα + β = x for some α}, where the columns of V are vectors that span the zi’s, and β is any point on the hyperplane.

The input to Algorithm ProjEl-1 are n observations of m dimensions. The main loop in Step 2 solves m LPs each of which has 2n + m + 1 variables. Step 3 involves n(m − 1) multiplications. In Step 4, the matrix V can be found by performing an orthogonalization procedure, which has complexity O(m2n + m3) [7]. Therefore, since LPs can be solved in polynomial time, the overall complexity of Algorithm ProjEl-1 is polynomial.

Algorithm ProjEl-1 is a complete formal description of a procedure for finding the L1-norm best-fit hyperplane to a set of points in m. It is derived and motivated directly from fundamental geometric insights about L1 projections. This approach to solving this problem will permit applications which previously relied on heuristics or inefficient methods to take advantage of easy-to-implement globally optimal solutions.

## 6. Numerical Validation

Algorithm ProjEl-1 is validated by comparing the results obtained for four instances to the results obtained with an industry-standard nonlinear programming solver. The formulation in (1) for p = 1 is recast as a mathematical program with a linear objective and 2mn nonlinear constraints and is submitted to KNITRO [4] via an algebraic modeling language (AMPL). The four point sets have dimensions m = 3, 5, 10, 20 with n = 10, 25, 50, and 25000 observations, respectively. The point sets are available online at http://www.people.vcu.edu/\~{}jpbrooks/ProjEl/index.html.

Algorithm ProjEl-1 is implemented using the ILOG CPLEX 11.1 Callable Library [8] for the solution of LPs. The instances are solved on a machine with 3.2 GHz Intel Pentium D processors and 4 GB RAM. The instances for KNITRO are solved using the NEOS Server [9].

Algorithm ProjEl-1 and KNITRO are applied to four different point sets to verify that the procedure in Algorithm ProjEl-1 produces a solution with the same objective function value as an optimal solution to the original nonlinear best-fit problem formulated directly from expression (1). Because the problem is non-convex, we allow 200 random starting points for KNITRO. The procedures obtain solutions with identical objective function values for the first three point sets. KNITRO was unable to solve the problem for the fourth point set due to insufficient memory available at the host site. Algorithm ProjEl-1 solves the fourth problem in 3127.2 seconds.

## 7. Conclusion

There is a gap between hyperplane fit theory and application areas. This gap has denied the latter a practical solution of the L1-norm best-fit hyperplane problem. With knowledge about L1 projection and L1 regression, we show that the problem has an intuitive solution and the resulting algorithm is practical to implement. Two key insights into the geometry of L1 projections onto a hyperplane allow for a direct and intuitive proof that the best-fit hyperplane can be found using L1 regression: 1) L1 projection occurs along a single unit direction, and 2) the direction of projection is independent of the location of the point. This suggests immediately the algorithm for solving the problem: calculate the L1 regression hyperplanes for each of the m dimensions in which the points reside, and proceed to select the one that minimizes the sum of the L1 distances. The algorithm is implemented and numerically validated. Our computational testing shows that large-scale instances arising from multivariate statistics and computer vision are accessible.

The ideas in this paper are limited by the fact that they only apply to best-fit hyperplanes. Our method cannot be applied directly to fitting affine sets with fewer dimensions, as is often required by applications. However, access to the L1-norm best-fit hyperplane obtained by solving (1) when p = 1 is the first step towards a future pure L1-based PCA procedure. The procedure will project points down one dimension at each iteration until the projected points lie in a one-dimensional subspace. Each iteration will provide the direction of least dispersion in the respective subspace. This procedure will benefit from well-known L1 properties such as robustness to outliers.

## Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

## References

1. Agarwal S, Chandraker M, Kahl F, Kriegman D, Belongie S. Practical global optimization for multiview geometry. Lecture Notes in Computer Science. 2006;3951:592–605.
2. Appa G, Smith C. On L1 and Chebyshev estimation. Mathematical Programming. 1973;5:73–87.
3. Brimberg J, Juel H, Schöbel A. Properties of three-dimensional median line location models. Annals of Operations Research. 2003;122:71–85.
4. Byrd R, Nocedal J, Waltz R. In: Large-Scale Nonlinear Optimization. di Pillo G, Roma M, editors. Springer Verlag; 2006. pp. 35–59. Large-scale nonlinear optimization, eds. g. di pillo and m. roma,
5. Candès E, Li X, Ma Y, Wright J. Robust principal component analysis? Journal of the ACM. 2011;58 Article 11.
6. Charnes A, Cooper W, Ferguson R. Optimal estimation of executive compensation by linear programming. Management Science. 1955;1:138–150.
7. Golub G, Loan CV. Matrix Computations. Baltimore, MD: Johns Hopkins University Press; 1983.
8. ILOG, ILOG CPLEX Division. Nevada: 889 Alder Avenue, Incline Village; 2009.
9. Czyzyk MMJ, Moré J. The NEOS server. IEEE Journal on Computational Science and Engineering. 1998;5:68–75.
10. Jolliffe I. Principal Component Analysis. 2nd edition. Springer; 2002.
11. Ke Q, Kanade T. Robust subspace computation using L1 norm, Technical Report CMU-CS-03-172. Pittsburgh, PA: Carnegie Mellon University; 2003.
12. Ke Q, Kanade T. Robust l1 norm factorization in the presence of outliers and missing data by alternative convex programming; IEEE Conference on Computer Vision and Pattern Recognition.
13. Kwak N. Principal component analysis based on L1-norm maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2008;30:1672–1680. [PubMed]
14. Mangasarian O. Arbitrary-norm separating plane. Operations Research Letters. 1999;24:15–23.
15. Martini H, Schöbel A. Median hyperplanes in normed spaces - a survey. Discrete Applied Mathematics. 1998;89:181–195.
16. Murty K. Linear Programming. Wiley; 1983.
17. Narula S, Wellington J. The minimum sum of absolute errors regression: A state of the art survey. International Statistical Review. 1982;50:317–326.
18. Pearson K. On lines and planes of closest fit to systems of points in space. Philosophical Magazine. 1901;2:559–572.
19. Rockafellar R. Convex analysis. Princeton University Press; 1970.
20. Wagner W. Linear programming techniques for regression analysis. Journal of the American Statistical Association. 1959;54:206–212.
21. Zemel E. An O(n) algorithm for the linear multiple choice knapsack problem and related problems. Information Processing Letters. 1984;18:123–128.

 PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers.