Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Am Stat Assoc. Author manuscript; available in PMC 2010 July 7.
Published in final edited form as:
J Am Stat Assoc. 2009 June 1; 104(486): 682–693.
doi:  10.1198/jasa.2009.0121
PMCID: PMC2898454

On Consistency and Sparsity for Principal Components Analysis in High Dimensions


Principal components analysis (PCA) is a classic method for the reduction of dimensionality of data in the form of n observations (or cases) of a vector with p variables. Contemporary datasets often have p comparable with or even much larger than n. Our main assertions, in such settings, are (a) that some initial reduction in dimensionality is desirable before applying any PCA-type search for principal modes, and (b) the initial reduction in dimensionality is best achieved by working in a basis in which the signals have a sparse representation. We describe a simple asymptotic model in which the estimate of the leading principal component vector via standard PCA is consistent if and only if p(n)/n→0. We provide a simple algorithm for selecting a subset of coordinates with largest sample variances, and show that if PCA is done on the selected subset, then consistency is recovered, even if p(n) [dbl greater than] n.

Keywords: Eigenvector estimation, Reduction of dimension, Regularization, Thresholding, Variable selection


Suppose {xi, i = 1, …, n} is a dataset of n observations on p variables. Standard principal components analysis (PCA) looks for vectors ξ that maximize


If ξ1, …, ξk have already been found by this optimization, then the maximum defining ξk+1 is taken over vectors ξ orthogonal to ξ1, …, ξk.

Our interest lies in situations in which each xi is a realization of a possibly high-dimensional signal, so that p is comparable in magnitude with n, or may even be larger. In addition, we have in mind settings in which the signals xi contain localized features, so that the principal modes of variation sought by PCA may be localized as well.

These issues are familiar in signal and image processing application areas in which each sample has many variablesapixels, frequencies, genes, stocks, and so forth. In applications, it is common to combine the use of transform domains and feature selection to achieve an effective reduction of dimensionality. For example, one might transform the data into a suitable orthogonal basis (e.g., wavelets), select coordinates with highest variance, and then do PCA on the reduced set of variables.

A notable example occurs in the work of Wickerhauser (1994a, b), in which the orthobasis itself was chosen from a library of (wavelet packet) bases. Applications to face and fingerprint classification were given. A selection of later examples (by no means exhaustive) would include Feng, Yuen, and Dai (2000) in face recognition; and Kaewpijit, Le Moigne, and El-Ghazawi (2002) and Du and Fowler (2008) for hyper-spectral images. For some further discussion, see Cherkassky and Mulier (1998). A recent approach to variable selection followed by dimensionality reduction that emphasizes sparsity is described by Wolf and Shashua (2005) and Wolf and Bileschi (2005).

The purpose of this article is to contribute some theoretical analysis of PCA in these burgeoning high-dimensional settings. In a simple class of models of factor analysis type, we (a) describe inconsistency results to emphasize that when p is comparable with n, some reduction in dimensionality is desirable before applying any PCA-type search for principal modes; and (b) establish consistency results to illustrate that the reduction in dimensionality can be effected by working in a basis in which the signals have a sparse representation. We will support these assertions with arguments based on statistical performance and computational cost.

We begin, however, with an illustration of our results on a simple constructed example. Consider a single component (or single factor) model, in which, when viewed as p-dimensional column vectors


in which ρRp is the single component to be estimated, υi ~ N(0, 1) are iid Gaussian random effects, and zi ~ Np(0, I) are independent noise vectors.

Figure 1a shows an example of ρ with p = 2,048 and the vector ρl = f(l/p), l [set membership] {1, …, p}, where f(t) is a mixture of beta densities on [0, 1], scaled so that ρ=(1pρl2)12=10. Figure 1b shows a sample case from model (2): The random effect υiρ is hard to discern in individual cases. Figure 1c shows the result of standard PCA applied to n = 1,024 observations from (2) with σ = 1, normalized to the same length as ‖ρ‖. The effect of the noise remains clearly visible in the estimated principal eigenvector.

Figure 1
True principal component, the “three-peak” curve. (a) The single component ρl = f(l/p) where f(t) = C{0.7B(1,500, 3,000) + 0.5B(1,200, 900) + 0.5B(600, 160)} and B(a, b)(t) = [Γ(a + b)/(Γ(a)Γ(b))]ta•1 ...

For functional data of this type, a regularized approach to PCA has been proposed by Rice and Silverman (1991) and Silverman (1996), (see also Ramsay and Silverman (1997) and references therein). Although smoothing can be incorporated in various ways, we illustrate the method discussed also in Ramsay and Silverman (1997, chap. 7), which replaces (1) with


where D2ξ is the (p − 2) × 1 vector of second differences of ξ, and λ [set membership] (0, ∞) is the regularization parameter.

Figure 1d shows the estimated first principal component vector found by maximizing (3) with λ = 10−12 and λ = 10−6, respectively. Neither is really satisfactory as an estimate. The first recovers the original peak heights, but fails fully to suppress the remaining baseline noise, whereas the second grossly oversmooths the peaks in an effort to remove all trace of noise. Further investigation with other choices of λ confirms the impression already conveyed here: No single choice of λ succeeds both in preserving peak heights and in removing baseline noise.

Figures 1e and f show the result of the adaptive sparse PCA algorithm to be introduced later, respectively without and with a final thresholding step. Both goals are accomplished quite satisfactorily after thresholding in this example.

This article is organized as follows. Section 2 reviews the inconsistency result Theorem 1. Section 3 sets out the sparsity assumptions and the consistency result (Theorem 2). Section 4 gives an illustrative algorithm, demonstrated on simulated and real data in Section 5. Proofs and their preliminaries are deferred to Section 6 and the Appendix.


A basic element of our sparse PCA proposal is initial selection of a relatively small subset of the initial p variables before any PCA is attempted. In this section, we formulate some (in)consistency results that motivate this initial step.

Consider first the single component model (2). The presence of noise means that the sample covariance matrix S=n1i=1nxixiT will typically have min(n, p) nonzero eigenvalues. Let ρ^ be the eigenvector associated with the largest sample eigenvalue, with probability one it is uniquely determined up to sign.

One natural measure of the closeness of ρ^ to ρ uses the “overlap” R(ρ^,ρ), defined as the inner product between the vectors after normalization to unit length:


Equivalently, R(ρ^,ρ) is the cosine of the angle between ρ^ and ρ


and we may also write this in terms of a distance metric


For asymptotic results, we will assume that there is a sequence of models (2) indexed by n. Thus, we allow pn and ρn to depend on n, although the dependence will usually not be shown explicitly. (Of course, σ might also be allowed to vary with n, but for simplicity it is assumed fixed.)

Our first interest is whether the estimate ρ^ is consistent as n → ∞. This turns out to depend crucially on the limiting value


One setting in which this last assumption may be reasonable is when pn grows by adding finer scale wavelet coefficients of a fixed function as n increases.

We will also assume that the limiting “signal-to-noise ratio”


Theorem 1. Assume that there are n observations drawn from the p-dimensional model (2). Assume that pn/nc and that ||ρn||22 → ω > 0. Then almost surely


In particular, R(ω, c) < 1 if and only if c > 0, and so ρ^ is a consistent estimator of ρ if and only if p/n → 0.

The situation is even worse if ω2c—that is, if


because ρ^ and ρ are asymptotically orthogonal, and ρ^ ultimately contains no information at all regarding ρ.

In short, ρ^ is a consistent estimate of ρ if and only if p/n → 0. The noise does not average out if there are too many dimensions p relative to sample size n.

Theorem 1 is stated without proof, and we include some bibliographic remarks about its history. The inconsistency phenomenon was first observed in a number of papers in the learning theory literature in physics (see, for example, Biehl and Mietzner 1994; Watkin and Nadal 1994; Reimann, Van den Broeck, and Bex 1996; Hoyle and Rattray 2004). The limiting overlap formula of Theorem 1, derived in a related classification setting, appears in Biehl and Mietzner (1994). These works all use the nonrigorous “replica method” of statistical mechanics.

The first rigorous proof of inconsistency was given, in model (2), by the second author in his Ph.D. dissertation (Lu 2002) and in the initial version of this article (Johnstone and Lu 2004), available at While this article was in publication review, subsequent rigorous proofs were published, along with other results, by Paul (2007) and Nadler (2008).

Remark. Paul (2007) also includes extensions to a considerably more general “multicomponent” or “spiked” covariance model that has attracted interest in the literature. Assume that we have n data vectors xi, observed at p time points. Viewed as p-dimensional column vectors, this model assumes that


The mean function μ is assumed known. The vectors ρj, j = 1, …, mp are unknown and mutually orthogonal, with norms ρj(n) = ||ρj|| assumed decreasing: ||ρ1|| > ||ρ2|| ≥ … ≥ ||ρm||. The multipliers υij~N(0,1) are all independent over j = 1, …,m and i = 1, …, n, and the noise vectors zi ~ Np(0, I) are independent among themselves and also of the random effects {υij}. The population covariance matrix of the iid vectors (xi) is given by =j=1mρjρjT+σ2I. The vectors ρj are the ordered principal component eigenvectors of the population covariance Σ. The asymptotics assume pn, mn, and ρnj to be functions of n, and as n → ∞,


Paul (2007) shows that it continues to be true in the multicomponent model that ρ^1 is consistent if and only if pn/n → 0. Of course, the inconsistency extends to cases with pn/n → ∞ because these models are even larger.


3.1 Sparsity

The inconsistency Theorem 1 asserts that ordinary PCA becomes confused in the presence of too many variables each with equal independent noise. Ideally, we might wish to reduce the dimensionality from p to a smaller number of variables k before beginning PCA. For this to succeed, the population principal components—ρ, in our model—should be essentially concentrated in the smaller number of dimensions, in a manner that can be discovered from the data.

To quantify this, assume that the data and the population principal components are represented, perhaps after transformation, in a fixed orthonormal basis {eν}:


The index ν will always indicate the transform domain. In many situations, including all examples in this article, the data xi are initially collected in the time domain—for example in [0, 1], with xi = {xi(tl)}, where tl = l/p, l = 1,… , p. In such cases, the basis vectors eν are also time domain functions eν(tl).

The idea of concentration in a small number of variables can be captured by considering the ordered coefficient magnitudes |ρ|(1) ≥ |ρ|(2)(...). The intuitive idea of sparse representation is that, for relatively small k, the “energy” in the largest k coordinates i=1kρ(i)2 is close to the total energy ρ2=ν=1pρν2. This can only be true if the magnitudes |ρ|ν decay rather quickly. Thus, we assume for some 0 < q < 2 and C >0 that


The condition q < 2 forces rapid decay—clearly, the more rapid if q is smaller. This notion of “weak [ell]q decay” is actually equivalent to the concentration of energy in the sums i=1kρ(i)2 just mentioned (see Donoho (1993) or Johnstone (2003, chap. 15)), but is more convenient for the results given here.

The choice of orthonormal basis {eν} for sparse representation will depend on the dataset and problem domain, and thus is beyond the scope of this article. We remark, however, that for certain signal processing settings, wavelet bases can be natural for uncovering sparsity. When one-dimensional signals are smooth or have isolated discontinuities (either in the signal or its derivatives), then it can be shown (e.g., Mallat 1999) that the wavelet coefficients decay rapidly with frequency octave away from the discontinuities. In such cases, assumptions (9) are natural, as is shown in detail, for example, in the references cited earlier. We have therefore used wavelet bases for the examples in this article, but hasten to emphasize that our results apply equally to representations in other bases that might be better suited to, say, economic or genomic data.

3.2 Consistency

If the principal components have a sparse representation in basis {eν}, then selection of an appropriate subset of variables should overcome the inconsistency problem described by Theorem 1. In this direction, we establish a consistency result for sparse PCA. For simplicity, we use the single component model (2), and assume σ2 is known—although this latter assumption could be removed by estimating σ2 using (14), presented later.

We again assume a sequence of models (2) indexed by n. The unknown principal components ρ = ρn should satisfy a “uniform sparsity condition”: For some q [set membership] (0, 2) and C < ∞ independent of n, each ρn satisfies decay condition (9). In addition, as in Theorem 1, the signal strength should stabilize: ||ρn ||[var rho] > 0.

On the assumption of model (2), the sample variances


Consequently, components ν with large values of ρν will typically have large sample variances. We use here a simple selection rule


with αn = α(n−1 log(n [logical or] p))1/2 and α a sufficiently large positive constant—for example, α>12 would work for the proof. (By definition, n [logical or] p = max(p, n).)

Let SI = (Sνν′ : ν and νI^) denote the sample covariance matrix of the selected variables. Applying PCA to SI yields a principal eigenvecto (ρ^ν, νI^). Let ρ^I denote the corresponding vector in the full p-dimensional space:


The sparsity assumption implies that ρ^I is a consistent estimator of ρ.

Theorem 2. Assume that the single component model (2) holds with log(p [logical or] n)/n → 0 and ||ρn||[var rho] > 0 as n → ∞ Assum for some q [set membership] (0, 2) and C < ∞, that for each n, ρn satisfies the sparsity condition (9). Then the estimated principal eigenvector ρ^I obtained via subset selection rule (11) is consistent:


Here, α is the angle between ρ^I and ρ as in (4). Converting to an estimate ρ^(t) in the time domain (as in Step 5 of Section 4, an equivalent statement of the result is that ρ^ρ^s^ρρ0 in Euclidean norm, where s^=sign(ρ^,ρ).

The proof shows that consistency holds even under weaker assumption p = o(en). The result and proof could also be extended to multicomponent systems (8).

Armed with Theorems 1 and 2, let us return briefly to Figure 1. Based on Figure 1c, one might ask if simple thresholding of the standard PCA estimate—either in the original or wavelet domain—might suffice. Although this may work for high signal-to-noise settings, Theorem 1 suggests that such an approach is doomed if ||ρ|| < p/n, because ρ^ is asymptotically orthogonal to ρ. No such constraint applies in Theorem 2—as long as lim ||ρn|| > 0—as a result of the preliminary reduction of variables.

Bibliographic Remarks

Significant extensions of these results have been obtained since this article was first written. For example, working with the multicomponent model, Paul (2007) and Paul and Johnstone (2004) have derived lower bounds to the possible quality of estimation, and optimal estimators that attain the bounds (at the level of rates of convergence) in certain cases.

In a large p, n setting of partial least squares regression, Nadler and Coifman (2005) noted the importance of prior dimension reduction and suggested the use of wavelets to exploit sparsity.

Alternative methods for “sparsifying” PCA have also been proposed, based on connections with LASSO and [ell]1 penalized regression (Jolliffe, Trendafilov, and Uddin 2003; Zou, Hastie, and Tibshirani 2006; d'Aspremont et al. 2007), and compressive sensing (Fowler 2008). The study of consistency properties for these methods and comparison with those of this article is a natural topic for further research, with significant progress recently reported in Amini and Wainwright (2009).

3.3 Correct Selection Properties

A basic issue raised by the sparse PCA algorithm is whether the selected subset I^ in fact correctly contains the largest population variances, and only those. We formulate a result, based on large deviations of chi-squared variables, to address this issue. The considerations of this section hold for coefficients in any orthogonal basis.

For this section, assume that the diagonal elements of the sample covariance matrix S=n11nxixiT have marginal chi-squared distributions—in other words,


We will not require any assumptions on the joint distribution of {σ^ν2} The use of the index ν emphasizes the fact that we work in the transform domain.

Denote the ordered population coordinate variances by σ(1)2σ(2)2, and the ordered sample coordinate variances by σ(1)2σ(2)2. A desirable property is that I^ should, for fixed k and for suitable positive αn small, (i) include all indices l in Iin={l:σl2σ(k)2(1+αn)} and (ii) exclude all indices l in Iout={l:σl2σ(k)2(1αn)}. We will show that this, in fact, occurs with high probability if αn=αn1logn, for appropriate α > 0.

We say that a “false exclusion” (FE) occurs if any variable in Iin is missed:


whereas a “false inclusion” (FI) happens if any variable in Iout is spuriously selected:


Theorem 3. Assume that the sample variances satisfy (12) and that a subset of size k of variables is sought. With αn = αn−1/2(log n)1/2, the probability of an inclusion error of either type is polynomially small:


with b(α)=[α3(4+23)]2.

The proof is in the Appendix. As an example, if α = 9, then b(α) [equals, single dot above] 4.36. As a numerical illustration based on (A.4) (seen in the Appendix), if the subset size k = 50, while p = n = 1,000, then the chance of an inclusion error corresponding to a 25% difference in standard deviations (i.e., 1+αn=1.25 when α = 9) is less than 5%.


4.1 An Algorithm

The inconsistency results summarized in Section 2 emphasize the importance of reducing the number of variables before embarking on PCA. The results of Section 3 show that the existence of a sparse representation allows consistency to be recovered. The proof of Theorem 2 relies on two key steps: (i) sparsity allows ρ to be approximated using a relatively small number of coefficients, and (ii) these smaller number of coefficients can be estimated by a reduced PCA.

We use these remarks as the basis for the sparse PCA algorithm to be described in general terms here. Note that the algorithm per se does not require the specification of a particular model, such as the single component system (2) or the multicomponent version (8). Given the widespread use of transform domain and feature selection techniques in the signal processing literature, as described in Section 1, we make no claims for originality; this section is included primarily to illustrate results of previous sections.

  1. Compute Basis Coefficients. Given a basis {eν} for Rp, compute coordinates xiν = (xi, eν) in this basis for each xi:
  2. Subset. Calculate the sample variances σ^ν2=var^(xiν). Let I^{1,,p} denote the set of indices ν corresponding to the largest k variances.
  3. Reduced PCA. Apply standard PCA to the reduced dataset {xiν, νI^, i = 1, …, n} on the selected k-dimensional subset, obtaining eigenvectors ρ^j=(ρ^νj), j = 1, …, k, νI^.
  4. Thresholding. Filter out noise in the estimated eigenvectors by hard thresholding
  5. Reconstruction. Return to the original signal domain, using the given basis {eν}, and set

Discussion: Steps 2 and 3

An important computational point that is implicit in Steps 2 and 3 is that we only compute the variances Sν,ν for the p transform domain variables. Off-diagonal covariance elements Sν,ν′ are only computed for ν, ν′ in the reduced set I^ of size k. The reduced PCA size k may be specified in advance or chosen based on the data (see Section 4.2).

Discussion: Step 4

Although not formally studied in the theory in the preceding section, the thresholding step is found in our examples to yield a useful further filtering of noise. For a scalar value y, hard thresholding is given, as usual, by ηH(y, δ) = yI{|y| ≥ δ}. An alternative is soft thresholding ηS(y, δ) = sgn(y) × max(|y| − δ, 0), but hard thresholding has been used here because it preserves the magnitude of retained signals.

There is considerable freedom in the choice of thresholds δj. Trial and error is always possible, of course. Further, more formal choices are suggested by analogy with the signal in Gaussian noise setting δj=τ^j2logk (compare with, for example, Donoho et al. (1995)), and, for this article, we use this choice of δj. Here, τ^j is an estimate of the noise level in {ρ^νj, νI^ }—in this article, estimate (16) is used. Another possibility is to set τ^j=MAD{ρ^νj,νI^}0.6745, where MAD denotes “median absolute deviation.”

The consistency result Theorem 2 applies to this algorithm, with subset selection rule (11), and without the thresholding Step 4. Although the thresholding step helps in the examples to follow, theoretical analysis to elucidate its specific advantages is beyond the scope of this article.


We will refer to the general procedure specified by Steps 1 through 5 as “sparse PCA.” With the specific data-based choice of k proposed as method (b) in Section 4.2, with w = 0.995, we use the term “adaptive sparse PCA” (ASPCA).

In the rest of this section, we amplify and illustrate various aspects of this algorithm. Given eigenvalue and eigenvector routines, it is not difficult to code. For example, a MATLAB package ASPCALab that includes the algorithms and files that produce the figures in this article is available at To exploit wavelet bases, it makes use of the open-source library WaveLab available at

4.2 Data-Based Choice of k

When the size k of the set of selected variables I^ is itself determined from the data, we write k^ for I^. Here are two possibilities, both based on the sample variances σ^ν2 of Step 2 presented earlier:

  • (a)
    Choose coordinates with variance exceeding the estimated noise level by a specified fraction αn:
    This choice was considered in Section 3.
  • (b)
    As motivation for the second method, recall that we hope that the selected set of variables I^ is both small in cardinality and also captures most of the variance of the population principal components, in the sense that the ratio
    is close to one for each of the leading population principal components in {ρ1, …, ρm}. Now let χ(n),α2 denote the upper α percentile of the χ(n)2 distribution. If all coordinates were pure noise, one might expect the ordered sample variances σ^(ν)2 to be close to (n1)1σ^2χ(n1),ν(p+1)2. Define the excess over these percentiles by
    and for a specified fraction w(n) [set membership] (0, 1), set
    where k^ is the smallest index k for which the inequality holds, and the somewhat sloppy notation refers to the indices ν that contribute to the left-hand sum of excesses η(ν)2. This second method has been used for the figures in this article, typically with w(n) = 0.995.

Estimation of σ

If the population principal components ρj have a sparse representation in basis {eν}, then we may expect that in most coordinates, ν, {xiν} will consist largely of noise. This suggests a simple estimate of the noise level on the assumption that the noise level is the same in all coordinates—namely,


4.3 Computational Complexity

One estimates the cost of sparse PCA by examining its main steps:

  1. This depends on the choice of basis. In the wavelet case, no more than O(np log p) operations are needed. (see, for example, Mallat (1999)).
  2. Evaluate and sort the sample variances and select I^:O(np+plogp).
  3. Compute a k × k matrix and its eigendecomposition: O(k3+k2n).
  4. Apply thresholding to each vector in I^:O(k2), and estimate σ^2 and ρ^2:O(p).
  5. Reconstruct eigenvectors in the original sample space: O(k2p).

Hence, the total cost of sparse PCA is O(np log p + k2max(p, n)). Both standard and smoothed PCA need at least O((min(p, n))3) operations. Therefore, if we can find a sparse basis such that k/p→0, then under the assumption that p/nc as n→∞, the total cost of sparse PCA is o(p3). We will see in the examples that follow that the savings can be substantial.


5.1 Simulated Examples

The two examples in this section are both motivated by functional data with localized features. The first is a three-peak principal component depicted in Figure 1, and already discussed in Section 1. The second example, Figure 2, has an underlying first principal component composed of step functions. For both examples, the dimension of data vectors is p = 2,048, the number of observations n = 1,024, and the noise level σ = 1. However, the amplitudes of ρ differ, with ||ρ|| = 10 for the “three-peak” function and ||ρ|| = 25 for the “step” function. The corresponding square root signal-to-noise ratios ω = [var rho]/σ (Theorem 1) are 10 and ~25 respectively.

Figure 2
Comparison of the sample principal components for a step function. (a) True principal component ρl. (b) a sample case drawn from model (2) with σ = 1, p = 2048. (c) Sample principal component by standard PCA. (d) Sample principal component ...

Figure 1c and Figure 2c, respectively, show the sample principal components obtained by using standard PCA. Although standard PCA does capture the peaks and steps, it retains significant noise in the flat regions of the function. Corresponding Figure 1d and Figure 2d show results from smooth PCA with the indicated values of the smoothing parameter. Just as for the three-peak curve discussed earlier, in the case of the step function, none of the three estimates simultaneously captures both jumps and flat regions well.

Figures 1e, f and Figures 2e, f present the principal components obtained by sparse PCA without and with the thresholding step, respectively. The WaveLab wavelet bases Symmlet and Haar are used for the “three-peak” and “step” functions respectively. Using method (b) of Section 4.2 with w = 99.5%, the subset step selects k = 142 and 361 for the “three-peak” curve and “step” function, respectively. The sample principal component in Figure 1f is clearly superior to the other sample principal components in Figure 1. Although the principal component function in the step case appears to be only slightly better than the solid, red, smooth PCA estimate, we will see shortly that its squared error is reduced by about 75%.

Table 1 compares the accuracy of the three PCA algorithms, using average squared error (ASE) defined as ASE=p1ρ^ρ. (ρ^ was first normalized to have length ||ρ|| before computing ASE.) The running time is average CPU time over 50 iterations, used by MATLAB on an Intel Core Duo CPU T2400 at 1.83 GHz.

Table 1
Accuracy and efficiency comparison

Figure 3 presents boxplots of ASE for the 50 iterations. Overall, sparse PCA with thresholding always gives the best result for the “step” curve. For the “three-peak” function, in only a few iterations (~15%) does sparse PCA generate larger error than smoothed PCA with λ = 10−12. On average, ASE using sparse PCA is superior to the other methods by a large margin. Overall, Table 1 and Figure 3 show that sparse PCA leads to the most accurate principal component (within the techniques considered) while using much less CPU time than other PCA algorithms.

Figure 3
Side-by-side boxplots of ASE from 50 iterations using different algorithms for the “three-peak” function (a) and for the “step” function (b).

5.2 Noise Level in the Single Component Model

Anderson (1963) obtained the asymptotic distribution of n(ρ^ρ) for fixed p—in particular,


as n→∞. Here, p increases with n, but one can nevertheless use (15) as a heuristic for estimating the variance τ^ needed for thresholding. Because the effect of thresholding is to remove noise in small coefficients, setting ρν to 0 in (15) suggests


Neither ||ρ||2 nor σ2 in (16) is known, but they can be estimated by using the information contained in the sample covariance matrix S of (10). Hence, in the single component model,


If ρν is a sparse representation of ρ, then most coefficients will be small, suggesting the estimate (14) for σ2. In turn, this suggests as an estimate


5.3 ECG Example

We offer a brief illustration of sparse PCA as applied to some electrocardiogram (ECG) data kindly provided by Jeffrey Froning and Victor Froelicher in the cardiology group at Palo Alto Veterans Affairs Hospital. Beat sequences—typically about 60 cycles in length—were obtained from some 15 healthy patients; we selected two for the preliminary illustrations here. Individual beats are notable for features such as the sharp spike (“QRS complex”) and the subsequent lower peak (“T wave”), seen, for example, in Figures 5a and d. The presence of these local features, of differing spatial scales, suggests the use of wavelet bases for efficient representation. Traditional ECG analysis focuses on averages of a series of beats. If one were to look instead at beat-to-beat “variation,” one might expect these local features to play a significant role in the principal component eigenvectors.

Figure 5
ECG examples. (Note: Colors refer to online version; “vertical offset” to monochrome print version for clarity.) (a) Mean curve for ECG sample 1, n = 66, in blue, along with x+2ρ^ (green) and x2ρ ...

Considerable preprocessing is routinely done on ECG signals before the beat averages are produced for physician use. Here we summarize certain steps taken with our data, in collaboration with Jeff Froning, preparatory to the PCA analysis. The most important feature of an ECG signal is the Q-R-S complex: The maximum occurs at the R wave, as seen, for example, in Figure 5a. Therefore, we define the length of one cycle as the gap between two adjacent maxima of R waves.

  1. “Baseline wander” is observed in many ECG datasets (compare with Figure 4, with a caption that summarizes the adjustment used).
    Figure 4
    Baseline wander is observed in many ECG datasets. One common remedy for this problem is to deduct a piecewise linear baseline from the signal, the linear segment (dashed line) between two beats being determined from two adjacent onset points. The onset ...
  2. Because pulse rates vary even on short timescales, the duration of each heartbeat cycle may vary as well. We use linear interpolation to equalize the duration of each cycle, and for convenience in using wavelet software, discretize to 512 = 29 sample points in each cycle.
  3. Because of the importance of the R wave, the horizontal positions of the maxima are registered at the 150th position in each cycle.
  4. The ECG data vector is converted into an n × p data matrix, where n is the number of observed cycles and p = 512.

5.4 PCA Analysis

Figures 5a and d show the mean curves for two ECG samples in blue. The number of observations n (i.e., number of heartbeats recorded) are 66 and 61, respectively. The first sample principal components for these two sample sets (“patients”) are plotted in Figures 5c and f, with the upper/red curves from standard PCA and the lower/blue curves from sparse PCA, with thresholds chosen subjectively. In both cases, there are two sharp peaks in the vicinity of the QRS complex. The first peak occurs shortly before the 150th position, where all the maxima of R waves are aligned, and the second peak, which has an opposite sign, occurs shortly thereafter. (The lower/blue curves have been offset vertically by −0.4 and −0.2 in Figures 5c and f, respectively, for legibility in monochrome.)

The standard PCA curve in Figure 5c (upper/red) is less noisy than that in Figure 5f (upper/red), even allowing for the difference in vertical scales. Using (14), σ^12=24.59 and σ^22=80.77, whereas the magnitudes of the two mean sample curves are very similar.

The sparse PCA curves (lower/blue) are smoother than the standard PCA ones (upper/red), especially in Figure 5f, where the signal-to-noise ratio is lower. On the other hand, the upper/red and lower/blue curves match quite well at the two main peaks. Sparse PCA has reduced noise in the sample principal component in the baseline while keeping the main features.

There is a notable difference between the estimated principal components for the two patients. In the first case, the principal component is concentrated around the R-wave maximum, and the effect is to accelerate or decelerate the rise (and fall) of this peak from baseline in a given cycle. This is more easily seen by comparing plots of x+2ρ^ (green) with x2ρ^ (red), shown over a magnified part of the cycle in Figure 5b. In the second patient, the bulk of the energy of the principal component is concentrated in a level shift in the part of the cycle starting with the ST segment. This can be interpreted as beat-to-beat fluctuation in baseline; because each beat is anchored at 0 at the onset point, there is less fluctuation on the left side of the peak. This is particularly evident in Figure 5e. There is, again, a slight acceleration/deceleration in the rise to the R-wave peak—less pronounced in the first case, and also less evident in the fall.

Obvious questions raised by this illustrative example include the nature of effects that may have been introduced by the preprocessing steps—notably, the baseline removal anchored at onset points and the alignment of R-wave maxima. Clearly, some conventions must be adopted to create rectangular data matrices for principal component analysis, but detailed analysis of these issues must await future work.

To summarize, sparse PCA has reduced noise in the sample principal component in the baseline while keeping the main features, and in addition, sparse PCA uses less than 10% of the computing time used by standard PCA in these examples.


We first establish some notation and recall some pertinent matrix results (e.g., Golub and Van Loan 1996). Norms on vectors are always Euclidean 2-norms: x=(νxν2)12. Define the 2-norm of a rectangular matrix by ‖A‖ = sup{‖Ax‖ : ‖x‖ = 1. If A is real and symmetric, then ‖A‖ = λmax(A). If Ap×p is partitioned,


where b is (p − 1) × 1, then by setting x = (1, 0T)T, one finds that ‖b‖ ≤ ‖A‖. The matrix B = ρuT + uρT has at most two nonzero eigenvalues, given by


6.1 Perturbation Bounds

Suppose that a symmetric matrix Ap×p has unit eigenvector q1. We wish to bound the effect of a “symmetric” perturbation Ep×p on q1. The following result (Golub and Van Loan 1996, theorem 8.1.10) constructs a unit eigenvector q^1 of A + E and bounds its distance from q1 in terms of ‖E‖.

Let Qp×p = [q1 Q2] be an orthogonal matrix containing q1 in the first column, and partition conformally


where D22 and E22 are both (p − 1) × (p − 1).

Suppose that λ is separated from the set of eigenvalues of D22, denoted λ(D22); set


If ‖E‖ ≤ δ/5, then there exists rRp1 satisfying


such that q^1=(1+rTr)12(q1+Q2r) is a unit eigenvector ofA + E. Moreover, with d as in (5),


Let us remark that because ||e|| ≤ ||E||, we have ||r|| ≤ 1 and


Suppose now that q1 is the eigenvector of A associated with the “principal” eigenvalue λ1(A). Here and later, λi(A) denotes the ith largest eigenvalue of A. We verify that, using the preceding conditions, q^1 is also the principal eigenvector of A + E. In other words, if (A+E)q^1=λq^1, then, in fact, λ* = λ1(A + E).

To show this, we verify that λ* > λ2(A + E). Take inner products with q1 in the eigenequation for q^1:


Because A is symmetric, q1TA=λ1(A)q1T. Trivially, we have q1TEq^1E. Combine these remarks with (20) to get λλ1(A)2E.

Now δ = λ1(A) – λ2(A) and, because from the minimax characterization of eigenvalues (e.g., Golub and Van Loan 1996, p. 396), λ2(A + E) ≤ λ2(A) + ||E||, we have


6.2 Some Limit Theorems

Collect the noise vectors into a matrix Zp×n = [z1 ... zn]. We turn to properties of the noise matrix Z. The cross-products matrix ZZT has a standard p-dimensional Wishart Wp(n, I) distribution with n degrees of freedom and identity covariance matrix (see, for example, Muirhead 1982, p. 82). Thus, the matrix C = σ2(n−1ZZTIp) is simply a scaled and recentered Wishart matrix.

Geman (1980) and Silverstein (1985), respectively, established almost sure limits for the largest and smallest eigenvalues of a Wp(n, I) matrix as p/nc [set membership] [0, ∞), from which follows:


(Although the results in the articles cited are for c [set membership] (0, ∞), the results are easily extended to c = 0 by simple coupling arguments.)

Suppose, in addition, that υ is an n × 1 vector with independent N(0, 1) entries, which are also independent of Z. Conditioned on υ, the vector Zυ is distributed as Np(0,||υ||2I). Because Z is independent of υ, we conclude that


where χ(n)2 and χ(p)2 denote chi-squared variables the Up a vector uniform on the surface of the (p – 1)-dimensional unit sphere Sp–1 in Rp, and all three variables χ(n), χ(p), and Up are independent.

Now let u = σn−1Zυ. From (23) we have, as p/nc [set membership] [0, ∞),


6.3 Proof of Theorem 2


Recall from (11) that, given αn = α(n−1 log(n [logical or] p))1/2, the selected subset of variables I^ is defined by I^={ν:σ^ν2σ2(1+αn)} and that the estimated principal eigenvector based on I^ written ρ^I. We define a vector ρI with coordinates (ρI,ν) by selecting coordinates from ρ = (ρν) according to membership in I^:


We will use the triangle inequality d(ρ^I,ρ)d(ρ^I,ρI)+d(ρI,ρ) to show that ρ^Iρ. There are three main steps.

(i) Construct deterministic sets of indices In±={ν:ρν2σ2aαn} with constants a[minus-or-plus sign] to be determined later, which bracket I^ almost surely as n → ∞:


(ii) The uniform sparsity, combined with I^c(In)c, is used to show that d(ρI,ρ)a.s.0.

(iii) The containment I^In+, along with In+=o(n) shows that d(ρI,ρI)a.s.0.


Step (i)

We first obtain a bound on the cardinality of In± using the sparsity condition (9). Using (9), |ρ|(ν)Cν−1/q, and so


Let σν2=σ2+ρν2. Turning to the bracketing relations (25), we first remark that σ^ν2=Dσν2χ(n)2n, and when νIn±,


Using the definitions of I^ and writing Mn for a random variable with the distribution of χ(n)2n we have


We apply (A.2) from the Appendix with [sm epsilon]n = (a+ - 1)αn/(1 + a+αn) and for n large and α′ slightly smaller than α2,nn2>(a+1)2αlog(np), so that


with α+=(α+1)2α4. If α12, then α+3 for suitable a+ > 2.

The argument for the other inclusion is analogous, using (A.3) in place of (A.2):


with α_=3(1a_)2α16 so long as α′ is now slightly less than α2 and n is large enough. If α12, then α_>2 for suitable a_<189.

By a Borel-Cantelli argument, (25) follows from the bounds on Pn and Pn+.

Step (ii)

We first remark that one may easily show that d(ρ + u, ρ) ≤ ||u||/(||ρ|| – ||u||), so that norm convergence implies d-convergence. So, for n > n(ω) we have InI^ and so


When ν(In)c, we have by definition


say, while the uniform sparsity condition entails ρ(ν)2C2ν2q.

Putting these together, and defining s1 = s1(n) as the solution of the equation Cs−1/q = [sm epsilon]n, and writing ab for min(a, b), we obtain, as n → ∞,


Step (iii)

We adopt the abbreviations


and similarly for EI. We consider SI=SIσ2Ik^=ρIρIT+EI and note that the perturbation term has the decomposition


so that


Consider the first term on the right side. Because ρIρa.s.0 from Step (ii), it follows that ρIa.s.ρ. Because υsa.s.0, the first term is asymptotically negligible.

Let ZI+=(zνi:νIn+,i=1,,n) and uI+=(uν:νIn+). On the event Ωn={I^In+}, we have || uI || ≤ || uI+ || and setting k+=In+, by the same arguments as led to (24), we have


because k+ = o(n) from Step (i).

Finally, because on the event Ωn, the matrix ZI+ contains ZI, along with some additional rows, it follows that λmax(n1ZIZITI)λmax(n1ZI+ZI+TI)a.s.0 by (22), again because k+ = o(n). Combining previous bounds, we conclude that ||EI|| → 0.

The separation δn = ||ρI||2 → ||ρ||2 > 0 and so, by the perturbation bound (19),



In models with observational noise such as (2), in which the number of variables p grows with the number of cases n, we have reviewed results that show that standard PCA yields consistent estimates of the principal eigenvectors if and only if p/n → 0.

If the leading population principal eigenvector has a sparse representation in a given basis, Theorem 2 shows that it can be consistently estimated by selecting a subset of variables with variances above a threshold and then by restricting the PCA to this selected set. Incorporating a threshold is found empirically to be helpful. Future theoretical work might explore the tradeoff between variable selection and thresholding.

In summary, sparse PCA as described here may be of practical benefit in high-dimensional settings with substantial observational noise in which variation between individuals resides mainly in a subset of the coordinates in which the data are represented (perhaps after transformation).


This work was supported in part by grants NSF DMS 0072661, 0505303, and NIH EB R01 EB001988.


A.1 Large Deviation Inequalities

If X=n11nXi is the average of iid variates with moment-generating function exp{∧(λ)} = E exp{λX1}, then Cramer's theorem (see, for example, Dembo and Zeitouni 1993, sections 2.2.2 and 2.2.12) says that for x > EX1,


where the conjugate function ∧*(x) = supλx – ∧(λ)}. The same bound holds for P{X<x} when x < EX1. When applied to the χ(n)2 distribution, with X1 = z2 and z ~ N(0, 1), the moment-generating function


and the conjugate function


The bounds


(the latter following, for example, from (47) in Johnstone (2001)) yield



A.2 Proof of Theorem 3

Assume, without loss of generality, that σ12σ22σp2.

False Inclusion

For any fixed constant t, and l [set membership] Iout,


This threshold device leads to bounds on error probabilities using only marginal distributions. For example, consider false inclusion of variable l:


Write Mn for a χ(n)2n variate, and note from (12) that σ^ν2~σν2Mn. Set t=σk2(1n) for a value of [sm epsilon]n to be determined. Because σi2σk2 and σl2σk2(1αn), we arrive at


using large deviation bound (A.2). With the choice n=3αn(2+3) both exponents are bounded above by –b(α)log(n [logical or] p), and so P{FI} ≤ p(k + 1)(n [logical or] p)b(α).

False Exclusion

The argument is similar, starting with the remark that for any fixed t and l [set membership] Iin,


Consequently, if we set t=σk2(1+εn) and use σ12σk2(1+αn), we get


this time using (A.3). The bound P{FE} ≤ pk(n [logical or] p)b(α) ke–b(α)(1 – 2an)log(n[logical or]p) follows on setting n=2αn(2+3) and noting that (1 + αn)−2 ≥ 1 –2αn.

For numerical bounds, setting Ln = log(n [logical or] p), we may collect the preceding bounds in the form



  • Amini AA, Wainwright MJ. The Annals of Statistics. 2009. High-Dimensional Analysis of Semidefinite Relaxations for Sparse Principal Components.
  • Anderson TW. Asymptotic Theory for Principal Component Analysis. Annals of Mathematical Statistics. 1963;34:122–148.
  • Biehl M, Mietzner A. Statistical Mechanics of Unsupervised Structure Recognition. Journal of Physics A: Mathematical and General. 1994;27:1885–1897.
  • Cherkassky V, Mulier F. Learning from Data. Wiley; New York: 1998.
  • d'Aspremont A, El Ghaoui L, Jordan M, Lanckriet G. A Direct Formulation for Sparse PCA Using Semidefinite Programming. SIAM Review. 2007;49:434–448.
  • Dembo A, Zeitouni O. Large Deviations Techniques and Applications. Jones and Bartlett; Boston: 1993.
  • Donoho D. Unconditional Bases Are Optimal Bases for Data Compression and Statistical Estimation. Applied and Computational Harmonic Analysis. 1993;1:100–115.
  • Donoho DL, Johnstone IM, Kerkyacharian G, Picard D. “Wavelet Shrinkage: Asymptopia” (with discussion) Journal of the Royal Statistical Society, Ser. B. 1995;57:301–369.
  • Du Q, Fowler JE. Low-Complexity Principal Component Analysis for Hyperspectral Image Compression. International Journal of High Performance Computing Applications. 2008;22:438–448.
  • Feng GC, Yuen PC, Dai DQ. Human Face Recognition Using PCA on Wavelet Subband. Journal of Electronic Imaging. 2000;9:226–233.
  • Fowler JE. Compressive–Projection Principal Component Analysis for the Compression of Hyperspectral Signatures. In: Storer JA, Marcellin MW, editors. Data Compression Conference, 2008. DCC 2008. IEEE; Snowbird, UT: 2008. pp. 83–92.
  • Geman S. A Limit Theorem for the Norm of Random Matrices. Annals of Probability. 1980;8:252–261.
  • Golub GH, Van Loan CF. Matrix Computations. 3rd ed Johns Hopkins University Press; Baltimore: 1996.
  • Hoyle DC, Rattra y, M. Principal-Component-Analysis Eigenvalue Spectra from Data with Symmetry Breaking Structure. Physical Review E: Statistical, Nonlinear, and Soft Matter Physics. 2004;69:026124. [PubMed]
  • Johnstone IM. Chi Square Oracle Inequalities. In: de Gunst M, Klaassen C, van der Waart A, editors. Festschrift for Willem R. van Zwet (vol. 36 of IMS Lecture Notes—Monographs) Institute of Mathematical Statistics; Beachwood, OH: 2001. pp. 399–418.
  • Johnstone IM. Function Estimation and Gaussian Sequence Models. Draft of a monograph. 2003. Available at
  • Johnstone IM, Lu AY. Technical Report. Stanford University, Dept. of Statistics; 2004. Sparse Principal Components Analysis. Available at as e-print 0901.4392.
  • Jolliffe IT, Trendafilov NT, Uddin M. A Modified Principal Component Technique Based on the LASSO. Journal of Computational and Graphical Statistics. 2003;12:531–547.
  • Kaewpijit S, Le Moigne J, El-Ghazawi T. A Wavelet-Based PCA Reduction for Hyperspectral Imagery. Geoscience and Remote Sensing Symposium, 2002.IGARSS'02. 2002 IEEE International; Washington, DC: IEEE; 2002. pp. 2581–2583.
  • Lu AY. Ph.D. dissertation. Stanford University, Dept. of Statistics; 2002. Sparse Principal Components Analysis for Functional Data.
  • Mallat S. A Wavelet Tour of Signal Processing. 2nd ed. Academic Press; New York: 1999.
  • Muirhead RJ. Aspects of Multivariate Statistical Theory. Wiley; NewYork: 1982.
  • Nadler B. Finite Sample Approximation Results for Principal Component Analysis: A Matrix Perturbation Approach. The Annals of Statistics. 2008;36:2791–2817.
  • Nadler B, Coifman R. The Prediction Error in CLS and PLS: The Importance of Feature Selection Prior to Multivariate Calibration. Journal of Chemometrics. 2005;19:107–118.
  • Paul D. Asymptotics of Sample Eigenstructure for a Large Dimensional Spiked Covariance Model. Statistica Sinica. 2007;17:1617–1642.
  • Paul D, Johnstone I. Technical Report. Stanford University, Dept. of Statistics; 2004. Estimation of Principal Components through Coordinate Selection.
  • Ramsay JO, Silverman BW. Functional Data Analysis. Springer; Berlin: 1997.
  • Reimann P, Van den Broeck C, Bex GJ. A Gaussian Scenario for Unsupervised Learning. Journal of Physics A: Mathematical and General. 1996;29:3521–3535.
  • Rice JA, Silverman BW. Estimating the Mean and Covariance Structure Nonparametrically When the Data Are Curves. Journal of the Royal Statistical Society, Ser. B. 1991;53:233–243.
  • Silverman BW. Smoothed Functional Principal Components Analysis by Choice of Norm. The Annals of Statistics. 1996;24:1–24.
  • Silverstein JW. The Smallest Eigenvalue of a Large Dimensional Wishart Matrix. Annals of Probability. 1985;13:1364–1368.
  • Watkin TLH, Nadal J-P. Optimal Unsupervised Learning. Journal of Physics A: Mathematical and General. 1994;27:1899–1915.
  • Wickerhauser MV. Large-Rank Approximate Principal Component Analysis with Wavelets for Signal Feature Discrimination and the Inversion of Complicated Maps. Journal of Chemical Information and Computer Sciences. 1994a;34:1036–1046.
  • Wickerhauser MV. Two Fast Approximate Wavelet Algorithms for Image Processing, Classification, and Recognition. Optical Engineering. 1994b;33:2225–2235. (Special issue on Adapted Wavelet Analysis.)
  • Wolf L, Bileschi S. Computer Vision and Pattern Recognition. vol. 2. IEEE Computer Society; Washington, DC: 2005. Combining Variable Selection wit Dimensionality Reduction; pp. 801–806. 2005.
  • Wolf L, Shashua A. Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weighted-Based Approach. Journal of Machine Learning Research. 2005;6:1855–1887.
  • Zou H, Hastie T, Tibshirani R. Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics. 2006;15:265–286.