|Home | About | Journals | Submit | Contact Us | Français|
Large datasets are increasingly common and are often difficult to interpret. Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance. Finding such new variables, the principal components, reduces to solving an eigenvalue/eigenvector problem, and the new variables are defined by the dataset at hand, not a priori, hence making PCA an adaptive data analysis technique. It is adaptive in another sense too, since variants of the technique have been developed that are tailored to various different data types and structures. This article will begin by introducing the basic ideas of PCA, discussing what it can and cannot do. It will then describe some variants of PCA and their application.
Large datasets are increasingly widespread in many disciplines. In order to interpret such datasets, methods are required to drastically reduce their dimensionality in an interpretable way, such that most of the information in the data is preserved. Many techniques have been developed for this purpose, but principal component analysis (PCA) is one of the oldest and most widely used. Its idea is simple—reduce the dimensionality of a dataset, while preserving as much ‘variability’ (i.e. statistical information) as possible.
Although it is used, and has sometimes been reinvented, in many different disciplines it is, at heart, a statistical technique and hence much of its development has been by statisticians.
This means that ‘preserving as much variability as possible’ translates into finding new variables that are linear functions of those in the original dataset, that successively maximize variance and that are uncorrelated with each other. Finding such new variables, the principal components (PCs), reduces to solving an eigenvalue/eigenvector problem. The earliest literature on PCA dates from Pearson  and Hotelling , but it was not until electronic computers became widely available decades later that it was computationally feasible to use it on datasets that were not trivially small. Since then its use has burgeoned and a large number of variants have been developed in many different disciplines. Substantial books have been written on the subject [3,4] and there are even whole books on variants of PCA for special types of data [5,6]. In §2, the formal definition of PCA will be given, in a standard context, together with a derivation showing that it can be obtained as the solution to an eigenproblem or, alternatively, from the singular value decomposition (SVD) of the (centred) data matrix. PCA can be based on either the covariance matrix or the correlation matrix. The choice between these analyses will be discussed. In either case, the new variables (the PCs) depend on the dataset, rather than being pre-defined basis functions, and so are adaptive in the broad sense. The main uses of PCA are descriptive, rather than inferential; an example will illustrate this.
Although for inferential purposes a multivariate normal (Gaussian) distribution of the dataset is usually assumed, PCA as a descriptive tool needs no distributional assumptions and, as such, is very much an adaptive exploratory method which can be used on numerical data of various types. Indeed, many adaptations of the basic methodology for different data types and structures have been developed, two of which will be described in §3a,d. Some techniques give simplified versions of PCs, in order to aid interpretation. Two of these are briefly described in §3b, which also includes an example of PCA, together with a simplified version, in atmospheric science, illustrating the adaptive potential of PCA in a specific context. Section 3c discusses one of the extensions of PCA that has been most active in recent years, namely robust PCA (RPCA). The explosion in very large datasets in areas such as image analysis or the analysis of Web data has brought about important methodological advances in data analysis which often find their roots in PCA. Each of §3a–d gives references to recent work. Some concluding remarks, emphasizing the breadth of application of PCA and its numerous adaptations, are made in §4.
The standard context for PCA as an exploratory data analysis tool involves a dataset with observations on p numerical variables, for each of n entities or individuals. These data values define p n-dimensional vectors x1,…,xp or, equivalently, an n×p data matrix X, whose jth column is the vector xj of observations on the jth variable. We seek a linear combination of the columns of matrix X with maximum variance. Such linear combinations are given by , where a is a vector of constants a1,a2,…,ap. The variance of any such linear combination is given by var(Xa)=a′Sa, where S is the sample covariance matrix associated with the dataset and ′ denotes transpose. Hence, identifying the linear combination with maximum variance is equivalent to obtaining a p-dimensional vector a which maximizes the quadratic form a′Sa. For this problem to have a well-defined solution, an additional restriction must be imposed and the most common restriction involves working with unit-norm vectors, i.e. requiring a′a=1. The problem is equivalent to maximizing a′Sa−λ(a′a−1), where λ is a Lagrange multiplier. Differentiating with respect to the vector a, and equating to the null vector, produces the equation
Thus, a must be a (unit-norm) eigenvector, and λ the corresponding eigenvalue, of the covariance matrix S. In particular, we are interested in the largest eigenvalue, λ1 (and corresponding eigenvector a1), since the eigenvalues are the variances of the linear combinations defined by the corresponding eigenvector a: var(Xa)=a′Sa=λa′a=λ. Equation (2.1) remains valid if the eigenvectors are multiplied by −1, and so the signs of all loadings (and scores) are arbitrary and only their relative magnitudes and sign patterns are meaningful.
Any p×p real symmetric matrix, such as a covariance matrix S, has exactly p real eigenvalues, λk (k=1,…,p), and their corresponding eigenvectors can be defined to form an orthonormal set of vectors, i.e. a′kak′=1 if k=k′ and zero otherwise. A Lagrange multipliers approach, with the added restrictions of orthogonality of different coefficient vectors, can also be used to show that the full set of eigenvectors of S are the solutions to the problem of obtaining up to p new linear combinations , which successively maximize variance, subject to uncorrelatedness with previous linear combinations . Uncorrelatedness results from the fact that the covariance between two such linear combinations, Xak and Xak′, is given by a′k′Sak=λka′k′ak=0 if k′≠k.
It is these linear combinations Xak that are called the principal components of the dataset, although some authors confusingly also use the term ‘principal components’ when referring to the eigenvectors ak. In standard PCA terminology, the elements of the eigenvectors ak are commonly called the PC loadings, whereas the elements of the linear combinations Xak are called the PC scores, as they are the values that each individual would score on a given PC.
It is common, in the standard approach, to define PCs as the linear combinations of the centred variables x*j, with generic element , where denotes the mean value of the observations on variable j. This convention does not change the solution (other than centring), since the covariance matrix of a set of centred or uncentred variables is the same, but it has the advantage of providing a direct connection to an alternative, more geometric approach to PCA.
Denoting by X* the n×p matrix whose columns are the centred variables x*j, we have
Equation (2.2) links up the eigendecomposition of the covariance matrix S with the singular value decomposition of the column-centred data matrix X*. Any arbitrary matrix Y of dimension n×p and rank r (necessarily, ) can be written (e.g. ) as
where U,A are n×r and p×r matrices with orthonormal columns (U′U=Ir=A′A, with Ir the r×r identity matrix) and L is an r×r diagonal matrix. The columns of A are called the right singular vectors of Y and are the eigenvectors of the p×p matrix Y′Y associated with its non-zero eigenvalues. The columns of U are called the left singular vectors of Y and are the eigenvectors of the n×n matrix YY′ that correspond to its non-zero eigenvalues. The diagonal elements of matrix L are called the singular values of Y and are the non-negative square roots of the (common) non-zero eigenvalues of both matrix Y′Y and matrix YY′. We assume that the diagonal elements of L are in decreasing order, and this uniquely defines the order of the columns of U and A (except for the case of equal singular values ). Hence, taking Y=X*, the right singular vectors of the column-centred data matrix X* are the vectors ak of PC loadings. Due to the orthogonality of the columns of A, the columns of the matrix product X*A=ULA′A=UL are the PCs of X*. The variances of these PCs are given by the squares of the singular values of X*, divided by n−1. Equivalently, and given (2.2) and the above properties,
where L2 is the diagonal matrix with the squared singular values (i.e. the eigenvalues of (n−1)S). Equation (2.4) gives the spectral decomposition, or eigendecomposition, of matrix (n−1)S. Hence, PCA is equivalent to an SVD of the column-centred data matrix X*.
The properties of an SVD imply interesting geometric interpretations of a PCA. Given any rank r matrix Y of size n×p, the matrix Yq of the same size, but of rank q<r, whose elements minimize the sum of squared differences with corresponding elements of Y is given  by
where Lq is the q×q diagonal matrix with the first (largest) q diagonal elements of L and Uq, Aq are the n×q and p×q matrices obtained by retaining the q corresponding columns in U and A.
In our context, the n rows of a rank r column-centred data matrix X* define a scatterplot of n points in an r-dimensional subspace of , with the origin as the centre of gravity of the scatterplot. The above result implies that the ‘best’ n-point approximation to this scatterplot, in a q-dimensional subspace, is given by the rows of X*q, defined as in equation (2.5), where ‘best’ means that the sum of squared distances between corresponding points in each scatterplot is minimized, as in the original approach by Pearson . The system of q axes in this representation is given by the first q PCs and defines a principal subspace. Hence, PCA is at heart a dimensionality-reduction method, whereby a set of p original variables can be replaced by an optimal set of q derived variables, the PCs. When q=2 or q=3, a graphical approximation of the n-point scatterplot is possible and is frequently used for an initial visual representation of the full dataset. It is important to note that this result is incremental (hence adaptive) in its dimensions, in the sense that the best subspace of dimension q+1 is obtained by adding a further column of coordinates to those that defined the best q-dimensional solution.
The quality of any q-dimensional approximation can be measured by the variability associated with the set of retained PCs. In fact, the sum of variances of the p original variables is the trace (sum of diagonal elements) of the covariance matrix S. Using simple matrix theory results it is straightforward to show that this value is also the sum of the variances of all p PCs. Hence, the standard measure of quality of a given PC is the proportion of total variance that it accounts for,
where tr(S) denotes the trace of S. The incremental nature of PCs also means that we can speak of a proportion of total variance explained by a set of PCs (usually, but not necessarily, the first q PCs), which is often expressed as a percentage of total variance accounted for: .
It is common practice to use some predefined percentage of total variance explained to decide how many PCs should be retained (70% of total variability is a common, if subjective, cut-off point), although the requirements of graphical representation often lead to the use of just the first two or three PCs. Even in such situations, the percentage of total variance accounted for is a fundamental tool to assess the quality of these low-dimensional graphical representations of the dataset. The emphasis in PCA is almost always on the first few PCs, but there are circumstances in which the last few may be of interest, such as in outlier detection  or some applications of image analysis (see §3c).
PCs can also be introduced as the optimal solutions to numerous other problems. Optimality criteria for PCA are discussed in detail in numerous sources (see [4,8,9], among others). McCabe  uses some of these criteria to select optimal subsets of the original variables, which he calls principal variables. This is a different, computationally more complex, problem .
PCA has been applied and found useful in very many disciplines. The two examples explored here and in §3b are very different in nature. The first examines a dataset consisting of nine measurements on 88 fossil teeth from the early mammalian insectivore Kuehneotherium, while the second, in §3b, is from atmospheric science.
Kuehneotherium is one of the earliest mammals and remains have been found during quarrying of limestone in South Wales, UK . The bones and teeth were washed into fissures in the rock, about 200 million years ago, and all the lower molar teeth used in this analysis are from a single fissure. However, it looked possible that there were teeth from more than one species of Kuehneotherium in the sample.
Of the nine variables, three measure aspects of the length of a tooth, while the other six are measurements related to height and width. A PCA was performed using the prcomp command of the R statistical software . The first two PCs account for 78.8% and 16.7%, respectively, of the total variation in the dataset, so the two-dimensional scatter-plot of the 88 teeth given by figure 1 is a very good approximation to the original scatter-plot in nine-dimensional space. It is, by definition, the best variance-preserving two-dimensional plot of the data, representing over 95% of total variation. All of the loadings in the first PC have the same sign, so it is a weighted average of all variables, representing ‘overall size’. In figure 1, large teeth are on the left and small teeth on the right. The second PC has negative loadings for the three length variables and positive loadings for the other six variables, representing an aspect of the ‘shape’ of teeth. Fossils near the top of figure 1 have smaller lengths, relative to their heights and widths, than those towards the bottom. The relatively compact cluster of points in the bottom half of figure 1 is thought to correspond to a species of Kuehneotherium, while the broader group at the top cannot be assigned to Kuehneotherium, but to some related, but as yet unidentified, animal.
So far, PCs have been presented as linear combinations of the (centred) original variables. However, the properties of PCA have some undesirable features when these variables have different units of measurement. While there is nothing inherently wrong, from a strictly mathematical point of view, with linear combinations of variables with different units of measurement (their use is widespread in, for instance, linear regression), the fact that PCA is defined by a criterion (variance) that depends on units of measurement implies that PCs based on the covariance matrix S will change if the units of measurement on one or more of the variables change (unless all p variables undergo a common change of scale, in which case the new covariance matrix is merely a scalar multiple of the old one, hence with the same eigenvectors and the same proportion of total variance explained by each PC). To overcome this undesirable feature, it is common practice to begin by standardizing the variables. Each data value xij is both centred and divided by the standard deviation sj of the n observations of variable j,
Thus, the initial data matrix X is replaced with the standardized data matrix Z, whose jth column is vector zj with the n standardized observations of variable j (2.7). Standardization is useful because most changes of scale are linear transformations of the data, which share the same set of standardized data values.
Since the covariance matrix of a standardized dataset is merely the correlation matrix R of the original dataset, a PCA on the standardized data is also known as a correlation matrix PCA. The eigenvectors ak of the correlation matrix R define the uncorrelated maximum-variance linear combinations of the standardized variables z1,…,zp. Such correlation matrix PCs are not the same as, nor are they directly related to, the covariance matrix PCs defined previously. Also, the percentage variance accounted for by each PC will differ and, quite frequently, more correlation matrix PCs than covariance matrix PCs are needed to account for the same percentage of total variance. The trace of a correlation matrix R is merely the number p of variables used in the analysis, hence the proportion of total variance accounted for by any correlation matrix PC is just the variance of that PC divided by p. The SVD approach is also valid in this context. Since (n−1)R=Z′Z, an SVD of the standardized data matrix Z amounts to a correlation matrix PCA of the dataset, along the lines described after equation (2.2).
Correlation matrix PCs are invariant to linear changes in units of measurement and are therefore the appropriate choice for datasets where different changes of scale are conceivable for each variable. Some statistical software assumes by default that a PCA means a correlation matrix PCA and, in some cases, the normalization used for the vectors of loadings ak of correlation matrix PCs is not the standard a′kak=1. In a correlation matrix PCA, the coefficient of correlation between the jth variable and the kth PC is given by (see )
Thus, if the normalization is used instead of a′ka=1, the coefficients of the new loading vectors are the correlations between each original variable and the kth PC.
In the fossil teeth data of §2b, all nine measurements are in the same units, so a covariance matrix PCA makes sense. A correlation matrix PCA produces similar results, since the variances of the original variable do not differ very much. The first two correlation matrix PCs account for 93.7% of total variance. For other datasets, differences can be more substantial.
One of the most informative graphical representations of a multivariate dataset is via a biplot , which is fundamentally connected to the SVD of a relevant data matrix, and therefore to PCA. A rank q approximation X*q of the full column-centred data matrix X*, defined by (2.5), is written as X*q=GH′, where G=Uq and H=AqLq (although other options are possible, see ). The n rows gi of matrix G define graphical markers for each individual, which are usually represented by points. The p rows hj of matrix H define markers for each variable and are usually represented by vectors. The properties of the biplot are best discussed assuming that q=p, although the biplot is defined on a low-rank approximation (usually q=2), enabling a graphical representation of the markers. When q=p the biplot has the following properties:
As stated above, these results are only exact if all q=p dimensions are used. For q<p, the results are merely approximate and the overall quality of such approximations can be measured by the percentage of variance explained by the q largest variance PCs, which were used to build the marker matrices G and H.
Figure 2 gives the biplot for the correlation matrix PCA of the fossil teeth data of §2b. The variable markers are displayed as arrows and the tooth markers as numbers. The group of three nearly horizontal and very tightly knit variable markers for two width variables and one height variable, WIDTH, HTMDT and TRIWIDTH, suggests a group of highly correlated variables, which are also strongly correlated with the first PC (represented by the horizontal axis). The very high proportion of variability explained by the two-dimensional principal subspace provides solid grounds for these conclusions. In fact, the smallest of the three true coefficients of correlation between these three variables is 0.944 (HTMDT and TRIWIDTH), and the smallest magnitude correlation between PC1 and any of these variables is 0.960 (TRIWIDTH). The sign difference in PC2 loadings between the three length variables (towards the bottom left of the plot) and the other variables is clearly visible. Projecting the marker for individual 58 onto the positive directions of all variable markers suggests that fossil tooth 58 (on the left of the biplot) is a large tooth. Inspection of the data matrix confirms that it is the largest individual on six of the nine variables, and close to largest on the remaining three. Likewise, individuals 85–88 (on the right) are small-sized teeth. Individuals whose markers are close to the origin have values close to the mean for all variables.
As was seen in §2, PCA amounts to an SVD of a column-centred data matrix. In some applications , centring the columns of the data matrix may be considered inappropriate. In such situations, it may be preferred to avoid any pre-processing of the data and to subject the uncentred data matrix to an SVD or, equivalently, to carry out the eigendecomposition of the matrix of non-centred second moments, T, whose eigenvectors define linear combinations of the uncentred variables. This is often referred to as an uncentred PCA and there has been an unfortunate tendency in some fields to equate the name SVD only with this uncentred version of PCA.
Uncentred PCs are linear combinations of the uncentred variables which successively maximize non-central second moments, subject to having their crossed non-central second moments equal to zero. Except when the vector of column means (i.e. the centre of gravity of the original n-point scatterplot in p-dimensional space) is near zero (in which case centred and uncentred moments are similar), it is not immediately intuitive that there should be similarities between both variants of PCA. Cadima & Jolliffe  have explored the relations between the standard (column-centred) PCA and uncentred PCA and found them to be closer than might be expected, in particular when the size of vector is large. It is often the case that there are great similarities between many eigenvectors and (absolute) eigenvalues of the covariance matrix S and the corresponding matrix of non-centred second moments, T.
In some applications, row centrings, or both row- and column-centring (known as double-centring) of the data matrix, have been considered appropriate. The SVDs of such matrices give rise to row-centred and doubly centred PCA, respectively.
Datasets where there are fewer observed entities than variables (n<p) are becoming increasingly frequent, thanks to the growing ease of observing variables, together with the high costs of repeating observations in some contexts (such as microarrays ). For example,  has an example in genomics in which n=59 and p=21225.
In general, the rank of an n×p data matrix is . If the data matrix has been column-centred, it is . When n<p, it is the number of observed individuals, rather than the number of variables, that usually determines the matrix rank. The rank of the column-centred data matrix X* (or its standardized counterpart Z) must equal the rank of the covariance (or correlation) matrix. The practical implication of this is that there are only r non-zero eigenvalues; hence r PCs explain all the variability in the dataset. Nothing prevents the use of PCA in such contexts, although some software, as is the case with R’s princomp (but not the prcomp) command, may balk at such datasets. PCs can be determined as usual, by either an SVD of the (centred) data matrix or the eigenvectors/values of the covariance (or correlation) matrix.
Recent research (e.g. [18,19]) has examined how well underlying ‘population’ PCs are estimated by the sample PCs in the case where np, and it is shown that in some circumstances there is little resemblance between sample and population PCs. However, the results are typically based on a model for the data which has a very small number of structured PCs, and very many noise dimensions, and which has some links with recent work in RPCA (see §3c).
The basic idea of PCA, leading to low-dimensional representations of large datasets in an adaptive and insightful way, is simple. However, the subsections in §2 have shown a number of subtleties that add some complexity. Going further, there are many ways to adapt PCA to achieve modified goals or to analyse data of different types. Because PCA is used in a large number of areas, research into modifications and adaptations is spread over literatures from many disciplines. Four such adaptations, chosen fairly arbitrarily from the many that exist, namely functional PCA, modifications of PCA to simplify interpretations, RPCA and symbolic data PCA, are described in the following subsections. Other adaptations are briefly mentioned in §4.
In some applications, such as chemical spectroscopy, observations are functional in nature, changing with some continuous variable which, for simplicity, we assume is time. The dataset is then a collection of n functions xi(t).
How to incorporate such functional features in the analysis is the goal of functional data analysis . Early work on functional PCA (e.g. ) performed a standard PCA on an n×p data matrix obtained by sampling n curves xi(t) at each of p points in time (tj, with j=1,…,p), so that the element in row i, column j, of the data matrix is xi(tj). The resulting p-dimensional vectors of loadings from a PCA of this data matrix are then viewed as sampled principal functions, which can be smoothed to recover functional form and can be interpeted as principal sources of variability in the observed curves . The above approach does not make explicit use of the functional nature of the n observations xi(t). To do so requires adapting concepts. In the standard setting, we consider linear combinations of p vectors, which produce new vectors. Each element of the new vectors is the result of an inner product of row i of the data matrix, (xi1,xi2,…,xip), with a p-dimensional vector of weights, a=(a1,…,ap): . If rows of the data matrix become functions, a functional inner product must be used instead, between a ‘loadings function’, a(t), and the ith functional observation, xi(t). The standard functional inner product is an integral of the form , on some appropriate compact interval. Likewise, the analogue of the p×p covariance matrix S is a bivariate function S(s,t) which, for any two given time instants s and t, returns the respective covariance, defined as
where is the mean function and is the ith centred function.
The analogue of the eigen-equation (2.1) involves an integral transform, which reflects the functional nature of S(s,t) and of inner products
The eigenfunctions a(t) which are the analytic solutions of equation (3.2) cannot, in general, be determined. Ramsay & Silverman  discuss approximate solutions based on numerical integration. An alternative approach, which they explore in greater detail, involves the assumption that the curves xi(t) can be written as linear combinations of a set of G basis functions ϕ1(t),…,ϕG(t), so that, for any data function i,
These basis functions can be chosen to reflect characteristics that are considered relevant in describing the observed functions. Thus, Fourier series functions may be chosen to describe periodic traits and splines for more general trends (B-splines are recommended). Other basis functions that have been used and can be considered are wavelets, exponential, power or polynomial bases. In theory, other bases, adapted to specific properties of a given set of observed functions, may be considered, although the computational problems that arise from any such choice must be kept in mind. The advantage of the basis function approach lies in the simplification of the expressions given previously. Denoting the n-dimensional vector of functions xi(t) as x(t), the G-dimensional vector of basis functions as ϕ(t) and the n×G matrix of coefficients cij as C, the n data functions in equation (3.3) can be written as a single equation x(t)=Cϕ(t). The eigenfunction a(t) can also be written in terms of the basis functions, with a(t)=ϕ(t)′b for some G-dimensional vector of coefficients b=(b1,…,bG). Assuming furthermore that x(t) and ϕ(t) are centred, the covariance function at time (s,t) becomes
where W is the G×G matrix of inner products between the basis functions. Since this equation must hold for all values of s, it reduces to
If the basis functions are orthonormal, W is the G×G identity matrix and we end up with a standard eigenvalue problem which provides the solutions a(t)=ϕ(t)′b to equation (3.2).
Ramsay & Silverman  further explore methods in which data functions xi(t) are viewed as solutions to differential equations, an approach which they call principal differential analysis, in order to highlight its close connections with PCA.
Research on functional PCA has continued apace since the publication of Ramsay and Silverman’s comprehensive text. Often this research is parallel to, or extends, similar ideas for data of non-functional form. For example, deciding how many PCs to retain is an important topic. A large number of suggestions have been made for doing so  and many selection criteria are based on intuitive or descriptive ideas, such as the obvious ‘proportion of total variance’. Other approaches are based on models for PCs. The problem of ‘how many functional PCs?’ is addressed in  using a model-based approach and criteria based on information theory.
As with other statistical techniques, it is possible that a few outlying observations may have a disproportionate effect on the results of a PCA. Numerous suggestions have been made for making PCA more robust to the presence of outliers for the usual data structure (see  and also §3c). One suggestion, using so-called S-estimators, is extended to functional PCA in .
Sometimes, as well as correlations between the p variables, there is a dependence structure between the n observations. A ‘dynamic’ version of functional PCA is proposed in , which is relevant when there are correlations between the observed curves, as well as the obvious correlation within the curves. It is based on an idea first suggested by Brillinger  for vector time series and uses frequency domain analysis.
PCA gives the best possible representation of a p-dimensional dataset in q dimensions (q<p) in the sense of maximizing variance in q dimensions. A disadvantage is, however, that the new variables that it defines are usually linear functions of all p original variables. Although it was possible to interpret the first two PCs in the fossil teeth example, it is often the case for larger p that many variables have non-trivial coefficients in the first few components, making the components difficult to interpret. A number of adaptations of PCA have been suggested that try to make interpretation of the q dimensions simpler, while minimizing the loss of variance due to not using the PCs themselves. There is a trade-off between interpretability and variance. Two such classes of adaptations are briefly described here.
Rotation. The idea of rotating PCs is borrowed from factor analysis  (a different method, which is sometimes confused with PCA—see  for a fuller discussion). Suppose, as before, that Aq is the p×q matrix, whose columns are the loadings of the first q PCs. Then XAq is the n×q matrix whose columns are the scores on the first q PCs for the n observations. Now let T be an orthogonal (q×q) matrix. Multiplication of Aq by T performs an orthogonal rotation of the axes within the space spanned by the first q PCs, so that Bq=AqT is a p×q matrix whose columns are loadings of q rotated PCs. The matrix XBq is an n×q matrix containing the corresponding rotated PC scores. Any orthogonal matrix T could be used to rotate the components, but if it is desirable to make the rotated components easy to interpret, then T is chosen to optimize some simplicity criterion. A number of such criteria have been suggested, including some that include non-orthogonal (oblique) rotation . The most popular is perhaps the varimax criterion in which an orthogonal matrix T is chosen to maximize , where bjk is the (j,k)th element of Bq.
Rotation can considerably simplify interpretation and, when viewed with respect to the q-dimensional space that is rotated, no variance is lost, as the sum of variances of the q rotated components is the same as for the unrotated components. What is lost is the successive maximization of the unrotated PCs, so that the total variance of the q components is more evenly distributed between components after rotation.
Drawbacks of rotation include the need to choose from the plethora of possible rotation criteria, though this choice often makes less difference than the choice of how many components to rotate. The rotated components can look quite different if q is increased by 1, whereas the successively defined nature of unrotated PCs means that this does not happen.
Adding a constraint. Another approach to simplification of PCs is to impose a constraint on the loadings of the new variables. Again, there are a number of variants of this approach, one of which adapts the LASSO (least absolute shrinkage and selection operator) approach from linear regression . In this approach, called SCoTLASS (simplified component technique–LASSO), components are found which successively solve the same optimization problem as PCA, but with the additional constraint , where τ is a tuning parameter. For , the constraint has no effect and PCs are obtained, but as τ decreases more and more loadings are driven to zero, thus simplifying interpretion. These simplified components necessarily account for less variance than the corresponding number of PCs, and usually several values of τ are tried to determine a good trade-off between added simplicity and loss of variance.
A difference between the rotation and constraint approaches is that the latter has the advantage for interpretation of driving some loadings in the linear functions exactly to zero, whereas rotation usually does not. Adaptations of PCA in which many coefficients are exactly zero are generally known as sparse versions of PCA, and there has been a substantial amount of research on such PCs in recent years. A good review of such work can be found in Hastie et al.  (see also §3c).
A related technique to SCoTLASS adds a penalty function to the variance criterion maximized, so that the optimization problem becomes to successively find ak, k=1,2,…,p, that maximize , subject to a′kak=1, where ψ is a tuning parameter . One of the present authors has recently reviewed a paper in which it is demonstrated that these apparently equivalent constraint and penalty approaches actually have quite distinct properties.
The original SCoTLASS optimization problem is non-convex and is also not solvable by simple iterative algorithms, although it is possible to re-express SCoTLASS as an equivalent, though still non-convex, optimization problem for which simple algorithms can be used . Another approach, due to d’Aspremont et al. , reformulates SCoTLASS in a more complex manner, but then drops one of the constraints in this new formulation in order to make the problem convex.
Achieving sparsity is important for large p and particularly when np. A number of authors have investigated versions of sparse PCA for this situation using models for the data in which the vast majority of the variables are completely unstructured noise [18,19,32]. These papers and others suggest and investigate the properties of algorithms for estimating sparse PCs when data are generated from their models. Lee et al.  use a different type of model, this time a random effects model for PC loadings, to derive an alternative penalty function to that used by SCoTLASS, giving another sparse PCA method. Additionally incorporating shrinkage of eigenvalues leads to yet another method, deemed super-sparse PCA in . Comparisons are given between their methods, SCoTLASS and the elastic net  for simulated data and a large genomic example.
One discipline in which PCA has been widely used is atmospheric science. It was first suggested in that field by Obukhov  and Lorenz  and, uniquely to that discipline, it is usually known as empirical orthogonal function (EOF) analysis. The book by Preisendorfer & Mobley  discusses many aspects of PCA in the context of meteorology and oceanography.
The format of the data in atmospheric science is different from that of most other disciplines. This example is taken from . The data consist of measurements of winter (December, January and February) monthly mean sea-level pressure (SLP) over the Northern Hemisphere north of 20°N. The dataset is available on a 2.5°×2.5° regular grid and spans the period from January 1948 to December 2000. Some preprocessing is done to adjust for the annual cycle and the different areas covered by grid squares at different latitudes. In many atmospheric science examples, the variables are measurements at grid points, and the loadings, known as EOFs, are displayed as smooth spatial patterns, as in figure 3 for the first two correlation-based EOFs for the SLP data . There are 1008 variables (grid-points) in this dataset, and the first two PCs account for 21% and 13% of the variation in these 1008 variables. Figure 3 gives a pattern which is commonly known as the Arctic Oscillation (AO). It is a measure of the north–south pressure gradient in the Atlantic Ocean and, to a lesser extent, in the Pacific Ocean and is a major source of variation in weather patterns. The second EOF is dominated by variation in the Pacific Ocean. The PCs for examples of this type are time series so the first PC, for example, will display which years have high values of the AO and which have low values.
Figure 4 shows simplified EOFs based on SCoTLASS . The main difference from the EOFs in figure 3 is for the first EOF, which is now completely dominated by the north–south pressure gradient in the Atlantic (the North Atlantic Oscillation) with exactly zero loadings for many grid-points. The simplification is paid for by a reduction in percentage of variation explained for the corresponding simplified PC (17% compared with 21%). The second simplified PC is very similar to the original second EOF, also explaining 13% of variation.
By its very nature, PCA is sensitive to the presence of outliers and therefore also to the presence of gross errors in the datasets. This has led to attempts to define robust variants of PCA and the expression RPCA has been used for different approaches to this problem. Early work by Huber [37,38] discussed robust alternatives to covariance or correlation matrices and ways in which they can be used to define robust PCs. This work was extended in [39,40]; see also .
The need for methods to deal with very large datasets in areas such as image processing, machine learning, bioinformatics or Web data analysis has generated a recent renewed interest in robust variants of PCA and has led to one of the most vigorous lines of research in PCA-related methods. A discussion of this issue can be found in . Wright et al.  defined RPCA as a decomposition of an n×p data matrix X into a sum of two n×p components: a low-rank component L and a sparse component S. More precisely, a convex optimization problem was defined as identifying the matrix components of X=L+S that minimize a linear combination of two different norms of the components:
where , the sum of the singular values of L, is the nuclear norm of L, and is the 1-norm of matrix S. The motivation for such a decomposition is the fact that, in many applications, low-rank matrices are associated with a general pattern (e.g. the ‘correct’ data in a corrupted dataset, a face in facial recognition, or a background image in video surveillance data), whereas a sparse matrix is associated with disturbances (e.g. corrupted data values, effects of light or shading in facial recognition, a moving object or person in the foreground of data surveillance images). Sparse components are also called ‘noise’, in what can be confusing terminology since in some applications it is precisely the ‘noise’ component that is of interest. Problem 3.5 has obvious points of contact with some of the discussion in §3b. Candès et al.  return to this problem, also called principal component pursuit, and give theoretical results proving that, under not very stringent conditions, it is possible to exactly recover the low-rank and sparse components with high probability and that the choice of works well in a general setting, avoiding the need to choose a tuning parameter. Results are extended to the case of data matrices with missing values. Algorithms for the identification of the components are also discussed in , an important issue given the computational complexity involved. Further variations consider more complex structures for the ‘noise’ component. Some such proposals are reviewed in , where the results of alternative algorithms in the presence of different types of ‘noise’ are compared in the context of image-processing and facial recognition problems. Their results show that classical PCA performs fairly well, when compared with these new methods, in terms of both time and the quality of low-rank solutions that are produced. A fairly recent review of work in this area can be found in .
Interval data arise when one wishes to retain a measure of underlying variability in the observations. This may occur if we wish to reflect the lack of precision of a measuring instrument or, more fundamentally, because the data are summary observations for which associated variability is considered inherent to the measurement. This is often the case when each observation corresponds to a group, rather than an individual, as would be the case with measurements on species, for which a range of values is considered part of the group value. If all p observed variables are of this type, each observation is represented by a hyper-rectangle, rather than a point, in p-dimensional space. Extensions of PCA for such data [47,49] seek PCs that are also of interval type, and which therefore also reflect ranges of values.
Another common type of symbolic data is given by histograms, which can be considered a generalization of interval-valued data where for each observation there are several intervals (the histogram bins) and associated frequencies. A recent review  covers several proposed definitions of PCA-type analyses for histogram data. Most of them require the definition of concepts such as distances between histograms (the Wasserstein distance being a common choice) or the sum and mean of histograms.
Although PCA in its standard form is a widely used and adaptive descriptive data analysis tool, it also has many adaptations of its own that make it useful to a wide variety of situations and data types in numerous disciplines. Adaptations of PCA have been proposed, among others, for binary data, ordinal data, compositional data, discrete data, symbolic data or data with special structure, such as time series  or datasets with common covariance matrices [6,40]. PCA or PCA-related approaches have also played an important direct role in other statistical methods, such as linear regression (with principal component regression ) and even simultaneous clustering of both individuals and variables . Methods such as correspondance analysis, canonical correlation analysis or linear discriminant analysis may be only loosely connected to PCA, but, insofar as they are based on factorial decompositions of certain matrices, they share a common approach with PCA. The literature on PCA is vast and spans many disciplines. Space constraints mean that it has been explored very superficially here. New adaptations and methodological results, as well as applications, are still appearing.
We thank Pamela Gill and Abdel Hannachi for helpful discussions regarding their data and results.
The fossil teeth data are available from I.T.J. The atmospheric science data were taken from the publicly accessible NCEP/NCAR reanalysis database (see  for details).
Both authors were equally involved in drafting the manuscript.
We have no competing interests.
Research by J.C. is partially supported by the Portuguese Science Foundation FCT - PEst-OE/MAT/UI0006/2014.