First, we give a simple explanation of PCA, the most basic and well-understood form of multivariate decomposition. As we mentioned before, there are many types of multivariate decompositions—we picked PCA since in our opinion it is the best understood of all multivariate decompositions and computationally fast to run, with a clear ordering of the results in terms of variance accounted for. This simplicity is seen by some authors as a vice rather than a virtue, with the justification that neuroimaging data are of such complexity that a simple algorithm like PCA cannot be adequate for illuminating analysis. While this statement is superficially appealing, it neglects to mention that complex tools and algorithms can have a “life of their own” and introduce artifacts whose proper assessment demands rigorous pre-testing with Monte-Carlo simulations and test runs on simple real-world data sets that are understood perfectly in terms of their variance structures. Otherwise, the analyst runs the risk of unleashing poorly understood, but complex, tools on even more complex brain data with an insufficient understanding of the ensuing results. For this article, such techniques are thus beyond the scope of our investigation.

Some notational conventions first: matrices are given in capital bold-face, while column vectors are given in lowercase bold-face. Row vectors are just transposed versions of column vectors and no separate notation will be introduced for them. Scalar variables are given in italics. Furthermore, we follow the conventions of the software package Matlab for concatenation of vectors. [

**x y**] denotes the assembly of the two column vectors

**x** and

**y** into a matrix that has 2 columns and as many rows as

**x** and

**y**. [

**x; y**], on the other hand, is column vector that has twice as many rows as

**x** and

**y**. Dimensions of matrix are denoted with a curly bracket, for instance for a 40-by-2 matrix

**X**, we can write

Transposition is expressed as

**X**^{T}, so

Any data array analyzed in this article assumes a data matrix **Y** with *R* rows, i.e., one row per image voxel (=3-dimensional pixel), and *N* columns, i.e., one column per brain image included in the data set. Usually *N* is several orders of magnitude smaller than *R*. A typical neuroimaging experiment might comprise 40 human participants who are scanned in a functional MRI experiment in 2 experimental conditions. In this case, *N* = 2 * 40 = 80, and the number of voxels *R* usually is on the order of several hundreds of thousands. Thus, the rank of the data matrix **Y** is *N*, and this determines the number of Principal Component (PCs) that follow from a PCA. It is customary, although not necessary, to remove the grand mean image from the data array, reducing the rank to *N* – 1. Further, we assume that all columns of **Y** have been mean-centered. These normalizations assure that voxel-by-voxel and subject-by-subject covariance matrices are just scaled versions of **YY**^{T} and **Y**^{T}Y, respectively.

Next, we will perform the PCA on the data array. Since the rank of the data matrix is

*N*, an Eigen decomposition of the voxel-by-voxel covariance matrix

**YY**^{T} is impossible since this matrix is rank-deficient. Instead, we perform the Eigen decomposition on the scan-by-scan covariance matrix

**Y**^{T}**Y**, and then obtain the Eigen images by projection. The Eigen equation reads:

where

**w**_{i} are the Eigen vectors in subject space, i.e., the dual of the voxel space, and the associated Eigen values are λ

_{i}. The Eigen vectors have

*N* – 1 rows each and can be assembled in a matrix

**W**
One can see easily how the projection into voxel space works by multiplying with

**Y** from the left to obtain:

This is the Eigen equation for the voxel-by-voxel covariance matrix

**YY**^{T} and the Eigen vectors in voxel space (=brain images or PCs) are conveniently obtained by a simple multiplication of

**w**_{i} with

**Y** from the left. Again, the PCs can be assembled in a matrix according to

This matrix has *R* rows and *N –* 1 columns.

With the final assembly of all Eigen values into a matrix according to

we can express the full data array as

A few noteworthy observations can be made: the PCA achieves a decomposition of the data into one factor (

**V**) that is only dependent on the voxel locations in the brain and one factor (

**W**) that is only dependent on the subject index. The PCs assembled in

**V** are invariant across the group and can serve as basis vectors for a coordinate system in terms of which the data

**Y** can be conveniently summarized. They can be visualized as brain images and assign loadings to every voxel location in the brain. For our purposes we will combine the square root of the Eigen value matrix and

**W** into on matrix

**Z** and rewrite the previous equation as

We term the column vectors in

**Z**, subject score vectors. The normalization for both PCs and subject score vectors are

Let us summarize what we accomplished:

- the data matrix
**Y** was expressed as a product of subject-invariant PCs in **V**, and voxel-invariant subject scores in **Z**; - PCs are mutually orthogonal, subject score vectors are mutually orthogonal;
- the Eigen value λ
_{i} indicates how much variance the associated *i*th PC accounts for in the data array **Y**; the fraction of the variance accounted for by this PC is computed through division of λ_{i} by the sum of all Eigen values.

We stress again that PCA is just *one* way of achieving a multivariate decomposition, and we chose it for its relative simplicity and transparent nature. Obviously, for the expression **Y = V Z**^{T} there is an infinity of choices for **V** and **Z**. PCA imposes orthogonality on the columns of both **V** and **Z**. Other choices like independent component analysis impose statistical independence beyond just second-order moments on either **V** or **Z** or both. Other decompositions might be reasonable and conceivable too, particularly if furnished with clear algorithmic formulations that can be executed on null-data to empirically generate the null-distribution for any test statistic of choice.

One last thing to notice is the following: we explained how PCA achieves the decomposition

**Y** =

**VZ**^{T}, a representation of the data matrix in terms of PCs and their subject scores. The PCs in

**V** form an orthonormal basis set; this means that

*any* data set

**Y*** can be expressed in terms of these components with modified subject scores

**Z*** plus a residual term of unaccounted variance, regardless whether

**Y*** is the “derivation data set,” i.e., original data set from which

**V** was derived:

In an independent “replication data set,” subject scores of the PCs assembled in

**V** are easily computed according to:

Any brain–behavioral relationship that was discovered in the derivation data set **Y** and involves the subject scores in **Z**, can now be tested in the replication data set, using the subject scores in **Z*** and the subject variable of interest. This means that rather than relying on statistical inference in the derivation data set, one can check empirically whether the findings hold up in a replication data set—a very powerful additional validity test. We will make use of this feature of prospective application extensively in this article.

One important caveat about PCA that needs to be brought to the practitioner’s attention is its susceptibility to outliers. Since PCA operates on the parametrically computed variance–covariance matrix, this susceptibility is not surprising and we have observed it in PET and fMRI data numerous times in practice. Single brain images might contribute an overwhelming portion of the variance, resulting in an abnormally large variance concentration in the first PC (>90%). Essentially, one participant’s brain image contributes an overwhelming amount of variance to the data and captures the first PC all by itself. The remaining PCs can account for all remaining brain images, but not in an optimal way since everything is predicated on being orthogonal to the unrepresentative and pathological first PC. Clearly, a better strategy would be to down-weight the contribution of the problematic brain image such that a better representation of

*all* images in the sample is achieved in the first few PCs of an optimized PCA. In the field of computer vision, a large variety of just such approaches has been proposed using iterative algorithms or exact closed-form solutions (for instance [

5–

8]). We will not try to pursue these approaches any further for the sake of brevity. However, when reviewing PCA results the analyst should look for signs of trouble and abnormally large variance contributions. If a brain image produces its’ own first PC, the common-sense first line of attack would be to just re-run the analysis without the problematic data point.