Principal components analysis (PCA) is a classic method for the reduction of dimensionality of data in the form of n observations (or cases) of a vector with p variables. Contemporary datasets often have p comparable with or even much larger than n. Our main assertions, in such settings, are (a) that some initial reduction in dimensionality is desirable before applying any PCA-type search for principal modes, and (b) the initial reduction in dimensionality is best achieved by working in a basis in which the signals have a sparse representation. We describe a simple asymptotic model in which the estimate of the leading principal component vector via standard PCA is consistent if and only if p(n)/n→0. We provide a simple algorithm for selecting a subset of coordinates with largest sample variances, and show that if PCA is done on the selected subset, then consistency is recovered, even if p(n) ⪢ n.
doi:10.1198/jasa.2009.0121
PMCID: PMC2898454
PMID: 20617121
Eigenvector estimation; Reduction of dimension; Regularization, Thresholding; Variable selection
Modern applications of statistical theory and methods can involve extremely large datasets, often with huge numbers of measurements on each of a comparatively small number of experimental units. New methodology and accompanying theory have emerged in response: the goal of this Theme Issue is to illustrate a number of these recent developments. This overview article introduces the difficulties that arise with high-dimensional data in the context of the very familiar linear statistical model: we give a taste of what can nevertheless be achieved when the parameter vector of interest is sparse, that is, contains many zero elements. We describe other ways of identifying low-dimensional subspaces of the data space that contain all useful information. The topic of classification is then reviewed along with the problem of identifying, from within a very large set, the variables that help to classify observations. Brief mention is made of the visualization of high-dimensional data and ways to handle computational problems in Bayesian analysis are described. At appropriate points, reference is made to the other papers in the issue.
doi:10.1098/rsta.2009.0159
PMCID: PMC2865881
PMID: 19805443
Bayesian analysis; classification; cluster analysis; high-dimensional data; regression; sparsity
The classical methods of multivariate analysis are based on the eigenvalues of one or two sample covariance matrices. In many applications of these methods, for example to high dimensional data, it is natural to consider alternative hypotheses which are a low rank departure from the null hypothesis. For rank one alternatives, this note provides a representation for the joint eigenvalue density in terms of a single contour integral. This will be of use for deriving approximate distributions for likelihood ratios and ‘linear’ statistics used in testing.
doi:10.1002/sta4.58
PMCID: PMC4159168
PMID: 25221357
principal components; covariance matrices; hypergeometric function; contour integral
We study the problem of estimating the leading eigenvectors of a high-dimensional population covariance matrix based on independent Gaussian observations. We establish a lower bound on the minimax risk of estimators under the l2 loss, in the joint limit as dimension and sample size increase to infinity, under various models of sparsity for the population eigenvectors. The lower bound on the risk points to the existence of different regimes of sparsity of the eigenvectors. We also propose a new method for estimating the eigenvectors by a two-stage coordinate selection scheme.
doi:10.1214/12-AOS1014
PMCID: PMC4196701
PMID: 25324581
Minimax risk; high-dimensional data; principal component analysis; sparsity; spiked covariance model
We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate quite different total numbers of reads. To overcome these difficulties, we use a log-linear model with a new approach to normalization. We derive a novel procedure to estimate the false discovery rate (FDR). Our method can be applied to data with quantitative, two-class, or multiple-class outcomes, and the computation is fast even for large data sets. We study the accuracy of our approaches for significance calculation and FDR estimation, and we demonstrate that our method has potential advantages over existing methods that are based on a Poisson or negative binomial model. In summary, this work provides a pipeline for the significance analysis of sequencing data.
doi:10.1093/biostatistics/kxr031
PMCID: PMC3372940
PMID: 22003245
Differential expression; FDR; Overdispersion; Poisson log-linear model; RNA-Seq; Score statistic
In Gaussian sequence models with Gaussian priors, we develop some simple examples to illustrate three perspectives on matching of posterior and frequentist probabilities when the dimension p increases with sample size n: (i) convergence of joint posterior distributions, (ii) behavior of a non-linear functional: squared error loss, and (iii) estimation of linear functionals. The three settings are progressively less demanding in terms of conditions needed for validity of the Bernstein-von Mises theorem.
doi:10.1214/10-IMSCOLL607
PMCID: PMC2990974
PMID: 21113327
high dimensional inference; Gaussian sequence; linear functional; squared error loss; posterior distribution; frequentist
The greatest root distribution occurs everywhere in classical multivariate analysis, but even under the null hypothesis the exact distribution has required extensive tables or special purpose software. We describe a simple approximation, based on the Tracy–Widom distribution, that in many cases can be used instead of tables or software, at least for initial screening. The quality of approximation is studied, and its use illustrated in a variety of setttings.
doi:10.1214/08-AOAS220
PMCID: PMC2880335
PMID: 20526465
Canonical correlation; characteristic root; equality of covariance matrices; greatest root statistic; largest eigenvalue; MANOVA; multivariate linear model; Tracy-Widom distribution
Let A and B be independent, central Wishart matrices in p variables with common covariance and having m and n degrees of freedom, respectively. The distribution of the largest eigenvalue of (A + B)−1 B has numerous applications in multivariate statistics, but is difficult to calculate exactly. Suppose that m and n grow in proportion to p. We show that after centering and, scaling, the distribution is approximated to second-order, O(p−2/3), by the Tracy–Widom law. The results are obtained for both complex and then real-valued data by using methods of random matrix theory to study the largest eigenvalue of the Jacobi unitary and orthogonal ensembles. Asymptotic approximations of Jacobi polynomials near the largest zero play a central role.
doi:10.1214/08-AOS605
PMCID: PMC2821031
PMID: 20157626
Canonical correlation analysis; characteristic roots; Fredholm determinant; Jacobi polynomials; largest root; Liouville-Green; multivariate analysis of variance; random matrix theory; Roy’s test; soft edge; Tracy-Widom distribution