While the origins of statistical investigation (e.g. Graunt 1662
) predate even those of this eldest of extant scientific journals, the largest development of the science of statistics occurred in the twentieth century. The theory and practice of frequentist methods, the likelihood approach and the Bayesian paradigm all flourished, and informal graphical methods, computational algorithms and careful mathematical theory grew up together.
For most of the period, the primary motivating practical problems consisted of a comparatively large number of ‘experimental units’, on which a comparatively small number of features were measured. If, informally, we let p denote the dimension of what is ‘unknown’ and let n denote the cardinality of what is ‘known’, then traditional theory, and most practice, has until recently been largely limited to the ‘small p, large n’ scenario. This scenario also naturally reflected the contemporary limitations of computers (the term meant people prior to 1950) and graphical display.
A natural mode for asymptotic approximation therefore imagines that
remains of smaller order than n
, in fact usually fixed. Among the most familiar theoretical results of this type are the Laws of Large Numbers and the Central Limit Theorems. The former says that the sample mean of a random sample of size n
from a population has as a limit, in a well-defined sense, the population mean, as n
. The corresponding central limit theorem shows that the limiting distribution of the sample mean about the population mean (when scaled up by
) is of the normal or Gaussian type. In statistics, such results are useful in deriving asymptotic properties of estimators of parameters, but their validity relies on there being, in theory at least, many ‘observations per parameter’.
In practice, n will generally correspond to the number of experimental units on which data are available; for p, however, there are at least two, albeit related, interpretations. The more basic interpretation is as the measure of complexity of the model to be fitted to the data. However, that is often determined by the dimension of the data as given by the number of items (variables) recorded for each experimental unit, and in our presentation we shall use p to represent either interpretation, as appropriate.
Over the last 20 years or so, however, the practical environment has changed dramatically, with the spectacular evolution of data acquisition technologies and computing facilities. At the same time, applications have emerged in which the number of experimental units is comparatively small but the underlying dimension is massive; illustrative examples might include image analysis, microarray analysis, document classification, astronomy and atmospheric science. Methodology has responded vigorously to these challenges, and procedures have been developed or adapted to provide practical results.
However, there is a need for consolidation in the form of a systematic and critical assessment of the new approaches as well as development of appropriate theoretical underpinning (Lindsay et al. 2004
). In terms of asymptotic theory, the key scenarios to be investigated can be described as ‘large p
, small n
’ or in some cases as ‘large p
, large n
’; theory for the former scenario would assume that p
goes to infinity faster than n
and for the latter would assume that p
go to infinity at the same rate.
The practical and theoretical challenges posed by the large p/small n settings, along with the ferment of recent research, formed the backdrop to the 2008 research programme ‘Statistical Theory and Methods for Complex, High-dimensional Data’ at the Isaac Newton Institute for Mathematical Sciences, which stimulated this Theme Issue. It is important to emphasize the breadth of the research community represented in that programme and this Theme Issue. From a theoretical/methodological point of view, it is of relevance not only to statisticians but also to many in the growing population of machine-learning researchers. In addition, increasingly many areas of application generate data, the analysis of which requires the type of theory and methodology described in this issue.
Before setting the scene for the papers in this volume, we conclude this introduction to the introduction with some general remarks.
It should not, of course, be imagined that the ‘large p’ scenarios are mere alternative cases to be explored in the same spirit as their ‘small p’ forebears. A better analogy would lie in the distinction between linear and nonlinear models and methods—the unbounded variety and complexity of departures from linearity is a metaphor (and in some cases a literal model) for the scope of phenomena that can arise as the number of parameters grows without limit.
Indeed, a priori
, the enterprise seems impossible. Good data-analytical practice has always held that the number of data points n
should exceed the number of parameters p
to be estimated by some solid margin: n
≥5 is a plausible rule of thumb, mentioned for example by Hamilton (1970)
and repeated in the classic text by Huber (1981)
The large p/small n world would therefore seem to depend on a certain statistical alchemy—the computational transformation of ignorance into parameter estimates by fearless specification and fitting of high-dimensional models.
Nevertheless, as will be indicated by example in the papers in this volume, in a variety of methodological settings, as well as in numerous scientific applications, there has been notable success with large-p models and methods.
The key, of course, is that we are not always ignorant. Indeed, it seems clear that the enterprise can have hopes of success only if the actual number of influential parameters, k, say, is much smaller than the nominal number p. Thus, prior knowledge of the existence of a sparse representation, either of a hypothesized form or to be discovered by exploration, is a sine qua non.
At the same time, statistical theory is challenged to provide heuristics, principles and results to help explain when sparse models can be expected to be well estimable, or alternatively when the enterprise is simply too ambitious without further reliable prior information.
One such theoretical construct that emerges in a couple of papers in this volume is the ‘phase diagram’. An asymptotic model of a large-p regression or classification problem is expressed in terms of parameters such as the data ratio n/p and the effective parameter sparsity k/n, for which the diagram depicts sharp transitions between conditions in which estimation/classification is possible and those in which it must fail entirely.