Up to this point we have not discussed the notion of the “true model.” It is clear from the preceding that one's approach to model selection, and the criteria used for it, depend on the true model. In statistics “true model” has a technical definition: it is the model that generated the data. This is how the term is used throughout this article. Asymptotic and simulation results thus far have depended on the status of the true model. Does it contain large or small effects? Is it in the candidate set? Is it of finite dimension? Does its dimension increase with increasing N? These are all questions to consider before using the AIC or BIC.
Here we spend a few paragraphs explicating the notion of a true model aware that this issue is extremely complex and likely will never be fully resolved. One often hears the adage “[A]ll models are wrong, but some are useful” (Box & Draper, 1987
, p. 424). The statistician D. Cox (1995
, p. 456) drives the point home more forcefully:
[I]t does not seem helpful just to say that all models are wrong… The idea that complex physical, biological, or sociological systems can be exactly described by a few formulae is patently absurd.
Some argue that the truth in principle cannot
be fully modeled, and offer analytical information-theoretic arguments to support this claim (Kolmogorov, 1968
; Rissanen, 1987
). When one reflects on the notion of a true model, and considers the extreme complexity of psychological systems (depending on the most complex human organ, the brain), it becomes less clear how a true model could be devised, much less that it has already been devised at this stage in our science. In this respect, where the true model captures the entire complexity of human behavior (or some other physical system), it seems useless to speak of a true model.
Even if the true model could in principle be under consideration, the complexity of candidate models may change with increasing N
, increased scientific knowledge, and increasingly clever research designs. Fisher (1922)
stated “More or less elaborate forms will be suitable according to the volume of the data.” That is, one expects the candidate model pool to change as N
grows large. Imagine fitting a regression of order 50 to a data set of a few hundred or a few million. High-order models are more defensible with large data sets.
We venture to suggest that no statistician or psychologist ever believes they have a true model under consideration, if by that we mean a complete listing of all relevant variables, all non-zero weightings and interactions between them, and precisely specified error distributions. Instead, psychological theories, and the statistical models they imply, possess levels of verisimilitude, with no model (or theory) having perfect verisimilitude (Meehl, 1990
). Models are simplifications of reality, and the effect of unmeasured variables is presumed to be captured by the error distribution. That said, one can imagine some models, although not comprehensive models of behavior, as being, practically speaking, true.
The applicability of a practically true model depends on the type of study and the research goals. In some cases a simple model may be legitimate and justifiable (even if the error distribution is not perfectly correctly specified). Sometimes particular parametric statistical models are entirely reasonable—imagine modeling a coin toss with the binomial distribution (analogous to any event with a dichotomous outcome, such as treatment success or failure), or the number of shots it takes to make a basket in basketball with a geometric distribution. In these cases, where the physical properties of the system are well-understood and the statistical question well-defined, one may have good reason to consider the true model as practically parametric. In other cases, such as modeling the covariance structure of personality via factor analysis, it may not be justifiable to assume that the model(s) under consideration are appropriate, as the physical system under study is much less well understood (at least compared to a coin toss).
When the physical system is well-defined (e.g., in a true experiment where observations are randomized) the primary objective of the analysis can often be parameter estimation and computing confidence intervals (Cox, 1977
). When the physical system is largely unknown then the correct model is much less clear. Sometimes the data are too sparse to support a useful model. That does not mean some specific candidate model is not scientifically useful, but it is much less clear if the model assumptions are satisfied, and model selection might be conducted to help ensure that the model under consideration is a useful one for the purposes of the scientific investigation at hand.
Considering the status of the true model is relevant to choosing between the AIC and BIC. Consider non-experimental observational data, where the physical system is not understood, there are a few large effects, many small tapering effects, and a small sample size. The true model here may be non-parametric, but there also may exist a parametric model that captures the vast majority of systematic variation with only a few parameters to estimate. Pragmatically, the parametric model is here the best model and an appropriate loss function would be MSE (not zero/one loss). However, due to the large effects in the data a researcher would be tempted to use the BIC to select it. Note this is consistent with our simulations (e.g., in ), where the BIC outperformed the AIC when effects were very small (tapering effects) and when effects were very large. The BIC would ignore small (but true) effects in the data, and obtain a smaller MSE in estimating the covariances. As the sample size increases, it may become reasonable to expand the model to capture the small effects, since estimation of more parameters becomes feasible with increasing N. As the sample size becomes quite large, the best approach may become practically non-parametric because one can now with confidence model some of the many small tapering effects, and the same researcher might discard the BIC in favor of using the AIC.
Now imagine a different scenario, where there are no large effects, many moderate effects, and few small effects. Here a non-parametric model may well be more capable of capturing moderate effects in the data, even at small sample sizes, and the AIC would be preferred to minimize MSE. Again, note that this is consistent with our results in , where the AIC outperformed the BIC when the effects were moderate, but not when they were very large or very small. As the sample size increases it may be that the parameters of moderate size can be estimated with greater precision, and the best model becomes, practically speaking, parametric, with the BIC as the criterion of choice.
Choosing between the AIC and BIC in these situations depends on knowledge of the true model, which is difficult to have in practice. A more recent development in model selection is to use the data to guide the selection of a model selection criterion like AIC or BIC. For example, Liu and Yang (in press) have devised a parametricness index
that converges to ∞ when the data are governed by a parametric model and 0 when they are governed by a non-parametric model. The index provides a data-driven way to choose between AIC and BIC in the instant data set, and may resolve to some extent problems associated with assuming the true model exists (or does not). The parametricness index is an example of adaptive model selection criteria, a more recent movement in statistical theory where the choice of a model selection criterion is itself data-driven (Kadane & Lazar, 2004