|Home | About | Journals | Submit | Contact Us | Français|
A tacit but fundamental assumption of the Theory of Signal Detection (TSD) is that criterion placement is a noise-free process. This paper challenges that assumption on theoretical and empirical grounds and presents the Noisy Decision Theory of Signal Detection (ND-TSD). Generalized equations for the isosensitivity function and for measures of discrimination that incorporate criterion variability are derived, and the model's relationship with extant models of decision-making in discrimination tasks is examined. An experiment that evaluates recognition memory for ensembles of word stimuli reveals that criterion noise is not trivial in magnitude and contributes substantially to variance in the slope of the isosensitivity function. We discuss how ND-TSD can help explain a number of current and historical puzzles in recognition memory, including the inconsistent relationship between manipulations of learning and the slope of the isosensitivity function, the lack of invariance of the slope with manipulations of bias or payoffs, the effects of aging on the decision-making process in recognition, and the nature of responding in Remember/Know decision tasks. ND-TSD poses novel and theoretically meaningful constraints on theories of recognition and decision-making more generally, and provides a mechanism for rapprochement between theories of decision-making that employ deterministic response rules and those that postulate probabilistic response rules.
The Theory of Signal Detection (TSD1; Green & Swets, 1966; Macmillan & Creelman, 2005; Peterson, Birdsall, & Fox, 1954; Tanner & Swets, 1954) is a theory of decision-making that has been widely applied to psychological tasks involving detection, discrimination, identification, and choice, as well as to problems in engineering and control systems. Its historical development follows quite naturally from earlier theories in psychophysics (Blackwell, 1953; Fechner, 1860; Thurstone, 1927) and advances in statistics (Wald, 1950). The general framework has proven sufficiently flexible so as to allow substantive cross-fertilization with related areas in statistics and psychology, including mixture distributions (DeCarlo, 2002), theories of information integration in multidimensional spaces (Banks, 2000; Townsend & Ashby, 1982), models of group decision-making (Sorkin & Dai, 1994), models of response timing (Norman & Wickelgren, 1969; Sekuler, 1965; Thomas & Myers, 1972), and multiprocess models that combine thresholded and continuous evidence distributions (Yonelinas, 1999). It also exhibits well-characterized relationships with other prominent perspectives, such as individual choice theory (Luce, 1959) and threshold-based models (Krantz, 1969; Swets, 1986a). Indeed, it is arguably the most widely used and successful theoretical framework in psychology of the last half century.
The theoretical underpinnings of TSD can be summarized in four basic postulates:
As applied to recognition memory experiments (Banks, 1970; Egan, 1958; Lockhart & Murdock, 1970; Parks, 1966), in which subjects make individual judgments about whether a test item was previously viewed in a particular delimited study episode, the “signal” is considered to be the prior study of the item. That study event is thought to confer additional strength on the item such that studied items generally, but not always, yield greater evidence for prior study than do unstudied items. Subjects then make a decision about whether they did or did not study the item by comparing the strength yielded by the current test stimulus to a decision criterion. Analytically, TSD reparameterizes the obtained experimental statistics as estimates of discriminability and response criterion or bias. Theoretical conclusions about the mnemonic aspects of recognition performance are often drawn from the form of the isosensitivity function2, which is a plot of the theoretical hit rate against the theoretical false-alarm rate across all possible criterion values. The function is typically estimated from points derived from a confidence-rating procedure (Egan, 1958; Egan, Schulman, & Greenberg, 1959).
TSD has been successfully applied to recognition because it provides an articulated and intuitive description of the decision portion of the task without obliging any particular theoretical account of the relevant memory processes. In fact, theoretical interpretations derived from the application of TSD to recognition memory have been cited as major constraints on process models of recognition (e.g., McClelland & Chappell, 1998; Shiffrin & Steyvers, 1997). Recent evidence reveals, in fact, an increased role of TSD in research on recognition memory: The number of citations in PsycInfo that appear in response to a joint query of “recognition memory” and “signal detection” as keywords has increased from 23 in the 1980s to 39 in the 1990s to 67 in just the first seven years of this decade.
The purpose of this paper is to theoretically and empirically evaluate the postulate of a noise-free criterion (Assumption , above), and to describe an extension of TSD that is sufficiently flexible to handle criterion variability. The claim is that criteria may vary from trial to trial in part because of noise inherent to the processes involved with maintaining and updating them. Although this claim does not seriously violate the theoretical structure of TSD, it does have major implications for how we draw theoretical conclusions about memory, perception, and decision processes from detection, discrimination, and recognition experiments.
As we review below, concerns about variability in the decision process are apparent in a variety of literatures, and theoretical tools have been advanced to address the problems that arise from noisy decision making (Rosner & Kochaniski, 2008). However, theorizing in recognition memory has mostly advanced independently of such concerns, perhaps in part because of the difficulty associated with disentangling decision noise from representational noise (see, e.g., Ratcliff & Starns, in press). This paper considers the statistical and analytic problems that arise from its postulation in the context of detection theoretical models, and applies a novel experimental task—the ensemble recognition paradigm—towards the problem of estimating criterion variance.
Considerations similar to the ones forwarded here have been previously raised in the domains of psychoacoustics and psychophysics (Durlach & Braida, 1969; Gravetter & Lockhead, 1973), but have not been broadly considered in the domain of recognition memory. An exception is the seminal “strength theory” of Wickelgren (1968; Wickelgren & Norman, 1966; Norman & Wickelgren, 1969), on whose work our initial theoretical rationale is based. That work was applied predominately to problems in short-term memory and to the question of how absolute (yes-no) and relative (forced-choice) response tasks differed from one another. However, general analytic forms for the computation of detection statistics were not provided, nor was the work applied to the relationship between the isosensitivity function and theories of recognition memory (which were not prominent at the time).
Contemporary versions of the TSD are best understood by their relation to the general class of judgment models derived from Thurstone (1927). A taxonomy of those models described by Torgerson (1958) allows various restrictions on the equality of stimulus variance and of criterial variance; current applications of TSD to recognition memory vary in whether they permit stimulus variance to differ across distributions, but they almost unilaterally disallow criterial variance. This is a restriction that, although not unique to this field, is certainly a surprising dissimilarity with work in related areas such as detection and discrimination in psychophysical tasks (Bonnel & Miller, 1994; Durlach & Braida, 1969; Nosofsky, 1983) and classification (Ashby & Maddox, 1993; Erev, 1998; Kornbrot, 1980). The extension of TSD to ND-TSD is a relaxation of this restriction: ND-TSD permits nonzero criterial variance.
The recent explosion of work evaluating the exact form of the isosensitivity function in recognition memory under different conditions (Arndt & Reder, 2002; Glanzer, Kim, Hilford, & Adams, 1999; Gronlund & Elam, 1994; Kelley & Wixted, 2001; Matzen & Benjamin, in press; Qin, Raye, Johnson, & Mitchell, 2001; Ratcliff, Sheu, & Gronlund, 1992; Ratcliff, McKoon, & Tindall, 1994; Slotnick, Klein, Dodson, & Shimamura, 2000; Van Zandt, 2000; Yonelinas, 1994, 1997, 1999) and in different populations (Healy, Light, & Chung, 2005; Howard, Bessette-Symons, Zhang, & Hoyer, 2006; Manns, Hopkins, Reed, Kitchener, & Squire, 2003; Wixted & Squire, 2004a, 2004b; Yonelinas, Kroll, Dobbins, Lazzara, & Knight, 1998; Yonelinas, Kroll, Quamme, Lazzara, Suavé, Widaman, & Knight, 2002; Yonelinas, Quamme, Widaman, Kroll, Suavé, & Knight, 2004), as well as the prominent role those functions play in current theoretical development (Dennis & Humphreys, 2001; Glanzer, Adams, Iverson, & Kim, 1993; McClelland & Chappell, 1998; Wixted, 2007; Shiffrin & Steyvers, 1997; Yonelinas, 1999), suggests the need for a thorough reappraisal of the underlying variables that contribute to those functions. Because work in psychophysics (Krantz, 1969; Nachmias & Steinman, 1963) and, more recently, in recognition memory (Malmberg, 2002; Malmberg & Xu, 2006; Wixted & Stretch, 2004) has illustrated how aspects and suboptimalities of the decision process can influence the shape of the isosensitivity function, the goals of this article are to provide an organizing framework for the incorporation of decision noise within TSD, and to help expand the various theoretical discussions within the field of recognition memory to include a role for decision variability. We suggest that drawing conclusions about the theoretical components of recognition memory from the form of the isosensitivity function can be a dangerous enterprise, and show how a number of historical and current puzzles in the literature may benefit from a consideration of criterion noise.
The first part of this paper provides a short background on the assumptions of traditional TSD models, as well as evidence bearing on the validity of those assumptions. Appreciating the nature of the arguments underlying the currently influential unequal-variance version of TSD is critical to understanding the principle of criterial variance and the proposed analytic procedure for separately estimating criterial and evidence variance. In the second part of the paper, we critically evaluate the assertion of a stationary and nonvariable scalar criterion value3 from a theoretical and empirical perspective, and in the third section, provide basic derivations for the form of the isosensitivity function in the presence of nonzero criterial variability. The fourth portion of the paper provides derivations for measures of accuracy in the presence of criterial noise, and leads to the presentation, in the fifth section, of the “ensemble recognition” task, which can be used to assess criterial noise. In the sixth part of the paper, different models of that experimental task are considered and evaluated, and estimates of criterial variability are provided. In the seventh and final part of the paper, we review the implications of the findings and review some of the situations in which a consideration of criterial variability might advance our progress on a number of interesting problems in recognition memory and beyond.
It is important to note that the successes of TSD have led to many unanswered questions, and that a reconsideration of basic principles like criterion invariance may provide insight into those problems. No less of an authority than John Swets—the researcher most responsible for introducing TSD to psychology—noted that it was “unclear” why, for example, the slope of isosensitivity line for detection of brain tumors was approximately ½ the slope of the isosensitivity line for detection of abnormal tissue cells (Swets, 1986b). Within the field of recognition memory, there is evidence that certain manipulations that lead to increased accuracy, such as increased study time, are also associated with decreased slope of the isosensitivity function (Glanzer et al., 1999; Hirshman & Hostetter, 2000), whereas other manipulations that also lead to superior performance are not (Ratcliff et al., 1994). Although there are extant theories that account for changes in slope, there is no agreed-upon mechanism by which they do so, nor is an explanation of such heterogeneous effects forthcoming.
Throughout this paper, we will make reference to the recognition decision problem, but most of the considerations presented here are relevant to other problems in detection and discrimination, and we hope that the superficial application to recognition memory will not deter from the more general message about the need to consider decision-based noise in such problems (see also Durlach & Braida, 1969; Gravetter & Lockhead, 1973; Nosofsky, 1983; Wickelgren, 1968).
The lynchpin theoretical apparatus of TSD is the probabilistic relationship between signal status and perceived evidence. The historical assumption about this relationship is that the distributions of the random variables are normal in form (Thurstone, 1927) and of equal variance, separated by some distance, d’ (Green & Swets, 1966). Whereas the former assumption has survived inquiry, the latter has been less successful.
The original (Peterson et al., 1954) and most popularly applied version of TSD assumes that signal and noise distributions are of equal variance. Although many memory researchers tacitly endorse this assumption by reporting summary measures of discrimination and criterion placement that derive from the application of the equal-variance model, such as d’ and Cj, respectively, the empirical evidence does not support the equal-variance assumption. The slope of the isosensitivity function in recognition memory is often found to be ~0.80 (Ratcliff et al., 1992), although this value may change with increasing discriminability (Glanzer et al., 1999; Heathcote, 2003; Hirshman & Hostetter, 2000). This result has been taken to imply that the evidence distribution for studied items is of greater variance than the distribution for unstudied items (Green & Swets, 1966). The magnitude of this effect, and not its existence, as well as whether manipulations that enhance or attenuate it are actually affecting representational variance, are the issues at stake here.
The remarkable linearity of the isosensitivity function notwithstanding, it is critical for present purposes to note not only that the mean slope for recognition memory is often less than 1, but also that it varies considerably over situations and individuals (Green & Swets, 1966). It is considerably lower than 0.8 for some tasks (i.e., ~0.6 for the detection of brain tumors; Swets, Pickett, Whitehead, Getty, Schnur, Swets, & Freeman, 1979), higher than 1 for other tasks (like information retrieval; Swets, 1969), and around 1.0 for yet others (such as odor recognition; Rabin & Cain, 1984).
Nonunit and variable slopes reveal an inadequacy of the equal-variance model of Peterson et al. (1954), and of the validity of the measure d’. This failure can be addressed in several ways. It might be assumed that the distributions of evidence are asymmetric in form, for example, or that one or the other distribution reflects a mixture of latent distributions (DeCarlo, 2002; Yonelinas, 1994). The traditional and still predominant explanation, however, is the one described above—that the variance of the distributions is unequal (Green & Swets, 1966; Wixted, 2007). Because the slope of the isosensitivity function is equal to the ratio of the standard deviations of the noise and signal distributions in the unequal-variance TSD model, the empirical estimates of slope less than 1 have promoted the inference that the signal distribution is of greater variance than the noise distribution in recognition. However, the statistical theory of the form of the isosensitivity function that is used to understand nonunit slopes and slope variability has been only partially unified with the psychological theories that produce such behavior, via either the interactivity of continuous and thresholding mechanisms (Yonelinas, 1999) or the averaging process presumed by global matching mechanisms (Gillund & Shiffrin, 1984; Hintzman, 1986; Humphreys, Pike, Bain, & Tehan, 1989; Murdock, 1982) None of these prominent theories include a role for criterial variability, nor do they provide a comprehensive account of the shape of isosensitivity functions and of the effect of manipulations on that shape. Criterial variability can directly affect the slope of the isosensitivity function, a datum that opens up novel theoretical possibilities for psychological models of behavior underlying the isosensitivity function.
The form of the isosensitivity function has been used to test the validity of assumptions built into TSD about the nature of the evidence distributions, as well as to estimate parameters for those distributions. In that sense, TSD can be said to have bootstrapped itself into its current position of high esteem: Its validity has mostly been established by confirming its implications, rather than by systematically testing its individual assumptions. This is not intended to be a point of criticism, but it must be kept in mind that the accuracy of such estimation and testing depends fundamentally on the joint assumptions that evidence is inherently variable and criterion location is not. Allowing criterial noise to play a role raises the possibility that previous explorations of the isosensitivity function in recognition memory have conflated the contributions of stimulus and criterial noise.
As noted earlier, traditional TSD assumes that criterion placement is a noise-free and stationary process. Although there is some acknowledgment of the processes underlying criterion inconsistency (see, e.g., Macmillan & Creelman, 2005, p. 46), the apparatus of criterion placement in TSD stands in stark contrast with the central assumption of stimulus-related variability (see also Rosner & Kochanski, 2008). There are numerous reasons to doubt the validity of the idea that criteria are noise-free. First, there is evidence from detection and discrimination tasks of response autocorrelations, as well as systematic effects of experimental manipulations on response criteria. Second, maintaining the values of one or multiple criteria poses a memory burden and should thus be subject to forgetting and memory distortion. Third, comprehensive models of response time and accuracy in choice tasks suggest the need for criterial variability. Fourth, there is evidence from basic and well controlled psychophysical tasks of considerable trial-to-trial variability in the placement of criteria. Fifth, there are small but apparent differences between forced-choice response tasks and yes-no response tasks that indicate a violation of one of the most fundamental relationships predicted by TSD: the equality of the area under the isosensitivity function as estimated by the rating procedure and the proportion of correct responses in a two-alternative forced-choice task. This section will review each of these arguments more fully.
In each case, it is important to distinguish between systematic and nonsystematic sources of variability in criterion placement. This distinction is critical because only nonsystematic variability violates the actual underlying principle of a nonvariable criterion. Some scenarios violate the usual use of, but not the underlying principles of, TSD. This section identifies some sources of systematic variability and outlines the theoretical mechanisms that have been invoked to handle them. We also review evidence for nonsystematic sources of variability. Systematic sources of variability can be modeled within TSD by allowing criterion measures to vary with experimental manipulations (Benjamin, 2001; Benjamin & Bawa, 2004; Brown & Steyvers, 2005; Brown, Steyvers, & Hemmer, in press), by postulating a time-series criterion localization process contingent upon feedback (Atkinson, Carterette, & Kinchla, 1964; Atkinson & Kinchla, 1965; Friedman, Carterette, Nakatani, & Ahumada, 1968) only following errors (Kac, 1962; Thomas, 1973) or only following correct responses (Model 3 of Dorfman & Biderman, 1971), or as a combination of a long-term learning process and nonrandom momentary fluctuations (Treisman, 1987; Treisman & Williams, 1984). Criterial variance can even be modeled with a probabilistic responding mechanism (Parks, 1966; Thomas, 1975; White & Wixted, 1999), although the inclusion of such a mechanism violates much of the spirit of TSD.
When data are averaged across trials in order to compute TSD parameters, the researcher is tacitly assuming that the criterion is invariant across those trials. By extension, when parameters are computed across an entire experiment, measures of discriminability and criterion are only valid when the criterion is stationary over that entire period. Unfortunately, there is a abundance of evidence that this condition is rarely, if ever, met.
Research more than one-half century ago established the presence of longer runs of responses than would be expected under a response-independence assumption (Fernberger, 1920; Howarth & Bulmer, 1956; McGill, 1957; Shipley, 1961; Verplanck, Collier, & Cotton, 1952; Verplanck, Cotton, & Collier, 1953; Wertheimer, 1953). More recently, response autocorrelations (Gilden & Wilson, 1995; Luce, Nosofsky, Green, & Smith, 1982; Staddon, King, & Lockhead, 1980) and response time autocorrelations (Gilden, 1997; 2001; Van Orden, Holden, & Turvey, 2003) within choice tasks have been noted and evaluated in terms of long-range fractal properties (Gilden, 2001; Thornton & Gilden, in press) or short-range response dependencies (Wagenmakers, Farrell, & Ratcliff, 2004, 2005). Such dependencies have even been reported in the context of tasks eliciting confidence ratings (Mueller & Weidemann, 2008). Numerous models were proposed to account for short-range response dependencies, most of which included a mechanism for the adjustment of the response criterion on the basis of feedback of one sort or another (e.g., Kac, 1962; Thomas, 1973; Treisman, 1987; Treisman & Williams, 1984). Because criterion variance was presumed to be systematically related to aspects of the experiment and the subject's performance, however, statistical models that incorporated random criterial noise were not applied to such tasks (e.g. Durlach & Braida, 1969, Gravetter & Lockhead, 1973, Wickelgren, 1968) .
The presence of such response correlations in experiments in which the signal value is uncorrelated across trials implies shifts in the decision regime, either in terms of signal reception or transduction, or in terms of criterion location. To illustrate this distinction, consider a typical subject in a detection experiment whose interest and attention fluctuate with surrounding conditions (did an attractive research assistant just pass by the door?) and changing internal states (increasing hunger or boredom). If these distractions cause the subject to attend less faithfully to the experiment for a period of time, it could lead to systematically biased evidence values and thus biased responses. Alternatively, if a subject's criterion fluctuates because such distraction affects their ability to maintain a stable value, it will bias responses equivalently from the decision-theoretic perspective. More importantly, fluctuating criteria can lead to response autocorrelations even when the transduction mechanism does not lead to correlated evidence values. Teasing apart these two sources of variability is the major empirical difficulty of our current enterprise.
Stronger evidence for the lability of criteria comes from tasks in which experimental manipulations are shown to induce strategic changes. Subjects appear to modulate their criterion based on their estimated degree of learning (Hirshman, 1995) and perceived difficulty of the distractor set in recognition (Benjamin & Bawa, 2004; Brown et al., in press). Subjects even appear to dynamically shift criteria in response to item characteristics, such as idiosyncratic familiarity (Brown, Lewis, & Monk, 1977) and word frequency (Benjamin, 2003). In addition, criteria exhibit reliable individual differences as a function of personality traits (Benjamin, Wee, & Roberts, 2008), thus suggesting another unmodeled source of variability in detection tasks.
It is important to note, however, that criterion changes do not always appear when expected (e.g., Stretch & Wixted, 1998; Higham, Perfect, & Bruno, in press; Verde & Rotello, 2007) and are rarely of an optimal magnitude. It is for this reason that there is some debate over whether subject-controlled criterion movement underlies all of the effects that it has been invoked to explain (Criss, 2006), and indeed, more generally, over whether a reconceptualization of the decision variable itself provides a superior explanation to that of strategic criterion-setting (for a review in the context of “mirror effects,” see Greene, 2007). For present purposes, it is worth noting that this inconsistency may well reflect the fact that criterion maintenance imposes a nontrivial burden on the rememberer, and they may occasionally forgo strategic shifting in order to minimize the costs of allocating the resources to do so.
These many contributors to criterion variability make it likely that every memory experiment contains a certain amount of systematic but unattributed sources of variance that may affect interpretations of the isosensitivity function if not explicitly modeled. To be clear, such effects are the province of the current model only if they are undetected and unincorporated into the application of TSD to the data. The systematic variability evident in strategic criterion movement may, depending on the nature of that variability, meet the assumptions of ND-TSD and thus be accounted for validly, but we will explicitly deal with purely nonsystematic variability in our statistical model.
Given the many systematic sources of variance in criterion placement, it is unlikely that recapitulation of criterion location from trial to trial is a trivial task for the subject. The current criterion location is determined by some complex function relating past experience, implicit and explicit payoffs, and experience thus far in the test, and retrieval of the current value is likely prone to error—a fact that may explain why intervening or unexpected tasks or events that disrupt the normal pace or rhythm of the test appear to affect criterion placement (Hockley & Niewiadomski, 2001). Evidence for this memory burden is apparent when comparing the form of isosensitivity functions estimated from rating procedures with estimates from other procedures, such as payoff manipulations.
The difficulty of criterion maintenance is exacerbated in experiments in which confidence ratings are gathered because the subject is forced to maintain multiple criteria, one for each confidence boundary. Although it is unlikely that these values are maintained as independent entities (Stretch & Wixted, 1998), the burden nonetheless increases with the number of required confidence boundaries. Variability introduced by the confidence-rating procedure may explain why the isosensitivity function differs slightly when estimated with that procedure as compared to experiments that manipulate payoff matrices, and why rating-derived functions change shape slightly but unexpectedly when the prior odds of signal and noise are varied (Balakrishnan, 1998a; Markowitz & Swets, 1967; Van Zandt, 2000). These findings have been taken to indicate a fundamental failing of the basic assumptions of TSD (Balakrishnan, 1998a,b, 1999) but may simply reflect the contribution of criterion noise (Mueller & Weidemann, 2008).
A related piece of evidence comes from the comparison of isosensitivity functions from rating procedures with single points derived from a yes-no judgment. As noted by Wickelgren (1968), it is not uncommon for that yes-no point to lie slightly above the isosensitivity function (Egan, Greenberg, & Schulman, 1961; Markowitz & Swets, 1967; Schulman & Mitchell, 1966; Watson, Rilling, & Bourbon, 1964; Wickelgren & Norman, 1966) and for that effect to be somewhat larger when more confidence categories are employed. This result likely reflects the fact that the maintenance of criteria becomes more difficult with increasing numbers of criterion points. In recognition memory, Benjamin, Lee, and Diaz (2008) showed that discrimination between previously studied and unstudied words was measured to be superior when subjects made yes-no discrimination judgments than when they used a four-point response scale, and superior on the four-point response scale when compared to an eight-point response scale. This result is consistent with the idea that each criterion introduces noise to the decision process, and that, in the traditional analysis, that noise inappropriately contributes to estimates of memory for the studied materials.
A third argument in favor of criterion variance comes from sequential sampling models that explicitly account for both response time and accuracy in two-choice decisions. Specifically, the diffusion model of Ratcliff (1978, 1988; Ratcliff & Rouder, 1998) serves as a benchmark in the field of recognition memory (e.g., Ratcliff, Thapar, & McKoon, 2004) in that it successfully accounts for aspects of data, including response times, that other models do not explicitly address. It would thus seem that general, heuristic models like TSD have much to gain from analyzing the nature of the decision process in the diffusion model.
That model only provides a full account of recognition memory when two critical parameters are allowed to vary (Ratcliff & Rouder, 1998). First is a parameter that corresponds to the variability in the rate with which evidence accumulates from trial to trial. This value corresponds naturally to stimulus-based variability and resembles the parameter governing variability in the evidence distributions in TSD. The second parameter corresponds to trial-to-trial variability in the starting point for the diffusion process. When this value moves closer to a decision boundary, less evidence is required prior to a decision—thus, this value is analogous to variability in criterion placement. A recent extension of the diffusion model to the confidence rating procedure (Ratcliff & Starns, in press) has a similar mechanism. The fact that the otherwise quite powerful diffusion model fails to provide a comprehensive account of recognition memory without possessing explicit variability in criterion suggests that such variability influences performance in recognition nontrivially.
Thurstonian-type models with criterial variability have been more widely considered in psychophysics and psychoacoustics, where they have generally met with considerable success. Nosofsky (1983) found that range effects in auditory discrimination were due to both increasing representational and criterial variance with wider ranges. Bonnel and Miller (1994) found evidence of considerable criterial variance in a same/different line-length judgment task in which attention to two stimuli was manipulated by instruction. They concluded that criterial variability was greater than representational variability in their task (see their Experiment 2) and that focused attention served to decrease that variance.
One of the outstanding early successes of TSD was the proof by Green (1964; Green & Moses, 1966) that the area under the isosensitivity function as estimated by the rating-scale procedure should be equal to the proportion of correct responses in a 2-alternative forced-choice task. This result generalizes across any plausible assumption about the shape of evidence distributions, as long as they are continuous, and is thus not limited by the assumption of normality typically imposed on TSD. Empirical verification of this claim would strongly support the assumptions underlying TSD, including that of a nonvariable criterion, but the extant work on this topic is quite mixed.
In perceptual tasks, this relationship appears to be approximately correct under some conditions (Emmerich, 1968; Green & Moses, 1966; Schulman & Mitchell, 1966; Shipley, 1965; Whitmore, Williams, & Ermey, 1968), but is not as strong or as consistent as one might expect (Lapsley Miller, Scurfield, Drga, Galvin, & Whitmore, 2002). Even within a generalization of Green's principle to a wide range of other decision axes and decision variables (Lapsley Miller et al., 2002), considerable observer inconsistency was noted. Such inconsistency is the province of our exploration here. In fact, a relaxation of the assumption of nonvariable criteria permits conditions in which this relationship can be violated. Wickelgren (1968) even noted that it was “quite amazing” (p. 115) that the relationship appeared to hold even approximately
The empirical evidence regarding the correspondence between forced-choice and yes-no recognition also suggests an inadequacy in the basic model. Green and Moses (1966) reported one experiment that conformed well to the prediction (Experiment 2) and one that violated it somewhat (Experiment 1). Most recent studies have made this comparison under the equal-variance assumption reviewed earlier as inadequate for recognition memory (Deffenbacher, Leu, & Brown, 1981; Khoe, Kroll, Yonelinas, Dobbins, & Knight, 2000; Yonelinas, Hockley, & Murdock, 1992), but experiments that have relaxed this assumption have yielded mixed results: some have concluded that TSD-predicted correspondences are adequate (Smith & Duncan, 2004) and others have concluded in favor of other models (Kroll, Yonelinas, Dobbins, & Frederick, 2002). However, Smith and Duncan (2004) used rating scales for both forced-choice and yes/no recognition, making it impossible to establish whether their correspondences were good because ratings imposed no decision noise or because the criterion variance imposed by ratings was more or less equivalent on the two tasks. In addition, amnesic patients, who might be expected to have a great difficulty with the maintenance of criteria, have been shown to perform relatively more poorly on yes-no than forced-choice recognition (Freed, Corkin, & Cohen, 1987; see also Aggleton & Shaw, 1996), although this result has not been replicated (Khoe et al., 2000; Reed, Hamann, Stefanacci, & Squire, 1997). The inconsistency in this literature may reflect the fact that criterion noise accrues throughout an experiment: Bayley, Wixted, Hopkins, and Squire (2008) recently showed that, whereas amnesics do not show any disproportionate impairment on yes-no recognition on early testing trials, their performance on later trials does indeed drop relative to control subjects.
Although we shall not pursue the comparison of forced-choice and yes-no responding further in our search for evidence of criterial variability, it is noteworthy that the evidence in support of the fundamental relationship between the two tasks reported by Green has not been abundant, and that the introduction of criterial variability allows conditions under which that relationship is violated.
This section outlines the mathematical formulation of the decision task and the basic postulates of TSD, and extends that formulation by explicitly modeling criterion placement as a random variable with nonzero variability. To start, let us consider a subject's perspective on the task. Recognition requires the subject to discriminate between previously studied and unstudied stimuli. The traditional formulation of recognition presumes that test stimuli yield mnemonic evidence for studied status and that prior study affords discriminability between studied and unstudied stimuli by increasing the average amount of evidence provided by studied stimuli, and likely increasing variance as well. However, inherent variability within both unstudied and studied groups of stimuli yields overlapping distributions of evidence. This theoretical formulation is depicted in the top panel of Figure 1, in which normal probability distributions represent the evidence values (e) that previously unstudied (S0) and previously studied (S1) stimuli yield at test.4 If these distributions are nonzero over the full range of the evidence variable, then there is no amount of evidence that is unequivocally indicative of a particular underlying distribution (studied or unstudied). Equivalently, the likelihood ratio at criterion is –∞ < β < ∞. The response is made by imposing a decision criterion (c), such that:
The indicated areas in Figure 1 corresponding to hit and false-alarm rates (HR and FAR) illustrate how the variability of the representational distributions directly implies a particular level of performance.
Consider, as a hypothetical alternative case, a system without variable stimulus encoding. In such a system, signal and noise are represented by nonvariable (and consequently nonoverlapping) distributions of evidence, and the task seems trivial. But there is, in fact, some burden on the decision-maker in this situation. First, the criterion must be placed judiciously—were it to fall anywhere outside the two evidence points, performance would be at chance levels. Thus, as reviewed previously, criterion placement must be a dynamic and feedback-driven process that takes into account aspects of the evidence distributions and the costs of different types of errors. Here we explicitly consider the possibility that there is an inherent noisiness to criterion placement in addition to such systematic effects.
The bottom panel of Figure 1 illustrates this alternative scenario, in which the decision criterion is a normal random variable with variance greater than 0, and e is a binary variable. Variability in performance in this scenario derives from variability in criterion placement from trial to trial, but yields—in the case of this example—the same performance as in the top panel (shown by the areas corresponding to HR and FAR). This model fails, of course, to conform with our intuitions and we shall see presently that it is untenable. However, the demonstration that criterial variability can yield identical outcomes as evidence variability is illustrative of the predicament we find ourselves in; namely, how to empirically distinguish between these two components of variability. The next section of this paper outlines the problem explicitly.
Let μx and σx indicate the mean and standard deviation of distribution x, and the subscripts e and c refer to evidence and criterion, respectively. If both evidence and criterial variability are assumed to be normally distributed (N) and independent of one another, as generally assumed by Thurstone (1927) and descendant models (Kornbrot, 1980; Peterson et al., 1954; Tanner & Swets, 1954), the decision variable is distributed as
Because the variances of the component distributions sum to form the variability of the decision variable, it is not possible to discriminate between evidence and criterial variability on a purely theoretical basis (see also Wickelgren & Norman, 1966). This constraint does not preclude an empirical resolution, however. In addition, reworking the Thurstone model such that criteria can not violate order contstraints yields a model in which theoretical discrimination between criterion and evidence noise may be possible (Rosner & Kochanski, 2008).
Performance in a recognition task can be related to the decision variable by defining areas over the appropriate evidence function and, as is typically done in TSD, assigning the unstudied (e0) distribution a mean of 0 and unit variance:
in which “respond S” indicates a signal response, or a “yes” in a typical recognition task. These values are easiest to work with in normal-deviate coordinates:
Substitution and rearrangement yields the general model for the isosensitivity function with both representational and criterial variability (for related derivations, see McNicol, 1972; Wickelgren, 1968):
Note that, by this formulation, the slope of the function is not simply the reciprocal of the signal standard deviation, as it is in unequal-variance TSD. Increasing evidence variance will indeed decrease the slope of the function. However, the variances of the evidence and the criterion distribution also have an interactive effect: When the signal variance is greater than 1, increasing criterion variance will increase the slope. When it is less than 1, increasing criterion variance will decrease the slope. Equivalently, criterial variance reduces the effect of stimulus variance and pushes the slope towards 1.
Figure 2 depicts how isosensitivity functions vary as a function of criterial variance, and confirms the claim of previous theorists (Treisman & Faulkner, 1984; Wickelgren, 1968) and implication of Equation 2 that criterial variability generally decreases the area under the isosensitivity function. The slight convexity at the margins of the function that results from unequal variances is an exception to that generality (see also Thomas and Myers, 1972). The left panels depict increasing criterial variance for signal variance less than 1, and the right panels for signal variance greater than 1. The middle panels show that, when signal variance is equal to noise variance, criterial variance decreases the area under the curve but the slope does not change. It is worth noting that the prominent attenuating effect of criterial variance on the area under the function is generalizable across a number of plausible alternative distributions (including the logistic and gamma distributions; Thomas & Myers, 1972).
When criterial variability is zero, Equation 2 reduces to the familiar form of the unequal-variance model of TSD:
in which the slope of the function is the reciprocal of the signal variance and the y-intercept is μ1/σ1. When the distributions are assumed to have equal variance, as shown in the top panel in Figure 1, the slope of this line is 1.
When stimulus variability is zero and criterion variability is nonzero, as in the bottom example of the scenario depicted in Figure 1, the isosensitivity function is:
and when stimulus variability is nonzero but equal for the two distributions, the zROC is:
In both cases, the function has a slope of 1 and is thus identical to the case in which representational variability is nonzero but does not vary with stimulus type; thus, there is no principled way of using the isosensitivity function to distinguish between the two hypothetical cases shown in Figure 1, in which either evidence but not criterial variability or criterial but not evidence variability is present. Thankfully, given the actual form of the empirical isosensitivity function—which typically reveals a nonunit slope—we can use the experimental technique presented later in this paper to disentangle these two bases.
Because empirical isosensitivity functions exhibit nonunit slope, we need to consider measures of accuracy that generalize to the case when evidence distributions are not of equal variance. This section provides the rationale and derivations for ND-TSD generalizations of two commonly used measures, da and de.
There are three basic ways of characterizing accuracy (or, variously, discriminability or sensitivity) in the detection task. First, accuracy is related to the degree to which the evidence distributions overlap, and is thus a function of the distance between them, as well as their variances. Second, accuracy is a function of the distance of the isosensitivity line from an arbitrary point on the line that represents complete overlap of the distributions (and thus chance levels of accuracy on the task). Finally, accuracy can be thought of as the amount of area below an isosensitivity line—an amount that increases to 1 when performance is perfect and drops to 0.5 when performance is at chance. Each of these perspectives has interpretive value: the distribution-overlap conceptualization is easiest to relate to the types of figures associated with TSD (like the top part of Figure 1); distance-based measures emphasize the desirable psychometric qualities of the statistic (e.g., that they are on a ratio scale; Matzen & Benjamin, in press). Area-based measures bear a direct and transparent relation with forced-choice tasks. All measures can be intuitively related to the geometry of the isosensitivity space.
To derive measures of accuracy, we shall deal with the distances from the isosensitivity line, as defined by Equation 2.5 Naturally, there are an infinite number of distances from a point to a line, so it is necessary to additionally restrict our definition. Here we do so by using the shortest possible distance from the origin to the line, which yields a simple linear transformation of da (Schulman & Mitchell, 1966). In Appendix A, we provide an analogous derivation for de, which is the distance from the origin to the point on the isosensitivity line that intersects with a line perpendicular to the isosensitivity line. These values also correspond to distances on the evidence axis scaled by the variance of the underlying evidence distributions: de corresponds to the distance between the distributions, scaled by the arithmetic average of the standard deviations, and da corresponds to distance in terms of the root-mean-square average of the standard deviations (Macmillan & Creelman, 2005). For the remainder of this paper, we will use da, as it is quite commonly used in the literature (e.g., Banks, 2000; Matzen & Benjamin, in press), is easily related to area-based measures of accuracy, and provides a relatively straightforward analytic form.
The generalized version of da can be derived for by solving for the point at which the isosensitivity function must intersect with a line of slope (-1/m):
The intersection point of Equation 2 and this equation is
which yields a distance of
from the origin. This value is scaled by √2 in order to determine the length of the hypotenuse on a triangle with sides of length noisy d*a (Simpson & Fitter, 1973):
The area measure AZ also bears a simple relationship with d*a:
Because both criterial and evidence variability affect the slope of the isosensitivity function, it is difficult to isolate the contributions of each to performance. To do so, we must find conditions over which we can make a plausible case for criterial and evidential variance being independently and differentially related to a particular experimental manipulation. We start by taking a closer look at this question.
Over what experimental factor is evidence presumed to vary? Individual study items probably vary in pre-experimental familiarity and also in the effect of a study experience. In addition, the waxing and waning of attention over the course of an experiment increases the item-related variability (see also DeCarlo, 2002).
Do these same factors influence criterial variability? By the arguments presented here, criterial variability related to item characteristics is mostly systematic in nature (see, e.g., Benjamin, 2003) and is thus independent of the variability modeled by Equation 1. We have specifically concentrated on nonsystematic variability, and have argued that it is likely a consequence of the cognitive burden of criterion maintenance. Thus, the portion of criterial variability with which we concern ourselves with is trial-to-trial variability on the test. What is needed is a paradigm in which item variability can be dissociated from trial variability.
In the experiment reported here, we use a variant of a clever paradigm devised by Nosofsky (1983) to investigate range effects in the absolute identification of auditory signals. In our experiment, subjects made recognition judgments for ensembles of items that vary in size. Thus, each test stimulus included a variable number of words (1,2, or 4), all of which were old or all of which were new. The subjects’ task was to evaluate the ensemble of items and provide an “old” or “new” judgment on the group.
The size manipulation is presumed to affect stimulus noise—because each ensemble is composed of heterogeneous stimuli and is thus subject to item-related variance—but not criterial noise, because the items are evaluated within a single trial, as a group. Naturally, this assumption might be incorrect: subjects might, in fact, evaluate each item in an ensemble independently and with heterogeneous criteria. We will examine the data closely for evidence of a violation of the assumption of criterial invariance within ensembles.
In order to use the data from ensemble recognition to separately evaluate criterial and stimulus variance, we must have a linking model of information integration within an ensemble—that is, a model of how information from multiple stimuli is evaluated jointly for the recognition decision. We shall consider two general models. The independent variability model proposes that the variance of the strength but not the criterial distribution is affected with ensemble size, as outlined in the previous section. Four submodels are considered. The first two assume that evidence is averaged across stimulus within an ensemble and differ only in whether criterial variability is permitted to be nonzero (ND-TSD) or not (TSD). The latter two assume that evidence is summed across the stimuli within an ensemble and, as before, differ in whether criterion variability is allowed to be nonzero. These models will be compared to the OR model, which proposes that subjects respond positively to an ensemble if any member within that set yields evidence greater than a criterion. This latter model embodies a failure of the assumption that the stimuli are evaluated as a group, and its success would imply that our technique for separating criterial and stimulus noise is invalid. Thus, a total of five models of information intergration are considered.
For each ensemble size, five criteria had to be estimated to generate performance on a 6-point rating curve. For all models except the two summation models, a version of the model was fit in which criteria were free to vary across ensemble size (yielding 15 free parameters, and henceforth referred to as without restriction) and another version was fit in which the criteria were constrained (with restriction) to be the same across ensemble sizes (yielding only 5 free parameters). Because the scale of the mean evidence values varies with ensemble size for the summation model, only one version was fit in which there were fifteen free parameters (i.e., they were free to vary across ensemble size).
One important concern in comparing models, especially non-nested models like the OR model, is that a model may benefit from undue flexibility. That is, a model may account for a data pattern more accurately not because it is a more accurate description of the underlying generating mechanisms, but rather because its mathematical form affords it greater flexibility (Myung & Pitt, 2002). It may thus appear superior to another model by virtue of accounting for nonsystematic aspects of the data. There are several approaches we have taken to reduce concerns that ND-TSD may benefit from greater flexibility than its competitors.
First, we have adopted the traditional approach of using an index of model fit that is appropriate for nonnested models and penalizes models according to the number of their free parameters (AIC; Akaike, 1973). Second, we use a correction on the generated statistic that is appropriate for the sample sizes in use here (AICc; Burnham & Anderson, 2004). Third, we additionally report the Akaike weight metric, which, unlike the AIC or AICc, has a straightforward interpretation as the probability that a given model is the best among a set of candidate models. Fourth, in addition to reporting both AICc values and Akaike weights, we also report the number of subjects best fit by each model, ensuring that no model is either excessively penalized for failing to account for only a small number of subjects (but dramatically so) or bolstered by accounting for only a small subset of subjects considerably more effectively than the other models.
Finally, we report in Appendix C the results of a large series of Monte Carlo simulations evaluating the degree to which ND-TSD has an advantage over TSD in terms of accounting for failures of assumptions common to the two models. We consider cases in which the evidence distributions are of a different form than assumed by TSD, and cases in which the decision rule is different from what we propose. To summarize the results from that exercise here, ND-TSD never accrues a higher AIC score or Akaike weight than TSD unless the generating distribution is ND-TSD itself. These results indicate that a superior fit of ND-TSD to empirical data is unlikely to reflect undue model flexibility when compared to TSD.
In this experiment, we evaluate the effects of manipulating study time on recognition of word ensembles of varying sizes. By combining ND-TSD and TSD with a few simple models of information integration, we will be able to separately estimate the influence of criterial and evidence variability on recognition across those two study conditions. This experiment pits the models outlined in the previous section against one another.
Nineteen undergraduate students from the University of Illinois participated to partially fulfill course requirements for an introductory course in psychology.
Word set size (one, two or four words in each set) was manipulated within-subjects in both experiments. Each subject participated in a single study phase and a single test phase. Subjects made their recognition responses on a 6-point confidence rating scale, and the raw frequencies of each response type were fit to the models in order to evaluate performance.
All words were obtained from the English Lexicon Project (Balota, Cortese, Hutchison, Neely, Nelson, Simpson, & Treiman, 2002). We drew 909 words with a mean word length of 5.6 (range: 4 – 8 letters) and mean log HAL frequency of 10.96 (range:5.5 – 14.5). A random subset of 420 words was selected for the test list, which consisted of 60 single-item sets, 60 double-item sets, and 60 four-item sets. A random half of the items from each ensemble-size set was assigned to the study list. All study items were presented singly, while test items were presented in sets of one, two, or four items. Words presented in a single ensemble were either all previously studied or all unstudied. This resulted in 210 study item presentations and 180 test item presentations (90 old and 90 new). Again, every test presentation included all old or all new items; there were no trials on which old and new items were mixed in an ensemble.
Subjects were tested individually in a small, well-lit room. Stimuli were presented, and subject responses were recorded, on PC-style computers programmed using the Psychophysical Toolbox for MATLAB (Brainerd, 1997; Pelli, 1997). Prior to the study phase, subjects read instructions on the computer screen informing them that they were to be presented with a long series of words that they were to try and remember as well as they could. They began the study phase by pressing the space bar. During the study phase, words were presented for 1.5 seconds. There was a 333 ms inter-stimulus interval (ISI) between presentations. At the conclusion of the study phase, subjects were given instructions for the test phase. Subjects were informed that test items would be presented in sets of one, two, or four words, and that they were to determine if the word or words that they were presented had been previously studied or not. They began the test phase by pressing the space bar. There was no time limit on the test.
Table 1 shows the frequencies by test condition summed across subjects. Discriminability (da) was estimated separately for each ensemble size and study time condition by maximum-likelihood estimation (Ogilvie & Creelman, 1968), and is also displayed in Table 1. All model fitting reported below was done on the data from individual subjects because of well known problems with fitting group data (see, e.g., Estes & Maddox, 2005) and particular problems with recognition data (Heathcote, 2003) None of the subjects or individual trials were omitted from analysis. Details of the fitting procedure are outlined in Appendix D.
The subject-level response frequencies were used to evaluate the models introduced earlier. Of particular interest is the independent variability model that we use to derive separate estimates of criterial and evidence variability. That model's performance is evaluated with respect to several other models. One is a sub-model (zero criterial variance model) that is equivalent to the independent variability model but assumes no criterial variance. For both the model with criterion variability (ND-TSD) and without (TSD), two different decision rules (averaging versus summation) are tested. Another model (the OR model) assumes that each stimulus within an ensemble is evaluated independently and that the decision is made on the basis of combining those independent decisions via an OR rule. Comparison of the independent variability model with the OR model is used to evaluate the claim that the stimulus is evaluated as an ensemble, rather than as n individual items. Comparison of the independent variability model with the nested zero-criterial-variance model is used to test for the presence of criterial variability.
The averaging version of this model is based on ND-TSD and the well known relationship between the sampling distribution of the mean and sample size, as articulated by the Central Limit Theorem. Other applications of a similar rule in psychophysical tasks (e.g., Swets & Birdsall, 1967; Swets, Shipley, McKey, & Green, 1959) have confirmed this assumption of averaging stimuli or samples, but we will evaluate it carefully here because of the novelty of applying that assumption to recognition memory.
If the probability distribution of stimulus strength has variability σ2 then that probability distribution for the ensemble of n stimuli drawn from that distribution has variability σ2/n. This model assumes that the distribution of strength values is affected by n, but that criterial variability is not. Thus, the isosensitivity function of the criterion-variance ensemble recognition model is :
Because we fit the frequencies directly rather than the derived estimates of distance (unlike previous work: Nosofsky, 1983), there is no need to fix any parameters (such as the distance between the distributions) a priori. The hypothesized effect of the ensemble size manipulation is shown in Figure 3, in which the variance of the stimulus distributions decreases with increasing size. For clarity, the criterion distribution is not shown.
The fit of this model is compared to a simpler model that assumes no criterial variability:
Another possibility is that evidence is summed, rather than averaged within an ensemble. In this case, the size of the ensemble scales both the signal mean and the stimulus variances, and the isosensitivity functions assumes the form:
when criterion variance is nonzero and
We must also consider the possibility that our assumption of criterial invariance within an ensemble is wrong. If criterial variance is affected by ensemble size in the same purely statistical manner as is stimulus variance, then both stimulus and criterion variance terms are affected by n. Under these conditions, the model is:
Two aspects of this model are important. First, it can be seen that it is impossible to separately estimate the two sources of variability, because they can be combined into a single super-parameter. Second, as shown in Appendix B, this model reduces to the same form as Equation 5, and thus can fit the data no better than the zero criterial variability model. Consequently, if the zero criterial variability model is outperformed by the independent variability model, then we have supported the assumption that criterial variability is invariant across an ensemble.
The OR model assumes that each stimulus within an ensemble is evaluated independently, and that subjects respond positively to a set if any one of those stimuli surpasses a criterion value (e.g., Macmillan & Creelman, 2005; Wickens, 2002). This is an important “baseline” against which to evaluate the information integration models because the interpretation of those models hinges critically on the assumption that the ensemble manipulation alters representational variability in predictable ways embodied by Equations 4 – 8. The OR model embodies a failure of this assumption: If subjects do not average or sum evidence across the stimuli in an ensemble, but rather evaluate each stimulus independently, then this multidimensional extension of the standard TSD model will provide a superior fit to the data.
The situation is simplified because the stimuli within an ensemble (and, in fact, across the entire study set) can be thought of as multiple instances of a common random variable. The advantage of this situation is apparent in Figure 4, which depicts the 2-dimensional TSD representation of the OR model applied to 2 stimuli. Here the strength distributions are shown jointly as density contours; the projection of the marginal distributions onto either axis represents the standard TSD case. Because the stimuli are represented by a common random variable, those projections are equivalent.
According to the standard TSD view, a subject provides a rating of r to a stimulus if and only if the evidence value yielded by that stimulus exceeds the criterion associated with that rating, Cr. Thus, the probability of at least one of n independent and identically distributed instances of that random variable exceeding that criterion is
The shaded portion of the figure corresponds to the bracketed term in Equation 9. Equivalently, the region of endorsement for a subject is above or to the right of the shaded area (which extends leftward and downward to -∞).
Details of the model-fitting procedure are provided in Appendix D.
The performance of the models is shown in Table 2, which indicates AICC, Akaike weights, and number of individual subjects best fit by each model. It is clear that the superior fit was provided by ND-TSD with the restriction of equivalent criteria across ensemble conditions, and with the averaging rather than the summation process. That model provided the best fit (lowest AICc score) for more than 80% of the individual subjects, and had (on average across subjects) a greater than 80% chance of being the best model in the set tested. This result is consistent with the presence of criterial noise and additionally with the suggestion that subjects have a very difficult time adjusting criteria across trials (e.g., Ratcliff and McKoon, 2000).
A depiction of the fit of the winning model is shown in the top panel of Figure 5, in which it can be seen that ND-TSD provides a quite different conceptualization of the recognition process than does standard TSD (shown in the bottom panel). In addition to criterial variance, the variance of the studied population of items is estimated to be much greater relative to the unstudied population. This suggests that the act of studying words may confer quite substantial variability, and that criterial variance acts to mask that variability. The implications of this will be considered in the next major section.
When interpreted in the context of TSD, superior performance in one condition versus another, or as exhibited by one subject over another, is attributable either to a greater distance between the means of the two probability distributions or to lesser variability of the distributions. In ND-TSD, superior performance can additionally reflect lower levels of criterial variability. In this section, we outline several current and historical problems that may benefit from an explicit consideration of criterial variability. The first two issues we consider underlie current debates about the relationship between the slope of the isosensitivity function and theoretical models of recognition and of decision-making. The third section revisits the standoff between deterministic and probabilistic response models and demonstrates how decision noise can inform that debate. The fourth, fifth, and sixth issues address the effects of aging, the consequences of fatigue, and consider the question of how subjects make introspective remember/know judgments in recognition tasks. These final points are all relevant to current theoretical and empirical debates in recognition memory.
Understanding the psychological factors underlying the slope of the isosensitivity function have proven to be somewhat of a puzzle in psychology in general and in recognition memory in particular. Different tasks appear to yield different results: for example, recognition of odors yield functions with slopes ~1 (Rabin & Cain, 1984; Swets, 1986b), whereas recognition of words typically yields considerably shallower slopes (Ratcliff et al., 1992, 1994). That latter result is particularly important because it is inconsistent with a number of prominent models of recognition memory (Eich, 1982; Murdock, 1982; Pike, 1984). The form of the isosensitivity function has even been used to explore variants of recognition memory, including memory for associative relations (Kelley & Wixted, 2001; Rotello, Macmillan, & Van Tassel, 2000) and memory for source (Healy et al., 2005; Hilford et al.. 2002)
One claim about the slope of the isosensitivity function in recognition memory is the constancy-of-slopes generalization, and owes to the pioneering work of Ratcliff and his colleagues (Ratcliff et al., 1992, 1994), who found that slopes were not only consistently less than unity, but also relatively invariant with manipulations of learning. Later work showed, however, that this may not be the case (Glanzer et al., 1999; Heathcote, 2003; Hirshman & Hostetter, 2000). In most cases, it appears as though variables that increase performance decrease the slope of the isosensitivity function (for a review, see Glanzer et al., 1999). This relation holds for manipulations of normative word frequency (Glanzer & Adams, 1990; Glanzer et al., 1999; Ratcliff et al., 1994), concreteness (Glanzer & Adams, 1990), list length (Elam, 1991, as reported in Glanzer et al., 1999; Gronlund & Elam, 1994; Ratcliff et al., 1994; Yonelinas, 1994), retention interval (Wais, Wixted, Hopkins, & Squire, 2006), and study time (Glanzer et al., 1999; Hirshman & Hostetter, 2000; Ratcliff et al., 1992, 1994).
These two findings—slopes of less than 1 and decreasing slopes with increasing performance—go very much hand in hand from a measurement perspective. Consider the limiting case, in which learning has been so weak and memory thus so poor, that discrimination between the old and new items on a recognition test is nil. The isosensitivity function must have a slope of 1 in both probability and normal-deviate coordinates in that case, because any change in criterion changes the HR and FAR by the same amount. As that limiting case is approached, it is thus not surprising that slopes move towards 1. The larger question in play here is whether the decrease in performance that elicits that effect owes specifically to shifting evidence distributions, or whether criterial variance might also play a role. We tackle this question below by carefully examining the circumstances in which a manipulation of learning affects the slope and the circumstances in which it does not.
The next problem we consider is why isosensitivity functions estimated from the rating task differ from those estimated by other means and whether such differences are substantive and revealing of fundamental problems with TSD. In doing so, we consider what role decision noise might play in promoting such differences, and also whether reports of the demise of TSD (Balakrishnan, 1998a) may be premature.
The first puzzle we will consider concerns the conflicting reports on the effects of manipulations of learning on the slope of the isosensitivity function. Some studies have revealed that the slope does not change with manipulations of learning (Ratcliff et al., 1992, 1994), whereas others have supported the idea that the slope decreases with additional learning or memory strength. While some models of recognition memory predict changes in slope (Gillund & Shiffrin, 1984; Hintzman, 1986) with increasing memory strength, others either predict unit slope (Murdock, 1982) or invariant slope with memory strength. This puzzle is exacerbated by the lack of entrenched theoretical mechanisms that offer a reason why the effect should sometimes obtain and sometimes not.
To understand the way in which criterion noise might underlie this inconsistency, it is important to note the conditions under which changes in slope are robust and the conditions under which they are not. Glanzer et al. (1999) reviewed these data and their results provide an important clue. Of the four variables for which a reasonable number of data were available (≥5 independent conditions), list length and word frequency manipulations clearly demonstrated the effect of learning on slope: shorter list lengths and lower word frequency led to higher accuracy and also exhibited a lower slope (in 94% of their comparisons). In contrast, greater study time and more repetitions led to higher accuracy but revealed the effect on slope less consistently (on only 68% of the comparisons).
To explore this discrepancy, we will consider the criterion-setting strategies that subjects bring to bear in recognition, and how different manipulations of memory might interact with those strategies. There are two details about the process of criterion setting and adjustment that are informative. First, the control processes that adjust criteria are informed by an ongoing assessment of the properties of the testing regimen. This may include information based on direct feedback (Dorfman & Biderman, 1971; Kac, 1962; Thomas, 1973, 1975) or derived from a limited memory store of recent experiences (Treisman, 1987; Treisman & Williams, 1984). In either case, criterion placement is likely to be a somewhat noisy endeavor until a steady state is reached, if it ever is. From the perspective of these models, it is not surprising that support has been found for the hypothesis that subjects set a criterion as a function of the range of experienced values (Parducci, 1984), even in recognition memory (Hirshman, 1995). These theories have at their core the idea that recognizers hone in on optimal criterion placement by assessing, explicitly or otherwise, the properties of quantiles of the underlying distributions. Because this process is subject to a considerable amount of irreducible noise—for example, from the particular order in which early test stimuli are received—decision variability is a natural consequence. To the degree that criterion variability is a function of the range of sampled evidence values (cf. Nosofsky, 1983), criterion noise will be greater when that range is larger.
The second relevant aspect of the criterion-setting process is that it takes advantage of the information conveyed by the individual test stimuli. A stimulus may reveal something about the degree of learning a prior exposure would have afforded it, and subjects appear to use this information in generating an appropriate criterion (Brown et al., 1977). Such a mechanism has been proposed as a basis for the mirror effect (Benjamin, 2003; Benjamin, Bjork, & Hirshman, 1998; Hirshman, 1995), and, according to such an interpretation, reveals the ability of subjects to adjust criteria on an item-by-item basis in response to idiosyncratic stimulus characteristics. It is noteworthy that within-list mirror effects are commonplace for stimulus variables, such as word frequency, meaningfulness, and word concreteness (Glanzer & Adams, 1985; 1990), but typically absent for experimental manipulations of memory strength, such as repetition (Higham et al., in press; Stretch & Wixted, 1998) or study time (Verde & Rotello, 2007). This difference has been taken to imply that recognizers are not generally willing or able to adjust criteria within a test list based on an item's perceived strength class.
In fact, the few examples of within-list mirror effects arising in response to a manipulation of strength are all ones in which the manipulation provided for a relatively straightforward assignment to strength class, including variable study-test delay (Singer, Gagnon, & Richards, 2002; Singer & Wixted, 2006) and the use of stimuli that were associatively categorized (Benjamin, 2001; Starns, Hicks, & Marsh, 2006). Similar within-list manipulations of strength tied to color (Stretch & Wixted, 1998) or list half at test (Verde & Rotello, 2007) were unsuccessful, supporting the view that the relationship between the strength manipulation and the stimulus must be extremely transparent in order to support explicit differentiation by the subject.
What does this imply for the placement and maintenance of criteria across conditions that vary in discriminability? When the burden of assigning a test stimulus to a subclass falls on the recognizer, they will often forgo that decision. In that case, they will accumulate information on a single class of “old” items as they sample from the test stimuli. However, when the task relieves the subject of this burden, either by dividing up the discriminability classes between subjects or between test lists, or by using stimuli that carry with them inherent evidence as to their appropriate class and likely discriminability, then the subject may treat as separate the estimation of range for the different classes.
The effects of these strategic differences can be seen in Figures 6 and and7.7. As shown in Figure 6, if increases in discriminability lead to increased stimulus variance and criterion noise is constant, the slope of the isosensitivity function should always decrease when conditions afford superior memory discrimination. This is shown in Boxes B and C. However, as criterion variance increases, the effect of stimulus variance becomes less pronounced (as can be seen by comparing the two boxes). Consider the effect of a manipulation of memory on the slope:
in which the subscripts 1 and 2 denote the two levels of the manipulated variable, with level 2 being the condition with superior performance and greater stimulus variability. Under these conditions, it is easy to see that the value of this effect must be either 0 or positive. That is, if stimulus variance increases with discriminability, then the condition with greater discriminability must have a lower slope. The inconsistency in the literature must then come from the effect of those variables on criterion variability, which can attenuate the magnitude of the difference. When test stimuli are not successfully subclassified, then criterion variance reflects the full range of the old stimuli, rather than the ranges of the individual classes.
Figure 7 illustrates the decision milieu that yields these differential effects. When the set of old items is heterogeneous with respect to discriminability, but subjects do not discriminate between the strength classes, then the sampled range of criterion values reflect the full range of this mixture distribution, and the variability of the criterion will be great (shown in the bottom panel as the root-mean-square average of the criterion distributions in the top panel). When subjects do discriminate between the strength classes and sampled values from each class inform a unique criterion distribution (as shown in the top panel), then those two distributions will both be of lesser variability.
In both cases, criterion variance is constant across stimulus classes, and the net effect is a decrease in slope. This occurs because increasing stimulus variability is offset by a constant amount of criterion variance. However, criterion variance serves to effectively augment or retard the magnitude of the decrease. In the top panel, in which each item class specifies a unique criterion distribution, the lesser variability that accompanies the stimulus classes translates into a lesser amount of criterial variability. In this case, that lesser variability increases the degree to which stimulus variability yields an effect on slope.
By this explanation, memory enhancing conditions that afford subclassification of the test stimuli with respect to discriminability should be more likely to yield an effect on slope than variables that are opaque with respect to discriminability. Now we are in a position to reconsider the empirically studied variables enumerated earlier. Variables that are manipulated between-subjects or between-lists require no subclassification within a test list, and should thus provide for relatively easy assignment at test. Of the four variables mentioned earlier, only list length is always studied between list (by definition). In addition, variables for which the discriminability class is inherent to the stimulus itself should also afford subclassification. Word frequency is the only member of this category from that list.
The other two variables, repetition and study time, are the paradigmatic examples of manipulations that do not routinely afford such subclassification. An encounter with a single test word reveals nothing about whether it was probably repeated or if it probably studied for a long duration—other than through the evidence it yields for having been studied at all. And, consistent with the explanation laid out here, these are the very variables for which the effects of discriminability on slope are less consistently observed.
To summarize, manipulations that encourage easy allotment of test items into discriminability classes are likely to promote lesser criterion variability, and are this less likely to mask the underlying decrease in the slope of the isosensitivity function generated by increasing stimulus variability.
Another recent important result that is somewhat vexing from the standpoint of TSD is the lack of invariance in the shape of the isosensitivity function when estimated under different biasing or payoff conditions (Balakrishnan, 1998a; Van Zandt, 2000). This failure has led theorists to question some of the basic tenets of TSD, such as the assumption that confidence ratings are scaled from the evidence axis (Van Zandt, 2000) or, even more drastically, that stimulus distributions are not invariant with manipulations of bias (Balakrishnan, 1998b, 1999). Both suggestions do serious violence to the application of TSD to psychological tasks, and rating tasks in particular, but several theorists have defended the honor of the venerable theory (Rotello & Macmillan, 2008; Treisman, 2002). Of particular interest here, a recent report by Mueller and Weidemann (2008) postulates criterial noise as a source of the failed invariance. Mueller and Weidemann demonstrated that criterial noise can account for the lack of invariance under a bias manipulation using their Decision Noise Model, which is similar in spirit to (but quite different in application from) ND-TSD.
ND-TSD can also explain such effects quite simply. Figure 6 shows the joint effects of stimulus and criterial variability on the slope of the isosensitivity function. A manipulation of bias is presumed not to affect either the location or shape of the evidence distributions (cf. Balakrishnan, 1999), and should consequently have no effect on slope. The predictions of TSD are indicated by the darkest (bottom) line; any point on that line is a potential slope value, and it should not change with a manipulation of bias. However, if criterion variability changes with bias, then the slope of the function can vary along a contour of constant stimulus variance, such as shown by Box A. Such an interpretation presumes that criterial variance itself varies with the bias manipulation; why might this be?
First, criterion variance might increase with increasing distance from an unbiased criterion. This could be true because placing criteria in such locations is uncommon or unfamiliar, or simply because the location value is represented as a distance from the intersection of the distributions. A magnitude representation of distance would exhibit scalar variability and thus imply greater criterion variance with more biased criterion locations. The model of Mueller and Weidemann achieves this effect by imposing greater variability on peripheral confidence criteria than on the central yes/no criterion; such a mechanism is neither included in nor precluded by ND-TSD. Similarly, criteria may exhibit scalar variability with increasing distance from the mean of the noise distribution. This assumption is supported somewhat by results that indicate that criterion noise increases with stimulus range in absolute identification tasks (Nosofsky, 1983).
In sum, if the variance of criteria scales with the magnitude of those criteria, manipulations of bias may be incorrectly interpreted as reflecting changes in the stimulus distributions. This does not reflect a fundamental failing of TSD, but rather reveals conditions in which ND-TSD is necessary to explain the effects of decision noise on estimated isosensitivity functions.
In earlier flashpoints over decision rules in choice tasks, some theorists suggested that the rule may be probabilistic, rather than deterministic in form (e.g., Luce, 1959; Nachmias & Kocher, 1970; Parks, 1966; Thomas & Legge, 1970). From the perspective of TSD, the evidence value is compared to a criterion value and a decision is made based on their ordering. This strategy leads to optimal performance, either in terms of payoff maximization or maximal number of correct responses, when that criterion is based on the likelihood ratio (Green & Swets, 1966). Regardless of how the criterion is placed, and whether it is optimal or not, this is a deterministic response rule, and differs from a probabilistic response rule, by which the value of the likelihood ratio or a transformation thereof, is continuously related to the probability of a particular response.
There was a tremendous amount of research devoted to the resolution of this question in the 1960s and 1970s, in part because TSD made such a forceful claim that the rule was deterministic. A convincing answer was not apparent, however: The strong implications of static criteria were rejected by the data reviewed above, including sequential dependencies and changes in the slope of the isosensitivity function. Improvements in sensitivity over the course of individual tasks (e.g., Gundy, 1961; Zwislocki, Marie, Feldman, & Rubin, 1958) also suggested the possibility of increasingly optimal or perhaps decreasingly variable criteria. In some tasks, the prediction of a binary cutoff in response probability that followed from deterministic theories was confirmed (Kubovy, Rapaport, & Tversky, 1971) and in other tasks that prediction was disconfirmed (Lee & Janke, 1965; Lee & Zentall, 1966). In still others, data fell in a range that was not naturally predicted by either a binary cutoff or one of the probabilistic viewpoints reviewed below (Lee & Janke, 1964). Cutoffs appeared to be steeper when discriminability was greater (Lee & Zentall, 1966), suggesting that subjects may employ cutoffs within a range of evidence and use alternate strategies when the evidence less clearly favors one choice or the other (Parducci & Sandusky, 1965; Sandusky, 1971; Ward, 1973; Ward & Lockhead, 1971).
The evidence in favor of probabilistic models was mixed as well. The most general prediction of probabilistic models of decision-making is that the probability of an endorsement varies with the evidence in favor of the presence of the to-be-endorsed stimulus. Whereas it is optimal to respond “old” to a recognition test stimulus when the evidence in favor of that stimulus actually having been studied outweighs the evidence that it was not (assuming equal priors and payoffs), probabilistic models suggest that the weight of that evidence determines the probability of an “old” response. Evidence from individual response functions in tasks that minimized sources of variability revealed sharp cutoffs (Kubovy et al., 1971). Simple probabilistic models failed to account for that result, but accounted well for performance in other tasks (Schoeffler, 1965), including recognition memory (Parks, 1966) and a wide variety of higher-level categorization tasks (Erev, 1998).
A partial reconciliation of these views came in the form of deterministic dynamic-criterion models (Biderman, Dorfman, & Simpson, 1975; Dorfman, 1973; Dorfman & Biderman, 1971; Kac, 1962, 1969), in which the criterion varied systematically from trial to trial based on the stimulus, response, and outcome. These models outperformed models with static criteria (Larkin, 1971; Dorfman & Biderman, 1971) but did not account for a relatively large amount of apparently nonsystematic variability (Dorfman, Saslow, & Simpson, 1975). Similar models were proposed with probabilistic responding (Larkin, 1971; Thomas, 1973), but were never tested against dynamic-criterion models with deterministic responding.
Many of the dynamic-criterion models made the prediction that responding would exhibit probability matching (or micromatching; Lee, 1963); that is, that the probability of a positive response would asymptotically equal the a priori probability of a to-be-endorsed stimulus being presented (Creelman & Donaldson, 1968; Parks, 1966; Thomas & Legge, 1970). Such theories also met with mixed results: although there were situations in which probability matching appeared to hold (e.g., Lee, 1971; Parks, 1966), time-series analysis revealed overly conservative response frequencies (to be reviewed in greater detail below) and poor fits to individual subjects (Dusoir, 1974; Norman, 1971). Kubovy and Healy (1977) even concluded that dynamic-criterion models that employed error correction were mostly doomed to fail because, empirically, subjects appeared to shift criteria after both correct and incorrect responses, an effect that was inconsistent with the majority of models. They also claimed that models of the “additive-operator” type—in which the direction of criterion change following a correct response combination was predicted to be constant—were wrong, because subjects appeared to be willing to shift their criterion in either direction, depending on the exact circumstances. Here we have explicitly avoided theorizing about the nature of systematic changes in criterion so as to be able to more fully examine the role of nonsystematic noise on the response function and thus on recognition performance. Yet it can be shown that criterial noise naturally and simply leads to conservative shifts of criteria in response to manipulations of base rates of signal and noise events.
An important result in tasks in which base rates are manipulated is the excessively conservative response of criteria to manipulations of the base rates of events.6 Overall, experiments have revealed mixed effects of base rate manipulations: although, in some tasks, subjects appear acutely sensitive to prior probabilities (Kubovy & Healy, 1977; Swets, Tanner, & Birdsall, 1961), even in recognition memory (Healy & Kubovy, 1978), those shifts typically are lesser in magnitude than predicted either under an optimal deterministic decision rule (Green & Swets, 1966) or under the more conservative prediction of probability matching (Thomas, 1975). Other data suggested that subjects in recognition memory experiments did not modulate their criteria at all when the base rates were shifted across blocks (Healy & Jones, 1975; Healy & Kubovy, 1977).
The general conservatism of criterion placement has been attributed to, variously, unwillingness to abandon sensory (or mnemonic) evidence in favor of base rates (Green & Swets, 1966), failure to appreciate the proper form of the evidence distributions (Kubovy, 1977), inaccurate estimation of prior probabilities (Galanter, 1974; Ulehla, 1966), or to probability matching (although this latter view was eventually rejected by the data discussed in the previous section). In this next subsection, we show that either probability matching or optimality perspectives can predict conservatism when criterial variability is explicitly accounted for. Likewise, we shall see that criterial variability can mimic probabilistic response selection.
The most fundamental effect of the addition of criterial noise is to change the shape of the response function—that is, the function relating evidence to response. Here we consider the form of response functions in the presence of criterial variability and evaluate the exact effect of that variability on the specific predictions of optimality views (Green & Swets, 1966) and probability matching (Parks, 1966). We show that (a) probabilistic response functions are not to be distinguished from deterministic functions with criterial variability, and that (b) conservatism in criterion shifts in response to manipulations of base rates is a natural consequence of criterial variability (for more general arguments about mimicry between deterministic and probabilistic response functions, see Marley, 1992; Townsend & Landon, 1982). The goal of these claims is to show how criterial variability can increase the range of results that fall within the explanatory purview of TSD, and to demonstrate why previously evaluated benchmarks for the rejection of deterministic models may be inappropriate. Specifically, ND-TSD naturally accounts for (apparently) suboptimal response probabilities in response to manipulations of base rate. It does so successfully because, as shown below, a deterministic response rule in the presence of criterial noise can perfectly mimic a probabilistic rule (for similar demonstrations, see Ashby & Maddox, 1993; Marley, 1992; Townsend & Landon, 1982).
The deterministic response rule is to endorse a stimulus as “old” if the subjective evidence value (E) surpasses a criterion value c:
Treating c as an instance of the previously defined random variable for criterion, the response function conditional upon E is
Example response functions are shown in the left panel of Figure 8, in which increasingly bright lines indicate increasingly variable criteria. The function is, of course, simply the cumulative normal distribution of which the step function that is the traditional implication of TSD (shown in black) is the asymptotic form as σ2c → 0. This result is not surprising, but it is revealing, especially in comparison with the right panel of Figure 8, which depicts response functions for two purely probabilistic response rules. The first (darker) depicts Schoeffler's (1965) response rule, which is:
and the second (lighter) line depicts an even simpler rule relating the height of the signal distribution at E to the sum of the heights of the two distributions at E, or:
in which ϕ indicates the normal probability density function. In each of these cases, the resultant response function is also a cumulative normal distribution, thus showing that criterial variability can make a deterministic response rule perfectly mimic a probabilistic one. To be fair, the rules chosen here are simple ones and simplifying assumptions have been made with respect to the evidence distributions (with the latter rule, the evidence distributions have been set to be of equal variance). It is not our claim that there are not probabilistic rules that may be differentiated from deterministic rules with criterial noise, nor that there are no circumstances under which even these rules can be differentiated from one another. Rather, it is to demonstrate that a parameter governing criterial variability can produce a range of response functions, including ones that perfectly replicate the predictions of probabilistic rules. This result provides a new perspective on the phenomenon of conservatism seen in criterion-setting, as we will review below.
The conservatism seen in responses to manipulations of base rate has been hypothesized to reflect either suboptimal criterion placement or a failure to accurately estimate the parameters of the decision regime, including the probability distributions or the actual base rates themselves. Here we show that conservatism is a natural consequence of criterial variability and arises with both optimal criterion placement and probability matching strategies.
Green and Swets (1966) showed that the optimal bias can be defined purely in terms of the stimulus base rates:
When the evidence distributions are not of equal variance, an optimal bias leads to two criteria. This fact is reflected in the nonmonotonicity at the margins of the isosensitivity function, or, equivalently, by the nonmonotonic relationship between evidence and the likelihood ratio throughout the scale. In any case, this issue falls outside the purview of our current discussion and need not concern us here. The effect of criterial variability can be amply demonstrated under the equal-variance assumption.
In the equal-variance case, the optimal criterion placement is a function of the optimal bias and the distance between the distributions:
in which d’ represents the distance between the evidence distributions scaled by their common standard deviation.
Imagine that subjects place their criterion optimally according to this analysis, but fail to account for the presence and consequence of criterial noise. To evaluate that effect, we must consider first how criterial variability affects d’. To do so, remember that d’ = zHR – zFAR. Substituting terms from Equation 1b and setting σ21= σ20 =1,
where d’noisy indicates d’ under conditions of criterial variability.
This relationship indicates that d’ will be overestimated in computing optimal criterion placement, and that this overestimation will worsen with increasing criterial noise. What effect does this have on the overall rate of positive responding? That relationship is shown in the left panel of Figure 9, which plots the deviation of overall “yes” rate from the predicted rate of a “semi-ideal” decision-maker—that is, one that is ideal except insofar as it fails to appreciate its own criterial noise. These values were computed by assessing the rate of positive responding (for to-be-endorsed and to-be-rejected stimuli) at the semi-ideal criterion for varying base rates (for d’ = 1), and then comparing that value to the rate of responding with added criterial noise. As noted by Thomas and Legge (1970), the effect of criterial variability is to lead to the appearance of nonoptimal criterion placement. The employed criterion is optimal from the perspective of the information available in the task, but nonoptimal in that it fails to account for its own variability. The net effect is that low signal probabilities lead to a nonoptimally high rate of responding, and high signal probabilities lead to a nonoptimally low rate of responding. This result is the hallmark of conservatism.
According to the probability matching view, subjects aim to respond positively at the same rate as the positive signal is presented. Probability matching often predicts more conservative response behavior than does the optimality view presented above (Thomas & Legge, 1970). Let P1 be the proportion of signal trials and thus also the desired rate of positive responding (R1). Then,
Thomas and Legge (1970) pointed out that this function is not an isosensitivity function, but rather an isocriterion function: It describes the relationship between HR and FAR and that relationship's invariance with R1 as sensitivity varies. Thus, like the case above, we must assume a particular level of sensitivity in order to derive values for the HR and FAR. In addition, the relationship between R1 and μC is complex because of the integral over the normal distribution. In the simulation that follows, we selected the value for μC that minimized the deviation of the right-hand portion of Equation 11 from P1, assuming values of 1 for μ1 and σ1. Deviation from this model was estimated by simulating 1 million trials for each signal base rate from 0.1 to 0.9 (by steps of 0.1) and adding a variable amount of criterial noise on each trial. The results, shown on the right-hand side of Figure 9, indicate an effect similar to what was seen in the previous case: increasing criterial noise leads to increased conservatism.
Criterion noise within a deterministic decision framework (TSD) can thus account for results that have been proposed to reveal probabilistic responding. These demonstrations in and of themselves do not reveal the superiority of deterministic theories, but they do suggest that such data are not decisive for either viewpoint and, in doing so, call attention to the large additional body of evidence in support of deterministic decision theories.
Many of the current battles over the nature of the information that subserves recognition decisions are waged using data that compare age groups. For example, it has been proposed that elderly subjects specifically lack recollective ability (Jacoby, 1991; Mandler, 1980) but enjoy normal levels of familiarity. Evidence for this two-component theory of recognition comes from age-related dissociations in performance as well as differences between younger and older subjects in the shape of the isosensitivity function (Howard et al., 2007; Yonelinas, 2002). However, the role of criterial variability has never been considered.
Two general sources of differences between age groups in criterial maintenance are possible. First, those mechanisms and strategies that govern the evolution of criterion placement over time may differ between young and older subjects, perhaps leading to differences in variability of that placement over the course of the experiment. Such a finding would be fascinating in that it would provide an example of how higher-level cognitive strategic differences play out in terms of performance on very basic tests of memory (cf. Benjamin & Ross, 2008). Alternatively, it might be the case that maintenance of criterion is simply a noisier process in the elderly—perhaps attributable to one of very problems in the elderly it can be confused with, namely memory (Kester, Benjamin, Castel, & Craik, 2002)—and that recognition suffers as a result.
Empirically, the results are as one might expect if older adults exhibit greater criterion variability. The slope of the isosensitivity function is greater for older than younger subjects on tasks of word recognition (Kapucu, Rotello, Ready, & Seidl, in press), picture recognition (Howard et al., 2006), and associative recognition (Healy et al., 2005). The wide variety of materials across which this age-related effect obtains is suggestive of a quite general effect of aging on criterion maintenance, rather than a strategic difference between the age groups. These studies have not attempted to separate the effects of criterion and stimulus variability, and these results are consistent with but not uniquely supportive of greater decision noise in the elderly. Future work is necessary to isolate these effects within older subjects.
TSD is often used to evaluate whether fatigue affects performance on a detection task over time (Galinsky, Rosa, Warm, & Dember, 1993; cf. Dobbins, Tiedmann, & Skordahl, 1961), or, conversely, whether improvements in sensitivity are evident with increasing practice (Gundy, 1961, Trehub, Schneider, Thorpe, & Judge, 1991; Zwislocki et al., 1958). Traditional interpretations of such effects attribute fatigue-related decrements to increasing stimulus noise and practice-related improvement to increasing criterion optimization, but such dramatically differing interpretations of these related effects are not compelled by the data. They reflect a tacit but intuitive belief that maintenance of criteria is not demanding and thus not subject to fatigue. The purely perceptual part of detection tasks is assumed to be similarly undemanding and thus not likely to show much improvement with practice.
Consideration of criterial variability provides an alternate theoretical rationale that can unite these findings: decrements arise with time when fatigue increases the difficulty of criterial maintenance, and improvements arise when practice decreases the effects of noise on criterion localization. Such a statement should not be confused with an articulated psychological theory of such effects, but it is an alternative theoretical mechanism that such theories might profitably take advantage of in substantively addressing these and related results.
There is currently a vigorous debate over judgments that subjects provide about the phenomenological nature of their recognition judgments and whether those judgments validly represent different sources of evidence (Gardiner & Gregg, 1997; Gardiner, Richardson-Klavehn, & Ramponi, 1998) or two criteria applied to a single continuous evidence dimension (Benjamin, 2005; Donaldson, 1996; Dunn, 2004; Hirshman & Master, 1997; Wixted & Stretch, 2004). The latter view is consistent with the received version of unidimensional TSD with multiple criteria, just as in the ratings task discussed previously at length, whereas the former view specifies additional sources of evidence beyond those captured in a single evidence dimension. Which view is correct is a major theoretical debate for theorists of recognition memory and whether these phenomenological judgments of “remember” and “know” status indicate multiple states or multiple criteria has become a major front in that battle. That debate is peripheral to the present work and will not be reviewed here. We do consider how criterial variability might influence interpretation of data relevant to that debate, however.
Some authors have cited differences in the slope of the isosensitivity function estimated from remember/know judgments from the slope estimated from confidence ratings as evidence against the unidimensional view of remember/know judgments (Rotello, Macmillan, & Reeder, 2004), whereas others have disputed this claim (Wixted & Stretch, 2004). In a large meta-analysis, Rotello et al. (2004) examined slopes for isosensitivity functions relating remember responses to overall rates of positive responses, and found a greater slope for such R-O isosensitivity functions. Wixted and Stretch (2004) explained this result thusly:
“...the evidence suggests that the location of the remember criterion exhibits item-to-item variability with respect to the confidence criteria...if the remember criterion varies from item to item, the slope of the [isosensitivity function] would increase accordingly.” (p. 627)
Although not described in the same framework that we provide here, the astute reader will recognize a claim of criterial variability analogous to our earlier discussion. If the judgments in the remember/know paradigm are subject to greater variability than the judgments in a confidence rating procedure, then the slope of the isosensitivity function will be closer to 1 for the remember/know function than for the confidence function. If the slope of the confidence function is less than 1, as it typically is, then the additional criterial variability associated with remember/know judgments will increase the slope of the function. This is exactly the result reported by Rotello et al. (2004).
This interpretation is further borne out by recent studies that empirically assessed the variability in the location of the remember/know criterion. Recent studies that compared models of remember/know judgments with and without an allowance for criterion variability for the criterion lying between “know” and “remember” judgments revealed superior performance by the models with nonzero criterion variability (Dougal & Rotello, 2007; Kapucu et al., in press). The heady controversy underlying the use of the R/K procedure may thus reflect the consequences of unconsidered noise in the decision process.
This paper has questioned a very basic assumption of the Theory of Signal Detection—that the criterion value is a stable, stationary value. We have forwarded theoretical arguments based on the psychological burden of maintaining criteria and reviewed empirical evidence that suggests the presence of criterial noise, including comparisons of different procedures for estimating isosensitivity functions and systematic effects of experimental manipulations on criteria, and a long-standing debate over whether the response rule is probabilistic or deterministic. Criterial noise makes these two candidates indistinguishable, and naturally accounts for the conservatism in response shifts that is ubiquitous in experiments that manipulate signal base rates. In addition, we have argued that the isosensitivity function can only be used to test theories of recognition if criterial variability is presumed to be negligible.
In the second half of the paper, we used the task of ensemble recognition to tease apart the effects of criterial and stimulus noise across a manipulation of learning, and shown that criterial noise can be quite substantial. Given the empirical variability in estimates of slopes of isosensitivity functions across conditions (Swets, 1986b), and the lack of a strong theory that naturally accounts for such inconsistency, it may be useful to consider criterion noise as a meaningful contributor to the shape of the isosensitivity function, and to detection, discrimination, and recognition more generally.
We have considered at some length the psychological implications of this claim. The effects of learning on detection, discrimination, and recognition tasks have always been interpreted in terms of shifting evidence distributions. Primarily, distributions are thought to overlap less under conditions of superior memory, but hypotheses regarding the relationship of their shape to performance have also recently been discussed (DeCarlo, 2002; Hilford et al., 2002). The specifics of that shape have even been used to test the assumptions of competing models of the nature of recognition judgments (Heathcote, 2003; Yonelinas, 1999). Here we have argued that learning may also influence the variability of criteria, and that superior performance may in part reflect greater criterion stability. This explanation does not deemphasize the role of encoding and retention of stimuli as a basis for recognition performance, but allows for task-relevant expertise over the course of the test to play an additional role.
Finally, we reviewed a set of problems that the postulate of criterion noise might help provide new solutions for, including the inconsistency of manipulations of learning on the slope of the isosensitivity function, discrepancies between procedures used to estimate such functions, the effects of prior odds on shifts in response policy, the nature of remember/know judgments in recognition, the effects of fatigue on judgment tasks of vigilance, and the effects of aging on recognition. This is a small subset of areas in which decision noise is relevant, but illustrates the dilemma: accurate separation of the mnemonic aspects of recognition from the decision components of recognition relies on valid assumptions about the reliability, as well as the general nature, of the decision process. We have provided evidence that variability in this process is important, is apparent, and undermines attempts to use TSD as a general means of evaluating models of recognition. ND-TSD reconciles the powerful theoretical machinery of TSD with realistic assumptions about the fallibility of the decision process.
This work was supported by grant R01 AG026263 from the National Institutes of Health. We offer many thanks for useful commentary and suggestions to the Human Memory and Cognition Lab at UIUC (http://www.psych.uiuc.edu/~asbenjam/), John Wixted, and Gary Dell.
The value of de can be derived from the geometry of the isosensitivity space, analogously to how da is derived in the body of the paper. We consider the point at which the isosensitivity function must intersect with a line of inverse slope through the origin:
De is the point at which Equation 2 and this line meet, which is
The distance from the origin to this point is:
This value is again scaled by √2 (see text for details):
In this section, we report the results of a series of simulations of the ensemble recognition task intended to assess the degree to which ND-TSD spuriously captures variability that reflects failures of basic assumptions of TSD, rather than criterion variability itself. Seven simulations are reported in which data were generated assuming a failure of distribution shape assumptions (Simulations 1 – 4), a failure of decision rule assumptions (Simulation 5) or no such failures (Simulations 6 – 7). With the exception of Simulations 5 and 7 (as described in greater detail below), criterion variability was 0. Each simulation employed 30 old (signal) stimuli and 30 new (noise) stimuli per ensemble size (just as in the experiment) and 50 sim-subjects.
It has long been known that quite substantive departures from the assumption of normal distributions still lead to roughly linear isosensitivity functions in normal-deviate coordinates (Lockhart & Murdock, 1970). Such a finding leads to concern that the parameters yielded by a model may not accurately capture the underlying generating process, and that TSD may appear to be a good explanation of the underlying decision-making process when it is not. Here we ask whether failures of such distributional assumptions benefit ND-TSD, with the implication that the validity of ND-TSD would be undermined by providing a superior fit to data generated under alternative assumptions.
The first two models consider the possibility that the generating distributions are exponential in form. In Simulation 1, only the noise distribution is assumed to be exponential, and in Simulation 2, both distributions are assumed to be exponential. For both simulations, the rate parameter for the exponential noise distribution (λ) was set to 1. The signal distribution in Simulation 1 was normal with mean 1 and unit variance, and the signal in Simulation 2 was exponential with λ = 2. Criteria were set to reside at a constant proportion of the average of the distributions means (.25, .50, .75, 1, and 1.25). Performance in conditions with ensembles of 2 and 4 was generated using the averaging rule.
Simulations 3 and 4 used mixture distributions for the signal distributions (DeCarlo, 2002). In both cases, the mixing parameter was 0.5. The simulations differed in the placement of the distributions; in Simulation 3 the signals were 2.5 d’ units apart from one another (d’1 = 0.5 and d’2 = 3.0), and in Simulation 4 they were 0.5 d’ units apart from one another (d’1 = 0.25 and d’2 = 0.75). The criteria were again set to a constant proportion of the average d’ values (.28, .48, .7, .91, and 1.21). Again, ensemble performance was generated using the averaging rule.
Simulation 5 investigates a case in which an alternative ensemble decision rule was used to generate the data. When criterion variance is 0, the summation rule reduces to the averaging rule, as can be seen be comparing Equations 5 and 7. Thus, for this simulation, σC was set to 0.8. The noise distribution was set to the standard normal distribution, and the signal distribution was set to be normal with a mean of 1 and a standard deviation of 1.4. Criteria were set at -0.28, 0.21, 0.7, 1.19, and 1.533 and multiplied by the relevant ensemble size for the multiple-item conditions.
In the final two simulations, the ability of TSD and ND-TSD to accurately account for data generated under their own assumptions was tested. In both, the noise distribution was the standard normal distribution. In Simulation 6, the signal distribution was normal with a mean of 1 and standard deviation of 1.25, and there was no criterion variability. In Simulation 7, the signal distribution standard deviation was 1.4, and the criterion standard deviation was 0.8. Combining the two sources of noise in this simulation yields approximately the same total amount of variance as in Simulation 6. In both cases, the criteria were set to -1.2, -0.5, 0.5, 1.5, and 2.2.
Here we consider the performance of a number of models in fitting the data generated in the simulations described above. These are roughly the same models used to fit the actual data generated in the experiment in this paper. There are eight models that represent the full combination of three factors. Each model either had a criterion variance parameter (ND-TSD) or it did not (TSD). With the exception of Simulation 5, each model had either a full set of 15 criteria, five for each of the three ensemble conditions, or only 5 criteria (restricted set of criteria). Finally, each model used either the averaging or the summation decision rule. Details of the actual model-fitting are presented in Appendix D.
The results of the simulations are summarized in Table C.1. Across all four simulations in which distributional assumptions of TSD were violated (Simulations 1 – 4), TSD was much more likely than ND-TSD to achieve a superior fit. In addition, the models using an averaging rule and a restricted set of criteria outperformed their counterparts (as is appropriate, given the generating models). The lesson of these simulations is that the extra parameter provided by ND-TSD does not benefit that model in accounting for variability that derives from distributional failures of TSD.
In Simulation 5, we considered whether ND-TSD would benefit from an incorrect specification of the decision rule. Because the summation process leads to a major rescaling across ensemble sizes, it did not make sense to have equivalent criteria across ensembles. Instead, criteria were used that were a multiplicative constant across ensemble size. Thus, two additional models were fit (in the rightmost columns for Simulation 5) in which there were only five free parameters for criteria (like the restricted criteria models) but were multiplied by the ensemble size (leading to 15 different criterion values). This was the generating model, and, as expected, it did outperform the other models (note that the two proportional criteria models together achieve an Akaike weight of 0.59). However, it is important to note that the simulation did not effectively recover the relatively small amount of criterion variance in the simulation (the ND-TSD version of the proportional model was outperformed by the TSD version). This result suggests that, to the degree that there is any biasing of model performance, it is towards the models without criterion variance.
The final two simulations conformed to the assumptions of TSD and ND-TSD respectively (under the assumptions of averaging and restricted criteria). As can be seen, the model fitting successfully recovered the original model with quite high Akaike weight scores.
|Full set of criteria||Restricted set of criteria||Proportional criteria|
|Averaging rule||Summation rule||Averaging rule||Summation rule|
All models were fit simultaneously to the response frequencies of individual subjects for all three ensemble sizes. Parameters were determined using maximum-likelihood estimation, as detailed below.
The model predicts that the proportion of responses above the jth criterion, cj, for old items in ensemble size n is
and for new items is:
where etotal is the total amount of evidence yielded by the ensemble, μ1 is mean of the signal distribution, σ12 is the signal variance, σc2 is the criterial variance, and n is the ensemble size. From this formula, we can derive the predicted proportion of each rating, θj, on the confidence scale for each item type:
where θ1j is the proportion of the jth rating response for old items and θ0j is the proportion of the jth rating response for novel items, for j = 1...r where r is the number of ratings and c0 = -∞ and cr = +∞. The likelihood function for a set of parameters, μ1, σ12, σc2, and cj for all j, given the data, xij for all i and j, is:
where i=0,1 indicates the new and old ensembles, respectively, Ni is the total number of the ith type of item, and xij is the frequency of the jth response to the ith item type. The parameter values were found that maximized the likelihood function for all three ensemble sizes jointly. Specifically, the joint likelihood function is the product of each of the three individual likelihood functions:
where Ljoint is the joint likelihood function and Ln is the likelihood of the parameters given the data from ensemble size n. Two different sets of parameters were fit for the criterial variance model. The first had a single set of criteria, cj where j=1,...,r, that was constrained to be the same for all three ensemble sizes. The second set of parameters had 3r criteria, cjn for j=1,...,r and n=[1,2,4], such that corresponding criteria in different ensemble sizes were free to differ. The optimal parameters were found using the mle function in Matlab, which implements a version of the Simplex algorithm.
The zero criterial variance model was fit by constraining σc2 to be zero and maximizing the same likelihood function.
The OR model suggests participants perform a criterion comparison individually on each item in the ensemble and then endorse the ensemble if ANY of the items surpass the criterion. The model implies that the probability of endorsing an ensemble is the logical OR of the probabilities of endorsing each word. Complementarily, the probability of not endorsing an ensemble is the logical AND of not endorsing the individual words. Assuming all the words in an ensemble have the same mean and variance, on average, the logical AND of the misses (or of the correct rejections) would be probability of a miss (or correct rejection) raised to the ensemble size power. More formally, the probability of endorsing an ensemble size at a given rating level is equal to one minus the probability of none of the individual items meeting or surpassing the criterion below that rating:
Using these probabilities, the predicted proportion of each rating, θij, for the OR model can be computed by plugging these probabilities into equations C1.1 and C1.2. The likelihood equations C2.1 and C2.2 can then be used to find the maximum likelihood estimators for this model. Like the criterial variance and zero criterial variance models, the OR model was also fit both using the same set of criteria across ensemble sizes and using a unique set of criteria for each ensemble size.
Benjamin, A. S., Diaz, M. L., & Wee, S. (in press). Signal detection with criterion noise: Applications to recognition memory. Psychological Review.
1More commonly, this theory is referred to as the Signal Detection Theory (e.g., Swets, 1964; Treisman, 1965). Here the alternative acronym TSD is preferred (see also Atkinson, 1973; Birdsall, 1956; Lockhart & Murdock, 1970; Tanner, 1959) in that it properly emphasizes the theory's relation to, but not isomorphism with, Statistical Decision Theory (Wald, 1950).
2Following the suggestion of Luce (1963), we use the term isosensitivity function instead of the more historically relevant but somewhat unintuitive label of Receiver (or Relative) Operating Characteristic (ROC). Throughout this paper, no change in terminology is used to indicate whether the isosensitivity function is plotted in probability or normal-deviate coordinates, other than to the relevant axes in figures, unless the distinction is relevant to that discussion.
3Criterion is used throughout to refer to the location of a decision threshold in the units of the evidence dimension (i.e., in terms of the values on the abscissa in Figure 1, which are typically standard deviations of the noise distribution). Bias—a term sometimes used interchangeably with criterion—refers specifically to the value of the likelihood ratio at criterion. In the discussion here, the distinction is often not relevant, in which case we will use the term criterion.
4Signal (studied status) and Noise (unstudied status) distributions are referred to by the subscripts 1 and 0 throughout. This notation ensures more transparent generality to situations involving more than two distributions, and can be thought of either as a dummy variable or as representing the number of presentations of the stimulus during the study phase.
5For an intuitive and thorough review of the geometry underlying detection parameters, see Wickens (2002)
6Note that conservative in this context refers to a suboptimal magnitude of criterion shift with respect to changing base rates, not to a conservative (as opposed to liberal) criterion placement.