|Home | About | Journals | Submit | Contact Us | Français|
Many cognitive tasks have been developed by basic scientists to isolate and measure specific cognitive processes in healthy young adults, and these tasks have the potential to provide important information about cognitive dysfunction in psychiatric disorders, both in psychopathology research and in clinical trials. However, several practical and conceptual challenges arise in translating these tasks for patient research. Here we outline a paradigm development strategy—which involves iteratively testing modifications of the tasks in college students, in older healthy adults, and in patients—that we have used to successfully translate a large number of cognitive tasks for use in schizophrenia patients. This strategy makes it possible to make the tasks patient friendly while maintaining their cognitive precision. We also outline several measurement issues that arise in these tasks, including differences in baseline performance levels and speed-accuracy trade-offs, and we provide suggestions for addressing these issues. Finally, we present examples of 2 experiments, one of which exemplifies our recommendations regarding measurement issues and was a success and one of which was a painful but informative failure.
Cognitive impairment is now widely regarded as a central feature of schizophrenia and a critical treatment target. Although clinical neuropsychological methods have been very useful in demonstrating the existence of cognitive deficits in schizophrenia patients, the interpretation of this body of work remains challenging because of the cognitive complexity of nearly all clinical neuropsychological tests. In essence, the neuropsychological literature establishes that patients are impaired but does not indicate exactly which cognitive processes and neural systems are responsible for the molar deficits documented with clinical assessment methods. To address specific cognitive mechanisms, many clinical researchers are turning toward the basic cognitive psychology and cognitive neuroscience literatures in search of behavioral paradigms that can isolate and quantify specific cognitive processes and the corresponding neural systems. However, several roadblocks stand in the way of adapting cognitive paradigms that were developed in the basic science literature, and advances in basic cognitive psychology have not been widely translated into clinical research, slowing progress in understanding the nature of cognitive impairment in schizophrenia.
Over the past 7 years, our team of basic and clinical researchers has attempted to translate approximately 15 basic science paradigms for use with schizophrenia patients, and the purpose of the present article is to describe some of the lessons we have learned that may be helpful to other researchers as they attempt to adapt state-of-the-art basic science tasks for use in various patient groups. We describe 2 general types of issues that arise in adapting basic science paradigms: (a) task selection and modification issues that arise when tasks developed for college students are used in older, less educated, and lower functioning patients and control subjects and (b) measurement issues that are of great importance in patient studies but are typically irrelevant and therefore ignored in basic science studies.
One of the biggest roadblocks in adapting basic science paradigms for use in patients is that the subjects in basic cognitive psychology experiments are almost always college students. Compared with college students, patients and matched control subjects may: (a) be older and less well educated, with lower IQ and lower socioeconomic status (SES); (b) have reduced low-level sensory abilities; (c) have difficulty understanding instructions and a lower probability of asking for clarifications if they do not understand the instructions; (d) Have difficulty maintaining the appropriate task set; (e) have different strategies, speed-accuracy trade-offs, motivation levels, etc.; (f) lack experience interacting with computers, video displays, keyboards, mice, etc.; and (g) have limited tolerance for long or difficult tasks. In addition, groups of control subjects and especially groups of patients are typically more heterogeneous than groups of college students. Moreover, behavioral studies in college students are relatively fast and inexpensive, and researchers can easily conduct multiple variants of a task to obtain optimal performance levels, rule out alternative explanations, etc. This converging style of experimentation, often using different samples of subjects, is rarely possible in clinical settings and is never possible in the context of clinical trials.
In the following sections, we will describe how these issues influence the selection of an appropriate task and how tasks can be modified and optimized for use in clinical studies to minimize the role of frequently encountered methodological and measurement problems. We will also discuss the task development strategy that we use to ensure that we have adequately addressed these issues.
Perhaps the most common difficulty in selecting a cognitive task for use with patients is that every task involves multiple processes in addition to the process that the task is designed to isolate. There is no such thing as “a working memory task,” “an attention task,” “an executive control task,” or “a performance monitoring task,” in which overall performance levels purely reflect these specific processes. At the simplest level, virtually all cognitive tasks have perceptual and motor components, and impairments in these relatively peripheral processes may cause overall performance to become impaired. Most tasks also involve strategic, mnemonic, and decision components that may not be obvious at first glance. As a result, the most informative cognitive tasks do not typically rely on overall performance levels to measure a given cognitive process but instead use the pattern of response speed or accuracy over different conditions.
As an example, consider the visual search task illustrated in figure 1. In this task, observers search for a target stimulus in an array of distractors, and reaction times (RTs) are measured as the number of items in the stimulus array varies. The speed with which attention shifts from item to item is then quantified as the slope of the function relating RT to the number of items in the array, and the duration of presearch and postsearch processes is reflected in the intercept of this function. Thus, overall RT is influenced by multiple cognitive processes, but the slope measure serves to isolate the contribution of visual attention. This is a common way of measuring the efficiency of attention in the basic science literature.1–3
To assess an impairment in a specific process, it is almost always necessary to look for an interaction between diagnostic group (eg, patient vs control) and an experimental variable (eg, the number of items in the array). Unfortunately, the statistical power of non–crossover interactions is substantially worse the than that of main effects,4 and this limits the tasks that can realistically be adapted for use in patient studies. Generally speaking, if the effect of an experimental manipulation is only a 15-ms slowing of RT or a 5% decrease in accuracy in college students (which may be reliable and theoretically meaningful), it will be extremely difficult to find a significant reduction in this effect in patients (although it might be possible to see an increase in the effect for patients). In our own research, we have rejected several tasks in which the effects in healthy young adults are relatively small and in which we expected the effects to be even smaller in patients, even though the tasks had excellent construct validity and tapped processes that might be meaningfully impaired in patients (eg, object-based attention tasks5).
It is also important to consider whether a given task will isolate the same process in patients as it does in college students. Imagine, eg, that factor X and factor Y both influence the effect of an experimental manipulation in some task, but factor X is largely constant across a sample of college students. In our visual search example, factor X might be the ability to remember what the target looks like (which should be relatively constant among college students), and factor Y might be the speed with which attention can shift from object to object (which may vary among college students). A study of college students might therefore indicate that the task provides a relatively pure measure of efficiency of attentional shifting, but this may not be true when the task is applied to patients. That is, if patients and control subjects differ in the ability to remember the target, then impaired performance in the task may be attributed to a deficit in attention even though it is really caused by a deficit in working memory for the search target. There is no general solution to this problem; it can be addressed only by careful consideration of the processes involved in performing the task.
It is also important to consider whether impaired performance in a given task reflects a specific deficit in the processes thought to underlie performance of that task or a more general deficit that impacts a wide variety of processes. This is related to the question of whether the impairment is specific to the disorder of interest or would be observed in any group of patients with a broad-spectrum cognitive disorder. In other words, would anyone who has a significant brain disorder exhibit impaired performance in this task? This issue has received considerable discussion in the psychopathology literature,6–8 but it is not usually relevant in studies of healthy individuals. There are 2 particularly good ways to avoid this problem. First, one can look for double dissociations, in which one group of subjects exhibits impaired performance in task A but not in task B, whereas another group of subjects exhibits impaired performance in task B but not in task A. Unfortunately this strategy will only rarely be helpful in schizophrenia research because it is relatively unusual to find meaningful tasks in which schizophrenia patients are unimpaired. Second, one can use tasks in which an impairment in a specific process leads to improved rather than impaired performance. For example, recent research has shown that schizophrenia patients are more accurate than control subjects at judging the luminance contrast of a texture patch in the presence of a surrounding region of a different contrast, presumably because they have a deficit in integrating information across space.9 While rare, areas of paradoxically spared or supranormal performance levels may be uniquely informative.
Once an appropriate basic science task has been selected to isolate a specific cognitive process of interest, it is almost always necessary to make it more “patient friendly” by modifying various task parameters. The most common adjustment is to slow the speed of the stimuli (including exposure durations, within-trial rate of stimuli, and intertrial intervals) to account for differences in processing speed between college students and the participants in clinical studies. In addition, more trials per condition may be needed owing to greater variability within and between participants and to the relatively low statistical power of group × condition interactions. These 2 changes—slower stimulus speeds and larger numbers of trials—will increase the duration of the experiment, and this may be a problem for patients with limited tolerance for long experiments or when the task is part of a larger battery. To compensate for this, it is often necessary to reduce the number of experimental conditions in the task, which may decrease the ability to rule out alternative explanations for the pattern of results. If a task is being developed for inclusion in a large battery or in a clinical trial, it may be desirable to first run a study focusing solely on the task of interest, including all conditions that are necessary for ruling out alternative explanations of the deficit. Once these have been ruled out, it may be possible to use only the most important conditions in future studies.
In addition to decreasing the speed and increasing the number of trials per condition, it is often necessary to modify some fundamental task parameters so that the task is not too difficult. For example, working memory capacity may be lower in both patients and demographically matched control subjects compared with college students, and a lower working memory load may therefore be necessary to avoid poor performance levels. Changes of this sort can potentially interfere with the validity of the task, and it is therefore important to conduct pilot experiments demonstrating that the modified task retains its validity and sensitivity.
In our team's translational research, we typically use an iterative task development approach to find task parameters that are appropriate for our patients and control subjects and that maintain the validity and sensitivity of the task. We begin by adjusting task parameters to make the experiment more “patient friendly,” and we then test the modified task in college students, often comparing directly with the original task. This starting point has 2 main advantages. First, the properties of the task are usually well known for college students, making it easier to know what to expect from a valid task. Second, it is relatively fast and inexpensive to test large numbers of college students, especially for researchers associated with academic psychology departments. We often find that we need to try several different sets of parameters before we arrive at a set that maintains the validity and sensitivity of the original task, and so this stage is iterative.
Once the modified task has been validated with college students, we test it on healthy older adults from the surrounding community, usually aged 60–85 years. These community subjects need not be matched with the ultimate patient population in age, gender, SES, etc, because they are not used as control subjects who will be directly compared with patients. Instead, we use the community subjects to determine whether the modified task can be tolerated by, and remains valid for, individuals who are quite different from college students in their perceptual abilities, their general cognitive speed, and their familiarity with computers. These subjects can be recruited more easily and less expensively than matched control subjects, because we need not be very picky about their demographic characteristics and because they are typically retired and therefore able to come to the laboratory for testing during normal work hours. Indeed, we frequently find that they enjoy coming to the laboratory and participating in the experiments. In testing these community subjects, we are able to refine our instruction procedures to make the tasks more easily understood. We often make multiple adjustments to the task parameters at this stage. Sometimes the changes are large enough that we go back to testing with college students to make sure that we have not lost task validity.
When community subjects are recruited near universities, they may be very different in education, IQ, and SES from the ultimate patient and control populations. Indeed, a significant proportion of our community subjects have postgraduate degrees. Even though these subjects may be in their 70s or 80s, they may still be better able to understand complex tasks than our younger patients and control subjects. Thus, one must be careful when using these subjects to assess the adequacy of the task instructions and in such cases it may be necessary to recruit community subjects from a broader demographic range. Although potentially challenging, the recruitment of average ability community subjects may be very advantageous in terms of modeling many of the issues that are likely to arise in patient studies.
Our next step is typically to try the task in a few patients, and we often begin this stage with patients who are known to have fairly severe cognitive impairments, who might not satisfy our typical inclusion and exclusion criteria, and whose data will not be included in the main study. This allows us to make sure that our pretesting with community subjects has led to the desired result, namely a task that can be fully understood and adequately performed by even the most impaired patients. This may lead to a few more small modifications, which we may then test in community subjects before we are convinced that it is ready for application in a full-scale study. At that point, we begin the actual study.
This path from the identification of a promising basic science paradigm to the beginning of the actual study is a long one, often requiring 6–12 months of concerted effort. However, we find that the benefit fully justifies the cost. Not only do we avoid conducting large-scale patient studies that fail to produce interpretable results but we also increase the probability that our studies actually isolate the specific cognitive processes of interest.
When accuracy is the main dependent variable, it is essential to avoid psychometric artifacts and statistical problems that can result from differences between patients and control subjects in overall performance levels. Perhaps the most fundamental problem is that parameters that lead to appropriate accuracy levels in college students may lead to substantially lower accuracy levels in both patients and demographically matched controls subjects in clinical studies. This may decrease statistical power (owing to floor effects or tightly compacted distributions) or make the task so difficult that subjects “give up.”
This problem can often be solved by optimizing task parameters so that they are appropriate for clinical populations, but this must be done carefully in order to maintain the validity of the paradigm. Imagine that performance in a task is governed by factors A and B and that both patients and control subjects in a clinical study are impaired in this task relative to college students because of a deficit in factor A. If the task is made easier by adjusting factor B rather than by adjusting factor A, then overall performance may move to an appropriate level, but the sensitivity of the task to patient-control differences in factor B may be reduced. For example, consider a working memory task in which both patients and control subjects perform poorly compared with college students because of reduced low-level sensory functioning. If the task is modified by decreasing the working memory load (eg, by reducing the number of to-be-remembered items), this may bring overall performance to the level observed in college students. However, this decrease in working memory load may reduce task sensitivity to the point that a modest deficit in working memory abilities will not lead to significantly reduced performance levels in patients. Thus, adjusting one factor to make up for a deficit in another factor may lead to reduced sensitivity.
The same issue arises when the patient and control groups differ in their overall performance levels, which makes it difficult to interpret group × condition interactions in accuracy. This is a consequence of the fact that the relationship between the quality of an observer's internal representations and the accuracy of behavioral performance is often highly nonlinear. As illustrated in figure 2A, increases in the quality of the internal representation generally produce sigmoidal changes in accuracy, and a given change in the quality of the internal representation therefore has a different effect on observed accuracy depending on the starting point. To make this concrete, imagine a continuous performance task experiment in which the stimuli are presented either at high visibility (condition A) or in a degraded form (condition B). Imagine further that the function relating the quality of the stimulus representation to observed accuracy is the same for patients and control subjects (which should be true for very simple tasks), but patients have a worse sensory representation than control subjects for a given stimulus. Figure 2A shows that, even if the experimental manipulation produces the same size change in the quality of the stimulus representation in the patient and control groups, the effect of this manipulation on observed accuracy may be larger for the patient group than for the control group given the sigmoidal shape of the function shown in figure 2. Thus, one might incorrectly conclude from such an experiment that the patients are more sensitive to stimulus degradation than control subjects. However, figure 2B shows that the opposite effect can be observed if the stimuli in both conditions A and B are made less visible. In this case, the control subjects show a larger change in observed accuracy than patients, which might lead to the incorrect conclusion that patients are less sensitive to sensory degradation than control subjects. All else being equal, the group that is closest to the steep part of the function will show a larger effect. The simplest way to solve this problem is to use a broad range of stimulus visibility levels, which will make it possible to measure the shape of the whole function that relates stimulus visibility to accuracy. Unfortunately, this requires the use of many more experimental conditions, which may make the experiment unrealistically long.
A more efficient approach is to equate baseline performance for the 2 groups by adjusting the stimuli so that each subject is at a particular performance level. That is, subjects are pretested with different stimuli, and a stimulus value is chosen for each subject that produces a specific accuracy level in that subject (eg, 85% correct). This equates baseline performance for the patient and control groups, and it may also reduce error variance and thereby increase statistical power. However, caution must be used when equating performance by varying the stimuli because it risks replacing a single confound (differences in internal quality in the baseline condition) with 2 confounds (differences in internal quality combined with differences in the stimuli).
Imagine, eg, a working memory experiment that examines performance with a 1-item memory load (condition A) and a 5-item memory load (condition B). Imagine also that the patient group performs worse than the control group with a set size of one item when that item is presented at a high level of visibility (eg, a high-contrast letter presented on a video display for 1 s). This difference in performance for highly visible stimuli is unlikely to reflect an impairment in perception but might instead reflect something like occasional lapses of attention or response confusions in the patient group. To equate performance with a memory load of 1 item, it would be possible to pretest the subjects with different levels of stimulus degradation to find a degradation level that would produce 90% accuracy in all subjects. This degradation level could then be used to compare 1-item and 5-item memory loads in the main experiment. Because accuracy would be 90% in both groups for the 1-item load, any differences between patients and controls for the load of 5 items could not be explained by the sort of psychometric artifact illustrated in figure 2.
However, the results may still be difficult to interpret. Because patients make more errors than controls even under levels of high visibility, this matching procedure would lead to the use of more degraded stimuli for the control group than for the patient group. Thus, the control subjects would be making errors due to a failure of perception, whereas the patients would be making errors due to lapses of attention. Thus, even though accuracy has been equated, processing has not been equated. If perceptual difficulty interacts with memory load (which is quite plausible), whereas lapses of attention cause errors that are independent of the load (which is also quite plausible), then the use of different levels of stimulus degradation in the patient and control groups may cause an artifactual interaction between load and group. Alternatively, this artifactual interaction might be opposite in direction to a real interaction between load and group, leading to a cancellation of the real interaction in the observed data and leading to the conclusion that the groups do not differ in their sensitivity to load. Thus, adjusting the stimulus visibility to offset a difference in baseline performance that does not arise from differences in perceptual quality can make matters worse, creating 2 confounds in an attempt to eliminate one confound.
This is analogous to the adage that “two wrongs don't make a right.” One confound cannot be fixed by adding another confound. Instead, researchers must somehow eliminate the original confound (“right the wrong”). This is often impossible to achieve in a direct manner (eg, by equating the groups for lapses of attention), but it can often be achieved by means of parametric experimental manipulations, as will be described in a later section.
It should be stressed that adjusting stimulus parameters to equate baseline performance is actually the correct solution when used to offset an actual difference in perception between the patient and control groups. However, the most efficient procedures for finding the appropriate stimuli—adaptive staircase procedures—will not work properly if nonperceptual factors also contribute to differences in performance. In these procedures, stimulus visibility is automatically adjusted by increasing visibility following errors and by decreasing visibility following correct responses.10,11 By adjusting the size and frequency of these changes in visibility, it is possible to rapidly converge on a level of stimulus visibility that yields a desired level of observed accuracy. However, these procedures assume that all errors are a result of misperception of the stimuli rather than lapses of attention, responses confusions, etc. If these nonperceptual factors occur more often in patients than in control subjects, the stimulus level chosen by the procedure to produce a given accuracy level will actually yield more accurate perception for the patients than for the control subjects. That is, to achieve the same level of observed accuracy that is obtained for patients, the stimuli used for the patients must lead to better perception to offset their poorer nonperceptual processing. This greater perceptibility for the patient group may make the patients appear to be less sensitive to the experimental manipulation of interest, creating the appearance of a difference when there is no impairment or perhaps masking a real impairment in the opposite direction.
How can this problem be solved? We suggest 2 approaches. First, if the experimental variable of interest is parametrically manipulated, and the factors that control performance are well understood, then it may not be necessary to equate baseline performance levels. An example of this approach will be described in a later section. However, this option is typically plausible only when a single task is being tested and when the subjects can tolerate a fairly long experiment. If the goal is to create an efficient version of a task for use in a battery or clinical trial, then a validation study can first be run in which the experimental variable is parametrically manipulated along with the variable that is used to equate patient and control performance levels. The data from this rather long experiment can be used to determine whether the adjustment variable leads to appropriate changes in the psychometric function or to distortions of the sort illustrated in figure 2B. If the adjustment variable has the desired effects, then an adaptive staircase procedure can be used to equate baseline performance when the task is used in future studies, yielding an efficient task that can be used in batteries or clinical trials.
Although we have used examples of impaired perception and perceptual manipulations in this discussion of baseline accuracy values, the same principles apply to higher level aspects of cognition, such as language and cognitive control. In a language experiment, eg, one might test the ability of patients and control subjects to make use of sentential context to determine the meanings of polysemous vs monosemous words. Under many conditions, it would be possible to ensure that the subjects have adequate time to fully perceive the words. Nonetheless, patients might exhibit lower accuracy for both polysemous and monosemous words owing to specific problems in factors such as word recognition or general problems in factors such as attentiveness, and this would make it difficult to interpret any group × word type interactions. In such cases, it would be possible to use parametric manipulations to draw stronger conclusions about the nature of the patient impairment. For example, one could parametrically vary the degree to which the word meanings were disambiguated by the sentence context. The key is to go beyond simple binary experimental manipulations and to use tasks for which there is a solid model of the factors underlying performance.
In most experiments that focus on the accuracy of unspeeded responses, there are 2 response alternatives (eg, X or Y). This has the advantage of minimizing the working memory resources required to maintain the stimulus-response mapping. However, it may cause a large reduction in the reliability of the accuracy measures. Imagine, eg, an experiment in which observers must determine whether a masked stimulus is the letter X or the letter Y, which are presented with equal probability. Imagine also that a given observer is able to see the stimuli well enough to confidently detect the target on 50% of trials and makes a random guess on the other 50% of trials (and is therefore correct on half of the guesses). Given an infinite number of trials, that observer's accuracy would be 75% correct. Given a small number of trials, however, a given observer may have enough lucky correct guesses or unlucky incorrect guesses to deviate quite a bit from 75% correct. However, increasing the number of possible response alternatives can dramatically reduce the influence of guessing. For example, if the masked stimulus could be any of the 26 letters of the alphabet, then observers will rarely guess correctly, and the observed accuracy rate for a given subject based on a small number of trials will tend to be closer to the accuracy observed on an infinite number of trials. If the observer can confidently detect the target on 50% of trials and guesses on the other 50%, only 1/26 of the guesses will be correct and the accuracy for an infinite number of trials would be 51.92% correct.
To demonstrate the difference in measurement error between conditions with 2 and 26 alternatives, we conducted a simple Monte Carlo simulation of an experiment with a patient group and a control group (N = 50 per group) in which each subject received 10 trials of the task. Each simulated patient was able to perceive the target on 50% of trials and guessed on the remaining 50%. Each simulated control subject was able to perceive the target on 60% of trials and guessed on the remaining 40%. We simulated 2 versions of the task, one with 2 response alternatives (eg, X vs Y) and one with 26 response alternatives (eg, any letter of the alphabet). Each of these simulated experiments was repeated 20 times, and for each repetition we calculated the observed effect size of the patient-control difference. The average effect size (Cohen d) was 0.57 in the version with 26 response alternatives but only 0.34 in the version with 2 response alternatives. Thus, the use of large numbers of response alternatives can have a large impact on effect size, especially when the number of trials per subject is small.
Using large numbers of response alternatives can have the negative side effect of imposing a large working memory load on the subject. This is not a problem, however, when the stimulus-response mapping is well learned from everyday experience (eg, saying the names of the letters of the alphabet). An example of a real experiment using this approach will be described in a later section.
Many experimental situations would seem to preclude the use of multiple response alternatives. For example, judgments of memory (eg, old/new) or semantics (eg, related/unrelated) are typically binary. However, it is usually possible to instruct subjects to make a multiple-level confidence judgment rather than a binary yes/no response in these tasks. In a recognition memory experiment, eg, subjects could be instructed to respond to each probe item with a number between 1 and 6, where 1 means that probe was definitely not in the study list and 6 means that the probe was definitely in the list. This procedure reduces measurement error by increasing the number of response alternatives, and it has the additional advantage of making it possible to construct receiver operating characteristic curves. These curves can be used to separately measure sensitivity and response bias in a manner that is much better than the procedure used in conventional yes/no tasks.12–14
Given the problems that may arise when accuracy is the main dependent variable, one might conclude that RT is a better measure of performance. Indeed, RT does solve 2 of the complications of accuracy. First, whereas the relationship between the quality of the internal representation of a stimulus and the observed accuracy is highly nonlinear (figure 2), the relationship between the amount of time required to make a target decision and the observed RT is typically 1:1. Second, whereas accuracy reaches a ceiling at 100%, leading to low sensitivity when accuracy is high, RT has no real upper limit.
However, any RT experiment that requires the observer to discriminate among stimulus alternatives is also an accuracy experiment, and it is important to consider how RT and accuracy are related. The standard conceptualization of speed-accuracy relationships is illustrated in figure 3. As time progresses after the onset of a stimulus, the quality of the observer's internal representation increases. When asked to make an unspeeded response, the observer waits until the internal quality reaches asymptote and then responds. When asked to make a speeded response, the observer sets a response threshold, and when the internal quality reaches this threshold, a response is triggered. Thus, the observed RT reflects both the rate at which the quality of the representation of the target increases (the rate of information acquisition) and the response threshold. When a patient group has a longer RT than a control group, it is tempting to conclude that this reflects a difference in the rate of information acquisition (as in figure 3A). However, the same RT effect will occur if the patient group has a more conservative response threshold (as in figure 3B). That is, RT differences can occur because of differences in speed-accuracy trade-offs.
If patients are slower solely because they use a more conservative response threshold, then this means that the internal quality of the target representation is greater at the time of response. Thus, a pure speed-accuracy trade-off should result in greater accuracy for patients than for control subjects. If the observed accuracy is significantly worse for the patients than for the control subjects, then slower RTs for the patients cannot be attributed to speed-accuracy trade-offs. Fortunately, this is the pattern that is often observed in schizophrenia patients. However, if the observed accuracy is approximately equal for patients and control subjects, it is difficult to rule out the possibility of speed-accuracy trade-offs because this would require accepting the null hypothesis of no difference in accuracy. This problem is especially severe when accuracy is near ceiling. As illustrated in figure 3C, small differences in response threshold can lead to large differences in RT when the thresholds are high, but the high thresholds also lead to near-ceiling accuracy levels. If both groups are near ceiling, then it will be difficult to measure accuracy well enough to determine whether the groups differ in accuracy. Thus, when performance in both the patient and control groups is near ceiling, it is extremely difficult to determine whether the response thresholds differ. In most cases, therefore, RT effects can be interpreted most easily when accuracy is well below ceiling and when patient accuracy is worse than control accuracy.
Many studies involve measurements of RT in 2 or more experimental conditions, and the differences in RT among these conditions are compared in the patient and control groups. In such studies, the problem is not that patients and control subjects might use different overall response thresholds but rather that the response thresholds may vary across conditions differently for the 2 groups. Consider, eg, an experiment in which subjects make speeded responses in a task that imposes a low memory load (eg, a 0-back task) or in a task that imposes a high memory load (eg, a 2-back task). If the patient group is slowed by 100 ms in the difficult task compared WITH the easy task, whereas the control group is slowed by 50 ms, then this would normally be interpreted as a slowing of the memory-related processes in the patient group. However, this pattern could also arise if the patients were less confident than control subjects in the high-load condition and therefore increased their response threshold in this condition, even if the rate of memory-related processes was the same in the patient and control groups.
Just as in the case of between-group differences in overall response thresholds, this sort of explanation can be assessed by determining whether the differences in RT are accompanied by differences in accuracy. If, eg, the patient group exhibited a smaller accuracy decline in the high-load condition than did the control subjects, then this would indicate that at least part of the RT effect was due to differences in response threshold rather than differences in the rate of the memory-related processes. Conversely, if the patients exhibited a larger accuracy decline, then this would indicate that their greater slowing cannot be explained by differences in response thresholds. Once again, accuracy can be used to rule out explanations based on speed-accuracy trade-offs, but only if accuracy is well below ceiling.
This approach is very common in the basic science literature. However, it may be more complicated in translational research because of the likelihood that the patients will be less accurate than the control subjects across conditions. If overall accuracy levels differ between groups, then comparing the size of an accuracy effect in one group with the size of an accuracy effect in another group will lead to all the problems discussed in the preceding section on accuracy. That is, if differences in overall accuracy are a problem in interpreting accuracy effects, then these differences will also be a problem in assessing differences between groups in speed-accuracy trade-offs. Thus, although RT measures might seem to solve many of the complications associated with accuracy measures, the intimate relationship between RT and accuracy means that the same complications apply to RT experiments.
It should be stressed that it is necessary to consider the accuracy of all the different response types in a given experiment to assess speed-accuracy trade-offs. In a continuous performance task, eg, one must examine both the hit rate for targets and the false alarm rate for nontargets to adequate assess accuracy. Similarly, both the word and nonword responses must be considered in lexical decision tasks, even though the word responses may have more obvious theoretical implications.
Many of these issues can be minimized or eliminated if the processes underlying task performance are well understood. As an example, consider the visual search task shown in figure 1. In most visual search experiments, the target is present on 50% of trials and absent on 50% of trials. Under typical conditions, observers can be virtually 100% certain that they have found the target on target-present trials. On target-absent trials, however, observers cannot easily determine that they have fully searched each item in the array because it is hard to keep track of which items have and have not been searched. It is therefore difficult to know with certainty that the target is absent. As a result, observers search until they either find the target and make a target-present response (with virtually 100% certainty) or until a certain amount of time has passed, at which point they make a target-absent response (with relatively low certainty). The threshold for making a target-absent response varies considerably from subject to subject and from trial to trial.15 When the target is present, an observer who uses a relatively short threshold time will often “give up” before finding the target and make an erroneous target-absent response. This leads to faster RTs and higher error rates, especially for arrays containing many items. Consequently, differences in threshold lead to differences in the estimated search rate (the slope of the function relating RT to set size; see figure 1).
This is a very serious problem. Some studies of visual search in schizophrenia patients using this kind of task have found little or no difference in search rate between the patients and the control subjects.16,17 However, accuracy declined as the set size increased, and this effect was more pronounced in the patients than in the control subjects, which implies that the patients were more likely than controls to “give up” before searching the whole array. Consequently, the actual rate of search was underestimated in the patients, possibly masking a real difference in search rates between the patient and control groups.
This problem can be solved quite easily by changing the task so that every array contains one of 2 possible targets and the observer indicates which of the 2 targets was present. In the example shown in figure 1, eg, the task might be to press a left-hand button if the target contains a gap on the left side and to press a right-hand button if the target contains a gap on the right side. Because the observers know that a target is always present, they will always search until they find the target and then respond with nearly perfect accuracy. Studies using this approach have found that the search rate is substantially slowed in schizophrenia patients compared WITH control subjects.18,19
Previously, we noted that speed-accuracy trade-offs are difficult to assess when accuracy is near ceiling, but in this visual search example we eliminated a source of speed-accuracy trade-offs while at the same time producing near-perfect accuracy. These apparently contradictory statements can be reconciled because the processes that underlie visual search performance are extremely well understood. When observers know that a target is always present, the slope of the function relating RT to set size reflects the time required to search the items, and differences in response threshold influence the intercept of this function rather than the slope. Thus, many measurement issues can be resolved by using tasks for which the sequence of processes is well understood.
In most studies, schizophrenia patients have longer RTs than do control subjects in all conditions, and this can complicate the interpretation of differences in RT effects between the 2 groups. Fortunately, however, this problem is not as severe as the problem of different levels of overall accuracy. RT is simpler in this regard because the scale for RT is linearly related to the amount of time required for processing to occur, whereas the scale for accuracy is not linearly related to the amount of information that the subject has acquired about the stimulus. Imagine, eg, a Stroop task in which patient RTs increase from 700 to 770 ms for compatible vs incompatible trials, whereas control RTs increase from 600 to 660 ms. In this case, it might be perfectly legitimate to say that the incompatibility causes a greater increase in processing time for the patients than for the control subjects.
However, there are 2 cases in which this sort of comparison may be problematic. First, the SD of RT typically increases linearly with the RT (ie, the coefficient of variation is constant), and variance will therefore be greater for the patients than for the control subjects. Second, researchers are sometimes interested in the proportional size of an effect rather than the absolute size of an effect. In the Stroop case described in the preceding paragraph, eg, both groups showed a 10% increase in RT on incompatible trials. Both these problems can be addressed by using the logarithm of the RT as the dependent variable because logarithms convert multiplicative (ie, proportional) differences into additive differences. In the Stroop example, the natural logarithms of the compatible and incompatible RT means would be 6.5511 and 6.6464 for patients and 6.3969 and 6.4922 for control subjects, with a difference between compatible and incompatible of 0.0953 in both groups. Thus, a perfectly proportional group × condition interaction turns into a pure group main effect when the data are transformed in this manner.
The appropriateness of absolute vs proportional differences in RT will depend on the specific experimental paradigm and the nature of the conclusions that one wishes to draw. However, care must be used when interpreting log-transformed data (or data subjected to any kind of transformation). For example, if a logarithmic transformation is applied solely to eliminate the differences in variance among groups, it is important to keep in mind that this will change the meaning of any group × condition interactions. That is, an interaction in log-transformed data means that the groups differed in the size of the proportional effect of the experimental manipulation. Second, it might be tempting to assume that a proportional effect on RT reflects a proportional change in the time required for some key cognitive process that is being assessed. In our Stroop case, eg, one might be tempted to conclude that both groups showed a 10% change in the naming process. However, the naming process is only a fraction of the overall RT, so this would not be a valid conclusion. For example, the naming process might last 200 ms for both patients and control subjects on compatible trials, with the difference in RT on these trials reflecting some other process. In this case, the 70-ms slowing on incompatible trials for patients would be greater than the 60-ms slowing for control subjects in both absolute and proportional terms. Thus, this sort of transformation is typically valid only when applied to experimental effects and not to raw RTs. For example, we have used logarithmic transformation on visual search slopes.18
If patients exhibit a significant impairment in task A but no impairment in task B, it is tempting to assume that the patients must have a deficit in some process that underlies performance of task A but no deficit in the processes that underlie performance of task B. However, task A could simply be more sensitive to deficits than task B. This issue is of great importance in the clinical literature, where it has received substantial discussion.20 However, different tasks are not usually compared in the basic cognitive science and neuroscience literatures (except in basic neuropsychology), so this issue does not usually arise in these literatures. As discussed in an earlier section, problems such as this can often be solved by means of double dissociations, in which different groups of patients show different patterns of deficits. This is not always practical, however, and it is often useful to compare the pattern of deficits in a single patient group relative to a control group. Statistical procedures are available to take into account differences in measurement sensitivity in many such situations.21 Because this issue has been described in such detail elsewhere, we will not provide the details here.
Two kinds of outliers are commonly encountered when cognitive tasks are tested in patient studies: outlier trials and outlier subjects. Outlier trials are common in RT studies because the distribution of single-trial RTs is highly skewed. Figure 4 shows an example of this from a study in which schizophrenia patients and control subjects categorized stimuli as letters or digits (Luck SJ, Kappenman ES, Fuller RL, Robinson B, Gold JM, unpublished data). The RT distributions can be modeled as the convolution of a Gaussian distribution and an exponential distribution, and most cognitive manipulations influence the exponential component of the distribution.22 As a result, differences in RT between groups or conditions occur mainly in the tail of the RT distributions. This is exactly what was observed in the data shown in figure 4.
Thus, if the goal of an experiment is to detect a difference in RTs, it is important for the statistical approach to be sensitive to the tail of the RT distribution. The mean of a skewed distribution is very sensitive to the tail of the distribution, so using mean RT as the dependent variable will naturally lead to sensitivity to the tail of the distribution. The problem, however, is that a relatively small proportion of RTs contribute to the tail of the distribution and the presence or absence of a few very long RTs can dramatically change an individual subject's mean RT. This makes it difficult to obtain a reliable measure of each subject's true mean RT from a small sample of single-trial RTs. Thus, mean RT has the advantage of being sensitive the part of the distribution in which patients and control subjects are likely to differ, but it has the disadvantage of leading to high levels of measurement error.
Three approaches to this problem are commonly used in the basic science literature. First, because the median of a distribution is much less sensitive to outliers than is the mean, median RT can be used as the dependent variable instead of mean RT. The main disadvantage of this approach is that median RT may be relatively insensitive to differences between groups in the tails of the RT distributions. However, it is probably the best approach when the differences are not primarily found in the tails. A second approach is to eliminate trials with the most extreme outlier RT values. By choosing an appropriate criterion, it is possible to balance the reduction in sensitivity to the tails with the reduction in measurement error. A widely used automated outlier rejection algorithm was developed by Van Selst and Jolicoeur,23 and we have used this algorithm successfully in studies of schizophrenia.19 A third approach is to mathematically decompose each subject's RT distribution into its components so that the parameters of the components can be directly estimated. In an ideal world, this would be the best approach. Unfortunately, extremely large numbers of trials are needed to accurately characterize an individual subject's distribution of RTs. Rouder and his colleagues24 have recently developed a new estimation approach that may be effective with smaller numbers of trials, but it remains to be seen whether this method will work well in studies of patients.
In addition to outlier trials, one must also consider the possibility of outlier subjects. Basic science studies are typically focused on the nature of cognition in “typical” individuals, and it is common to exclude any subjects whose overall performance is unusually poor (usually no more than a few percent of subjects, on average). When the basic science tasks are used in clinical populations, a substantially larger number of subjects with poor performance may be present in the sample. This may arise in both the patient and control groups if the task and instructions have not been optimized for older, less capable subjects. However, this problem may be more acute in the patients, especially those with severe cognitive impairments. In either case, excluding subjects will be more problematic in patient studies than in basic studies.
Outlier subjects may lead to several specific problems in patient studies. If subjects with poor performance are not excluded from the sample, then it is possible that these subjects did not understand the instructions, that they used an unusual strategy, or that their performance reflects the impact of a deficit in another cognitive process that is required by the task. In these cases, the pattern of results for these subjects may not accurately reflect the operation of the specific neurocognitive processes that the task was designed to measure. In addition, poor performers may exhibit floor effects, which may create artifactual differences between groups or mask real differences between groups. The alternative is to exclude poor performers from the sample, but this can be equally problematic. First, unless the criteria for exclusion are applied objectively, the exclusion of outlier subjects can lead to an unintentional biasing of the results. Second, the poor performers are likely to be the most impaired patients who might show the largest effects, and excluding these patients may reduce the statistical power of the study. In addition, the exclusion of poor performers may be simply incompatible with some types of research (eg, clinical trials).
There are several solutions to the problem of outlier subjects, and the best solution will depend on the nature of the study. In most cases, the problem of floor-level performance can be minimized via clear instructions, sufficient practice, and the use of a carefully piloted patient-friendly task. In addition, the use of pretesting to select stimuli that yield an appropriate performance level in each subject can be helpful, although it can be problematic for the reasons described above. If it is still necessary to exclude subjects, this must be done in an unbiased manner. The best way to achieve this is to develop strict exclusion criteria before data collection begins. This works well when a task has been already been extensively used in the relevant patient group, but it is impractical for tasks that have not been well studied in the patient group. In these cases, it is best to have a member of the research team decide the criteria while looking at data that have been stripped of all subject and group information.
In this section, we present an example of an experiment in which we attempted to translate a state-of-the-art cognitive task for use in patients and failed. That is, although the question asked by the experiment was interesting and timely, we were unable to obtain meaningful data. Although it is never pleasant to publicize one's failures, this case is worth discussing because it exemplifies several of the principles described above.
The experiment examined a recently developed masking paradigm, in which the target and mask onset at the same time, but the mask outlasts the target by a variable amount of time (the mask offset delay). In healthy young adults, performance in this paradigm declines as the mask offset delay increases, and this appears to reflect the iterative, reentrant nature of perceptual processing.25–28 In this experimental context, an impairment in reentrant processing could plausibly lead to patients showing a reduced vulnerability to masking, unlike that the pattern seen in more widely studied backward masking paradigms.29,30
In our version of this task, the target was a circle that either contained a line or did not contain a line (with equal probability). The target and mask were presented for 100 ms, and the mask continued to be visible for 0, 30, 60, 90, 120, or 180 ms after target offset (these were the mask offset delays). We tested 30 schizophrenia patients and 24 healthy control subjects, and each subject received 32 trials for each of the 6 mask offset delays.
Mean target discrimination accuracy is shown in figure 5A. We encountered 3 major problems in this data set. First, our control group showed a relatively small masking effect, with only a 13% drop in performance between the 0- and 180-ms mask offset delays. With such a small effect in the control group, it would be difficult to detect a reduction in masking in the patient group. For example, the patients exhibited a 10% masking effect, and we did not come close to having the statistical power necessary to show that this 10% effect in patients was significantly different from the 13% effect in controls. The pilot testing we conducted as a part of our task development strategy yielded a somewhat larger effect of 16% in both college students and older community subjects, and we do not know why the effect was only 13% in our control subjects during the actual study. In retrospect, however, it would have been worthwhile to adjust the task during the task development phase of the study to produce a larger effect.
A second problem with the data shown in figure 5A is that overall accuracy was lower in the patient group than in the control group, making it difficult to directly compare the size of the effect across groups. We attempted to address this problem by excluding all subjects whose accuracy at the 0-ms mask offset delay was less than 60% correct; this led to the removal of 5 patients and 0 control subjects from the sample. Our rationale for using this exclusion criterion was that subjects who could not perform the task reasonably accurately in this condition—in which masking should have been minimal—must have either misunderstood the task or suffered from some kind of perceptual problem that might make it impossible to assess masking. Figure 5B shows the means after this exclusion criterion was applied. Because we excluded the patients with the worst overall performance, the overall difference in accuracy between the patients and the control subjects was reduced. It is impossible to know whether the application of the exclusion criterion led us closer toward or farther away from the truth. That is, we may have eliminated the one real difference between the patient and control groups by excluding the most impaired patients or we may have improved our ability to see the “true” pattern of patient performance by eliminating patients who would otherwise obscure our ability to accurately measure the effects of masking.
The biggest problem with this experiment is shown in figure 5C, which shows the single-subject data for all 30 patients. This pattern of data is perhaps best described as spaghetti. Many subjects were near ceiling, many were near floor, and many had scores that bounced around seemingly randomly across the different mask offset delays. With such extreme variance, the probability of obtaining meaningful and statistical significant results was near zero. There are 2 likely explanations for this extreme degree of variability. First, subjects probably differed considerably in their ability to perform the basic perceptual task, even in the absence of masking. Perhaps this reflected variability in low-level visual acuity for parafoveal stimuli. Second, we used a task with 2 response alternatives (plain circle or circle with a line) along with a modest number of trials per condition (32). If we were to repeat this experiment, we would probably use the 26 letters of the alphabet as targets, providing 26 response alternatives.
From a measurement perspective, our most successful behavioral experiment so far was a study of the speed with which patients can shift attention to a cued location.31 The task is shown in figure 6A. On each trial, an array of 8 letters was presented for 1000 ms. One location was then cued by the disappearance of a box, and the subject reported the identity of the letter in the box that disappeared. The letter could be any of the 26 letters of the alphabet, and subjects responded by simply saying the letter, which minimized any problems associated with learning a complicated set of stimulus-response mappings. The experimenter typed the subject's response into the data collection computer (responses were unspeeded). The array of letters was replaced by an array of masks at a variable time after the onset of the cue (the cue-mask delay). Each subject received 30 trials at each cue-mask delay.
We assumed that subjects should be able to accurately report the identity of the target if they were able to shift attention to the cued location before the onset of the mask array. However, if the masks appeared before attention was shifted, then they would be correct only if they happened to guess correctly (which should be rare with 26 possible targets) or if they had stored the cued item in working memory prior to the onset of the masks. The general pattern of results for this paradigm is illustrated in figure 6A. When the mask array appears simultaneously with the onset of the cue (a cue-mask delay of 0 ms), it is impossible to shift attention to the cued location prior to the onset of the mask array, and the accuracy observed on these trials can be used to estimate the contribution of working memory and guessing to performance. When the mask onset delay is 1000 ms, subjects should be able to simply make an eye movement to the cued item and say its name before the target is masked. In most subjects, performance in this condition should reach an asymptote of 100% correct. However, if the subject has occasional lapses of attention or other such nonspecific problems, the asymptote may be less than 100% correct.
We anticipated that schizophrenia patients would be less able than control subjects to use working memory when the mask onset delay was short, and we also anticipated that their asymptotic performance might be lower than that of control subjects. We therefore developed a mathematical normalization procedure that allowed us to measure the speed of attention independently of these factors. Specifically, as shown in figure 6B, we rescaled the data for each subject so that the accuracy level with a mask onset delay of 0 ms was treated as 0% accuracy and the asymptotic accuracy was treated as 100% accuracy. More precisely, we normalized the data for each subject according to the formula , where Nt was the normalized proportion of correct responses at a cue-mask delay of t ms, Pt was the observed proportion of correct responses at a cue-mask delay of t ms, Pasymptote was the asymptotic proportion correct, and P0 was the observed proportion correct at a cue-mask delay of 0 ms. This was equivalent to stretching the function vertically for each subject so that it started at a value of 0.0 for the 0-ms delay and reached 1.0 at asymptote.
By converting each subject's data into the same scale, this type of normalization approach eliminates the problems described earlier of comparing subjects with different baseline accuracy values. It requires measuring accuracy over a wide range of stimulus values so that minimum and maximum values are obtained for each subject, and it is also useful to have a well-justified model of task performance (even if this model is quite simple, as in the present case). Using a wide range of stimulus, values can make the experiment unrealistically long, but this problem can be mitigated by using a task with a large number of response alternatives to minimize measurement error.
Figure 6C shows the results from the 24 individual patients in this study (prior to normalization). Each subject exhibited a very systematic increase in accuracy as the mask onset delay increased, and the variability is remarkably small in comparison with the variability of the single-subject data in figure 5C, even though the number of trials per condition was nearly identical in the 2 experiments. This was presumably due, in large part, to the use of 26 response alternatives in the present experiment, which minimized measurement error and made it possible to reliably characterize the performance of each individual subject. Indeed, the measurement error is so low that it was possible to identify 4 patients who were clear outliers (see solid lines in figure 6C). There was a clear gap between these patients and the remaining patients in terms of the time at which performance began to rise up from the floor. Because the data from these subjects were still interpretable, we did not need to exclude them from the analyses. Rather, we were able to treat them as a separate subgroup.
The mean data from the main group of patients, the subgroup of outlier patients, and the control subjects are shown, after normalization, in figure 6D. The normalization achieved the desired effect of putting all subjects onto the same accuracy scale, eliminating the measurement artifacts that can arise from different baseline accuracy levels. These results make it clear that the large majority of patients can shift attention just as quickly as control subjects, a finding that was supported by a converging electrophysiological experiment.31 We are now using the lessons learned from this experiment in designing our future studies.
Many challenges must be overcome to translate cognitive paradigms from the basic science literature for use in patient populations. In their original form, tasks developed to study college students may be too difficult or too long to be tolerated by patients, and measurement issues that do not arise in within-subject basic science experiments are of paramount importance in studies that compare different groups of subjects. Consequently, clinical researchers often cannot obtain meaningful results by simply using these tasks without modification in patients. Substantial up-front effort must be devoted to optimize tasks before they can be used in the clinic. We have found that extensive pilot testing in both young and elderly healthy subjects, coupled with limited pilot testing in patients known to be impaired, is an effective method for assessing task tolerability while maintaining construct validity. While time intensive, this approach is needed to validate tasks for application in clinical contexts where measurement methods cannot be altered until data collection is complete, often a period of years. With attention to task optimization and measurement challenges, paradigms from the basic cognitive sciences should provide highly efficient tools for assessing specific cognitive deficits, the functioning of specific neural systems, and the modulation of these systems by treatments that target cognitive deficits.
National Institute of Mental Health (MH65034, MH06850); University of Maryland General Clinical Research Center (M01- RR-16500).