|Home | About | Journals | Submit | Contact Us | Français|
The Patient-Reported Outcomes Measurement Information System (PROMIS) aims to develop self-reported item banks for clinical research. The PROMIS pediatrics (aged 8–17) project focuses on the development of item banks across several health domains (physical function, pain, fatigue, emotional distress, social role relationships, and asthma symptoms). The psychometric properties of the anxiety and depressive symptom item banks are described.
Participants (n = 1,529) were recruited in public school settings, hospital-based outpatient and subspecialty pediatrics clinics. The anxiety (k = 18) and depressive symptoms (k = 21) items were split between two test administration forms. Hierarchical confirmatory factor-analytic models (CFA) were conducted to evaluate scale dimensionality and local dependence. IRT analyses were then used to finalize item banks and short forms.
CFA results confirmed that anxiety and depressive symptoms are separate constructs and indicative of negative affect. Items with local dependence and DIF were removed resulting in 15 anxiety and 14 depressive symptoms items. The psychometric differences between short forms and simulated computer adaptive tests are presented.
PROMIS pediatric item banks were developed to provide efficient assessment of health-related quality of life domains. This sample provides initial calibrations of anxiety and depressive symptoms item banks and creates PROMIS pediatric instruments, version 1.0.
The Patient-Reported Outcomes Measurement Information System (PROMIS) project, a National Institute of Health Roadmap for Medical Research initiative, was developed to advance the science and application of patient-reported outcomes (PRO) in chronic diseases . One main goal of the PROMIS initiative is to develop a set of PRO item banks and computerized adaptive tests for the clinical research community. The PROMIS Pediatric project focused on the development of self-report PRO item banks across several health domains for youth aged 8–17. The generic health domains are important across a variety of illnesses, and include physical function, pain, fatigue, emotional distress, and social function . Additionally, one disease-specific item bank was developed for children with asthma to measure disease-related symptoms.
Emotional distress commonly refers to unpleasant feelings or emotions that are experienced privately and, therefore, are good candidates for assessment as PROs. Emotional distress among children is partially comprised of feelings of anxiety, depression, and anger . The emotional distress domains of anxiety and depressive symptoms are the focus of this manuscript.
Symptoms that best differentiate anxiety are those that reflect autonomic arousal and the experience of threat. Children often experience these feelings in a variety of contexts specific to their environment of home, school, and social activities . The PROMIS pediatric item bank for anxiety focuses on fear (e.g., fearfulness), anxious misery (e.g., worry), and hyperarousal (e.g., nervousness).
Depressive symptoms among children often include feelings of hopelessness, helplessness, and worthlessness . The PROMIS pediatric item bank for depressive symptoms focuses on negative mood (e.g., sadness), anhedonia (e.g., loss of interest), negative views of the self (e.g., worthlessness, low self-esteem), and negative social cognition (e.g., loneliness, interpersonal alienation). This item bank is best characterized as depressive symptoms rather than as a complete diagnostic test for depression.
PROMIS pediatric item banks were developed using a strategic item generation methodology adopted by the PROMIS Network . Six phases of item development were implemented: identification of existing items, item classification and selection, item review and revision, focus group input on domain coverage, cognitive interviews with individual items, and final revision before field testing. Identification of items refers to the systematic search for existing items in currently available pediatric scales [2, 4–6]. Items successfully screened through the process were sent to field testing. The final PROMIS pediatric item set contained 15 anxiety and 14 depressive symptom items.
A limited number of generic self-report health-related quality of life (HRQOL) instruments exist for use in pediatric populations, and most attempt to measure at least some aspect of emotional distress . The vast majority of these have utilized classical test theory and few have taken advantage of item response theory (IRT) analysis in the scale development process . PROMIS psychometric analyses focused on determining scale dimensionality and detecting sources of local dependence and also considered final item selection using IRT analyses. The primary objective of this paper is to describe the IRT analyses of the PROMIS pediatric anxiety and depressive symptoms item banks and the measurement properties of the new PROMIS pediatric anxiety and depressive symptoms scales that resulted from these IRT analyses, including investigations of scale dimensionality, sources of local dependence, and differential item functioning (DIF).
Participants from central North Carolina and Texas were recruited in hospital-based outpatient general pediatrics and subspecialty clinics and in public school settings between January 2007 and May 2008. School-based participants were recruited through elementary after school programs as well as middle and high school required health classes. Parental informed consent and minor assent were obtained for all children taking the survey. A more detailed description of the study design is provided elsewhere .
The PROMIS anxiety and depressive symptoms items were randomly split between two test administration forms (Form 1 contained 9 anxiety items and 10 depressive symptom items; Form 2 contained 9 anxiety items and 11 depressive symptom items). Children were randomly assigned to complete one of the testing forms. Each of the anxiety and depressive symptoms PROMIS pediatric items was administered to at least 759 respondents. This sampling plan was developed for collecting responses to candidate items from the targeted PROMIS domains and accommodated multiple objectives including: (1) confirm the factor structure of the domains; (2) evaluate items for local dependence (LD) and DIF; and (3) calibrate the items for each domain using IRT.
All of these emotional distress items had a 7-day recall period and used standardized 5-point response options (never, almost never, sometimes, often, almost always). Table 1 shows the anxiety and depressive symptoms items administered during the testing.
Data analysis followed the sequence of procedures presented by Reeve et al.  in their description of plans for psychometric evaluation and calibration of health-related quality of life item banks for PROMIS. First, traditional descriptive statistics were computed, as a check on data entry and validity and to verify that there were no empty (zero frequency) response categories for any item. These statistics included the frequencies and proportions in each item response category and the correlation of the item scores with the total summed score.
Second, as a check on the assumptions of the unidimensional IRT model to be used, the dimensionality of individual differences on the anxiety and depressive symptoms item sets was examined using confirmatory factor analysis (CFA; e.g., bifactor analysis) of the inter-item polychoric correlation matrices. These analyses were performed using the “weighted least squares with robust standard errors, mean- and variance-adjusted” (WLSMV) algorithm  as implemented in the software Mplus . Fitting additional factors, over and above those indicated by the design of the questionnaire, and residual correlations significantly greater than zero, served as indices of local dependence (LD) for pairs or small numbers of items that violate the local independence assumption of unidimensional IRT . If a pair of items exhibited LD, one item from the pair was set aside.
Third, within the sets of items for which unidimensionality had been confirmed using CFA, the items were “calibrated” by fitting Samejima’s Graded Response Model [14, 15] using the software Multilog . This model characterizes each item with a slope or discrimination parameter (a), that reflects the degree of association of the item responses with the latent construct being measured, and four threshold parameters (bk) (for five-alternative items), that indicate the level of anxiety or depressive symptoms at which a response in a particular category or higher becomes likely. This model has been selected for the PROMIS scales . The goodness of fit of the IRT model to the data was examined using Orlando and Thissen’s [17, 18] SS X2 statistic as generalized by Bjorner et al.  for polytomous response data. Because SS X2 is a goodness-of-fit statistic, a non-significant value is the desirable outcome, indicating adequate fit of the model to the data.
Fourth, the possibility of differential item functioning (DIF) was investigated for each item on each scale using the IRT-LR DIF detection procedure  as implemented in the software IRTLRDIF . DIF indicates that the relation of the item responses with the latent variable being measured differs between two (most often demographic) groups. Such a difference implies that some other factor, related to group membership but different from the construct being measured, had an influence on the item responses, violating the IRT assumption of unidimensionality. In the present data, the only demographic background variable that divides the sample into two groups that are sufficiently large to examine DIF is gender, so the DIF analysis was done separating the data into responses from boys and girls. IRT-LR DIF detection provides a X2-distributed test statistic; again, a non-significant value is the desirable outcome, indicating a lack of detectable DIF. We used the Benjamini–Hochberg [22, 23] procedure to control for the multiplicity of comparisons involved in checking each item for DIF using α = 0.05, and graphical methods, as suggested by Steinberg and Thissen  to evaluate effect size when DIF was detected.
Item pools were generated by combining the remaining items across the form. The linking procedure used, called “common population linking” or “randomly equivalent groups,” is based on calibrating multiple test forms from a common population and is widely used in educational testing [25, 26]. This technique is appropriate in this situation because items were randomly assigned to test forms, and test forms were randomly assigned to individuals. These procedures enabled the research team to administer nearly 300 items across multiple forms and domains without exceeding 70 items on any particular form.
Fifth, after the final item pools were selected, confirmatory factor analysis (CFA) of the inter-item polychoric correlation matrix among the remaining, selected items was used to ensure that the latent variables underlying the item responses for the anxiety and depressive symptoms were unidimensional in the final item pools. These analyses were performed using the DWLS algorithm as implemented in the software LISREL .
Finally, IRT scores for the scales are based on the GRM parameters after the scales are assembled . All IRT-based scores are relative to some reference group ; in this case, the reference group is the subset of the sample from the NC. While IRT scale scores may be based either on item response patterns or summed scores, we expect most often scale scores based on summed scores will be used; score translation tables for that purpose are provided in the “Appendix” and Table 7.
Test forms containing anxiety and depressive symptoms PROMIS pediatric items were completed by a total of 1,529 respondents. The sample was about 52% female and 58% were children aged 8–12. Fifty-nine percent were Caucasian, 21% black, 6% multi-racial, and 14% other races (Asian/Pacific Islanders, Native Americans, and Other Races). Seventeen percent of the sample was of Hispanic ethnicity. The vast majority of the adults providing informed consent for the children were parents of the child (92%) or grandparents (4%). The educational attainment of these parents or guardians ranged from less than high school (7%) to advanced degree (15%) with 26% reporting a college degree, 33% some college, and 20% a high school diploma. Approximately 23% of the children participating in the survey had a chronic illness diagnosis during the past 6 months (Table 2).
Using CFA to examine the dimensionality of the anxiety and depressive symptoms item sets on the two-item tryout forms involved fitting a number of models; the factor loadings for representative models that fit the data reasonably well are shown in Tables 3 and and4.4. The items in both tables are sorted to group together items with similar statistical properties. Both tables show modified bi-factor models, comprising a general factor with loadings for all items, two group-specific factors with non-zero loadings for either the anxiety or the depressive items (that much is a bi-factor model), plus a smaller group factor (in Table 3) and residual correlations. The latter components augment the bi-factor model and represent local dependence between pairs, or a triplet, of items. Indicators of goodness of fit suggest both models fit the data, using as standards suggested by Reeve et al. : For the model in Table 3, X2(76) = 248, CFI = 0.95, TLI = 0.99, RMSEA = 0.06; for the model in Table 4, X2(91) = 194, CFI = 0.97, TLI = 0.99, RMSEA = 0.04.
These models answer two questions that arise in the context of scale construction using these items. The first question answered, “Is negative affect unidimensional, or is there distinguishable individual difference variation corresponding to the anxiety items distinct from the depressive symptoms items?” The fact that there are substantial loadings that differ significantly from zero on the group-specific factor for the depressive symptoms items in Table 3, and on that for the anxiety items in Table 4, indicates that the covariation among the item responses cannot be adequately explained with a theory that there is a single negative-affect dimension of individual differences underlying responses to all of the items. Both anxiety and depressive symptoms have their own unique components. (It is a curious fact that the general factor in Table 3 is anxiety-dominated negative affect, leaving little unique for the anxiety group-specific factor; and the general factor in Table 4 is dominated by depressive symptoms, leaving little unique for the depressive symptoms group-specific factor. However, that is simply an illustration of the fact that the composition of latent variables in factor analysis is highly dependent on the properties of the item set being analyzed.)
A second question answered by these factor-analytic results is: “Are the items conditionally independent, given the combination of the general factor and group-specific factors for anxiety and depressive symptoms?” The answer is: “Not all of them.” In Table 3, there is a cluster of items that involve being “scared” or “afraid” with responses that are more correlated than is expected given the general factor and the anxiety-specific factor, and there are four more pairs of items with significant residual correlations. In Table 4, there are two more locally dependent pairs. Items in these pairs or triplets are (in part) like “asking the same question twice” (a common sense description of LD), so we will include only one item from the triplet, or from each pair, in each final item pool so that each pool comprises locally independent items.
The model in Table 4 includes a residual correlation between one of the anxiety items, “I felt worried,” and one of the depressive symptoms items, “I felt stressed.” Because those two items are ultimately destined for the distinct anxiety and depressive symptoms item pools, that residual correlation does not constitute a violation of the assumption of local independence within either scale, so that residual correlation is ignored.
We used the results of additional CFA analyses, not shown here, to examine the possibility that the factor structure for the anxiety and depressive symptoms might be different for younger children than it is for adolescents. We divided the sample approximately in half by age and estimated parameters for models like those shown in Tables 3 and and4,4, and found that there were no substantial differences between the parameter estimates for younger (aged 8–12) versus older (aged 13–17) children (data not shown).
Two items that were included in the anxiety sets for the CFA, “I felt afraid or scared” and “I worry about what will happen to me” are actually legacy items (from the Pediatric Quality of Life Inventory™ (PedsQL™) Version 4.0 Generic Core Scales; ). Those items were included in the CFA for reference purposes and they were set aside before item calibration for the PROMIS scales. Then we calibrated the remaining items on forms 1 and 2 separately for the anxiety and depressive symptoms dimensions. To avoid deleterious effects of locally dependent item pairs on the item parameter estimates , we computed the item parameter estimates twice for each form, including only one member of each LD pair in the item set at a time. That produced two sets of item parameters for the non-locally dependent items; to reduce capitalization on chance, we selected the set with the lower slope (a) parameter from those pairs. The values of the item parameter estimates and the SS X2 item fit statistics are shown in Tables 5 and and6.6. In those tables, the items that remain in the final item pools are sorted in order of decreasing discrimination (a), so the generally best indicators of anxiety and depressive symptoms are near the top of the tables.
For the anxiety items, after using the Benjamini–Hochberg correction for multiplicity, none of the items exhibited significant lack of fit as indicated by the SS X2 statistic. Similarly, none of the gender DIF tests (also shown in Table 5) was significant when adjusted for multiplicity. So we set aside the three items listed near the bottom of Table 5; two of those were the less-discriminating items in pairs the CFA had indicated exhibited LD. The third item, “I felt relaxed,” was set aside because it was not nearly as discriminating as the other items, and, upon reflection, even reverse-scored as it is, is probably not a particularly specific indicator of anxiety. That left the 15-item set in the upper part of Table 5 as the pediatric anxiety item pool.
Similarly, for the depressive items, after using the Benjamini–Hochberg correction for multiplicity, none of the items exhibited significant lack of fit as indicated by the SS X2 statistic. However, three items listed near the bottom of Table 6 exhibited significant DIF. Unsurprisingly, two of those items involved “crying,” which usually exhibits DIF between genders on depression scales [32–34]. The reason for DIF for the item “I felt so bad that I didn’t want to do anything” is perhaps not so clear; however, the item was set aside due to the magnitude of the effect size when depicted graphically. As shown in Table 6, three additional depressive symptoms items were set aside because they were the less-discriminating members of three LD pairs (as detected with residual correlations in the CFA). Finally, the item “I was bored” was set aside because it appeared to be a poor indicator of depressive symptoms. The remaining 14 items in the upper half of Table 6 are the pediatric PRO-MIS depressive symptoms item pool.
To ensure that the final item pools for anxiety and depressive symptoms were unidimensional, and to estimate the correlation between those two latent variables, two-factor simple-structure CFA models were fitted to the selected anxiety and depressive symptoms on Forms 1 and 2. In each of these models, all of the selected anxiety items loaded on the anxiety factor, with zero loadings on the depressive symptoms factor, and all of the selected depressive symptoms items loaded on the depressive symptoms factor, with zero loadings on the anxiety factor. The goodness of fit of these models was satisfactory for both forms (for Form 1: X2(53) = 247, CFI = 0.98, TLI = 0.97, RMSEA = 0.07; for Form 2: X2(118) = 289, CFI = 0.99, TLI = 0.99, RMSEA = 0.04). The estimated correlations between the latent variables depressive symptoms and anxiety were 0.82 (SE = 0.02) for Form 1, and 0.85 (SE = 0.02) for Form 2. Additional separate one-factor analyses of the selected items on each scale for each form for boys and girls fit well, with no indication of further local dependence.
The upper panel of Fig. 1 shows the test information function for the anxiety item pool, and for the eight-item subset of the pool that is most informative near the middle of the distribution, and the lower panel of Fig. 1 shows the corresponding curves for the depressive symptoms item pool. (The x-axes in Fig. 1 are labeled using the T-score scale on which scores from PROMIS scales are reported, with a mean of 50 and a standard deviation of 10 for the reference population.) Test information is the inverse of the variance of measurement (the squared standard error of measurement), so a value of 6.67 for information corresponds with a standard error of measurement of approximately 0.4 standard units (for standardized test scores, or 4 for T-scores), and that in turn corresponds to a reliability coefficient of approximately 0.85. For anxiety and depressive symptoms, eight items are sufficient to provide measurement with that precision for the range of scores from the low 40s to nearly 80 on the T-score scale. Thus, for many purposes, we recommend using the short eight-item forms that are listed in the Appendix, with the corresponding summed score to scale score translation tables based on the IRT model.
For situations that require more precision of measurement, the complete item pools are in Tables 5 and and6,6, with the item parameters that can be used to compute IRT response pattern scores or the scale scores for summed scores for any other (larger or smaller) sets of the items, all on a comparable scale. In addition, the item pools are available from the Assessment Center at www.nihpromis.org.
To consider whether adaptive testing might be useful, we computed the test information curves for the most informative set of items from the anxiety and depressive symptoms item pools at T-scores of 30, 40, 50, 60, and 70. These curves are shown in the upper and lower panels, respectively, of Fig. 2. These curves basically answer the questions: “How much more information can be gained by choosing a different set of ‘best items’ for different score levels (adapting), given this item pool?” and “To what extent are different sets of items ‘best’ at different levels of anxiety or depressive symptoms?” The answer to both questions, as shown in Fig. 2, is “not very much.” The items measure anxiety and depressive symptoms well between T-scores in the low 40s and 80. Within that range, to a large extent the same items are most informative, and outside that range adapting, even to the extent of using all of the questions in the pool, adds little precision. Thus, we have not further evaluated the idea of using adaptive testing with these item pools; however, the pools and item parameters are available, and the PROMIS Assessment Center software has the capability of administering these items as a CAT if that is desired.
This study led to the development of new anxiety and depressive symptoms item banks for use in measuring pediatric PROs. After determining scale dimensionality, items with local dependence and DIF were next identified and removed resulting in final item banks with 15 anxiety items and 14 depressive symptom items, allowing a variety of possible approaches to scoring that can be tailored to meet the goals of the end user.
Several generic self-report HRQOL instruments exist for use in pediatric populations and most attempt to measure at least some aspect of emotional distress. The vast majority of the generic pediatric HRQOL measures for anxiety and depressive symptoms utilized classical test theory and few have taken advantage of IRT analysis in the scale development process . PROMIS psychometric analyses focused on determining the scale dimensionality and detecting sources of local dependence and considered final item selection using IRT analyses. Like PROMIS, two of these newer instruments, KIDSCREEN and PedsQL, utilized qualitative research methods for incorporating the child’s perspective during the development process [35, 36].
One major challenge prior to applying IRT models to the measurement of emotional distress is resolving issues of dimensionality. Conventional wisdom is that emotional distress scales are less likely to fit unidimensional models . Often items are sampled from multiple domains (e.g., mood, behavior, somatic symptoms) in order to capture a comprehensive set of latent construct indications. Hence, it is common to observe higher correlations within domains than is expected under the conditional independence assumption of unidimensional IRT models . One of the initial steps for this project was to develop multidimensional conceptual frameworks that were informed by previous empirical (e.g., factor analytic) and theoretical work as well as to determine the level of resolution at which unidimensional scales could be derived from the domains [2, 4–6]. Three constructs of emotional distress were conceptualized: depressive symptoms, anxiety, and anger. These results of unidimensionality are consistent with a recent meta-analysis  and other published studies [40–45].
It remains a question for comparative validity studies to determine which of these scales might be most valid for any particular use. Both the KIDSCREEN Moods and Emotions scale (7 items) and the PedsQL Emotional Functioning scale (5 items) are shorter than the PROMIS depressive symptoms or anxiety 8-item short forms and provided less reliable measurement in this item calibration sample: Coefficient alpha for the KIDSCREEN Moods and Emotions scale in this sample was 0.83, and for the PedsQL Emotional Functioning scale, it was 0.74. While the PROMIS scales provide separate scores for depressive symptoms and anxiety, the PedsQL Emotional Functioning scale includes items that indicate depression, anxiety, and anger while the KIDSCREEN Moods and Emotions scale largely measures depressive symptoms, with one item that may indicate anxiety. It also remains a question for future validity studies to determine the usefulness of separate scores for depressive symptoms and anxiety: These two constructs are highly correlated; however, it may be that either one or both are responsive to any particular treatment or that they are affected separately or together by any particular condition. The separate scores of the PROMIS pediatric measures permit study of those questions.
Utilizing IRT analysis to identify final items ultimately offers more flexibility for future users of the item banks. This approach allows researchers the opportunity to select the most useful items for their study design. We proposed 8-item short forms; however, a smaller subset of items from the item bank can also be used and scored on the same metric as the larger set.
By administering the items spread over several test forms, we are unable to perform factor analyses across each entire bank. This limitation makes it impossible to ensure that items from different forms do not exhibit local dependence. Additionally, it is possible that factor analyses for each domain would turn out differently if the items were analyzed all together. Instead, factor analysis was conducted over the subgroups of items tested on each form. Because the items were created to fill content from qualitative work and then were randomly allocated to each test form, the different test forms can be viewed as replications. By having replicated factor analyses, our impressions of multidimensionality, when repeated across forms, increased our confidence in the factor-analytic results. We are currently performing cross-sectional testing using the entire item pools to verify these results.
The PROMIS pediatric item banks were developed to provide accurate and efficient assessment of important domains of HRQOL for children including anxiety and depressive symptoms. This sample provides initial calibrations of the PROMIS pediatric anxiety and depressive symptoms item banks and the creation of the corresponding PROMIS Pediatric instruments, version 1.0.
Listed below are the item stems for the recommended eight-item short forms for the PROMIS Pediatric Anxiety and Depressive Symptoms Scales. All items use a 7-day recall period (the preface is “In the past seven days”), and a 5-point response scale with the options never (0), almost never (1), sometimes (2), often (3) and almost always (4).
I felt scared.
I worried about what could happen to me.
I felt worried.
I felt like something awful might happen.
I worried when I went to bed at night.
I thought about scary things.
I felt nervous.
I was afraid that I would make mistakes.
I felt like I couldn’t do anything right.
I felt everything in my life went wrong.
I felt unhappy.
I felt lonely.
I felt sad.
I felt alone.
I thought that my life was bad.
I could not stop feeling sad.
Summed score to scale score translation for these short forms is in Table 7.
Debra E. Irwin, Department of Epidemiology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
Brian Stucky, Department of Psychology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
Michelle M. Langer, National Board of Medical Examiners, Philadelphia, PA, USA.
David Thissen, Department of Psychology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
Esi Morgan DeWitt, Department of Pediatrics, Duke University Medical Center, Durham, NC, USA.
Jin-Shei Lai, Department of Medical Social Sciences, Northwestern University Feinberg School of Medicine, Chicago, IL, USA.
James W. Varni, Department of Pediatrics, College of Medicine, Department of Landscape Architecture and Urban Planning, College of Architecture, Texas A&M University, College Station, TX, USA.
Karin Yeatts, Department of Epidemiology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
Darren A. DeWalt, Division of General Medicine and Clinical Epidemiology, Cecil G. Sheps Center for Health Services Research, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.