|Home | About | Journals | Submit | Contact Us | Français|
Families of intensive care unit (ICU) patients are at risk for depression, and are important targets for depression-reducing interventions. Multi-item scores for evaluating such interventions should meet criteria for unidimensionality and longitudinal measurement invariance. The Patient Health Questionnaire (PHQ), widely used for measuring depression severity, provides standard nine-, eight-, and two-item scores. However, published studies often report no (or weak) evidence of these scores' unidimensionality/invariance, and no tests have evaluated them as measures of depression severity in ICU patients' families.
To identify multi-item PHQ constructs with promise for evaluating change in depression severity among family members of critically ill patients.
Structural equation models with rigorous fit criterion (χ2 P≥0.05) tested the standard nine-, eight-, and two-item PHQ, and other item subsets, for unidimensionality and longitudinal invariance, using data from a trial evaluating an intervention to reduce depressive symptoms in family members.
Neither the standard nine-item nor eight-item PHQ construct showed longitudinal invariance, although the standard two-item construct and other item subsets did.
The longer eight- and nine-item PHQ scores appear inappropriate for assessing depression severity in this population, with constructs based on smaller subsets of items being more promising targets for future trials. The CONSORT (Consolidated Standards of Reporting Trials) requirement for pre-specified trial outcomes is problematic because unidimensionality/invariance testing must occur after trial completion. CONSORT could be strengthened by endorsing rigorous assessment of composite scores and encouraging use of the most appropriate substitute, should trial-based evidence challenge the legitimacy of pre-specified multi-item scores.
Current guidelines for palliative care in intensive care units (ICU) urge family-centered approaches (1, 2). ICU patients' families face increased risk for depressive symptoms (3-6), and several studies have employed composite scores to measure families' depression-severity (7-13). Measurement experts contend that to be legitimate, such scores must be unidimensional (14-16) and show measurement invariance for groups or times being compared (17-20). That is, the component items must measure a single underlying construct consistently. To date, no such evidence has been provided for widely used measures of depression severity in ICU patients' families.
Although insufficiently tested scores are reported for both observational studies and trial evaluations, their use in trials may be partly attributable to the Consolidated Standards of Reporting Trials (CONSORT) guidelines, which require that outcomes be specified before the trial (21). Later modification is allowed if the researcher can supply adequate reason, but the standard provides no guidance regarding acceptable reasons. Nor does CONSORT require testing of composite scores for sample-specific appropriateness, with replacement using the best available substitute when testing fails. These CONSORT guidelines (and omissions) may result in trials reporting results based on inadequately tested outcome measures.
Although sample-specific testing is needed, evidence from one sample can indicate whether a score is likely to be unidimensional/invariant in similar future samples. This potential for informing future selection of depression severity outcomes motivated the current article. We looked specifically at the Patient Health Questionnaire (PHQ), an instrument developed as a clinical tool to screen primary care patients for major depressive disorder (MDD), with subsequent clinical evaluation required for actual diagnosis. Increasingly used in research evaluating the severity of depressive symptoms (22), it covers the nine diagnostic criteria for MDD from the Diagnostic and Statistical Manual of Mental Disorders (DSM)-IV and DSM-5 (23, 24). Three sum-scores have been developed: PHQ-9, covering all nine criteria; the PHQ-8, which omits a suicidal ideation item; and PHQ-2, which includes only items assessing anhedonia and depressed mood (22). All three have shown responsiveness in monitoring depression-related outcomes (22, 25).
Numerous articles assessing dimensionality/invariance of the PHQ have based their conclusions on exploratory factor analysis, a method that often produces models with poor fit to observed data (26). Other studies, based on more rigorous confirmatory factor analysis (CFA) techniques, have evaluated model fit with approximate-fit indices, a practice methodologists have deemed problematic (27-29). In addition to urging the use of stronger criteria for assessing the dimensionality/invariance of constructs, methodologists note the need to consider whether all item-combinations function equivalently for all purposes. For example, a particular intervention might be expected to influence a narrower definition of depression, measured by fewer items. A recent article recommended that researchers use only a few indicators for each construct, selecting one to three that best represent the latent variable relevant to a given investigation (30).
During a randomized trial of an intervention to reduce depressive symptoms in family members of ICU patients, we administered the nine-item PHQ three times: at study enrollment and three and six months later. The current report sought to answer three questions: 1) Did any of the standard PHQ composite scores meet criteria for unidimensionality and longitudinal measurement invariance in this sample? 2) Did other item subsets, defining slightly different depression severity constructs, meet these criteria? and 3) Did patient/family characteristics contribute to family members' depression severity?
We used data from a randomized trial testing an intervention to improve communication between clinicians and ICU patients' families (31, 32). Patients being treated in ICUs in two Seattle-area hospitals were eligible for inclusion if they were mechanically ventilated, with estimated hospital mortality ≥30% based on mortality prediction scales (33) and diagnoses (31). Family members of eligible patients received baseline and three- and six-month follow-up questionnaires. The pre-specified test of trial efficacy was an association of the intervention with change between baseline and the two follow-up periods in family members' depression severity, as assessed by the PHQ-9.
Each time-specific PHQ included nine items measuring the frequency of depressive symptoms in the previous two weeks (0=not at all, 1=several days, 2=more than half the days, 3=nearly every day). Questionnaires also documented respondent gender, age, race/ethnicity, education, and length/type of relationship with the patient. Medical records provided information about patient gender, age, race/ethnicity, hospital length-of-stay, and mortality status at hospital discharge. Study records provided the patient's randomization condition.
We used CFA (34-38) to evaluate unidimensionality of the standard PHQ-9 and PHQ-8 items and all combinations of 4-7 items at baseline. Combinations of 2-3 indicators were not separately testable for unidimensionality, but were retained, along with the unidimensional baseline combinations, for later testing.
Structural equation models (SEM) subsequently tested each retained item-combination for longitudinal measurement invariance. For latent constructs to be comparable over time, they should be measured by the same set of indicators at all time points, with each indicator carrying the same weight over time, thus providing time-invariant meaning to the construct. With ordinal items, invariant models have item loadings and category thresholds constrained to equality across time (39). We constructed each model with three underlying factors, representing depression severity at the three time points, measured by identical combinations of time-specific indicators with the required equality constraints. Our determination of longitudinal invariance required that a model, thus constrained, demonstrate adequate fit to the data. Each model also included structural effects leading from baseline depression severity to 3-month severity, and from three-month severity to six-month severity. An additional direct link from baseline severity to six-month severity was never statistically significant and is omitted from models presented in the results.
We evaluated additional evidence of departures from unidimensionality/invariance via Rasch analyses, based on Rasch-Masters Partial Credit models (40). This involved identifying items with disordered category thresholds (the latent construct's average value at an indicator threshold being greater than its average at the next higher threshold), as well as items that exhibited time-related differential item functioning (DIF).
We tested patient/family contributors to depression severity (measured with two items constituting the standard PHQ-2) with path models that included exogenous predictors of depression severity at the three time points. We hypothesized that any of the following might contribute to baseline depression severity: patient gender, age, race; respondent gender, age, race, education, length and type of relationship to patient. We further hypothesized that any of these variables, plus the patient's hospital length-of-stay, mortality status at hospital discharge, and randomization condition, might have independent effects on depression severity at follow-up. We began with a model that included all potential predictors of baseline depression severity, removing non-significant predictors in a reverse stepwise procedure until only predictors with P≤0.20 remained. We then added all potential predictors of three-month severity, and then of six-month severity, following the same procedure for removal of predictors with the highest P-values. Finally, using a stepwise procedure, we removed all remaining predictors having P≥0.05.
We based all CFA/SEM analyses on complex single-group models, with family members clustered under patients, using a sample having complete data on all variables in the model. We defined PHQ items as ordered categorical variables and used robust least squares (WLSMV) estimation. We evaluated model fit with the χ2 test of fit, rejecting all models with P<0.05. Although significant χ2 values are possible with only trivial misfit when samples are large, our sample was small enough to be relatively immune to this problem. We report unstandardized coefficients, with estimates for the indicator-loadings representing probit regression coefficients (41). We used SPSS 19.0.0 (42) for data management, Mplus 7.3 (43) for SEM analysis, and Winsteps 3.81.0 (44) for Rasch analysis.
We enrolled 232 family members of 149 critically ill patients, with 193 family members (131 patients) providing sufficient data to be included in one or more analyses for the current study. Patient and family characteristics are shown in Table 1. Family members' responses to the questions about depressive symptoms (Table 2) indicated relatively low symptom frequency at all assessments (Table 3).
Test of the PHQ-9 baseline model showed significant misfit (χ2 P=0.001). Three items were problematic: item #9 (suicidal ideation), an empirical dichotomy in this dataset (99% of all respondents indicating no problem, and all remaining respondents indicating “several days”); and #6 (low self-worth) and #7 (trouble concentrating), both of which had the top two category thresholds disordered, per Rasch analysis. The PHQ-8, omitting the suicide item, showed only a modest improvement in fit at baseline (χ2 P=0.005).
Of 162 baseline models containing 4-7 items (and excluding suicide item #9), 83 passed the baseline unidimensionality test, with 67 of these including the anhedonia and/or depressed mood indicator. We considered models that included neither anhedonia nor depressed mood to be suspect as models of depression severity, as the remaining symptom combinations could reflect conditions other than depression.
Longitudinal measurement invariance tests involved 167 models: 83 models that passed the baseline unidimensionality test and 84 models based on 2-3 indicators. Of the 167 models, 42 (including the standard PHQ-2) resulted in χ2 P≥0.05, with 34 containing the anhedonia and/or depressed mood indicator (test results in Table 4; syntax used to test PHQ-2 in Table 5, available at jpsmjournal.com). Although the 34 models were acceptable on both empirical and theoretical grounds, most included at least one item (#3, #5, or #8) with ambiguous meaning (Table 2), rendering the construct similarly ambiguous. Most of the models based on three or more indicators included the psychomotor disturbance indicator (# 8), which Rasch analysis suggested was the most serious of the symptoms.
Eight additional models met the χ2 criterion but did not include either anhedonia or depressed mood. They included various combinations of sleep, energy, eating, and psychomotor disturbances that could be attributable to physical illness, anxiety, or other conditions unrelated to depression.
None of the models that met the criterion for longitudinal measurement invariance included item #6 (low self-worth). Evaluation of models containing this item showed that it exhibited DIF: low self-worth being reported at baseline primarily by respondents with high values on the depression severity construct, but at follow-up points by respondents with lower values (i.e., low self-worth was more symptomatic of the construct at baseline than at follow-up, when it frequently reflected other underlying issues). When this item was included as a depressive symptom, slightly different “varieties” of the construct were measured at baseline than at follow-up. Item #7 (trouble concentrating) also exhibited DIF, concentration problems being frequently reported at baseline by respondents with relatively low depression severity, but at follow-up primarily by respondents with high severity levels. Concentration problems, thus, were more indicative of the construct at follow-up than at baseline.
Of 31 models that showed significant departure from longitudinal measurement invariance, and that excluded items #6-#7, none provided evidence of DIF. However, 24 produced evidence suggesting that the indicators did not reflect any unidimensional construct at all three time points, much less the same construct at all time points.
We investigated the association of patient/family characteristics with the depression severity construct measured with the standard PHQ-2. Of known characteristics, only patient age predicted depression severity at baseline – family members of older patients reporting less severe symptoms (Fig. 1). Although female respondents endorsed more depressive symptoms than male respondents, the association was just short of statistical significance (P = 0.053). Baseline depression severity was a significant predictor of three-month severity. In addition, there were significant independent effects of the respondent's relationship to the patient (higher severity when the family member was the patient's spouse/partner) and the patient's mortality status at hospital discharge (higher severity when the patient had died). Depression severity at three months carried over significantly into the six-month period, but there were no other significant predictors of six-month severity, nor was there a significant direct effect of baseline severity on six-month severity. Significant unexplained variance in depression severity was present at all three time points (labeled “D” in Fig. 1), with the unexplained amount decreasing over time.
In both clinical and research settings, the PHQ is commonly used to measure depression severity via standard summated scoring of the items. Our analyses suggest that neither the eight-nor nine-item score appropriately represents depression severity for family members of ICU patients. Neither represented a unidimensional construct at baseline and neither had consistent meaning over time.
We identified numerous subsets of items, including one based on the standard PHQ-2, that showed longitudinal measurement invariance among family respondents. This demonstrates that, at least in our sample, using a strict fit criterion did not prevent identification of empirically appropriate models. There is no guarantee that any of these models would provide acceptable fit to other family-member samples, nor would all of the constructs have equal theoretical appeal for specific studies. Identification of the best indicator-set involves both empirical assessment of fit and consideration of underlying theory. For example, the best latent construct for evaluating an intervention is the construct that most precisely matches the features hypothesized to be amenable to change by the intervention. We believe it is important for researchers to evaluate both model fit and theory in selecting an outcome, rather than automatically employing an “industry standard.”
Our sample exhibited relatively low levels of depressive symptoms, however measured. Several items were particularly problematic. Suicidal ideation was rarely endorsed. Researchers evaluating the PHQ-9 in a population-based sample of older adults in Germany also noted problems with this item, reporting its low reliability and suggesting that suicidality may be only loosely related to depression (45). A group studying psychiatric genetics contended that suicidal behavior is more appropriately regarded as an independent clinical entity than as a symptom of major psychiatric disorders (46). As an indicator of depression severity, low self-worth was stronger at baseline than at follow-up. Difficulty concentrating was stronger at follow-up than at baseline, when fatigue, worry, and uncertainty may reduce the ability to concentrate.
The fact that the models that were longitudinally invariant and theoretically tenable in our sample comprised relatively small sets of items accords well with the call by SEM methodologists for the use of small sets of indicators that most precisely capture the construct of interest (30). All models with P>0.30 contained two indicators.
This study's limitations are small sample size and lack of geographic dispersion. This limits the extent to which the observations can be confidently generalized to other populations of family members in similar circumstances. The study also ignores the issue of whether it is appropriate to use sum-scores, rather than latent variables, as research outcomes.
Although we have abbreviated the construct of interest as “depression severity,” this is not meant to imply a clinical diagnosis, but rather the severity of a constellation of depression-related symptoms. Our objective was not to define a “best measure” for tracking depression severity in ICU patients' families nor to specify the form an ideal measure would take, but rather to provide preliminary evidence of depression severity constructs that might prove useful in similar samples, pending sample-specific tests of appropriateness. We believe our results raise a general question related to using pre-specified composite outcomes in evaluating randomized trials, in the absence of trial-based evidence supporting the composites. CONSORT guidelines (21) permit changing an outcome measure after commencement of a trial if the change is appropriately justified, but provide no guidance regarding what constitutes a justifiable basis. We believe the guidelines could be strengthened if they encouraged assessment of composite scores, and recommended employing the strongest and most appropriate alternative measure, should trial-based evidence challenge a pre-specified multi-item score.
The randomized trial providing data for this report was funded by the NIH/NINR (R01-NR05226), which had no role in study design; data collection, analysis, or interpretation; writing of this report; or the decision to submit it for publication.
Disclosures: The authors declare no conflicts of interest.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.