|Home | About | Journals | Submit | Contact Us | Français|
A recent special issue of the Journal of Pediatric Psychology included papers focused on evidence-based assessment across several broad domains of assessment in pediatric psychology (e.g., adherence, pediatric pain, and quality of life). In one of these papers, Holmbeck et al. (2008) reviewed strengths and limitations of existing measures of psychosocial adjustment and psychopathology, concluding that many measures lacked supporting psychometric data (e.g., basic indices of reliability and validity) that would permit a complete evaluation of these measures. Given that measure development and validation papers are frequently published in JPP (Brown, 2007), it is important that we attend to guiding psychometric principles when developing and disseminating data on new measures to be employed with pediatric populations (Nunnally & Bernstein, 1994). Thus, the purpose of this paper is to present and describe a checklist for authors to use when submitting measure development papers to JPP. This checklist is included in the Appendix and is also included at the following link on the JPP website:
Findings presented by Holmbeck et al. (2008) indicated that 34 of the 37 measures reviewed met pre-established “evidence-based assessment” (EBA) criteria for “well-established” measures (Cohen et al., 2008). To be considered “well-established,” a measure had to have been presented in at least two peer-reviewed journal articles by different investigatory teams, have demonstrated adequate levels of reliability and validity, and be accompanied by supporting information (e.g., a measure manual). Although most measures that we reviewed met these criteria, we also found that most of the 34 “well-established” measures were hampered by at least one major psychometric flaw and/or lacked important psychometric data. We concluded that a more fine-grained EBA classification system is needed.
One important distinction in this literature relates to differences between empirically supported assessment and evidence-based assessment. This type of distinction was first discussed in the literature on clinical interventions (e.g., Spring, 2007). An empirically supported intervention is one that has demonstrated efficacy in randomized clinical trials or clinic-based effectiveness trials. An evidence-based intervention has empirical support in the manner just described, but also “integrates research evidence, clinical expertise, and patient preferences and characteristics … empirically-supported treatments (ESTs) are an important component of evidence-based practice (EBP), but EBP cannot be reduced to ESTs” (Spring, 2007, p.611). Applying these terms to the field of assessment and measure development efforts, an empirically supported assessment measure would be one that demonstrates satisfactory psychometric characteristics, broadly defined. To be evidence based, the measure should also demonstrate utility in clinical settings, be useful in making diagnoses, be sensitive to treatment effects, and/or provide incremental validity above and beyond other similar measures. Although papers in the special issue of JPP frequently referred to “evidence-based assessment” (Cohen et al., 2009), the articles included in the issue tended to evaluate the degree to which the measures were empirically-supported rather than evidence based. To be “evidence-based,” our reviews would have needed to integrate an evaluation of clinical utility, diagnostic utility, and treatment sensitivity with the empirical psychometric data presented in each review. As noted, the published reviews were more likely to focus on the latter rather than on the former.
As suggested by Mash and Hunsley (2005), detailed EBA profiles would provide a complete evaluation of evidence across each of several psychometric and clinically relevant dimensions, including: (a) internal consistency, (b) test–retest reliability, (c) the availability of normative data, (d) content validity, (e) construct validity, (f) convergent and discriminant validity, (g) criterion-related validity, (h) incremental validity, (i) clinical utility, (j) diagnostic utility, and (k) treatment sensitivity. The focus on incremental validity and clinical and diagnostic utility raises the bar from a focus on “empirical support” (i.e., where the focus would tend to be primarily on psychometric data) to a broad focus on the “evidence base” for a measure. In developing the checklist that is the focus of this article, we attempted to provide a list of criteria relevant to establishing the evidence base (and not just empirical support) for a measure. In addition to shifting the focus from providing “empirical support” for a measure to providing an “evidence base” for our instruments, a checklist for measure development papers would permit JPP reviewers to evaluate such papers in the same way that reviewers of randomized clinical trials make use of the Consolidated Standards of Reporting Trials (CONSORT) checklist and flowchart (Altman et al., 2001). The CONSORT checklist contains reporting standards with respect to methodological features of and the manner in which results are reported in clinical trials. Moreover, authors are required to provide a flowchart that describes details of sample recruitment and attrition during the course of the study.
Thus, a checklist for measure development papers would serve two interrelated purposes: (a) it would provide guidance to authors as they embark on the measure development process and would provide a list of criteria authors can use as they develop an evidence base for their measures, and (b) it would begin to standardize the manner in which psychometric and other assessment-related data are presented in measure development papers for this journal. Before providing a more detailed overview of the checklist, it is important to note that this checklist is rather exhaustive (see Appendix). As such, it represents what would “ideally” be expected for a measure development or validation manuscript rather than minimal criteria for such papers. No one paper can provide a complete evaluation of all important psychometric and clinically relevant dimensions that will establish once-and-for-all the evidence base for a measure.
Instrument refinement is part of a measure development process that gradually builds an evidence base for a scale (see Smith & McCarthy, 1995, for suggestions on measure refinement). Indeed, the validation of any measure is a cumulative process that occurs across many different types of research studies and across research programs.
As can be seen in the Appendix, the first, and perhaps most important criterion, focuses on the degree to which the author has established a scientific need for the instrument. This is a fundamentally important criterion that should be included in all measure development papers. How does this measure make a contribution to the literature and/or clinical practice above and beyond other previously developed measures? How will the measure be used and by whom?
With respect to the scientific necessity of the measure, it is worth discussing one type of manuscript that is often submitted to this journal. Many authors seek to employ a given measure with a new population that differs from the population that was the basis for the original measure development research. Given that the number of “new populations” to which a measure can be applied is infinite, these types of papers benefit greatly from a clearly articulated rationale for why it is of interest to employ the measure with this particular “new population.” For example, one might discuss how the construct of interest is relevant to this population and whether there are important differences in how the construct would be perceived in this population as compared to how it would be perceived in other populations. Simply stating that this construct has never been assessed in a given population is not a sufficient justification for applying a measure to a new population.
Once the author has determined that there is a need for this measure either for research and/or clinical purposes, one typically attends to issues of content validity prior to actually developing the measure or generating items (Haynes, Nelson, & Blaine, 1999; Haynes, Richard, & Kubany, 1995). Although it is often tempting to begin developing a measure based on one's own knowledge of the construct of interest or the urgent need to develop a measure for use in a larger research project, the “content validity” stage is one of the most important parts of the measure development process. Indeed, we often receive submitted manuscripts where it is clear that items were generated by a research team that was not necessarily made up of experts with respect to the construct of interest. It is important to take the time to properly define your construct and specify dimensions that underlie the construct. Item generation can be based on a variety of factors and strategies (see Appendix), including a review of the larger research literature and consultation with experts and relevant target populations. During the item generation stage, it is important to maintain roughly equivalent numbers of items across dimensions and to generate more items than are necessary so that items that do not function in an appropriate manner psychometrically can be eliminated at later stages of the measure development process (Nunnally & Bernstein, 1994). If one starts with too few items, one may end up with small subscales of questionable psychometric quality. The reading level of all items should be assessed and measure instructions and the response scale need to be developed (see Clark & Watson, 1995, and Comrey, 1988, for information on how to generate appropriate items). Additionally, it is important to determine whether the measure would be appropriate across multiple developmental levels and for different ethnic groups (Frick, 2000). After an item pool has been generated, it is useful to again consult with experts and/or members of the target population to assess for item relevance and wording ambiguities.
Once content validity has been “built in” to the measure during this important initial stage of measure development, the investigator can begin to gather data on reliability indices and conduct item analyses (see Appendix for details). Problematic items can be dropped during this stage and the hypothesized dimensions of the construct can be evaluated via confirmatory factor analyses (which may result in more problematic items being dropped from the scale). If one employs an exploratory factor analysis, several difficulties often emerge. One may retain too many factors, thus yielding subscales with small numbers of items or one may “accept” an unsatisfactory solution where many items load significantly on more than one subscale.
After adequate reliability and factorial integrity have been established, the investigator can develop a plan to test the validity of the measure. This process may involve multiple studies with different types of validation samples (e.g., one may want to compare scores on the measure across clinical and nonclinical samples). The measure should exhibit high correlations with other measures that tap similar constructs and it should be less highly correlated with measures that assess different constructs. Moreover, one can expect that scores on the measure will be associated with other behaviors assessed concurrently or prospectively. A difficulty that often arises at this stage of the measure development process is that the researcher will employ self-report methods for both the measure of interest “and” for the validity indices, making it impossible to rule out common method variance interpretations for the findings. In other words, such a study has limited potential to evaluate the validity of a measure.
Finally, the investigator may be interested in documenting the utility of the measure in assessing responsiveness to treatment or in making diagnostic decisions. Measuring responsiveness to treatment requires additional considerations, such as whether the scale can be administered repeatedly, whether it is sensitive to change over time, and how much change reflects meaningful differences in a person's functioning (Kazdin, 2005). One may also be interested in assessing the degree to which the measure is predictive of outcomes above and beyond other existing measures and the degree to which the measure is cost effective in a clinical setting (i.e., is the information provided worth the time allotted for its administration and scoring?). If relevant, one may also be interested in employing appropriate procedures for translating the measure into languages other than English (see Appendix for details regarding the translation process).
In this paper, we have described a checklist for those who seek to submit measure development and measure validation papers to JPP. In attempting to promote evidence-based assessment, we have highlighted the importance of attending to treatment sensitivity and diagnostic and clinical utility when developing a new measure. We hope that the checklist provided will not only be useful for authors but for reviewers as well. Again, we note that this checklist represents what would ideally be expected for measure development over multiple studies rather than minimal criteria for a single study.
A research grant from the National Institute of Child Health and Human Development (R01-HD048629).
Conflicts of interest: None declared.