|Home | About | Journals | Submit | Contact Us | Français|
Research on health disparities and determinants of health disparities among ethnic minorities and vulnerable older populations necessitates use of self-report measures. Most established instruments were developed on mainstream populations and may need adaptation for research with diverse populations. Although information is increasingly available on various problems using these measures in diverse groups, there is little guidance on how to modify the measures. We provide a framework of issues to consider when modifying measures for diverse populations.
We describe reasons for considering modifications, the types of information that can be used as a basis for making modifications, and the types of modifications researchers have made. We recommend testing modified measures to assure they are appropriate. Suggestions are made on reporting modifications in publications using the measures.
The issues open a dialogue about what appropriate guidelines would be for researchers adapting measures in studies of ethnically diverse populations.
Eliminating health disparities among ethnic minorities and vulnerable older populations is a national priority (Smedley, Stith, & Nelson, 2003; U.S. Department of Health and Human Services, 2000). To accomplish this goal, it is essential to understand factors that contribute to these disparities (National Institute on Aging, 2011). Research on health disparities and its determinants among ethnic minorities and vulnerable older populations necessitates use of self-report measures. However, many of these widely-used measures were developed and tested on mainly white, young and middle-aged, well-educated samples. These measures may have limitations when used in studies of minority or lower-socioeconomic status (SES) older adults included in health disparities and minority aging research.
A substantial amount of research has attempted to address these issues over the past 15 years, including systematic efforts by the Resource Centers for Minority Aging Research (RCMARs) (Stahl & Hahn, 2006); also see preface for an overview of RCMAR contributions in measurement in diverse older populations (Teresi, Stewart, & Stahl, 2012). There are published guidelines on methods for examining the conceptual and psychometric adequacy of measures in ethnically diverse populations (Collins, 2003; Hahn & Cella, 2003; Johnson, 2006; A.M. Nápoles-Springer, Santoyo-Olsson, O'Brien, & Stewart, 2006; A.M. Nápoles-Springer & Stewart, 2006; A.L. Stewart & Nápoles-Springer, 2003; Teresi, Stewart, Morales, & Stahl, 2006). There also are systematic reviews examining the conceptual or psychometric adequacy of particular concepts and measures in diverse populations (Coates & Monteilh, 1997; A.L. Stewart & Nápoles-Springer, 2000) including older adults (Mui, Burnette, & Chen, 2001; Mutran, Reed, & Sudha, 2001).
However, when problems are found with measures in diverse population groups, there are no guidelines on what to do next. Once a measure has been determined to be inappropriate for minority or lower-SES participants, researchers have three options. One is to use the measure “as is” without modification and articulate the limitations. This abides by the tenet of administering measures as published which some believe preserves score reliability and validity (Juniper, 2009). However, if a measure is not suitable for a population, scientific inferences derived from it may be compromised.
A second option is to create a new measure de novo. Developing a new measure is fraught with challenges, including having the expertise, time, and resources to develop and test a new measure, which may not be practical for most health disparities researchers. Methods for developing new measures to be culturally sensitive involve a mixed-methods approach that includes concept development, writing items, pretesting, revising, field testing, and conducting psychometric analysis to derive final measures. Although there are no general guidelines for developing new measures in minority populations, several publications provide detailed examples of these steps (Jackson, 1996; Krause, 2006; A. L. Stewart, Napoles-Springer, Gregorich, & Santoyo-Olsson, 2007).
The third option is to modify or adapt an existing measure. There is a delicate balance between trying to retain the strength of an existing measure, which may have undergone extensive development and testing, but clearly will be problematic, and making modifications which may or may not work. Furthermore, there are very few practical guidelines on how to go about making those modifications. In multi-national research, measures developed in English must be translated for use in non-English speaking countries. In guidelines for multi-national studies, adaptations are an integral part of the translation process, e.g., items may be modified to achieve “semantic” equivalence (Aaronson et al., 1992; Beaton, Bombardier, Guillemin, & Ferraz, 2000; Bullinger et al., 1998). However, these guidelines give little attention to issues in modifying measures when no language translation is needed.
This paper attempts to address this gap by providing a framework for modifying measures to improve their reliability and validity in health disparities research involving ethnically and racially diverse populations. The goal of such modifications would be to increase the likelihood that the modified measure has comparable meaning, reliability, and validity as the original measure, but in a new population group. The issues raised here pertain primarily to addressing differences from mainstream populations (on which original measures were tested) in socioeconomic status, race/ethnicity, language, and literacy. Issues of modifications based on these types of group differences are relevant to health disparities studies of adults of all ages. Occasionally, modifications are made to adapt a measure specifically to be more appropriate for older adults (Thiamwong, Stewart, & Warahut, 2009; Vanderplas & Vanderplas, 1981). We describe three issues: 1) reasons for considering modifications – why a modification would be needed; 2) the basis for modifications – information that can be used to make the modifications; and 3) possible types of modifications. We conclude with recommendations for assessing modified measures and for reporting results of these assessments.
There are a number of reasons why investigators might consider modifying a measure. In health disparities research, the most common reason is that the population group(s) being studied differs substantially from the one in which the original measure was developed. Essentially, the motivation for modifying measures is the concern that racial/ethnic or generational differences might adversely affect the meaning, reliability, or validity of the original measure (information on these basic measurement concepts is available elsewhere, e.g., McDowell & Newell, 2006; Nunnally & Bernstein, 1994). Some key reasons why measures developed in one group may not be suitable include: 1) a concept or dimension may be missing from a measure; 2) the meaning of the concepts or items may differ; 3) items/phrases may not be interpreted as intended; 4) the process of responding to the questions is complex or difficult; and 5) the study context or mode of administration may differ from the original. Any of these reasons could result in findings that measures do not meet minimal psychometric criteria in these new groups or that the measures do not demonstrate invariance across groups (Gregorich, 2006; A.L. Stewart & Nápoles-Springer, 2000; A.L. Stewart & Nápoles-Springer, 2003).
Many concepts used in health disparities research are complex and multidimensional. As researchers delve into existing measures, it often is apparent that key dimensions relevant to a diverse group may be missing. For example, in focus groups exploring social support concepts for older Chinese and Korean immigrants, getting help with English in conducting business or in health care visits and aspects of financial support were identified as important but were not included in standard measures of social support (Wong, Yoo, & Stewart, 2005).
The meaning of the existing concepts that are the target of a measure may differ by race or ethnicity or by generation. Published measures reflect the population, place, and time in which they were developed. To the extent that a new population group may define or perceive the concept differently, the original measure may lack content validity for the new group. One example is the use of food frequency questionnaires to assess intake of foods and nutrients. The validity of food frequency instruments is partly dependent on the list of foods, thus use of a food frequency questionnaire developed for a general population in a study of a minority population may result in underestimates of intake unless it is adapted for the new population (Tucker, Bianchi, Maras, & Bermudez, 1998). Investigators aiming to determine energy and nutrient sources of older Hispanics modified a questionnaire developed by the National Cancer Institute to add southwestern regional foods such as chile rellenos and tamales (Mayer-Davis et al., 1999). Without these modifications, researchers would not have learned that chile sauces were major sources of vitamins A and C for Hispanic elderly.
A measure might include terms and phrases that may be misinterpreted due to unfamiliar language, idioms, or colloquialisms. Terms may be so general as to generate a variety of interpretations, and words may be confusing. There is a greater likelihood of unfamiliar terminology when working with low-literate or non-native English-speaking populations. For example, several items in the Beck Depression Inventory (Beck, Steer, & Garbin, 1988) were misunderstood in a sample of older and low-literacy patients, leading to changes to make the items clearer (Sentell & Ratcliff-Baird, 2003). In a pretest of diabetes knowledge in low-literate Spanish-speaking patients with diabetes, the concept “blood sugar drop” was not understood; similarly in the same study the phrase “bothered by” in the CES-D item “I was bothered by things that usually don’t bother me” was interpreted as having a physical complaint (Rosal, Carbone, & Goins, 2003).
Systematic differences may exist across racial/ethnic groups in the way they respond to survey question items. For example, Blacks and Latinos are more likely than whites to choose the extreme response options rather than middle options of Likert response scales (Bachman & O'Malley, 1984; Hui & Triandis, 1989; Warnecke et al., 1997). In contrast, Asian Americans appear to be more likely than whites to avoid extremely positive options in surveys of health services (Murray-Garcia, Selby, Schmittdiel, Grumbach, & Quesenberry, 2000; Ngo-Metzger, Legedza, & Phillips, 2004; Taira et al., 1997).
For older adults who might have limited vision, or for those with limited English proficiency or low literacy, question formatting and the response task may impact the validity of a measure. Complex instructions and formatting may work with younger white populations but could be challenging for older minority groups or persons with low-literacy (Mullin, Lohr, Bresnahan, & McNulty, 2000; Rosal et al., 2003). Scientific studies have examined ways to present questions to maximize respondent ability to answer the questions accurately (Stone et al., 2000).
Studies need to be sensitive to the appropriateness of different modes of data collection. For example, pencil and paper administration may pose no difficulties in middle aged populations but might be inappropriate for older populations with low vision where oral administration might work better. The growth in use of technologies to accommodate low-literacy populations such as audio-visual computer-based platforms (Hahn & Cella, 2003; Hahn et al., 2004) as well as other electronic modes of administration (web-based surveys, touch-screen computers, handheld computers, interactive voice response, and automated telephone assessment) has led to a call for extensive testing to assure the comparability of electronic and paper-based measures (Coons et al., 2009).
Once a decision is made to modify an existing measure, investigators need information on which to base those modifications. Such information is often available in the same studies that identified the need for modifications. Three sources of information can be used as a basis for modifications: 1) qualitative research on a concept or measure, 2) literature reviews of the adequacy of a measure, and 3) investigator experience or judgment. We review these three information sources and provide examples of how the information source was used to make modifications.
The two most common qualitative methods for exploring concepts and measures in diverse populations include cognitive interview pretests (Collins, 2003; Drennan, 2003; A.M. Nápoles-Springer et al., 2006; Rosal et al., 2003; Willis, 2005) and focus groups (Fuller, Edwards, Vorakitphokatorn, & Sermsri, 1993; Hughes & DuMont, 1993; A. M. Nápoles-Springer, Santoyo, Houston, Perez-Stable, & Stewart, 2005; Vogt, King, & King, 2004). In addition to identifying specific problems, transcripts often include dialogue that can be used to help make modifications such as alternative phrasing. Qualitative research on the adequacy of a specific measure in a new population group often identifies items that are not culturally appropriate or relevant. The following are some examples of how qualitative research provided information by which to modify a measure:
Literature reviews of the adequacy or appropriateness of measures of particular concepts for racial/ethnic groups can be done by individual investigators or by convening an expert panel (Mutran et al., 2001). For example, literature on measures of park and recreation environments was reviewed for adequacy in studies of how parks and recreation settings contribute to physical activity in low-income communities of color. The review provided suggestions for improving measures, for example, the need to reflect the concerns and preferences of residents in these communities and to “explicitly reflect inequality in the availability and quality of parks and recreation areas” (Floyd, Taylor, & Whitt-Glover, 2009). Another example is from a review of physical activity questionnaires for appropriateness for middle-aged and older minority women (Masse et al., 1998). The authors suggested a number of alterations: clarify physical activity-related words like exercise and physical activity; improve definition of phrases like moderate physical activity; modify instructions about walking to capture walking in multiple contexts; modify unsuitable phrases such as “leisure-time” activities, and; substitute culturally relevant activities such as dancing for items such as playing golf or tennis. Other examples of how literature reviews of measures provided specific information by which to modify measures include reviews of measures of socioeconomic status for minority aging studies (Rudkin & Markides, 2001), measures of acculturation (Salant & Lauderdale, 2003), and food frequency questionnaires for minority populations (Coates & Monteilh, 1997).
Investigator experience and knowledge can provide ideas for modifications, based on long-term programs of research with diverse populations. For example, based on their experience conducting research with persons with disabilities, Meyers and Andresen (2000) recommended shortening the recall period in health-related quality of life instruments for patients with disabilities. For instance, changes may occur over very short periods of time in multiple sclerosis patients, and in patients with stroke, recall periods need to be compressed due to short-term memory loss.
The types of modifications that can be made range from simple format changes such as improving the contrast or increasing the font size to extensive changes such as adding new subscales or changing item wording. To facilitate thinking about the various types of modifications, we have classified them into three broad categories: content, context, and format. For each of these, in Table 1, we define the various specific modifications and provide examples.
Content modifications can be made at the level of dimensions, item stems, or response options, all of which can be added, dropped, modified, or replaced. Adding dimensions or items may be indicated when additional components are found to be needed. Dropping dimensions or items might be done when either is found to be unsuitable for a particular group. Replacing items might be done when an item is unsuitable and a comparable alternative has been suggested by respondents during cognitive interviews.
Context modifications are those made primarily because of study-specific differences, for example, due to a different referent such as nurses instead of doctors. This can include changes to instructions for a self-administered measure to verbal or web-based administration. For example, to modify the Consumer Assessment of Healthcare Providers and Systems (CAHPS) survey for an American Indian health service setting, items were modified to substitute the term “doctor or nurse” for “doctor,” and to substitute “health professional” for “health provider” (Weidmer-Ocampo et al., 2009).
Format and presentation modifications include changes in appearance or the way of responding to reduce respondent burden or enhance readability. For older adults this might include the use of audio or touch screens and the addition of response aids; it might also include simplifying instructions and increasing the font size. Mullin and colleagues (2000) have summarized “state-of-the-art” formatting methods for self-administered questionnaires that are known to improve data quality, based on an extensive literature review. They describe a variety of formatting and presentation ideas that can reduce errors in responding, reduce burden, and enhance motivation to respondents to complete questionnaires. For example, a study focused on improving self-reported questionnaires for older people with multiple sclerosis, inconsistent formatting was distracting, thus questionnaires were reformatted for consistency (e.g., response choices printed below each question) (Ploughman, Austin, Stefanelli, & Godwin, 2010).
Once a measure is modified it is essential that the new measurement be assessed for its reliability and validity. For some minor modifications such as modifying font typeface, it may not be necessary to conduct a formal assessment of the psychometric properties of the instrument. However, for changes such as adding or deleting a number of question items or modifying response categories, investigators should seriously consider more formal assessment of the new measure.
One classification system for determining the magnitude or extensiveness of a modification by Coons and colleagues (2009) is based on defining potential effects of modification on “…the content, meaning, or interpretation of the measure’s items and/or scales” (Coons et al., p. 422–423). The system was originally conceived for electronic data collection methods but it is likely equally applicable to paper/pencil or orally administrated instruments. The 3-level classification system includes:
One of the biggest challenges faced by health disparities researchers is the time and cost that is involved in a full-scale psychometric assessment of a measure that was extensively modified. As Coons and colleagues note (Coons et al., 2009), the rigor of equivalence testing probably should vary with the extent of modification. However, regardless of the level of modification, it is judicious to test the new measure for psychometric adequacy as well as equivalence with the original measure. For minor modifications, a small-scale pretest would suffice to assess that the changes are working as expected. For moderate modifications, a more thorough assessment of the psychometric adequacy of the measure or the extent to which its properties are similar to the original measure should be undertaken. For substantial modifications, where there will be little in the way of prima fascia evidence for measurement adequacy or equivalence, a full-scale psychometric assessment is probably needed. In many cases it is impractical to conduct a detailed assessment and the investigators must balance the requirements of a time-limited research program and having a modified measure without fully assessing its reliability and validity. We offer some thoughts on this tricky situation from our own and others’ experiences.
One approach we recommend is to conduct an in-depth pretest on a small sample to determine if the modifications are “working” prior to going into the field with the main study. While likely not definitive about measurement performance, this kind of information can be invaluable in determining if there are major problems with the modified measure prior to mounting a major study that includes it. Once the investigative team decides to move forward with pretesting the measure, it is possible to assess its psychometric properties. Typically these assessments would include item-scale correlations and internal consistency reliability. For example, Fongwa and colleagues conducted a field test of a modified patient satisfaction questionnaire in a sample of African Americans and whites and reported results of extensive psychometric analysis (Fongwa, Hays, Gutierrez, & Stewart, 2006). Ideally, one would evaluate the adequacy of the original and the modified measure in the new sample (Hays, Hahn, & Marshall, 2002), but this requires administering the original measure and the modified items. To allow for this possibility, we strongly suggest that researchers do not drop any items and instead add new or modified items to the established measure (Aroian, Hough, Templin, & Kaskiri, 2008; Hays et al., 2002); only in this way will it be possible to compare the original and modified measures. For example, Gonzalez and colleagues analyzed the construct validity of their modified Visual Analogue Pain scale in relation to the original Visual Analogue Pain scale for Hispanics recruited from several U.S. communities; the correlation of the modified scale and the original was 0.72, and the modified scale had less missing data (6% compared to 24%) (González, Stewart, Ritter, & Lorig, 1995). There are several other examples of the benefits of comparing the original and modified measures in the same study (Aroian et al., 2008; Kazis et al., 2004; Tucker et al., 1998). However, including both old and new versions in the same study may not be practical and can introduce context effects.
Investigators should also consider how they might assess the validity of the modified measure. One approach is to conduct validity tests to parallel those done with the original measure, thus administration of the same indicators of validity used with the original measure is necessary. The expectation is that the modified measure is an improvement over the original measure in the new context or population (Hays et al., 2002). For example, in a study to adapt a patient satisfaction instrument for lower literacy population, two versions of a ‘new’ instrument, one with cartoon/pictorial enhancements and one a computer-assisted telephone delivery, the original self-report text instrument and new versions were compared head-to-head in a randomized trial, with each arm using the same validity indicators (Shea et al., 2008). Another approach is to use new indicators of validity to assess the performance of the modified measure. While this approach may produce strong evidence of validity, it makes it difficult to compare the validity assessment with previous validity results using the original measure.
If assessment studies of measure modifications were routinely published, we would gain a tremendous amount of information on how various modifications affect the reliability and validity of measures in new populations, as well as point to new strategies and methods for test and measurement assessment. Currently, reporting appears to be missing, e.g., in a review of measures of spirituality for use in palliative care, the authors noted that changes in the content of instruments after cultural adaptation were poorly reported (Selman, Harding, Gysels, Speck, & Higginson, 2011).
One approach would be to provide details of the modification and its assessment in a separate methods paper. Investigators reported on an adaptation of the CAHPS Clinician and Group Survey for use in the American Indian Health Service, including methods for deciding to modify the measure, the types of modifications, and the psychometric characteristics (Weidmer-Ocampo et al., 2009). Similarly, in reporting modifications to the CHAMPS Physical Activity Questionnaire (A.L. Stewart et al., 2001) for a church-based lifestyle intervention for African Americans, Resnicow and colleagues (Resnicow et al., 2003) clearly described the entire process of modifications, including tests of validity. Other examples include a study adapting a patient satisfaction instrument for low literacy individuals by modifying the format (Shea et al., 2005) and one adapting a patient satisfaction survey for low literacy VA patients by adding illustrations (Weiner et al., 2004).
Another approach is to report details of the modification and assessment process within the methods section of a substantive paper. For example, Nápoles and colleagues (Napoles, Ortiz, O'Brien, Sereno, & Kaplan, 2011) reported their modifications of the Cancer Behavior Inventory in the methods section of a paper on coping resources and self-rated health among Latina breast cancer survivors, including results of psychometric tests of the modified measure.
When publishing papers that include modified measures, we recommend at a minimum reporting: 1) features of the original measure that required modification; 2) source of information on the basis for modifications; 3) specific type of modification made; and 4) how the modified measure was tested for psychometric adequacy and results. It might also be necessary to report on whether permission was obtained to modify it or permission was granted on measure or website. Incorporating this information adds to accumulated knowledge on the particular measure as applied in diverse groups, and helps our understanding of broader issues involved in adapting measures for diverse populations. These reporting guidelines may not always be practical, though, given space limitations in published papers and typically brief treatment of measurement issues. Especially when this information would be disproportional to the substantive portion of a paper, we encourage taking advantage of options to provide supplemental methodological information on the websites of journal. This is becoming more common in some journals that discourage extensive methodological information in articles (Tobin, 2000).
We have addressed a gap in the literature pertaining to use of measures in health disparities and minority aging studies that are appropriate for the population group. There are virtually no published papers that discuss methodological issues in modifying measures. Our paper becomes a first step toward understanding the issues. As the field of measurement in diverse populations evolves, we will increasingly find more information on such modifications. It was very difficult to find examples in the literature on modifications to existing measures for use in a diverse population group. Modifications were typically described in a brief paragraph in the measures section, but it was not possible to find these modifications through keyword searches. Thus, by calling attention to the importance of providing details, we hope to promote better reporting in publications that use modified measures.
Many modifications are designed to improve the ability of the measure to answer specific, contemporary research questions. Thus, often modifications are made to update terminology or to reflect historical changes (Krause, 2006). In this case, modifications are an integral part of the evolution of a concept or measure, with modifications improving an existing measure for use by other researchers. The SF-36 is a good example of evolution in measures of health-related quality of life. As noted by Ware (2000), the SF-36 version 2 includes several modifications to the original SF-36. The original SF-36 in turn was a modification or subset of several longer-form Medical Outcomes Study (MOS) measures of “functioning and well-being.” The MOS measures in turn were adapted and developed based on measures of health status developed for the Health Insurance Experiment (HIE) (Stewart & Ware, 1992) and HIE measures were based on even earlier measures such as the General Psychological Well-Being Inventory (Dupuy et al., 1984). Thus, the SF-36 version 2 reflects numerous iterations of prior measurement instruments, each involving modifications and adaptations of earlier versions to improve the measures.
Other examples of how measures evolve can illustrate this process. One is the adaptation of the Mini-Mental State examination by Teng and Chui (Teng & Chui, 1987) to sample a broader variety of cognitive functions, include a broader range of difficulty levels, and improve reliability and validity. Chatters and colleagues reviewed the evolution of concepts and measures of religiosity for research on the role of religiosity in the lives of older African Americans (Chatters, Taylor, & Lincoln, 2001). They noted a progression from single measures of church attendance to multi-dimensional concepts and measures including prayer, private religious practices, daily spiritual experiences, and organizational religiousness. As the concept evolved through research on how spirituality affects health and well-being, the measures also evolved to reflect the changing concepts.
Consistent with the concept of measurement evolution is the perspective that the validity of the measure is not a property of a test or measure, but of a measure tested under a particular set of conditions (Messick, 1995; Sechrest, 2005). This perspective holds that because validity pertains to understanding the meaning of scores, construct validity can only be established incrementally based on the accumulation of evidence on how the measure relates to other measures (Sechrest, 2005, p. 1596). Well-designed modifications thus contribute to enhancing the validity of measures in new and diverse populations.
The scientific study of the adequacy of measures in diverse populations is still emerging, but research promoting understanding of how to modify measures and test those modifications is even newer. By laying out some of the issues in understanding how to modify and test measures and by describing some possible solutions, we hope to guide future research efforts to advance the field of measurement in diverse populations. By highlighting the value of modifications in the ongoing evolution of concepts and measures in studies of minority aging, we also hope to encourage authors of published measures to be flexible in imposing limitations on use of their measures.
Anita L Stewart, Institute for Health & Aging, University of California San Francisco, 3333 California St. Suite 340, San Francisco, CA 94118, Phone: 415 502-5207, Email: firstname.lastname@example.org.
Angela D Thrasher, Department of Health Behavior and Health Education, University of North Carolina Gillings School of Global Public Health, 315 Rosenau Hall, CB #7440, Chapel Hill, NC 27599-7440, Phone: 919-843-9293, Email: email@example.com.
Jack Goldberg, Vietnam Era Twin Registry, Seattle VA and the University of Washington School of Public Health, Box 359780, 1730 Minor Avenue, Suite 1760, Seattle, WA 98105-1597, Phone: 206 543-4667, Email: firstname.lastname@example.org.
Judy A. Shea, University of Pennsylvania, School of Medicine, Division of General Internal Medicine, 1223 Blockley Hall, 423 Guardian Drive, Philadelphia, PA 19104-6021, Phone: 215 573-5111, Email: ude.nnepu.dem.liam@ajaehs.