|Home | About | Journals | Submit | Contact Us | Français|
Combining extant datasets with differing outcome measures, an economical method to generate evidence guiding older adults’ cancer care, may introduce heterogeneity leading to invalid study results. We recently conducted a study combining extant datasets from five oncology nurse-directed clinical trials (parent studies) using norm-based scoring to standardize the differing outcome measures. The purpose of this article is to describe and analyze our methods in the recently completed study. Despite addressing and controlling for heterogeneity, our analysis found statistically significant heterogeneity (p<0.0001) in temporal trends among the five parent studies. We concluded that assessing heterogeneity in combined extant datasets with differing outcome measures is important to ensure similar magnitude and direction of findings across parent studies. Future research should include investigating reasons for heterogeneity to generate hypotheses about subgroup differences or differing measurement domains that may have an impact on outcomes.
Adults ages 65 and over comprise over 60% of the cancer population, yet only 25% to 38% of the patients enrolled in large national cancer clinical trials (Hutchins, Unger, Crowley, Coltman, & Albain, 1999; Lewis et al., 2003; Talarico, Chen, & Pazdur, 2004). Age-associated decline in functional reserve, increase in comorbid conditions, multiple medications, and delayed diagnosis limit accrual. The under-representation of older adults in major cancer clinical trials limits data analysis and generalizability of cancer treatments in this population (Talarico et al., 2004). Implementation of strategies, such as changes in Medicare policy to cover costs associated with clinical trials, have had minimal effect on incentivizing older adults’ participation (Kimmick et al., 2005; Unger et al., 2006).
Although randomized clinical trials are the gold standard, researchers may need to use alternative study designs to increase the sample size of older adults with cancer in order to answer important research questions, detect clinically significant changes, and increase generalizability of results (Kiecolt & Nathan, 1985). One such alternative is combining extant datasets to increase the study sample. Combining extant data sets, however, has the potential to introduce heterogeneity into the study and limit the researchers’ choice of variables. Additionally, researchers may need to combine measures of concepts that are not strictly equivalent. All issues create the potential for imprecise measurement of concepts and study bias.
We recently completed such a study exploring the functional status of older adults after cancer surgery by combining five subsets of nurse-directed studies (Van Cleave, Egleston, & McCorkle, 2011). Although similar in design, data collection, and population, the five studies contained differing functional status measures: the Enforced Social Dependency Scale (ESDS) in four studies and SF-36® Health Survey (SF-36) in the fifth study. Based on the literature, we standardized the differing outcome measures for the analysis using norm-based scoring. The results of the study suggest that factors other than age influence recovery of function.
Therefore, the purpose of this article is to describe and examine our methods for combining extant datasets for the index study. Specifically, we will present a brief literature review and then describe the methods we used to standardize the differing outcomes. We will examine the combined dataset for heterogeneity in patterns of functional status across studies, and conclude the article by analyzing the benefits and limitations of our methods. For clarity, we will refer to the five nurse-directed clinical studies as the parent studies and refer to the recently conducted study of functional status in older adults as the index study.
Combining extant datasets with differing outcomes using standardized measures, a well-described method in the literature and basis for meta-analysis studies, has benefits and limitations (Higgins & Green, 2009; Kiecolt & Nathan, 1985; Orsi et al., 1999; Smith & Glass, 1977). Critics maintain that combining datasets with differing outcomes may lead to study bias due to the mixing of heterogeneous data. As a result, the researcher may falsely reject the null hypothesis when it represents truth (Type I error) or fail to reject the null hypothesis when the alternative hypothesis is true (Type II error) (Eysenck, 1994; Higgins & Green, 2009; Shadish, Cook, & Campbell, 2002). Proponents counter that mixing heterogeneneous data is appropriate within the context of a larger contextual phenomenon. In other words, a study of “apples and oranges” can yield meaningless results when interested only in apples or oranges, but can significantly contribute to a wider question about fruit (Higgins & Green, 2009; Smith & Glass, 1977).
Researchers standardize differing outcome measures in combined extant datasets to facilitate analysis and test research questions. Two methods for standardizing differing outcome measures are mean differences and norm-based scoring. Mean differences consist of using standard deviations (SDs) together with sample sizes to compute the weight given to each study. For example, studies with small SDs are given relatively greater weight while studies with larger SDs are given relatively smaller weights. This is appropriate when differences in SD reflect real differences in the variability of outcomes in the study population (Higgins & Green, 2009). Another method, norm-based scoring, uses a linear transformation with mean=50 and SD=10. This creates the possibility of meaningfully comparing scores across studies (Ware, 2010).
For the index study, we combined the five parent studies, conducted by the same principal investigator, to increase the population of postsurgical cancer patients for study power and generalizability. We first evaluated the five parent studies and judged them to be compatible based on data source, study design, procedures, and study participants. All studies collected data at baseline, three and six months, and targeted postsurgical cancer population. The outcome measure, functional status, was conceptualized broadly across all studies as representing the total domain of function (capacity, reserve, performance, and capacity utilization) (Leidy, 1994) and defined as the individuals’ actual performance of activities and tasks associated with their current life roles (Richmond, Tang, Tulman, Fawcett, & McCorkle, 2004).
The index study was approved by Yale School of Nursing (YSN) Human Subjects Institutional Review Board. Informed consent had been previously obtained from all patients during the parent studies, and study identification numbers were used in place of names or personal identifying data in to protect the rights of the participants.
The five parent studies were conducted between 1983 and 2007 at academic cancer centers in the northwestern and northeastern United States (Table 1). Complete details of these studies have been provided elsewhere (McCorkle et al., 1989, 1994, 2000, 2009; McCorkle, Siefert, Dowd, Robinson, & Pickett, 2007). Patients were recruited during hospitalization, and baseline interviews conducted within 1 month after discharge except for study one. In study one, patients were recruited after discharge in the outpatient clinics and baseline interviews were conducted on average 60 days after surgery. The data collection times consisted of baseline (enrollment), and then 3 and 6 months after enrollment.
Demographic data and comorbidities were assessed by patient self-report and recorded at baseline. Cancer site, stage, and treatment were collected from treatment records and medical record audits. Functional status, symptoms, and mental health data were reported by patients and collected by research assistants at baseline, and 3 and 6 months. The types of cancers differed among parent studies. Study one consisted of patients with thoracic cancers. Studies two and three enrolled patients with heterogeneous cancers: breast, colorectal, head and neck, thoracic, prostate, gynecologic, bladder, pancreatic, esophageal, renal, and gastric. Study four targeted men with prostate cancer. The patient population for study five consisted of women who had undergone abdominal surgery for presumed gynecological cancers. Patients were recruited to this study prior to the final pathology, so this subset of records included women with a heterogeneous group of cancers including ovarian, uterine, endometrial, and metastatic breast and pancreatic cancers.
The inclusion criteria for the index study consisted of age 65 and older, enrollment in five oncology nurse-directed randomized clinical studies, with no other concurrent malignancy or other medical treatment. A total of 537 patients from the five parent studies met the inclusion criteria. Participant attrition occurred because of substantial patient mortality (n=76) during the six months of each study. We therefore limited the index study to patients who had data collected at a minimum of two or more time points. For analysis, we also limited the index study to patients with digestive system, thoracic, gynecologic sites, and genitourinary cancer sites treated primarily with surgery for a homogenous population. The final index study population composed of 316 patients.
Prior to combining datasets, we mapped the selected variables across the five parent studies, using a spread sheet, to examine and plan item compatiblity. The mapping process described showed that studies 1, 2, 3, and 5 used a similar functional status measure, the ESDS (Tang & McCorkle, 2002). Researchers in Study 4 used SF-36, specifically the physical component summary measures, to measure functional status (McHorney, Ware, Lu, & Sherbourne, 1994; McHorney, Ware, & Raczek, 1993; Ware & Sherbourne, 1992).
The ESDS consists of two components; personal and social competence. Personal competence includes six activities: eating/feeding, dressing, walking, traveling, bathing, and toileting. Dependency in each activity was reported by the patient and rated by the interviewer on a 6-point scale. Scores for personal competence were summed and ranged from 6 to 36. Social competence consisted of home, work, and recreational activities, which were rated on 4-point scales, and a communication category, rated on a 3-point scale. Scores for social competence were summed and ranged from 4 to 15. Scores for personal and social competence were summed to generate a total dependency score ranging from 10 to 51, with higher scores reflecting greater dependency. The ESDS has demonstrated reliability (Cronbach’s alpha coefficient = 0.72 to 0.96) and both concurrent and predictive validity (Tang & McCorkle, 2002).
The SF-36 is a 36-item survey of health status used to assess eight health concepts: physical functioning, role limitations due to physical problems, bodily pain, general health perceptions, vitality, social functioning, role limitations due to emotional problems, and mental health. The SF-36 has demonstrated item-internal consistency and item-discriminant validity. Reliability coefficients have ranged from a low of 0.65 to a high of 0.94 across scales (median=0.85). The measure has also demonstrated validity in discriminating between serious and minor medical conditions as well as psychiatric illnesses (McHorney et al 1993, 1994; Ware & Sherbourne, 1992).
Number of symptoms was measured by the Symptom Distress Scale (SDS) (McCorkle, Cooley, & Shea, 1998) to quantify the effect of each additional symptom on functional status. The SDS consists of 13 common symptoms of cancer patients: frequency of nausea, severity of nausea, appetite, insomnia, frequency of pain, severity of pain, fatigue, bowel pattern, concentration, appearance, breathing, outlook, and cough. Participcants rated their distress on a scale from 1 (low distress) to 5 (high distress). Absence of a symptom was defined as values 1 and 2, and presence of a symptom was equal to values 3, 4, and 5. This categorization is consistent with previous studies using these same values to distinguish between low and high symptom distress (Cooley, Short, & Moriarty, 2003). Number of symptoms, therefore, represented the sum of present symptoms with highest score equal to 13 and lowest score equal to 0. The SDS has demonstrated content, construct, and criterion validity. The SDS has also demonstrated reliability, with reported Cronbach’s alpha coefficient ranging from 0.70 to 0.89 (McCorkle, et al 1994, 1998; Sarna, Lindsey, Dean, Brecht, & McCorkle, 1994).
We prepared the data for combination by reverse coding the ESDS scores to attain consistent direction with the SF-36. The data were then converted to standardized norm-based scores (mean=50, SD=10) using 0–100 metric and Z scores (Figure 1) (Ware, Kosinski, Turner-Bowker, & Gandek, 2005). SF-36 scores were converted to norm-based scores using QualityMetric Health Outcomes Scoring Software, version 2.0
We used norm-based scoring to transform the outcome variables and combine the datasets. By this, we mean that we standardized the outcome measures to a linear T-score having a mean of 50 and SD of 10 within each study (Figure 1). The T-score is based on Student’s T-distribution, the same distribution used for T-test comparisons of sample means. The T-distribution is essentially equivalent to the normal distribution (Z-scores) as the sample gets larger. Standardization of the outcome measures via T-score transformation eases interpretation across different measures. A 10-unit change in any standardized score is equivalent to a one SD change regardless of outcome measure. In many fields, the use of SD units to define clinically relevant changes in metrics is well accepted (Cohen, 1992). Ware et al. (1998) similarly use linear T-score transformations to make comparisons using SF-36 measures.
To combine the datasets in our study, we normalized our data using the study-specific means and SDs. By using study-specific means, we crudely removed study-specific confounding of demographic and clinical characteristics of interest with the outcomes. In using the study-specific SDs, the interpretation of effects is conditional on the SD within each study.
There are limitations to our approach. Using study-specific means to standardize variables limits the ability to detect associations of interest if the study-level effects are highly correlated with other characteristics, such as age. For example, if age is similar within studies but varies across studies, then standardization by study specific means would hinder the ability to find age related associations. Still, this is not a major limitation, as it would nonetheless be difficult to distinguish age-related effects from study-specific effects if participants’ ages are highly correlated with the study. In such a case, reducing the ability to detect effects would protect one from falsely concluding that age is related to an outcome of interest when indeed there were other study-specific factors driving the association.
Another limitation to the approach is that the interpretation of effects in SD units makes clinical relevance of effects more difficult to assess. Further, if the SD of the outcomes is highly variable among studies, standardizing by study-specific means can create artificial heterogeneity in effects among studies. Cummings (2004) criticized meta-analyses using standardized effects on both these measures. Despite such criticism, such standardization does not necessarily lead to substantially different overall decision rules regarding associations of variables. In Cummings’ (2004) example, pooled effects were statistically significant both before and after standardization.
Despite the limitations to this approach, the main benefit of such standardization is its ease of implementability among researchers. While there has been work on rigorous methods to combine datasets using disparate quality of life measures (Chang & Cella, 1997; Dobrez, Cella, Pickard, Lai, & Nickolov, 2007; Revicki et al., 2009), such methods generally require that the disparate measures are administered simultaneously in a sample so they can be mapped to each other. When a researcher does not have access to information regarding the appropriate mapping of two variables in a population of interest, simpler methods, such as T-score-based transformations, might be the only option available.
After performing the T-score transformations, we sought evidence concerning the validity of the approach with respect to the ESDS and SF-36 measures of interest. Although we suspected the ESDS and SF-36 measured different outcomes, we expected that the trajectory of the measures over time on average would be the same after standardization. This does not imply that the same people would have the same trajectory assessed by the two measures. People who had significant changes in one measure might not have similar changes in the other. On average, however, we would expect the population changes to be similar.
To assess whether the standardized measures did move similarly in each sample, we fit multiple linear regression models of the standardized outcomes separately for each study. We included the 3- and 6- month data collection time points as dummy variables in the model. We also included age and number of symptoms as covariates to crudely adjust for confounding by these variables among the studies. We fit the models by generalized estimating equations assuming an unstructured working correlation matrix to account for the correlation of observations within individuals over time. We then assessed whether the direction and magnitude of the regression coefficients were similar across the studies. If the direction and magnitude were similar, we would have preliminary evidence supporting the combination of datasets with ESDS and SF-36 measures via T-score standardization.
To formally test for heterogeneity in temporal trends in functional status among the studies, we examined a multiple linear regression of function in which we included the wave dummy variables (2 covariates, with the baseline time used as the reference time point), study dummy variables (4 covariates, with the first study used as the reference study), and the 8 interaction terms between the wave and study dummy variables (wave dummy variables multiplied by study dummy variables). We also included the total number of symptoms and age in the model. We again fit the model by generalized estimating equations to account for correlation within individuals. Statistically significant interaction terms would suggest that changes in functional status over time varied among the studies.
In the index study, we used an inverse probability of censoring weighted estimator to account for the missing outcome data over the three data collection times. To implement this method, functional status measures were weighted inversely by the probability of staying in the study (Fitzmaurice & Laird, 2000; Scharfstein, Rotnitzky, & Robins, 1999). We did not implement the weight estimator for this analysis to fully compare the functional status measures across all three data collection times. Analyses were performed using STATA, version 10.1 (StataCorp LP, College Station, Texas).
The study sample consisted of 316 patients who underwent surgery for digestive, thoracic, genitourinary, and gynecological cancer (Table 2). The mean age for the total population was 71.8 (SD=5.4 years). The study population consisted primarily of patients ages 65 to 69 (44.9%), White race (76.6%), married (61.7%), who did not work (82.3%), yet had some high school education (90.8%). The study sample was balanced between men and women (49.7% versus 48.7%) and 25.3% of the population earned $40,000 or greater.
We conducted a regression analysis of the effect of time of data collection and number of symptoms (Table 3). The total study population showed statistically significant improvement of functional status at 3 months (β=6.85, 95% confidence interval [CI] = 5.76 to 7.93) and 6 months (β=8.47, 95% CI = 7.21 to 9.73) compared with baseline functional status. Both number of symptoms and age were significantly associated with decreased functional status: functional status declined with each additional symptom (β=−1.58, 95% CI = −1.79 to −1.38) and each added year of age (β=−0.18, 95% CI = −0.29 to −0.08). We next analyzed the pattern of the recovery of functional status across studies (Table 3). Time of data collection was significantly associated with functional status in study 1 (3 months: β=−2.19, 95% CI = −3.76 to −0.62; 6 months: β= −4.71, 95% CI = −8.18 to −1.24), study 2 (6 months: β=8.02, 95% CI = 4.26 to 11.78), study 3 (3 months: β=11.31, 95% CI = 10.09 to 12.54; 6 months: β=13.18, 95% CI = 11.84 to 14.51), and study 5 (3 months: β=9.32, 95% CI = 6.51 to 12.13; 6 months: β=10.91, 95% CI = 7.98 to 13.84).
A two-way line plot portrays the mean standardized functional status scores across studies to compare and contrast the trajectory of recovery of functional status over time of data collection (wave) (Figure 2). The population in Studies 2, 3, and 5 demonstrated increased mean functional status over time while those in Studies 1 and 4 experienced either a downward or non-specific trajectory.
Table 4 lists the results of our interaction model used to test for heterogeneity in functional status temporal trends among the studies. We found that a Wald test for the joint statistical significance of the 8 interaction terms was statistically significant (p<0.0001), which suggests that there is indeed heterogeneity in temporal trends.
We then examined whether the ESDS and SF-36 contained similar items that could be compared for similarity in trajectory over time. The ESDS contained the items of dressing and bathing; the SF-36 had one item titled bathe or self-dress. We again used two-way line plots to portray the mean standardized functional status across times of data collection. This also demonstrated differing trajectories across time. All three items demonstrated improved scores between baseline and 3 months. The ESDS items, however, showed continued improved mean scores at 6 months compared with decreased mean score by the SF-36 (Figure 3).
Combining extant datasets represents a potential economical solution to increase the power and generalizability of studies of older adults with cancer, thus providing evidence to guide clinicians. However, the method may introduce bias into study results due to heterogeneous data, imprecise measurement of concepts, or combining measures that are not strictly equivalent. Literature supports that broadly defining the outcome and then using standardized scores for analysis of differing outcome measures across studies may overcome these potential methods problems. Thus, we used norm-based scoring to standardize the differing functional status outcome measures because of its ease of use and interpretation. In the index study, we found a statistically significant improvement in functional status in the total population. An analysis of individual studies and comparable measurement items, however, demonstrated heterogeneity in trajectory of functional status across parent studies. Hence, the results provide evidence for the benefits and limitations of combining extant datasets with differing measures.
Results from our analysis suggest that we gained some benefit from combining extant datasets in studies of older adults with cancer. The method may have led to increased study power to generate evidence that older adults have significant recovery of functional status after cancer surgery. Additionally, we were able to expand our index study population to include older adults with differing types of cancer surgery, and thus increased the generalizability of our results.
Results from our analysis also exposed some of the potential limitations of combining extant datasets using standardized outcome measures. We may have introduced heterogeneity into the index study, as exemplified by the differing trajectories of functional status after cancer surgery across parent studies and supported by the Wald test for the joint statistical significance of the 8 interaction terms. Functional status improved over time in studies 2,3 and 5. In contrast, functional status in studies 1 and 4 either decreased over time or remained approximately the same over time. One reason for the differences in functional status temporal trends may stem from the 25 year span of the parent studies. For example, study 1 was conducted between 1983 and 1986 with thoracic cancer patients prior to the adoption of life-extending treatments including combinations of surgery, chemotherapy, and radiation therapy for older thoracic cancer patients (Bunn & Lilenbaum, 2003). Study 4, which demonstrated no significant difference of functional status across time, was conducted from 1997 to 2000 and included men ages 65 and over with early prostate cancer undergoing surgery for potential cure.
An additional reason for differences in functional status temporal trends may be related to differing research questions across the parent studies. For example, study 2 evaluated home care for cancer patients and study 4 explored nursing’s impact on men post prostatectomy (Table 1). Variables chosen to represent study concepts ultimately provide the information that answers the study question (Hulley et al., 2001). The two measures used across studies, ESDS and SF-36, provide different information for researchers. Compared with the SF-36, the ESDS provides information on personal competence and potential targets for interventions by assessing patients’ activities of daily living (i.e., eating/feeding, dressing, walking, travel, bathing, and toileting) (Tang & McCorkle, 2002). The SF-36, in contrast, addresses limitations in self-care: physical, social and role activities, bodily pain, and frequent tiredness (Ware, 2010).
We took several steps to address and control for heterogeneity based on meta-analysis methods. For the index study, we carefully developed inclusion criteria to generate a homogenous population, examined and judged the five parent studies compatible in study design and data collection methods, reverse coded the ESDS to attain consistent direction with the SF-36, and then transformed the outcome measures to norm-based scoring. Hence, the differing trajectories raise the issue that, despite standardization with norm-based scoring, the ESDS and SF-36 may represent different functional status domains. We were unable to test for their equivalency, though, since none of the parent studies measured functional status with both the SF-36 and ESDS.
Assessing heterogeneity in combined extant datasets with differing outcome measures is important to ensure that the magnitude and direction of findings are similar across studies. Finding a great deal of heterogeneity could call into question the validity and generalizability of the combined results. Some have argued that differences in recruitment populations among studies could be a reason for substantial heterogeneity (Elwood, 1998). Therefore, investigators should examine study-level characteristics, such as general age or gender differences among the studies, that could be creating heterogeneity. Such considerations can be helpful in generating hypotheses about subgroup differences that have an impact on outcomes.
In our study, we found that the interaction terms between study and time period were statistically significant in a regression model, which suggests that there is indeed heterogeneity in temporal trends in functional status among the studies. Future research could investigate reasons for such heterogeneity, which may be related to demographic, disease, or treatment characteristics among the studies and differing domains of functional status among measures.
Janet H. Van Cleave, New York University College of Nursing, 726 Broadway, 10th Floor, New York, NY 10003.
Brian L. Egleston, Biostatistics Facility, Fox Chase Cancer Center, Philadelphia, PA 19111.
Meg Bourbonniere, Thomas Jefferson University Hospital, Philadelphia, PA 19107.
Ruth McCorkle, Yale School of Nursing, New Haven, CT 06536.