|Home | About | Journals | Submit | Contact Us | Français|
Accuracy of self-reported data may be improved by data editing, a mechanism to produce accurate information by excluding inconsistent data based on a set number of predetermined decision rules. We compared data editing methods in the Global Youth Tobacco Survey (GYTS) with other editing approaches and evaluated the effects of these on smoking prevalence estimates. We evaluated 5 approaches for handling inconsistent responses to questions regarding cigarette use: GYTS, do-nothing, gatekeeper, global, and preponderance. Compared with GYTS data edits, the do-nothing and gatekeeper approaches produced similar estimates, whereas the global approach resulted in lower estimates and the preponderance approach, higher estimates. Implications for researchers using GYTS include recognition of the survey’s data editing methods and documentation in their study methods to ensure cross-study comparability.
Accurate monitoring of cigarette smoking status among youth is important in addressing the tobacco use epidemic globally (1). However, the accuracy of self-reported health-risk behaviors in questionnaires may be compromised because of difficulties in recall, social desirability, and sensitivity of the question itself (2). Data editing is a mechanism to produce accurate information by excluding inconsistent data based on a set number of predetermined decision rules. Research suggests that editing procedures have potential effects on point estimates and cross-study comparability (3–5). This exploratory study compares the data editing method used in the Global Youth Tobacco Survey (GYTS) with other data editing approaches and evaluates the effect of these on estimates of smoking prevalence in GYTS to inform collaborators globally.
GYTS, a self-administered school-based survey, uses a 2-stage cluster sample design that is grade-based and produces representative samples of students with ages ranging from 10 to 17 years. A subset of students aged 13 to 15 years is used for comparing the data within and across Word Health Organization (WHO) regions. In countries, such as small islands, where all students in the selected grades were surveyed, a census rather than a 2-stage cluster sample is conducted. The survey methods are described in detail elsewhere (6,7).
Eligible countries were selected on the basis of the following inclusion criteria: a nationally representative sample, recent completion of GYTS (2007–2009), large sample size (≥3,000 participants), and GYTS data publicly released. Of 35 eligible countries that met the inclusion criteria, 1 country from each WHO region was randomly selected for this study. Data analysis was performed on a subset of participants aged 13 to 15 years (n) among all ages in the grades selected for the survey (N). The selected countries and the year GYTS was conducted (values for n and N) are as follows: Ghana, 2009 (n/N = 4,171/8,295); Guatemala, 2008 (n/N = 3,838/5,565); Saudi Arabia, 2007 (n/N = 2,574/3,829); the Philippines, 2007 (n/N = 3,278/5,919); Slovakia, 2007 (n/N = 4,176/4,696); and Thailand, 2009 (n/N = 7,649/9,963).
Some questions from the GYTS presented the opportunity for participants to contradict themselves when responding (Table 1). Self-reported cigarette smoking on 1 or more of the past 30 days was used to determine cigarette smoking status. For this series of questions, 5 approaches were taken for handling inconsistent responses to questions regarding cigarette use: GYTS, do-nothing, gatekeeper, global, and preponderance (Table 1).
We used Stata 11 software (StataCorp LP, College Station, Texas) to account for complex survey design and to calculate weighted point estimates and standard error (SE) of the estimates. Estimates with a relative SE (ratio of the SE of the estimate to the estimate, multiplied by 100) greater than 30% were considered statistically unreliable. Adjusted Wald tests were used to evaluate for statistical differences between point estimates derived from the GYTS approach and the 4 other data editing approaches. Significance was set at P < .05.
Overall response rates of students interviewed (calculated as the school response rate multiplied by the class and student response rates) for all 6 countries were the following: 84.0% (Ghana), 79.6% (Guatemala), 82.1% (Saudi Arabia), 80.9% (Philippines), 86.1% (Slovakia), and 93.1% (Thailand). Data edit approaches resulted in variation of prevalence estimates of cigarette use; estimates ranged from 2.3% to 5.1% in Ghana, 8.9% to 12.4% in Guatemala, 4.9% to 6.5% in Saudi Arabia, 12.3% to 17.0% in the Philippines, 21.6% to 25.0% in Slovakia, and 9.6% to 11.9% in Thailand (Table 2). The global approach resulted in lower estimates and the preponderance approach, in general, higher estimates. The do-nothing and gatekeeper approaches produced estimates similar to those of the GYTS approach. The range and magnitude of differences in estimates derived from the global and preponderance approaches compared with those of the GYTS approach were greater among girls than boys. All comparisons of GYTS estimates were significantly different (P < .05) from estimates derived with the 4 other approaches, with several exceptions (Table 2). Consistent with the overall estimates, the global approach resulted in lower estimates, the preponderance approach higher estimates, and the do-nothing and gatekeeper approaches similar estimates, by sex across all selected countries.
We demonstrated the effect of decision rules for handling data inconsistencies in GYTS data to assist collaborators globally. Smoking prevalence estimates generated from surveys can vary with the data editing approach used. Compared with the GYTS data edits, the global approach resulted in lower estimates and the preponderance approach, higher estimates. It is noteworthy that the do-nothing and gatekeeper approaches produced estimates similar to those of the GYTS data editing method. In comparison to the GYTS approach (7 logic checks), data editing methods in the National Youth Tobacco Survey and Youth Risk Behavior Survey are more extensive (more than 30 logic checks for each), suggesting a need to provide a more comprehensive list of logic checks to account for all possible combinations of inconsistencies in GYTS data (8,9).
This study shows how different ways of removing inconsistent data influence the degree to which cigarette smoking is estimated. Clearly described methods for handling inconsistent data are necessary for reproducibility and comparability of GYTS results. Multiple researchers across WHO regions use and publish GYTS data, and accurate comparisons between 2 studies can be made only if the same approach in handling inconsistent data is used. Resolving issues with data inconsistency may include piloting surveys before implementation and incorporating built-in skip patterns if electronic versions of the survey are explored in the future. A limitation of this study is that the list of sampled countries is not representative of, and therefore not generalizable to, all countries conducting GYTS.
Data cleaning and management, as essential aspects of quality assurance and determinants of study validity, require transparency and proper documentation of all procedures (10). Implications for researchers using GYTS include recognition of its data editing approach and documentation in their study methods to ensure cross-study comparability.
This project received no funding. None of the authors have a commercial or other financial interest associated with the information presented in this manuscript.
The opinions expressed by authors contributing to this journal do not necessarily reflect the opinions of the U.S. Department of Health and Human Services, the Public Health Service, the Centers for Disease Control and Prevention, or the authors' affiliated institutions.
Suggested citation for this article: Lam E, Rolle I, Shin M, Lee KA. Impact of Data Editing Methods on Estimates of Smoking Prevalence, Global Youth Tobacco Survey, 2007–2009. Prev Chronic Dis 2013;10:120202. DOI: http://dx.doi.org/10.5888/pcd10.120202.