From a multiinstitutional perspective, our study is among the first to examine the utility of using data from a national survey completed by all medical schools in the United States (i.e., the GQ) as a possible measure of the impact of curricular change in multischool studies. Whereas Newell14
used data from the GQ as part of an assessment of curricular change in geriatrics education, only a single institution was involved in that study. Pugnaire and colleagues15
used GQ data to assess the longitudinal stability of students’ perceptions of the quality of their clerkships, and Marantz et al16
used GQ data to assess student satisfaction with curricular change; all of those studies involved single institutions. Although the GQ has also been used to assess changes in planned career settings17
and predictors of career choice,18
only one of those last two studies16
was a multiinstitutional study. Our study makes a substantial contribution to the literature on what would be involved in conducting a national assessment of changes in the SBS curriculum, as measured by the GQ. To our knowledge, no other group of educational researchers has analyzed the content of the GQ and classified it as we have done. We were hopeful that these efforts would allow our nine schools to use this measure to assess for changes over time.
The findings from our study suggest that for our learning collaborative to benefit from the GQ, using the assumptions we included in our analyses, the effect size associated with the planned innovations would have to be very large (≥107%). Thus, our collaborative is likely too small, with its sample size of nine schools, to identify significant impacts of curricular change using this tool. We found that it would likely take a collaboration of more than 35 medical schools to observe modest changes (50%) in curricular impact over time, and that we would need a sample size of 199 schools to detect a small effect size (20%), which is much greater than the number of medical schools that are currently enrolling medical students in the United States and Canada. Given that the impetus for our work was an important IOM extensive review, we believe it is important for the medical education community to understand the utility of existing measures and foundational aspects needed to study change over time.
The analytic challenge inherent in sample size calculations involves the balance of rigorous criteria, such as setting power at 0.80, using two-tailed tests, and setting appropriate alpha levels (.05 or .01 to account for multiple comparisons) with other parameters that can be expected to vary. In our case, we had a fixed sample size (n = 9) and available summary scores that allowed us to apply some assumptions regarding mean scores and standard deviations, leaving the effect size, or the difference we could attribute to curricular innovations, to finish the calculations. Though we found our small collaborative of nine schools could likely not find statistical differences associated with change, a much larger collaborative might have. Effect size differences for measurable change on attitude items in the GQ are likely to be small, thereby warranting a larger collaborative to measure such differences with adequate statistical power. We are encouraged by this work, and we hope others will undertake such studies so that larger collaboratives can form in meaningful ways.
This work illustrates that the GQ can enable the establishment of baseline measures that could be used to assess curricular changes in subsequent medical school classes across U.S. medical schools. Our study found that improvements could be detected for the majority of measures on the GQ that are related to SBS. Few areas would experience a ceiling effect, where statistical changes could not be detected because baseline scores were already high, given the effect sizes were large and enough medical schools participated in the analysis. The IOM report1
(p7) indicated that existing national databases provide inadequate information on behavioral and social science content, teaching techniques, and assessment methodologies. This lack of data impedes the ability to reach conclusions about the current state and adequacy of SBS instruction in U.S. medical schools. We have undertaken an important approach to remedy this situation. In fact, the IOM1
report set the precedent for using the only available standardized questionnaire administered to all U.S. medical students, the GQ, by citing its use to measure medical student satisfaction with specific topics, most notably communication, cultural diversity, and socioeconomics.1
Many articles have reported on the paucity of robust evaluation and educational research conducted in the United States.19 –22
It is vitally important that this should change and for schools to collaborate more effectively toward a unified mission of enhancing medical education. It is important to note, however, that the GQ is not a completely standardized instrument in that some of the questions do change over time, which would affect the use of the tool in prospective studies. Retrospective analyses would need to take these changes into account.
With nine medical schools receiving NIH Behavioral and Social Sciences K07 awards, each institution had a particular vision and measurement approach. However, it became readily apparent that collaboration on common IOM domain themes could make the respective projects synergistically powerful. The OHSU administrative supplement created an evaluation core that would serve to develop plans and databases for pooled data and undertake relevant analyses, setting the stage for the nine-school NIH Behavioral and Social Sciences Consortium.5
Several schools have overlapping student outcome evaluation measures; however, finding a standardized measure that can be used across all institutions is beneficial. As these efforts progress during the nine schools’ implementation phases, available baseline data, such as the GQ, serve as a logical starting point, and we plan to continue our assessments using this tool over time and, if possible, work with the NIH to expand the collaborative.
Using the GQ core questions to track the influence of the medical schools’ interventions makes the assumption that this tool, which was not designed for this purpose, is sensitive enough to measure change. Our analytical approach essentially created a proxy measure for capturing change in the absence of an existing, psychometrically sound, measure. This creates both an opportunity and a challenge. The opportunity is to use the GQ as a proxy measure for trying to assess broad systemic changes in curriculum over time. However, as a proxy measure we realize that the GQ may not be sensitive enough to capture the benefits, or negative consequences, of very targeted curricular change. To the extent that curriculum change is medical-school-specific, new measures of curricular change will have to be developed that can more directly pick up the subtleties of those efforts. The modified Delphi approach we undertook allowed us to discuss this issue at length, and we hope those reading this report will understand and appreciate the challenges in this. In research, many measures are either adapted or used in ways that were different from their developmental purpose. This occurs because instrument development, testing, and administration are expensive undertakings. Our hope is that our collaborative would contribute to the dialogue about the use of national data sources.
With the exception of Mind-Body Interactions in the Health and Disease domain, the collaborative’s modified Delphi approach achieved consensus on GQ questions corresponding to the remaining IOM domains: Patient Behavior, Physician Role and Behavior, Physician-Patient Interactions, Social and Cultural Issues in Healthcare, and Health Policy and Economics. Supplementing the GQ with more SBS questions in the Mind-Body Interactions in Health and Disease domain could potentially be important. Applications of these results will be useful for other medical institutions as they evaluate curricular change and as they address unique and common curriculum components. The approach could be useful for measuring Liaison Committee on Medical Education topic areas across institutions.
The IOM report notes that the domains it identifies are broad, and the report leaves each school to decide exactly what content to include. The local variations in curricula offered will be important to track. Given the possible variability of what will actually be implemented at each of the nine schools, the GQ core questions may be more or less sensitive to capture effects of the interventions. For example, a school that offers little content regarding cultural components of care may not expect to see a change in that component of the GQ core questions, or if a change is evident, it is possible the change has occurred for reasons other than the intervention. For the GQ core questions to be used to track the consequences of the medical school’s interventions, and to reach its highest potential, it becomes critical to document how the actual curricula have been implemented in each of the schools. With that piece in place, and with enough medical schools contributing data, the use of the GQ would become a valuable tool in monitoring change.
The strength of our study is that nine schools collaborated and worked with an existing measurement tool to which all medical schools have access. The cost of the administrative supplement that supports the nine schools was minimal ($40K annually), which could be contributed by other collaborative efforts at the institutional level if the value were explicitly recognized. The GQ has been administered to graduating medical students since 1978 to assist the AAMC and medical schools in priority setting and in program and policy development.23
In addition, as indicated, several studies have been conducted using this tool,14 –18
and many of those used less formal processes for coming to agreement on classifying study variables than we did.
An important limitation of the study is that we did not have access to data files containing individual responses from each medical student, which would have allowed us to work with the data without making assumptions required in the absence of individual-level rather than school-level summary data. Because of this, we have oversimplified the sample size calculations we conducted, though we attempted to be as conservative as possible. We made multiple attempts to obtain individual schools’ data from the AAMC directly via a request from the multischool core located at OHSU and via individual schools. Unfortunately, no request succeeded in obtaining a data file with individual responses. Although some schools were able to obtain GQ data in the past, in response to the present request, we were told that only summary data would be provided at the level of the school because of limited resources at the AAMC. If we had had access to individual-level data, we could have taken into account the clustering effects of students within specific institutions as well as expected effect sizes based on planned innovations in our calculations. This would also have allowed us to adjust for possible confounders, such as age and gender of students. However, the GQ is likely the best starting point for addressing these questions, given that an extensive, structured experimental design across the nine institutions is something that is beyond the funding capability of the K07. We hope that studies like the one we conducted will foster the use of national data in a way that can benefit all medical schools and students in training.
We also do not have explicit access to control schools, so we made some assumptions about historical cohort comparisons (using each school to serve as its own control) and assessments using mean national scores that would need to be tested using a more rigorous study design. Last, we considered only the category of appropriate curricular time spent in our analyses of some IOM domain areas, which did not take into account whether the remaining time was excessive or inadequate. The consequences of this are that although we make comparisons with some accuracy measures of appropriate time, we do not fully address excessive or inadequate time. However, our detailed exploration of this issue indicates that the most likely response was appropriate curricular time, with fewer than 20% of responses being inadequate curricular time (15.1%) or too much curricular time (4.6%); thus, our findings are not likely influenced to any significant extent by this approach.
Finally, we note that the response rates to the GQ range from 48% to 100% with a mean of 74%. There is the possibility of response bias affecting these findings, especially from the four schools with response rates below 75%. The schools use different techniques to encourage their students to complete the GQ, which included doing nothing, weekly e-mail reminders from administration, a student-run recruitment program, offering a nominal gift certificate to each student completing the survey, offering a nominal amount toward the Residency Match Day Fund, requiring completion of the survey to receive grades, and requiring completion for graduation (one school). These different attempts seemed to have influenced the students’ response rates regarding survey completion. This may have entered bias into the study, even though the overall response rate was quite high (74%), and given the fact that some schools had more difficulty than others in attaining high response rates.