|Home | About | Journals | Submit | Contact Us | Français|
As educators seek confirmation of successful trainee achievement, medical education must move toward a more evidence-based approach to teaching and evaluation. Although medical training often provides physicians with a general background in biostatistics, many are not prepared to apply these skills. This can hinder clinician educators as they wish to develop, analyze and disseminate their scholarly work. This paper is intended to be a concise educational tool and guide for choosing and interpreting statistical tests aimed toward medical education assessment. It includes guidelines and examples that clinician-educators can use when analyzing and interpreting studies and when writing methods and results sections of reports.
As accreditation bodies seek confirmation of successful trainee achievement, 1, 2 medical education must move toward a more evidence-based approach to teaching and evaluation. 3, 4 To meet these challenges, educators must have knowledge and skills in developing, analyzing and disseminating educational interventions as part of their scholarly work. Effective development and evaluation require a fundamental knowledge of study design and statistical methods. Although medical training often provides physicians with a general background in epidemiology and biostatistics, many physicians are not prepared to apply these skills. 5– 7
While an effort has been made to help educators apply epidemiology to educational research, 8 we found no references that help educators understand how to use statistical tests to evaluate educational interventions. This paper is intended to be a concise educational tool and guide for choosing a statistical test during medical education assessment and for interpreting and analyzing educational studies without relying on mathematical theory. To provide a framework for understanding statistical concepts and to illustrate the decision-making process needed to choose a statistical test, we present an educational intervention detailing the hypothesis testing, data analysis, and interpretation of the results. Examples of statistical tests recently used in the educational literature are provided in Appendix 1, and statistical terms appearing in boldface are defined in Appendix 2.
Before determining which statistical test to use, one must consider study hypotheses, study design, number of study groups, whether groups are matched or paired for certain characteristics, type of outcome data, and how data are distributed in the sample. A checklist of questions addressing these areas is provided in Table 1. First, we present a sample educational intervention to illustrate the statistical concepts presented later in the text.
We developed a 1-month curriculum to improve second-year medical students' physical examination skills, interpersonal skills and confidence level. We conducted a randomized controlled trial in which half of the class received the new curriculum and the other half served as controls. We collected information regarding student age, gender, and college major. We evaluated all students' physical examination and interpersonal skills using a standardized patient exam 1 week after the intervention (Note: for simplicity, we will consider only one station of a standardized patient exam). We assessed the number of relevant physical examination maneuvers performed correctly by each student (total of 6 manuevers), a 20-item interpersonal score rated by the standardized patient, and whether the patient would recommend the student to a friend. Each interpersonal item was rated on 5-point Likert scale (1 =poor, 5 =excellent). We assessed each intervention student's confidence level in performing physical examination techniques before and after the curriculum using a 4-point Likert scale (1 =not very confident, 4 =very confident).
We used a Student's t-test to compare the mean number of physical examination maneuvers performed correctly and the Wilcoxon rank-sum test to compare overall interpersonal scores between groups. We used the Wilcoxon signed-rank test to compare intervention students' confidence level before and after the curriculum. To assess the relationship between student characteristics and the likelihood of being recommended to a friend, we performed simple logistic regression.
With a sample size of 60 students in each group, the study had 80% power to detect a difference of 1.2 maneuvers between the intervention and control groups in the mean number of relevant physical examination maneuvers performed correctly.
Statistics is the scientific use of data to describe and draw inferences about true associations or phenomena by assessing the strength of the evidence for or against a hypothesis. It is used to make predictions and comparisons about a larger population based on data collected from a smaller sample. Since we usually cannot test an entire population (e.g., all second-year medical students), we must rely on sample data to guide our understanding of the truth. How well our sample represents the larger population determines how generalizable our findings are.
Data collected in any study are subject to variation. Some variation comes from random error and some from statistical error (measurement variation). Bias can be introduced in any stage of the study from its development to reporting of the results. 9 The goals of any study should include decreasing bias and minimizing error.
Studies generally have 2 variable types: the response variable (also called the outcome or dependent variable) and the explanatory variable (also called a covariate or independent variable). These variables can be quantitative or qualitative in nature. Quantitative variables are numerical and can be continuous or discrete. Continuous variables have no gaps in the values (e.g., age), whereas discrete variables have gaps (e.g., the number of study participants). Qualitative variables describe certain attributes and are either ordinal or nominal. Ordinal variables have an implicit ranking associated with them (e.g., Likert scales), whereas nominal variables are descriptive and cannot be ordered (e.g., college major). The types of dependent and independent variables used to make comparisons influence what statistical tests are needed.
The appropriate use of statistics depends upon the research question(s) being asked. These questions and study hypotheses influence the study design and should be determined before conducting a study. Two types of study designs are commonly used in research: observational and experimental. Observational studies examine groups at one or more points in time (e.g., case-control, cross-sectional, and cohort studies). Experimental studies, or controlled trials, allocate participants to one or more groups and make comparisons across groups to assess differences in outcomes. Our study was a randomized controlled trial. Random allocation involves chance in the assignment of participants to intervention and control groups. This avoids a potential bias called selection bias that may be present if group assignment is known, as is often the case in observational studies. Selection bias can produce comparison groups that are different from each other from the study onset. This can limit the interpretation and generalizability of the study results.
The study design and the type of comparison group influences the statistical analyses performed. If the study uses a pre-post design, each participant is assessed by the same instrument at different points in time. The results obtained for each individual during different measurements are more likely to be highly correlated than the results of 2 randomly selected participants. Statistical analyses in this case should be performed using paired methods such that each participant serves as his/her own comparison. Our study requires the use of paired methods to assess differences in student confidence level before and after the intervention.
The first step in any analysis is to explore the data collected to ensure that they are reasonable, accurate and not affected by measurement or recording errors. Exploratory data analysis, or descriptive statistics, is a method of organizing, summarizing and displaying data. It includes calculating measures of central tendency (e.g., mean and median) along with measures of dispersion (e.g., standard deviation and interquartile range). Graphically displaying the data in histograms, stem-and-leaf plots or box-and-whisker plots will also aid in assessing patterns of dispersion and can identify potential outlying values that may influence study results. Understanding the type of data collected and how it is dispersed helps determine which types of statistical analyses can be performed.
Confirmatory data analysis, or inferential statistics, uses estimation and hypothesis testing to assess the strength of the evidence, make comparisons, make predictions and draw conclusions about a population based on the sample data. Types of inferential statistics include bivariate analyses that investigate relationships between 1 dependent and 1 independent variable, and multivariable analyses that investigate relationships between 1 dependent and multiple independent variables while controlling for the possible confounding influence of several independent variables on the dependent variable. In our example, we use bivariate analyses to compare differences in interpersonal scores between groups and multivariable analyses to quantify the association of student characteristics with the interpersonal score.
The results of inferential statistics are reported according to the type of data collected and the statistical test or method used to determine the result (e.g., mean number of physical examination maneuvers performed correctly in each group using a Student's t-test). Results are also described by a level of statistical significance expressed as a P-value or estimated with a confidence interval (CI).
In hypothesis testing, the null hypothesis is a statement of no effect or no association. The null hypothesis regarding our main study goal would be: Participants and controls do not differ in the mean number of relevant physical examination maneuvers performed correctly at the end of the curriculum. The alternative hypothesis is that there is a difference.
Two types of errors can occur when making conclusions regarding the null hypothesis: Type I error and Type II error. A Type I error refers to rejecting the null hypothesis when the null hypothesis is true (false positive). A Type II error refers to accepting the null hypothesis when it is false (false negative). The goal is to minimize the probability of making a Type I error. Most studies set this probability, known as the significance level, at .05. In statistical tests, P-values are calculated as the probability of obtaining an outcome as extreme or more extreme than the observed study result under the assumption that the null hypothesis is true. If the P-value is less than the significance level, the result is considered statistically significant (e.g., P <.05). When statistical significance is not observed, either the null hypothesis is true (i.e., no difference really exists) or the sample size was not large enough to detect a difference (i.e., insufficient statistical power). The relationship between sample size, effect size, and statistical power is important to consider and is described elsewhere. 10, 11
Although P-values are used ubiquitously in the literature, they have several limitations. P-values do not indicate the strength or direction of the association, nor do they provide a direct interpretation of the results. For this reason, a 95% confidence interval (CI) associated with the result should be used when possible. A 95% CI indicates 95% certainty that the interval contains the true value. The true value refers to the outcome that we would expect if we could test the entire population. In our example, we wanted to determine whether there was a difference in the mean number of relevant physical examination maneuvers performed correctly between groups. The 95% CI for the true difference in mean scores was 0.85 to 1.7 suggesting that the true difference lies approximately in the range of 1 to 2 maneuvers. Studies with larger sample sizes and less variation will have narrower CIs indicating more precision in the results. Those with smaller sample sizes and higher variation will have larger CIs indicating less precision.
Before conducting a study, determination of statistical significance and clinical (practical) significance should be made. To do this, one needs to define the magnitude of detectable difference that would provide a meaningful change in outcome. In some studies, statistical significance may be reached due to large sample size, but the practical significance of the outcome may not be noteworthy. On the contrary, statistical significance may not be reached due to low sample size, but the outcome may be clinically relevant. In our example, we wished to see if the intervention improved the average number of physical exam maneuvers performed correctly by students. We needed to ascertain in advance, either from other research or practical experience, the increase in average number of exam maneuvers that would constitute a meaningful change in results, and establish a sample size that would allow statistical detection of this change.
The distribution of data assessed during exploratory data analysis helps determine whether parametric or nonparametric tests should be used to make comparisons. Parametric tests are based upon the assumption that the data are sampled from a known population distribution (Note: we will consider only the normal (bell-shaped) distribution for continuous outcome data and the binomial distribution for dichotomous outcomes). If continuous outcome data in a sample are skewed toward either higher or lower values, or if the sample size is small, nonparametric tests should be used. Ordinal variables are usually analyzed using nonparametric tests; however, parametric tests can be used when values of separate variables are summed together to produce a total score which follows a normal distribution (e.g., summing each student's 20-item interpersonal ratings to obtain an overall score). Nonparametric tests use ranked observations rather than the actual values and do not assume that the shape of the distribution is known. 12 These tests are more conservative, but are important to use when parametric considerations do not hold.
We will use the steps outlined in Table 1 and the diagrams in Appendix 1 to illustrate how to select the appropriate statistical test for each of the 4 study hypotheses.
Participants and controls do not differ in the mean number of relevant physical examination maneuvers performed correctly at the end of the curriculum.
Participants and controls do not differ in their overall interpersonal scores at the end of the curriculum.
Participants' confidence level in performing physical examination maneuvers does not differ before and after the curriculum.
No association exists between a student's age, gender, and college major with the patient's recommendation of the student to a friend.
This paper illustrates the decision-making processes clinician-educators can use to select statistical tests for interventions with 2-group comparisons. Examples of comparisons between 3 or more groups, correlations, and different regression analyses can be found in Appendix 1. Other tests or analyses may be needed depending on the research question of interest. Studies using observer ratings should be analyzed for interrater and/or intrarater reliability to assess consistency of results. When multiple comparisons will be performed, researchers may need to adjust the significance level to a smaller value (e.g., P =.001) to decrease the probability of finding a statistically significant result by chance alone. When performing regression analyses, certain assumptions must be checked to assess whether a specific regression model is appropriate and whether the potential for confounding and effect modification by certain covariates should be considered. 13
With this guide, we hope to provide educators with a tool for improving the quality of medical education research conducted and presented in the literature. To obtain appropriate advice for both statistical design and analyses, we suggest the consultation of a statistician early in a study. Other resources such as textbooks and references for clinical research 10, 11 may be needed to address areas not covered in this paper.