Evidence‐based practice requires that physical therapists are able to analyze and interpret scientific research. When performing or evaluating research for clinical practice, sports physical therapists must first be able to identify the appropriate study design. Research begins by identifying a specific aim or purpose; researchers should always attempt to use a methodologically superior design when performing a study. Research design is one of the most important factors to understand because:
1. Research design provides validity to the study;
2. The design must be appropriate to answer the research question; and
3. The design provides a “level of evidence” used in making clinical decisions.
Research study designs must have appropriate validity, both internally and externally. Internal validity refers to the design itself, while external validity refers to the study's applicability in the real world. While a study may have internal validity, it may not have external validity; however, a study without internal validity is not useful at all.
Most clinical research suffers from a conflict between internal and external validity. Internally valid studies are well‐controlled with appropriate designs to ensure that changes in the dependent variable result from manipulation of an independent variable. Well‐designed research provides controls for managing or addressing extraneous variables that may influence changes in the dependent variable. This is often accomplished by ensuring a homogenous population; however, clinical populations are rarely homogenous. An internally‐valid study with control of extraneous variables may not represent a more heterogeneous clinical population; therefore, clinicians should always consider the conflict between internal and external validity both when choosing a research design and when applying the results of research on order to make evidence‐based clinical decisions.
Furthermore, research can be basic or applied. Basic science research is often done on animals or in a controlled laboratory setting using tissue samples, for example. Applied research involves humans, including patient populations; therefore, applied research provides more clinical relevance and clinical application (i.e., external validity) than basic science research.
One of the most important considerations in research design for internal validity is to minimize bias. Bias represents the intentional or unintentional favoring of something in the research process. Within research designs, there are 5 important features to consider in establishing the validity of a study: sample, perspective, randomization, control, and blinding.
- Sample size and representation is very important for both internal and external validity. Sample size is important for statistical power, but also increases the representativeness of the target population. Unfortunately, some studies use a ‘convenience sample’, often consisting of college students, which may not represent a typical clinical population. Obviously, a representative clinical population can provide a higher level of external validity than a convenience sample.
- In terms of perspective, a study can be prospective (before the fact) or retrospective (after the fact). A prospective study has more validity because of more control of the variables at the beginning of and throughout the study, whereas a retrospective study has less control since it is performed after the end of an event. A prospective design provides a higher level of evidence to support cause‐and‐effect relationships, while retrospective studies are often associated with confounding variables and bias.
- Random assignment to an experimental or control group is performed to represent a ‘normal distribution’ of the population. Randomization reduces selection bias to ensure one group doesn't have an advantage over the other. Sometimes groups, rather than individual subjects, are randomly assigned to an experimental or control group; this is referred to as “block randomization.” Sample bias can also occur when a “convenience sample” is used that might not be representative of the target population. This is often seen when healthy, college‐aged students are included, rather than a representative sample of the population.
- A control group helps ensure that changes in the dependent variable are due to changes in the independent variable, and not due to chance. A control group receives no intervention, while the experimental group receives some type of intervention. In some situations, a true control group is not possible or ethical; therefore, “quasi‐experimental” designs are often used in clinical research where the control group receives a “standard treatment.” Sometimes, the experimental group can be used as it's “own control” by testing different conditions over time.
- Blinding (also known as “masking”) is performed to minimize bias. Ideally, both the subjects and the investigator should be blinded to group assignment and intervention. For example, a “double‐blind” study is one in which the subjects are not aware if they are receiving the experimental intervention or a placebo and at the same time, and the examiner is not aware which intervention the subjects received.
While considering these 5 features, a large sample size of patients, prospective, randomized, controlled, double‐blinded clinical outcome study would likely provide the best design to assure very high internal and external validity.
Most research follows the “scientific method”. The scientific method progresses through four steps:
1. Identification of the question or problem;
2. Formulation of a hypothesis (or hypotheses);
3. Collection of data; and
4. Analysis and interpretation of data.
Different research designs applying apply are used to answer a question or address a problem. Different authors provide different classifications of research designs.1‐4
Within the scientific method, there are 2 main classifications of research methodology: experimental and non‐experimental. Both employ systematic collection of data. Experimental research is used to determine cause‐and‐effect relationships, while non‐experimental is used to describe observations or relationships in a systematic manner. Both experimental and non‐experimental research consist of several types and designs. ()
Experimental methods follow the scientific method in order to examine changes in one variable by manipulating other variables to attempt to establish cause‐and‐effect. The dependent variable is measured under controlled conditions while controlling for confounding variables. It is important to remember that statistics do not establish cause‐and‐effect; rather, the design of the study does. Experimental statistics can only reject a null hypothesis and identify variance accounted for by the independent variable. Thomas et al.4
provide three criteria to establish cause‐and‐effect:
1. Cause must precede effect in time;
2. Cause and effect must be correlated with each other; and
3. Relationship cannot be explained by another variable.
There are 3 elements of research to consider when evaluating experimental designs: groups, measures, and factors. Subjects in experimental research are generally classified into groups such as an experimental (those receiving treatment) or control group. Technically speaking, however, “groups” refers to the treatment of the data, not how the treatment is administered2
. Groups are sometimes called “treatment arms” in order to denote subjects receiving different treatments. True experimental designs generally use randomized assignment to groups, while quasi‐experimental research may not.
Next, the order of measurements and treatments should be considered. “Time” refers to the course of the study from start to finish. Observations, or measurements of the dependent variables, can be performed one or several times throughout a study. The term, “repeated measures” denotes any measurement that is repeated on a group of subjects in the study. Repeated measures are often used in pseudo‐experimental research when the subjects act as their own control in one group, while true experimental research can use repeated measurements of the dependent variable as a single factor (“time”).
Since experimental designs are used to identify changes in a dependent variable by manipulating an independent variable, “factors” are used. Factors are essentially the independent variables. Individual factors can also have several levels. Single‐factor designs are referred to as “one‐way” designs with one independent variable and any number of levels. One‐way designs may have multiple dependent variables (measurements), but only one independent variable (treatment). Studies involving more than one independent variable are considered “multi‐factorial” and are referred to as “two‐way” or “three‐way” (and so on) designs. Multi‐factorial designs are used to investigate interactions within and between different variables. A “mixed design” factorial study includes 2 more independent variables with one repeated across all subjects and the other randomized to independent groups. is an example of a 2‐way repeated measures design including a true control group.
Two‐way repeated measures experimental design to determine interactions within and between groups.
Factorial designs are denoted with numbers representing the number of levels of each factor. A two‐way factorial (2 independent variables) with 2 levels of each factor is designated by “2 × 2”. The total number of groups in a factorial design can be determined by multiplying the factors together; for example, a 2×2 factorial has 4 groups while a 2×3×2 factorial has 12. describes the differences in factorial designs using an example of 3 studies examining strength gains of the biceps during exercise. Each factor has multiple levels. In the 1‐way study, strength of the biceps is examined after performing flexion or extension with standard isotonic resistance. In the 2‐way study, a 3‐level factor is added by comparing different types of resistance during the same movements. In the 3‐way study, 2 different intensity levels are added to the design.
Examples of progressive factorial designs.
Statistical analysis of a factorial design begins by determining a main effect, which is an overall effect of a single independent variable on dependent variables. If a main effect is found, post‐hoc analysis examines the interaction between independent variables (factors) to identify the variance in the dependent variable.
As described in previously, there are 2 types of experimental designs: true experimental and quasi‐experimental.
True Experimental Designs
True experimental designs are used to determine cause‐and‐effect by manipulating an independent variable and measuring its effect on a dependent variable. These designs always have at least 2 groups for comparison.
In a true experimental design, subjects are randomized into at least 2 independent, separate groups, including an experimental and “true” control. This provides the strongest internal validity to establish a cause‐and‐effect relationship within a population. A true control group consists of subjects that receive no treatment while the experimental group receives treatment. The randomized, controlled trial design is the “gold standard” in experimental designs, but may not be the best choice for every project.
provides common true experimental designs that include 2 independent, randomly assigned groups and a true control group. Notation is often used to illustrate research designs:
Common true experimental designs.
Clinical researchers often find it difficult to use true experimental designs with a ‘true’ control because it may be unethical and sometimes illegal to withhold treatment within a patient population. In addition, clinical trials are often affected by a conflict between internal and external validity. Internal validity requires rigorous control of variables; however, that control does not support real‐world generalizability (external validity). As previously described, clinical researchers must seek balance between internal and external validity.
Quasi‐experimental designs are those that do not include a true control group or randomization of subjects. While these types of designs may reduce the internal validity of a study, they are often used to maximize a study's external validity. Quasi‐experimental designs are used when true randomization or a true control group is unethical or difficult. For example, a ‘pseudo‐control’ group may include a group of patients receiving traditional treatment rather than a true control group receiving nothing.
Block‐randomization or cluster grouping may also be more practical when examining groups, rather than individual randomization. Subjects are grouped by similar variables (age, gender, etc) to help control for extraneous factors that may influence differences between groups. The block factor must be related to dependent variable (i.e., the factor affecting response to treatment).
A cross‐over or counterbalanced design may also be used in a quasi‐experimental study. This design is often used when only 2 levels of an independent variable are repeated to control for order effects.3
A cross‐over study may require twice as long since both groups must undergo the intervention at different times. During the cross‐over, both groups usually go through a ‘washout’ period of no intervention to be sure prolonged effects are not a factor in the outcome.
Examples of quasi‐experimental designs can include both single and multiple groups (). Quasi‐experimental designs generally do not randomize group assignment or use true control groups. (Note: One‐group pre‐post test designs are sometimes classified as “pre‐experimental” designs.)
Single‐subject designs are also considered quasi‐experimental as they draw conclusions about the effects of a treatment based on responses of single patients under controlled conditions.3
These designs are used when withholding treatment is considered unethical or when random assignment is not possible or when it is difficult to recruit subjects as is commonly seen in rare diseases or conditions. Single subject designs have 2 essential elements: design phases and repeated measures.3
Design phases include baseline and intervention phases. The baseline measure serves as a ‘pseudo‐control.” Repeated measurement over time (for example, during each treatment session) can occur during the baseline and intervention phases. Common single‐subject designs are commonly denoted by the letters ‘A’ ‘(baseline phases) and ‘B’ (intervention phases): A‐B; A‐B‐A; and A‐B‐A‐B. Other single‐subject designs include withdrawal, multiple baselines, alternating treatment, multiple treatment, and interactive design. For more detailed descriptions on single subject designs, see Portney and Watkins.3
Studies involving non‐experimental methods include descriptive, exploratory, and analytic designs. These designs do not infer cause‐and‐effect by manipulating variables; rather, they are designed to describe or explain phenomena. Non‐experimental designs help provide an early understanding about clinical conditions or situations, without a full clinical study through systematic collection of data.
Descriptive designs are used to describe populations or phenomena, and can help identify groups and variables for new research questions.3
Descriptive designs can be prospective or retrospective, and may use longitudinal or cross‐sectional methods. Phenomena can be evaluated in subjects either over a period time (longitudinal studies) or through sampling different age‐grouped subjects (cross‐sectional studies). Descriptive research designs are used to describe results of surveys, provide norms or descriptions of populations, and to describe cases. Descriptive designs generally focus on describing one group of subjects, rather than comparing different groups.
Surveys are one of the most common descriptive designs.4
They can be in the form of questionnaires or interviews. The most important component of an effective survey is to have an appropriate sample that is representative of the population of interest. There are generally 2 types of survey questions: open‐ended and closed‐ended. Open‐ended questions have no fixed answer, while closed‐ended questions have definitive answers including rank, scale, or category. Investigators should be careful not to lead answers of subjects one way or another, and to keep true to the objectives of the study. Surveys are limited by the sample and the questions asked. External validity is threatened, for example, if the sample was not representative of the research question and design.
A special type of survey is the Delphi technique that uses expert opinions to make decisions about practices, needs, and goals.4
The Delphi technique uses a series of questionnaires in successive stages called “rounds.” The first round of the survey focuses on opinions of the respondents, and the second round of questions is based on the results of the first round, where respondents are asked to reconsider their answers in context of other's responses. Delphi surveys are common in establishing expert guidelines where consensus around an issue is needed.
A descriptive observational study evaluates specific behaviors or variables in a specific group of subjects. The frequency and duration of the observations are noted by the researcher. An investigator observing a classroom for specific behaviors from students or teachers would use an observational design.
Normative research describes typical or standard values of characteristics within a specific population.3
These “norms” are usually determined by averaging the values of large samples and providing an acceptable range of values. For example, goniometric measures of joint range of motion are reported with an accepted range of degrees, which may be recorded as “within normal limits.” Samples for normative studies must be large, random, and representative of the population heterogeneity.3
The larger the target population, the larger sample required to establish norms; however, sample sizes of at least 100 are often used in normative research. Normative data is extremely useful in clinical practice because it serves as a basis for determining the need for an intervention, as well as an expected outcome or goal.
Developmental research helps describe the developmental change and the sequencing of human behavior over time.3
This type of research is particularly useful in describing the natural course of human development. For example, understanding the normal developmental sequencing of motor skills can be useful in both the evaluation and treatment of young athletes. Developmental designs are classified by the method used to collect data; they can be either cross‐sectional or longitudinal.
Case designs offer thoughtful descriptions and analysis of clinical information;2
they include case reports, case studies, and case series. A case report is an in‐depth understanding of a unique patient, while a case study focuses on a unique situation. These cases may involve a series of patients or situations, which is referred to as a ‘case series’ design. Case designs are often useful in developing new hypotheses and contributing to theory and practice. They also provide a springboard for moving toward more quasi‐experimental or experimental designs in order to investigate cause and effect.
Research measures can also be classified as quantitative or quantitative. Quantitative measures explain differences, determines causal relationships, or describes relationships; these designs include those previously discussed. Qualitative research, on the other hand, emphasizes attempting to discern process and meaning without measuring quantity. Qualitative studies focus on analysis in trying to describe a phenomenon. Qualitative research examines beliefs, understanding, and attitudes through skillful interview and content analysis.5
These designs are used to describe specific situations, cultures, or everyday activities. provides a comparison between qualitative and quantitative designs.
Comparison of quantitative and qualitative designs (Adapted from Thomas et al4 and Carter et al2).
Exploratory designs establish relationships without manipulating variables while using non‐experimental methods. These designs include cohort studies, case control studies, epidemiological research, correlational studies, and methodological research. Exploratory research usually involves comparison of 2 or more groups.
A cohort is a group of subjects being studied. Cohort studies may evaluate single groups or differences between specific groups. These observations may be made in subjects one time, or over periods of time, using either cross‐sectional or longitudinal methods.
In contrast to experimental designs, non‐experimentally designed cohort studies do not manipulate the independent variable, and lack randomization and blinding. A prospective analysis of differences in cohort groups is similar to an experimental design, but the independent variable is not manipulated. For example, outcomes after 2 different surgeries in 2 different groups can be followed without randomization of subjects using a prospective cohort design.
have classified “Outcomes Research” as a retrospective, non‐experimental cohort design, where differences in groups are evaluated ‘after the fact’ without random allocation to groups or manipulation of an independent variable. This design would include chart reviews examining outcomes of specific interventions.
Case Control Studies
Case control studies are similar to cohort studies comparing groups of subjects with a particular condition to a group without the condition. Both groups are observed over the same period of time, therefore requiring a shorter timeframe compared to cohort studies. Case control studies are better for investigations of rare disease or conditions because the sample size required is less than a cohort study. The control group (injury/disease‐free) is generally matched to the injury/disease group by confounding variables consistent in both groups such as age, gender, and ethnicity.
Case control studies sometimes use “odds ratios” in order to estimate the relative risk if a cohort study would have been done.4
An odds ratio greater than 1 suggests an increased risk, while a ratio less than 1 suggests reduced risk.
Studies that evaluate the exposure, incidence rates, and risk factors for disease, injury, or mortality are descriptive studies of epidemiology. According to Thomas et al,4
epidemiological studies evaluate “naturally occurring differences in a population.” Epidemiological studies are used to identify a variety of measures in populations ().
Measurement terminology used in epidemiological research.
“Relative risk,” (RR) which is associated with exposure and incidence rates. Portney and Watkins3
use a “contingency table” () to determine the relative risk and odds ratio. Usually, incidence rates are compared between 2 groups by dividing the incidence of one group by the other.
Contingency Table to determine risk (Adapted from Portney and Watkins3).
With these formulas, the “null value” is 1.0. A risk or odds ratio less than 1.0 suggests reduced risk or odds, while a value greater than 1.0 suggests increased risk or odds. For example, if the risk is 1.5 in a group, there is a 1.5 times greater risk of suffering an injury in that group. Relative risk should be reported with a confidence interval, typically 95%.
Epidemiological studies can also be used to test a hypothesis of the effectiveness of an intervention on on injury prevention by using incidence as a dependent variable. These studies help link exposures and outcomes with observations, and can include case control and cohort studies mentioned previously.
Correlations studies examine relationships among variables. Correlations are expressed using the Pearson's “r” value that can range from −1 to +1. A Pearson's “r” value of +1 indicates a perfect linear correlation, noting the increase in one variable is directly dependent on the other. In contrast, an “r” value of −1 indicates a perfect inverse relationship. An “r” value of 0 indicates that the variables are independent of each other. The most important thing to remember is that correlation does not infer causation; in other words, correlational studies can't be used to establish cause‐and‐effect. In addition, 2 variables may have a high correlation (r>.80), but lack statistical significance if the p‐value is not sufficient. Finally, be aware that correlational studies must have a representative sample in order to establish external validity.
The usefulness of clinical research and decision‐making heavily depends on the validity and reliability of measurements.3
Methodological research is used to develop and test measuring instruments and methods used in practice and research. Methodological studies are important because they provide the reliability and validity of other studies. First, the reliability of the rater (inter‐rater and intra‐rater reliability) must be established when administering a test in order to support the accuracy of measurements. Inter‐rater reliability supports consistent measurements between different raters, while intra‐rater reliability supports consistent measures for the same individual rater. Reliability can also be established for instruments by demonstrating consistent measurements over time. Reliability is related to the ability to control error, and thus associated with internal validity.
Methodological studies are also used to establish validity for a measurement, which may include clinical diagnostic tests, performance batteries, or measurement devices. Measurement validity establishes the extent to which an instrument measures what it intends to measure. Different types of validity can be measured, including face validity, content validity, criterion‐related validity and construct validity ().
Different types of validity in scientifi c research.
Sports physical therapists may also be interested in the sensitivity and specificity of clinical tests. Sensitivity refers to the ability of a test to correctly identify those with a condition, while specificity refers to the ability to correctly identify those without the condition. Unfortunately, few clinical tests possess both high sensitivity and specificity.6
Analytical research designs are not just a review or summary, but a method of evaluating the existing research to reach a conclusion. These designs provide a synthesis of the literature for empirical and theoretical conclusions.4
Analytical designs explain phenomena and analyze existing data using systematic reviews and meta‐analysis techniques. In contrast to systematic reviews, meta‐analyses include statistical analysis of data.
Systematic reviews most commonly examine the effectiveness of interventions, but may also examine the accuracy of diagnostic tools.3
Systematic reviews of randomized controlled trials provide the highest level of evidence possible.7
Systematic reviews should describe their methodology in detail, including inclusion and exclusion criteria for studies reviewed, study designs, and outcomes measures. In addition, the method of literature search should be detailed including databases, dates, and keywords used.
Systematic reviews can be extended into meta‐analysis if multiple studies contain necessary information and data. Meta‐analysis techniques are particularly useful when trying to analyze and interpret smaller studies and studies with inconsistent outcomes. Meta‐analysis of randomized controlled trials provides a high level of evidence, but may suffer in quality from heterogeneous samples, bias, outliers, and methodological differences.
Meta‐analysis quantifies the results of various studies into a standard metric that allows for statistical analysis to calculate effect sizes. The effect size, calculated by “Cohen's d value,” is defined as a standardized value of the relationship between two variables. Effect size provides magnitude and direction of the effect of a treatment, and is determined by the difference in means divided by the standard deviation (ΔM / SD). A Cohen's d value of .2 is considered small; .5 is considered moderate, and .8 and greater is a large effect size. Confidence intervals are then reported to provide an interval of certainty.
Levels of Evidence
Research designs are often viewed in a hierarchy of evidence. These designs have been discussed in this paper, but bear repeating in the context of evidence‐based practice. “Levels of Evidence” have been established by the Center for Evidence‐Based Medicine in Oxford, England () as well as other research consortiums. Each level is based on controlling as many factors (variables) as possible to confidently make conclusions without bias, the highest of which is cause‐and‐effect. In addition, “grades” of evidence have been established based on the quality and number of various levels of evidence to make recommendations in reviews and guidelines (). Thus, a research publication could be described and labeled using a combination of a level and a grade, such as “Level II‐A” or “Level II‐B”.
Levels of Evidence (Adapted from the Center for Evidence-Based Medicine7).
Grades of Evidence (Adapted from the Center for Evidence-Based Medicine7).