Physical Therapists are often interested in studies that involve treatment interventions, identifying risk factors for succumbing to an injury or disease, and diagnosis of clinical conditions. In each of these areas, there are a number of different study designs that can be implemented. These designs may dictate the potential importance of the studies findings in its field. The design that a researcher chooses should be that which most appropriately answers the question being posed.2
However in many cases, it reflects the resources that researchers have at their disposal and the practicalities of undertaking the research. Resources required for studies may involve physical space and equipment, expertise in data collection, administrative processing of data, statisticians for analyzing data, and patient availability. In most cases, a researcher does not have the opportunity to cover all of these resources to the maximum level possible. Because of this, compromises are made and these often affect the choice of design to be utilised during the research process.
In studies concerning interventions, risk factors and diagnosis, the strength of an experimental paper's design is rated upon a scale that has 4‐5 levels and may be regarded as a hierarchy with level 1 being the highest. In the current paper, the hierarchies presented are based on those recommended by the National Health and Medical Research Council of Australia.3
However, there are others4
and they generally follow the same pattern, being different only in the alphanumeric nomenclature given to the levels of the hierarchy (eg: 1a or IIa etc). While one design may be high in the hierarchy for a particular question to be answered, it may not fare so well for a different question. For instance, while a prospective cohort study may be very effective at identifying risk factors, such a design does not provide professionals with the best evidence of a treatment's effect on a particular clinical condition. For the latter, a randomised controlled trial (RCT) would be more appropriate. Thus, it is important to recognise that different study designs have particular features that may make them advantageous for answering a certain type of research question.
If possible, always look for systematic reviews when searching the literature. A Level 1 rating is reserved for a systematic review of the experimental
papers. In such a paper, the quality of the designs and the findings of all the individual experimental papers are assessed in a systematic manner to provide an overall assessment or answer for a particular study question. However, it should be noted that not all systematic reviews automatically reach Level 1. If the papers that were reviewed were primarily of studies with poor designs, then the strength of evidence for the providing the answer to the question posed is lower, and the systematic review no matter how well it was conducted will not receive Level 1 status.5
Thus, the experimental papers upon which the review is based should determine the validity and strength of the review's findings.
Even when a systematic review has utilised papers with the strongest possible designs, the professional needs to appreciate a number of other factors that will influence its importance. These include the number of papers that have been reported upon, and the consistency of the results across papers. One should also appreciate the degree to which the findings apply to the clinical population of interest and what the implications are in respect to applying them in clinical practice, that is, could they be reasonably implemented. On the above‐mentioned scale, the highest quality experimental designs are rated with a Level 2 and lesser‐rated designs receive Levels that decline to 4‐5.
For studies examining treatment interventions, randomised controlled trials (RCTs) provide Level II evidence, the strongest level of evidence below a systematic review. Not surprisingly, the two key criteria for these study designs are the incorporation of at least one control group and the randomisation of participants.6
Without a control group, it is impossible to determine how participants would have changed over time without the experimental intervention. For instance, changes may have occurred due to disease progression or spontaneous recovery. The specific conclusions that can be drawn regarding the experimental intervention are critically dependent on what the control group receives. For example, researchers could compare the effects of icing on acute knee pain to a control group who received no specific intervention, or they could give the control group a bag of peas that are at room temperature to place over their knee for the same period of time. In the first example, the only conclusion that could be drawn is that icing is more effective at reducing pain than no treatment, whereas in the latter example, by controlling for effects associated with receiving a physical intervention to the knee and for the time of application, a researcher could therefore make more specific conclusions regarding the effects of ice itself. In terms of randomisation, the crucial criterion for a RCT is that neither the participant nor the experimenter should be able to predict which group the participant will be allocated to. Commonly accepted randomisation procedures include a coin toss, random number generator, drawing group allocation from an envelope. While researchers may design more complex procedures to ensure that group characteristics are matched on important factors and that participant numbers are balanced between groups, the final determination of group allocation for each participant should be due to chance alone.
One step down from an RCT is a pseudo‐RCT, which provides Level III‐1 evidence. In these study designs, there is still an appropriate control group but group allocation is not strictly randomised. Group allocation in pseudo‐RCTs is dictated by a set rule such as date of birth or participant number. These are weaker randomisation procedures as the experimenter can have knowledge of the group to which a participant will be assigned. The ability to predict group allocation introduces bias into the study as this knowledge can affect the decision about whether to enter the participant into the trial, which may bias the results of the trial overall.
The next level of evidence, Level III‐2, incorporates non‐randomised controlled trials and two types of observational studies. Non‐randomised controlled trials have marked group selection bias. For example, participants may allocate themselves into groups by choosing to receive a treatment, or participants presenting to a particular treatment provider might be always allocated to the experimental intervention and those that present to another treatment provider might receive a control intervention only. Observational designs include cohorts in which a group of people who are exposed to a particular intervention are followed over time and their health outcomes compared to a similar group of people who were not exposed to the intervention. Another example of an observational study is the case‐control design, in which people with a selected condition are identified and their history of exposure to an intervention is compared to a similar group of people who do not have the condition. In all of these study designs, the researchers are not in control of group randomisation and thus the potential for selection bias is substantially higher than in RCTs. This selection bias means that there will be an inherent risk that confounding factors, or factors other than the intervention of interest, are influencing the results of the study. However, it is important to recognise that there are some research questions and interventions to which researchers cannot apply the principles of randomisation and have subjects assigned to different groups., e.g. abortion or obesity, or whether parachutes are an effective life saver. In such situations, the observational designs are the best or only alternative, and hence they can be extremely valuable.7
The final group of studies providing Level III evidence (Level III‐3) are comparative studies with non‐controlled designs. These are non‐randomised studies where a group of people receiving the intervention of interest are compared with previous or historical information, or to another group receiving another intervention in another study. The key limitation of these studies is the lack of a concurrent control group, and thus it is not possible to determine the specific effects of the intervention in the population as there is not a suitable comparative group. The attempt to make up for the lack of a control group by comparing to historical data or other studies provides an improvement over non‐comparative studies (see case series below), but is still limited. For example, comparison to historical data on disease progression may be confounded by changes in disease management, specific characteristics of the participants tested, or variations in the assessment of outcome measures.
The lowest level of evidence (Level IV) is provided by case series that have no comparison group. These are usually pre‐test ‐ post‐test comparisons of outcomes following an intervention in a single group. Obviously, the lack of a control comparison severely limits the strength of the findings and the conclusions that could be drawn. These study designs will often incorporate the addition of a second pre‐test measure following a baseline, control period. This control period and additional baseline measure marginally strengthen the design of the study by enabling participants to serve “as their own control”. Case series study designs are commonly used for feasibility studies to demonstrate the potential efficacy, safety, or practicality of an intervention before implementation in a larger, more robust study.8
In the intervention section above, we described observational study designs such as the prospective cohort and the case control. While not the best choice of design for examining interventions where subjects can be randomised into groups, they can be very powerful in the study of risk factors associated with the development of clinical conditions.9
In the aetiology hierarchy, the strongest of the observational studies is the prospective cohort receiving level II. As the name suggests, it follows a group of similar individuals (eg: forestry workers) over time to examine whether a particular factor (eg: vibration from chain saw use) influences the occurrence of an outcome (osteoarthritis in the hand). A key point is that the occurrence of the outcome has not occurred at the commencement of the study. Such a design allows a consistent measurement of exposure across all the study participants and consistent measurement of the criteria that determines the outcome (eg: the presence of osteoarthritis in the hand). Cohort designs can be prospective or retrospective with the latter being at a lower hierarchal level. The key difference is that the data related to the exposure and the outcome has already been collected in the retrospective design. In many instances, the risk factor and/or outcome of interest was not the reason for the original study.10
For example, while a prospective study may have primarily been run to examine vibration levels as a risk factor for osteoarthritis of the hand in forestry workers, data might also have been collected on specific safety procedures and injuries that occurred in this cohort. Such data can be linked retrospectively and associations between variables can provide important findings. However, because the retrospective study was not the original intention, the same degree of standardisation of the data collection procedures and the precision in which they were collected is unlikely to have been undertaken and therefore the design is not as strong as a prospective study.
At the next level in the hierarchy of designs for examining risk factors is the case‐control study. In this design two groups are identified, one that has a clinical condition of interest, and another that does not. For instance, a group of forestry workers with osteoarthritis of the hand would be the case group and they would be compared to a group of forestry workers without osteoarthritis of the hand. That comparison might involve examining potential physical risk factors, (e.g. tools used, tasks performed, times and volume of work) that were undertaken by both groups over a specified time to highlight a risk factor or set of factors that are different across the groups. This design is weaker than the cohort design as only the outcome (osteoarthritis of the hand) has the potential to have been measured in a standardised and precise manner.10
Even then, one of the most notable criticisms of this design is that the criteria for being included in either the control or case groups may be insufficient to accurately represent those of the wider population with and without the condition of interest.9
This is particularly so, when the case‐control design is targeting risk factors for a rare condition. Characterising risk factors associated with rare conditions is a key strength of the case control. The alternative, if one were to use a prospective cohort, means waiting for sufficient cases to contract a disease so that its risk factors might be characterised well, and that may never eventuate.
Cross sectional study designs and case series form the lowest level of the aetiology hierarchy. In the cross sectional design, data concerning each subject is often recorded at one point in time. For instance, a questionnaire might be sent to a district where forestry is a predominant industry. It might ask about the presence of osteoarthritis in the hand. In doing so, the prevalence of the disorder can be established. Some information related to exposure might also be collected and associations might be observed, but it is difficult to be confident in the validity of these associations. Thus, information gained from the cross‐sectional study is often a starting point that provides the impetus to use a more powerful design to substantiate the initial findings.
For diagnostic studies, the basic design utilized is very similar across most studies, and the higher levels of the hierarchy are based on meeting specific methodological criteria within that design. To receive Level II strength, the design is usually a prospective cohort, and the comparison it makes between a diagnostic test and a reference standard requires the following criteria:11
All subjects should receive the reference standard, and that standard should be the best evidence available for determining whether the condition of interest is present. For studies, involving primary care, this will often be a scanning or electrophysiological procedure and might also include an anaesthetic block, while in studies involving tertiary care patients, the reference standard is often what is observed at surgery. The diagnostic test and the reference standard should also be completely independent of one another. It is crucial that the reference standard and the diagnostic tests are clearly described so that others can replicate them. The persons performing the diagnostic tests on the patients should not have knowledge of the results of the reference standard and similarly those performing the reference standard should have no knowledge of the results of the diagnostic test. The patients participating in the study must be well described, and represent those with mild as well as severe levels of the condition of interest who are recruited in a consecutive manner, and at the end of the study they are all accounted for.
Studies where the subjects are not consecutively recruited are assigned level III‐1 strength. When the criteria relating to reference standards are partially compromised, a study is regarded as level III‐2. When a study uses a group of subjects that don't include a wide spectrum of those likely to have the condition, or don't identify specific potential sub‐groupings that might affect the results, it is assigned level III‐3. Such studies are often case‐control designs where there are narrow criteria for inclusion in either the case or control groups, which can ultimately affect the generalizability of the results.12
The lowest level (IV) is reserved for those studies that lack a reference standard.