Oxman's criteria—a listing of standards for systematic reviews—has been used as a steering guide by the Cochrane Collaborative for over 4000 systematic reviews on health care.14,15
The criteria consist of 11 items related to the quality of the systematic review in problem formulation, data collection, data synthesis, and interpretation of results.3
presents these criteria, along with our evaluations of the standards as applied to the systematic reviews of Lee et al. (2008) and Wayne et al. (2007) on the efficacy of TC for bone mineral density.
Oxman et al. Criteria and T'ai Chi (TC) Bone Mineral Density Systematic Reviews
Because six of the criteria from differ between the reviews, they are addressed in more detail below (Criteria 3, 4, 6, 8, 9, and 10). The remaining five are comparable and are not discussed further.
Criterion 3: Are the inclusion criteria appropriate?
The selection of articles for inclusion from the computer database searches is determined by the scoring criteria. These scoring systems have critical implications not only for the selection of studies but also for the conclusions. In their discussion of systematic reviews in complementary medicine, Linde and Willich posit that discrepancies in conclusions can stem from subtle differences in inclusion criteria of the reviews.16
Using the Jadad scoring system, Lee et al. selected three randomized controlled trials (RCTs) and one controlled clinical trial (CCT) for postmenopausal women, while for “elderly,” they selected two RCTs and one CCT. Wayne et al. included six controlled studies for postmenopausal women, with two RCTs, two cross-sectional studies, and two prospective parallel cohort studies. Three (3) postmenopausal studies were selected by both reviews.17–19
Under the Jadad rating, TC studies can never achieve a perfect score of 5. The 5th point cannot be awarded for double blinding because of the visibility of TC movements; at best, TC studies can create a partial single blind for the outcome assessor. In Lee et al., studies that were deemed “higher quality” were necessarily RCTs or CCTs. Consequently, no cross-sectional or prospective cohort studies were included. Though the omission of cross-sectional and prospective cohort studies by Lee et al. was consistent with the Jadad criteria, it effectively limited Lee et al.'s final pool of studies to only short-term TC studies. In their suggestions for future research, Lee et al. stated that current TC research needs to examine long-term benefits of TC to better understand its potential, and that prolonging TC interventions for longer than a year “might give a different picture of the effects of Tai Chi.” These longer term cross-sectional and prospective cohort studies may have value, particularly for hypothesis generation and future research directions.20
However, the Jadad criteria did not allow them to be included for review.
In contrast, Wayne and colleagues included two cross-sectional studies and one prospective parallel cohort study.21–23
These three studies by Wayne et al. included subjects with long-term experience with TC, comparing them to control subjects matched for age and sex. The Wayne et al. checklist also examined more details in methodological qualities of the primary studies than Jadad. Their checklist verified clear inclusion and exclusion criteria, sample size calculations, and appropriate descriptive and inferential statistics. They included two criteria specific to TC: how TC was implemented in the intervention, and the qualifications of the TC instructor. The frequency, intensity, and duration of TC intervention and the way it was implemented can make a difference in the outcome and should be reported in studies. For instance, one can expect a lower “dosage” of TC consisting of 30 minutes per week to have less impact than a study with 5 intensive hours of practice per week. Furthermore, highly knowledgeable and skilled instructors may provide a stronger role model and may increase compliance and effects due to more skillful home practice.24
Inclusion of these TC-specific criteria can enable more consistent replication of the intervention and can improve the quality of future systematic reviews.
Criterion 4: Is the validity of included studies adequately assessed?
Both teams of reviewers evaluated their databases of studies using the standards proposed by their quality rating tools. This prescription of method is no guarantee of internal validity, even for double-blind RCT designs.25
The nature of the control condition (active, passive, waiting list) is essential for internal validity considerations and the interpretation of the active components in the intervention; it needs to be addressed. Neither review supplied this information. In addition, inter-rater reliability is missing from both reviews, limiting statistical measurements of validity.26
Both reviews mentioned that inconsistencies leading to low inter-rater reliability were resolved by discussion. Although this forced agreement is expedient, it does not allow the independent verification that is central to inter-rater reliability. Other authors have stated that the Jadad scale has low reliability and face validity.27
For example, a study could have received a perfect rating of 5 in Jadad yet be fraught with errors that distorted data and conclusions.6
In these two reviews of TC studies on bone mineral density, there were starkly conflicting evaluations of one article used in both systematic reviews, a study by K. Chan et al. titled “A randomized, prospective study of the effects of Tai Chi Chun exercise on bone mineral density in postmenopausal women”.17
In this RCT, the study authors compared bone mineral density levels following assignment to a TC exercise group versus assignment to a sedentary control group. Using the Jadad scale, Lee and colleagues awarded 2 of 5 total points to the study.9
Although the details of their rating are not given, one may surmise that it received 1 point for its randomized design and 1 point for description of patient withdrawals and dropouts. Lee et al. referred to the study very briefly, dismissing it as low quality. In contrast, Wayne et al. rated the Chan et al. study 7 of 9 quality checks, the highest rating of all studies reviewed.10
It lost a total of 2 checks: one for the absence of outcome assessor blinding and the other point for not mentioning the qualifications of the TC instructor. Wayne and colleagues10
described the Chan et al. study results in greater detail and referred to it repeatedly when they formed their overall conclusions recommending TC for widespread dissemination.10,p 675–678
To better understand these diverging ratings of the Chan et al. study, the present authors conducted our own investigation regarding its construct validity using three other well-regarded quality rating instruments: scales by W. Chan and Bartlett, Cho and Bero, and Downs and Black.28–30
The choice of these quality-rating instruments was purposely broad, in order to sample different approaches to systematic review. The Chan and Bartlett approach was similar to Wayne et al. in that it was an ad hoc
method developed for reviewing TC studies. The Cho and Bero approach was based on guidelines for systematic reviews of drug studies, and thus, ostensibly had the same goal as Jadad et al. Finally, the Downs and Black approach represented methodological and quality considerations for systematic reviews as applied to epidemiological and public health concerns; accordingly, their scale included standards for both randomized and nonrandomized studies. The raters were two PhDs (a postdoctoral fellow of integrative medicine and a professor of behavioral statistics and psychology) and an MD (a postdoctoral clinical research student in medicine).
The results, as summarized in , show that the checklist of Wayne et al. received a 78% rating, in good agreement with the three other scales (75%, 77%, and 86%). The Jadad scale shows a 40% rating. Among other aspects, the number of criteria included in the scale may be one of the factors explaining this isolated divergence; the Jadad scale had only 3 criteria to evaluate, while the remaining rating scales were more comprehensive, with 9–27 criteria.
Table 3. Comparison of Five Quality Rating Scales: Mean Ratings of Chan et al. Article on Effects of T'ai Chi on Bone Mineral Density17
Reviewing the original study by K. Chan et al. in more depth, it was found to address many issues related to internal and external validity. Power calculations were performed initially to gauge sample size appropriately. Reliability measurements and validity studies were supplied on outcome measurements, helping establish their precision and suitability for inclusion. The TC intervention was described in sufficient detail for replication. Confounding factors that might complicate the interpretation were briefly described, along with measurements of relevant anthropometric, hormonal, and dependent variables. Detailed specifications of baseline and follow-up values were presented by means, standard deviation, and percentage differences across six dependent variables for the TC and control groups. Appropriate inferential statistics were used and the results of the tests were properly reported. Annual changes in bone mineral density at different anatomic sites were presented, along with adjustments for expected rates of loss due to aging. Related dependent variables such as fall rates and fractures in the TC and control groups were documented and discussed. Weaknesses of the study included lack of outcome assessor blinding and external validity of both the particular TC intervention and the sample, which limited generalizability to other TC interventions and populations.
Criterion 6: How sensitive are the results to change in the way the review is done?
Oxman states that systematic reviews need to check against “sensitivity analysis” using various means. One basic way is to examine how the results change when inclusion criteria were modified. The discordance between these two reviews cited above demonstrates how different inclusion criteria in the quality-rating tools dramatically affect the ratings of individual studies and the formation of the pool of quality studies.
The inclusion criteria also ultimately affect the conclusions regarding the effectiveness of TC. Lee et al. stated that the “results for post-menopausal women failed to show specific effects of Tai Chi for bone mineral density.
…Overall our findings provide no convincing evidence that Tai Chi is beneficial for preventing or treating osteoporosis.”9, p. 141
On the other hand, Wayne et al. offered strikingly different conclusions: “Tai Chi may be an effective, safe and practical intervention for maintaining bone mineral density in postmenopausal women.”10, p. 673
Citing their selected studies, other systematic reviews on balance and fractures, and economics of the intervention, they conclude “Tai Chi may be a logical and practical response to the Surgeon General's recent call for novel exercise programs for women with low bone density.”10, p. 677–678.
Wayne et al. emphasize another factor that can dramatically change results: the way that bone mineral density is operationalized. There are several procedures to gauge bone mineral density, including dual-energy x-ray absorptiometry (DXA), quantitative computerized tomography (QCT), and broadband ultrasound attenuation; and there are multiple body sites for assessing bone mineral density. Though the Chan et al. study did not find significant change using DXA at a spinal site, they did find significant improvements at other sites using different measures. Wayne et al. argued that “QCT has the advantage of being able to quantify true volumetric density as well as partition the 2 types of bone, trabecular and cortical, which may respond differently to exercise. Moreover, it has the potential to have higher precision.”10, p. 677
Criterion 8: Are recommendations linked to the strength of the evidence?
Lee et al. stated that “the evidence is not convincing for Tai Chi in preventing or treating osteoporosis.” 9
Reporting that TC studies were methodologically weak, their recommendations focused on ways to improve the design of future studies, such as RCT designs, larger patient samples, longer assessment durations, appropriate inferential tests, and more complete write-ups. They also recommended additional outcome measures, such as balance and fractures related to falls.
While agreeing with the Lee et al. contention that many TC bone mineral density studies were poor methodologically, Wayne et al. focused on the strength of six studies that formed their evidentiary pool. They also utilized conclusions from other systematic reviews and individual studies to form conclusions and recommendations, going beyond the evidence base from their selected studies to reflect a more encompassing segment of the current TC literature.
Criterion 9: Are judgments about preferences (values) explicit?
The issue that Oxman raises in this criterion is more complex than whether validity was adequately assessed according to quality-scale criteria. Tool choice may reflect the authors' attitudes toward particular methodologies. Furthermore, following a given set of evaluative steps does not guarantee adequate ratings of study quality. It is problematic in the translation of the proper evidence, precisely because the chosen quality-rating instruments affect the analysis, interpretation, and conclusions about individual studies, as well as the overall summary of evidence. This question can be rephrased as, “Are the reviewers aware of their own value judgments or preference of the topic or their own agenda?” In their comparison of systematic reviews on complementary medicine, Linde and Willich note that unless the outcome measure is very clear with obvious differences, authors of different reviews often interject their own hypotheses, which reflect their philosophies.16
Lee et al. focused on older traditional standards of experimental design embodied in the Jadad scale, and they did not extend any conclusions beyond the results of their screened sample. While they did not explicitly state why they found the Jadad criteria appropriate for TC systematic reviews, they encouraged future TC researchers to “utilize accepted standards of trial methodology.”9
Though Lee et al. did not explicitly explain what “accepted standards” were, one can assume that Lee et al. were referring to the Jadad scale as the “gold standard” for research studies. In contrast, Wayne et al. explicitly rejected the Jadad scale as a standard for evaluation criteria “because RCTs employing TC interventions are not amenable to double-blinding.”10
Instead, they offered their ad hoc
standards for assessing methodology that reflected the context of TC. These included the checks on description of the TC intervention and the qualifications of the TC instructor. Although their qualification of TC instructors needs to be operationalized and their checklist needs validation on other TC topics, their focus is clearly more inclusive than the Lee et al. “accepted standards.”
Criterion 10: Is “no evidence of effect” interpreted as “evidence of no effect”?
Both systematic reviews generally avoided this logical error, which confuses insufficient evidence of change with no change. However, when Lee et al. discussed their findings, they stated that TC had evidence of no effect on bone mineral density. They state: “One should also note that Tai Chi is not the type of exercise that provides for much loading on weight bearing joints, which is a precondition for an effect on bone metabolism.”9
This statement is unwarranted on at least two grounds. This statement presumes that TC has no effect, rather than the evidence that the effect is weak, which can be related to other factors such as poorly validated measures and inadequate sample size. Furthermore, kinesthetic studies indicate that TC imposes substantial loads on weight-bearing joints.23,31
A more accurate statement of the Lee et al. review is a restricted claim in line with Oxman's caution, that there is no evidence of effect according to the studies cited in their article.