We developed the PRISMA statement using an approach for developing reporting guidelines that has evolved over several years.178
The overall aim of PRISMA is to help ensure the clarity and transparency of reporting of systematic reviews, and recent data indicate that this reporting guidance is much needed.3
PRISMA is not intended to be a quality assessment tool and it should not be used as such.
This PRISMA explanation and elaboration document was developed to facilitate the understanding, uptake, and dissemination of the PRISMA statement and hopefully provide a pedagogical framework for those interested in conducting and reporting systematic reviews. It follows a format similar to that used in other explanatory documents.17 18 19
Following the recommendations in the PRISMA checklist may increase the word count of a systematic review report. We believe, however, that the benefit of readers being able to critically appraise a clear, complete, and transparent systematic review report outweighs the possible slight increase in the length of the report.
While the aims of PRISMA are to reduce the risk of flawed reporting of systematic reviews and improve the clarity and transparency in how reviews are conducted, we have little data to state more definitively whether this “intervention” will achieve its intended goal. A previous effort to evaluate QUOROM was not successfully completed.178
Publication of the QUOROM statement was delayed for two years while a research team attempted to evaluate its effectiveness by conducting a randomised controlled trial with the participation of eight major medical journals. Unfortunately that trial was not completed due to accrual problems (David Moher, personal communication). Other evaluation methods might be easier to conduct. At least one survey of 139 published systematic reviews in the critical care literature179
suggests that their quality improved after the publication of QUOROM.
If the PRISMA statement is endorsed by and adhered to in journals, as other reporting guidelines have been,17 18 19 180
there should be evidence of improved reporting of systematic reviews. For example, there have been several evaluations of whether the use of CONSORT improves reports of randomised controlled trials. A systematic review of these studies181
indicates that use of CONSORT is associated with improved reporting of certain items, such as allocation concealment. We aim to evaluate the benefits (that is, improved reporting) and possible adverse effects (such as increased word length) of PRISMA and we encourage others to consider doing likewise.
Even though we did not carry out a systematic literature search to produce our checklist, and this is indeed a limitation of our effort, PRISMA was developed using an evidence based approach whenever possible. Checklist items were included if there was evidence that not reporting the item was associated with increased risk of bias, or where it was clear that information was necessary to appraise the reliability of a review. To keep PRISMA up to date and as evidence based as possible requires regular vigilance of the literature, which is growing rapidly. Currently the Cochrane Methodology Register has more than 11
000 records pertaining to the conduct and reporting of systematic reviews and other evaluations of health and social care. For some checklist items, such as reporting the abstract (item 2), we have used evidence from elsewhere in the belief that the issue applies equally well to reporting of systematic reviews. Yet for other items, evidence does not exist; for example, whether a training exercise improves the accuracy and reliability of data extraction. We hope PRISMA will act as a catalyst to help generate further evidence that can be considered when further revising the checklist in the future.
More than 10 years have passed between the development of the QUOROM statement and its update, the PRISMA statement. We aim to update PRISMA more frequently. We hope that the implementation of PRISMA will be better than it has been for QUOROM. There are at least two reasons to be optimistic. First, systematic reviews are increasingly used by healthcare providers to inform “best practice” patient care. Policy analysts and managers are using systematic reviews to inform healthcare decision making and to better target future research. Second, we anticipate benefits from the development of the EQUATOR Network, described below.
Developing any reporting guideline requires considerable effort, experience, and expertise. While reporting guidelines have been successful for some individual efforts,17 18 19
there are likely others who want to develop reporting guidelines who possess little time, experience, or knowledge as to how to do so appropriately. The EQUATOR (enhancing the quality and transparency of health research) Network aims to help such individuals and groups by serving as a global resource for anybody interested in developing reporting guidelines, regardless of the focus.7 180 182
The overall goal of EQUATOR is to improve the quality of reporting of all health science research through the development and translation of reporting guidelines. Beyond this aim, the network plans to develop a large web presence by developing and maintaining a resource centre of reporting tools, and other information for reporting research (www.equator-network.org/
We encourage healthcare journals and editorial groups, such as the World Association of Medical Editors and the International Committee of Medical Journal Editors, to endorse PRISMA in much the same way as they have endorsed other reporting guidelines, such as CONSORT. We also encourage editors of healthcare journals to support PRISMA by updating their “instructions to authors” and including the PRISMA web address, and by raising awareness through specific editorial actions.
Box 1: Terminology
The terminology used to describe systematic reviews and meta-analyses has evolved over time and varies between fields. Different terms have been used by different groups, such as educators and psychologists. The conduct of a systematic review comprises several explicit and reproducible steps, such as identifying all likely relevant records, selecting eligible studies, assessing the risk of bias, extracting data, qualitative synthesis of the included studies, and possibly meta-analyses.
Initially this entire process was termed a meta-analysis and was so defined in the QUOROM statement.8
More recently, especially in healthcare research, there has been a trend towards preferring the term systematic review. If quantitative synthesis is performed, this last stage alone is referred to as a meta-analysis. The Cochrane Collaboration uses this terminology,9
under which a meta-analysis, if performed, is a component of a systematic review. Regardless of the question addressed and the complexities involved, it is always possible to complete a systematic review of existing data, but not always possible or desirable, to quantitatively synthesise results because of clinical, methodological, or statistical differences across the included studies. Conversely, with prospective accumulation of studies and datasets where the plan is eventually to combine them, the term “(prospective) meta-analysis” may make more sense than “systematic review.”
For retrospective efforts, one possibility is to use the term systematic review for the whole process up to the point when one decides whether to perform a quantitative synthesis. If a quantitative synthesis is performed, some researchers refer to this as a meta-analysis. This definition is similar to that found in the current edition of the Dictionary of Epidemiology
While we recognise that the use of these terms is inconsistent and there is residual disagreement among the members of the panel working on PRISMA, we have adopted the definitions used by the Cochrane Collaboration.9
A systematic review attempts to collate all empirical evidence that fits pre-specified eligibility criteria to answer a specific research question. It uses explicit, systematic methods that are selected with a view to minimising bias, thus providing reliable findings from which conclusions can be drawn and decisions made.184 185
The key characteristics of a systematic review are (a
) a clearly stated set of objectives with an explicit, reproducible methodology; (b
) a systematic search that attempts to identify all studies that would meet the eligibility criteria; (c
) an assessment of the validity of the findings of the included studies, such as through the assessment of risk of bias; and (d
) systematic presentation and synthesis of the characteristics and findings of the included studies.
Meta-analysis Meta-analysis is the use of statistical techniques to integrate and summarise the results of included studies. Many systematic reviews contain meta-analyses, but not all. By combining information from all relevant studies, meta-analyses can provide more precise estimates of the effects of health care than those derived from the individual studies included within a review.
Box 2: Helping to develop the research question(s): the PICOS approach
Formulating relevant and precise questions that can be answered in a systematic review can be complex and time consuming. A structured approach for framing questions that uses five components may help facilitate the process. This approach is commonly known by the acronym “PICOS” where each letter refers to a component: the patient population or the disease being addressed (P), the interventions or exposure (I), the comparator group (C), the outcome or endpoint (O), and the study design chosen (S).186
Issues relating to PICOS affect several PRISMA items (items 6, 8, 9, 10, 11, and 18).
- P—Providing information about the population requires a precise definition of a group of participants (often patients), such as men over the age of 65 years, their defining characteristics of interest (often disease), and possibly the setting of care considered, such as an acute care hospital.
- I—The interventions (exposures) under consideration in the systematic review need to be transparently reported. For example, if the reviewers answer a question regarding the association between a woman’s prenatal exposure to folic acid and subsequent offspring’s neural tube defects, reporting the dose, frequency, and duration of folic acid used in different studies is likely to be important for readers to interpret the review’s results and conclusions. Other interventions (exposures) might include diagnostic, preventive, or therapeutic treatments; arrangements of specific processes of care; lifestyle changes; psychosocial or educational interventions; or risk factors.
- C—Clearly reporting the comparator (control) group intervention(s)—such as usual care, drug, or placebo—is essential for readers to fully understand the selection criteria of primary studies included in the systematic review, and might be a source of heterogeneity investigators have to deal with. Comparators are often poorly described. Clearly reporting what the intervention is compared with is important and may sometimes have implications for the inclusion of studies in a review—many reviews compare with “standard care,” which is otherwise undefined; this should be properly addressed by authors.
- O—The outcomes of the intervention being assessed—such as mortality, morbidity, symptoms, or quality of life improvements—should be clearly specified as they are required to interpret the validity and generalisability of the systematic review’s results.
- S—Finally, the type of study design(s) included in the review should be reported. Some reviews include only reports of randomised trials, whereas others have broader design criteria and include randomised trials and certain types of observational studies. Still other reviews, such as those specifically answering questions related to harms, may include a wide variety of designs ranging from cohort studies to case reports. Whatever study designs are included in the review, these should be reported.
Independently from how difficult it is to identify the components of the research question, the important point is that a structured approach is preferable, and this extends beyond systematic reviews of effectiveness. Ideally the PICOS criteria should be formulated a priori, in the systematic review’s protocol, although some revisions might be required because of the iterative nature of the review process. Authors are encouraged to report their PICOS criteria and whether any modifications were made during the review process. A useful example in this realm is the appendix of the “systematic reviews of water fluoridation” undertaken by the Centre for Reviews and Dissemination.187
Box 3: Identification of study reports and data extraction
Comprehensive searches usually result in a large number of identified records, a much smaller number of studies included in the systematic review, and even fewer of these studies included in any meta-analyses. Reports of systematic reviews often provide little detail as to the methods used by the review team in this process. Readers are often left with what can be described as the “X-files” phenomenon, as it is unclear what occurs between the initial set of identified records and those finally included in the review.
Sometimes, review authors simply report the number of included studies; more often they report the initial number of identified records and the number of included studies. Rarely, although this is optimal for readers, do review authors report the number of identified records, the smaller number of potentially relevant studies, and the even smaller number of included studies, by outcome. Review authors also need to differentiate between the number of reports and studies. Often there will not be a 1:1 ratio of reports to studies and this information needs to be described in the systematic review report.
Ideally, the identification of study reports should be reported as text in combination with use of the PRISMA flow diagram. While we recommend use of the flow diagram, a small number of reviews might be particularly simple and can be sufficiently described with a few brief sentences of text. More generally, review authors will need to report the process used for each step: screening the identified records; examining the full text of potentially relevant studies (and reporting the number that could not be obtained); and applying eligibility criteria to select the included studies.
Such descriptions should also detail how potentially eligible records were promoted to the next stage of the review (such as full text screening) and to the final stage of this process, the included studies. Often review teams have three response options for excluding records or promoting them to the next stage of the winnowing process: “yes,” “no,” and “maybe.”
Similarly, some detail should be reported on who participated and how such processes were completed. For example, a single person may screen the identified records while a second person independently examines a small sample of them. The entire winnowing process is one of “good bookkeeping” whereby interested readers should be able to work backwards from the included studies to come up with the same numbers of identified records.
There is often a paucity of information describing the data extraction processes in reports of systematic reviews. Authors may simply report that “relevant” data were extracted from each included study with little information about the processes used for data extraction. It may be useful for readers to know whether a systematic review’s authors developed, a priori or not, a data extraction form, whether multiple forms were used, the number of questions, whether the form was pilot tested, and who completed the extraction. For example, it is important for readers to know whether one or more people extracted data, and if so, whether this was completed independently, whether “consensus” data were used in the analyses, and if the review team completed an informal training exercise or a more formal reliability exercise.
Box 4: Study quality and risk of bias
In this paper, and elsewhere,11
we sought to use a new term for many readers, namely, risk of bias, for evaluating each included study in a systematic review. Previous papers89 188
tended to use the term “quality.” When carrying out a systematic review we believe it is important to distinguish between quality and risk of bias and to focus on evaluating and reporting the latter. Quality is often the best the authors have been able to do. For example, authors may report the results of surgical trials in which blinding of the outcome assessors was not part of the trial’s conduct. Even though this may have been the best methodology the researchers were able to do, there are still theoretical grounds for believing that the study was susceptible to (risk of) bias.
Assessing the risk of bias should be part of the conduct and reporting of any systematic review. In all situations, we encourage systematic reviewers to think ahead carefully about what risks of bias (methodological and clinical) may have a bearing on the results of their systematic reviews.
For systematic reviewers, understanding the risk of bias on the results of studies is often difficult, because the report is only a surrogate of the actual conduct of the study. There is some suggestion189 190
that the report may not be a reasonable facsimile of the study, although this view is not shared by all.88 191
There are three main ways to assess risk of bias—individual components, checklists, and scales. There are a great many scales available,192
although we caution against their use based on theoretical grounds193
and emerging empirical evidence.194
Checklists are less frequently used and potentially have the same problems as scales. We advocate using a component approach and one that is based on domains for which there is good empirical evidence and perhaps strong clinical grounds. The new Cochrane risk of bias tool11
is one such component approach.
The Cochrane risk of bias tool consists of five items for which there is empirical evidence for their biasing influence on the estimates of an intervention’s effectiveness in randomised trials (sequence generation, allocation concealment, blinding, incomplete outcome data, and selective outcome reporting) and a catch-all item called “other sources of bias”.11
There is also some consensus that these items can be applied for evaluation of studies across diverse clinical areas.93
Other risk of bias items may be topic or even study specific—that is, they may stem from some peculiarity of the research topic or some special feature of the design of a specific study. These peculiarities need to be investigated on a case-by-case basis, based on clinical and methodological acumen, and there can be no general recipe. In all situations, systematic reviewers need to think ahead carefully about what aspects of study quality may have a bearing on the results.
Box 5: Whether to combine data
Deciding whether to combine data involves statistical, clinical, and methodological considerations. The statistical decisions are perhaps the most technical and evidence-based. These are more thoroughly discussed in box 6. The clinical and methodological decisions are generally based on discussions within the review team and may be more subjective.
Clinical considerations will be influenced by the question the review is attempting to address. Broad questions might provide more “license” to combine more disparate studies, such as whether “Ritalin is effective in increasing focused attention in people diagnosed with attention deficit hyperactivity disorder (ADHD).” Here authors might elect to combine reports of studies involving children and adults. If the clinical question is more focused, such as whether “Ritalin is effective in increasing classroom attention in previously undiagnosed ADHD children who have no comorbid conditions,” it is likely that different decisions regarding synthesis of studies are taken by authors. In any case authors should describe their clinical decisions in the systematic review report.
Deciding whether to combine data also has a methodological component. Reviewers may decide not to combine studies of low risk of bias with those of high risk of bias (see items 12 and 19). For example, for subjective outcomes, systematic review authors may not wish to combine assessments that were completed under blind conditions with those that were not.
For any particular question there may not be a “right” or “wrong” choice concerning synthesis, as such decisions are likely complex. However, as the choice may be subjective, authors should be transparent as to their key decisions and describe them for readers.
Box 6: Meta-analysis and assessment of consistency (heterogeneity)
Meta-analysis: statistical combination of the results of multiple studies
If it is felt that studies should have their results combined statistically, other issues must be considered because there are many ways to conduct a meta-analysis. Different effect measures can be used for both binary and continuous outcomes (see item 13). Also, there are two commonly used statistical models for combining data in a meta-analysis.195
The fixed-effect model assumes that there is a common treatment effect for all included studies;196
it is assumed that the observed differences in results across studies reflect random variation.196
The random-effects model assumes that there is no common treatment effect for all included studies but rather that the variation of the effects across studies follows a particular distribution.197
In a random-effects model it is believed that the included studies represent a random sample from a larger population of studies addressing the question of interest.198
There is no consensus about whether to use fixed- or random-effects models, and both are in wide use. The following differences have influenced some researchers regarding their choice between them. The random-effects model gives more weight to the results of smaller trials than does the fixed-effect analysis, which may be undesirable as small trials may be inferior and most prone to publication bias. The fixed-effect model considers only within-study variability, whereas the random-effects model considers both within- and between-study variability. This is why a fixed-effect analysis tends to give narrower confidence intervals (that is, provides greater precision) than a random-effects analysis.110 196 199
In the absence of any between-study heterogeneity, the fixed- and random-effects estimates will coincide.
In addition, there are different methods for performing both types of meta-analysis.200
Common fixed-effect approaches are Mantel-Haenszel and inverse variance, whereas random-effects analyses usually use the DerSimonian and Laird approach, although other methods exist, including Bayesian meta-analysis.201
In the presence of demonstrable between-study heterogeneity (see below), some consider that the use of a fixed-effect analysis is counterintuitive because their main assumption is violated. Others argue that it is inappropriate to conduct any meta-analysis when there is unexplained variability across trial results. If the reviewers decide not to combine the data quantitatively, a danger is that eventually they may end up using quasi-quantitative rules of poor validity (such as vote counting of how many studies have nominally significant results) for interpreting the evidence. Statistical methods to combine data exist for almost any complex situation that may arise in a systematic review, but one has to be aware of their assumptions and limitations to avoid misapplying or misinterpreting these methods.
Assessment of consistency (heterogeneity)
We expect some variation (inconsistency) in the results of different studies due to chance alone. Variability in excess of that due to chance reflects true differences in the results of the trials, and is called “heterogeneity.” The conventional statistical approach to evaluating heterogeneity is a χ2
test (Cochran’s Q), but it has low power when there are few studies and excessive power when there are many studies.202
By contrast, the I2
statistic quantifies the amount of variation in results across studies beyond that expected by chance and so is preferable to Q.202 203
represents the percentage of the total variation in estimated effects across studies that is due to heterogeneity rather than to chance; some authors consider an I2
value less than 25% as low.202
also suffers from large uncertainty in the common situation where only a few studies are available,204
and reporting the uncertainty in I2
(such as 95% confidence interval) may be helpful.145
When there are few studies, inferences about heterogeneity should be cautious.
When considerable heterogeneity is observed, it is advisable to consider possible reasons.205
In particular, the heterogeneity may be due to differences between subgroups of studies (see item 16). Also, data extraction errors are a common cause of substantial heterogeneity in results with continuous outcomes.139
Box 7: Bias caused by selective publication of studies or results within studies
Systematic reviews aim to incorporate information from all relevant studies. The absence of information from some studies may pose a serious threat to the validity of a review. Data may be incomplete because some studies were not published, or because of incomplete or inadequate reporting within a published article. These problems are often summarised as “publication bias,” although the bias arises from non-publication of full studies and selective publication of results in relation to their findings. Non-publication of research findings dependent on the actual results is an important risk of bias to a systematic review and meta-analysis.
Several empirical investigations have shown that the findings from clinical trials are more likely to be published if the results are statistically significant (P<0.05) than if they are not.125 206 207
For example, of 500 oncology trials with more than 200 participants for which preliminary results were presented at a conference of the American Society of Clinical Oncology, 81% with P<0.05 were published in full within five years compared with only 68% of those with P>0.05.208
Also, among published studies, those with statistically significant results are published sooner than those with non-significant findings.209
When some studies are missing for these reasons, the available results will be biased towards exaggerating the effect of an intervention.
In many systematic reviews only some of the eligible studies (often a minority) can be included in a meta-analysis for a specific outcome. For some studies, the outcome may not be measured or may be measured but not reported. The former will not lead to bias, but the latter could.
Evidence is accumulating that selective reporting bias is widespread and of considerable importance.42 43
In addition, data for a given outcome may be analysed in multiple ways and the choice of presentation influenced by the results obtained. In a study of 102 randomised trials, comparison of published reports with trial protocols showed that a median of 38% efficacy and 50% safety outcomes per trial, respectively, were not available for meta-analysis. Statistically significant outcomes had higher odds of being fully reported in publications when compared with non-significant outcomes for both efficacy (pooled odds ratio 2.4 (95% confidence interval 1.4 to 4.0)) and safety (4.7 (1.8 to 12)) data. Several other studies have had similar findings.210 211
Detection of missing information
Missing studies may increasingly be identified from trials registries. Evidence of missing outcomes may come from comparison with the study protocol, if available, or by careful examination of published articles.11
Study publication bias and selective outcome reporting are difficult to exclude or verify from the available results, especially when few studies are available.
If the available data are affected by either (or both) of the above biases, smaller studies would tend to show larger estimates of the effects of the intervention. Thus one possibility is to investigate the relation between effect size and sample size (or more specifically, precision of the effect estimate). Graphical methods, especially the funnel plot,212
and analytic methods (such as Egger’s test) are often used,213 214 215
although their interpretation can be problematic.216 217
Strictly speaking, such analyses investigate “small study bias”; there may be many reasons why smaller studies have systematically different effect sizes than larger studies, of which reporting bias is just one.218
Several alternative tests for bias have also been proposed, beyond the ones testing small study bias,215 219 220
but none can be considered a gold standard. Although evidence that smaller studies had larger estimated effects than large ones may suggest the possibility that the available evidence is biased, misinterpretation of such data is common.123