Your activity: 44 p.v.
your limit has been reached. plz Donate us to allow your ip full access, Email: sshnevis@outlook.com

Systematic review and meta-analysis

Systematic review and meta-analysis
Authors:
Ethan Balk, MD, MPH
Peter A L Bonis, MD
Section Editor:
Joann G Elmore, MD, MPH
Deputy Editor:
Carrie Armsby, MD, MPH
Literature review current through: Dec 2022. | This topic last updated: Oct 01, 2021.

INTRODUCTION — This topic review will provide an overview of how systematic reviews and meta-analyses are conducted and how to interpret them. In addition, it will provide a summary of methodologic terms commonly encountered in systematic reviews and meta-analyses.

A broader discussion of evidence-based medicine and a glossary of methodologic and biostatistical terms are presented separately. (See "Evidence-based medicine" and "Glossary of common biostatistical and epidemiological terms".)

KEY DEFINITIONS — The terms systematic review and meta-analysis are often used together, but they are not interchangeable. Not all systematic reviews include meta-analyses, though many do.

These terms are defined here since they are used throughout this topic. A glossary of other relevant terms is provided at the end of this topic. (See 'Glossary of terms' below.)

Systematic review — A systematic review is a comprehensive summary of all available evidence that meets predefined eligibility criteria to address a specific clinical question or range of questions. It is based upon a rigorous process that incorporates [1-3]:

Systematic identification of studies that have evaluated the specific research question(s)

Critical appraisal of the studies

Meta-analyses (not always performed) (see 'Meta-analysis' below)

Presentation of key findings

Explicit discussion of the limitations of the evidence and the review

Systematic reviews contrast with traditional "narrative" reviews and textbook chapters. Such reviews generally do not exhaustively review the literature, lack transparency in the selection and interpretation of supporting evidence, generally do not provide a quantitative synthesis of the data, and are more likely to be biased [4].

Meta-analysis — Meta-analysis, which is commonly included in systematic reviews, is the statistical method of quantitatively combining or pooling results from different studies. It can be used to provide overall pooled effect estimates [5]. For example, if a drug was evaluated in multiple placebo-controlled trials that all reported mortality, meta-analysis can be used to estimate a pooled relative risk for the drug's overall effect on mortality in all of the trials together. Meta-analysis can also be used to pool other types of data such as studies on diagnostic accuracy (ie, pooled estimates on sensitivity and specificity) and epidemiologic studies (ie, pooled incidence or prevalence rates; pooled odds ratio for strength of association). Meta-regression and network meta-analysis (NMA) are enhancements to traditional meta-analysis. (See 'Meta-regression' below and 'Network meta-analysis' below.)

ADVANTAGES OF SYSTEMATIC REVIEW AND META-ANALYSIS — Clinical decisions in medicine ideally should be based upon guidance from a comprehensive assessment of the body of available knowledge. A single clinical trial, even a large one, is seldom sufficient to provide a confident answer to a clinical question. Indeed, one analysis suggested that most research claims are ultimately proven to be incorrect or inaccurate when additional studies have been performed [6]. At the same time, it is well established that large randomized controlled trials do not always confirm the results of prior meta-analyses [7-9]. The "truth" needs to be understood by examining all sources of data as critically and objectively as possible.

There are several potential benefits to performing systematic analysis, which may also include meta-analysis:

Unique aspects to a single randomized trial, involving the participating patient population, protocol, setting in which the trial is performed, or expertise of the involved clinicians, may limit its generalizable to other settings or individual patients. The conclusions of systematic reviews are likely to be more generalizable than single studies.

Combining studies in meta-analyses increases the sample size and generally produces more precise estimates of the effect size (ie, estimates that have smaller confidence intervals) than a single randomized trial. Meta-analysis may also allow exploration of heterogeneity across studies to allow conclusions beyond what can be gleaned from individual studies.

Clinicians rarely have the time or resources to critically evaluate the body of evidence relevant to a particular clinical question, and a systematic review can facilitate this investigation.

In contrast with narrative review articles, most systematic reviews focus on a narrow, clearly defined topic and include all eligible studies, not just those chosen by the author. Systematic reviews start with a clinical or research question and form conclusions based on the evidence. This is in contrast with many narrative reviews that start with a conclusion and include evidence to support that conclusion.

Systematic review and meta-analysis are methods to synthesize the available evidence using an explicit, transparent approach that considers the strengths and weaknesses of the individual studies, populations and interventions, and specific outcomes that were assessed. Individual practitioners, policymakers, and guideline developers can use well-conducted systematic reviews to determine best patient management decisions. Organizations that develop guidelines can use the results of systematic reviews and meta-analyses to provide evidence-based recommendations for care.

STEPS TO CONDUCTING A SYSTEMATIC REVIEW AND META-ANALYSIS

Overview — Several steps are essential for conducting a systematic review or meta-analysis. These include:

Formulating research questions (see 'Formulating research questions' below)

Developing a protocol (see 'Developing a protocol' below)

Searching for the evidence (see 'The literature search' below)

Assessing the quality of studies (see 'Risk of bias assessment' below)

Summarizing and displaying results (eg, using forest pots and a summary of findings table, as shown in the figure (figure 1)) (see 'Forest plot' below)

Exploring reasons for heterogeneity across studies (see 'Heterogeneity' below and 'Subgroup analyses' below and 'Sensitivity analysis' below)

The basic steps, along with limitations that should be considered, are discussed here. While this topic review focuses on meta-analysis of randomized controlled trials, many of the methods and issues apply equally to meta-analyses of other comparative studies, noncomparative (single group) and other observational studies, and studies of diagnostic tests. An overview of approaches to systematic review and meta-analysis is provided in a table (table 1).

The updated 2020 Preferred Reporting Items of Systematic reviews and Meta-Analyses (PRISMA) statement emphasizes that systematic reviews should provide the protocol, data, and assessments of risk of bias (RoB) from individual studies with sufficient transparency to allow the reader to verify the results [10]. It underscores the basic questions that the clinician and investigator should ask when interpreting a systematic review. The PRISMA website provides checklists for the items that should be included in a systematic review. Several "extensions" to PRISMA have been developed for specific types of systematic reviews or meta-analyses (eg, harms, network meta-analyses, meta-analyses of diagnostic tests, individual patient data analyses) [11]. In addition, readers of systematic reviews should assess the relevance to their own practice in regard to the studied populations, settings, interventions, and outcomes assessed.

In 2011, the Institute of Medicine has published recommended standards for developing systematic reviews, which remain pertinent [12]. While these standards principally apply to publicly funded systematic reviews of comparative effectiveness research that focus specifically on treatments, most of the standards pertain to all systematic reviews. The United States Agency for Healthcare Research and Quality also has an ongoing series of articles that form a Methods Guide for Comparative Effectiveness Reviews for its Evidence-based Practice Center program and related reviews. This guide principally applies to large overarching systematic reviews but provides insights and recommendations for addressing different types of topics and studies.

Formulating research questions — Research questions (often referred to as "key questions") are analogous to the research hypotheses of primary research studies. They should be focused and defined clearly since they determine the scope of research the systematic review will address [13].

Broad questions that cover a range of topics may not be directly answerable and are not appropriate for systematic reviews or meta-analyses. As an example, the question "What is the best treatment for chronic hepatitis B?" would need to be broken down into several smaller well-focused questions that could be addressed in individual and complementary systematic reviews. Examples of appropriate key questions may include, "How does entecavir compare with placebo for achieving hepatitis B e antigen (HBeAg) seroconversion in patients with chronic HBeAg-positive hepatitis B?" and "What is the relationship between hepatitis B genotypes and response rates to entecavir?" These and other related questions would be addressed individually and then, ideally, considered together to answer the more general question.

Key questions for studies of the effectiveness of interventions are commonly formulated according to the "PICO" method, which fully defines the Population, Intervention, Comparator, and Outcomes of interest [13]. The acronym "PICOD" is sometimes used to indicate that investigators must also specify which study designs are appropriate to include (eg, all comparative studies versus only randomized trials). Other eligibility criteria may include the timing or setting of care. Variations of these criteria should be used for systematic reviews of other study designs, such as of cohort studies (without a comparator), studies of exposures, or studies of diagnostic tests.

Developing a protocol — A written protocol serves to minimize bias and to ensure that the review is implemented according to reproducible steps. A systematic review should describe the research questions and the review methodology, including the search strategy and approach to analyzing the data. Ideally, the protocol should be a collaborative effort that includes both clinical and methodology experts [14].

Publication of protocols can be useful to prevent unnecessary duplication of efforts and to enhance transparency of the systematic review. A voluntary registry, PROSPERO, was established in 2011. The database contains protocol details for systematic reviews that have health-related outcomes.

The literature search

Performing the search — The literature search should be systematic and comprehensive to minimize error and bias [13]. Most systematic reviews start with a search of an electronic database of the literature. PubMed [15] is almost universally used; other commonly searched databases include Embase [16] and the Cochrane Central Register of Controlled Trials (CENTRAL) [17]. Inclusion of additional databases should be considered for specialized topics such as complementary or alternative medicine, quality of care, or nursing. Electronic searches should be supplemented by searches of the bibliographies of retrieved articles and relevant review articles and by studies known to domain experts.

The research community has also recognized a need to incorporate the "grey literature" to diminish the risks of publication bias (selective publication of studies, possibly based on their results) and reporting bias (selective reporting of study results, possibly based on statistical significance) [12,18-20]. There is no standard definition of grey literature, but it generally refers to information obtained from sources other than published, peer-reviewed articles, such as conference proceedings, clinical trial registries, adverse events databases, government agency databases (eg, US Food and Drug Administration) and documents, unpublished industry data, dissertations, and online sites. Methods to incorporate other types of relevant information, particularly "real-world data" obtained from analyzing databases of patients undergoing routine care, are still being developed [21,22].

Publication and reporting bias — Reporting bias refers to bias that results from incomplete publishing or reporting of available research. This is a common concern and a potentially important limitation of systematic review since the missing data may affect the validity of systematic reviews [23]. There are two main categories of reporting bias:

Publication bias – Compared with positive studies, negative studies may take longer to be published or may not be published at all [24]. This is referred to as "publication bias."

Outcome reporting bias – "Outcome reporting bias" refers to the concern that a study may only include outcomes that are favorable and significant in the published report, while nonsignificant or unfavorable outcomes are selectively not reported.

Several methods have been developed to evaluate whether publication bias is present. However, they all involve major assumptions about possible missing studies [25]. Any evaluation of publication bias should not be considered definitive, but rather only exploratory in nature.

A commonly used method for assessing publication bias is the funnel plot, which is a scatter plot displaying the relationship between the weight of the study (eg, study size or standard error) and the observed effect size (figure 2) [26]. An asymmetric appearance, especially due to the absence of smaller negative studies, can suggest unpublished data. However, this assessment is not definitive since asymmetry could be due to factors other than unpublished negative studies (such as population heterogeneity or study quality) [23,27-29].

Other methods to evaluate reporting bias include the "trim and fill" method [30], "modeling selection process" [31,32], and testing for an excess of significant findings. These methods are beyond the scope of this topic [33].

Risk of bias assessment — The quality of an individual study has been defined as the "confidence that the trial design, conduct, and analysis has minimized or avoided biases" [34]. The risk of bias (RoB) assessment (sometimes referred to as "quality assessment") represents the extent to which trial design and methodology prevented systematic error and can help explain differences in the results of systematic reviews.

The primary value of the RoB assessment of individual studies in the meta-analysis is to determine the degree of confidence that the pooled effect estimate reflects the "truth" as best as it can be measured. One would be more likely to have high confidence in conclusions based upon "high-quality" (ie, low RoB) studies rather than "low-quality" (ie, high RoB) studies. Differences in RoB of individual studies can also be explored to help explain heterogeneity (eg, does the effect in low RoB studies differ from that in high RoB?).

The process of assessing study quality is not straightforward. Several different RoB scoring systems are available. Commonly used tools, among many others, include:

Original Cochrane RoB tool for randomized controlled trials (with 7 questions [35])

More complex revision of this tool, RoB 2 (with 5 overarching questions and 22 subquestions [36])

The ROBINS-I tool (Risk Of Bias In Non-randomized Studies of Interventions, with 7 overarching questions and 31 subquestions [37])

Different methodologists use different tools depending on available time and resources, needs and purpose of the given review, and philosophical differences among researchers about the relative importance of different "quality" factors. Importantly, the assessment of a study's RoB can be limited by the need to rely on information presented in the manuscript [38].

For randomized trials, the RoB assessment typically considers the following factors:

Randomization method – Some "randomization" methods are not truly random, which can be a source of bias. For example, a computer algorithm is generally preferred over a system based on day of the week or other nonrandom method.

Allocation concealment – Allocation is the assignment of study participants to a treatment group. It occurs between randomization and implementation of the intervention. Allocation should be adequately concealed from the study personnel. A study may be biased if allocation is not concealed. For example, if the study used unsealed envelopes corresponding to the randomization order to assign patients to each treatment arm, the study personnel could read the contents and thereby channel certain patients into the desired treatment (eg, if they believed the investigational treatment was effective, they may channel sicker patients into that arm). This would result in imbalance between the two arms of the study (ie, the intervention arm would have sicker patients while the control arm would have healthier people), resulting in the intervention appearing to be less effective than it truly is.

Blinding – Ideally, all relevant groups should be blinded to treatment assignment. This includes study participants, clinicians, data collectors, outcome assessors, and data analysts. Blinding is not always feasible. Some forms of surgery or behavioral modifications, for example, do not lend themselves to blinding of patients and providers. However, outcome assessors and data analyst can usually be blinded regardless of the type of treatment. "Double blinding" generally refers to blinding of the study participants and at least one of the study investigators, although it may not be clear who was blinded when only "double blinding" is reported. For adequate blinding, treatments with a noticeable side effect (eg, niacin) ideally should have an "active control" that mimics the side effect.

Differences between study groups – Differences in the treatment groups at baseline can lead to biased results. The goal of randomization is to balance important prognostic variables relevant to the outcome(s) of interest among the different treatment groups. However, randomization is not always successful. Differences in treatment groups typically occur in trials with relatively small numbers of subjects. Researchers can attempt to adjust for baseline differences in the statistical analysis, but it is far more preferable to have balanced groups at baseline.

Attrition and incomplete reporting – High rates of withdrawal of participants from a study may indicate a fundamental problem with the study design. Uneven withdrawal from different study groups can lead to bias, particularly if the reasons for withdrawal differ between, and are related to, the interventions (such as ascribing adverse events to the intervention or lack of effectiveness to the placebo). Reports should describe the reasons for patient withdrawal to allow assessment of their effect on bias and study applicability.

Early termination for benefit – Stopping a trial early for benefit will, on average, overestimate treatment effects [39]. However, the degree of overestimation varies. Small trials that are stopped early with few events can result in large overestimates. In larger trials with more events (ie, >200 to 300 events), early stopping is less likely to result in serious overestimation [40]. Early termination of a trial for harm can also introduce bias (ie, overestimation of the harm); however, it is generally considered ethically obligatory to stop the trial in such circumstances. Early termination for other reasons (eg, slow accrual) is not considered a source of bias per se, though it can sometimes indicate that there are other problems with the trial (eg, the eligibility criteria may be too strict and not reflective of the patient population seen in actual clinical practice).

Other factors that may be considered when assessing the methodologic quality of a study include the accuracy of reporting (eg, details of study methodology, patient characteristics, and study results) and the appropriateness of statistical analyses. For example, an intention to treat (ITT) analysis is appropriate for assessing efficacy of a treatment since it preserves the comparability of treatment groups achieved by randomization. In some cases, it may be appropriate to perform a per protocol analysis alongside the ITT analysis, but when performed alone, per protocol analyses can lead to biased results.

The RoB assessment involves judgement. For this reason, it should generally be performed independently by two separate reviewers and there should be a process for resolving disagreements.

Meta-analysis

Statistical methods for combining data — Meta-analysis combines results across studies to provide overall estimates and confidence intervals of treatment effects. For dichotomous outcomes (ie, outcomes with two possible states, such as death versus survival), results are summarized using an odds ratio (OR), relative risk (RR; also called risk ratio), or hazard ratio (HR). Essentially, any study metric can be meta-analyzed, including continuous variables (mean, mean difference, percent change) or proportions. However, meta-analysis is not feasible if the studies measured completely different outcomes (eg, one trial measured pain scores while the other measured functional ability).

There are numerous specific methodologic details of meta-analysis that are beyond the scope of this topic. The primary consideration is whether the summary effect estimate should be calculated under the assumption of a "random effects" or a "fixed effect" model [41]. For most of the medical literature, the random effects model is the more appropriate approach. These two approaches are discussed in detail below. (See 'Random effects model' below and 'Fixed effect model' below.)

When to combine studies — The decision to combine studies should be based upon both qualitative and quantitative evaluations. Important qualitative features include the degree of similarity of populations, interventions, outcomes, study objectives, and study designs that incorporate both clinical and biologic plausibility. The systematic reviewers should provide a sufficient explanation of the rationale for combining studies to allow the readers to judge for themselves whether they agree that it was appropriate to combine the individual studies.

Quantitative methods to examine heterogeneity may also be considered in making the decision or determining if it is appropriate to combine data. These typically involve the I2 index or Q statistic, which are described below. (See 'Heterogeneity' below and 'I2 index' below and 'Q statistic' below.)

These statistics, however, generally have low power and are thus prone to false negative results (eg, not detecting heterogeneity when it is present). Evidence of statistical heterogeneity does not preclude appropriate meta-analysis.

Precision — Precision refers to the extent to which the observed results would be reproduced exactly given the same interventions and study design. Precision is generally assessed by examining the confidence intervals (CIs). The narrower the CIs are, the more precise the estimate is. If the estimate is too imprecise (ie, CIs are too wide), our certainty in the finding is reduced. But how wide is too wide? As a general rule, imprecision is problematic if the clinical decision based on the result (eg, to use or not use an intervention) would be different at the upper versus lower boundary of the 95% CIs. For organizations issuing guidelines, the strength of the evidence should be downgraded for imprecision in this scenario. Specific criteria for imprecision have been developed in the context of grading for guidelines [42].

Precision is different from validity, which refers to the extent to which the results reflect the "truth." The figure illustrates a conceptual example of the difference between precision and validity (figure 3).

Problematic imprecision is often encountered when the sample size is small (particularly if there are few events). An important advantage of meta-analysis is that combining studies produces more precise estimates of the effect size (ie, estimates that have narrower CIs) due to the increased sample size. (See 'Advantages of systematic review and meta-analysis' above.)

Sensitivity analysis — A meta-analysis should test how stable the overall estimates are when different subgroups of studies are analyzed and should explore heterogeneity among the studies. Meta-regression and subgroup analyses can be used to examine the influence on the overall results. When feasible, meta-analysis of individual patient data allows the most rigorous exploration of heterogeneity. (See 'Individual patient data' below.)

Explorations such as reanalyzing the data with single studies or groups of studies (eg, high RoB studies) omitted can be used to determine the degree to which overall results are being driven by these studies. Conclusions should seldom be driven by a single study since the meta-analysis would add little additional information or confidence compared with the single study alone.

Sensitivity analyses can also be used to explore such issues as publication or reporting bias. As an example, finding that meta-analysis of the largest studies yields smaller effect sizes than meta-analysis of all trials can suggest that smaller "negative" trials may be missing [43,44].

Subgroup analyses — Another way to explore heterogeneity is subgroup analysis, which involves performing separate analyses based upon clinically relevant variables. Subgroup analysis is subject to the same limitations inherent to meta-regression, including risks associated with data dredging and ecological fallacy. To minimize the risk of drawing false conclusions, subgroup analyses in meta-analyses should be:

Specified a priori, including hypotheses for the direction of the differences (ie, they should be based upon prior evidence or knowledge)

Limited to only a few (ie, to avoid data dredging)

Analyzed by testing for interaction (eg, using meta-regression) rather than simply comparing the separate effect estimates

An example of subgroup analysis is shown in the figure (figure 4), which is from a meta-analysis examining the effect of continuous positive airway pressure on reducing depressive symptoms in patients with obstructive sleep apnea [45]. The investigator performed subgroup analyses to explore whether the effect differed in studies involving patients with baseline depression compared with studies of patients without baseline depression. in this case, the test for subgroup effect (ie, interaction) was statistically significant (p<0.001).

The approach to evaluating subgroup analyses in meta-analyses and clinical trials is discussed in greater detail separately. (See "Evidence-based medicine", section on 'Subgroup analyses'.)

Meta-regression — Regression analysis of primary studies may be used to adjust for potential confounders or explain differences in results among subjects. This meta-analytic technique is commonly known as meta-regression. In this approach, the dependent variable in the regression is the estimate of treatment effect from each individual study and the independent variables (eg, covariates such as drug dose, treatment duration, or study size) are the aggregated characteristics derived from the individual studies. Instead of individual patients serving as the units of analysis, each individual study is considered to be one observation [46-48]. Meta-regression tests the statistical interaction between the subgroup variable (eg, dose) and the treatment effect (eg, relative risk of death). It can include categorical variables (including two or more categories, such as study country or study design) and continuous variables (such as dose or follow-up duration) either singly (univariable analysis) or together (multivariable analysis).

An example of a meta-regression of early trials of zidovudine monotherapy for HIV infection is shown in a figure (figure 5) [49]. The meta-regression successfully explains the heterogeneity across studies, showing an association between treatment duration and the effect of treatment on death that was not apparent within the individual trials.

There are several caveats related to the performance and interpretation of meta-regression:

Meta-regression and subgroup analyses (that rely on retrospective data from previously run trials) should be considered to yield hypothesis-generating, rather than conclusive, associations, in contrast to well-designed regressions of prospective study data.

Meta-regression is not always feasible, since covariates may not be fully reported or may not be uniformly defined.

Data dredging (analyzing every possible variable regardless of clinical relevance) can result in spurious associations [50].

It may be difficult to account properly for patient-level variables (such as age, sex, or laboratory values). Most studies, for example, report averages for such variables (eg, a mean age of 47.1 years) that do not reflect the range of values across the study population. Making an assumption about individual data based upon aggregated statistics (known as "ecological fallacy") can produce invalid results in meta-regression [51,52]. The only reliable way to address this is to analyze patient-level data.

Individual patient data — It is sometimes possible to obtain original patient-level databases to reanalyze individual patient data in a meta-analysis [14]. Pooling individual patient data is the most rigorous form of meta-analysis. While more costly and time-consuming and limited by difficulties collecting original data, there are several benefits. These include the ability to perform meta-regressions of patient-level predictors (eg, age) without the risk of ecological fallacy; time-to-event analyses; and to include unpublished, previously unanalyzed data. However, analyses of partial databases (all that may be available with proprietary data) or of selected databases are subject to selection bias or limited generalizability of results, similar to other retrospective analyses of incomplete samples.

Network meta-analysis — When multiple different interventions are compared across trials, a network of studies can be established where all the studied interventions are linked to each other by individual trials. Network meta-analysis (NMA) evaluates all studies and all interventions simultaneously to produce multiple pairwise estimates of relative effects of each intervention compared with every other intervention [53,54].

A schematic representation of a network diagram is shown in the figure (figure 6). In reality, some network diagrams in NMAs are far more complex (figure 7).

The pairwise comparisons in NMAs are based upon both direct and indirect comparisons. For example, consider two drugs (drug A and drug B) that were each evaluated in placebo-controlled trials and directly compared with one another in a separate clinical trial (figure 6). NMA can be used to estimate the relative efficacy of drug A versus drug B based upon the direct comparison (ie, from the trial directly comparing drug A to drug B) and indirect comparisons (ie, from the placebo-controlled trials). The direct and indirect estimates are then pooled together to yield an overall estimate (or "network estimate") of the relative effect. Typically, the direct, indirect, and network estimates are reported separately in NMAs. Some of the comparisons in a NMA may be based entirely on indirect data.

When assessing the validity of an NMA, many of the same principles that are used for assessing conventional meta-analysis apply (eg, was the literature search comprehensive, were eligibility criteria for the studies clearly stated, were the individual studies assessed for RoB, how precise are the effect estimates, etc (table 2)). However, there are two concerns that are unique to NMAs [55,56]:

Intransitivity – The assumption of transitivity is fundamental to NMA because the network estimates rely upon indirect comparisons. For the transitivity assumption to hold, the individual studies must be sufficiently similar in all respects other than the treatments being compared (ie, similar participants, setting, ancillary treatments, and other relevant parameters). In the example above, if studies of drug A versus placebo are systematically different than studies of drug B versus placebo (eg, if they were conducted in an earlier era), then the indirect comparison of drug A versus drug B may be biased due to these differences (ie, the difference may be partly explained by differences in disease management over the intervening decades).

Incoherence – Incoherence (also called inconsistency) refers to differences between the direct and indirect estimates. Incoherence can be a consequence of bias due to methodologic limitations of the studies, publication bias, indirectness, or intransitivity. If the direct and indirect estimates are considerably different from each other, the network estimate may not be valid. Addressing incoherence and assessing its impact on the network estimate requires judgement [55].

Bayesian methods are commonly used to conduct NMA [57]. This approach has the advantage of allowing estimation of the probability of each intervention being best, which, in turn, allows interventions to be ranked. Such ranking, however, needs to be interpreted cautiously, as it can be unstable, depending on the network topology, and can have a substantial degree of imprecision [58].

READING AND INTERPRETING A SYSTEMATIC REVIEW — Key questions to consider when reading and interpreting a systematic review are summarized in the table (table 2). The reader should appraise the systematic reviews for its quality, potential sources of bias, and extent to which the findings are applicable to their specific question. Systematic review and meta-analysis are subject to the same biases observed in all research. In addition, the value of a systematic review's conclusions may be limited by the quality and applicability of the individual studies included in the review.

Since meta-analysis is a pooling of distinct individual studies, it is important to bear in mind that the overall results, even more than individual study results, are not directly interpretable as a patient-level risk (of an outcome) and cannot make personalized predictions for patients. Results must be interpreted as an average result for a population.

GLOSSARY OF TERMS

Applicability (generalizability) — The relevance of a study (or a group of studies) to a population of interest (or an individual patient). This requires an assessment of how similar the subjects of a study are to the population of interest, the relevance of the studied interventions and outcomes, and other PICO features. (See 'PICO method (PICOD, PICOS, PICOTS, others)' below.)

Ecological fallacy (ecological inference fallacy) — An error in interpreting data where inferences are made about specific individuals based upon aggregated statistics for groups of individuals.

Fixed effect model — The central assumption of a fixed effect model is that there is a single true treatment effect and that all trials provide estimates of this one true effect. Meta-analysis thus provides a pooled estimate of the single true effect. A hypothetical model for a fixed effect model meta-analysis is shown in a figure (figure 8).

The central assumption of a fixed effect model is that estimates from each study differ solely because of random error around a common true effect. This assumes that all studies represent the same population, intervention, comparator, and outcome for which there is a single "true" effect size. Fixed effects models yield effect size estimates by assigning a weight to each individual study estimate that reflects the inherent variability in the results measured (ie, the "within-study variance" related to the standard error of the outcome).

There are limited instances when it is appropriate to use a fixed effects model for summarizing clinical trials. These include meta-analyses in which:

There is extreme confidence that the studies are comparable (ie, characteristics of the enrolled patients, the type of intervention, comparators and outcome measures) such that any difference across studies is just due to random variation. Such an assumption is typically difficult to justify. One example of an appropriate use of the fixed effects model is meta-analysis of repeated, identical, highly controlled trials in a uniform setting, as may be done by pharmaceutical companies during early testing.

The studies are of rare events in which one form of a fixed effects model (the Peto odds ratio) may be less biased than other methods of pooling data [59].

Forest plot — A forest plot is a graphical presentation of individual studies, typically displayed as point estimates with their associated 95% CIs on an appropriate scale, next to a description of the individual studies (figure 9). The forest plot allows the reader to see the estimate and the precision of the individual studies, appreciate the heterogeneity of results, and compare the estimates of the individual studies to the overall summary estimate.

Ideally, a forest plot should provide sufficient data for the reader to make some assessment of the individual studies in the context of the overall summary (eg, to compare sample sizes, any variations in treatments such as dose, baseline values, demographic features, and study quality).

Funnel plot — A graphical technique, with related statistical tests, to examine the studies within a systematic review for the possibility of publication bias (figure 2). (See 'Publication and reporting bias' above.)

Grey literature — A term with varying and shifting meaning that indicates sources of evidence beyond the peer-reviewed, published literature available in major databases (eg, Medline). Examples include alternative databases, conference abstracts and proceedings, unpublished studies (eg, via clinicaltrials.gov), newspaper or internet citations, citation indexes, handsearching of journals or reference lists, and domain experts.

Heterogeneity

Clinical heterogeneity — Qualitative differences in study features, such as study eligibility criteria, interventions, or methods of measuring outcomes, that may preclude appropriate meta-analysis. These features can be explicit (such as different drug doses used) or implicit (such as differences in populations depending on setting or country). Clinical heterogeneity may or may not result in statistical heterogeneity but often may not: for example, if the effect size is similar regardless of the drug dose, of the individual drug within a class of drugs, or in different populations (eg, men and women, or Japanese and American).

Statistical heterogeneity — Quantitative differences in study results across studies examining similar questions. Statistical heterogeneity may be due to clinical heterogeneity or to chance. Statistical heterogeneity is measured with a variety of tests, most commonly I2 and the Q statistic. Other heterogeneity measures (eg, H2, R2, tau2) have also been described but are infrequently used.

I2 index — The I2 index represents the amount of variability in the effect sizes across studies that can be explained by between-study variability. For example, an I2 value of 75 percent means that 75 percent of the variability in the measured effect sizes across studies is caused by true heterogeneity among studies. By consensus, standard thresholds for the interpretation of I2 are 25, 50, and 75 percent to represent low, medium, and high heterogeneity, respectively [60]. However, the investigators who introduced the I2 statistic noted that naïve categorization of I2 values is not appropriate in all circumstances and that "the practical impact of heterogeneity in a meta-analysis also depends on the size and direction of treatment effects" [60]. The clinical implication and interpretability of a meta-analysis with a large I2 index will be different for studies with large statistically significant effects compared with studies with smaller inconsistent effects.

Key questions — Research questions that are clearly defined and form the basis for the systematic review or meta-analysis. (See 'Formulating research questions' above.)

Meta-regression — A meta-analytic technique that permits adjustment for potential confounders and analysis of different variables to help explain differences in results across studies. Equivalent to patient-level regression, except that the unit of analysis is a study instead of a person. (See 'Meta-regression' above.)

Network meta-analysis — A technique to simultaneously meta-analyze a network of studies that evaluated related, but different, specific comparisons. It permits quantitative inferences across studies that have made indirect comparisons of interventions. An example would be the comparison of two or more drugs to each other, when each was studied only in comparison to placebo. (See 'Network meta-analysis' above.)

PICO method (PICOD, PICOS, PICOTS, others) — An acronym that stands for Population, Intervention(s), Comparator(s), Outcome(s); added letters include Study Design (PICOD), Setting (PICOS), Timing and Setting (PICOTS). PICO is the basis for a systematic approach in developing a key question and research protocol. While used extensively for systematic reviews, PICO is relevant to all medical research questions. Each feature is defined explicitly and comprehensively so that it is unambiguously evident which studies are eligible for inclusion in a systematic review.

Precision — Precision refers to the extent to which the observed results would be reproduced exactly, given the same interventions and study design. The precision of an effect estimate can generally be assessed by examining the confidence intervals (ie, the narrower the confidence intervals are, the more precise the estimate is). (See "Glossary of common biostatistical and epidemiological terms", section on 'Confidence interval'.)

PRISMA statement — Preferred Reporting Items of Systematic reviews and Meta-Analyses, an update of QUOROM (Quality of Reporting of Meta-analyses) statement. A guideline for reporting of systematic reviews, used as a standard by many journals.

PROSPERO — An international database of prospectively registered systematic reviews in health care. PROSPERO creates a permanent record of systematic review protocols to reduce unnecessary duplication of efforts and increase transparency. Researchers should ideally enter their protocols prospectively and update them as necessary.

Publication bias — One of several related biases in the available evidence being considered for inclusion in a systematic review. Conceptually, studies that have been published are systematically different than studies that have failed to be published, due to lack of acceptance by journals, lack of interest by authors or research grantors, or potentially, by deliberate withholding by funders. Theoretically, "positive" (statistically significant) results are more likely to be published than "negative" results. Strictly, publication bias refers specifically to missing publications about studies.

Related biases include selective outcome reporting bias, where studies are published without certain outcomes; time-lag bias, where "negative" study results tend to be delayed in their publication compared with "positive" results; location bias, where "positive" or more interesting results tend to be published in journals that are more easily accessible; language bias, pertinent in certain fields, where non-English language publications differ in study results compared with those published (from the same countries or authors) in English; and multiple or duplicate publication bias, where certain studies may be overrepresented in the literature due to duplicate or overlapping publications (that may be difficult to tease apart).

Q statistic — The "Q" statistic (or chi square test for heterogeneity) tests the hypothesis that results across studies are homogeneous. Its calculation involves summing the squared deviations from the effect measured in each study from the overall effect and weighting the contribution from each study by the inverse of its variance. The Q statistic is usually interpreted to indicate heterogeneity if its P value is <0.10. A nonsignificant value suggests that the studies are homogeneous. However, the Q statistic has limited power to detect heterogeneity in meta-analyses with few studies, while it tends to over-detect heterogeneity in meta-analyses with many studies [61].

Random effects model — The central assumption of a random effects model is that each study estimate represents a random sample from a distribution of different populations [62]. For most of the medical literature, the random effects model is the more appropriate approach. A hypothetical model for a random effects model meta-analysis is shown in the figure (figure 10). The model assumes there are multiple true treatment effects related to inherent differences in different populations or other factors, and that each trial provides an estimate of its own true effect. The meta-analysis provides a pooled estimate across (or an average of) a range of true effects. Thus, the random effects model assumes that there is not necessarily one "true" effect size but rather that the studies included have provided a glimpse of a range of "true" effects. The random effects model incorporates both "between-study variance" (to capture the range of difference effects across studies) and "within-study variance" (to capture the range of difference effects within studies) [41]. There are several methods for calculating the random effects model estimates. The optimal approaches continue to be debated [63].

Risk of bias assessment — The risk of bias (RoB) assessment (sometimes referred to as "quality assessment") represents the extent to which trial design and methodology prevented systematic error and can help explain differences in the results of systematic reviews. The primary value of the RoB assessment of individual studies in the meta-analysis is to determine the degree of confidence that the pooled effect estimate reflects the "truth" as best as it can be measured. One would be more likely to have high confidence in conclusions based upon "high-quality" (ie, low RoB) studies rather than "low-quality" (ie, high RoB) studies. (See 'Risk of bias assessment' above.)

Sensitivity analysis — A method of exploring heterogeneity in a meta-analysis by varying which studies are included to determine the effects of such changes. Used to explore how sensitive a meta-analysis finding is to inclusion of individual studies and to evaluate possible causes of heterogeneity; for example, whether exclusion of high RoB studies influences the size of the effect. (See 'Sensitivity analysis' above.)

SUMMARY

A systematic review is a comprehensive summary of all available evidence that meets predefined eligibility criteria to address a specific clinical question or range of questions. Meta-analysis, which is commonly included in systematic reviews, is a statistical method that quantitatively combines the results from different studies. It is commonly used to provide an overall pooled estimate of the benefit or harm of an intervention. (See 'Key definitions' above.)

Several steps are essential for conducting a systematic review or meta-analysis. These include:

Formulating research questions (see 'Formulating research questions' above)

Developing a protocol (see 'Developing a protocol' above)

Searching for the evidence (see 'The literature search' above)

Assessing the quality of studies (see 'Risk of bias assessment' above)

Summarizing and displaying results (eg, using forest pots and a summary of findings table, as shown in the figure (figure 1)) (see 'Forest plot' above)

Exploring reasons for heterogeneity across studies (see 'Heterogeneity' above and 'Subgroup analyses' above and 'Sensitivity analysis' above)

When reading and interpreting a systematic review, the reader should appraise the methodologic quality, assess for potential sources of bias, and consider the extent to which the findings are applicable to their specific question. Key issues to consider are summarized in the table (table 2). The value of a systematic review's conclusions may be limited by the quality and applicability of the individual studies included in the review. (See 'Reading and interpreting a systematic review' above.)

  1. Systematic reviews, Chalmers I, Altman DG (Eds), BMJ Publishing Group, London 1995.
  2. Cook DJ, Mulrow CD, Haynes RB. Systematic reviews: synthesis of best evidence for clinical decisions. Ann Intern Med 1997; 126:376.
  3. Oxman AD, Cook DJ, Guyatt GH. Users' guides to the medical literature. VI. How to use an overview. Evidence-Based Medicine Working Group. JAMA 1994; 272:1367.
  4. Mulrow CD. The medical review article: state of the science. Ann Intern Med 1987; 106:485.
  5. Lau J, Ioannidis JP, Schmid CH. Quantitative synthesis in systematic reviews. Ann Intern Med 1997; 127:820.
  6. Ioannidis JP. Why most published research findings are false: author's reply to Goodman and Greenland. PLoS Med 2007; 4:e215.
  7. LeLorier J, Grégoire G, Benhaddad A, et al. Discrepancies between meta-analyses and subsequent large randomized, controlled trials. N Engl J Med 1997; 337:536.
  8. Cappelleri JC, Ioannidis JP, Schmid CH, et al. Large trials vs meta-analysis of smaller trials: how do their results compare? JAMA 1996; 276:1332.
  9. Villar J, Carroli G, Belizán JM. Predictive ability of meta-analyses of randomised controlled trials. Lancet 1995; 345:772.
  10. Page MJ, McKenzie JE, Bossuyt PM, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 2021; 372:n71.
  11. www.prisma-statement.org/Extensions/Default.aspx (Accessed on April 16, 2018).
  12. Institute of Medicine. Finding what workds in health care: Standards for systematic reviews. The National Academies Press, Washington, DC, 2011. Available at: http://www.iom.edu/Reports/2011/Finding-What-Works-in-Health-Care-Standards-for-Systematic-Reviews.aspx (Accessed on October 10, 2011).
  13. Counsell C. Formulating questions and locating primary studies for inclusion in systematic reviews. Ann Intern Med 1997; 127:380.
  14. Clarke MJ, Stewart LA. Principles of and procedures for systematic reviews. In: Systematic reviews in health care: meta-analysis in context, Egger M, Smith G, Altman D (Eds), BMJ Publishing Group, London 2001. p.23.
  15. http://www.ncbi.nlm.nih.gov/pubmed (Accessed on July 14, 2011).
  16. http://www.embase.com/search (Accessed on July 14, 2011).
  17. http://onlinelibrary.wiley.com/o/cochrane/cochrane_clcentral_articles_fs.html (Accessed on July 14, 2011).
  18. Dickersin K, Chalmers I. Recognising, investigating and dealing with incomplete and biased reporting of clinical research: from Francis Bacon to the World Health Organisation. James Lind Library 2010. Available at: www.jameslindlibrary.org (Accessed on October 10, 2011).
  19. Mathieu S, Boutron I, Moher D, et al. Comparison of registered and published primary outcomes in randomized controlled trials. JAMA 2009; 302:977.
  20. Kirkham JJ, Altman DG, Williamson PR. Bias due to changes in specified outcomes during the systematic review process. PLoS One 2010; 5:e9810.
  21. Sherman RE, Anderson SA, Dal Pan GJ, et al. Real-World Evidence - What Is It and What Can It Tell Us? N Engl J Med 2016; 375:2293.
  22. Briere JB, Bowrin K, Taieb V, et al. Meta-analyses using real-world data to generate clinical and epidemiological evidence: a systematic literature review of existing recommendations. Curr Med Res Opin 2018; 34:2125.
  23. Thornton A, Lee P. Publication bias in meta-analysis: its causes and consequences. J Clin Epidemiol 2000; 53:207.
  24. Ioannidis JP. Effect of the statistical significance of results on the time to completion and publication of randomized efficacy trials. JAMA 1998; 279:281.
  25. Vevea JL, Woods CM. Publication bias in research synthesis: sensitivity analysis using a priori weight functions. Psychol Methods 2005; 10:428.
  26. Egger M, Davey Smith G, Schneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. BMJ 1997; 315:629.
  27. Tang JL, Liu JL. Misleading funnel plot for detection of bias in meta-analysis. J Clin Epidemiol 2000; 53:477.
  28. Terrin N, Schmid CH, Lau J, Olkin I. Adjusting for publication bias in the presence of heterogeneity. Stat Med 2003; 22:2113.
  29. Terrin N, Schmid CH, Lau J. In an empirical evaluation of the funnel plot, researchers could not visually identify publication bias. J Clin Epidemiol 2005; 58:894.
  30. Duval S, Tweedie R. Trim and fill: A simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis. Biometrics 2000; 56:455.
  31. Copas J. What works?: Selectivity models and meta-analysis. Journal of the Royal Statistical Society Series A 1999; 162:95.
  32. Rosenthal R. The 'file drawer problem' and tolerance for null results. Psychol Bull 1979; 86:638.
  33. Ioannidis JP, Trikalinos TA. An exploratory test for an excess of significant findings. Clin Trials 2007; 4:245.
  34. Moher D, Jadad AR, Nichol G, et al. Assessing the quality of randomized controlled trials: an annotated bibliography of scales and checklists. Control Clin Trials 1995; 16:62.
  35. Higgins JP, Altman DG, Gøtzsche PC, et al. The Cochrane Collaboration's tool for assessing risk of bias in randomised trials. BMJ 2011; 343:d5928.
  36. Sterne JAC, Savović J, Page MJ, et al. RoB 2: a revised tool for assessing risk of bias in randomised trials. BMJ 2019; 366:l4898.
  37. Sterne JA, Hernán MA, Reeves BC, et al. ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ 2016; 355:i4919.
  38. Verhagen AP, de Vet HC, de Bie RA, et al. The art of quality assessment of RCTs included in systematic reviews. J Clin Epidemiol 2001; 54:651.
  39. Bassler D, Briel M, Montori VM, et al. Stopping randomized trials early for benefit and estimation of treatment effects: systematic review and meta-regression analysis. JAMA 2010; 303:1180.
  40. Walter SD, Guyatt GH, Bassler D, et al. Randomised trials with provision for early stopping for benefit (or harm): The impact on the estimated treatment effect. Stat Med 2019; 38:2524.
  41. DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials 1986; 7:177.
  42. Guyatt GH, Oxman AD, Kunz R, et al. GRADE guidelines 6. Rating the quality of evidence--imprecision. J Clin Epidemiol 2011; 64:1283.
  43. Dechartres A, Altman DG, Trinquart L, et al. Association between analytic strategy and estimates of treatment outcomes in meta-analyses. JAMA 2014; 312:623.
  44. Berlin JA, Golub RM. Meta-analysis as evidence: building a better pyramid. JAMA 2014; 312:603.
  45. Povitz M, Bolo CE, Heitman SJ, et al. Effect of treatment of obstructive sleep apnea on depressive symptoms: systematic review and meta-analysis. PLoS Med 2014; 11:e1001762.
  46. Berkey CS, Hoaglin DC, Mosteller F, Colditz GA. A random-effects regression model for meta-analysis. Stat Med 1995; 14:395.
  47. Schmid CH. Exploring heterogeneity in randomized trials via metaanalysis. Drug Inf J 1999; 33:211.
  48. van Houwelingen HC, Arends LR, Stijnen T. Advanced methods in meta-analysis: multivariate approach and meta-regression. Stat Med 2002; 21:589.
  49. Ioannidis JP, Cappelleri JC, Sacks HS, Lau J. The relationship between study design, results, and reporting of randomized clinical trials of HIV infection. Control Clin Trials 1997; 18:431.
  50. Schulz KF, Grimes DA. Multiplicity in randomised trials II: subgroup and interim analyses. Lancet 2005; 365:1657.
  51. Rothman KJ, Greenland S. Modern epidemiology, 2nd ed, Lippincott-Raven, Philadelphia 1998.
  52. Geissbühler M, Hincapié CA, Aghlmandi S, et al. Most published meta-regression analyses based on aggregate data suffer from methodological pitfalls: a meta-epidemiological study. BMC Med Res Methodol 2021; 21:123.
  53. Jansen JP, Fleurence R, Devine B, et al. Interpreting indirect treatment comparisons and network meta-analysis for health-care decision making: report of the ISPOR Task Force on Indirect Treatment Comparisons Good Research Practices: part 1. Value Health 2011; 14:417.
  54. Mills EJ, Ioannidis JP, Thorlund K, et al. How to use an article reporting a multiple treatment comparison meta-analysis. JAMA 2012; 308:1246.
  55. Brignardello-Petersen R, Mustafa RA, Siemieniuk RAC, et al. GRADE approach to rate the certainty from a network meta-analysis: addressing incoherence. J Clin Epidemiol 2019; 108:77.
  56. Brignardello-Petersen R, Bonner A, Alexander PE, et al. Advances in the GRADE approach to rate the certainty in estimates from a network meta-analysis. J Clin Epidemiol 2018; 93:36.
  57. Salanti G. Indirect and mixed-treatment comparison, network, or multiple-treatments meta-analysis: many names, many benefits, many concerns for the next generation evidence synthesis tool. Res Synth Methods 2012; 3:80.
  58. Trinquart L, Attiche N, Bafeta A, et al. Uncertainty in Treatment Rankings: Reanalysis of Network Meta-analyses of Randomized Trials. Ann Intern Med 2016; 164:666.
  59. Bradburn MJ, Deeks JJ, Berlin JA, Russell Localio A. Much ado about nothing: a comparison of the performance of meta-analytical methods with rare events. Stat Med 2007; 26:53.
  60. Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ 2003; 327:557.
  61. Huedo-Medina TB, Sánchez-Meca J, Marín-Martínez F, Botella J. Assessing heterogeneity in meta-analysis: Q statistic or I2 index? Psychol Methods 2006; 11:193.
  62. Lau J, Ioannidis JP, Schmid CH. Summing up evidence: one answer is not always enough. Lancet 1998; 351:123.
  63. Jackson D, Law M, Stijnen T, et al. A comparison of seven random-effects models for meta-analyses that estimate the summary odds ratio. Stat Med 2018; 37:1059.
Topic 16293 Version 25.0

References