Evaluating diagnostic tests

- Authors:
- Neal G Mahutte, MD
- Antoni J Duleba, MD
- Section Editor:
- Joann G Elmore, MD, MPH
- Deputy Editor:
- Carrie Armsby, MD, MPH

Literature review current through: Jan 2023. | This topic last updated: Oct 20, 2022.

INTRODUCTION — The introduction of new diagnostic tests that claim to improve screening or provide definitive diagnosis is a major dilemma for all clinicians. The decision to embrace or reject these tests is often made individually with incomplete information and without thoughtful reflection.

In this topic review, we will outline a simple seven-step process that can be used to evaluate the utility of any diagnostic test:

●Can the test be reliably performed?

●Was the test evaluated on an appropriate population?

●Was an appropriate gold standard used?

●Was an appropriate cutoff value chosen to optimize sensitivity and specificity?

●What are the positive and negative likelihood ratios?

●How well does the test perform in specific populations?

●What is the balance between cost of the disease and cost of the test?

A catalog of common biostatistical and epidemiologic terms encountered in the medical literature, an evidence-based approach to prevention, and issues regarding hypothesis testing are presented separately. (See "Glossary of common biostatistical and epidemiological terms" and "Evidence-based approach to prevention" and "Proof, p-values, and hypothesis testing".)

CAN THE TEST BE PERFORMED RELIABLY?

Accuracy and precision — It is helpful to determine the extent to which the test is accurate, precise, and user dependent to objectively answer this question. "Accuracy" refers to the ability of the test to actually measure what it claims to measure and is defined as the proportion of all test results (both positive and negative) that are correct (table 1). Precision refers to the ability of the test to reproduce the same result when repeated on the same patient or sample. The two concepts are related but different. For example, a test could be precise but not accurate if on three occasions it produced roughly the same result, but that result differed greatly from the actual value determined by a reference standard. Both accuracy and precision may be presented in the form of a confidence interval (CI) or standard error (SE).

Expertise — One of the great challenges in evaluating a diagnostic test is determining to what extent user expertise influences accuracy and precision. Studies in the literature often originate from tertiary care centers with advanced capabilities for diagnostic equipment and personnel. Such environments may bear little resemblance to the facilities found at a local level. As an example, high "user dependence" makes it difficult to apply advances in screening ultrasonography found at specialized centers to the population at large [1]. The test may be accurate and precise in an expert's hands, but it may be imprecise, inaccurate, and unreliable when performed by a less experienced practitioner. These factors should be taken into account when determining if a given test should be implemented in a given situation.

WAS THE TEST EVALUATED ON AN APPROPRIATE POPULATION?

Population — This step examines the population from which test data was derived, a point that is often overlooked. A test should be conducted on a broad spectrum of patients with and without the disorder in question to maximize generalizability. Those with the disorder should represent all stages and manifestations of the disease. Even more importantly, individuals without the disorder should have some clinical manifestations similar to, and perhaps easily confused with, the disease in question. This is critical in demonstrating the ability of the test to distinguish among clinical entities in the differential diagnosis.

As an example, the utility of obtaining a serum CA125 concentration for detection of endometriosis depends upon studying a population that includes a range of patients with minimal, mild, moderate, and severe endometriosis. If the study population has a disproportionate number of women with severe disease, this might falsely inflate the capability of the test to identify cases. It is also essential to include a large cohort of patients without endometriosis but with similar signs or symptoms (eg, dysmenorrhea, dyspareunia, pelvic pain, infertility, adnexal mass, fibroids). Neglecting to include these patients might falsely inflate the performance of the test.

Sample size — Sample size is part of the question of population appropriateness. An adequate number of patients must be studied to encompass a broad spectrum of manifestations in diseased and nondiseased subjects. However, an overly large sample size may detect a statistically significant test difference that is not clinically meaningful, while a sample size that is too small may yield inconclusive results due to low power.

One direct way of evaluating sample size is to examine the confidence intervals (CIs) for sensitivity, specificity, and likelihood ratio reported in the study. (See 'Balancing sensitivity and specificity' below and 'What are the positive and negative likelihood ratios?' below.)

WAS AN APPROPRIATE REFERENCE STANDARD USED?

Reference standards — Evaluation of a test necessarily involves comparison with a reference standard. Ideally, a reference standard allows unambiguous identification of diseased and nondiseased patients. However, in the real world, reference standards often involve some degree of error or user dependence.

As an example, histopathology is often used as a reference standard for the diagnosis of endometriosis; however, histopathology is not infallible. Cases can be misdiagnosed because of sampling error or individual differences among pathologists in histologic interpretation. The presence of ectopic endometrial glands, but not stroma (or vice versa), in a woman with clinical signs and symptoms of endometriosis is suggestive of this disorder but does not meet strict criteria for the disease (ie, ectopically located endometrial glands and stroma). By comparison, does an asymptomatic woman have endometriosis if a random biopsy of her normal-appearing peritoneum finds endometrial glands and stroma? These questions address issues of both disease definition and what is normal.

Real-world considerations compel us to use practical definitions. Reference standards represent "the best we have" for distinguishing normal from abnormal. The reference standard is the test that thus far has been shown to most reliably detect the disease. Therefore, any new test that may purport to have value must be compared with the reference standard if we are to minimize the chance of misdiagnosis.

Defining normal — "Normal" is a deceptive term. While it is used commonly to refer to good health or the absence of disease, defining normal can be complex and arbitrary. Many tests define normal based upon assigned cutoff values that assume a fixed prevalence of disease. Intrauterine growth restriction (IUGR), for example, may be defined as an estimated fetal weight less than the 10^{th} percentile, less than the 5^{th} percentile, or less than two standard deviations from the mean. Such definitions may be convenient but clearly do not reflect the true prevalence of the disease in divergent populations.

In addition, the cutoff value may not accurately reflect the diseased condition. As an example, the concept of growth restriction implies a pathologic process resulting in failure to achieve the genetically programmed size. A neonate with a birth weight at the 12^{th} percentile who has three older brothers whose birth weights were at the 90^{th} percentile would be categorized as normal under the standard definitions described above, even though the neonate appears not to have reached its genetic potential. In contrast, an infant whose actual weight and true genetic potential were at the 4^{th} percentile might be mislabeled as having IUGR.

WAS AN APPROPRIATE CUTOFF VALUE CHOSEN TO OPTIMIZE SENSITIVITY AND SPECIFICITY?

Balancing sensitivity and specificity — A cutoff value must be chosen to separate normal from abnormal. Selecting this value virtually always involves balancing sensitivity and specificity, although the actual value may be arbitrary.

●Sensitivity is the probability that an individual with the disease will test positive. It is the number of patients with a positive test who have the disease (true positives) divided by the number of all patients who have the disease. A test with high sensitivity will not miss many patients who have the disease (ie, low false-negative rate).

●Specificity is the probability that an individual without the disease will test negative. It is the number of patients who have a negative test and do not have the disease (true negatives) divided by the number of patients who do not have the disease. A test with high specificity will infrequently identify patients as having a disease when they do not (ie, low false-positive results).

Two-by-two tables — A two-by-two table (table 2) is the simplest way to calculate sensitivity and specificity. However, understanding the interrelationship among sensitivity, specificity, and cutoff values is easiest in graphic form (figure 1).

Two-by-two tables can also be used for calculating the false-positive and false-negative rates:

●The false positive rate = false positives/(false positives + true negatives). It is also equal to 1 − specificity.

●The false negative rate = false negatives/(false negatives + true positives). It is also equal to 1 − sensitivity.

An ideal test maximizes both sensitivity and specificity, thereby minimizing the false-positive and false-negative rates.

Receiver operating characteristic curves — Receiver operating characteristic (ROC) curves allow one to identify the cutoff value that minimizes both false positives and false negatives. An ROC curve plots sensitivity on the y axis and 1 − specificity on the x axis (figure 2). Applying a variety of cutoff values to the same reference population allows one to generate the curve. The perfect test would have a cutoff value that allowed an exact split of diseased and nondiseased populations (ie, a cutoff that gives both 100 percent sensitivity and 100 percent specificity). It would plot as a right angle with the fulcrum in the far upper left corner (x = 0, y = 1). This case, however, is very rare. For the vast majority of cases, as one moves from left to right on the ROC curve, the sensitivity increases while the specificity decreases.

Calculation of the area under the ROC curve allows comparison of different tests. A perfect test has an area under the curve equal to 1. Therefore, the closer the area under the curve is to 1, the better the test. Similarly, if one wants to know the cutoff value for a test that minimizes both false positives and false negatives (and hence maximizes both sensitivity and specificity), one would select the point on the ROC curve closest to the far upper left corner (x = 0, y = 1).

However, finding the right balance between optimal sensitivity and specificity may not involve simultaneously minimizing false positives and false negatives in all situations. For example, when screening for a deadly disease that is curable, it may be desirable to accept more false positives (lower specificity) in return for fewer false negatives (higher sensitivity). ROC curves allow for more thorough evaluation of a test and potential cutoff values, but they are not the ultimate arbiters of how to set sensitivity and specificity.

WHAT ARE THE POSITIVE AND NEGATIVE LIKELIHOOD RATIOS?

Epidemiologists have devised another method by which to judge diagnostic tests: positive and negative likelihood ratios, which, like sensitivity and specificity, are independent of disease prevalence.

●The positive likelihood ratio = sensitivity/(1 − specificity). This ratio divides the probability that a patient with the disease will test positive by the probability that a patient without the disease will test positive. It can also be written as the true positive rate/false positive rate. Thus, the higher the positive likelihood ratio, the better the test (a perfect test has a positive likelihood ratio equal to infinity).

●The negative likelihood ratio = (1 − sensitivity)/specificity. This ratio divides the probability that a patient with the disease will test negative by the probability that a patient without the disease will test negative. It can also be written as the false negative rate/true negative rate. Therefore, the lower the negative likelihood ratio, the better the test (a perfect test has a negative likelihood ratio of 0).

In most instances, one can evaluate likelihood ratios as shown in the table (table 3). For example, suppose you were attempting to interpret the significance of a CA125 value of 80 in a 46-year-old woman with an ovarian cyst. If 70 percent of patients with ovarian cancer have a CA125 at this level, but 35 percent of patients with benign cysts have a CA125 at the same level, then the positive likelihood ratio would only be 2 (ie, 0.70/0.35). This would be considered a poor test for the diagnosis of cancer.

Although likelihood ratios are independent of disease prevalence, their direct validity is only within the original study population. They are generalizable to other populations to the extent that:

●The test can be reliably performed with minimal interobserver and intraobserver variation

●The study population(s) from which the values were derived was adequate in size and composition of normal and diseased phenotypes

●An appropriate reference standard was used

If a diagnostic test was investigated in a narrow subpopulation or the test relied heavily on user skill/interpretation, then the sensitivity, specificity, and likelihood ratios reported in the study may not be generalizable outside of the original research population. In other words, the test performance parameters may have internal validity but not external validity.

HOW WELL DOES THE TEST PERFORM IN SPECIFIC POPULATIONS?

Disease prevalence — If the sensitivity, specificity, and likelihood ratios are well defined, the penultimate factor determining the utility of a test is disease prevalence (calculator 1 and calculator 2). The usefulness of a positive test decreases as disease prevalence decreases. This concept is the basis of predictive values or post-test probabilities.

●Positive predictive value (PPV) refers to the probability that a positive test correctly identifies an individual who actually has the disease. It is computed from two-by-two tables: true positives/(true positives + false positives) (table 4).

●Negative predictive value (NPV) refers to the probability that a negative test correctly identifies an individual who does not have the disease. It is computed from two-by-two tables: true negatives/(false negatives + true negatives) (table 4).

For example, assuming a constant sensitivity and specificity, the PPV and NPV for a disease with prevalence of 10, 1, or 0.1 percent are shown in a table (table 5). This example illustrates how a positive result from the same test with near-perfect sensitivity (99 percent) and high specificity (90 percent) may have completely different significance depending upon the baseline prevalence of disease in the population. When applied to a population in which the disease is common (prevalence = 10 percent), the PPV is 53 percent. By comparison, when applied to a different population in which the disease is uncommon (prevalence = 0.1 percent), the PPV is only 1 percent; thus, 99 percent of all individuals who test positive are actually free of the disease. All that the test has accomplished in this population is to slightly upgrade the probability of disease from extremely unlikely (0.1 percent) to very unlikely (1 percent) and, in the process, subjected numerous individuals without the disease to further testing. A second example, using a different combination of sensitivity, specificity, and prevalence, is illustrated in the figure (figure 3).

A clinical example of the importance of prevalence on test utility is in fetal fibronectin testing for the prediction of preterm delivery. A systematic review reported the overall sensitivity and specificity of this test (in symptomatic and asymptomatic patients) for delivery before 34 weeks was 52 and 85 percent, respectively [2]. If the prevalence of preterm birth in an asymptomatic low-risk population is 10 percent, then the PPV of a positive fetal fibronectin result would be 28 percent, whereas in a high-risk symptomatic population with a prevalence of preterm birth of 50 percent, the PPV would be 78 percent.

WHAT IS THE BALANCE BETWEEN COST OF THE DISEASE AND COST OF THE TEST?

The final judgment involved in considering the value of a test is the balance between cost of the disease and cost of the test. These costs involve charges to an individual, to an insurer, to an institution, or to society. We live in a world with finite resources but burgeoning demand for better health care, more accurate tests, and rapid diagnosis. Cost is often the determinant in deciding when, where, and how a diagnostic test is utilized.

A society and its health care payers and providers might be willing to accept low positive predictive values (PPVs) in return for saved lives for a rare disease that is universally fatal but easily curable. By comparison, an accurate but extremely expensive test might be less desirable than one of lesser quality if the consequences of misdiagnosis are not serious.

Cost analysis involves direct monetary costs as well as all of the indirect costs of disease, testing, and misdiagnosis. Unfortunately, these costs are often rough estimates, which hampers the accuracy of this type of analysis. In addition, cost studies typically suffer from poor external validity; values used in the analysis may not be readily generalizable to other areas of the country, other health systems, or other countries. Finally, because markets are never static, costs often change, with the potential to alter or entirely invalidate the thrust of the analysis.

SUMMARY AND RECOMMENDATIONS — It is not an easy task to fully evaluate the utility of a diagnostic test; many variables must be considered. This topic review attempts to provide a framework from which any test may be objectively and systematically analyzed. The seven steps outlined need not be followed in exact order. For example, one may wish to consider cost and predictive values before delving deeper into the validity of applying the test to a specific population. Nevertheless, careful consideration of all seven questions is important in making a final decision regarding the utility of a diagnostic test.

●In order to determine the reliability of a test, it is helpful to assess the extent to which the test is accurate, precise, and user dependent. (See 'Can the test be performed reliably?' above.)

●A test should be evaluated on a broad spectrum of patients with and without the disorder in question to maximize generalizability. (See 'Was the test evaluated on an appropriate population?' above.)

●A reference standard allows unambiguous identification of diseased and nondiseased patients. However, in the real world, reference standards often involve some degree of error or user dependence. (See 'Was an appropriate reference standard used?' above.)

●A cutoff value must be chosen to separate normal from abnormal. Selecting this value virtually always involves balancing sensitivity and specificity, although the actual value may be arbitrary (figure 1). (See 'Was an appropriate cutoff value chosen to optimize sensitivity and specificity?' above.)

●Positive and negative likelihood ratios, like sensitivity and specificity, are independent of disease prevalence. In most instances, one can evaluate likelihood ratios across a range of possible values, unlike sensitivity and specificity, which determine the crude presence and absence of a condition (table 3). (See 'What are the positive and negative likelihood ratios?' above.)

●If the sensitivity, specificity, and likelihood ratios are well defined, the penultimate factor determining the utility of a test is disease prevalence (calculator 1 and calculator 2). The usefulness of a positive test decreases as disease prevalence decreases. This concept is the basis of predictive values or post-test probabilities. (See 'How well does the test perform in specific populations?' above.)

●The final judgment involved in considering the value of a test is the balance between cost of the disease and cost of the test. These costs involve charges to an individual, to an insurer, to an institution, or to society. Cost is often the determinant in deciding when, where, and how a diagnostic test is utilized. (See 'What is the balance between cost of the disease and cost of the test?' above.)

Topic 2769 Version 21.0