# Related Articles

Summary

Receiver operating characteristic (ROC) curves can be used to assess the accuracy of tests measured on ordinal or continuous scales. The most commonly used measure for the overall diagnostic accuracy of diagnostic tests is the area under the ROC curve (AUC). A gold standard test on the true disease status is required to estimate the AUC. However, a gold standard test may sometimes be too expensive or infeasible. Therefore, in many medical research studies, the true disease status of the subjects may remain unknown. Under the normality assumption on test results from each disease group of subjects, using the expectation-maximization (EM) algorithm in conjunction with a bootstrap method, we propose a maximum likelihood based procedure for construction of confidence intervals for the difference in paired areas under ROC curves in the absence of a gold standard test. Simulation results show that the proposed interval estimation procedure yields satisfactory coverage probabilities and interval lengths. The proposed method is illustrated with two examples.

doi:10.1002/sim.3661

PMCID: PMC2812057
PMID: 19691022

Area under the ROC curve; EM algorithm; bootstrap method; gold standard test; maximum likelihood estimation

SUMMARY

Receiver Operating Characteristic (ROC) curves are commonly used to summarize the classification accuracy of diagnostic tests. It is not uncommon in medical practice that multiple diagnostic tests are routinely performed or multiple disease markers are available for the same individuals. When the true disease status is verified by a gold standard test, a variety of methods have been proposed to combine such potential correlated tests to increase the accuracy of disease diagnosis. In this article, we propose a method of combining multiple diagnostic tests in the absence of a gold standard. We assume that the test values and their classification accuracies are dependent on covariates. Simulation studies are performed to examine the performance of the combination method. The proposed method is applied to data from a population-based aging study to compare the accuracy of three screening tests for kidney function and to estimate the prevalence of moderate kidney impairment.

doi:10.1002/sim.4203

PMCID: PMC3107872
PMID: 21432889

Diagnostic test; Gold standard; Markov chain Monte-Carlo; Sensitivity; Specificity

In diagnostic medicine, estimating the diagnostic accuracy of a group of raters or medical tests relative to the gold standard is often the primary goal. When a gold standard is absent, latent class models where the unknown gold standard test is treated as a latent variable are often used. However, these models have been criticized in the literature from both a conceptual and a robustness perspective. As an alternative, we propose an approach where we exploit an imperfect reference standard with unknown diagnostic accuracy and conduct sensitivity analysis by varying this accuracy over scientifically reasonable ranges. In this article, a latent class model with crossed random effects is proposed for estimating the diagnostic accuracy of regional obstetrics and gynaecological (OB/GYN) physicians in diagnosing endometriosis. To avoid the pitfalls of models without a gold standard, we exploit the diagnostic results of a group of OB/GYN physicians with an international reputation for the diagnosis of endometriosis. We construct an ordinal reference standard based on the discordance among these international experts and propose a mechanism for conducting sensitivity analysis relative to the unknown diagnostic accuracy among them. A Monte-Carlo EM algorithm is proposed for parameter estimation and a BIC-type model selection procedure is presented. Through simulations and data analysis we show that this new approach provides a useful alternative to traditional latent class modeling approaches used in this setting.

doi:10.1111/j.1541-0420.2012.01789.x

PMCID: PMC3530625
PMID: 23006010

Diagnostic error; Imperfect tests; Prevalence; Sensitivity; Specificity; Model selection

Many applications of biomedical science involve unobservable constructs, from measurement of health states to severity of complex diseases. The primary aim of measurement is to identify relevant pieces of observable information that thoroughly describe the construct of interest. Validation of the construct is often performed separately. Noting the increasing popularity of latent variable methods in biomedical research, we propose a Multiple Indicator Multiple Cause (MIMIC) latent variable model that combines item reduction and validation. Our joint latent variable model accounts for the bias that occurs in the traditional 2-stage process. The methods are motivated by an example from the Physical Activity and Lymphedema clinical trial in which the objectives were to describe lymphedema severity through self-reported Likert scale symptoms and to determine the relationship between symptom severity and a “gold standard” diagnostic measure of lymphedema. The MIMIC model identified 1 symptom as a potential candidate for removal. We present this paper as an illustration of the advantages of joint latent variable models and as an example of the applicability of these models for biomedical research.

doi:10.1093/biostatistics/kxr018

PMCID: PMC3276271
PMID: 21775486

Factor analysis; Latent variable models; Lymphedema; Multiple Indicator Multiple Cause models

In occupational case–control studies, work-related exposure assessments are often fallible measures of the true underlying exposure. In lieu of a gold standard, often more than 2 imperfect measurements (e.g. triads) are used to assess exposure. While methods exist to assess the diagnostic accuracy in the absence of a gold standard, these methods are infrequently used to correct for measurement error in exposure–disease associations in occupational case–control studies. Here, we present a likelihood-based approach that (a) provides evidence regarding whether the misclassification of tests is differential or nondifferential; (b) provides evidence whether the misclassification of tests is independent or dependent conditional on latent exposure status, and (c) estimates the measurement error–corrected exposure–disease association. These approaches use information from all imperfect assessments simultaneously in a unified manner, which in turn can provide a more accurate estimate of exposure–disease association than that based on individual assessments. The performance of this method is investigated through simulation studies and applied to the National Occupational Hazard Survey, a case–control study assessing the association between asbestos exposure and mesothelioma.

doi:10.1093/biostatistics/kxp015

PMCID: PMC2742494
PMID: 19515637

Case–control study; Gold standard; Missing data; Occupational exposure assessment

Limmathurotsakul, Direk | Turner, Elizabeth L. | Wuthiekanun, Vanaporn | Thaipadungpanit, Janjira | Suputtamongkol, Yupin | Chierakul, Wirongrong | Smythe, Lee D. | Day, Nicholas P. J. | Cooper, Ben | Peacock, Sharon J.
We hypothesized that the gold standard for diagnosing leptospirosis is imperfect. We used Bayesian latent class models and random-effects meta-analysis to test this hypothesis and to determine the true accuracy of a range of alternative tests for leptospirosis diagnosis.

Background. We observed that some patients with clinical leptospirosis supported by positive results of rapid tests were negative for leptospirosis on the basis of our diagnostic gold standard, which involves isolation of Leptospira species from blood culture and/or a positive result of a microscopic agglutination test (MAT). We hypothesized that our reference standard was imperfect and used statistical modeling to investigate this hypothesis.

Methods. Data for 1652 patients with suspected leptospirosis recruited during three observational studies and one randomized control trial that described the application of culture, MAT, immunofluorescence assay (IFA), lateral flow (LF) and/or PCR targeting the 16S rRNA gene were reevaluated using Bayesian latent class models and random-effects meta-analysis.

Results. The estimated sensitivities of culture alone, MAT alone, and culture plus MAT (for which the result was considered positive if one or both tests had a positive result) were 10.5% (95% credible interval [CrI], 2.7%–27.5%), 49.8% (95% CrI, 37.6%–60.8%), and 55.5% (95% CrI, 42.9%–67.7%), respectively. These low sensitivities were present across all 4 studies. The estimated specificity of MAT alone (and of culture plus MAT) was 98.8% (95% CrI, 92.8%–100.0%). The estimated sensitivities and specificities of PCR (52.7% [95% CrI, 45.2%–60.6%] and 97.2% [95% CrI, 92.0%–99.8%], respectively), lateral flow test (85.6% [95% CrI, 77.5%–93.2%] and 96.2% [95% CrI, 87.7%–99.8%], respectively), and immunofluorescence assay (45.5% [95% CrI, 33.3%–60.9%] and 96.8% [95% CrI, 92.8%–99.8%], respectively) were considerably different from estimates in which culture plus MAT was considered a perfect gold standard test.

Conclusions. Our findings show that culture plus MAT is an imperfect gold standard against which to compare alterative tests for the diagnosis of leptospirosis. Rapid point-of-care tests for this infection would bring an important improvement in patient care, but their future evaluation will require careful consideration of the reference test(s) used and the inclusion of appropriate statistical models.

doi:10.1093/cid/cis403

PMCID: PMC3393707
PMID: 22523263

SUMMARY

Sensitivity, specificity, positive and negative predictive value are typically used to quantify the accuracy of a binary screening test. In some studies it may not be ethical or feasible to obtain definitive disease ascertainment for all subjects using a gold standard test. When a gold standard test cannot be used an imperfect reference test that is less than 100% sensitive and specific may be used instead. In breast cancer screening, for example, follow-up for cancer diagnosis is used as an imperfect reference test for women where it is not possible to obtain gold standard results. This incomplete ascertainment of true disease, or differential disease verification, can result in biased estimates of accuracy. In this paper, we derive the apparent accuracy values for studies subject to differential verification. We determine how the bias is affected by the accuracy of the imperfect reference test, the percent who receive the imperfect reference standard test not receiving the gold standard, the prevalence of the disease, and the correlation between the results for the screening test and the imperfect reference test. It is shown that designs with differential disease verification can yield biased estimates of accuracy. Estimates of sensitivity in cancer screening trials may be substantially biased. However, careful design decisions, including selection of the imperfect reference test, can help to minimize bias. A hypothetical breast cancer screening study is used to illustrate the problem.

doi:10.1002/sim.4232

PMCID: PMC3115446
PMID: 21495059

Bias; Predictive values; Screening; Sensitivity; Specificity

Summary

The goal in diagnostic medicine is often to estimate the diagnostic accuracy of multiple experimental tests relative to a gold standard reference. When a gold standard reference is not available, investigators commonly use an imperfect reference standard. This paper proposes methodology for estimating the diagnostic accuracy of multiple binary tests with an imperfect reference standard when information about the diagnostic accuracy of the imperfect test is available from external data sources. We propose alternative joint models for characterizing the dependence between the experimental tests and discuss the use of these models for estimating individual-test sensitivity and specificity as well as prevalence and multivariate post-test probabilities (predictive values). We show using analytical and simulation techniques that, as long as the sensitivity and specificity of the imperfect test are high, inferences on diagnostic accuracy are robust to misspecification of the joint model. The methodology is demonstrated with a study examining the diagnostic accuracy of various HIV-antibody tests for HIV.

doi:10.1002/sim.3514

PMCID: PMC2754820
PMID: 19101935

diagnostic error; imperfect tests; latent class models; misclassification; predictive values; prevalence; sensitivity; specificity; diagnostic accuracy

Diagnostic tests commonly are characterized by their true positive (sensitivity) and true negative (specificity) classification rates, which rely on a single decision threshold to classify a test result as positive. A more complete description of test accuracy is given by the receiver operating characteristic (ROC) curve, a graph of the false positive and true positive rates obtained as the decision threshold is varied. A generalized regression methodology, which uses a class of ordinal regression models to estimate smoothed ROC curves has been described. Data from a multi-institutional study comparing the accuracy of magnetic resonance (MR) imaging with computed tomography (CT) in detecting liver metastases, which are ideally suited for ROC regression analysis, are described. The general regression model is introduced and an estimate for the area under the ROC curve and its standard error using parameters of the ordinal regression model is given. An analysis of the liver data that highlights the utility of the methodology in parsimoniously adjusting comparisons for covariates is presented.

PMCID: PMC1566538
PMID: 7851336

A gold standard is often an imperfect diagnostic test, falling short of achieving 100% accuracy in clinical practice. Using an imperfect gold standard without fully comprehending its limitations and biases can lead to erroneous classification of patients with and without disease. This will ultimately affect treatment decisions and patient outcomes. Therefore, validation is essential prior to implementation of the reference standard into practice. Performing a comprehensive validation process is discussed along with its advantages and challenges. The different types of validation methods are reviewed. An example from our work in developing a new reference standard for vasospasm diagnosis in aneurysmal subarachnoid hemorrhage (A-SAH) patients is provided. Employing a new reference standard may result in a definitional shift of the disease and classification scheme of patients. Thereby, it is important to also assess the impact of a new reference standard on patient outcomes and its clinical effectiveness.

doi:10.1016/j.acra.2010.05.021

PMCID: PMC2919497
PMID: 20692619

Studies that evaluate the accuracy of binary classification tools are needed. Such studies provide 2 × 2 cross-classifications of test outcomes and the categories according to an unquestionable reference (or gold standard). However, sometimes a suboptimal reliability reference is employed. Several methods have been proposed to deal with studies where the observations are cross-classified with an imperfect reference. These methods require that the status of the reference, as a gold standard or as an imperfect reference, is known. In this paper a procedure for determining whether it is appropriate to maintain the assumption that the reference is a gold standard or an imperfect reference, is proposed. This procedure fits two nested multinomial tree models, and assesses and compares their absolute and incremental fit. Its implementation requires the availability of the results of several independent studies. These should be carried out using similar designs to provide frequencies of cross-classification between a test and the reference under investigation. The procedure is applied in two examples with real data.

doi:10.3389/fpsyg.2013.00694

PMCID: PMC3789284
PMID: 24106484

binary classification; gold standard; multinomial tree models; imperfect reference; diagnostic accuracy

We describe a general solution to the problem of determining diagnostic accuracy without the use of a perfect reference standard and in the presence of interpreter variability. The accuracy of a diagnostic test is typically determined by comparing its outcomes with those of an established reference standard. But the accuracy of the standard itself and those of the interpreters strongly influence such assessments. We use our solution to examine the effects of the properties of the standard, the reliability of the interpreters, and the prevalence of abnormality on the measured sensitivity and specificity. Our results provide a method of systematically adjusting the measured sensitivity and specificity in order to estimate their true values. The results are validated by simulations and their detailed application to specific cases are described.

doi:10.1371/journal.pone.0052221

PMCID: PMC3530612
PMID: 23300619

Sensitivity and specificity are common measures of the accuracy of a diagnostic test. The usual estimators of these quantities are unbiased if data on the diagnostic test result and the true disease status are obtained from all subjects in an appropriately selected sample. In some studies, verification of the true disease status is performed only for a subset of subjects, possibly depending on the result of the diagnostic test and other characteristics of the subjects. Estimators of sensitivity and specificity based on this subset of subjects are typically biased; this is known as verification bias. Methods have been proposed to correct verification bias under the assumption that the missing data on disease status are missing at random (MAR), that is, the probability of missingness depends on the true (missing) disease status only through the test result and observed covariate information. When some of the covariates are continuous, or the number of covariates is relatively large, the existing methods require parametric models for the probability of disease or the probability of verification (given the test result and covariates), and hence are subject to model misspecification. We propose a new method for correcting verification bias based on the propensity score, defined as the predicted probability of verification given the test result and observed covariates. This is estimated separately for those with positive and negative test results. The new method classifies the verified sample into several subsamples that have homogeneous propensity scores and allows correction for verification bias. Simulation studies demonstrate that the new estimators are more robust to model misspecification than existing methods, but still perform well when the models for the probability of disease and probability of verification are correctly specified.

doi:10.1093/biostatistics/kxr020

PMCID: PMC3276270
PMID: 21856650

Diagnostic test; Model misspecification; Propensity score; Sensitivity; Specificity

Background. In Traditional Chinese Medicine (TCM), most of the algorithms are used to solve problems of syndrome diagnosis that only focus on one syndrome, that is, single label learning. However, in clinical practice, patients may simultaneously have more than one syndrome, which has its own symptoms (signs). Methods. We employed a multilabel learning using the relevant feature for each label (REAL) algorithm to construct a syndrome diagnostic model for chronic gastritis (CG) in TCM. REAL combines feature selection methods to select the significant symptoms (signs) of CG. The method was tested on 919 patients using the standard scale. Results. The highest prediction accuracy was achieved when 20 features were selected. The features selected with the information gain were more consistent with the TCM theory. The lowest average accuracy was 54% using multi-label neural networks (BP-MLL), whereas the highest was 82% using REAL for constructing the diagnostic model. For coverage, hamming loss, and ranking loss, the values obtained using the REAL algorithm were the lowest at 0.160, 0.142, and 0.177, respectively. Conclusion. REAL extracts the relevant symptoms (signs) for each syndrome and improves its recognition accuracy. Moreover, the studies will provide a reference for constructing syndrome diagnostic models and guide clinical practice.

doi:10.1155/2012/135387

PMCID: PMC3376946
PMID: 22719781

This study aims to assess the diagnostic accuracy of a single vendor commercially available CT perfusion (CTP) software in predicting stroke. A retrospective analysis on patients presenting with stroke-like symptoms within 6 h with CTP and diffusion-weighted imaging (DWI) was performed. Lesion maps, which overlays areas of computer-detected abnormally elevated mean transit time (MTT) and decreased cerebral blood volume (CBV), were assessed from a commercially available software package and compared to qualitative interpretation of color maps. Using DWI as the gold standard, parameters of diagnostic accuracy were calculated. Point biserial correlation was performed to assess for relationship of lesion size to a true positive result. Sixty-five patients (41 females and 24 males, age range 22–92 years, mean 57) were included in the study. Twenty-two (34 %) had infarcts on DWI. Sensitivity (83 vs. 70 %), specificity (21 vs. 69 %), negative predictive value (77 vs. 84 %), and positive predictive value (29 vs. 50 %) for lesion maps were contrasted to qualitative interpretation of perfusion color maps, respectively. By using the lesion maps to exclude lesions detected qualitatively on color maps, specificity improved (80 %). Point biserial correlation for computer-generated lesions (Rpb = 0.46, p < 0.0001) and lesions detected qualitatively (Rpb = 0.32, p = 0.0016) demonstrated positive correlation between size and infarction. Seventy-three percent (p = 0.018) of lesions which demonstrated an increasing size from CBV, cerebral blood flow, to MTT/time to peak were true positive. Used in isolation, computer-generated lesion maps in CTP provide limited diagnostic utility in predicting infarct, due to their inherently low specificity. However, when used in conjunction with qualitative perfusion color map assessment, the lesion maps can help improve specificity.

doi:10.1007/s10140-012-1102-8

PMCID: PMC3661911
PMID: 23322329

CT perfusion; Stroke; Diagnostic accuracy; CT perfusion software

Background:

selecting the correct statistical test and data mining method depends highly on the measurement scale of data, type of variables, and purpose of the analysis. Different measurement scales are studied in details and statistical comparison, modeling, and data mining methods are studied based upon using several medical examples. We have presented two ordinal–variables clustering examples, as more challenging variable in analysis, using Wisconsin Breast Cancer Data (WBCD).

Ordinal-to-Interval scale conversion example:

a breast cancer database of nine 10-level ordinal variables for 683 patients was analyzed by two ordinal-scale clustering methods. The performance of the clustering methods was assessed by comparison with the gold standard groups of malignant and benign cases that had been identified by clinical tests.

Results:

the sensitivity and accuracy of the two clustering methods were 98% and 96%, respectively. Their specificity was comparable.

Conclusion:

by using appropriate clustering algorithm based on the measurement scale of the variables in the study, high performance is granted. Moreover, descriptive and inferential statistics in addition to modeling approach must be selected based on the scale of the variables.

PMCID: PMC3963323
PMID: 24672565

Biostatistics; breast cancer; cluster analysis; data mining; research design

Objectives

To test the diagnostic accuracy of myocardial CT perfusion (CTP) imaging using color and gray scale image analysis.

Background

Current myocardial CTP techniques have varying diagnostic accuracy and are prone to artifacts that impair detection. This study evaluated the diagnostic accuracy of color and/or gray-scale CTP and the application of artifact criteria to detect hypoperfusion.

Methods

Fifty-nine prospectively-enrolled patients with abnormal single photon emission computed tomography (SPECT) studies were analyzed. True hypoperfusion was defined if SPECT hypoperfusion corresponded to obstructive coronary stenoses on CT angiography (CTA). CTP applied color and gray scale myocardial perfusion maps to resting CTA images. Criteria for identifying artifacts were also applied during interpretation.

Results

Using combined SPECT plus CTA as the diagnostic standard, abnormal myocardial CTP was present in 33 (56%) patients, 19 suggesting infarction and 14 suggesting ischemia. Patient-level color and gray scale myocardial CTP sensitivity to detect infarction was 90%, with specificity 80%, and negative and positive predictive value of 94% and 68%. To detect ischemia or infarction, CTP specificity and positive predictive value were 92% while sensitivity was 70%. Gray scale myocardial CTP had slightly lower specificity but similar sensitivity. Myocardial CTP artifacts were present in 88% of studies and were identified using our criteria.

Conclusions

Color and gray scale myocardial CTP using resting CTA images identified myocardial infarction with high sensitivity as well as infarction or ischemia with high specificity and positive predictive value without additional testing or radiation. Color and gray scale CTP had slightly better specificity than gray scale alone.

doi:10.1016/j.jcct.2011.10.006

PMCID: PMC3246505
PMID: 22146500

Coronary CT Angiography; Myocardial CT perfusion; Cardiac CT; Cardiac CT perfusion

Gut
2004;53(11):1652-1657.
Background/Aim: Although ultrasound (US) has proved to be useful in intestinal diseases, barium enteroclysis (BE) remains the gold standard technique for assessing patients with small bowel Crohn’s disease (CD). The ingestion of anechoic non-absorbable solutions has been recently proposed in order to distend intestinal loops and improve small bowel visualisation. The authors’ aim was to evaluate the accuracy of oral contrast US in finding CD lesions, assessing their extent within the bowel, and detecting luminal complications, compared with BE and ileocolonoscopy.

Methods: 102 consecutive patients with proven CD, having undergone complete x ray and endoscopic evaluation, were enrolled in the study. Each US examination, before and after the ingestion of a polyethylene glycol (PEG) solution (500–800 ml), was performed independently by two sonographers unaware of the results of other diagnostic procedures. The accuracy of conventional and contrast enhanced US in detecting CD lesions and luminal complications, as well as the extent of bowel involvement, were determined. Interobserver agreement between sonographers with both US techniques was also estimated.

Results: After oral contrast, satisfactory distension of the intestinal lumen was obtained in all patients, with a mean time to reach the terminal ileum of 31.4 (SD 10.9) minutes. Overall sensitivity of conventional and oral contrast US in detecting CD lesions were 91.4% and 96.1%, respectively. The correlation coefficient between US and x ray extent of ileal disease was r1 = 0.83 (p<0.001) before and r2 = 0.94 (p<0.001) after PEG ingestion; r1 versus r2 p<0.01. Sensitivity in detecting strictures was 74% for conventional US and 89% for contrast US. Overall interobserver agreement for bowel wall thickness and disease location within the small bowel was already good before but significantly improved after PEG ingestion.

Conclusions: Oral contrast bowel US is comparable with BE in defining anatomic location and extension of CD and superior to conventional US in detecting luminal complications, as well as reducing interobserver variability between sonographers. It may be therefore regarded as the first imaging procedure in the diagnostic work up and follow up of small intestine CD.

doi:10.1136/gut.2004.041038

PMCID: PMC1774299
PMID: 15479688

Crohn’s disease; conventional bowel ultrasound; oral contrast bowel ultrasound; barium enteroclysis; ileocolonoscopy

Background

To compare the diagnostic accuracy of two continuous screening tests, a common approach is to test the difference between the areas under the receiver operating characteristic (ROC) curves. After study participants are screened with both screening tests, the disease status is determined as accurately as possible, either by an invasive, sensitive and specific secondary test, or by a less invasive, but less sensitive approach. For most participants, disease status is approximated through the less sensitive approach. The invasive test must be limited to the fraction of the participants whose results on either or both screening tests exceed a threshold of suspicion, or who develop signs and symptoms of the disease after the initial screening tests.

The limitations of this study design lead to a bias in the ROC curves we call paired screening trial bias. This bias reflects the synergistic effects of inappropriate reference standard bias, differential verification bias, and partial verification bias. The absence of a gold reference standard leads to inappropriate reference standard bias. When different reference standards are used to ascertain disease status, it creates differential verification bias. When only suspicious screening test scores trigger a sensitive and specific secondary test, the result is a form of partial verification bias.

Methods

For paired screening tests with bivariate normally distributed scores, we give formulae and programs to quantify the effect of paired screening trial bias on a paired comparison of area under the curves. We fix the prevalence of disease, and the chance a diseased subject manifests signs and symptoms. We derive the formulas for true sensitivity and specificity, and those for the sensitivity and specificity observed by the study investigator.

Results

The observed area under the ROC curves is quite different from the true area under the ROC curves. The typical direction of the bias is a strong inflation in sensitivity, paired with a concomitant slight deflation of specificity.

Conclusion

In paired trials of screening tests, when area under the ROC curve is used as the metric, bias may lead researchers to make the wrong decision as to which screening test is better.

doi:10.1186/1471-2288-9-4

PMCID: PMC2657218
PMID: 19154609

In view of lacking a quantifiable traditional Chinese medicine (TCM) pulse
diagnostic model, a novel TCM pulse diagnostic model was introduced to quantify
the pulse diagnosis. Content validation was performed with a panel of TCM
doctors. Criterion validation was tested with essential hypertension. The gold
standard was brachial blood pressure measured by a sphygmomanometer. Two hundred
and sixty subjects were recruited (139 in the normotensive group and 121 in the
hypertensive group). A TCM doctor palpated pulses at left and right cun, guan,
and chi points, and quantified pulse qualities according to eight elements
(depth, rate, regularity, width, length, smoothness, stiffness, and strength) on
a visual analog scale. An artificial neural network was used to develop a pulse
diagnostic model differentiating essential hypertension from normotension.
Accuracy, specificity, and sensitivity were compared among various diagnostic
models. About 80% accuracy was attained among all models. Their specificity and
sensitivity varied, ranging from 70% to nearly 90%. It suggested that the novel
TCM pulse diagnostic model was valid in terms of its content and diagnostic
ability.

doi:10.1155/2012/685094

PMCID: PMC3171770
PMID: 21918652

Background

The accuracy of computer-aided diagnosis (CAD) software is best evaluated by comparison to a gold standard which represents the true status of disease. In many settings, however, knowledge of the true status of disease is not possible and accuracy is evaluated against the interpretations of an expert panel. Common statistical approaches to evaluate accuracy include receiver operating characteristic (ROC) and kappa analysis but both of these methods have significant limitations and cannot answer the question of equivalence: Is the CAD performance equivalent to that of an expert? The goal of this study is to show the strength of log-linear analysis over standard ROC and kappa statistics in evaluating the accuracy of computer-aided diagnosis of renal obstruction compared to the diagnosis provided by expert readers.

Methods

Log-linear modeling was utilized to analyze a previously published database that used ROC and kappa statistics to compare diuresis renography scan interpretations (non-obstructed, equivocal, or obstructed) generated by a renal expert system (RENEX) in 185 kidneys (95 patients) with the independent and consensus scan interpretations of three experts who were blinded to clinical information and prospectively and independently graded each kidney as obstructed, equivocal, or non-obstructed.

Results

Log-linear modeling showed that RENEX and the expert consensus had beyond-chance agreement in both non-obstructed and obstructed readings (both p < 0.0001). Moreover, pairwise agreement between experts and pairwise agreement between each expert and RENEX were not significantly different (p = 0.41, 0.95, 0.81 for the non-obstructed, equivocal, and obstructed categories, respectively). Similarly, the three-way agreement of the three experts and three-way agreement of two experts and RENEX was not significantly different for non-obstructed (p = 0.79) and obstructed (p = 0.49) categories.

Conclusion

Log-linear modeling showed that RENEX was equivalent to any expert in rating kidneys, particularly in the obstructed and non-obstructed categories. This conclusion, which could not be derived from the original ROC and kappa analysis, emphasizes and illustrates the role and importance of log-linear modeling in the absence of a gold standard. The log-linear analysis also provides additional evidence that RENEX has the potential to assist in the interpretation of diuresis renography studies.

doi:10.1186/2191-219X-1-5

PMCID: PMC3175375
PMID: 21935501

Log-linear modeling; Renal obstruction; Diuresis renography

Background

The accuracy of computer-aided diagnosis (CAD) software is best evaluated by comparison to a gold standard which represents the true status of disease. In many settings, however, knowledge of the true status of disease is not possible and accuracy is evaluated against the interpretations of an expert panel. Common statistical approaches to evaluate accuracy include receiver operating characteristic (ROC) and kappa analysis but both of these methods have significant limitations and cannot answer the question of equivalence: Is the CAD performance equivalent to that of an expert? The goal of this study is to show the strength of log-linear analysis over standard ROC and kappa statistics in evaluating the accuracy of computer-aided diagnosis of renal obstruction compared to the diagnosis provided by expert readers.

Methods

Log-linear modeling was utilized to analyze a previously published database that used ROC and kappa statistics to compare diuresis renography scan interpretations (non-obstructed, equivocal, or obstructed) generated by a renal expert system (RENEX) in 185 kidneys (95 patients) with the independent and consensus scan interpretations of three experts who were blinded to clinical information and prospectively and independently graded each kidney as obstructed, equivocal, or non-obstructed.

Results

Log-linear modeling showed that RENEX and the expert consensus had beyond-chance agreement in both non-obstructed and obstructed readings (both p < 0.0001). Moreover, pairwise agreement between experts and pairwise agreement between each expert and RENEX were not significantly different (p = 0.41, 0.95, 0.81 for the non-obstructed, equivocal, and obstructed categories, respectively). Similarly, the three-way agreement of the three experts and three-way agreement of two experts and RENEX was not significantly different for non-obstructed (p = 0.79) and obstructed (p = 0.49) categories.

Conclusion

Log-linear modeling showed that RENEX was equivalent to any expert in rating kidneys, particularly in the obstructed and non-obstructed categories. This conclusion, which could not be derived from the original ROC and kappa analysis, emphasizes and illustrates the role and importance of log-linear modeling in the absence of a gold standard. The log-linear analysis also provides additional evidence that RENEX has the potential to assist in the interpretation of diuresis renography studies.

doi:10.1186/2191-219X-1-5

PMCID: PMC3175375
PMID: 21935501

Log-linear modeling; Renal obstruction; Diuresis renography

Background

Minimal hepatic encephalopathy (MHE) reduces quality of life, increases the risk of road traffic incidents and predicts progression to overt hepatic encephalopathy and death. Current psychometry-based diagnostic methods are effective, but time-consuming and a universal ‘gold standard’ test has yet to be agreed upon. Critical Flicker Frequency (CFF) is a proposed language-independent diagnostic tool for MHE, but its accuracy has yet to be confirmed.

Aim

To assess the diagnostic accuracy of CFF for MHE by performing a systematic review and meta-analysis of all studies, which report on the diagnostic accuracy of this test.

Methods

A systematic literature search was performed to locate all publications reporting on the diagnostic accuracy of CFF for MHE. Data were extracted from 2 × 2 tables or calculated from reported accuracy data. Collated data were meta-analysed for sensitivity, specificity, diagnostic odds ratio (DOR) and summary receiver operator curve (sROC) analysis. Prespecified subgroup analysis and meta-regression were also performed.

Results

Nine studies with data for 622 patients were included. Summary sensitivity was 61% (95% CI: 55–67), specificity 79% (95% CI: 75–83) and DOR 10.9 (95% CI: 4.2–28.3). A symmetrical sROC gave an area under the receiver operator curve of 0.84 (SE = 0.06). The heterogeneity of the DOR was 74%.

Conclusions

Critical Flicker Frequency has a high specificity and moderate sensitivity for diagnosing minimal hepatic encephalopathy. Given the advantages of language independence and being both simple to perform and interpret, we suggest the use of critical flicker frequency as an adjunct (but not replacement) to psychometric testing.

doi:10.1111/apt.12199

PMCID: PMC3761188
PMID: 23293917

van’t Hoog, Anna H. | Meme, Helen K. | Laserson, Kayla F. | Agaya, Janet A. | Muchiri, Benson G. | Githui, Willie A. | Odeny, Lazarus O. | Marston, Barbara J. | Borgdorff, Martien W. | Herrmann, Jean Louis
Background

We conducted a tuberculosis (TB) prevalence survey and evaluated the screening methods used in our survey, to assess if screening in TB prevalence surveys could be simplified, and to assess the accuracy of screening algorithms that may be applicable for active case finding.

Methods

All participants with a positive screen on either a symptom questionnaire, chest radiography (CXR) and/or sputum smear microscopy submitted sputum for culture. HIV status was obtained from prevalent cases. We estimated the accuracy of modified screening strategies with bacteriologically confirmed TB as the gold standard, and compared these with other survey reports. We also assessed whether sequential rather than parallel application of symptom, CXR and HIV screening would substantially reduce the number of participants requiring CXR and/or sputum culture.

Results

Presence of any abnormality on CXR had 94% (95%CI 88–98) sensitivity (92% in HIV-infected and 100% in HIV-uninfected) and 73% (95%CI 68–77) specificity. Symptom screening combinations had significantly lower sensitivity than CXR except for ‘any TB symptom’ which had 90% (95%CI 84–95) sensitivity (96% in HIV-infected and 82% in HIV-uninfected) and 32% (95%CI 30–34) specificity. Smear microscopy did not yield additional suspects, thus the combined symptom/CXR screen applied in the survey had 100% (95%CI 97–100) sensitivity. Specificity was 65% (95%CI 61–68). Sequential application of first a symptom screen for ‘any symptom’, followed by CXR-evaluation and different suspect criteria depending on HIV status would result in the largest reduction of the need for CXR and sputum culture, approximately 36%, but would underestimate prevalence by 11%.

Conclusion

CXR screening alone had higher accuracy compared to symptom screening alone. Combined CXR and symptom screening had the highest sensitivity and remains important for suspect identification in TB prevalence surveys in settings where bacteriological sputum examination of all participants is not feasible.

doi:10.1371/journal.pone.0038691

PMCID: PMC3391193
PMID: 22792158

Summary

Covariate-specific ROC curves are often used to evaluate the classification accuracy of a medical diagnostic test or a biomarker, when the accuracy of the test is associated with certain covariates. In many large-scale screening tests, the gold standard is subject to missingness due to high cost or harmfulness to the patient. In this paper, we propose a semiparametric estimation of the covariate-specific ROC curves with a partial missing gold standard. A location-scale model is constructed for the test result to model the covariates’ effect, but the residual distributions are left unspecified. Thus the baseline and link functions of the ROC curve both have flexible shapes. With the gold standard missing at random (MAR) assumption, we consider weighted estimating equations for the location-scale parameters, and weighted kernel estimating equations for the residual distributions. Three ROC curve estimators are proposed and compared, namely, imputation-based, inverse probability weighted and doubly robust estimators. We derive the asymptotic normality of the estimated ROC curve, as well as the analytical form the standard error estimator. The proposed method is motivated and applied to the data in an Alzheimer's disease research.

doi:10.1111/j.1541-0420.2011.01562.x

PMCID: PMC3596883
PMID: 21361890

Alzheimer's disease; covariate-specific ROC curve; ignorable missingness; verification bias; weighted estimating equations