PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (811800)

Clipboard (0)
None

Related Articles

1.  Interval Estimation for the Difference in Paired Areas under the ROC Curves in the Absence of a Gold Standard Test 
Statistics in medicine  2009;28(25):3108-3123.
Summary
Receiver operating characteristic (ROC) curves can be used to assess the accuracy of tests measured on ordinal or continuous scales. The most commonly used measure for the overall diagnostic accuracy of diagnostic tests is the area under the ROC curve (AUC). A gold standard test on the true disease status is required to estimate the AUC. However, a gold standard test may sometimes be too expensive or infeasible. Therefore, in many medical research studies, the true disease status of the subjects may remain unknown. Under the normality assumption on test results from each disease group of subjects, using the expectation-maximization (EM) algorithm in conjunction with a bootstrap method, we propose a maximum likelihood based procedure for construction of confidence intervals for the difference in paired areas under ROC curves in the absence of a gold standard test. Simulation results show that the proposed interval estimation procedure yields satisfactory coverage probabilities and interval lengths. The proposed method is illustrated with two examples.
doi:10.1002/sim.3661
PMCID: PMC2812057  PMID: 19691022
Area under the ROC curve; EM algorithm; bootstrap method; gold standard test; maximum likelihood estimation
2.  Combining multiple continuous tests for the diagnosis of kidney impairment in the absence of a gold standard 
Statistics in medicine  2011;30(14):1712-1721.
SUMMARY
Receiver Operating Characteristic (ROC) curves are commonly used to summarize the classification accuracy of diagnostic tests. It is not uncommon in medical practice that multiple diagnostic tests are routinely performed or multiple disease markers are available for the same individuals. When the true disease status is verified by a gold standard test, a variety of methods have been proposed to combine such potential correlated tests to increase the accuracy of disease diagnosis. In this article, we propose a method of combining multiple diagnostic tests in the absence of a gold standard. We assume that the test values and their classification accuracies are dependent on covariates. Simulation studies are performed to examine the performance of the combination method. The proposed method is applied to data from a population-based aging study to compare the accuracy of three screening tests for kidney function and to estimate the prevalence of moderate kidney impairment.
doi:10.1002/sim.4203
PMCID: PMC3107872  PMID: 21432889
Diagnostic test; Gold standard; Markov chain Monte-Carlo; Sensitivity; Specificity
3.  Estimating diagnostic accuracy of raters without a gold standard by exploiting a group of experts 
Biometrics  2012;68(4):1294-1302.
In diagnostic medicine, estimating the diagnostic accuracy of a group of raters or medical tests relative to the gold standard is often the primary goal. When a gold standard is absent, latent class models where the unknown gold standard test is treated as a latent variable are often used. However, these models have been criticized in the literature from both a conceptual and a robustness perspective. As an alternative, we propose an approach where we exploit an imperfect reference standard with unknown diagnostic accuracy and conduct sensitivity analysis by varying this accuracy over scientifically reasonable ranges. In this article, a latent class model with crossed random effects is proposed for estimating the diagnostic accuracy of regional obstetrics and gynaecological (OB/GYN) physicians in diagnosing endometriosis. To avoid the pitfalls of models without a gold standard, we exploit the diagnostic results of a group of OB/GYN physicians with an international reputation for the diagnosis of endometriosis. We construct an ordinal reference standard based on the discordance among these international experts and propose a mechanism for conducting sensitivity analysis relative to the unknown diagnostic accuracy among them. A Monte-Carlo EM algorithm is proposed for parameter estimation and a BIC-type model selection procedure is presented. Through simulations and data analysis we show that this new approach provides a useful alternative to traditional latent class modeling approaches used in this setting.
doi:10.1111/j.1541-0420.2012.01789.x
PMCID: PMC3530625  PMID: 23006010
Diagnostic error; Imperfect tests; Prevalence; Sensitivity; Specificity; Model selection
4.  A joint latent variable model approach to item reduction and validation 
Many applications of biomedical science involve unobservable constructs, from measurement of health states to severity of complex diseases. The primary aim of measurement is to identify relevant pieces of observable information that thoroughly describe the construct of interest. Validation of the construct is often performed separately. Noting the increasing popularity of latent variable methods in biomedical research, we propose a Multiple Indicator Multiple Cause (MIMIC) latent variable model that combines item reduction and validation. Our joint latent variable model accounts for the bias that occurs in the traditional 2-stage process. The methods are motivated by an example from the Physical Activity and Lymphedema clinical trial in which the objectives were to describe lymphedema severity through self-reported Likert scale symptoms and to determine the relationship between symptom severity and a “gold standard” diagnostic measure of lymphedema. The MIMIC model identified 1 symptom as a potential candidate for removal. We present this paper as an illustration of the advantages of joint latent variable models and as an example of the applicability of these models for biomedical research.
doi:10.1093/biostatistics/kxr018
PMCID: PMC3276271  PMID: 21775486
Factor analysis; Latent variable models; Lymphedema; Multiple Indicator Multiple Cause models
5.  Estimation and inference for case–control studies with multiple non–gold standard exposure assessments: with an occupational health application 
Biostatistics (Oxford, England)  2009;10(4):591-602.
In occupational case–control studies, work-related exposure assessments are often fallible measures of the true underlying exposure. In lieu of a gold standard, often more than 2 imperfect measurements (e.g. triads) are used to assess exposure. While methods exist to assess the diagnostic accuracy in the absence of a gold standard, these methods are infrequently used to correct for measurement error in exposure–disease associations in occupational case–control studies. Here, we present a likelihood-based approach that (a) provides evidence regarding whether the misclassification of tests is differential or nondifferential; (b) provides evidence whether the misclassification of tests is independent or dependent conditional on latent exposure status, and (c) estimates the measurement error–corrected exposure–disease association. These approaches use information from all imperfect assessments simultaneously in a unified manner, which in turn can provide a more accurate estimate of exposure–disease association than that based on individual assessments. The performance of this method is investigated through simulation studies and applied to the National Occupational Hazard Survey, a case–control study assessing the association between asbestos exposure and mesothelioma.
doi:10.1093/biostatistics/kxp015
PMCID: PMC2742494  PMID: 19515637
Case–control study; Gold standard; Missing data; Occupational exposure assessment
6.  Fool's Gold: Why Imperfect Reference Tests Are Undermining the Evaluation of Novel Diagnostics: A Reevaluation of 5 Diagnostic Tests for Leptospirosis 
We hypothesized that the gold standard for diagnosing leptospirosis is imperfect. We used Bayesian latent class models and random-effects meta-analysis to test this hypothesis and to determine the true accuracy of a range of alternative tests for leptospirosis diagnosis.
Background. We observed that some patients with clinical leptospirosis supported by positive results of rapid tests were negative for leptospirosis on the basis of our diagnostic gold standard, which involves isolation of Leptospira species from blood culture and/or a positive result of a microscopic agglutination test (MAT). We hypothesized that our reference standard was imperfect and used statistical modeling to investigate this hypothesis.
Methods. Data for 1652 patients with suspected leptospirosis recruited during three observational studies and one randomized control trial that described the application of culture, MAT, immunofluorescence assay (IFA), lateral flow (LF) and/or PCR targeting the 16S rRNA gene were reevaluated using Bayesian latent class models and random-effects meta-analysis.
Results. The estimated sensitivities of culture alone, MAT alone, and culture plus MAT (for which the result was considered positive if one or both tests had a positive result) were 10.5% (95% credible interval [CrI], 2.7%–27.5%), 49.8% (95% CrI, 37.6%–60.8%), and 55.5% (95% CrI, 42.9%–67.7%), respectively. These low sensitivities were present across all 4 studies. The estimated specificity of MAT alone (and of culture plus MAT) was 98.8% (95% CrI, 92.8%–100.0%). The estimated sensitivities and specificities of PCR (52.7% [95% CrI, 45.2%–60.6%] and 97.2% [95% CrI, 92.0%–99.8%], respectively), lateral flow test (85.6% [95% CrI, 77.5%–93.2%] and 96.2% [95% CrI, 87.7%–99.8%], respectively), and immunofluorescence assay (45.5% [95% CrI, 33.3%–60.9%] and 96.8% [95% CrI, 92.8%–99.8%], respectively) were considerably different from estimates in which culture plus MAT was considered a perfect gold standard test.
Conclusions. Our findings show that culture plus MAT is an imperfect gold standard against which to compare alterative tests for the diagnosis of leptospirosis. Rapid point-of-care tests for this infection would bring an important improvement in patient care, but their future evaluation will require careful consideration of the reference test(s) used and the inclusion of appropriate statistical models.
doi:10.1093/cid/cis403
PMCID: PMC3393707  PMID: 22523263
7.  Bias in estimating accuracy of a binary screening test with differential disease verification 
Statistics in medicine  2011;30(15):1852-1864.
SUMMARY
Sensitivity, specificity, positive and negative predictive value are typically used to quantify the accuracy of a binary screening test. In some studies it may not be ethical or feasible to obtain definitive disease ascertainment for all subjects using a gold standard test. When a gold standard test cannot be used an imperfect reference test that is less than 100% sensitive and specific may be used instead. In breast cancer screening, for example, follow-up for cancer diagnosis is used as an imperfect reference test for women where it is not possible to obtain gold standard results. This incomplete ascertainment of true disease, or differential disease verification, can result in biased estimates of accuracy. In this paper, we derive the apparent accuracy values for studies subject to differential verification. We determine how the bias is affected by the accuracy of the imperfect reference test, the percent who receive the imperfect reference standard test not receiving the gold standard, the prevalence of the disease, and the correlation between the results for the screening test and the imperfect reference test. It is shown that designs with differential disease verification can yield biased estimates of accuracy. Estimates of sensitivity in cancer screening trials may be substantially biased. However, careful design decisions, including selection of the imperfect reference test, can help to minimize bias. A hypothetical breast cancer screening study is used to illustrate the problem.
doi:10.1002/sim.4232
PMCID: PMC3115446  PMID: 21495059
Bias; Predictive values; Screening; Sensitivity; Specificity
8.  Estimating diagnostic accuracy of multiple binary tests with an imperfect reference standard‡ 
Statistics in medicine  2009;28(5):780-797.
Summary
The goal in diagnostic medicine is often to estimate the diagnostic accuracy of multiple experimental tests relative to a gold standard reference. When a gold standard reference is not available, investigators commonly use an imperfect reference standard. This paper proposes methodology for estimating the diagnostic accuracy of multiple binary tests with an imperfect reference standard when information about the diagnostic accuracy of the imperfect test is available from external data sources. We propose alternative joint models for characterizing the dependence between the experimental tests and discuss the use of these models for estimating individual-test sensitivity and specificity as well as prevalence and multivariate post-test probabilities (predictive values). We show using analytical and simulation techniques that, as long as the sensitivity and specificity of the imperfect test are high, inferences on diagnostic accuracy are robust to misspecification of the joint model. The methodology is demonstrated with a study examining the diagnostic accuracy of various HIV-antibody tests for HIV.
doi:10.1002/sim.3514
PMCID: PMC2754820  PMID: 19101935
diagnostic error; imperfect tests; latent class models; misclassification; predictive values; prevalence; sensitivity; specificity; diagnostic accuracy
9.  ROC curve regression analysis: the use of ordinal regression models for diagnostic test assessment. 
Environmental Health Perspectives  1994;102(Suppl 8):73-78.
Diagnostic tests commonly are characterized by their true positive (sensitivity) and true negative (specificity) classification rates, which rely on a single decision threshold to classify a test result as positive. A more complete description of test accuracy is given by the receiver operating characteristic (ROC) curve, a graph of the false positive and true positive rates obtained as the decision threshold is varied. A generalized regression methodology, which uses a class of ordinal regression models to estimate smoothed ROC curves has been described. Data from a multi-institutional study comparing the accuracy of magnetic resonance (MR) imaging with computed tomography (CT) in detecting liver metastases, which are ideally suited for ROC regression analysis, are described. The general regression model is introduced and an estimate for the area under the ROC curve and its standard error using parameters of the ordinal regression model is given. An analysis of the liver data that highlights the utility of the methodology in parsimoniously adjusting comparisons for covariates is presented.
PMCID: PMC1566538  PMID: 7851336
10.  Developing a New Reference Standard… Is Validation Necessary? 
Academic radiology  2010;17(9):1079-1082.
A gold standard is often an imperfect diagnostic test, falling short of achieving 100% accuracy in clinical practice. Using an imperfect gold standard without fully comprehending its limitations and biases can lead to erroneous classification of patients with and without disease. This will ultimately affect treatment decisions and patient outcomes. Therefore, validation is essential prior to implementation of the reference standard into practice. Performing a comprehensive validation process is discussed along with its advantages and challenges. The different types of validation methods are reviewed. An example from our work in developing a new reference standard for vasospasm diagnosis in aneurysmal subarachnoid hemorrhage (A-SAH) patients is provided. Employing a new reference standard may result in a definitional shift of the disease and classification scheme of patients. Thereby, it is important to also assess the impact of a new reference standard on patient outcomes and its clinical effectiveness.
doi:10.1016/j.acra.2010.05.021
PMCID: PMC2919497  PMID: 20692619
11.  Multinomial tree models for assessing the status of the reference in studies of the accuracy of tools for binary classification 
Studies that evaluate the accuracy of binary classification tools are needed. Such studies provide 2 × 2 cross-classifications of test outcomes and the categories according to an unquestionable reference (or gold standard). However, sometimes a suboptimal reliability reference is employed. Several methods have been proposed to deal with studies where the observations are cross-classified with an imperfect reference. These methods require that the status of the reference, as a gold standard or as an imperfect reference, is known. In this paper a procedure for determining whether it is appropriate to maintain the assumption that the reference is a gold standard or an imperfect reference, is proposed. This procedure fits two nested multinomial tree models, and assesses and compares their absolute and incremental fit. Its implementation requires the availability of the results of several independent studies. These should be carried out using similar designs to provide frequencies of cross-classification between a test and the reference under investigation. The procedure is applied in two examples with real data.
doi:10.3389/fpsyg.2013.00694
PMCID: PMC3789284  PMID: 24106484
binary classification; gold standard; multinomial tree models; imperfect reference; diagnostic accuracy
12.  Assessing Diagnostic Tests: How to Correct for the Combined Effects of Interpretation and Reference Standard 
PLoS ONE  2012;7(12):e52221.
We describe a general solution to the problem of determining diagnostic accuracy without the use of a perfect reference standard and in the presence of interpreter variability. The accuracy of a diagnostic test is typically determined by comparing its outcomes with those of an established reference standard. But the accuracy of the standard itself and those of the interpreters strongly influence such assessments. We use our solution to examine the effects of the properties of the standard, the reliability of the interpreters, and the prevalence of abnormality on the measured sensitivity and specificity. Our results provide a method of systematically adjusting the measured sensitivity and specificity in order to estimate their true values. The results are validated by simulations and their detailed application to specific cases are described.
doi:10.1371/journal.pone.0052221
PMCID: PMC3530612  PMID: 23300619
13.  A robust method using propensity score stratification for correcting verification bias for binary tests 
Sensitivity and specificity are common measures of the accuracy of a diagnostic test. The usual estimators of these quantities are unbiased if data on the diagnostic test result and the true disease status are obtained from all subjects in an appropriately selected sample. In some studies, verification of the true disease status is performed only for a subset of subjects, possibly depending on the result of the diagnostic test and other characteristics of the subjects. Estimators of sensitivity and specificity based on this subset of subjects are typically biased; this is known as verification bias. Methods have been proposed to correct verification bias under the assumption that the missing data on disease status are missing at random (MAR), that is, the probability of missingness depends on the true (missing) disease status only through the test result and observed covariate information. When some of the covariates are continuous, or the number of covariates is relatively large, the existing methods require parametric models for the probability of disease or the probability of verification (given the test result and covariates), and hence are subject to model misspecification. We propose a new method for correcting verification bias based on the propensity score, defined as the predicted probability of verification given the test result and observed covariates. This is estimated separately for those with positive and negative test results. The new method classifies the verified sample into several subsamples that have homogeneous propensity scores and allows correction for verification bias. Simulation studies demonstrate that the new estimators are more robust to model misspecification than existing methods, but still perform well when the models for the probability of disease and probability of verification are correctly specified.
doi:10.1093/biostatistics/kxr020
PMCID: PMC3276270  PMID: 21856650
Diagnostic test; Model misspecification; Propensity score; Sensitivity; Specificity
14.  Application of Multilabel Learning Using the Relevant Feature for Each Label in Chronic Gastritis Syndrome Diagnosis 
Background. In Traditional Chinese Medicine (TCM), most of the algorithms are used to solve problems of syndrome diagnosis that only focus on one syndrome, that is, single label learning. However, in clinical practice, patients may simultaneously have more than one syndrome, which has its own symptoms (signs). Methods. We employed a multilabel learning using the relevant feature for each label (REAL) algorithm to construct a syndrome diagnostic model for chronic gastritis (CG) in TCM. REAL combines feature selection methods to select the significant symptoms (signs) of CG. The method was tested on 919 patients using the standard scale. Results. The highest prediction accuracy was achieved when 20 features were selected. The features selected with the information gain were more consistent with the TCM theory. The lowest average accuracy was 54% using multi-label neural networks (BP-MLL), whereas the highest was 82% using REAL for constructing the diagnostic model. For coverage, hamming loss, and ranking loss, the values obtained using the REAL algorithm were the lowest at 0.160, 0.142, and 0.177, respectively. Conclusion. REAL extracts the relevant symptoms (signs) for each syndrome and improves its recognition accuracy. Moreover, the studies will provide a reference for constructing syndrome diagnostic models and guide clinical practice.
doi:10.1155/2012/135387
PMCID: PMC3376946  PMID: 22719781
15.  Accuracy of CT cerebral perfusion in predicting infarct in the emergency department: lesion characterization on CT perfusion based on commercially available software 
Emergency Radiology  2013;20(3):203-212.
This study aims to assess the diagnostic accuracy of a single vendor commercially available CT perfusion (CTP) software in predicting stroke. A retrospective analysis on patients presenting with stroke-like symptoms within 6 h with CTP and diffusion-weighted imaging (DWI) was performed. Lesion maps, which overlays areas of computer-detected abnormally elevated mean transit time (MTT) and decreased cerebral blood volume (CBV), were assessed from a commercially available software package and compared to qualitative interpretation of color maps. Using DWI as the gold standard, parameters of diagnostic accuracy were calculated. Point biserial correlation was performed to assess for relationship of lesion size to a true positive result. Sixty-five patients (41 females and 24 males, age range 22–92 years, mean 57) were included in the study. Twenty-two (34 %) had infarcts on DWI. Sensitivity (83 vs. 70 %), specificity (21 vs. 69 %), negative predictive value (77 vs. 84 %), and positive predictive value (29 vs. 50 %) for lesion maps were contrasted to qualitative interpretation of perfusion color maps, respectively. By using the lesion maps to exclude lesions detected qualitatively on color maps, specificity improved (80 %). Point biserial correlation for computer-generated lesions (Rpb = 0.46, p < 0.0001) and lesions detected qualitatively (Rpb = 0.32, p = 0.0016) demonstrated positive correlation between size and infarction. Seventy-three percent (p = 0.018) of lesions which demonstrated an increasing size from CBV, cerebral blood flow, to MTT/time to peak were true positive. Used in isolation, computer-generated lesion maps in CTP provide limited diagnostic utility in predicting infarct, due to their inherently low specificity. However, when used in conjunction with qualitative perfusion color map assessment, the lesion maps can help improve specificity.
doi:10.1007/s10140-012-1102-8
PMCID: PMC3661911  PMID: 23322329
CT perfusion; Stroke; Diagnostic accuracy; CT perfusion software
16.  Manipulating measurement scales in medical statistical analysis and data mining: A review of methodologies 
Background:
selecting the correct statistical test and data mining method depends highly on the measurement scale of data, type of variables, and purpose of the analysis. Different measurement scales are studied in details and statistical comparison, modeling, and data mining methods are studied based upon using several medical examples. We have presented two ordinal–variables clustering examples, as more challenging variable in analysis, using Wisconsin Breast Cancer Data (WBCD).
Ordinal-to-Interval scale conversion example:
a breast cancer database of nine 10-level ordinal variables for 683 patients was analyzed by two ordinal-scale clustering methods. The performance of the clustering methods was assessed by comparison with the gold standard groups of malignant and benign cases that had been identified by clinical tests.
Results:
the sensitivity and accuracy of the two clustering methods were 98% and 96%, respectively. Their specificity was comparable.
Conclusion:
by using appropriate clustering algorithm based on the measurement scale of the variables in the study, high performance is granted. Moreover, descriptive and inferential statistics in addition to modeling approach must be selected based on the scale of the variables.
PMCID: PMC3963323  PMID: 24672565
Biostatistics; breast cancer; cluster analysis; data mining; research design
17.  Myocardial Hypo-enhancement on Resting Computed Tomography Angiography Images Accurately Identifies Myocardial Hypoperfusion 
Objectives
To test the diagnostic accuracy of myocardial CT perfusion (CTP) imaging using color and gray scale image analysis.
Background
Current myocardial CTP techniques have varying diagnostic accuracy and are prone to artifacts that impair detection. This study evaluated the diagnostic accuracy of color and/or gray-scale CTP and the application of artifact criteria to detect hypoperfusion.
Methods
Fifty-nine prospectively-enrolled patients with abnormal single photon emission computed tomography (SPECT) studies were analyzed. True hypoperfusion was defined if SPECT hypoperfusion corresponded to obstructive coronary stenoses on CT angiography (CTA). CTP applied color and gray scale myocardial perfusion maps to resting CTA images. Criteria for identifying artifacts were also applied during interpretation.
Results
Using combined SPECT plus CTA as the diagnostic standard, abnormal myocardial CTP was present in 33 (56%) patients, 19 suggesting infarction and 14 suggesting ischemia. Patient-level color and gray scale myocardial CTP sensitivity to detect infarction was 90%, with specificity 80%, and negative and positive predictive value of 94% and 68%. To detect ischemia or infarction, CTP specificity and positive predictive value were 92% while sensitivity was 70%. Gray scale myocardial CTP had slightly lower specificity but similar sensitivity. Myocardial CTP artifacts were present in 88% of studies and were identified using our criteria.
Conclusions
Color and gray scale myocardial CTP using resting CTA images identified myocardial infarction with high sensitivity as well as infarction or ischemia with high specificity and positive predictive value without additional testing or radiation. Color and gray scale CTP had slightly better specificity than gray scale alone.
doi:10.1016/j.jcct.2011.10.006
PMCID: PMC3246505  PMID: 22146500
Coronary CT Angiography; Myocardial CT perfusion; Cardiac CT; Cardiac CT perfusion
18.  Oral contrast enhanced bowel ultrasonography in the assessment of small intestine Crohn’s disease. A prospective comparison with conventional ultrasound, x ray studies, and ileocolonoscopy 
Gut  2004;53(11):1652-1657.
Background/Aim: Although ultrasound (US) has proved to be useful in intestinal diseases, barium enteroclysis (BE) remains the gold standard technique for assessing patients with small bowel Crohn’s disease (CD). The ingestion of anechoic non-absorbable solutions has been recently proposed in order to distend intestinal loops and improve small bowel visualisation. The authors’ aim was to evaluate the accuracy of oral contrast US in finding CD lesions, assessing their extent within the bowel, and detecting luminal complications, compared with BE and ileocolonoscopy.
Methods: 102 consecutive patients with proven CD, having undergone complete x ray and endoscopic evaluation, were enrolled in the study. Each US examination, before and after the ingestion of a polyethylene glycol (PEG) solution (500–800 ml), was performed independently by two sonographers unaware of the results of other diagnostic procedures. The accuracy of conventional and contrast enhanced US in detecting CD lesions and luminal complications, as well as the extent of bowel involvement, were determined. Interobserver agreement between sonographers with both US techniques was also estimated.
Results: After oral contrast, satisfactory distension of the intestinal lumen was obtained in all patients, with a mean time to reach the terminal ileum of 31.4 (SD 10.9) minutes. Overall sensitivity of conventional and oral contrast US in detecting CD lesions were 91.4% and 96.1%, respectively. The correlation coefficient between US and x ray extent of ileal disease was r1 = 0.83 (p<0.001) before and r2 = 0.94 (p<0.001) after PEG ingestion; r1 versus r2 p<0.01. Sensitivity in detecting strictures was 74% for conventional US and 89% for contrast US. Overall interobserver agreement for bowel wall thickness and disease location within the small bowel was already good before but significantly improved after PEG ingestion.
Conclusions: Oral contrast bowel US is comparable with BE in defining anatomic location and extension of CD and superior to conventional US in detecting luminal complications, as well as reducing interobserver variability between sonographers. It may be therefore regarded as the first imaging procedure in the diagnostic work up and follow up of small intestine CD.
doi:10.1136/gut.2004.041038
PMCID: PMC1774299  PMID: 15479688
Crohn’s disease; conventional bowel ultrasound; oral contrast bowel ultrasound; barium enteroclysis; ileocolonoscopy
19.  Bias in trials comparing paired continuous tests can cause researchers to choose the wrong screening modality 
Background
To compare the diagnostic accuracy of two continuous screening tests, a common approach is to test the difference between the areas under the receiver operating characteristic (ROC) curves. After study participants are screened with both screening tests, the disease status is determined as accurately as possible, either by an invasive, sensitive and specific secondary test, or by a less invasive, but less sensitive approach. For most participants, disease status is approximated through the less sensitive approach. The invasive test must be limited to the fraction of the participants whose results on either or both screening tests exceed a threshold of suspicion, or who develop signs and symptoms of the disease after the initial screening tests.
The limitations of this study design lead to a bias in the ROC curves we call paired screening trial bias. This bias reflects the synergistic effects of inappropriate reference standard bias, differential verification bias, and partial verification bias. The absence of a gold reference standard leads to inappropriate reference standard bias. When different reference standards are used to ascertain disease status, it creates differential verification bias. When only suspicious screening test scores trigger a sensitive and specific secondary test, the result is a form of partial verification bias.
Methods
For paired screening tests with bivariate normally distributed scores, we give formulae and programs to quantify the effect of paired screening trial bias on a paired comparison of area under the curves. We fix the prevalence of disease, and the chance a diseased subject manifests signs and symptoms. We derive the formulas for true sensitivity and specificity, and those for the sensitivity and specificity observed by the study investigator.
Results
The observed area under the ROC curves is quite different from the true area under the ROC curves. The typical direction of the bias is a strong inflation in sensitivity, paired with a concomitant slight deflation of specificity.
Conclusion
In paired trials of screening tests, when area under the ROC curve is used as the metric, bias may lead researchers to make the wrong decision as to which screening test is better.
doi:10.1186/1471-2288-9-4
PMCID: PMC2657218  PMID: 19154609
20.  Validation of a Novel Traditional Chinese Medicine Pulse Diagnostic Model Using an Artificial Neural Network 
In view of lacking a quantifiable traditional Chinese medicine (TCM) pulse diagnostic model, a novel TCM pulse diagnostic model was introduced to quantify the pulse diagnosis. Content validation was performed with a panel of TCM doctors. Criterion validation was tested with essential hypertension. The gold standard was brachial blood pressure measured by a sphygmomanometer. Two hundred and sixty subjects were recruited (139 in the normotensive group and 121 in the hypertensive group). A TCM doctor palpated pulses at left and right cun, guan, and chi points, and quantified pulse qualities according to eight elements (depth, rate, regularity, width, length, smoothness, stiffness, and strength) on a visual analog scale. An artificial neural network was used to develop a pulse diagnostic model differentiating essential hypertension from normotension. Accuracy, specificity, and sensitivity were compared among various diagnostic models. About 80% accuracy was attained among all models. Their specificity and sensitivity varied, ranging from 70% to nearly 90%. It suggested that the novel TCM pulse diagnostic model was valid in terms of its content and diagnostic ability.
doi:10.1155/2012/685094
PMCID: PMC3171770  PMID: 21918652
21.  Computer-aided diagnosis of renal obstruction: utility of log-linear modeling versus standard ROC and kappa analysis 
EJNMMI research  2011;1(5):1-8.
Background
The accuracy of computer-aided diagnosis (CAD) software is best evaluated by comparison to a gold standard which represents the true status of disease. In many settings, however, knowledge of the true status of disease is not possible and accuracy is evaluated against the interpretations of an expert panel. Common statistical approaches to evaluate accuracy include receiver operating characteristic (ROC) and kappa analysis but both of these methods have significant limitations and cannot answer the question of equivalence: Is the CAD performance equivalent to that of an expert? The goal of this study is to show the strength of log-linear analysis over standard ROC and kappa statistics in evaluating the accuracy of computer-aided diagnosis of renal obstruction compared to the diagnosis provided by expert readers.
Methods
Log-linear modeling was utilized to analyze a previously published database that used ROC and kappa statistics to compare diuresis renography scan interpretations (non-obstructed, equivocal, or obstructed) generated by a renal expert system (RENEX) in 185 kidneys (95 patients) with the independent and consensus scan interpretations of three experts who were blinded to clinical information and prospectively and independently graded each kidney as obstructed, equivocal, or non-obstructed.
Results
Log-linear modeling showed that RENEX and the expert consensus had beyond-chance agreement in both non-obstructed and obstructed readings (both p < 0.0001). Moreover, pairwise agreement between experts and pairwise agreement between each expert and RENEX were not significantly different (p = 0.41, 0.95, 0.81 for the non-obstructed, equivocal, and obstructed categories, respectively). Similarly, the three-way agreement of the three experts and three-way agreement of two experts and RENEX was not significantly different for non-obstructed (p = 0.79) and obstructed (p = 0.49) categories.
Conclusion
Log-linear modeling showed that RENEX was equivalent to any expert in rating kidneys, particularly in the obstructed and non-obstructed categories. This conclusion, which could not be derived from the original ROC and kappa analysis, emphasizes and illustrates the role and importance of log-linear modeling in the absence of a gold standard. The log-linear analysis also provides additional evidence that RENEX has the potential to assist in the interpretation of diuresis renography studies.
doi:10.1186/2191-219X-1-5
PMCID: PMC3175375  PMID: 21935501
Log-linear modeling; Renal obstruction; Diuresis renography
22.  Computer-aided diagnosis of renal obstruction: utility of log-linear modeling versus standard ROC and kappa analysis 
EJNMMI Research  2011;1:5.
Background
The accuracy of computer-aided diagnosis (CAD) software is best evaluated by comparison to a gold standard which represents the true status of disease. In many settings, however, knowledge of the true status of disease is not possible and accuracy is evaluated against the interpretations of an expert panel. Common statistical approaches to evaluate accuracy include receiver operating characteristic (ROC) and kappa analysis but both of these methods have significant limitations and cannot answer the question of equivalence: Is the CAD performance equivalent to that of an expert? The goal of this study is to show the strength of log-linear analysis over standard ROC and kappa statistics in evaluating the accuracy of computer-aided diagnosis of renal obstruction compared to the diagnosis provided by expert readers.
Methods
Log-linear modeling was utilized to analyze a previously published database that used ROC and kappa statistics to compare diuresis renography scan interpretations (non-obstructed, equivocal, or obstructed) generated by a renal expert system (RENEX) in 185 kidneys (95 patients) with the independent and consensus scan interpretations of three experts who were blinded to clinical information and prospectively and independently graded each kidney as obstructed, equivocal, or non-obstructed.
Results
Log-linear modeling showed that RENEX and the expert consensus had beyond-chance agreement in both non-obstructed and obstructed readings (both p < 0.0001). Moreover, pairwise agreement between experts and pairwise agreement between each expert and RENEX were not significantly different (p = 0.41, 0.95, 0.81 for the non-obstructed, equivocal, and obstructed categories, respectively). Similarly, the three-way agreement of the three experts and three-way agreement of two experts and RENEX was not significantly different for non-obstructed (p = 0.79) and obstructed (p = 0.49) categories.
Conclusion
Log-linear modeling showed that RENEX was equivalent to any expert in rating kidneys, particularly in the obstructed and non-obstructed categories. This conclusion, which could not be derived from the original ROC and kappa analysis, emphasizes and illustrates the role and importance of log-linear modeling in the absence of a gold standard. The log-linear analysis also provides additional evidence that RENEX has the potential to assist in the interpretation of diuresis renography studies.
doi:10.1186/2191-219X-1-5
PMCID: PMC3175375  PMID: 21935501
Log-linear modeling; Renal obstruction; Diuresis renography
23.  Meta-analysis: the diagnostic accuracy of critical flicker frequency in minimal hepatic encephalopathy 
Background
Minimal hepatic encephalopathy (MHE) reduces quality of life, increases the risk of road traffic incidents and predicts progression to overt hepatic encephalopathy and death. Current psychometry-based diagnostic methods are effective, but time-consuming and a universal ‘gold standard’ test has yet to be agreed upon. Critical Flicker Frequency (CFF) is a proposed language-independent diagnostic tool for MHE, but its accuracy has yet to be confirmed.
Aim
To assess the diagnostic accuracy of CFF for MHE by performing a systematic review and meta-analysis of all studies, which report on the diagnostic accuracy of this test.
Methods
A systematic literature search was performed to locate all publications reporting on the diagnostic accuracy of CFF for MHE. Data were extracted from 2 × 2 tables or calculated from reported accuracy data. Collated data were meta-analysed for sensitivity, specificity, diagnostic odds ratio (DOR) and summary receiver operator curve (sROC) analysis. Prespecified subgroup analysis and meta-regression were also performed.
Results
Nine studies with data for 622 patients were included. Summary sensitivity was 61% (95% CI: 55–67), specificity 79% (95% CI: 75–83) and DOR 10.9 (95% CI: 4.2–28.3). A symmetrical sROC gave an area under the receiver operator curve of 0.84 (SE = 0.06). The heterogeneity of the DOR was 74%.
Conclusions
Critical Flicker Frequency has a high specificity and moderate sensitivity for diagnosing minimal hepatic encephalopathy. Given the advantages of language independence and being both simple to perform and interpret, we suggest the use of critical flicker frequency as an adjunct (but not replacement) to psychometric testing.
doi:10.1111/apt.12199
PMCID: PMC3761188  PMID: 23293917
24.  Screening Strategies for Tuberculosis Prevalence Surveys: The Value of Chest Radiography and Symptoms 
PLoS ONE  2012;7(7):e38691.
Background
We conducted a tuberculosis (TB) prevalence survey and evaluated the screening methods used in our survey, to assess if screening in TB prevalence surveys could be simplified, and to assess the accuracy of screening algorithms that may be applicable for active case finding.
Methods
All participants with a positive screen on either a symptom questionnaire, chest radiography (CXR) and/or sputum smear microscopy submitted sputum for culture. HIV status was obtained from prevalent cases. We estimated the accuracy of modified screening strategies with bacteriologically confirmed TB as the gold standard, and compared these with other survey reports. We also assessed whether sequential rather than parallel application of symptom, CXR and HIV screening would substantially reduce the number of participants requiring CXR and/or sputum culture.
Results
Presence of any abnormality on CXR had 94% (95%CI 88–98) sensitivity (92% in HIV-infected and 100% in HIV-uninfected) and 73% (95%CI 68–77) specificity. Symptom screening combinations had significantly lower sensitivity than CXR except for ‘any TB symptom’ which had 90% (95%CI 84–95) sensitivity (96% in HIV-infected and 82% in HIV-uninfected) and 32% (95%CI 30–34) specificity. Smear microscopy did not yield additional suspects, thus the combined symptom/CXR screen applied in the survey had 100% (95%CI 97–100) sensitivity. Specificity was 65% (95%CI 61–68). Sequential application of first a symptom screen for ‘any symptom’, followed by CXR-evaluation and different suspect criteria depending on HIV status would result in the largest reduction of the need for CXR and sputum culture, approximately 36%, but would underestimate prevalence by 11%.
Conclusion
CXR screening alone had higher accuracy compared to symptom screening alone. Combined CXR and symptom screening had the highest sensitivity and remains important for suspect identification in TB prevalence surveys in settings where bacteriological sputum examination of all participants is not feasible.
doi:10.1371/journal.pone.0038691
PMCID: PMC3391193  PMID: 22792158
25.  Semiparametric estimation of the covariate-specific ROC curve in presence of ignorable veri cation bias 
Biometrics  2011;67(3):906-916.
Summary
Covariate-specific ROC curves are often used to evaluate the classification accuracy of a medical diagnostic test or a biomarker, when the accuracy of the test is associated with certain covariates. In many large-scale screening tests, the gold standard is subject to missingness due to high cost or harmfulness to the patient. In this paper, we propose a semiparametric estimation of the covariate-specific ROC curves with a partial missing gold standard. A location-scale model is constructed for the test result to model the covariates’ effect, but the residual distributions are left unspecified. Thus the baseline and link functions of the ROC curve both have flexible shapes. With the gold standard missing at random (MAR) assumption, we consider weighted estimating equations for the location-scale parameters, and weighted kernel estimating equations for the residual distributions. Three ROC curve estimators are proposed and compared, namely, imputation-based, inverse probability weighted and doubly robust estimators. We derive the asymptotic normality of the estimated ROC curve, as well as the analytical form the standard error estimator. The proposed method is motivated and applied to the data in an Alzheimer's disease research.
doi:10.1111/j.1541-0420.2011.01562.x
PMCID: PMC3596883  PMID: 21361890
Alzheimer's disease; covariate-specific ROC curve; ignorable missingness; verification bias; weighted estimating equations

Results 1-25 (811800)