Assessing agreement in method comparison studies depends on two fundamentally important components; validity (the between method agreement) and reproducibility (the within method agreement). The Bland-Altman limits of agreement technique is one of the favoured approaches in medical literature for assessing between method validity. However, few researchers have adopted this approach for the assessment of both validity and reproducibility. This may be partly due to a lack of a flexible, easily implemented and readily available statistical machinery to analyse repeated measurement method comparison data.
Adopting the Bland-Altman framework, but using Bayesian methods, we present this statistical machinery. Two multivariate hierarchical Bayesian models are advocated, one which assumes that the underlying values for subjects remain static (exchangeable replicates) and one which assumes that the underlying values can change between repeated measurements (non-exchangeable replicates).
We illustrate the salient advantages of these models using two separate datasets that have been previously analysed and presented; (i) assuming static underlying values analysed using both multivariate hierarchical Bayesian models, and (ii) assuming each subject's underlying value is continually changing quantity and analysed using the non-exchangeable replicate multivariate hierarchical Bayesian model.
These easily implemented models allow for full parameter uncertainty, simultaneous method comparison, handle unbalanced or missing data, and provide estimates and credible regions for all the parameters of interest. Computer code for the analyses in also presented, provided in the freely available and currently cost free software package WinBUGS.
An ingestible telemetric temperature sensor for measuring body core temperature (Tc) was first described 45 years ago, although the method has only recently gained widespread use for exercise applications. This review aims to (1) use Bland and Altman's limits of agreement (LoA) method as a basis for quantitatively reviewing the agreement between intestinal sensor temperature (Tintestinal), oesophageal temperature (Toesophageal) and rectal temperature (Trectal) across numerous previously published validation studies; (2) review factors that may affect agreement; and (3) review the application of this technology in field‐based exercise studies. The agreement between Tintestinal and Toesophageal is suggested to meet our delimitation for an acceptable level of agreement (ie, systematic bias <0.1°C and 95% LoA within ±0.4°C). The agreement between Tintestinal and Trectal shows a significant systematic bias >0.1°C, although the 95% LoA is acceptable. Tintestinal responds less rapidly than Toesophageal at the start or cessation of exercise or to a change in exercise intensity, but more rapidly than Trectal. When using this technology, care should be taken to ensure adequate control over sensor calibration and data correction, timing of ingestion and electromagnetic interference. The ingestible sensor has been applied successfully in numerous sport and occupational applications such as the continuous measurement of Tc in deep sea saturation divers, distance runners and soldiers undertaking sustained military training exercises. It is concluded that the ingestible telemetric temperature sensor represents a valid index of Tc and shows excellent utility for ambulatory field‐based applications.
Objective(s): Reliability measures precision or the extent to which test results can be replicated. This is the first ever systematic review to identify statistical methods used to measure reliability of equipment measuring continuous variables. This studyalso aims to highlight the inappropriate statistical method used in the reliability analysis and its implication in the medical practice.
Materials and Methods: In 2010, five electronic databases were searched between 2007 and 2009 to look for reliability studies. A total of 5,795 titles were initially identified. Only 282 titles were potentially related, and finally 42 fitted the inclusion criteria.
Results: The Intra-class Correlation Coefficient (ICC) is the most popular method with 25 (60%) studies having used this method followed by the comparing means (8 or 19%). Out of 25 studies using the ICC, only 7 (28%) reported the confidence intervals and types of ICC used. Most studies (71%) also tested the agreement of instruments.
Conclusion: This study finds that the Intra-class Correlation Coefficient is the most popular method used to assess the reliability of medical instruments measuring continuous outcomes. There are also inappropriate applications and interpretations of statistical methods in some studies. It is important for medical researchers to be aware of this issue, and be able to correctly perform analysis in reliability studies.
ICC; Intra-class correlation coefficient; Reliability; Statistical method; Validation study
Few commercially available brands of actigraphs (ACT) have been subjected to rigorous validation with infant participants. The purpose of this study was to examine the agreement between concurrent polysomnography (PSG) and one brand of ACT (AW-64, Mitter Co. Inc.) using appropriate statistical techniques among a sample of healthy infants.
Twenty-two healthy infants (14.1 ± 0.6 months) had one night of ankle ACT recording during research PSG at Kosair Children's Hospital Sleep Research Center in Louisville, Kentucky. Macroanalyses were conducted using the Bland-Altman concordance technique to assess agreement between total sleep time (TST) and wake after sleep onset (WASO) simultaneously measured by PSG and ACT, using two ACT algorithm settings. Microanalyses were also calculated to examine sensitivity, specificity, and accuracy of ACT within each PSG-identified sleep state. Correlations were calculated between PSG-identified arousals and the discrepancies between ACT and PSG.
The Bland-Altman concordance technique revealed that ACT underestimated TST by 72.25 (SD = 61.48) minutes and by ≥ 60 minutes among 54.55% of infants. Furthermore, ACT overestimated WASO by 13.85 (SD = 30.94) minutes and by ≥ 30 minutes among 40.91% of infants. Sensitivity, specificity, and accuracy analyses revealed that ACT adequately identified sleep, but poorly identified wake. PSG and ACT discrepancies were positively associated with PSG-identified arousals (r = .45).
Improved device and/or software development is needed before the AW-64 can be considered a valid method for identifying infant sleep and wake.
actigraphy; polysomnography; infant; validation; Bland-Altman
Although various acceptable and easy-to-use devices have been used for saliva collection, cotton swabs are among the most common ones. Previous studies reported that cotton swabs yield a lower level of melatonin detection. However, this statistical method is not adequate for detecting an agreement between cotton saliva collection and passive saliva collection, and a test for bias is needed. Furthermore, the effects of cotton swabs have not been examined at lower melatonin level, a level at which melatonin is used for assessment of circadian rhythms, namely dim light melatonin onset (DLMO). In the present study, we estimated the effect of cotton swabs on the results of salivary melatonin assay using the Bland-Altman plot at lower level.
Nine healthy males were recruited and each provided four saliva samples on a single day to yield a total of 36 samples. Saliva samples were directly collected in plastic tubes using plastic straws, and subsequently pipetted onto cotton swabs (cotton saliva collection) and into clear sterile tubes (passive saliva collection). The melatonin levels were analyzed in duplicate using commercially available ELISA kits.
The mean melatonin concentration in cotton saliva collection samples was significantly lower than that in passive saliva collection samples at higher melatonin level (>6 pg/mL). The Bland-Altman plot indicated that cotton swabs causes relative and proportional biases in the assay results. For lower melatonin level (<6 pg/mL), although the BA plots didn't show proportional and relative biases, there was no significant correlation between passive and cotton saliva collection samples.
Our findings indicate an interference effect of cotton swabs on the assay result of salivary melatonin at lower melatonin level. Cotton-based collection devices might, thus, not be suitable for assessment of DLMO.
The possibility that a significant proportion of the patients attending a general health facility may have a mental disorder means that psychiatric conditions must be recognised and managed appropriately. This study sought to determine the prevalence of common psychiatric disorders in adult (aged 18 years and over) inpatients and outpatients seen in public, private and faith-based general hospitals, health centres and specialised clinics and units of general hospitals.
This was a descriptive cross-sectional study conducted in 10 health facilities. All the patients in psychiatric wards and clinics were excluded. Stratified and systematic sampling methods were used. Informed consent was obtained from all study participants. Data were collected over a 4-week period in November 2005 using various psychiatric instruments for adults. Descriptive statistics were generated using SPSS V. 11.5.
A total of 2,770 male and female inpatients and outpatients participated in the study. In all, 42% of the subjects had symptoms of mild and severe depression. Only 114 (4.1%) subjects had a file or working diagnosis of a psychiatric condition, which included bipolar mood disorder, schizophrenia, psychosis and depression.
The 4.1% clinician detection rate for mental disorders means that most psychiatric disorders in general medical facilities remain undiagnosed and thus, unmanaged. This calls for improved diagnostic practices in general medical facilities in Kenya and in other similar countries.
Validity of self-reported height and weight has not been adequately evaluated in diverse adolescent populations. In fact there are no reported validity studies conducted in Asian children and adolescents. This study aims to examine the accuracy of self-reported weight, height, and resultant BMI values in Chinese adolescents, and of the adolescents' subsequent classification into overweight categories.
Weight and height were self-reported and measured in 1761 adolescents aged 12-16 years in a cross-sectional survey in Xi'an city, China. BMI was calculated from both reported values and measured values. Bland-Altman plots with 95% limits of agreement, Pearson's correlation and Kappa statistics were calculated to assess the agreement.
The 95% limits of agreement were -11.16 and 6.46 kg for weight, -4.73 and 7.45 cm for height, and -4.93 and 2.47 kg/m2 for BMI. Pearson correlation between measured and self-reported values was 0.912 for weight, 0.935 for height and 0.809 for BMI. Weighted Kappa was 0.859 for weight, 0.906 for height and 0.754 for BMI. Sensitivity for detecting overweight (includes obese) in adolescents was 56.1%, and specificity was 98.6%. Subjects' area of residence, age and BMI were significant factors associated with the errors in self-reporting weight, height and relative BMI.
Reported weight and height does not have an acceptable agreement with measured data. Therefore, we do not recommend the application of self-reported weight and height to screen for overweight adolescents in China. Alternatively, self-reported data could be considered for use, with caution, in surveillance systems and epidemiology studies.
Psychological distress is common among medical students but manifests in a variety of forms. Currently, no brief, practical tool exists to simultaneously evaluate these domains of distress among medical students. The authors describe the development of a subject-reported assessment (Medical Student Well-Being Index, MSWBI) intended to screen for medical student distress across a variety of domains and examine its preliminary psychometric properties.
Relevant domains of distress were identified, items generated, and a screening instrument formed using a process of literature review, nominal group technique, input from deans and medical students, and correlation analysis from previously administered assessments. Eleven experts judged the clarity, relevance, and representativeness of the items. A Content Validity Index (CVI) was calculated. Interrater agreement was assessed using pair-wise percent agreement adjusted for chance agreement. Data from 2248 medical students who completed the MSWBI along with validated full-length instruments assessing domains of interest was used to calculate reliability and explore internal structure validity.
Burnout (emotional exhaustion and depersonalization), depression, mental quality of life (QOL), physical QOL, stress, and fatigue were domains identified for inclusion in the MSWBI. Six of 7 items received item CVI-relevance and CVI-representativeness of ≥0.82. Overall scale CVI-relevance and CVI-representativeness was 0.94 and 0.91. Overall pair-wise percent agreement between raters was ≥85% for clarity, relevance, and representativeness. Cronbach's alpha was 0.68. Item by item percent pair-wise agreements and Phi were low, suggesting little overlap between items. The majority of MSWBI items had a ≥74% sensitivity and specificity for detecting distress within the intended domain.
The results of this study provide evidence of reliability and content-related validity of the MSWBI. Further research is needed to assess remaining psychometric properties and establish scores for which intervention is warranted.
A variety of instruments are used to measure health related quality of life. Few data exist on the performance and agreement of different instruments in a depressed population. The aim of this study was to investigate agreement between, and suitability of, the EQ-5D-3L, EQ-5D Visual Analogue Scale (EQ-5D VAS), SF-6D and SF-12 new algorithm for measuring health utility in depressed patients.
The intraclass correlation coefficient (ICC) and Bland and Altman approaches were used to assess agreement. Instrument sensitivity was analysed by: (1) plotting utility scores for the instruments against one another; (2) correlating utility scores and depressive symptoms (Beck Depression Inventory (BDI)); and (3) using Tukey’s procedure. Receiver Operating Characteristic (ROC) analysis assessed instrument responsiveness to change. Acceptability was assessed by comparing instrument completion rates.
The overall ICC was 0.57. Bland and Altman plots showed wide limits of agreement for each pair wise comparison, except between the SF-6D and SF-12 new algorithm. Plots of utility scores displayed ’ceiling effects’ in the EQ-5D-3L index and ’floor effects’ in the SF-6D and SF-12 new algorithm. All instruments showed a negative monotonic relationship with BDI, but the EQ-5D-3L index and EQ-5D VAS could not differentiate between depression severity sub-groups. The SF-based instruments were better able to detect changes in health state over time. There was no difference in completion rates of the four instruments.
There was a lack of agreement between utility scores generated by the different instruments. According to the criteria of sensitivity, responsiveness and acceptability that we applied, the SF-6D and SF-12 may be more suitable for the measurement of health related utility in a depressed population than the EQ-5D-3L, which is the instrument currently recommended by NICE.
Depression; EQ-5D; SF-6D; Health related utility; QALYs
OBJECTIVE: To validate a range of dietary assessment instruments in general practice. METHODS: Using a randomised block design, brief assessment instruments and more complex conventional dietary assessment tools were compared with an accepted "relative" standard--a seven day weighed dietary record. The standard was checked using biomarkers, and by performing test-retest reliability in additional subjects (n = 29). OUTCOMES: Agreement with weighed record. Percentage agreement with weighed record, rank correlation from scatter plot, rank correlation from Bland-Altman plot. Reliability of the weighed record. SETTING: Practice nurse treatment room in a single suburban general practice. SUBJECTS: Patients with risk factors for cardiovascular disease (n = 61) or age/sex stratified general population group (n = 50). RESULTS: Brief self completion dietary assessment tools based on food groups caten during a week show reasonable agreement with the relative standard. For % energy from fat and saturated fat, non-starch polysaccharide, grams of fruit and vegetables and starchy foods consumed the range of agreement with the standard was: median % difference -6% to 12%, rank correlation 0.5 to 0.6. This agreement is of a similar order to the reliability of the weighed record, as good as or better than test standard agreement for more time consuming instruments, and compares favourably with research instruments validated in other settings. Under-reporting of energy intake was common (40%) and more likely if subjects were obese (body mass idex (BMI) > or = 30 60% under-reported; BMI < 30 29%, p < 0.001). CONCLUSION: Under-reporting of absolute energy intake is common, particularly among obese patients. Simple self assessment tools based on food groups, designed for practice nurse dietary assessment, show acceptable agreement with a standard, and suggest such tools are sufficiently accurate for clinical work, research, and possibly population dietary monitoring.
Vignette studies of medical choice and judgement have gained popularity in the medical literature. Originally developed in mathematical psychology they can be used to evaluate physicians' behaviour in the setting of diagnostic testing or treatment decisions. We provide an overview of the use, objectives and methodology of these studies in the medical field.
Systematic review. We searched in electronic databases; reference lists of included studies. We included studies that examined medical decisions of physicians, nurses or medical students using cue weightings from answers to structured vignettes. Two reviewers scrutinized abstracts and examined full text copies of potentially eligible studies. The aim of the included studies, the type of clinical decision, the number of participants, some technical aspects, and the type of statistical analysis were extracted in duplicate and discrepancies were resolved by consensus.
30 reports published between 1983 and 2005 fulfilled the inclusion criteria. 22 studies (73%) reported on treatment decisions and 27 (90%) explored the variation of decisions among experts. Nine studies (30%) described differences in decisions between groups of caregivers and ten studies (33%) described the decision behaviour of only one group. Only six studies (20%) compared decision behaviour against an empirical reference of a correct decision. The median number of considered attributes was 6.5 (IQR 4–9), the median number of vignettes was 27 (IQR 16–40). In 17 studies, decision makers had to rate the relative importance of a given vignette; in six studies they had to assign a probability to each vignette. Only ten studies (33%) applied a statistical procedure to account for correlated data.
Various studies of medical choice and judgement have been performed to depict weightings of the value of clinical information from answers to structured vignettes of care givers. We found that the design and analysis methods used in current applications vary considerably and could be improved in a large number of cases.
The purpose of this study was to systematically compare methods for standardization of blood pressure levels obtained by ambulatory blood pressure monitoring (ABPM) in a group of 111 children studied at our institution.
Blood pressure indices, blood pressure loads and standard deviation scores were calculated using he original ABPM and the modified reference standards. Bland—Altman plots and kappa statistics for the level of agreement were generated.
Overall, the agreement between the two methods was excellent; however, approximately 5% of children were classified differently by one as compared with the other method.
Depending on which version of the German Working Group’s reference standards is used for interpretation of ABPM data, the classification of the individual as having hypertension or normal blood pressure may vary.
ambulatory blood pressure monitoring; blood pressure; hypertension; reference standards
Background. Reliable ICU severity scores have been achieved by various healthcare workers but nothing is known regarding the accuracy in real life of severity scores registered by untrained nurses. Methods. In this retrospective multicentre audit, three reviewers independently reassessed 120 SAPS II scores. Correlation and agreement of the sum-scores/variables among reviewers and between nurses and the reviewers' gold standard were assessed globally and for tertiles. Bland and Altman (gold standard—nurses) of sum scores and regression of the difference were determined. A logistic regression model identifying risk factors for erroneous assessments was calculated. Results. Correlation for sum scores among reviewers was almost perfect (mean ICC = 0.985). The mean (±SD) nurse-registered SAPS II sum score was 40.3 ± 20.2 versus 44.2 ± 24.9 of the gold standard (P < 0.002 for difference) with a lower ICC (0.81). Bland and Altman assay was +3.8 ± 27.0 with a significant regression between the difference and the gold standard, indicating overall an overestimation (underestimation) of lower (higher; >32 points) scores. The lowest agreement was found in high SAPS II tertiles for haemodynamics (k = 0.45–0.51). Conclusions. In real life, nurse-registered SAPS II scores of very ill patients are inaccurate. Accuracy of scores was not associated with nurses' characteristics.
To investigate whether (1) machine learning classifiers can help identify nonrandomized studies eligible for full-text screening by systematic reviewers; (2) classifier performance varies with optimization; and (3) the number of citations to screen can be reduced.
We used an open-source, data-mining suite to process and classify biomedical citations that point to mostly nonrandomized studies from 2 systematic reviews. We built training and test sets for citation portions and compared classifier performance by considering the value of indexing, various feature sets, and optimization. We conducted our experiments in 2 phases. The design of phase I with no optimization was: 4 classifiers × 3 feature sets × 3 citation portions. Classifiers included k-nearest neighbor, naïve Bayes, complement naïve Bayes, and evolutionary support vector machine. Feature sets included bag of words, and 2- and 3-term n-grams. Citation portions included titles, titles and abstracts, and full citations with metadata. Phase II with optimization involved a subset of the classifiers, as well as features extracted from full citations, and full citations with overweighted titles. We optimized features and classifier parameters by manually setting information gain thresholds outside of a process for iterative grid optimization with 10-fold cross-validations. We independently tested models on data reserved for that purpose and statistically compared classifier performance on 2 types of feature sets. We estimated the number of citations needed to screen by reviewers during a second pass through a reduced set of citations.
In phase I, the evolutionary support vector machine returned the best recall for bag of words extracted from full citations; the best classifier with respect to overall performance was k-nearest neighbor. No classifier attained good enough recall for this task without optimization. In phase II, we boosted performance with optimization for evolutionary support vector machine and complement naïve Bayes classifiers. Generalization performance was better for the latter in the independent tests. For evolutionary support vector machine and complement naïve Bayes classifiers, the initial retrieval set was reduced by 46% and 35%, respectively.
Machine learning classifiers can help identify nonrandomized studies eligible for full-text screening by systematic reviewers. Optimization can markedly improve performance of classifiers. However, generalizability varies with the classifier. The number of citations to screen during a second independent pass through the citations can be substantially reduced.
medical informatics; clinical research informatics; text mining; document classification; systematic reviews
Physical activity self-report instruments in the US have largely been developed for and validated in White samples. Despite calls to validate existing instruments in more diverse samples, relatively few instruments have been validated in US Blacks. Emerging evidence suggests that these instruments may have differential validity in Black populations.
This report reviews and evaluates the validity and reliability of self-reported measures of physical activity in Blacks and makes recommendations for future directions.
A systematic literature review was conducted to identify published reports with construct or criterion validity evaluated in samples that included Blacks. Studies that reported results separately for Blacks were examined.
The review identified 10 instruments validated in nine manuscripts. Criterion validity correlations tended to be low to moderate. No study has compared the validity of multiple instruments in a single sample of Blacks.
There is a need for efforts validating self-report physical activity instruments in Blacks, particularly those evaluating the relative validity of instruments in a single sample.
Cardiac output (CO) and systemic vascular resistance (SVR) are two important parameters of the cardiovascular system. The ability to measure these parameters continuously and noninvasively may assist in diagnosing and monitoring patients with suspected cardiovascular diseases, or other critical illnesses. In this study, a method is proposed to estimate both the CO and SVR of a heterogeneous cohort of intensive care unit patients (N=48).
Spectral and morphological features were extracted from the finger photoplethysmogram, and added to heart rate and mean arterial pressure as input features to a multivariate regression model to estimate CO and SVR. A stepwise feature search algorithm was employed to select statistically significant features. Leave-one-out cross validation was used to assess the generalized model performance. The degree of agreement between the estimation method and the gold standard was assessed using Bland-Altman analysis.
The Bland-Altman bias ±precision (1.96 times standard deviation) for CO was -0.01 ±2.70 L min-1 when only photoplethysmogram (PPG) features were used, and for SVR was -0.87 ±412 dyn.s.cm-5 when only one PPG variability feature was used.
These promising results indicate the feasibility of using the method described as a non-invasive preliminary diagnostic tool in supervised or unsupervised clinical settings.
Cardiac output; Systemic vascular resistance; Photoplethysmography; Power spectrum analysis; Photoplethysmogram variability; Photoplethysmogram morphology; Feature selection
Evidence-based medicine depends on the timely synthesis of research findings. An important source of synthesized evidence resides in systematic reviews. However, a bottleneck in review production involves dual screening of citations with titles and abstracts to find eligible studies. For this research, we tested the effect of various kinds of textual information (features) on performance of a machine learning classifier. Based on our findings, we propose an automated system to reduce screeing burden, as well as offer quality assurance.
We built a database of citations from 5 systematic reviews that varied with respect to domain, topic, and sponsor. Consensus judgments regarding eligibility were inferred from published reports. We extracted 5 feature sets from citations: alphabetic, alphanumeric+, indexing, features mapped to concepts in systematic reviews, and topic models. To simulate a two-person team, we divided the data into random halves. We optimized the parameters of a Bayesian classifier, then trained and tested models on alternate data halves. Overall, we conducted 50 independent tests.
All tests of summary performance (mean F3) surpassed the corresponding baseline, P<0.0001. The ranks for mean F3, precision, and classification error were statistically different across feature sets averaged over reviews; P-values for Friedman's test were .045, .002, and .002, respectively. Differences in ranks for mean recall were not statistically significant. Alphanumeric+ features were associated with best performance; mean reduction in screening burden for this feature type ranged from 88% to 98% for the second pass through citations and from 38% to 48% overall.
A computer-assisted, decision support system based on our methods could substantially reduce the burden of screening citations for systematic review teams and solo reviewers. Additionally, such a system could deliver quality assurance both by confirming concordant decisions and by naming studies associated with discordant decisions for further consideration.
To evaluate the accuracy of the swallowing kinematic analysis.
To evaluate the accuracy at various velocities of movement, we developed an instrumental model of linear and rotational movement, representing the physiologic movement of the hyoid and epiglottis, respectively. A still image of 8 objects was also used for measuring the length of the objects as a basic screening, and 18 movie files of the instrumental model, taken from videofluoroscopy with different velocities. The images and movie files were digitized and analyzed by an experienced examiner, who was blinded to the study.
The Pearson correlation coefficients between the measured and instrumental reference values were over 0.99 (p<0.001) for all of the analyses. Bland-Altman plots showed narrow ranges of the 95% confidence interval of agreement between the measured and reference values as follows: 0.14 to 0.94 mm for distances in a still image, -0.14 to 1.09 mm/s for linear velocities, and -1.02 to 3.81 degree/s for angular velocities.
Our findings demonstrate that the distance and velocity measurements obtained by swallowing kinematic analysis are highly valid in a wide range of movement velocity.
Reproducibility of results; Biomechanics; Deglutition
Background and Aims
Clinical management of polyps discovered by computed tomographic (CT) colonography depends on polyp size. However, size measured by CT colonography is an estimate, and its agreement with other measures is not well characterized. We hypothesized that size measurement by CT colonography varies substantially compared to measurement by other methods.
We performed a secondary data analysis of a multicenter study of CT colonography in comparison to colonoscopy. Polyp size was determined by CT colonography, at colonoscopy, and measurement pre-fixation with a ruler. Agreement was assessed using descriptive statistics and Bland-Altman methodology.
600 trial participants completed both tests. 95% limits of agreement indicated that estimates of size by CT colonography were between 52% lower to 64% higher than pre-fixation polyp size estimates. 95% limits of agreement stratified by categories of clinical importance indicated that estimates of size by CT colonography were between 44% lower to 84% higher for polyps ≤0.6cm, 44% lower to 44% higher for polyps 0.6 to 0.9cm, and 48% lower to 22% higher for polyps ≥0.9cm compared with pre-fixation estimates. Analysis of participants with one identified polyp in the same colon segment demonstrated that categorization based on CT colonography measurement (i.e., <0.6cm, 0.6 to 0.9cm, or >0.9cm) differed from pre-fixation measurement for 43% of participants.
Polyp size estimation by CT colonography varies from pre-fixation and colonoscopic measures of size. Future studies should clarify whether size estimation by CT colonography is sufficiently reliable as a primary factor to guide clinical management.
Few assessment instruments have examined the nutrition and physical activity environments in child care, and none are self-administered. Given the emerging focus on child care settings as a target for intervention, a valid and reliable measure of the nutrition and physical activity environment is needed.
To measure inter-rater reliability, 59 child care center directors and 109 staff completed the self-assessment concurrently, but independently. Three weeks later, a repeat self-assessment was completed by a sub-sample of 38 directors to assess test-retest reliability. To assess criterion validity, a researcher-administered environmental assessment was conducted at 69 centers and was compared to a self-assessment completed by the director. A weighted kappa test statistic and percent agreement were calculated to assess agreement for each question on the self-assessment.
For inter-rater reliability, kappa statistics ranged from 0.20 to 1.00 across all questions. Test-retest reliability of the self-assessment yielded kappa statistics that ranged from 0.07 to 1.00. The inter-quartile kappa statistic ranges for inter-rater and test-retest reliability were 0.45 to 0.63 and 0.27 to 0.45, respectively. When percent agreement was calculated, questions ranged from 52.6% to 100% for inter-rater reliability and 34.3% to 100% for test-retest reliability. Kappa statistics for validity ranged from -0.01 to 0.79, with an inter-quartile range of 0.08 to 0.34. Percent agreement for validity ranged from 12.9% to 93.7%.
This study provides estimates of criterion validity, inter-rater reliability and test-retest reliability for an environmental nutrition and physical activity self-assessment instrument for child care. Results indicate that the self-assessment is a stable and reasonably accurate instrument for use with child care interventions. We therefore recommend the Nutrition and Physical Activity Self-Assessment for Child Care (NAP SACC) instrument to researchers and practitioners interested in conducting healthy weight intervention in child care. However, a more robust, less subjective measure would be more appropriate for researchers seeking an outcome measure to assess intervention impact.
Background and Scope
Significant progress has been made over the past two decades in the development of screening and diagnostic instruments for autism spectrum disorders (ASD). This article reviews this progress, including recent innovations, focussing on those instruments for which the strongest research data on validity exists, and then turns to addressing issues arising from their use in clinical settings.
Research studies have evaluated the ability of screens to prospectively identify cases of ASD in population-based and clinically-referred samples, as well as the accuracy of diagnostic instruments to map onto ‘gold standard’ clinical best estimate diagnosis. However, extension of the findings to clinical services must be done with caution, with a full understanding that instrument properties are sample-specific. Furthermore, we are limited by the lack of a true test for ASD, which remains a behaviourally-defined disorder. In addition screening and diagnostic instruments help clinicians least in the cases where they are most in want of direction, since their accuracy will always be lower for marginal cases.
Instruments help clinicians to collect detailed, structured information and increase accuracy and reliability of referral for in-depth assessment and recommendations for support, but further research is needed to refine their effective use in clinical settings.
autism spectrum disorder (ASD); screening; diagnosis; sensitivity; specificity; predictive value
Neuropathology centers are expected to offer a prompt and accurate intraoperative diagnosis regarding tumor/lesion type and grade on fresh unfixed tissue. Level of diagnostic accuracy according to type and grade and also, the experience at a new center has not been reported before.
The aim of this study is to review the agreement patterns according to tumor/lesion type and grade between intraoperative and final histopathologic diagnosis in central nervous system (CNS) lesion samples received by a newly established neuropathology center at a tertiary care neuropsychiatric hospital.
Materials and Methods:
Agreement between intraoperative and final histopathologic diagnosis was classified as: (I) Grade in agreement but type not in agreement; (II) grade not in agreement but type in agreement; (III) grade and type both not in agreement; (IV) grade and type both in agreement.
Confidence interval (CI) of agreements was calculated for various categories of neoplastic as well as non-neoplastic lesions. CI was also calculated for groups where n × p and n × (1 − p) were more than 5, i.e., fulfilled the requirement of the central limit theorem.
On retrospective analysis of 333 cases, 284 (85.3%) cases were categorized as neoplastic while 49 (14.7%) cases were categorized as non-neoplastic. Among the neoplastic lesions agreement was seen in 237 (83.5%) cases while 47 (16.5%) cases showed disagreement. Similarly in non-neoplastic category; 46 (93.9%) cases showed agreement while 3 (6.15%) cases showed disagreement. Of the non-neoplastic lesions, one case fell into the agreement category I, 2 in category III and 46 in IV. Among neoplastic lesions, there were 21 cases in agreement category I, 17 in II, 9 in III and 237 in IV. On analyzing the accuracy of intraoperative reporting according to tumor type, the break up was: - Astrocytic: 2 (I), 16 (II), 2 (III), 86 (IV); oligodendroglial: 8 (I), 1 (II); ependymal: 2 (III), 6 (IV); embryonal: 23 (IV); cranial and spinal nerve tumors: 2 (II), 21 (IV); choroid plexus tumors: 4 (IV); meningeal tumors: 3 (I), 1 (III), 49 (IV); metastatic tumors: 3 (I), 17 (IV); cysts (tumor-like conditions): 14 (IV); neuronal and mixed neuronal glial tumors: 1 (III); malignant lymphoma: 1 (III); sellar tumors: 17 (IV); and mixed gliomas: 5 (I).
This study identifies problem areas of CNS intraoperative reporting, in a new center, with reference to tumor typing and grading. It may forewarn upcoming centers of neuropathology about the potential problem areas of intraoperative reporting.
Central nervous system lesions; intraoperative reporting; new center
The SLICC Damage Index (SDI) is a validated instrument for assessing organ damage in systemic lupus erythematosus (SLE). Trained physicians must complete it, limiting utility where this is impossible.
We developed and pilot-tested a self-assessed organ damage instrument, the Lupus Damage Index Questionnaire (LDIQ), in 37 SLE subjects and 7 physicians. After refinement, 569 English-speaking SLE subjects and 14 rheumatologists from 11 international SLE clinics participated in validation. Subjects and physicians completed instruments separately. We calculated sensitivity, specificity, Spearman correlations and agreement, using the SDI as gold standard. 605 SLE participants in the community-based National Data Bank for Rheumatic Diseases (NDB) study completed the LDIQ and we assessed correlations with outcome and disability measures.
Mean LDIQ score was 3.3 (0-16) and mean SDI score was 1.5 (0-9). LDIQ had a moderately high correlation with SDI (Spearman r=0.50, p<0.001). Specificities of individual LDIQ items were >80%, except for neuropathy. Sensitivities were variable and lowest for damage with <1% prevalence. Agreement between SDI and LDIQ was > 85% for all but neuropathy, reduced renal function, deforming arthritis and alopecia. In the NDB, LDIQ correlated well with comorbidity index (r=0.45), SF-36 physical component scale (0.43), Medical Research Council dyspnea scale (0.40), disability (0.37) and SLE Activity Questionnaire score (0.37).
The LDIQ’s metric properties are good compared to the SDI. It has construct validity and correlations with health assessments similar to the SDI. The LDIQ should allow expansion of SLE research. Its ultimate value will be determined in longitudinal studies.
systemic lupus erythematosus; questionnaire; damage; SLICC damage index; validation; self-assessed
To estimate agreement among scores on three common assessments of cognitive function.
Baseline responses on the Alzheimer's Disease Assessment Scale – Cognitive, Clinical Dementia Rating, and the Mini-Mental State Examination were obtained from two clinical trials (n = 138 and n = 351). A graphical method of examining agreement, the means-difference or Bland-Altman plot, was followed by Levene's test of the equality of variance corrected for multiple comparison within each sample.
70–78% of variability was shared by one factor, suggesting that all three instruments reflect cognitive impairment. However, agreement among tests was significantly worse for individuals with greater-than-average, relative to individuals with less-than-average, cognitive impairment.
Worse agreement between tests, as a function of increasing cognitive impairment, implies that interpretation of these tests and selection of coprimary cognitive impairment outcomes may depend on impairment level.
Alzheimer's disease; Dementia; Outcomes assessment methods
Poor to moderate validity of self-reported physical activity instruments is commonly observed in young people in low- and middle-income countries. However, the reasons for such low validity have not been examined in detail. We tested the validity of a self-administered daily physical activity record in adolescents and assessed if personal characteristics or the convenience level of reporting physical activity modified the validity estimates.
The study comprised a total of 302 adolescents from an urban and rural area in Ecuador. Validity was evaluated by comparing the record with accelerometer recordings for seven consecutive days. Test-retest reliability was examined by comparing registrations from two records administered three weeks apart. Time spent on sedentary (SED), low (LPA), moderate (MPA) and vigorous (VPA) intensity physical activity was estimated. Bland Altman plots were used to evaluate measurement agreement. We assessed if age, sex, urban or rural setting, anthropometry and convenience of completing the record explained differences in validity estimates using a linear mixed model.
Although the record provided higher estimates for SED and VPA and lower estimates for LPA and MPA compared to the accelerometer, it showed an overall fair measurement agreement for validity. There was modest reliability for assessing physical activity in each intensity level. Validity was associated with adolescents’ personal characteristics: sex (SED: P = 0.007; LPA: P = 0.001; VPA: P = 0.009) and setting (LPA: P = 0.000; MPA: P = 0.047). Reliability was associated with the convenience of completing the physical activity record for LPA (low convenience: P = 0.014; high convenience: P = 0.045).
The physical activity record provided acceptable estimates for reliability and validity on a group level. Sex and setting were associated with validity estimates, whereas convenience to fill out the record was associated with better reliability estimates for LPA. This tendency of improved reliability estimates for adolescents reporting higher convenience merits further consideration.
Accelerometers; Convenience; Diary; Ecuador; Low- and middle-income countries; Validity