Assessing agreement in method comparison studies depends on two fundamentally important components; validity (the between method agreement) and reproducibility (the within method agreement). The Bland-Altman limits of agreement technique is one of the favoured approaches in medical literature for assessing between method validity. However, few researchers have adopted this approach for the assessment of both validity and reproducibility. This may be partly due to a lack of a flexible, easily implemented and readily available statistical machinery to analyse repeated measurement method comparison data.
Adopting the Bland-Altman framework, but using Bayesian methods, we present this statistical machinery. Two multivariate hierarchical Bayesian models are advocated, one which assumes that the underlying values for subjects remain static (exchangeable replicates) and one which assumes that the underlying values can change between repeated measurements (non-exchangeable replicates).
We illustrate the salient advantages of these models using two separate datasets that have been previously analysed and presented; (i) assuming static underlying values analysed using both multivariate hierarchical Bayesian models, and (ii) assuming each subject's underlying value is continually changing quantity and analysed using the non-exchangeable replicate multivariate hierarchical Bayesian model.
These easily implemented models allow for full parameter uncertainty, simultaneous method comparison, handle unbalanced or missing data, and provide estimates and credible regions for all the parameters of interest. Computer code for the analyses in also presented, provided in the freely available and currently cost free software package WinBUGS.
An ingestible telemetric temperature sensor for measuring body core temperature (Tc) was first described 45 years ago, although the method has only recently gained widespread use for exercise applications. This review aims to (1) use Bland and Altman's limits of agreement (LoA) method as a basis for quantitatively reviewing the agreement between intestinal sensor temperature (Tintestinal), oesophageal temperature (Toesophageal) and rectal temperature (Trectal) across numerous previously published validation studies; (2) review factors that may affect agreement; and (3) review the application of this technology in field‐based exercise studies. The agreement between Tintestinal and Toesophageal is suggested to meet our delimitation for an acceptable level of agreement (ie, systematic bias <0.1°C and 95% LoA within ±0.4°C). The agreement between Tintestinal and Trectal shows a significant systematic bias >0.1°C, although the 95% LoA is acceptable. Tintestinal responds less rapidly than Toesophageal at the start or cessation of exercise or to a change in exercise intensity, but more rapidly than Trectal. When using this technology, care should be taken to ensure adequate control over sensor calibration and data correction, timing of ingestion and electromagnetic interference. The ingestible sensor has been applied successfully in numerous sport and occupational applications such as the continuous measurement of Tc in deep sea saturation divers, distance runners and soldiers undertaking sustained military training exercises. It is concluded that the ingestible telemetric temperature sensor represents a valid index of Tc and shows excellent utility for ambulatory field‐based applications.
Empathy is frequently cited as an important attribute in physicians and some groups have expressed a desire to measure empathy either at selection for medical school or during medical (or postgraduate) training. In order to do this, a reliable and valid test of empathy is required. The purpose of this systematic review is to determine the reliability and validity of existing tests for the assessment of medical empathy.
A systematic review of research papers relating to the reliability and validity of tests of empathy in medical students and doctors. Journal databases (Medline, EMBASE, and PsycINFO) were searched for English-language articles relating to the assessment of empathy and related constructs in applicants to medical school, medical students, and doctors.
From 1147 citations, we identified 50 relevant papers describing 36 different instruments of empathy measurement. As some papers assessed more than one instrument, there were 59 instrument assessments. 20 of these involved only medical students, 30 involved only practising clinicians, and three involved only medical school applicants. Four assessments involved both medical students and practising clinicians, and two studies involved both medical school applicants and students.
Eight instruments demonstrated evidence of reliability, internal consistency, and validity. Of these, six were self-rated measures, one was a patient-rated measure, and one was an observer-rated measure.
A number of empathy measures available have been psychometrically assessed for research use among medical students and practising medical doctors. No empathy measures were found with sufficient evidence of predictive validity for use as selection measures for medical school. However, measures with a sufficient evidential base to support their use as tools for investigating the role of empathy in medical training and clinical care are available.
Few commercially available brands of actigraphs (ACT) have been subjected to rigorous validation with infant participants. The purpose of this study was to examine the agreement between concurrent polysomnography (PSG) and one brand of ACT (AW-64, Mitter Co. Inc.) using appropriate statistical techniques among a sample of healthy infants.
Twenty-two healthy infants (14.1 ± 0.6 months) had one night of ankle ACT recording during research PSG at Kosair Children's Hospital Sleep Research Center in Louisville, Kentucky. Macroanalyses were conducted using the Bland-Altman concordance technique to assess agreement between total sleep time (TST) and wake after sleep onset (WASO) simultaneously measured by PSG and ACT, using two ACT algorithm settings. Microanalyses were also calculated to examine sensitivity, specificity, and accuracy of ACT within each PSG-identified sleep state. Correlations were calculated between PSG-identified arousals and the discrepancies between ACT and PSG.
The Bland-Altman concordance technique revealed that ACT underestimated TST by 72.25 (SD = 61.48) minutes and by ≥ 60 minutes among 54.55% of infants. Furthermore, ACT overestimated WASO by 13.85 (SD = 30.94) minutes and by ≥ 30 minutes among 40.91% of infants. Sensitivity, specificity, and accuracy analyses revealed that ACT adequately identified sleep, but poorly identified wake. PSG and ACT discrepancies were positively associated with PSG-identified arousals (r = .45).
Improved device and/or software development is needed before the AW-64 can be considered a valid method for identifying infant sleep and wake.
actigraphy; polysomnography; infant; validation; Bland-Altman
Although various acceptable and easy-to-use devices have been used for saliva collection, cotton swabs are among the most common ones. Previous studies reported that cotton swabs yield a lower level of melatonin detection. However, this statistical method is not adequate for detecting an agreement between cotton saliva collection and passive saliva collection, and a test for bias is needed. Furthermore, the effects of cotton swabs have not been examined at lower melatonin level, a level at which melatonin is used for assessment of circadian rhythms, namely dim light melatonin onset (DLMO). In the present study, we estimated the effect of cotton swabs on the results of salivary melatonin assay using the Bland-Altman plot at lower level.
Nine healthy males were recruited and each provided four saliva samples on a single day to yield a total of 36 samples. Saliva samples were directly collected in plastic tubes using plastic straws, and subsequently pipetted onto cotton swabs (cotton saliva collection) and into clear sterile tubes (passive saliva collection). The melatonin levels were analyzed in duplicate using commercially available ELISA kits.
The mean melatonin concentration in cotton saliva collection samples was significantly lower than that in passive saliva collection samples at higher melatonin level (>6 pg/mL). The Bland-Altman plot indicated that cotton swabs causes relative and proportional biases in the assay results. For lower melatonin level (<6 pg/mL), although the BA plots didn't show proportional and relative biases, there was no significant correlation between passive and cotton saliva collection samples.
Our findings indicate an interference effect of cotton swabs on the assay result of salivary melatonin at lower melatonin level. Cotton-based collection devices might, thus, not be suitable for assessment of DLMO.
Validity of self-reported height and weight has not been adequately evaluated in diverse adolescent populations. In fact there are no reported validity studies conducted in Asian children and adolescents. This study aims to examine the accuracy of self-reported weight, height, and resultant BMI values in Chinese adolescents, and of the adolescents' subsequent classification into overweight categories.
Weight and height were self-reported and measured in 1761 adolescents aged 12-16 years in a cross-sectional survey in Xi'an city, China. BMI was calculated from both reported values and measured values. Bland-Altman plots with 95% limits of agreement, Pearson's correlation and Kappa statistics were calculated to assess the agreement.
The 95% limits of agreement were -11.16 and 6.46 kg for weight, -4.73 and 7.45 cm for height, and -4.93 and 2.47 kg/m2 for BMI. Pearson correlation between measured and self-reported values was 0.912 for weight, 0.935 for height and 0.809 for BMI. Weighted Kappa was 0.859 for weight, 0.906 for height and 0.754 for BMI. Sensitivity for detecting overweight (includes obese) in adolescents was 56.1%, and specificity was 98.6%. Subjects' area of residence, age and BMI were significant factors associated with the errors in self-reporting weight, height and relative BMI.
Reported weight and height does not have an acceptable agreement with measured data. Therefore, we do not recommend the application of self-reported weight and height to screen for overweight adolescents in China. Alternatively, self-reported data could be considered for use, with caution, in surveillance systems and epidemiology studies.
A variety of instruments are used to measure health related quality of life. Few data exist on the performance and agreement of different instruments in a depressed population. The aim of this study was to investigate agreement between, and suitability of, the EQ-5D-3L, EQ-5D Visual Analogue Scale (EQ-5D VAS), SF-6D and SF-12 new algorithm for measuring health utility in depressed patients.
The intraclass correlation coefficient (ICC) and Bland and Altman approaches were used to assess agreement. Instrument sensitivity was analysed by: (1) plotting utility scores for the instruments against one another; (2) correlating utility scores and depressive symptoms (Beck Depression Inventory (BDI)); and (3) using Tukey’s procedure. Receiver Operating Characteristic (ROC) analysis assessed instrument responsiveness to change. Acceptability was assessed by comparing instrument completion rates.
The overall ICC was 0.57. Bland and Altman plots showed wide limits of agreement for each pair wise comparison, except between the SF-6D and SF-12 new algorithm. Plots of utility scores displayed ’ceiling effects’ in the EQ-5D-3L index and ’floor effects’ in the SF-6D and SF-12 new algorithm. All instruments showed a negative monotonic relationship with BDI, but the EQ-5D-3L index and EQ-5D VAS could not differentiate between depression severity sub-groups. The SF-based instruments were better able to detect changes in health state over time. There was no difference in completion rates of the four instruments.
There was a lack of agreement between utility scores generated by the different instruments. According to the criteria of sensitivity, responsiveness and acceptability that we applied, the SF-6D and SF-12 may be more suitable for the measurement of health related utility in a depressed population than the EQ-5D-3L, which is the instrument currently recommended by NICE.
Depression; EQ-5D; SF-6D; Health related utility; QALYs
OBJECTIVE: To validate a range of dietary assessment instruments in general practice. METHODS: Using a randomised block design, brief assessment instruments and more complex conventional dietary assessment tools were compared with an accepted "relative" standard--a seven day weighed dietary record. The standard was checked using biomarkers, and by performing test-retest reliability in additional subjects (n = 29). OUTCOMES: Agreement with weighed record. Percentage agreement with weighed record, rank correlation from scatter plot, rank correlation from Bland-Altman plot. Reliability of the weighed record. SETTING: Practice nurse treatment room in a single suburban general practice. SUBJECTS: Patients with risk factors for cardiovascular disease (n = 61) or age/sex stratified general population group (n = 50). RESULTS: Brief self completion dietary assessment tools based on food groups caten during a week show reasonable agreement with the relative standard. For % energy from fat and saturated fat, non-starch polysaccharide, grams of fruit and vegetables and starchy foods consumed the range of agreement with the standard was: median % difference -6% to 12%, rank correlation 0.5 to 0.6. This agreement is of a similar order to the reliability of the weighed record, as good as or better than test standard agreement for more time consuming instruments, and compares favourably with research instruments validated in other settings. Under-reporting of energy intake was common (40%) and more likely if subjects were obese (body mass idex (BMI) > or = 30 60% under-reported; BMI < 30 29%, p < 0.001). CONCLUSION: Under-reporting of absolute energy intake is common, particularly among obese patients. Simple self assessment tools based on food groups, designed for practice nurse dietary assessment, show acceptable agreement with a standard, and suggest such tools are sufficiently accurate for clinical work, research, and possibly population dietary monitoring.
Patients experience an increasing treatment burden related to everything they do to take care of their health: visits to the doctor, medical tests, treatment management and lifestyle changes. This treatment burden could affect treatment adherence, quality of life and outcomes. We aimed to develop and validate an instrument for measuring treatment burden for patients with multiple chronic conditions.
Items were derived from a literature review and qualitative semistructured interviews with patients. The instrument was then validated in a sample of patients with chronic conditions recruited in hospitals and general practitioner clinics in France. Factor analysis was used to examine the questionnaire structure. Construct validity was studied by the relationships between the instrument's global score, the Treatment Satisfaction Questionnaire for Medication (TSQM) scores and the complexity of treatment as assessed by patients and physicians. Agreement between patients and physicians was appraised. Reliability was determined by a test-retest method.
A sample of 502 patients completed the Treatment Burden Questionnaire (TBQ), which consisted of 7 items (2 of which had 4 subitems) defined after 22 interviews with patients. The questionnaire showed a unidimensional structure. The Cronbach's α was 0.89. The instrument's global score was negatively correlated with TSQM scores (rs = -0.41 to -0.53) and positively correlated with the complexity of treatment (rs = 0.16 to 0.40). Agreement between patients and physicians (n = 396) was weak (intraclass correlation coefficient 0.38 (95% confidence interval 0.29 to 0.47)). Reliability of the retest (n = 211 patients) was 0.76 (0.67 to 0.83).
This study provides the first valid and reliable instrument assessing the treatment burden for patients across any disease or treatment context. This instrument could help in the development of treatment strategies that are both efficient and acceptable for patients.
chronic disease/therapy; patient participation; physician-patient relations; quality of life; questionnaires; workload
The purpose of this study was to systematically compare methods for standardization of blood pressure levels obtained by ambulatory blood pressure monitoring (ABPM) in a group of 111 children studied at our institution.
Blood pressure indices, blood pressure loads and standard deviation scores were calculated using he original ABPM and the modified reference standards. Bland—Altman plots and kappa statistics for the level of agreement were generated.
Overall, the agreement between the two methods was excellent; however, approximately 5% of children were classified differently by one as compared with the other method.
Depending on which version of the German Working Group’s reference standards is used for interpretation of ABPM data, the classification of the individual as having hypertension or normal blood pressure may vary.
ambulatory blood pressure monitoring; blood pressure; hypertension; reference standards
Background. Reliable ICU severity scores have been achieved by various healthcare workers but nothing is known regarding the accuracy in real life of severity scores registered by untrained nurses. Methods. In this retrospective multicentre audit, three reviewers independently reassessed 120 SAPS II scores. Correlation and agreement of the sum-scores/variables among reviewers and between nurses and the reviewers' gold standard were assessed globally and for tertiles. Bland and Altman (gold standard—nurses) of sum scores and regression of the difference were determined. A logistic regression model identifying risk factors for erroneous assessments was calculated. Results. Correlation for sum scores among reviewers was almost perfect (mean ICC = 0.985). The mean (±SD) nurse-registered SAPS II sum score was 40.3 ± 20.2 versus 44.2 ± 24.9 of the gold standard (P < 0.002 for difference) with a lower ICC (0.81). Bland and Altman assay was +3.8 ± 27.0 with a significant regression between the difference and the gold standard, indicating overall an overestimation (underestimation) of lower (higher; >32 points) scores. The lowest agreement was found in high SAPS II tertiles for haemodynamics (k = 0.45–0.51). Conclusions. In real life, nurse-registered SAPS II scores of very ill patients are inaccurate. Accuracy of scores was not associated with nurses' characteristics.
To test the validity and reliability of a tool specifically developed for the evaluation of appropriateness in rehabilitation facilities and to assess the prevalence of appropriateness of the days of stay.
The tool underwent a process of cross-cultural translation, content validity, and test-retest validity. Two hospital-based rehabilitation wards providing intensive rehabilitation care located in the Region of Calabria, Southern Italy, were randomly selected. A review of medical records on a random sample of patients aged 18 or more was performed.
The process of validation resulted in modifying some of the criteria used for the evaluation of appropriateness. Test-retest reliability showed that the agreement and the k statistic for the assessment of the appropriateness of days of stay were 93.4% and 0.82, respectively. A total of 371 patient days was reviewed, and 22.9% of the days of stay in the sample were judged to be inappropriate. The most frequently selected appropriateness criterion was the evaluation of patients by rehabilitation professionals for at least 3 hours on the index day (40.8%); moreover, the most frequent primary reason accounting for the inappropriate days of stay was social and/or family environment issues (34.1%).
The findings showed that the tool used is reliable and have adequate validity to measure the extent of appropriateness of days of stay in rehabilitation facilities and that the prevalence of inappropriateness is contained in the investigated settings. Further research is needed to expand appropriateness evaluation to other rehabilitation settings, and to investigate more thoroughly internal and external causes of inappropriate use of rehabilitation services.
Cardiac output (CO) and systemic vascular resistance (SVR) are two important parameters of the cardiovascular system. The ability to measure these parameters continuously and noninvasively may assist in diagnosing and monitoring patients with suspected cardiovascular diseases, or other critical illnesses. In this study, a method is proposed to estimate both the CO and SVR of a heterogeneous cohort of intensive care unit patients (N=48).
Spectral and morphological features were extracted from the finger photoplethysmogram, and added to heart rate and mean arterial pressure as input features to a multivariate regression model to estimate CO and SVR. A stepwise feature search algorithm was employed to select statistically significant features. Leave-one-out cross validation was used to assess the generalized model performance. The degree of agreement between the estimation method and the gold standard was assessed using Bland-Altman analysis.
The Bland-Altman bias ±precision (1.96 times standard deviation) for CO was -0.01 ±2.70 L min-1 when only photoplethysmogram (PPG) features were used, and for SVR was -0.87 ±412 dyn.s.cm-5 when only one PPG variability feature was used.
These promising results indicate the feasibility of using the method described as a non-invasive preliminary diagnostic tool in supervised or unsupervised clinical settings.
Cardiac output; Systemic vascular resistance; Photoplethysmography; Power spectrum analysis; Photoplethysmogram variability; Photoplethysmogram morphology; Feature selection
Physical activity self-report instruments in the US have largely been developed for and validated in White samples. Despite calls to validate existing instruments in more diverse samples, relatively few instruments have been validated in US Blacks. Emerging evidence suggests that these instruments may have differential validity in Black populations.
This report reviews and evaluates the validity and reliability of self-reported measures of physical activity in Blacks and makes recommendations for future directions.
A systematic literature review was conducted to identify published reports with construct or criterion validity evaluated in samples that included Blacks. Studies that reported results separately for Blacks were examined.
The review identified 10 instruments validated in nine manuscripts. Criterion validity correlations tended to be low to moderate. No study has compared the validity of multiple instruments in a single sample of Blacks.
There is a need for efforts validating self-report physical activity instruments in Blacks, particularly those evaluating the relative validity of instruments in a single sample.
To evaluate the accuracy of the swallowing kinematic analysis.
To evaluate the accuracy at various velocities of movement, we developed an instrumental model of linear and rotational movement, representing the physiologic movement of the hyoid and epiglottis, respectively. A still image of 8 objects was also used for measuring the length of the objects as a basic screening, and 18 movie files of the instrumental model, taken from videofluoroscopy with different velocities. The images and movie files were digitized and analyzed by an experienced examiner, who was blinded to the study.
The Pearson correlation coefficients between the measured and instrumental reference values were over 0.99 (p<0.001) for all of the analyses. Bland-Altman plots showed narrow ranges of the 95% confidence interval of agreement between the measured and reference values as follows: 0.14 to 0.94 mm for distances in a still image, -0.14 to 1.09 mm/s for linear velocities, and -1.02 to 3.81 degree/s for angular velocities.
Our findings demonstrate that the distance and velocity measurements obtained by swallowing kinematic analysis are highly valid in a wide range of movement velocity.
Reproducibility of results; Biomechanics; Deglutition
Psychological distress is common among medical students but manifests in a variety of forms. Currently, no brief, practical tool exists to simultaneously evaluate these domains of distress among medical students. The authors describe the development of a subject-reported assessment (Medical Student Well-Being Index, MSWBI) intended to screen for medical student distress across a variety of domains and examine its preliminary psychometric properties.
Relevant domains of distress were identified, items generated, and a screening instrument formed using a process of literature review, nominal group technique, input from deans and medical students, and correlation analysis from previously administered assessments. Eleven experts judged the clarity, relevance, and representativeness of the items. A Content Validity Index (CVI) was calculated. Interrater agreement was assessed using pair-wise percent agreement adjusted for chance agreement. Data from 2248 medical students who completed the MSWBI along with validated full-length instruments assessing domains of interest was used to calculate reliability and explore internal structure validity.
Burnout (emotional exhaustion and depersonalization), depression, mental quality of life (QOL), physical QOL, stress, and fatigue were domains identified for inclusion in the MSWBI. Six of 7 items received item CVI-relevance and CVI-representativeness of ≥0.82. Overall scale CVI-relevance and CVI-representativeness was 0.94 and 0.91. Overall pair-wise percent agreement between raters was ≥85% for clarity, relevance, and representativeness. Cronbach's alpha was 0.68. Item by item percent pair-wise agreements and Phi were low, suggesting little overlap between items. The majority of MSWBI items had a ≥74% sensitivity and specificity for detecting distress within the intended domain.
The results of this study provide evidence of reliability and content-related validity of the MSWBI. Further research is needed to assess remaining psychometric properties and establish scores for which intervention is warranted.
Neuropathology centers are expected to offer a prompt and accurate intraoperative diagnosis regarding tumor/lesion type and grade on fresh unfixed tissue. Level of diagnostic accuracy according to type and grade and also, the experience at a new center has not been reported before.
The aim of this study is to review the agreement patterns according to tumor/lesion type and grade between intraoperative and final histopathologic diagnosis in central nervous system (CNS) lesion samples received by a newly established neuropathology center at a tertiary care neuropsychiatric hospital.
Materials and Methods:
Agreement between intraoperative and final histopathologic diagnosis was classified as: (I) Grade in agreement but type not in agreement; (II) grade not in agreement but type in agreement; (III) grade and type both not in agreement; (IV) grade and type both in agreement.
Confidence interval (CI) of agreements was calculated for various categories of neoplastic as well as non-neoplastic lesions. CI was also calculated for groups where n × p and n × (1 − p) were more than 5, i.e., fulfilled the requirement of the central limit theorem.
On retrospective analysis of 333 cases, 284 (85.3%) cases were categorized as neoplastic while 49 (14.7%) cases were categorized as non-neoplastic. Among the neoplastic lesions agreement was seen in 237 (83.5%) cases while 47 (16.5%) cases showed disagreement. Similarly in non-neoplastic category; 46 (93.9%) cases showed agreement while 3 (6.15%) cases showed disagreement. Of the non-neoplastic lesions, one case fell into the agreement category I, 2 in category III and 46 in IV. Among neoplastic lesions, there were 21 cases in agreement category I, 17 in II, 9 in III and 237 in IV. On analyzing the accuracy of intraoperative reporting according to tumor type, the break up was: - Astrocytic: 2 (I), 16 (II), 2 (III), 86 (IV); oligodendroglial: 8 (I), 1 (II); ependymal: 2 (III), 6 (IV); embryonal: 23 (IV); cranial and spinal nerve tumors: 2 (II), 21 (IV); choroid plexus tumors: 4 (IV); meningeal tumors: 3 (I), 1 (III), 49 (IV); metastatic tumors: 3 (I), 17 (IV); cysts (tumor-like conditions): 14 (IV); neuronal and mixed neuronal glial tumors: 1 (III); malignant lymphoma: 1 (III); sellar tumors: 17 (IV); and mixed gliomas: 5 (I).
This study identifies problem areas of CNS intraoperative reporting, in a new center, with reference to tumor typing and grading. It may forewarn upcoming centers of neuropathology about the potential problem areas of intraoperative reporting.
Central nervous system lesions; intraoperative reporting; new center
Background and Aims
Clinical management of polyps discovered by computed tomographic (CT) colonography depends on polyp size. However, size measured by CT colonography is an estimate, and its agreement with other measures is not well characterized. We hypothesized that size measurement by CT colonography varies substantially compared to measurement by other methods.
We performed a secondary data analysis of a multicenter study of CT colonography in comparison to colonoscopy. Polyp size was determined by CT colonography, at colonoscopy, and measurement pre-fixation with a ruler. Agreement was assessed using descriptive statistics and Bland-Altman methodology.
600 trial participants completed both tests. 95% limits of agreement indicated that estimates of size by CT colonography were between 52% lower to 64% higher than pre-fixation polyp size estimates. 95% limits of agreement stratified by categories of clinical importance indicated that estimates of size by CT colonography were between 44% lower to 84% higher for polyps ≤0.6cm, 44% lower to 44% higher for polyps 0.6 to 0.9cm, and 48% lower to 22% higher for polyps ≥0.9cm compared with pre-fixation estimates. Analysis of participants with one identified polyp in the same colon segment demonstrated that categorization based on CT colonography measurement (i.e., <0.6cm, 0.6 to 0.9cm, or >0.9cm) differed from pre-fixation measurement for 43% of participants.
Polyp size estimation by CT colonography varies from pre-fixation and colonoscopic measures of size. Future studies should clarify whether size estimation by CT colonography is sufficiently reliable as a primary factor to guide clinical management.
Few assessment instruments have examined the nutrition and physical activity environments in child care, and none are self-administered. Given the emerging focus on child care settings as a target for intervention, a valid and reliable measure of the nutrition and physical activity environment is needed.
To measure inter-rater reliability, 59 child care center directors and 109 staff completed the self-assessment concurrently, but independently. Three weeks later, a repeat self-assessment was completed by a sub-sample of 38 directors to assess test-retest reliability. To assess criterion validity, a researcher-administered environmental assessment was conducted at 69 centers and was compared to a self-assessment completed by the director. A weighted kappa test statistic and percent agreement were calculated to assess agreement for each question on the self-assessment.
For inter-rater reliability, kappa statistics ranged from 0.20 to 1.00 across all questions. Test-retest reliability of the self-assessment yielded kappa statistics that ranged from 0.07 to 1.00. The inter-quartile kappa statistic ranges for inter-rater and test-retest reliability were 0.45 to 0.63 and 0.27 to 0.45, respectively. When percent agreement was calculated, questions ranged from 52.6% to 100% for inter-rater reliability and 34.3% to 100% for test-retest reliability. Kappa statistics for validity ranged from -0.01 to 0.79, with an inter-quartile range of 0.08 to 0.34. Percent agreement for validity ranged from 12.9% to 93.7%.
This study provides estimates of criterion validity, inter-rater reliability and test-retest reliability for an environmental nutrition and physical activity self-assessment instrument for child care. Results indicate that the self-assessment is a stable and reasonably accurate instrument for use with child care interventions. We therefore recommend the Nutrition and Physical Activity Self-Assessment for Child Care (NAP SACC) instrument to researchers and practitioners interested in conducting healthy weight intervention in child care. However, a more robust, less subjective measure would be more appropriate for researchers seeking an outcome measure to assess intervention impact.
The primary objective was to systematically review the medical literature for instruments validated for use in epidemiological and clinical research on waterpipe smoking.
We searched the following databases: MEDLINE, EMBASE, and ISI the Web of Science. We selected studies using a two-stage duplicate and independent screening process. We included papers reporting on the development and/or validation of survey instruments to measure waterpipe tobacco consumption or related concepts. Two reviewers used a standardized and pilot tested data abstraction form to collect data from each eligible study using a duplicate and independent screening process. We also determined the percentage of observational studies assessing the health effects of waterpipe tobacco smoking and the percentage of studies of prevalence of waterpipe tobacco smoking that have used validated survey instruments.
We identified a total of five survey instruments. One instrument was designed to measure knowledge, attitudes, and waterpipe use among pregnant women and was shown to have internal consistency and content validity. Three instruments were designed to measure waterpipe tobacco consumption, two of which were reported to have face validity. The fifth instrument was designed to measure waterpipe dependence and was rigorously developed and validated. One of the studies of prevalence and none of the studies of health effects of waterpipe smoking used validated instruments.
A number of instruments for measuring the use of and dependence on waterpipe smoking exist. Future research should study content validity and cross cultural adaptation of these instruments.
Objective(s): Reliability measures precision or the extent to which test results can be replicated. This is the first ever systematic review to identify statistical methods used to measure reliability of equipment measuring continuous variables. This studyalso aims to highlight the inappropriate statistical method used in the reliability analysis and its implication in the medical practice.
Materials and Methods: In 2010, five electronic databases were searched between 2007 and 2009 to look for reliability studies. A total of 5,795 titles were initially identified. Only 282 titles were potentially related, and finally 42 fitted the inclusion criteria.
Results: The Intra-class Correlation Coefficient (ICC) is the most popular method with 25 (60%) studies having used this method followed by the comparing means (8 or 19%). Out of 25 studies using the ICC, only 7 (28%) reported the confidence intervals and types of ICC used. Most studies (71%) also tested the agreement of instruments.
Conclusion: This study finds that the Intra-class Correlation Coefficient is the most popular method used to assess the reliability of medical instruments measuring continuous outcomes. There are also inappropriate applications and interpretations of statistical methods in some studies. It is important for medical researchers to be aware of this issue, and be able to correctly perform analysis in reliability studies.
ICC; Intra-class correlation coefficient; Reliability; Statistical method; Validation study
The SLICC Damage Index (SDI) is a validated instrument for assessing organ damage in systemic lupus erythematosus (SLE). Trained physicians must complete it, limiting utility where this is impossible.
We developed and pilot-tested a self-assessed organ damage instrument, the Lupus Damage Index Questionnaire (LDIQ), in 37 SLE subjects and 7 physicians. After refinement, 569 English-speaking SLE subjects and 14 rheumatologists from 11 international SLE clinics participated in validation. Subjects and physicians completed instruments separately. We calculated sensitivity, specificity, Spearman correlations and agreement, using the SDI as gold standard. 605 SLE participants in the community-based National Data Bank for Rheumatic Diseases (NDB) study completed the LDIQ and we assessed correlations with outcome and disability measures.
Mean LDIQ score was 3.3 (0-16) and mean SDI score was 1.5 (0-9). LDIQ had a moderately high correlation with SDI (Spearman r=0.50, p<0.001). Specificities of individual LDIQ items were >80%, except for neuropathy. Sensitivities were variable and lowest for damage with <1% prevalence. Agreement between SDI and LDIQ was > 85% for all but neuropathy, reduced renal function, deforming arthritis and alopecia. In the NDB, LDIQ correlated well with comorbidity index (r=0.45), SF-36 physical component scale (0.43), Medical Research Council dyspnea scale (0.40), disability (0.37) and SLE Activity Questionnaire score (0.37).
The LDIQ’s metric properties are good compared to the SDI. It has construct validity and correlations with health assessments similar to the SDI. The LDIQ should allow expansion of SLE research. Its ultimate value will be determined in longitudinal studies.
systemic lupus erythematosus; questionnaire; damage; SLICC damage index; validation; self-assessed
Background and Scope
Significant progress has been made over the past two decades in the development of screening and diagnostic instruments for autism spectrum disorders (ASD). This article reviews this progress, including recent innovations, focussing on those instruments for which the strongest research data on validity exists, and then turns to addressing issues arising from their use in clinical settings.
Research studies have evaluated the ability of screens to prospectively identify cases of ASD in population-based and clinically-referred samples, as well as the accuracy of diagnostic instruments to map onto ‘gold standard’ clinical best estimate diagnosis. However, extension of the findings to clinical services must be done with caution, with a full understanding that instrument properties are sample-specific. Furthermore, we are limited by the lack of a true test for ASD, which remains a behaviourally-defined disorder. In addition screening and diagnostic instruments help clinicians least in the cases where they are most in want of direction, since their accuracy will always be lower for marginal cases.
Instruments help clinicians to collect detailed, structured information and increase accuracy and reliability of referral for in-depth assessment and recommendations for support, but further research is needed to refine their effective use in clinical settings.
autism spectrum disorder (ASD); screening; diagnosis; sensitivity; specificity; predictive value
To estimate agreement among scores on three common assessments of cognitive function.
Baseline responses on the Alzheimer's Disease Assessment Scale – Cognitive, Clinical Dementia Rating, and the Mini-Mental State Examination were obtained from two clinical trials (n = 138 and n = 351). A graphical method of examining agreement, the means-difference or Bland-Altman plot, was followed by Levene's test of the equality of variance corrected for multiple comparison within each sample.
70–78% of variability was shared by one factor, suggesting that all three instruments reflect cognitive impairment. However, agreement among tests was significantly worse for individuals with greater-than-average, relative to individuals with less-than-average, cognitive impairment.
Worse agreement between tests, as a function of increasing cognitive impairment, implies that interpretation of these tests and selection of coprimary cognitive impairment outcomes may depend on impairment level.
Alzheimer's disease; Dementia; Outcomes assessment methods
To study the relationship of waist circumference (WC) and bioelectrical impedance analysis (BIA) and degree of agreement between anthropometric index (AI) and BIA, using BIA as a reference or ‘gold standard’. The second objective is to study the relationship between body mass index (BMI) and BIA in subjects with spinal cord injury (SCI).
Comparative cross-sectional study.
Convenience sample at outpatient clinic of spinal cord center.
Estimation of obesity was made in 23 men with motor complete paraplegia (>1 year post-injury). Bland and Altman statistics were used to define level of agreement between AI and BIA, Pearson's r to describe correlation between WC and BIA and BMI and BIA.
Good agreement between BIA and AI with a small systematic difference in fat mass (FM) (mean difference: −0.28%, Pearson's r: 0.91) was found. The correlation between WC and the BIA (% FM) was very high (Pearson's r: 0.83). The correlation between WC and BMI (% FM) was just over moderate (Pearson's r: 0.51).
AI seems to be a valid proxy measure to estimate obesity in males living with SCI. Measurement of obesity in persons with SCI based on WC is promising. BMI showed not to be valid to estimate obesity in persons with SCI.
Spinal cord injuries; Obesity; Anthropometrics; Body mass index; Bioelectrical impedance analysis