Assessing agreement in method comparison studies depends on two fundamentally important components; validity (the between method agreement) and reproducibility (the within method agreement). The Bland-Altman limits of agreement technique is one of the favoured approaches in medical literature for assessing between method validity. However, few researchers have adopted this approach for the assessment of both validity and reproducibility. This may be partly due to a lack of a flexible, easily implemented and readily available statistical machinery to analyse repeated measurement method comparison data.
Adopting the Bland-Altman framework, but using Bayesian methods, we present this statistical machinery. Two multivariate hierarchical Bayesian models are advocated, one which assumes that the underlying values for subjects remain static (exchangeable replicates) and one which assumes that the underlying values can change between repeated measurements (non-exchangeable replicates).
We illustrate the salient advantages of these models using two separate datasets that have been previously analysed and presented; (i) assuming static underlying values analysed using both multivariate hierarchical Bayesian models, and (ii) assuming each subject's underlying value is continually changing quantity and analysed using the non-exchangeable replicate multivariate hierarchical Bayesian model.
These easily implemented models allow for full parameter uncertainty, simultaneous method comparison, handle unbalanced or missing data, and provide estimates and credible regions for all the parameters of interest. Computer code for the analyses in also presented, provided in the freely available and currently cost free software package WinBUGS.
Empathy is frequently cited as an important attribute in physicians and some groups have expressed a desire to measure empathy either at selection for medical school or during medical (or postgraduate) training. In order to do this, a reliable and valid test of empathy is required. The purpose of this systematic review is to determine the reliability and validity of existing tests for the assessment of medical empathy.
A systematic review of research papers relating to the reliability and validity of tests of empathy in medical students and doctors. Journal databases (Medline, EMBASE, and PsycINFO) were searched for English-language articles relating to the assessment of empathy and related constructs in applicants to medical school, medical students, and doctors.
From 1147 citations, we identified 50 relevant papers describing 36 different instruments of empathy measurement. As some papers assessed more than one instrument, there were 59 instrument assessments. 20 of these involved only medical students, 30 involved only practising clinicians, and three involved only medical school applicants. Four assessments involved both medical students and practising clinicians, and two studies involved both medical school applicants and students.
Eight instruments demonstrated evidence of reliability, internal consistency, and validity. Of these, six were self-rated measures, one was a patient-rated measure, and one was an observer-rated measure.
A number of empathy measures available have been psychometrically assessed for research use among medical students and practising medical doctors. No empathy measures were found with sufficient evidence of predictive validity for use as selection measures for medical school. However, measures with a sufficient evidential base to support their use as tools for investigating the role of empathy in medical training and clinical care are available.
An ingestible telemetric temperature sensor for measuring body core temperature (Tc) was first described 45 years ago, although the method has only recently gained widespread use for exercise applications. This review aims to (1) use Bland and Altman's limits of agreement (LoA) method as a basis for quantitatively reviewing the agreement between intestinal sensor temperature (Tintestinal), oesophageal temperature (Toesophageal) and rectal temperature (Trectal) across numerous previously published validation studies; (2) review factors that may affect agreement; and (3) review the application of this technology in field‐based exercise studies. The agreement between Tintestinal and Toesophageal is suggested to meet our delimitation for an acceptable level of agreement (ie, systematic bias <0.1°C and 95% LoA within ±0.4°C). The agreement between Tintestinal and Trectal shows a significant systematic bias >0.1°C, although the 95% LoA is acceptable. Tintestinal responds less rapidly than Toesophageal at the start or cessation of exercise or to a change in exercise intensity, but more rapidly than Trectal. When using this technology, care should be taken to ensure adequate control over sensor calibration and data correction, timing of ingestion and electromagnetic interference. The ingestible sensor has been applied successfully in numerous sport and occupational applications such as the continuous measurement of Tc in deep sea saturation divers, distance runners and soldiers undertaking sustained military training exercises. It is concluded that the ingestible telemetric temperature sensor represents a valid index of Tc and shows excellent utility for ambulatory field‐based applications.
OBJECTIVE: To validate a range of dietary assessment instruments in general practice. METHODS: Using a randomised block design, brief assessment instruments and more complex conventional dietary assessment tools were compared with an accepted "relative" standard--a seven day weighed dietary record. The standard was checked using biomarkers, and by performing test-retest reliability in additional subjects (n = 29). OUTCOMES: Agreement with weighed record. Percentage agreement with weighed record, rank correlation from scatter plot, rank correlation from Bland-Altman plot. Reliability of the weighed record. SETTING: Practice nurse treatment room in a single suburban general practice. SUBJECTS: Patients with risk factors for cardiovascular disease (n = 61) or age/sex stratified general population group (n = 50). RESULTS: Brief self completion dietary assessment tools based on food groups caten during a week show reasonable agreement with the relative standard. For % energy from fat and saturated fat, non-starch polysaccharide, grams of fruit and vegetables and starchy foods consumed the range of agreement with the standard was: median % difference -6% to 12%, rank correlation 0.5 to 0.6. This agreement is of a similar order to the reliability of the weighed record, as good as or better than test standard agreement for more time consuming instruments, and compares favourably with research instruments validated in other settings. Under-reporting of energy intake was common (40%) and more likely if subjects were obese (body mass idex (BMI) > or = 30 60% under-reported; BMI < 30 29%, p < 0.001). CONCLUSION: Under-reporting of absolute energy intake is common, particularly among obese patients. Simple self assessment tools based on food groups, designed for practice nurse dietary assessment, show acceptable agreement with a standard, and suggest such tools are sufficiently accurate for clinical work, research, and possibly population dietary monitoring.
Few commercially available brands of actigraphs (ACT) have been subjected to rigorous validation with infant participants. The purpose of this study was to examine the agreement between concurrent polysomnography (PSG) and one brand of ACT (AW-64, Mitter Co. Inc.) using appropriate statistical techniques among a sample of healthy infants.
Twenty-two healthy infants (14.1 ± 0.6 months) had one night of ankle ACT recording during research PSG at Kosair Children's Hospital Sleep Research Center in Louisville, Kentucky. Macroanalyses were conducted using the Bland-Altman concordance technique to assess agreement between total sleep time (TST) and wake after sleep onset (WASO) simultaneously measured by PSG and ACT, using two ACT algorithm settings. Microanalyses were also calculated to examine sensitivity, specificity, and accuracy of ACT within each PSG-identified sleep state. Correlations were calculated between PSG-identified arousals and the discrepancies between ACT and PSG.
The Bland-Altman concordance technique revealed that ACT underestimated TST by 72.25 (SD = 61.48) minutes and by ≥ 60 minutes among 54.55% of infants. Furthermore, ACT overestimated WASO by 13.85 (SD = 30.94) minutes and by ≥ 30 minutes among 40.91% of infants. Sensitivity, specificity, and accuracy analyses revealed that ACT adequately identified sleep, but poorly identified wake. PSG and ACT discrepancies were positively associated with PSG-identified arousals (r = .45).
Improved device and/or software development is needed before the AW-64 can be considered a valid method for identifying infant sleep and wake.
actigraphy; polysomnography; infant; validation; Bland-Altman
Background. Reliable ICU severity scores have been achieved by various healthcare workers but nothing is known regarding the accuracy in real life of severity scores registered by untrained nurses. Methods. In this retrospective multicentre audit, three reviewers independently reassessed 120 SAPS II scores. Correlation and agreement of the sum-scores/variables among reviewers and between nurses and the reviewers' gold standard were assessed globally and for tertiles. Bland and Altman (gold standard—nurses) of sum scores and regression of the difference were determined. A logistic regression model identifying risk factors for erroneous assessments was calculated. Results. Correlation for sum scores among reviewers was almost perfect (mean ICC = 0.985). The mean (±SD) nurse-registered SAPS II sum score was 40.3 ± 20.2 versus 44.2 ± 24.9 of the gold standard (P < 0.002 for difference) with a lower ICC (0.81). Bland and Altman assay was +3.8 ± 27.0 with a significant regression between the difference and the gold standard, indicating overall an overestimation (underestimation) of lower (higher; >32 points) scores. The lowest agreement was found in high SAPS II tertiles for haemodynamics (k = 0.45–0.51). Conclusions. In real life, nurse-registered SAPS II scores of very ill patients are inaccurate. Accuracy of scores was not associated with nurses' characteristics.
Although various acceptable and easy-to-use devices have been used for saliva collection, cotton swabs are among the most common ones. Previous studies reported that cotton swabs yield a lower level of melatonin detection. However, this statistical method is not adequate for detecting an agreement between cotton saliva collection and passive saliva collection, and a test for bias is needed. Furthermore, the effects of cotton swabs have not been examined at lower melatonin level, a level at which melatonin is used for assessment of circadian rhythms, namely dim light melatonin onset (DLMO). In the present study, we estimated the effect of cotton swabs on the results of salivary melatonin assay using the Bland-Altman plot at lower level.
Nine healthy males were recruited and each provided four saliva samples on a single day to yield a total of 36 samples. Saliva samples were directly collected in plastic tubes using plastic straws, and subsequently pipetted onto cotton swabs (cotton saliva collection) and into clear sterile tubes (passive saliva collection). The melatonin levels were analyzed in duplicate using commercially available ELISA kits.
The mean melatonin concentration in cotton saliva collection samples was significantly lower than that in passive saliva collection samples at higher melatonin level (>6 pg/mL). The Bland-Altman plot indicated that cotton swabs causes relative and proportional biases in the assay results. For lower melatonin level (<6 pg/mL), although the BA plots didn't show proportional and relative biases, there was no significant correlation between passive and cotton saliva collection samples.
Our findings indicate an interference effect of cotton swabs on the assay result of salivary melatonin at lower melatonin level. Cotton-based collection devices might, thus, not be suitable for assessment of DLMO.
To test the validity and reliability of a tool specifically developed for the evaluation of appropriateness in rehabilitation facilities and to assess the prevalence of appropriateness of the days of stay.
The tool underwent a process of cross-cultural translation, content validity, and test-retest validity. Two hospital-based rehabilitation wards providing intensive rehabilitation care located in the Region of Calabria, Southern Italy, were randomly selected. A review of medical records on a random sample of patients aged 18 or more was performed.
The process of validation resulted in modifying some of the criteria used for the evaluation of appropriateness. Test-retest reliability showed that the agreement and the k statistic for the assessment of the appropriateness of days of stay were 93.4% and 0.82, respectively. A total of 371 patient days was reviewed, and 22.9% of the days of stay in the sample were judged to be inappropriate. The most frequently selected appropriateness criterion was the evaluation of patients by rehabilitation professionals for at least 3 hours on the index day (40.8%); moreover, the most frequent primary reason accounting for the inappropriate days of stay was social and/or family environment issues (34.1%).
The findings showed that the tool used is reliable and have adequate validity to measure the extent of appropriateness of days of stay in rehabilitation facilities and that the prevalence of inappropriateness is contained in the investigated settings. Further research is needed to expand appropriateness evaluation to other rehabilitation settings, and to investigate more thoroughly internal and external causes of inappropriate use of rehabilitation services.
Psychological distress is common among medical students but manifests in a variety of forms. Currently, no brief, practical tool exists to simultaneously evaluate these domains of distress among medical students. The authors describe the development of a subject-reported assessment (Medical Student Well-Being Index, MSWBI) intended to screen for medical student distress across a variety of domains and examine its preliminary psychometric properties.
Relevant domains of distress were identified, items generated, and a screening instrument formed using a process of literature review, nominal group technique, input from deans and medical students, and correlation analysis from previously administered assessments. Eleven experts judged the clarity, relevance, and representativeness of the items. A Content Validity Index (CVI) was calculated. Interrater agreement was assessed using pair-wise percent agreement adjusted for chance agreement. Data from 2248 medical students who completed the MSWBI along with validated full-length instruments assessing domains of interest was used to calculate reliability and explore internal structure validity.
Burnout (emotional exhaustion and depersonalization), depression, mental quality of life (QOL), physical QOL, stress, and fatigue were domains identified for inclusion in the MSWBI. Six of 7 items received item CVI-relevance and CVI-representativeness of ≥0.82. Overall scale CVI-relevance and CVI-representativeness was 0.94 and 0.91. Overall pair-wise percent agreement between raters was ≥85% for clarity, relevance, and representativeness. Cronbach's alpha was 0.68. Item by item percent pair-wise agreements and Phi were low, suggesting little overlap between items. The majority of MSWBI items had a ≥74% sensitivity and specificity for detecting distress within the intended domain.
The results of this study provide evidence of reliability and content-related validity of the MSWBI. Further research is needed to assess remaining psychometric properties and establish scores for which intervention is warranted.
Validity of self-reported height and weight has not been adequately evaluated in diverse adolescent populations. In fact there are no reported validity studies conducted in Asian children and adolescents. This study aims to examine the accuracy of self-reported weight, height, and resultant BMI values in Chinese adolescents, and of the adolescents' subsequent classification into overweight categories.
Weight and height were self-reported and measured in 1761 adolescents aged 12-16 years in a cross-sectional survey in Xi'an city, China. BMI was calculated from both reported values and measured values. Bland-Altman plots with 95% limits of agreement, Pearson's correlation and Kappa statistics were calculated to assess the agreement.
The 95% limits of agreement were -11.16 and 6.46 kg for weight, -4.73 and 7.45 cm for height, and -4.93 and 2.47 kg/m2 for BMI. Pearson correlation between measured and self-reported values was 0.912 for weight, 0.935 for height and 0.809 for BMI. Weighted Kappa was 0.859 for weight, 0.906 for height and 0.754 for BMI. Sensitivity for detecting overweight (includes obese) in adolescents was 56.1%, and specificity was 98.6%. Subjects' area of residence, age and BMI were significant factors associated with the errors in self-reporting weight, height and relative BMI.
Reported weight and height does not have an acceptable agreement with measured data. Therefore, we do not recommend the application of self-reported weight and height to screen for overweight adolescents in China. Alternatively, self-reported data could be considered for use, with caution, in surveillance systems and epidemiology studies.
To estimate agreement among scores on three common assessments of cognitive function.
Baseline responses on the Alzheimer's Disease Assessment Scale – Cognitive, Clinical Dementia Rating, and the Mini-Mental State Examination were obtained from two clinical trials (n = 138 and n = 351). A graphical method of examining agreement, the means-difference or Bland-Altman plot, was followed by Levene's test of the equality of variance corrected for multiple comparison within each sample.
70–78% of variability was shared by one factor, suggesting that all three instruments reflect cognitive impairment. However, agreement among tests was significantly worse for individuals with greater-than-average, relative to individuals with less-than-average, cognitive impairment.
Worse agreement between tests, as a function of increasing cognitive impairment, implies that interpretation of these tests and selection of coprimary cognitive impairment outcomes may depend on impairment level.
Alzheimer's disease; Dementia; Outcomes assessment methods
Accurate, inexpensive point-of-care CD4+ T cell testing technologies are needed that can deliver CD4+ T cell results at lower level health centers or community outreach voluntary counseling and testing. We sought to evaluate a point-of-care CD4+ T cell counter, the Pima CD4 Test System, a portable, battery-operated bench-top instrument that is designed to use finger stick blood samples suitable for field use in conjunction with rapid HIV testing.
Duplicate measurements were performed on both capillary and venous samples using Pima CD4 analyzers, compared to the BD FACSCalibur (reference method). The mean bias was estimated by paired Student's t-test. Bland Altman plots were used to assess agreement.
206 participants were enrolled with a median CD4 count of 396 (range; 18–1500). The finger stick PIMA had a mean bias of −66.3 cells/µL (95%CI −83.4−49.2, P<0.001) compared to the FACSCalibur; the bias was smaller at lower CD4 counts (0–250 cells/µL) with a mean bias of −10.8 (95%CI −27.3−+5.6, P = 0.198), and much greater at higher CD4 cell counts (>500 cells/µL) with a mean bias of −120.6 (95%CI −162.8, −78.4, P<0.001). The sensitivity (95%CI) of the Pima CD4 analyzer was 96.3% (79.1–99.8%) for a <250 cells/ul cut-off with a negative predictive value of 99.2% (95.1–99.9%).
The Pima CD4 finger stick test is an easy-to-use, portable, relatively fast device to test CD4+ T cell counts in the field. Issues of negatively-biased CD4 cell counts especially at higher absolute numbers will limit its utility for longitudinal immunologic response to ART. The high sensitivity and negative predictive value of the test makes it an attractive option for field use to identify patients eligible for ART, thus potentially reducing delays in linkage to care and ART initiation.
The purpose of this project was to conduct a systematic review to identify instruments designed to evaluate the quality of randomized controlled trials (RCTs) of natural health products (NHPs). Instruments were examined for inclusion of items assessing methods, identity and content of the NHP, generalizability of results and instructions for use. Online databases, websites, textbooks and reference lists were searched to identify instruments. Relevance assessment and data extraction of articles were completed by two investigators and disagreements were settled by the third investigator. Data were analyzed using descriptive statistics. Of the 4442 citations identified, 29 were potentially relevant with 16 meeting the criteria for inclusion. None of the instruments stated they were validated; content in the four areas of interest varied considerably. The most common items included randomization sequence generation (100%), blinding (100%), allocation concealment (75%) and participant flow (75%). Only nine of the NHP instruments included at least one item to appraise the specific content of the NHP. The CONSORT Statement for Herbal Interventions most closely addressed the four areas of interest; however, this instrument was specific for herbs. There is a need for the development of a validated instrument for assessment of the quality of RCTs that would be useful for herbs as well as other NHPs.
Checklists; herbs; quality assessment
The purpose of this study was to systematically compare methods for standardization of blood pressure levels obtained by ambulatory blood pressure monitoring (ABPM) in a group of 111 children studied at our institution.
Blood pressure indices, blood pressure loads and standard deviation scores were calculated using he original ABPM and the modified reference standards. Bland—Altman plots and kappa statistics for the level of agreement were generated.
Overall, the agreement between the two methods was excellent; however, approximately 5% of children were classified differently by one as compared with the other method.
Depending on which version of the German Working Group’s reference standards is used for interpretation of ABPM data, the classification of the individual as having hypertension or normal blood pressure may vary.
ambulatory blood pressure monitoring; blood pressure; hypertension; reference standards
Patients experience an increasing treatment burden related to everything they do to take care of their health: visits to the doctor, medical tests, treatment management and lifestyle changes. This treatment burden could affect treatment adherence, quality of life and outcomes. We aimed to develop and validate an instrument for measuring treatment burden for patients with multiple chronic conditions.
Items were derived from a literature review and qualitative semistructured interviews with patients. The instrument was then validated in a sample of patients with chronic conditions recruited in hospitals and general practitioner clinics in France. Factor analysis was used to examine the questionnaire structure. Construct validity was studied by the relationships between the instrument's global score, the Treatment Satisfaction Questionnaire for Medication (TSQM) scores and the complexity of treatment as assessed by patients and physicians. Agreement between patients and physicians was appraised. Reliability was determined by a test-retest method.
A sample of 502 patients completed the Treatment Burden Questionnaire (TBQ), which consisted of 7 items (2 of which had 4 subitems) defined after 22 interviews with patients. The questionnaire showed a unidimensional structure. The Cronbach's α was 0.89. The instrument's global score was negatively correlated with TSQM scores (rs = -0.41 to -0.53) and positively correlated with the complexity of treatment (rs = 0.16 to 0.40). Agreement between patients and physicians (n = 396) was weak (intraclass correlation coefficient 0.38 (95% confidence interval 0.29 to 0.47)). Reliability of the retest (n = 211 patients) was 0.76 (0.67 to 0.83).
This study provides the first valid and reliable instrument assessing the treatment burden for patients across any disease or treatment context. This instrument could help in the development of treatment strategies that are both efficient and acceptable for patients.
chronic disease/therapy; patient participation; physician-patient relations; quality of life; questionnaires; workload
Rationale and Objectives
In quantifying medical images, length-based measurements are still obtained manually. Due to possible human error, a measurement protocol is required to guarantee the consistency of measurements. In this paper, we review various statistical techniques that can be used in determining measurement consistency. The focus is on detecting a possible measurement bias and determining the robustness of the procedures to outliers.
Materials and Methods
We review correlation analysis, linear regression, Bland-Altman method, paired t-test, and analysis of variance (ANOVA). These techniques were applied to measurements, obtained by two raters, of head and neck structures from magnetic resonance images (MRI).
The correlation analysis and the linear regression were shown to be insufficient for detecting measurement inconsistency. They are also very sensitive to outliers. The widely used Bland-Altman method is a visualization technique so it lacks the numerical quantification. The paired t-test tends to be sensitive to small measurement bias. On the other hand, ANOVA performs well even under small measurement bias.
In almost all cases, using only one method is insufficient and it is recommended to use several methods simultaneously. In general, ANOVA performs the best.
measurement consistency; bias; outlier; head; neck; Bland-Altman
Physical activity self-report instruments in the US have largely been developed for and validated in White samples. Despite calls to validate existing instruments in more diverse samples, relatively few instruments have been validated in US Blacks. Emerging evidence suggests that these instruments may have differential validity in Black populations.
This report reviews and evaluates the validity and reliability of self-reported measures of physical activity in Blacks and makes recommendations for future directions.
A systematic literature review was conducted to identify published reports with construct or criterion validity evaluated in samples that included Blacks. Studies that reported results separately for Blacks were examined.
The review identified 10 instruments validated in nine manuscripts. Criterion validity correlations tended to be low to moderate. No study has compared the validity of multiple instruments in a single sample of Blacks.
There is a need for efforts validating self-report physical activity instruments in Blacks, particularly those evaluating the relative validity of instruments in a single sample.
To determine the accuracy of instrumented, prone compressive leg checking.
Point measures (n=29) on single participants.
Chiropractic college research clinic.
A pair of surgical boots was modified to permit continuous measurement of leg length inequality (LLI). The accuracy of prone leg checking for a masked examiner (n = 29) was determined, against the gold standard of artificial LLI that was created by randomly inserting zero to six 1.6 mm shims in either boot. Accuracy was defined as the examiner's ability to correctly assess the change in the number and side of shims inserted, in two consecutive observations per participant. Linear regression and Bland-Altman statistics were obtained to determine the concurrent validity of compressive leg checking compared to a reference standard.
The observed and artificial LLI shared 86% of their variation (n = 29) The mean examiner error was 2.7 mm and the accuracy of dichotomous short leg determination for two shim insertions was 86.2%. The 95% confidence interval for the Bland-Altman limits-of-agreement for observed vs. artificial change in LLI was (−7.6, +5.2).
Instrumented, compressive leg checking seems highly accurate, detecting artificial changes in leg length of 2–3 mm, and thus possesses concurrent validity assessed against artificial LLI. Pre- and post leg check differences should exceed about 4–6 mm to be highly confident a real change has occurred. It is unknown whether compressive leg checking is clinically relevant.
Leg Length Inequality; Chiropractic; Validity
The primary objective was to systematically review the medical literature for instruments validated for use in epidemiological and clinical research on waterpipe smoking.
We searched the following databases: MEDLINE, EMBASE, and ISI the Web of Science. We selected studies using a two-stage duplicate and independent screening process. We included papers reporting on the development and/or validation of survey instruments to measure waterpipe tobacco consumption or related concepts. Two reviewers used a standardized and pilot tested data abstraction form to collect data from each eligible study using a duplicate and independent screening process. We also determined the percentage of observational studies assessing the health effects of waterpipe tobacco smoking and the percentage of studies of prevalence of waterpipe tobacco smoking that have used validated survey instruments.
We identified a total of five survey instruments. One instrument was designed to measure knowledge, attitudes, and waterpipe use among pregnant women and was shown to have internal consistency and content validity. Three instruments were designed to measure waterpipe tobacco consumption, two of which were reported to have face validity. The fifth instrument was designed to measure waterpipe dependence and was rigorously developed and validated. One of the studies of prevalence and none of the studies of health effects of waterpipe smoking used validated instruments.
A number of instruments for measuring the use of and dependence on waterpipe smoking exist. Future research should study content validity and cross cultural adaptation of these instruments.
Background and Aims
Clinical management of polyps discovered by computed tomographic (CT) colonography depends on polyp size. However, size measured by CT colonography is an estimate, and its agreement with other measures is not well characterized. We hypothesized that size measurement by CT colonography varies substantially compared to measurement by other methods.
We performed a secondary data analysis of a multicenter study of CT colonography in comparison to colonoscopy. Polyp size was determined by CT colonography, at colonoscopy, and measurement pre-fixation with a ruler. Agreement was assessed using descriptive statistics and Bland-Altman methodology.
600 trial participants completed both tests. 95% limits of agreement indicated that estimates of size by CT colonography were between 52% lower to 64% higher than pre-fixation polyp size estimates. 95% limits of agreement stratified by categories of clinical importance indicated that estimates of size by CT colonography were between 44% lower to 84% higher for polyps ≤0.6cm, 44% lower to 44% higher for polyps 0.6 to 0.9cm, and 48% lower to 22% higher for polyps ≥0.9cm compared with pre-fixation estimates. Analysis of participants with one identified polyp in the same colon segment demonstrated that categorization based on CT colonography measurement (i.e., <0.6cm, 0.6 to 0.9cm, or >0.9cm) differed from pre-fixation measurement for 43% of participants.
Polyp size estimation by CT colonography varies from pre-fixation and colonoscopic measures of size. Future studies should clarify whether size estimation by CT colonography is sufficiently reliable as a primary factor to guide clinical management.
Few assessment instruments have examined the nutrition and physical activity environments in child care, and none are self-administered. Given the emerging focus on child care settings as a target for intervention, a valid and reliable measure of the nutrition and physical activity environment is needed.
To measure inter-rater reliability, 59 child care center directors and 109 staff completed the self-assessment concurrently, but independently. Three weeks later, a repeat self-assessment was completed by a sub-sample of 38 directors to assess test-retest reliability. To assess criterion validity, a researcher-administered environmental assessment was conducted at 69 centers and was compared to a self-assessment completed by the director. A weighted kappa test statistic and percent agreement were calculated to assess agreement for each question on the self-assessment.
For inter-rater reliability, kappa statistics ranged from 0.20 to 1.00 across all questions. Test-retest reliability of the self-assessment yielded kappa statistics that ranged from 0.07 to 1.00. The inter-quartile kappa statistic ranges for inter-rater and test-retest reliability were 0.45 to 0.63 and 0.27 to 0.45, respectively. When percent agreement was calculated, questions ranged from 52.6% to 100% for inter-rater reliability and 34.3% to 100% for test-retest reliability. Kappa statistics for validity ranged from -0.01 to 0.79, with an inter-quartile range of 0.08 to 0.34. Percent agreement for validity ranged from 12.9% to 93.7%.
This study provides estimates of criterion validity, inter-rater reliability and test-retest reliability for an environmental nutrition and physical activity self-assessment instrument for child care. Results indicate that the self-assessment is a stable and reasonably accurate instrument for use with child care interventions. We therefore recommend the Nutrition and Physical Activity Self-Assessment for Child Care (NAP SACC) instrument to researchers and practitioners interested in conducting healthy weight intervention in child care. However, a more robust, less subjective measure would be more appropriate for researchers seeking an outcome measure to assess intervention impact.
Developers of health information websites aimed at consumers need methods to assess whether their website is of “high quality.” Due to the nature of complementary medicine, website information is diverse and may be of poor quality. Various methods have been used to assess the quality of websites, the two main approaches being (1) to compare the content against some gold standard, and (2) to rate various aspects of the site using an assessment tool.
We aimed to review available evaluation instruments to assess their performance when used by a researcher to evaluate websites containing information on complementary medicine and breast cancer. In particular, we wanted to see if instruments used the same criteria, agreed on the ranking of websites, were easy to use by a researcher, and if use of a single tool was sufficient to assess website quality.
Bibliographic databases, search engines, and citation searches were used to identify evaluation instruments. Instruments were included that enabled users with no subject knowledge to make an objective assessment of a website containing health information. The elements of each instrument were compared to nine main criteria defined by a previous study. Google was used to search for complementary medicine and breast cancer sites. The first six results and a purposive six from different origins (charities, sponsored, commercial) were chosen. Each website was assessed using each tool, and the percentage of criteria successfully met was recorded. The ranking of the websites by each tool was compared. The use of the instruments by others was estimated by citation analysis and Google searching.
A total of 39 instruments were identified, 12 of which met the inclusion criteria; the instruments contained between 4 and 43 questions. When applied to 12 websites, there was agreement of the rank order of the sites with 10 of the instruments. Instruments varied in the range of criteria they assessed and in their ease of use.
Comparing the content of websites against a gold standard is time consuming and only feasible for very specific advice. Evaluation instruments offer gateway providers a method to assess websites. The checklist approach has face validity when results are compared to the actual content of “good” and “bad” websites. Although instruments differed in the range of items assessed, there was fair agreement between most available instruments. Some were easier to use than others, but these were not necessarily the instruments most widely used to date. Combining some of the better features of instruments to provide fewer, easy-to-use methods would be beneficial to gateway providers.
Consumer Health Informatics; Internet; quality of information; complementary medicine
The SLICC Damage Index (SDI) is a validated instrument for assessing organ damage in systemic lupus erythematosus (SLE). Trained physicians must complete it, limiting utility where this is impossible.
We developed and pilot-tested a self-assessed organ damage instrument, the Lupus Damage Index Questionnaire (LDIQ), in 37 SLE subjects and 7 physicians. After refinement, 569 English-speaking SLE subjects and 14 rheumatologists from 11 international SLE clinics participated in validation. Subjects and physicians completed instruments separately. We calculated sensitivity, specificity, Spearman correlations and agreement, using the SDI as gold standard. 605 SLE participants in the community-based National Data Bank for Rheumatic Diseases (NDB) study completed the LDIQ and we assessed correlations with outcome and disability measures.
Mean LDIQ score was 3.3 (0-16) and mean SDI score was 1.5 (0-9). LDIQ had a moderately high correlation with SDI (Spearman r=0.50, p<0.001). Specificities of individual LDIQ items were >80%, except for neuropathy. Sensitivities were variable and lowest for damage with <1% prevalence. Agreement between SDI and LDIQ was > 85% for all but neuropathy, reduced renal function, deforming arthritis and alopecia. In the NDB, LDIQ correlated well with comorbidity index (r=0.45), SF-36 physical component scale (0.43), Medical Research Council dyspnea scale (0.40), disability (0.37) and SLE Activity Questionnaire score (0.37).
The LDIQ’s metric properties are good compared to the SDI. It has construct validity and correlations with health assessments similar to the SDI. The LDIQ should allow expansion of SLE research. Its ultimate value will be determined in longitudinal studies.
systemic lupus erythematosus; questionnaire; damage; SLICC damage index; validation; self-assessed
To study the relationship of waist circumference (WC) and bioelectrical impedance analysis (BIA) and degree of agreement between anthropometric index (AI) and BIA, using BIA as a reference or ‘gold standard’. The second objective is to study the relationship between body mass index (BMI) and BIA in subjects with spinal cord injury (SCI).
Comparative cross-sectional study.
Convenience sample at outpatient clinic of spinal cord center.
Estimation of obesity was made in 23 men with motor complete paraplegia (>1 year post-injury). Bland and Altman statistics were used to define level of agreement between AI and BIA, Pearson's r to describe correlation between WC and BIA and BMI and BIA.
Good agreement between BIA and AI with a small systematic difference in fat mass (FM) (mean difference: −0.28%, Pearson's r: 0.91) was found. The correlation between WC and the BIA (% FM) was very high (Pearson's r: 0.83). The correlation between WC and BMI (% FM) was just over moderate (Pearson's r: 0.51).
AI seems to be a valid proxy measure to estimate obesity in males living with SCI. Measurement of obesity in persons with SCI based on WC is promising. BMI showed not to be valid to estimate obesity in persons with SCI.
Spinal cord injuries; Obesity; Anthropometrics; Body mass index; Bioelectrical impedance analysis
Independent mobility is a key factor in determining readiness for discharge for older patients following acute hospitalisation and has also been identified as a predictor of many important outcomes for this patient group. This review aimed to identify a physical performance instrument that is not disease specific that has the properties required to accurately measure and monitor the mobility of older medical patients in the acute hospital setting.
Databases initially searched were Medline, Cinahl, Embase, Cochrane Database of Systematic Reviews and the Cochrane Central Register of Controlled Trials without language restriction or limits on year of publication until July 2005. After analysis of this yield, a second step was the systematic search of Medline, Cinahl and Embase until August 2005 for evidence of the clinical utility of each potentially suitable instrument. Reports were included in this review if instruments described had face validity for measuring from bed bound to independent levels of ambulation, the items were suitable for application in an acute hospital setting and the instrument required observation (rather than self-report) of physical performance. Evidence of the clinical utility of each potentially suitable instrument was considered if data on measurement properties were reported.
Three instruments, the Elderly Mobility Scale (EMS), Hierarchical Assessment of Balance and Mobility (HABAM) and the Physical Performance Mobility Examination (PPME) were identified as potentially relevant. Clinimetric evaluation indicated that the HABAM has the most desirable properties of these three instruments. However, the HABAM has the limitation of a ceiling effect in an older acute medical patient population and reliability and minimally clinically important difference (MCID) estimates have not been reported for the Rasch refined HABAM. These limitations support the proposal that a new mobility instrument is required for older acute medical patients.
No existing instrument has the properties required to accurately measure and monitor mobility of older acute medical patients.