Assessing agreement in method comparison studies depends on two fundamentally important components; validity (the between method agreement) and reproducibility (the within method agreement). The Bland-Altman limits of agreement technique is one of the favoured approaches in medical literature for assessing between method validity. However, few researchers have adopted this approach for the assessment of both validity and reproducibility. This may be partly due to a lack of a flexible, easily implemented and readily available statistical machinery to analyse repeated measurement method comparison data.
Adopting the Bland-Altman framework, but using Bayesian methods, we present this statistical machinery. Two multivariate hierarchical Bayesian models are advocated, one which assumes that the underlying values for subjects remain static (exchangeable replicates) and one which assumes that the underlying values can change between repeated measurements (non-exchangeable replicates).
We illustrate the salient advantages of these models using two separate datasets that have been previously analysed and presented; (i) assuming static underlying values analysed using both multivariate hierarchical Bayesian models, and (ii) assuming each subject's underlying value is continually changing quantity and analysed using the non-exchangeable replicate multivariate hierarchical Bayesian model.
These easily implemented models allow for full parameter uncertainty, simultaneous method comparison, handle unbalanced or missing data, and provide estimates and credible regions for all the parameters of interest. Computer code for the analyses in also presented, provided in the freely available and currently cost free software package WinBUGS.
Empathy is frequently cited as an important attribute in physicians and some groups have expressed a desire to measure empathy either at selection for medical school or during medical (or postgraduate) training. In order to do this, a reliable and valid test of empathy is required. The purpose of this systematic review is to determine the reliability and validity of existing tests for the assessment of medical empathy.
A systematic review of research papers relating to the reliability and validity of tests of empathy in medical students and doctors. Journal databases (Medline, EMBASE, and PsycINFO) were searched for English-language articles relating to the assessment of empathy and related constructs in applicants to medical school, medical students, and doctors.
From 1147 citations, we identified 50 relevant papers describing 36 different instruments of empathy measurement. As some papers assessed more than one instrument, there were 59 instrument assessments. 20 of these involved only medical students, 30 involved only practising clinicians, and three involved only medical school applicants. Four assessments involved both medical students and practising clinicians, and two studies involved both medical school applicants and students.
Eight instruments demonstrated evidence of reliability, internal consistency, and validity. Of these, six were self-rated measures, one was a patient-rated measure, and one was an observer-rated measure.
A number of empathy measures available have been psychometrically assessed for research use among medical students and practising medical doctors. No empathy measures were found with sufficient evidence of predictive validity for use as selection measures for medical school. However, measures with a sufficient evidential base to support their use as tools for investigating the role of empathy in medical training and clinical care are available.
An ingestible telemetric temperature sensor for measuring body core temperature (Tc) was first described 45 years ago, although the method has only recently gained widespread use for exercise applications. This review aims to (1) use Bland and Altman's limits of agreement (LoA) method as a basis for quantitatively reviewing the agreement between intestinal sensor temperature (Tintestinal), oesophageal temperature (Toesophageal) and rectal temperature (Trectal) across numerous previously published validation studies; (2) review factors that may affect agreement; and (3) review the application of this technology in field‐based exercise studies. The agreement between Tintestinal and Toesophageal is suggested to meet our delimitation for an acceptable level of agreement (ie, systematic bias <0.1°C and 95% LoA within ±0.4°C). The agreement between Tintestinal and Trectal shows a significant systematic bias >0.1°C, although the 95% LoA is acceptable. Tintestinal responds less rapidly than Toesophageal at the start or cessation of exercise or to a change in exercise intensity, but more rapidly than Trectal. When using this technology, care should be taken to ensure adequate control over sensor calibration and data correction, timing of ingestion and electromagnetic interference. The ingestible sensor has been applied successfully in numerous sport and occupational applications such as the continuous measurement of Tc in deep sea saturation divers, distance runners and soldiers undertaking sustained military training exercises. It is concluded that the ingestible telemetric temperature sensor represents a valid index of Tc and shows excellent utility for ambulatory field‐based applications.
OBJECTIVE: To validate a range of dietary assessment instruments in general practice. METHODS: Using a randomised block design, brief assessment instruments and more complex conventional dietary assessment tools were compared with an accepted "relative" standard--a seven day weighed dietary record. The standard was checked using biomarkers, and by performing test-retest reliability in additional subjects (n = 29). OUTCOMES: Agreement with weighed record. Percentage agreement with weighed record, rank correlation from scatter plot, rank correlation from Bland-Altman plot. Reliability of the weighed record. SETTING: Practice nurse treatment room in a single suburban general practice. SUBJECTS: Patients with risk factors for cardiovascular disease (n = 61) or age/sex stratified general population group (n = 50). RESULTS: Brief self completion dietary assessment tools based on food groups caten during a week show reasonable agreement with the relative standard. For % energy from fat and saturated fat, non-starch polysaccharide, grams of fruit and vegetables and starchy foods consumed the range of agreement with the standard was: median % difference -6% to 12%, rank correlation 0.5 to 0.6. This agreement is of a similar order to the reliability of the weighed record, as good as or better than test standard agreement for more time consuming instruments, and compares favourably with research instruments validated in other settings. Under-reporting of energy intake was common (40%) and more likely if subjects were obese (body mass idex (BMI) > or = 30 60% under-reported; BMI < 30 29%, p < 0.001). CONCLUSION: Under-reporting of absolute energy intake is common, particularly among obese patients. Simple self assessment tools based on food groups, designed for practice nurse dietary assessment, show acceptable agreement with a standard, and suggest such tools are sufficiently accurate for clinical work, research, and possibly population dietary monitoring.
Few commercially available brands of actigraphs (ACT) have been subjected to rigorous validation with infant participants. The purpose of this study was to examine the agreement between concurrent polysomnography (PSG) and one brand of ACT (AW-64, Mitter Co. Inc.) using appropriate statistical techniques among a sample of healthy infants.
Twenty-two healthy infants (14.1 ± 0.6 months) had one night of ankle ACT recording during research PSG at Kosair Children's Hospital Sleep Research Center in Louisville, Kentucky. Macroanalyses were conducted using the Bland-Altman concordance technique to assess agreement between total sleep time (TST) and wake after sleep onset (WASO) simultaneously measured by PSG and ACT, using two ACT algorithm settings. Microanalyses were also calculated to examine sensitivity, specificity, and accuracy of ACT within each PSG-identified sleep state. Correlations were calculated between PSG-identified arousals and the discrepancies between ACT and PSG.
The Bland-Altman concordance technique revealed that ACT underestimated TST by 72.25 (SD = 61.48) minutes and by ≥ 60 minutes among 54.55% of infants. Furthermore, ACT overestimated WASO by 13.85 (SD = 30.94) minutes and by ≥ 30 minutes among 40.91% of infants. Sensitivity, specificity, and accuracy analyses revealed that ACT adequately identified sleep, but poorly identified wake. PSG and ACT discrepancies were positively associated with PSG-identified arousals (r = .45).
Improved device and/or software development is needed before the AW-64 can be considered a valid method for identifying infant sleep and wake.
actigraphy; polysomnography; infant; validation; Bland-Altman
Background. Reliable ICU severity scores have been achieved by various healthcare workers but nothing is known regarding the accuracy in real life of severity scores registered by untrained nurses. Methods. In this retrospective multicentre audit, three reviewers independently reassessed 120 SAPS II scores. Correlation and agreement of the sum-scores/variables among reviewers and between nurses and the reviewers' gold standard were assessed globally and for tertiles. Bland and Altman (gold standard—nurses) of sum scores and regression of the difference were determined. A logistic regression model identifying risk factors for erroneous assessments was calculated. Results. Correlation for sum scores among reviewers was almost perfect (mean ICC = 0.985). The mean (±SD) nurse-registered SAPS II sum score was 40.3 ± 20.2 versus 44.2 ± 24.9 of the gold standard (P < 0.002 for difference) with a lower ICC (0.81). Bland and Altman assay was +3.8 ± 27.0 with a significant regression between the difference and the gold standard, indicating overall an overestimation (underestimation) of lower (higher; >32 points) scores. The lowest agreement was found in high SAPS II tertiles for haemodynamics (k = 0.45–0.51). Conclusions. In real life, nurse-registered SAPS II scores of very ill patients are inaccurate. Accuracy of scores was not associated with nurses' characteristics.
Although various acceptable and easy-to-use devices have been used for saliva collection, cotton swabs are among the most common ones. Previous studies reported that cotton swabs yield a lower level of melatonin detection. However, this statistical method is not adequate for detecting an agreement between cotton saliva collection and passive saliva collection, and a test for bias is needed. Furthermore, the effects of cotton swabs have not been examined at lower melatonin level, a level at which melatonin is used for assessment of circadian rhythms, namely dim light melatonin onset (DLMO). In the present study, we estimated the effect of cotton swabs on the results of salivary melatonin assay using the Bland-Altman plot at lower level.
Nine healthy males were recruited and each provided four saliva samples on a single day to yield a total of 36 samples. Saliva samples were directly collected in plastic tubes using plastic straws, and subsequently pipetted onto cotton swabs (cotton saliva collection) and into clear sterile tubes (passive saliva collection). The melatonin levels were analyzed in duplicate using commercially available ELISA kits.
The mean melatonin concentration in cotton saliva collection samples was significantly lower than that in passive saliva collection samples at higher melatonin level (>6 pg/mL). The Bland-Altman plot indicated that cotton swabs causes relative and proportional biases in the assay results. For lower melatonin level (<6 pg/mL), although the BA plots didn't show proportional and relative biases, there was no significant correlation between passive and cotton saliva collection samples.
Our findings indicate an interference effect of cotton swabs on the assay result of salivary melatonin at lower melatonin level. Cotton-based collection devices might, thus, not be suitable for assessment of DLMO.
To test the validity and reliability of a tool specifically developed for the evaluation of appropriateness in rehabilitation facilities and to assess the prevalence of appropriateness of the days of stay.
The tool underwent a process of cross-cultural translation, content validity, and test-retest validity. Two hospital-based rehabilitation wards providing intensive rehabilitation care located in the Region of Calabria, Southern Italy, were randomly selected. A review of medical records on a random sample of patients aged 18 or more was performed.
The process of validation resulted in modifying some of the criteria used for the evaluation of appropriateness. Test-retest reliability showed that the agreement and the k statistic for the assessment of the appropriateness of days of stay were 93.4% and 0.82, respectively. A total of 371 patient days was reviewed, and 22.9% of the days of stay in the sample were judged to be inappropriate. The most frequently selected appropriateness criterion was the evaluation of patients by rehabilitation professionals for at least 3 hours on the index day (40.8%); moreover, the most frequent primary reason accounting for the inappropriate days of stay was social and/or family environment issues (34.1%).
The findings showed that the tool used is reliable and have adequate validity to measure the extent of appropriateness of days of stay in rehabilitation facilities and that the prevalence of inappropriateness is contained in the investigated settings. Further research is needed to expand appropriateness evaluation to other rehabilitation settings, and to investigate more thoroughly internal and external causes of inappropriate use of rehabilitation services.
Neuropathology centers are expected to offer a prompt and accurate intraoperative diagnosis regarding tumor/lesion type and grade on fresh unfixed tissue. Level of diagnostic accuracy according to type and grade and also, the experience at a new center has not been reported before.
The aim of this study is to review the agreement patterns according to tumor/lesion type and grade between intraoperative and final histopathologic diagnosis in central nervous system (CNS) lesion samples received by a newly established neuropathology center at a tertiary care neuropsychiatric hospital.
Materials and Methods:
Agreement between intraoperative and final histopathologic diagnosis was classified as: (I) Grade in agreement but type not in agreement; (II) grade not in agreement but type in agreement; (III) grade and type both not in agreement; (IV) grade and type both in agreement.
Confidence interval (CI) of agreements was calculated for various categories of neoplastic as well as non-neoplastic lesions. CI was also calculated for groups where n × p and n × (1 − p) were more than 5, i.e., fulfilled the requirement of the central limit theorem.
On retrospective analysis of 333 cases, 284 (85.3%) cases were categorized as neoplastic while 49 (14.7%) cases were categorized as non-neoplastic. Among the neoplastic lesions agreement was seen in 237 (83.5%) cases while 47 (16.5%) cases showed disagreement. Similarly in non-neoplastic category; 46 (93.9%) cases showed agreement while 3 (6.15%) cases showed disagreement. Of the non-neoplastic lesions, one case fell into the agreement category I, 2 in category III and 46 in IV. Among neoplastic lesions, there were 21 cases in agreement category I, 17 in II, 9 in III and 237 in IV. On analyzing the accuracy of intraoperative reporting according to tumor type, the break up was: - Astrocytic: 2 (I), 16 (II), 2 (III), 86 (IV); oligodendroglial: 8 (I), 1 (II); ependymal: 2 (III), 6 (IV); embryonal: 23 (IV); cranial and spinal nerve tumors: 2 (II), 21 (IV); choroid plexus tumors: 4 (IV); meningeal tumors: 3 (I), 1 (III), 49 (IV); metastatic tumors: 3 (I), 17 (IV); cysts (tumor-like conditions): 14 (IV); neuronal and mixed neuronal glial tumors: 1 (III); malignant lymphoma: 1 (III); sellar tumors: 17 (IV); and mixed gliomas: 5 (I).
This study identifies problem areas of CNS intraoperative reporting, in a new center, with reference to tumor typing and grading. It may forewarn upcoming centers of neuropathology about the potential problem areas of intraoperative reporting.
Central nervous system lesions; intraoperative reporting; new center
Psychological distress is common among medical students but manifests in a variety of forms. Currently, no brief, practical tool exists to simultaneously evaluate these domains of distress among medical students. The authors describe the development of a subject-reported assessment (Medical Student Well-Being Index, MSWBI) intended to screen for medical student distress across a variety of domains and examine its preliminary psychometric properties.
Relevant domains of distress were identified, items generated, and a screening instrument formed using a process of literature review, nominal group technique, input from deans and medical students, and correlation analysis from previously administered assessments. Eleven experts judged the clarity, relevance, and representativeness of the items. A Content Validity Index (CVI) was calculated. Interrater agreement was assessed using pair-wise percent agreement adjusted for chance agreement. Data from 2248 medical students who completed the MSWBI along with validated full-length instruments assessing domains of interest was used to calculate reliability and explore internal structure validity.
Burnout (emotional exhaustion and depersonalization), depression, mental quality of life (QOL), physical QOL, stress, and fatigue were domains identified for inclusion in the MSWBI. Six of 7 items received item CVI-relevance and CVI-representativeness of ≥0.82. Overall scale CVI-relevance and CVI-representativeness was 0.94 and 0.91. Overall pair-wise percent agreement between raters was ≥85% for clarity, relevance, and representativeness. Cronbach's alpha was 0.68. Item by item percent pair-wise agreements and Phi were low, suggesting little overlap between items. The majority of MSWBI items had a ≥74% sensitivity and specificity for detecting distress within the intended domain.
The results of this study provide evidence of reliability and content-related validity of the MSWBI. Further research is needed to assess remaining psychometric properties and establish scores for which intervention is warranted.
Validity of self-reported height and weight has not been adequately evaluated in diverse adolescent populations. In fact there are no reported validity studies conducted in Asian children and adolescents. This study aims to examine the accuracy of self-reported weight, height, and resultant BMI values in Chinese adolescents, and of the adolescents' subsequent classification into overweight categories.
Weight and height were self-reported and measured in 1761 adolescents aged 12-16 years in a cross-sectional survey in Xi'an city, China. BMI was calculated from both reported values and measured values. Bland-Altman plots with 95% limits of agreement, Pearson's correlation and Kappa statistics were calculated to assess the agreement.
The 95% limits of agreement were -11.16 and 6.46 kg for weight, -4.73 and 7.45 cm for height, and -4.93 and 2.47 kg/m2 for BMI. Pearson correlation between measured and self-reported values was 0.912 for weight, 0.935 for height and 0.809 for BMI. Weighted Kappa was 0.859 for weight, 0.906 for height and 0.754 for BMI. Sensitivity for detecting overweight (includes obese) in adolescents was 56.1%, and specificity was 98.6%. Subjects' area of residence, age and BMI were significant factors associated with the errors in self-reporting weight, height and relative BMI.
Reported weight and height does not have an acceptable agreement with measured data. Therefore, we do not recommend the application of self-reported weight and height to screen for overweight adolescents in China. Alternatively, self-reported data could be considered for use, with caution, in surveillance systems and epidemiology studies.
To estimate agreement among scores on three common assessments of cognitive function.
Baseline responses on the Alzheimer's Disease Assessment Scale – Cognitive, Clinical Dementia Rating, and the Mini-Mental State Examination were obtained from two clinical trials (n = 138 and n = 351). A graphical method of examining agreement, the means-difference or Bland-Altman plot, was followed by Levene's test of the equality of variance corrected for multiple comparison within each sample.
70–78% of variability was shared by one factor, suggesting that all three instruments reflect cognitive impairment. However, agreement among tests was significantly worse for individuals with greater-than-average, relative to individuals with less-than-average, cognitive impairment.
Worse agreement between tests, as a function of increasing cognitive impairment, implies that interpretation of these tests and selection of coprimary cognitive impairment outcomes may depend on impairment level.
Alzheimer's disease; Dementia; Outcomes assessment methods
Accurate, inexpensive point-of-care CD4+ T cell testing technologies are needed that can deliver CD4+ T cell results at lower level health centers or community outreach voluntary counseling and testing. We sought to evaluate a point-of-care CD4+ T cell counter, the Pima CD4 Test System, a portable, battery-operated bench-top instrument that is designed to use finger stick blood samples suitable for field use in conjunction with rapid HIV testing.
Duplicate measurements were performed on both capillary and venous samples using Pima CD4 analyzers, compared to the BD FACSCalibur (reference method). The mean bias was estimated by paired Student's t-test. Bland Altman plots were used to assess agreement.
206 participants were enrolled with a median CD4 count of 396 (range; 18–1500). The finger stick PIMA had a mean bias of −66.3 cells/µL (95%CI −83.4−49.2, P<0.001) compared to the FACSCalibur; the bias was smaller at lower CD4 counts (0–250 cells/µL) with a mean bias of −10.8 (95%CI −27.3−+5.6, P = 0.198), and much greater at higher CD4 cell counts (>500 cells/µL) with a mean bias of −120.6 (95%CI −162.8, −78.4, P<0.001). The sensitivity (95%CI) of the Pima CD4 analyzer was 96.3% (79.1–99.8%) for a <250 cells/ul cut-off with a negative predictive value of 99.2% (95.1–99.9%).
The Pima CD4 finger stick test is an easy-to-use, portable, relatively fast device to test CD4+ T cell counts in the field. Issues of negatively-biased CD4 cell counts especially at higher absolute numbers will limit its utility for longitudinal immunologic response to ART. The high sensitivity and negative predictive value of the test makes it an attractive option for field use to identify patients eligible for ART, thus potentially reducing delays in linkage to care and ART initiation.
Patients experience an increasing treatment burden related to everything they do to take care of their health: visits to the doctor, medical tests, treatment management and lifestyle changes. This treatment burden could affect treatment adherence, quality of life and outcomes. We aimed to develop and validate an instrument for measuring treatment burden for patients with multiple chronic conditions.
Items were derived from a literature review and qualitative semistructured interviews with patients. The instrument was then validated in a sample of patients with chronic conditions recruited in hospitals and general practitioner clinics in France. Factor analysis was used to examine the questionnaire structure. Construct validity was studied by the relationships between the instrument's global score, the Treatment Satisfaction Questionnaire for Medication (TSQM) scores and the complexity of treatment as assessed by patients and physicians. Agreement between patients and physicians was appraised. Reliability was determined by a test-retest method.
A sample of 502 patients completed the Treatment Burden Questionnaire (TBQ), which consisted of 7 items (2 of which had 4 subitems) defined after 22 interviews with patients. The questionnaire showed a unidimensional structure. The Cronbach's α was 0.89. The instrument's global score was negatively correlated with TSQM scores (rs = -0.41 to -0.53) and positively correlated with the complexity of treatment (rs = 0.16 to 0.40). Agreement between patients and physicians (n = 396) was weak (intraclass correlation coefficient 0.38 (95% confidence interval 0.29 to 0.47)). Reliability of the retest (n = 211 patients) was 0.76 (0.67 to 0.83).
This study provides the first valid and reliable instrument assessing the treatment burden for patients across any disease or treatment context. This instrument could help in the development of treatment strategies that are both efficient and acceptable for patients.
chronic disease/therapy; patient participation; physician-patient relations; quality of life; questionnaires; workload
The purpose of this project was to conduct a systematic review to identify instruments designed to evaluate the quality of randomized controlled trials (RCTs) of natural health products (NHPs). Instruments were examined for inclusion of items assessing methods, identity and content of the NHP, generalizability of results and instructions for use. Online databases, websites, textbooks and reference lists were searched to identify instruments. Relevance assessment and data extraction of articles were completed by two investigators and disagreements were settled by the third investigator. Data were analyzed using descriptive statistics. Of the 4442 citations identified, 29 were potentially relevant with 16 meeting the criteria for inclusion. None of the instruments stated they were validated; content in the four areas of interest varied considerably. The most common items included randomization sequence generation (100%), blinding (100%), allocation concealment (75%) and participant flow (75%). Only nine of the NHP instruments included at least one item to appraise the specific content of the NHP. The CONSORT Statement for Herbal Interventions most closely addressed the four areas of interest; however, this instrument was specific for herbs. There is a need for the development of a validated instrument for assessment of the quality of RCTs that would be useful for herbs as well as other NHPs.
Checklists; herbs; quality assessment
The purpose of this study was to systematically compare methods for standardization of blood pressure levels obtained by ambulatory blood pressure monitoring (ABPM) in a group of 111 children studied at our institution.
Blood pressure indices, blood pressure loads and standard deviation scores were calculated using he original ABPM and the modified reference standards. Bland—Altman plots and kappa statistics for the level of agreement were generated.
Overall, the agreement between the two methods was excellent; however, approximately 5% of children were classified differently by one as compared with the other method.
Depending on which version of the German Working Group’s reference standards is used for interpretation of ABPM data, the classification of the individual as having hypertension or normal blood pressure may vary.
ambulatory blood pressure monitoring; blood pressure; hypertension; reference standards
A variety of instruments are used to measure health related quality of life. Few data exist on the performance and agreement of different instruments in a depressed population. The aim of this study was to investigate agreement between, and suitability of, the EQ-5D-3L, EQ-5D Visual Analogue Scale (EQ-5D VAS), SF-6D and SF-12 new algorithm for measuring health utility in depressed patients.
The intraclass correlation coefficient (ICC) and Bland and Altman approaches were used to assess agreement. Instrument sensitivity was analysed by: (1) plotting utility scores for the instruments against one another; (2) correlating utility scores and depressive symptoms (Beck Depression Inventory (BDI)); and (3) using Tukey’s procedure. Receiver Operating Characteristic (ROC) analysis assessed instrument responsiveness to change. Acceptability was assessed by comparing instrument completion rates.
The overall ICC was 0.57. Bland and Altman plots showed wide limits of agreement for each pair wise comparison, except between the SF-6D and SF-12 new algorithm. Plots of utility scores displayed ’ceiling effects’ in the EQ-5D-3L index and ’floor effects’ in the SF-6D and SF-12 new algorithm. All instruments showed a negative monotonic relationship with BDI, but the EQ-5D-3L index and EQ-5D VAS could not differentiate between depression severity sub-groups. The SF-based instruments were better able to detect changes in health state over time. There was no difference in completion rates of the four instruments.
There was a lack of agreement between utility scores generated by the different instruments. According to the criteria of sensitivity, responsiveness and acceptability that we applied, the SF-6D and SF-12 may be more suitable for the measurement of health related utility in a depressed population than the EQ-5D-3L, which is the instrument currently recommended by NICE.
Depression; EQ-5D; SF-6D; Health related utility; QALYs
Over the years many scales have been designed for screening, diagnosis and assessing the severity of delirium. In this paper we review the various instruments available to screen the patients for delirium, instruments available to diagnose delirium, assess the severity, cognitive functions, motoric subtypes, etiology and associated distress. Among the various screening instruments, NEECHAM confusion scale and delirium observation scale appear to be most suitable screening instrument for patients’ in general medical and surgical wards, depending on the type of rater (physician or nurse). In general, the instruments which are used for diagnosis [i.e., confusion assessment method (CAM), CAM for intensive care unit (CAM-ICU), Delirium Rating Scale-revised version (DRS-R-98), memorial selirium assessment scale, etc.] are based on various Diagnostic and Statistical Manual criteria and have good to excellent reliability and fair to good validity. Among the various diagnostic instruments, CAM is considered to be most useful instrument because of its accuracy, brevity, and ease of use by clinicians and lay interviewers. In contrast, DRS-R-98 appears to be a comprehensive instrument useful for diagnosis, severity rating and is sensitive to change and hence can be used for monitoring patients over a period. In the ICU setting, evidence suggests that CAM-ICU and Nursing Delirium Screening Scale had comparable sensitivities, but CAM-ICU has higher specificity. With regard to assessment of delirium in pediatric age group, certain instruments like Pediatric Anesthesia Emergence Delirium scale and pediatric CAM-ICU has been designed and have been found to be useful.
Delirium; Screening; Diagnosis; Cognition
Cardiac output (CO) and systemic vascular resistance (SVR) are two important parameters of the cardiovascular system. The ability to measure these parameters continuously and noninvasively may assist in diagnosing and monitoring patients with suspected cardiovascular diseases, or other critical illnesses. In this study, a method is proposed to estimate both the CO and SVR of a heterogeneous cohort of intensive care unit patients (N=48).
Spectral and morphological features were extracted from the finger photoplethysmogram, and added to heart rate and mean arterial pressure as input features to a multivariate regression model to estimate CO and SVR. A stepwise feature search algorithm was employed to select statistically significant features. Leave-one-out cross validation was used to assess the generalized model performance. The degree of agreement between the estimation method and the gold standard was assessed using Bland-Altman analysis.
The Bland-Altman bias ±precision (1.96 times standard deviation) for CO was -0.01 ±2.70 L min-1 when only photoplethysmogram (PPG) features were used, and for SVR was -0.87 ±412 dyn.s.cm-5 when only one PPG variability feature was used.
These promising results indicate the feasibility of using the method described as a non-invasive preliminary diagnostic tool in supervised or unsupervised clinical settings.
Cardiac output; Systemic vascular resistance; Photoplethysmography; Power spectrum analysis; Photoplethysmogram variability; Photoplethysmogram morphology; Feature selection
Rationale and Objectives
In quantifying medical images, length-based measurements are still obtained manually. Due to possible human error, a measurement protocol is required to guarantee the consistency of measurements. In this paper, we review various statistical techniques that can be used in determining measurement consistency. The focus is on detecting a possible measurement bias and determining the robustness of the procedures to outliers.
Materials and Methods
We review correlation analysis, linear regression, Bland-Altman method, paired t-test, and analysis of variance (ANOVA). These techniques were applied to measurements, obtained by two raters, of head and neck structures from magnetic resonance images (MRI).
The correlation analysis and the linear regression were shown to be insufficient for detecting measurement inconsistency. They are also very sensitive to outliers. The widely used Bland-Altman method is a visualization technique so it lacks the numerical quantification. The paired t-test tends to be sensitive to small measurement bias. On the other hand, ANOVA performs well even under small measurement bias.
In almost all cases, using only one method is insufficient and it is recommended to use several methods simultaneously. In general, ANOVA performs the best.
measurement consistency; bias; outlier; head; neck; Bland-Altman
Physical activity self-report instruments in the US have largely been developed for and validated in White samples. Despite calls to validate existing instruments in more diverse samples, relatively few instruments have been validated in US Blacks. Emerging evidence suggests that these instruments may have differential validity in Black populations.
This report reviews and evaluates the validity and reliability of self-reported measures of physical activity in Blacks and makes recommendations for future directions.
A systematic literature review was conducted to identify published reports with construct or criterion validity evaluated in samples that included Blacks. Studies that reported results separately for Blacks were examined.
The review identified 10 instruments validated in nine manuscripts. Criterion validity correlations tended to be low to moderate. No study has compared the validity of multiple instruments in a single sample of Blacks.
There is a need for efforts validating self-report physical activity instruments in Blacks, particularly those evaluating the relative validity of instruments in a single sample.
To evaluate the accuracy of the swallowing kinematic analysis.
To evaluate the accuracy at various velocities of movement, we developed an instrumental model of linear and rotational movement, representing the physiologic movement of the hyoid and epiglottis, respectively. A still image of 8 objects was also used for measuring the length of the objects as a basic screening, and 18 movie files of the instrumental model, taken from videofluoroscopy with different velocities. The images and movie files were digitized and analyzed by an experienced examiner, who was blinded to the study.
The Pearson correlation coefficients between the measured and instrumental reference values were over 0.99 (p<0.001) for all of the analyses. Bland-Altman plots showed narrow ranges of the 95% confidence interval of agreement between the measured and reference values as follows: 0.14 to 0.94 mm for distances in a still image, -0.14 to 1.09 mm/s for linear velocities, and -1.02 to 3.81 degree/s for angular velocities.
Our findings demonstrate that the distance and velocity measurements obtained by swallowing kinematic analysis are highly valid in a wide range of movement velocity.
Reproducibility of results; Biomechanics; Deglutition
The primary objective was to systematically review the medical literature for instruments validated for use in epidemiological and clinical research on waterpipe smoking.
We searched the following databases: MEDLINE, EMBASE, and ISI the Web of Science. We selected studies using a two-stage duplicate and independent screening process. We included papers reporting on the development and/or validation of survey instruments to measure waterpipe tobacco consumption or related concepts. Two reviewers used a standardized and pilot tested data abstraction form to collect data from each eligible study using a duplicate and independent screening process. We also determined the percentage of observational studies assessing the health effects of waterpipe tobacco smoking and the percentage of studies of prevalence of waterpipe tobacco smoking that have used validated survey instruments.
We identified a total of five survey instruments. One instrument was designed to measure knowledge, attitudes, and waterpipe use among pregnant women and was shown to have internal consistency and content validity. Three instruments were designed to measure waterpipe tobacco consumption, two of which were reported to have face validity. The fifth instrument was designed to measure waterpipe dependence and was rigorously developed and validated. One of the studies of prevalence and none of the studies of health effects of waterpipe smoking used validated instruments.
A number of instruments for measuring the use of and dependence on waterpipe smoking exist. Future research should study content validity and cross cultural adaptation of these instruments.
The Richards-Campbell Sleep Questionnaire (RCSQ) is a simple, validated survey instrument for measuring sleep quality in intensive care patients. Although both patients and nurses can complete the RCSQ, interrater reliability and agreement have not been fully evaluated.
To evaluate patient-nurse interrater reliability and agreement of the RCSQ in a medical intensive care unit.
The instrument included 5 RCSQ items plus a rating of nighttime noise, each scored by using a 100-mm visual analogue scale. The mean of the 5 RCSQ items comprised a total score. For 24 days, the night-shift nurses in the medical intensive care unit completed the RCSQ regarding their patients’ overnight sleep quality. Upon awakening, all conscious, nondelirious patients completed the RCSQ. Neither nurses nor patients knew the others’ ratings. Patient-nurse agreement was evaluated by using mean differences and Bland-Altman plots. Reliability was evaluated by using intraclass correlation coefficients.
Thirty-three patients had a total of 92 paired patient-nurse assessments. For all RCSQ items, nurses’ scores were higher (indicating “better” sleep) than patients’ scores, with significantly higher ratings for sleep depth (mean [SD], 67  vs 48 , P = .001), awakenings (68  vs 60 , P = .03), and total score (68  vs 57 , P = .01). The Bland-Altman plots also showed that nurses’ ratings were generally higher than patients’ ratings. Intraclass correlation coefficients of patient-nurse pairs ranged from 0.13 to 0.49 across the survey questions.
Patient-nurse interrater reliability on the RCSQ was “slight” to “moderate,” with nurses tending to overestimate patients’ perceived sleep quality.
Objective(s): Reliability measures precision or the extent to which test results can be replicated. This is the first ever systematic review to identify statistical methods used to measure reliability of equipment measuring continuous variables. This studyalso aims to highlight the inappropriate statistical method used in the reliability analysis and its implication in the medical practice.
Materials and Methods: In 2010, five electronic databases were searched between 2007 and 2009 to look for reliability studies. A total of 5,795 titles were initially identified. Only 282 titles were potentially related, and finally 42 fitted the inclusion criteria.
Results: The Intra-class Correlation Coefficient (ICC) is the most popular method with 25 (60%) studies having used this method followed by the comparing means (8 or 19%). Out of 25 studies using the ICC, only 7 (28%) reported the confidence intervals and types of ICC used. Most studies (71%) also tested the agreement of instruments.
Conclusion: This study finds that the Intra-class Correlation Coefficient is the most popular method used to assess the reliability of medical instruments measuring continuous outcomes. There are also inappropriate applications and interpretations of statistical methods in some studies. It is important for medical researchers to be aware of this issue, and be able to correctly perform analysis in reliability studies.
ICC; Intra-class correlation coefficient; Reliability; Statistical method; Validation study