For the purposes of this study, calibration refers to the process of moving the practitioners toward the same TCM diagnosis when interviewing the same patient. Recalibration was a follow-up exercise in which we completed a second round of patient interviews to estimate reliability among practitioners. If any drift in diagnostic styles occurred, a similar training cycle would be implemented. Reliability, however, is a separate concept and refers to the consistency of two or more practitioners producing the same diagnosis for the same patient.
In a previous study, 10 TCM practitioners diagnosed participants using an open-ended questionnaire designed to cover the major distinguishing factors of diagnoses found in TMJD patients.23
The questionnaire allows for multiple diagnostic outcomes, reflecting common practice among TCM practitioners. The form was designed to follow the usual TCM diagnostic interview process, beginning with the chief complaint and then identifying the key components of the TCM 10 Questions. The tongue, pulse, and observations are recorded, followed by the organs most affected. At the end of the form, 19 diagnoses are each given a score of 0 to 10 to indicate their severity or clinical relevance; space is provided to write in additional diagnoses. In this way the practitioners are guided from a broad perspective towards a diagnosis at the end of the form. Our form was designed to organize the vast amount of health history and current complaints into a format that would assist the practitioner to assess and rank the diagnoses. shows the final page of the diagnostic questionnaire.
Scaled diagnoses in diagnostic questionnaire.
Two-and-a-half days were set aside for the initial practitioner calibration. The first half-day focused on familiarization with the questionnaire-based diagnostic form, the study population (noting common diagnoses and presentations from the previous study23
), and the treatment protocol. There were also discussions about some of the key characteristics that practitioners use to distinguish among diagnoses, including the overall importance of tongue and pulse observations.
On the second day, practitioners were paired for the diagnostic sessions. Practitioners were paired with each other in a round robin manner, rotating with each new participant. A total of 15 participants were interviewed by four pairs of practitioners for a total of 120 diagnoses. Women and men were recruited in Tucson by newspaper advertisements and flyers at local medical offices. The participants' complaints ranged from healthy to complex cases with multiple chronic diseases.
For each session, one practitioner would lead the interview while the other would simply take notes on the inquiry. After the lead interviewer exhausted the inquiry portion of the diagnosis, the second practitioner would ask any additional questions which were felt to be necessary to clarify the diagnosis. Then both practitioners would assess pulse and tongue individually. Finally, they would review their own notes and score the diagnoses. The practitioners were not allowed to discuss the case until the process was finished. After each patient was interviewed, questionnaires with diagnosis scores were returned to one of the authors, who data-entered them immediately for subsequent real-time review.
Between waves of participants, the entered scores were shown to the whole group of practitioners, with practitioners able to see their own scoring in relation to all the others. Practitioners, led by the first author, discussed any outlying diagnoses or severity scores. These discussions were intended to help the practitioners develop convergence on future diagnoses as well as on the meaning of the severity scores.
A similar process was completed 18 months later at both sites to evaluate how well the training was maintained. The recalibration took place in three sessions: with practitioners in Tucson, practitioners in Portland, and 2 diagnosticians. All interviews happened in pairs as in the calibration exercise, with partners rotating among each other, and within pairs switching the lead interviewer. After two diagnostic sessions, practitioners reviewed the outcomes. Participants who were being diagnosed had TMJD signs or symptoms but had not yet received a diagnosis of TMJD.
The diagnosticians also had an additional exercise in which they each made diagnostic scores based only on diagnostic questionnaires already completed by practitioners on study participants. This was done to begin to understand whether the information that is captured by the form is adequate for diagnosis and, if not, what additional information is needed. They reviewed the data collected from 11 questionnaires without diagnosis to determine if the form contained enough information from which to make a diagnosis.
The main analysis uses the Fleiss' kappa statistic,24
a measure for assessing the chance-corrected agreement between a set number of raters when assigning categorical ratings to a number of items,24,25
providing overall agreement by wave of participants. Fleiss' kappa, unlike the more familiar Cohen kappa, is appropriate when there are multiple raters. This is the most conservative approach; however, this methodology does not reflect the realities of the TCM diagnostic system.
Most participants had multiple diagnoses; only one practitioner diagnosed a participant as having a single diagnosis in the calibration exercise. Further, within the field of TCM diagnostic classifications, a natural progression of disease is observed. For example, one practitioner might feel that a participant has severe Liver Blood Xu while the next might diagnose the participant as having Liver Yin Xu. In TCM, Blood Xu leads to Yin Xu and Qi Xu leads to Yang Xu. Therefore, a second kappa, called whole systems kappa for clarity, was calculated; this counted agreement as including both those patients whose top two diagnoses were the same, and those whose diagnoses differed only based on disease progression, as described above. This analysis counted those cases in which the practitioners agreed on the top two diagnoses but disagreed on which was the most severe as agreement. Cases where the practitioners identified the same organ but differed on the substance (e.g., Liver Blood Xu and Liver Yin Xu) were considered in agreement. Likewise, Heart Blood Xu and Heart Yin Xu; Heart Qi Xu and Heart Yang Xu; and Kidney Qi Xu and Kidney Yang Xu were considered to be in agreement. This effectively reduced the number of diagnoses present by four.
Finally, intraclass correlation, using a two-way mixed effects model in which practitioner effects are random and diagnosis effects are fixed, was used to compare agreement on the most common diagnoses. Only those cases where practitioners diagnosed the participants with the same primary diagnosis (the diagnosis with the highest severity rating) were considered as matching.