|Home | About | Journals | Submit | Contact Us | Français|
There is a history of over-prescription of antipsychotics to individuals with intellectual disability (ID), while antidepressants may be under-prescribed. However, appropriate treatment is best supported when the diagnosis of psychosis or depression is valid and carries good predictive validity. The present authors report a study examining one aspect of validity, namely whether skilled clinicians can agree on whether an individual with an ID is psychotic or depressed.
Pairs of clinicians assessed 52 individuals. Agreement was assessed using Cohen’s kappa statistic and agreement proportion.
Overall agreement was high for both psychosis and depression. Whether the individual had mild ID or moderate/severe ID did not have a significant impact on agreement.
Experienced clinicians achieved a high level of agreement as to whether a person with ID was psychotic or depressed similar to that found for those without ID. The findings provide some support for treatment interventions based on diagnosis.
The presence of psychosis in individuals with intellectual disability (ID) has been the subject of considerable comment (Aman & Singh 1986; Aman 1987; Fischbacher 1987; Clarke et al. 1990; Deb & Fraser 1994). Much of this literature asserts that people with ID are often prescribed antipsychotic medications without good evidence of a psychotic illness. A number of studies have reported that the rate of prescription of antipsychotic medication far exceeds the expected prevalence of psychoses in people with ID (Wressell et al. 1990; Kiernan et al. 1995; Emerson et al. 1997; Spreat & Conroy 1998; Robertson et al. 2000). In contrast, there is a smaller body of literature suggesting that receipt of antidepressant medication is relatively low (Howland 1992; Robertson et al. 2000; Sevin et al. 2003).
Appropriate pharmacotherapy for mental disorders in persons with ID is enhanced when diagnoses within this population are both accurate and carry predictive validity. There have been a number of papers describing the difficulties of diagnosis of mental disorders in people with ID (Borthwick-Duffy & Eyman 1990; Einfeld 1992; Sturmey 1993; Gabriel 1994; Moss et al. 1997, 2000; Prosser et al. 1998; Patel et al. 2001). Atypical presentations, maladaptive behaviours, communication and cognitive limitations, different developmental trajectories and limitations in lifestyle, and reliance on diagnostic criteria developed using intellectually normal individuals can lead to under- and mis-diagnoses (Reiss et al. 1982; Menolascino et al. 1986; Sovner 1986; Borthwick-Duffy & Eyman 1990; Sturmey 1993; Crews et al. 1994; Einfeld & Tonge 1999; Moss et al. 2000). The Diagnostic and Statistical Manual of Mental Disorders (DSM) and International Classification of Diseases (ICD) diagnostic systems are used extensively in general mental health practice, but they were developed for the non-ID population. The studies which exist examining psychiatric diagnoses in persons with ID have virtually all used modified diagnostic criteria, but have lacked consideration of inter-rater reliability (Sturmey 1993; King et al. 1994).
Inter-rater reliability refers to the consistency with which a particular assessment method comes to the same conclusions when applied by different raters to the same body of information. Reliability is particularly important when there is no ‘gold standard’ as is the case in making psychiatric diagnoses. Where the result of rater judgement is a classification, two classes of measure of agreement have been used – those based only on the percentage of subjects receiving the same classification from the raters (‘percentage agreement’) and those incorporating also a ‘correction for chance agreement’. This study reports percentage agreement, Cohen’s kappa (the most widely used of the chance-corrected measures) and variants of this statistic incorporating adjustments for ‘bias’ and ‘prevalence’. Proportion (percentage) agreement varies between 0 and 1 (0 and 100), and is readily interpretable. Kappa can take negative values, but in practice mostly also lies between 0 and 1. Its interpretation is less straightforward. Byrt et al. (1993) showed that kappa may be expressed as a function of percentage agreement, the disparity between ‘yeses’ and ‘noes’ among the agreements (‘prevalence’) and the disparity in the proportions of yeses between the raters (‘bias’). Departure from a 50/50 split of yeses and noes among the agreements lowers kappa, while departure from equality of proportions of yeses between raters increases kappa. It is wise therefore to keep prevalence and bias in mind when interpreting kappa.
In considering the inter-rater reliability of psychiatric diagnoses in persons with ID, it is salient to consider the same in persons without communication or cognitive difficulties. Presumably, this provides the upper limit of the inter-rater reliability achievable with persons with ID. Here, the present authors consider the data available relating to the diagnoses of psychosis and depression.
The DSM-IV Field Trial for Schizophrenia and Related Psychotic Disorders (American Psychiatric Association, 1992) examined the reliability and concordance of three alternative sets of options for diagnosing DSM-IV psychotic disorders plus the criteria from DSM-III, DSM-III-R, and ICD-10. In the international field trials for ICD-10, inter-rater agreement based on diagnoses at the ‘two character’ group level, kappa values of 0.82 for schizophrenic disorders and 0.66 for depressive episode were reported (Sartorius et al. 1993, 1995). However subsequent studies using less structured formats have reported lower kappa-values. Way et al. (1998) videotaped 30 emergency department psychiatric assessments and these were then re-rated by eight different psychiatrists. For psychosis a kappa-value of 0.64 was reported and for depression a kappa of 0.48.
In a review examining the accuracy of the clinical examination for diagnosing clinical depression Williams et al. (2002) found seven studies using the Structured Clinical Interview for DSM Diagnoses (Spitzer et al. 1979) in which inter-rater reliability for major depression was evaluated. Study design ranged from multiple clinicians viewing a videotaped interview, to paired clinicians conducting sequential interviews, with training varying from psychology trainees to experts in the mood disorders field. Kappa values ranged from 0.64 to 0.93. They found a further seven studies evaluating inter-rater reliability of DSM-IV diagnoses using non-standardized interviews and thus more closely simulating clinical practice as in the current study. With the exception of one study using a videotaped interview, the study designs involved paired, generally blinded, interviewers conducting joint or sequential interviews. Here, the kappa-values ranged from 0.55 to 0.74.
In a smaller study, Miller (2001) compared the inter-rater reliability of structured versus unstructured interviews and found that the traditional, unstructured diagnostic assessment gave kappa-values from 0.24 to 0.43, whereas the inter-rater agreement using the SCIDCV, Computer Assisted Diagnostic Interview, gave a kappa of 0.75. A number of other studies (Keller et al. 1995; Roy et al. 1997; Shear et al. 2000; Simpson et al. 2002) support the finding that in individuals without ID, the inter-rater agreement for diagnosis of psychosis and depression gave kappas in the range 0.6 to 0.8, with structured assessments having higher values than unstructured.
As previously noted there has been little research on the inter-rater reliability of either the DSM or ICD diagnostic systems when used to assess persons with an ID and there continues to be a relative lack of suitable alternative techniques for detection and diagnosis of psychiatric morbidity in this population. Screening instruments, which tend to lack the depth required for accurate diagnosis include the Psychopathology Instrument for Mentally Retarded Adults (PIMRA) (Matson et al. 1984), the Reiss screen (see Sturmey et al. 1995), and the Diagnostic Assessment for the Severely Mental Handicapped (DASH) scale (Matson et al. 1991). Moss & Goldberg (Moss et al. 1993) developed the Psychiatric Assessment Schedule for Adults with a Developmental Disability (PAS-ADD) as a semi-structured interview for use in persons with an ID and is based on diagnostic criteria utilised in ICD and DSM. From the PAS-ADD item set, 13 possible symptom syndromes, such as ‘depressed mood’ or ‘situational anxiety’ can be generated by the CATEGO computer algorithm. While a detailed report of the psychometric properties of these measures is outside the scope of this report, none of these has documented sufficient validity to warrant its use as a structured assessment protocol for the diagnosis of psychosis or depression in people with ID.
We sought to add to this information by measuring the extent to which experienced clinicians agreed about a diagnosis of depression or psychosis in individuals with ID.
Subjects for this study were recruited in two ways. They were chosen as the most expeditious method for recruiting sufficient numbers. The first approach was to invite psychiatrists with particular experience and expertise in the assessment of individuals with ID to refer subjects whom they thought may have depression of any type or psychosis of any type. Secondly, the present authors asked general practitioners who cared for the general health of persons with ID, to refer to the study subjects whom they thought to be suffering from either psychosis or depression. That is, the first group was seen by experts in mental health and ID, and the second group by non-experts. For the first group, the subjects referred by psychiatrists with experience in ID (Rater 1) were seen by a psychiatrist member of the research team (Rater 2) within 3 weeks of the referring psychiatrists’ assessment, and the assessments of the referring psychiatrist and the research psychiatrist were examined for their inter-rater agreement. The subjects referred by the general practitioners were assessed by a member of the research team (Rater 1) and that assessment simultaneously observed by another member of the team (Rater 2). The team comprised experienced psychologists or psychiatrists specializing in psychopathology in persons with ID, and the assessments of these two research team members were examined for their inter-rater agreement. Thus, although the two groups were recruited differently inter-rater agreement was compared between experts for all cases. Thus, although each rater was blind to the diagnosis reached by the other rater, all raters were aware that subjects were suspected of having possible depression or psychosis.
As described earlier, given the lack of validated assessment tools, the present authors used a traditional psychiatric interview approach for the inter-rater assessments. That is, they interviewed and obtained a history from persons attending with the subject and where possible from the subject themselves. In addition, they also examined the mental state of the subject. The referring psychiatrists and the research assessors also completed a DSM-IV criteria checklist for schizophrenia and major depression as a semi-structured aid in assessment. For the purpose of this study, psychosis was defined as the presence of delusions or hallucinations in the past 2 years. Depression was defined as any DSM-IV depressive disorder in the last 12 months. The raters scored the diagnoses as present, maybe present, or absent. Three of the authors (SE, BT, CM) each completed 14 interviews and another 13 specialist clinicians completed three interviews on average.
There were three rating classifications – ‘yes’, ‘no’ and ‘maybe’. The ‘maybe’ ratings represent indecision between ‘yes’ and ‘no’ rather than a third category about which agreement was being investigated. It is reasonable to regard them as temporary results which would have been determined as ‘yes’ or ‘no’ after further information or observation. The present authors present parallel analyses. The first is based only on the yes and no ratings, ignoring any person about whom there was at least one maybe rating. The second is based on ‘resolving’ the maybes according to the following rule: yes and maybe resolves to yes and yes, no and maybe to no and no, and maybe and maybe to yes and no or no and yes, equally.
The level of agreement between raters is measured by simple percentage or proportion of agreement and by Cohen’s kappa. Bias-adjusted and prevalence-and-bias- adjusted versions of kappa are also presented.
The gender and level of ID of the subjects is shown in Table 1. The mean age of the participants was 31 years (age range = 11–63 years).
The results of ratings for depression are shown in Table 2.
Table 3 shows parallel results for all subjects, for those subjects with mild ID and for those with moderate or severe ID, first ignoring any person receiving at least one maybe classification and then with the maybes resolved according to the rule given above. The proportion of agreements ranges from 0.81 to 0.89. Bias-adjusted kappa (Byrt et al. 1993) is identical to kappa to second-decimal-place accuracy, and kappa differs for prevalence-adjusted-bias-adjusted kappa by at most 2 units in the second decimal place.
The lack of difference between bias-adjusted and raw kappa values for depression ratings shows that the raters do not disagree about the prevalence of depression among the young people with ID. The prevalence effects on kappas for agreement about depression, measured by the differences between bias-adjusted and prevalence- adjusted-bias-adjusted kappa values, are very small. The net result is that for measuring inter-rater agreement for depression nothing is gained by using kappa, which is difficult to interpret, in preference to proportions (or percentages) of agreement, which are transparent.
Considering only those decisions about which both raters are definite, raters are in agreement in 86% of cases, and this changes marginally to 87% when ‘maybe’ ratings are resolved in a plausible way.
The agreement percentages within the mild ID and moderate-to-severe ID groups differ significantly neither from the overall figures nor from each other.
Chance correction has zero (to two decimal place accuracy) influence on kappas for depression when maybes are resolved.
Considering only those decisions about which both raters are definite, raters are in agreement in 85% of cases, and this changes marginally to 83% when ‘maybe’ ratings are resolved.
The agreement statistics within the mild ID and moderate- to-severe ID groups differ significantly neither from the overall figures nor from each other.
There is no observable bias between raters, as was observed for agreements about depression. Kappas are slightly lower than their prevalence-adjusted-bias-adjusted equivalents – the lower prevalence of psychosis (compared to depression) among young people with ID has resulted in a small deflationary chance correction.
Agreement figures are slightly, but not significantly, higher for mild ID than for moderate-to-severe ID.
This study is one of the few to attempt to systematically assess whether experienced specialist clinicians can agree in making psychiatric diagnoses for individuals with ID. Overall, the present authors found that percentage agreement was high and similar to what has generally been found in studies of persons without ID, particularly, where structured assessment methods were used. Presumably, the difficulties in eliciting psychopathological phenomena described previously were not such as to lead to greater disagreement, at least when assessments were conducted by specialists in mental health and ID. Agreement in this study may have been even higher without heterogeneity of assessors, but this heterogeneity is in keeping with practice in service settings.
Where raters registered uncertainty by giving a ‘maybe’ response the present authors sought to resolve the uncertainties in a plausible way, to obtain a reasonable estimate of what the state of agreement between the raters might have been had they resolved their own uncertainties. For a person about whom the ratings were one definite yes or no and one maybe they saw the likely result as agreement, as the weight of evidence about the young person was towards the definite judgement. Counterbalancing this resolution, where both raters were uncertain they resolved the maybes as disagreements in both directions, there being no weight of evidence in either. This ‘minimalist’ procedure implies no change in bias and little change in prevalence (the effects of both of which were in any case very small) from the definite-judgements-only situation. The resultant changes to agreement statistics, both percentage agreement and kappa, were very marginal: an increase for depression and a decrease for psychosis. Our comparisons of unadjusted and adjusted versions of kappa show that bias (disparity between the two raters’ proportions of yeses) is virtually absent in our raters’ judgements. Further, the prevalence effect (due to disparity between the overall proportions of yeses and noes) is not in evidence except in judgements about psychosis in those with mild ID. With negligible bias and prevalence effects kappa reduces to twice the agreement proportion less 1, as Byrt et al. (1993) have shown. In these circumstances the transparent concept of percentage agreement is to be preferred over kappa, which contains no more information and does not lend itself to any clear interpretation.
The study has made us aware of the difficulty in conducting inter-rater reliability studies of psychiatric diagnoses. There are many challenges to be overcome in arranging for sufficient numbers of subjects with sufficiently high rates of suspected disorders to be interviewed by independent raters of sufficient expertise in comparable circumstances. It will take a considerable effort by researchers to amass the data required to canvass the diagnostic reliability of other psychiatric disorders.
Preliminary work towards the development of a diagnostic system specific for ID, such as the “Diagnostic criteria for psychiatric disorders for use with adults with learning disabilities” (DC-LD) (Royal College of Psychiatrists, 2001) creates the conditions necessary for the development of structured diagnostic instruments. These instruments will need to rely on observed behaviours and carer reports augmented if possible by self reflection by the person with ID. Until then, this study provides some confidence in the application, by specialist clinicians, of existing diagnostic criteria for depression and psychosis to adults with ID.
The finding that overall reliability was substantial is a necessary, though insufficient condition for concluding that the diagnoses of depression and psychosis are valid in persons with ID. Ultimately, the clinical importance of such validity rests largely in its capacity to predict treatment outcome. Therefore, one might argue that since the advent of antidepressants and antipsychotics with fewer side-effects it is justified to undertake a therapeutic trial on the basis of a probable diagnosis. However, newer psychopharmacological do have troublesome, if less toxic side-effects. A systematic approach to usage of psychotherapeutic medications in the face of uncertain diagnostic reliability is described in detail in Einfeld (2001). Cautious prescribing for psychiatric diagnoses will continue to be needed for individuals with ID for the foreseeable future, and will thus be supported by diagnoses as valid as possible within the constraints of reliability limits.
This study was supported by NHMRC Australia Grant 113844. Doctors who contributed psychiatric assessments were Bruce Chenoweth, Michael Fairley, Terry Heins, Peter Wurth, Jim Friend, Sophie Kavanagh, Julian Davis, David Moseley, Robert Davies and Jennifer Torr.