The aim of the study was to examine the usefulness of the application of specific scoring algorithms for the SDQ, as proposed by earlier UK findings, when used as a screening test to detect mental health disorders among patients in the CAMHS North Study. Sensitivity and specificity are important to clinicians because these measures indicate how many people with disorders the SDQ can correctly identify. Our results varied according to the dichotomisation level applied in the SDQ diagnostic algorithm, and also varied by diagnostic category.
For both levels of dichotomisation, emotional disorders had the lowest sensitivity. Our results for the most commonly used 'probable' dichotomisation level, which yielded a cut-off of approximately 90% in epidemiological samples, were almost identical to those reported by Mathai and colleagues [5
]. Goodman and colleagues [21
] also reported a lower sensitivity for emotional disorders than for the other diagnostic categories in the British sample, but not as low as in the present study. This difference may be an effect of Norwegian parents' and teachers' 'blind spot', or 'normalising' view for emotional difficulties, which was also reported by Heiervang, Goodman and Goodman [33
]. Given that the parents describe emotional difficulties in the semi-structured questions (free text) without reporting the same difficulties as problematic in the structured (yes/no) part, this may explain why the rates of clinician assigned DAWBA diagnoses are higher than the SDQ 'probable' screening rate for emotional disorders. This is in contrast to all other categories of disorders where the rates of clinician assigned DAWBA diagnoses are the lowest ones as expected, as a consequence of the screening cut-offs set at approximately 80% and 90% respectively, chosen to ensure inclusion of most cases in a population with a prevalence of psychiatric disorders of 7-8%. It is also generally accepted that parents are insensitive to children's emotional symptoms and that adolescents' reports of emotional problems are more valid than their parents' and teachers' reports [34
]. This knowledge may have affected the assessments of the diagnosing clinicians in our study, and resulted in lower sensitivity. For both hyperactivity and conduct disorders, as well as for 'any disorder', our results showed high sensitivity, ranging from 77% to 100%, Nevertheless, these values were lower than those reported by Goodman and colleagues [21
] for hyperactivity and conduct disorders in their English sample, and for hyperactivity disorders in their Bangladeshi sample. Compared to Mathai and colleagues [5
], our results were substantially more sensitive for hyperactivity disorders, and a little less sensitive for conduct disorders and emotional disorders. As expected, our results for the 'possible' dichotomisation level, which yielded a cut-off at approximately 80%, were more sensitive for psychiatric disorders.
Specificity was also dependent on dichotomisation level and diagnostic category. All specificity results for the 'possible' dichotomisation level were lower than those for the 'probable' dichotomisation level. The specificity for 'any disorder' was the lowest, regardless of the level of dichotomisation and considerably lower than the specificity for the other individual categories. All specificity results were comparable to those reported by Goodman and colleagues [21
], except for conduct disorders, for which specificity was substantially higher than in the British sample. This may be due to differences between the countries, in that the degree of reporting problems in Great Britain may be higher, whereas Norwegian parents and teachers tend to report fewer problems. In contrast to emotional disorders, the lower SDQ questionnaire scores for conduct problems seems to reflect a real and substantial lower prevalence of conduct disorders in Norway compared to Great Britain [33
]. The above-mentioned studies did not report screening efficiency statistics for the diagnostic category 'any disorder'. Overall our sensitivity and specificity results strengthen the earlier reported usefulness of the SDQ as a screening instrument for mental health problems when used in epidemiological research. Regarding clinical use, despite differences in culture and language, the scoring algorithms worked equally well in the Norwegian CAMHS North Study as in English, Bangladeshi, and Australian clinics. With the most common cut-off at approximately 90%, the SDQ will correctly identify four out of five children with psychiatric diagnoses, except for emotional disorders, and also correctly identify most children without diagnoses, except for 'any disorder'. Unfortunately 23 to 54% of these diagnoses will be false positives and 6 to 35% of negative screening results will be false negatives, depending on the category of diagnoses. On the other hand, a cut-off point at approximately 80% will correctly classify almost all children with one or more diagnoses, but only half or less of children with negative screening results will be correctly classified. The range of false positives will increase to between 29 and 72% and the false negatives decrease to between 0 and 26%, depending on the category of diagnoses. Choice of cut-offs may depend on the relative importance of false positives and false negatives, respectively. For research purposes both scenarios are sufficient, but not for clinical purposes, for which the rates of false positives are not acceptable.
Sensitivity and specificity are important from a population perspective, but for patients and their clinicians PPV, NPV, LHR+
may be more informative, as they show the probability of a disorder, given a positive or negative screening result. Compared to the findings from a Norwegian study of children with chronic physical illnesses [19
], our results showed a higher PPV, but a lower NPV for 'any disorder'. Our results by diagnostic category, showed a high NPV and lower PPV, which were very similar to the results reported by Goodman and colleagues [21
]. This indicates that the SDQ functions considerably better as a tool to rule out, rather than to confirm, possible psychiatric diagnoses. The pattern may be even more significant when mental health problems are combined with chronic physical illness.
To our knowledge LHR+/- and ORD have not been reported in previous studies. Our results showed that when using the most common dichotomisation ('probable' level) at approximately 90%, none of the diagnostic categories are in the ORD interval for potentially useful tests. This may seem strange since relative high ORD's were reported (i.e. 6.05-14.41), but is mainly explained by too wide confidence intervals to consider the ORD's as stable high estimates. However hyperactivity disorders, conduct disorders, and 'any disorders' are in the LHR- interval for potentially useful tests. When the 'possible' dichotomisation level was used all LHR+ results were worse and all LHR- results were better, yielding ORD results in the interval for potentially useful tests for diagnostic categories of hyperactivity disorder and conduct disorder. For a patient with a negative screening result this is good news, because it means that this result is almost certainly correct. However, for a clinician, and for patients with positive screening results, it is also important that the PPV and LHR+ are high in order to reduce both economic and emotional costs associated with unnecessary further evaluations of patients that are not afflicted with the disorder of interest.
The clinical implication of our results is that the SDQ by itself is not a sufficient screening instrument for psychiatric disorders when used among patients in the CAMHS North Study in Norway. Our results showed that the SDQ could be better utilised to detect the presence of 'any' diagnoses, rather than more specific diagnostic categories. On the contrary, the SDQ is better at ruling out the presence of specific categories of psychiatric disorders than ruling out the actual presence of 'any disorder'. Our results are in accordance with previous studies [5
] that clearly showed the unsuitability of SDQ for diagnostic purposes in a clinical setting, but contrary to these studies our results call into question the usefulness of SDQ to identify children who are in need of further psychiatric evaluation, as PPV and LHR+
results are low. According to our results the SDQ is best used to identify those children and adolescents who do not need further psychiatric evaluation. Such clinical practice is however problematic since children suffering from monosymptomatic disorders (e.g. tic disorders, enuresis, eating disorders) not will be identified with screening with the SDQ.
There are some limitations to this study. One is that the diagnosing clinicians were not blinded to the SDQ predictions while assigning the clinical diagnoses based on the DAWBA. This might have affected the clinical assessment and biased the results towards better agreement between the SDQ and the clinical diagnoses. Some previous studies have blinded the clinical experts to avoid this bias [5
], although others [19
] have used the same procedure reported in the present study. Another bias towards better agreement is that both SDQ information and DAWBA information were collected at the same time, which prevents changes in mental health status between assessments. On the other hand, multiple informants as in our study are often a clinical necessity, but from a research point of view this more complex and sometimes contradictory information may weaken the agreement between raters. The strength of our procedure lies in its ecological validity, as our diagnostic procedure is quite similar to the ordinary day-to-day practise, including the use of the original UK scoring algorithms, in Norwegian CAMHS.
Another limitation is the assumption of the clinician consensus diagnoses as the gold standard. As previously documented, there is poor agreement between structured interviews and clinicians' assigned diagnoses, and little knowledge about the most valid methods [36
]. There is no single objective feature that distinguishes any mental health diagnosis. Costello, Egger, and Angold [37
] stated that structured interviews are the closest we can come to a gold standard for psychiatric diagnoses. Thus, the assignment of clinical experts aided by a structured interview such as the DAWBA may be considered the best available reference for comparison. Such procedures are imperfect, but nevertheless valuable as long as mental health diagnostics are based on developmental history, behavioural observations and reported difficulties in everyday life.
Further research is needed to find out if combining the SDQ with other measures of symptoms and severity can improve the ability to detect mental health disorders among patients referred to CAMHS. Also more efficient case-finding strategies, as suggested by Ullebø et. al. for ADHD phenotype [38
], can optimize the potential of SDQ as a screening instrument for Norwegian CAMHS. Another aspect that merits further research is the identification of certain characteristics of either the patient or the other SDQ informants that might enhance the risk of false-positive or false-negative results. With a future database, large enough to subdivide the overall sample, subgroup-specific algorithms could be established and reported to facilitate comparisons between different clinical samples (e.g. with respect to age, gender, diagnostic categories) as well as identification of protective and/or risk factors.