We have systematically reviewed 15 studies of the HDS, ten of the IHDS, and one that included both scales. Most studies in the review apply to the original intended role of the HDS and IHDS–screening rather than diagnosis–in that participants were not selected on the basis of symptoms. Summary estimates for the HDS as a test for HAD or an equivalent diagnosis (severe HAND) were: sensitivity 68%, specificity 78%, LR+ 3.1, LR− 0.41, but its accuracy appeared to be lower when analysis was limited to studies with high-quality reference standards and unselected populations. When using the HDS as a test for MND or equivalent (all symptomatic HAND), estimates of accuracy were: sensitivity 42%, specificity 91%, LR+ 4.8, LR− 0.64. Summary estimates for the IHDS as a test for severe HAND were: sensitivity 74%, specificity 55%, LR+ 1.6, LR− 0.47. When using the IHDS as a test for all symptomatic HAND, estimates of accuracy were: sensitivity 64%, specificity 66%, LR+ 1.9, LR− 0.54. These summary estimates and most individual study estimates for both scales failed to achieve accepted levels of accuracy to provide strong evidence for a diagnosis of HAND 
, confirming their unsuitability for diagnostic purposes when used alone.
Comparing the two scales, the HDS had a higher DOR and LR+ than the IHDS, but the only direct comparison of both scales within the same sample failed to find a difference between the two, and was limited by its small sample size 
. Furthermore, the two scales were studied in different settings, with most of the HDS studies conducted in North America, and most of the IHDS studies conducted in Africa. Unfortunately, while the IHDS was developed with resource-limited settings in mind, it is not free from culturo-linguistic effects. The four-word recall task (in both tests) must be modified for different languages 
, and it was shown in an Indian population that education was associated with IHDS score, but HIV status was not 
. The two scales were also studied in different years, and considerable changes in our understanding of HIV pathogenesis and treatment occurred in the decade between the publication of the HDS in 1995 and the IHDS in 2005.
Estimates of screening accuracy showed wide variation between studies, particularly for the HDS. We did not find strong evidence of a diagnostic threshold effect. However, tests of correlation used to demonstrate this effect are known to have low statistical power 
, and the reference diagnosis of HAD is complex and subject to variations of interpretation. It is therefore plausible that differences between reference standards contributed to the varying accuracy of these well-standardised diagnostic tools.
Regarding other sources of variability, an increased DOR and lower LR− was seen in two studies assessing the HDS in patients with more advanced immunodeficiency. Spectrum bias is a form of selection bias that may occur when the study population is sampled from a limited or specialised clinical setting and therefore has a narrow spectrum of disease. This form of bias could have increased sensitivity in samples of more severely-impaired patients, such as those conducted in Africa, in the pre-HAART era, or in hospital wards. Spectrum bias could also reduce specificity in those in whom it was difficult to exclude competing diagnoses, such as in resource-limited settings, or conversely increase specificity in samples with fewer competing diagnoses. Non-random, non-consecutive sampling strategies are known to lead to over-estimation of accuracy 
There were a number of methodological limitations to this review. First, the literature search and data extraction were carried out by a single author (LJH). Second, the literature search could have missed studies not cited in the target data sources, or articles in which it was not clear from the abstract that neurocognitive testing was done. To minimise this, researchers in the field were asked about the existence of unpublished datasets. Third, it was not always possible to generate two-by-two tables from available data, usually because HDS and IHDS scores were reported as continuous variables. In a few studies, the estimated values were not consistent with other information in the same article, suggesting other unknown errors in the results. This was despite requests for reconfigured data directly from researchers.
More importantly, the review is limited by the lack of a clinical gold standard for neurocognitive impairment in HIV, whether this be neurological criteria, neuroimaging findings, biomarkers in cerebrospinal fluid, or histopathology. The Frascati criteria are relatively detailed, objective, and appropriate for a research definition, so the analysis in this review provides the best available estimates of the accuracy of the HDS and IHDS when used as screening tools for MND or HAD. However, current data do not clearly inform clinicians of the natural history or appropriate treatment of these conditions, particularly milder impairment, and this limits our ability to predict the effects of screening.
British HIV Association (BHIVA) guidelines do not comment on screening for HAND 
, whereas the European AIDS Clinical Society (EACS) guidelines recommend a brief symptom questionnaire in all patients at regular intervals 
and a recent review made similar recommendations but did not support one screening test over another 
. The general rule that one should minimise false positives if the confirmatory test is expensive or invasive favours the HDS over the IHDS, and the penalty for missing an asymptomatic case of HAND is arguably not high, so the lower-sensitivity test is acceptable. The prevalence of HAD was 2–4% of HIV positive individuals in recent surveys in the US and Switzerland 
, lower than the prevalence in most studies included in this review. At this low prior probability, one might confidently exclude the diagnosis with a negative HDS, but the posterior probability would be less than 15% after a positive HDS. In comparison, when used as a test for MND, a positive HDS result would give a posterior probability of 56% in the presence of a prior probability of 20%.
A screening test is an intervention that should be subject to interventional research as any other, and for it to be routinely used in clinical practice, the evidence base should address the next steps in the clinical pathway. For example, we need to evaluate how to investigate patients further, how to predict their outcome, and how to modify medical therapy in the light of a positive or negative screening test. On the tests themselves, studies are needed to determine their repeatability, intra-subject variation, and learning effects, and understand the causes of false positive and false negative results (not explored in the studies reviewed). Further studies of the HDS and IHDS should adhere to STARD guidelines. Specific settings of interest are the use of the HDS in an African or other resource-limited setting, or the IHDS in a North American or European setting with high ART coverage and relatively preserved immune function. There may be a role for studying the scales specifically in older adults, given the growing proportion of HIV+ individuals over the age of 50 
and their greater risk of HAND 
, although their ability to distinguish between HAND and non-HIV causes of NCI has not been assessed. One could also model theoretical screening programmes for neurocognitive impairment within HIV positive populations of known prevalence.
In conclusion, in current clinical practice, interpretation of the results of assessment with the HDS or IHDS requires an appreciation of their limited accuracy, the lack of generalisability of existing research, and the heterogeneity of estimates. The HDS appears to be more accurate overall and its higher specificity probably makes it the preferred test for detecting asymptomatic HAND, although the IHDS may be preferred in situations where sensitivity is most important, at the expense of loss of specificity. Having reviewed the evidence we advise against their further use as diagnostic tests for HAND in symptomatic patients, even in resource-limited settings, and believe that studies reporting their use should acknowledge their limited validity.