We identified and reviewed forty different studies of twenty-two unique scoring systems or diagnostic criteria that were developed from five original scoring systems and five original diagnostic criteria. These diagnostic approaches varied in the types of clinical signs and symptoms included in the criteria, the inclusion or exclusion of laboratory testing, and even their diagnostic focus (i.e., pulmonary TB alone or pulmonary and extrapulmonary TB). Studies designed to validate the various diagnostic systems varied significantly in the gold standard chosen for comparison. Because the publication dates of the articles range over the last fifty years, some criteria were developed and evaluated prior to the HIV epidemic, while other studies focused specifically on coinfected children.
The gold standards chosen to evaluate the validity of these diagnostic strategies also varied widely. Cultures can be difficult to obtain in children. Because tuberculous disease in children is often pauci-bacillary, the diagnostic yield of cultures in children is often poor. Although one study used culture as the gold standard [
25], others used positive response to treatment [
13], CXR [
35], or a previous scoring criteria [
30]. The most common gold standard was clinical diagnosis. Interestingly, in a study of the ALS assay for diagnosing active TB disease, the assay actually correlated better with clinical diagnosis than either the Kenneth Jones or Keith Edwards scoring criteria [
49]. Unfortunately, clinical diagnosis is likely to depend strongly upon the experience and knowledge base of the clinician and thus may be less reliable in settings where clinicians have less training. To allow for comparison of criteria across different studies and settings, future studies need to employ a more consistent gold standard. Ideally, this would be culture-based, as this is a standard for validation that could be reliably replicated across settings. However, because cultures are difficult to obtain in resource-limited settings and can lead to a delay in treatment, performing studies with culture as the gold standard can be difficult.
In addition to using a variety of gold standards, the various studies often included very different sample populations. Some studies did not clearly describe the characteristics of the patient population or how they were selected. Many were retrospective, often utilizing chart review. Ideally, prospective studies of diagnostic systems would evaluate a clearly defined sample of participants with a spectrum of disease that is representative of the patients to which the criteria would be applied in clinical practice. It is essential that researchers clearly describe the sample selection process and inclusion criteria in such studies to allow for more accurate comparisons of criteria across different populations or settings and to promote the utility of these systems in clinical practice.
Another challenge in prospective studies of TB diagnosis is the bias that is introduced when, as found in some of these studies, the inclusion or screening criteria for participants often include similar clinical features as the diagnostics systems being evaluated. For example, Pedrozo et al. used history of contact, CXR, and TST result as part of the criteria for inclusion in the study. Chest X-ray and TST were also used as part of their diagnostic gold standard to differentiate latent TB from no TB from active TB disease. All three inclusion criteria are also used in the Brazil MOH scoring system being evaluated in this study [
25]. This makes it difficult to interpret the accuracy of a diagnostic system and its ability to predict a diagnosis of TB in a particular patient or patient population. This overlap also causes difficulty in determining the relative importance of particular signs or symptoms within the diagnostic system.
The largest shift in the newer diagnostic systems as compared to Kenneth Jones and Keith Edwards is the focus on pulmonary tuberculosis alone. Diagnostic systems focusing simply on pulmonary TB, such as the Brazil MOH and Marais criteria, have demonstrated higher sensitivities and specificities than those developed to diagnose both extrapulmonary and pulmonary TB. Because children have a higher incidence of extrapulmonary TB [
50], using diagnostic systems targeted at pulmonary TB only addresses part of the diagnostic challenge. On the other hand, because TB presents with varied signs and symptoms depending on the site of disease, it is difficult to conceive of a single diagnostic system that could diagnose with high sensitivity and specificity the various types of tuberculosis infections (e.g., vertebral, abdominal, and pulmonary TB). Furthermore, many children with extrapulmonary TB also have pulmonary disease [
51]. A new system of classification, focusing on the severity of the disease rather than location, has recently been published and may also be a more reliable and reproducible method. If this is well validated in different settings, it may allow various diagnostic systems to be better compared than is currently possible [
52].
At this time, the Brazil MOH scoring system has the most studies evaluating its validity with consistently high sensitivities and specificities. In each of the three studies of this criteria, the scoring system was tested against a slightly different gold standard, ranging from clinical criteria [
23,
25] to culture-proven disease [
24]. Although this may make some comparisons difficult with the lack of a standard gold standard, the fact that the scoring system holds up fairly well when tested in different ways actually strengthens the evidence for its validity. Though it has not been tested outside of Brazil, it has been tested in both an inpatient [
24] and outpatient setting [
23,
25]. The performance of the scoring system has also been evaluated in HIV-infected patients. These coinfected children still scored significantly above the cutoff for a diagnosis of TB [
23]. All of these evaluations point favorably toward the validity of this scoring system. Evaluating the Brazil MOH scoring system in additional settings worldwide should be an important next step.
The findings of this systematic review are limited by the design and quality of the studies included. The lack of consistent and sometimes clearly defined inclusion criteria among the studies makes it difficult to compare sensitivity and specificity across the different diagnostic systems. Most of the various diagnostic systems have only been evaluated in specific geographic locations or single populations; few studies evaluate a particular diagnostic system in multiple geographic regions or patient populations. Fewer studies have compared the diagnostic yield of multiple criteria in the same patient population. Finally, the increase in the prevalence of HIV during the publication range of these studies makes it difficult to compare studies from thirty years ago to those more recently published. Although this paper includes more than twenty new studies since Hesseling et al. was published in 2002 [
21], the number of articles assessing the validity of each diagnostic system is still relatively small. The paper also did not include unpublished data or non-English publications.