shows the flow of studies through the review. Sixty one publications met the inclusion criteria, 21 of which were earlier reports of included studies and were not extracted.
w1-w43 Forty publications reporting the results of 29 studies (some reported results for different magnetic resonance imaging criteria, for imaging of the spine rather than the brain, or for patient subgroups) were included. Sample sizes were generally small (median 70), ranging from 15 to 1500 patients. The proportions of dropouts ranged from 0 to 58% (median 4%), increasing with length of follow-up. provides details of the 29 publications reporting the results of 18 cohort studies. Most of these studies used clinical follow-up as the reference standard. Most used the Poser criteria,
16 although some used the McDonald 1977 criteria.
17 The McDonald 1977 criteria, based on clinical information alone, are not the same as the McDonald 2001 criteria, which incorporate magnetic resonance imaging.
4 provides details of the 11 studies of other designs. The studies differed according to population, quality, magnetic resonance imaging protocol, and criteria used to define a positive test result. Cohort studies varied in their inclusion criteria; some included only patients presenting with a particular clinically isolated syndrome (for example, optic neuritis or a spinal cord syndrome), whereas others included all patients being evaluated for possible multiple sclerosis. Publication dates ranged from 1986 to 2003. Over this time improvements occurred in magnetic resonance imaging technology; this is reflected in differences in scanning protocols (see table A on
bmj.com).
| Table 1Study details and results of cohort studies recruiting patients with suspected multiple sclerosis (MS) |
summarises the results of the quality assessment (see table B on
bmj.com for results of individual studies). Study quality was generally poor: only four QUADAS items were met by over 70% of studies (avoidance of partial and differential verification bias and reporting of uninterpretable results and withdrawals). Studies scored badly on three items: blinding, the use of an appropriate reference standard, and the availability of clinical information. Four publications, reporting results from three cohort studies, were susceptible to incorporation bias as magnetic resonance imaging contributed to the final diagnosis.
18-21 Three of these used a combination of clinical follow-up and paraclinical tests as the reference standard,
19-21 the other relied on paraclinical tests alone.
18 All other cohort studies used clinical follow-up alone as the reference standard.
shows that cohort studies produced lower estimated sensitivity and specificity than studies of other designs. The pooled diagnostic odds ratio was 9 (95% confidence interval 5 to 16) for cohort studies and 213 (85 to 535) for studies of other designs (P < 0.001, permutation test). Further analysis was restricted to the 15 cohort studies that used a diagnosis of clinically definite multiple sclerosis, arrived at by clinical information alone, as the reference standard.
The average duration of follow-up ranged from seven months to 14 years. The only criteria for which sufficient data were available to investigate the effects of duration of follow-up were presence of one or more lesions and presence of one or more non-clinical lesions. is a receiver operating characteristic plot for these criteria, with numbers showing the duration of follow-up in years. Evidence shows (P = 0.074 from hierarchical summary receiver operating characteristic analysis) that studies with longer follow-up produced higher estimated specificity and lower estimated sensitivity.
The longest average duration of follow-up was three years in studies that assessed the Barkhof, Fazekas, and McDonald 2001 criteria, and six years for studies that assessed the Paty criteria. It is therefore possible to draw conclusions regarding the ability of these criteria to predict the development of multiple sclerosis only over these relatively short periods. shows the receiver operating characteristic plots for these criteria. The study that developed the Barkhof criteria
22 showed higher estimated sensitivity and specificity than did the other studies of this criterion. The negative likelihood ratios for the Barkhof, Fazekas, and Paty criteria ranged from 0.2 to 0.5, suggesting that a negative result on magnetic resonance imaging on the basis of these criteria is of limited utility for ruling out the development of multiple sclerosis within three to six years. Positive likelihood ratios were < 5: thus these criteria are also of limited utility in predicting the development of multiple sclerosis within three to six years. Positive likelihood ratios for the McDonald 2001 criteria ranged from 2.7 to 8.7, suggesting that they have more potential for predicting the development of multiple sclerosis within three years than any of the criteria based on magnetic resonance imaging alone.
23-26 Negative likelihood ratios were 0.1 in one study and 0.2 to 0.5 in three studies, suggesting that the McDonald 2001 criteria are of limited utility for ruling out the development of multiple sclerosis within three years.
Only two studies, one from the United States
2 and one from England,
3 followed patients for more than 10 years, long enough to be reasonably confident that almost all patients had been diagnosed as having multiple sclerosis who ever would be. Both studies fulfilled all but one QUADAS criterion (the availability of clinical information), and in the US study it was unclear whether review bias had been avoided (see
bmj.com). The US study included 351 patients with optic neuritis; follow-up of more than 10 years was available for 302 (86%) of these. The study used survival analysis to estimate the cumulative proportions of patients diagnosed, with patients who did not receive a diagnosis of multiple sclerosis censored at the time of their last clinical follow-up. The English study included 135 patients with a range of presenting symptoms, of whom 71 (53%) were included in the final evaluation. Both studies evaluated thresholds based on the number of non-clinical T2 lesions present on magnetic resonance imaging of the brain.
shows the estimates of sensitivity and specificity, with confidence intervals, for each of the thresholds evaluated in these two studies. Sensitivity and specificity varied according to the number of lesions used to define a positive result on magnetic resonance imaging: sensitivity was higher with fewer lesions but specificity was lower. Estimates of specificity were similar for the two studies, but the English study tended to produce higher estimates of sensitivity. Comparison of areas under the curves suggested better accuracy in the English study than in the US study (P = 0.045). Estimates of the positive likelihood ratios for the presence of various numbers of lesions ranged from 2.0 to 3.4. Assuming a pretest probability of multiple sclerosis of 60% this is equivalent to a post-test probability of 75%-84%, suggesting that magnetic resonance imaging is of limited utility for ruling in multiple sclerosis at any threshold. Estimates of the negative likelihood ratio ranged from 0.1 to 0.9 but were greater than 0.5 for all but one of the thresholds in the English study. This is equivalent to modifying a pretest probability of 60% to give a post-test probability of multiple sclerosis of 43%-57%, suggesting that magnetic resonance imaging is also of limited utility in ruling out a diagnosis of multiple sclerosis.