For CEA, an estimate of expected (mean) survival time is needed, therefore reported or pooled HRs are clearly insufficient. A reliable analysis can only be conducted if access to IPD for each source of efficacy evidence is available. This allows investigation of a range of parametric models and selection of an appropriate model for the underlying distribution and for the treatment effect.

Unfortunately, IPD is only rarely available, particularly if the secondary analysis is to be a meta-analysis of several trials or where several different treatments have been examined. If a meta-analysis of time-to-event-outcomes is being considered for use in a CEA, this should be an IPD meta-analysis.

There is a strong case to be made that the same statistical model should be used for both efficacy and CEA analyses [

13]. But even if the secondary analysis is focussed purely on efficacy, it is noteworthy that IPD meta-analysis of time-to-event outcomes has been described by many authors as the gold standard [

5,

6,

11,

14]. One reason for this is that it allows a more informative analysis of time-dependent data [

15]. First, aggregated results from time-to-event trials may be reported as medians or as HRs, but it is hard to combine trials reporting medians with trials reporting HRs without making distributional assumptions. Secondly, if HRs are reported one is obliged to accept the proportional hazards assumption, even though this is seldom checked [

13] in the primary analysis. Third, although it is often believed that PH provides an approximate "average" HR in cases where PH does not hold, such estimates are clearly vulnerable to bias. If, for example, the HR is diminishing over time, the procedure will over-estimate the HR whenever the trials differ in follow-up time. Finally, even in the best possible case for meta-analysis based on HRs, where every study reports the HR

*and *every study tests the null hypothesis of PH and fails to reject it, it should be remembered that trials are not powered to detect departure from PH [

16], and a superior test of PH can always be achieved by an IPD meta-analysis of the entire set of trials.

The algorithm suggested here allows investigators to re-create KM data, allowing them to explore the proportional hazards assumption, and freeing them to fit a richer class of survival distributions to the reconstructed data. Access to the IPD, or to reconstructed data, allows a huge liberalization in modelling survival. Multiple parameter distributions can be implemented [

17-

19], or flexible spline approaches [

20]. Methods to adjust for "cross-over" of treatment can also be applied [

21]. In the context of synthesis, treatment effects can be put on shape and scale parameters [

8] or fractional polynomials can be used [

9].

The validation exercise established that reproducibility and accuracy of reconstructed statistics was excellent, especially for median survival and probability of survival. In addition these reconstructed statistics were relatively insensitive to deterioration in the level of information used. Reproducibility and accuracy of reconstructed Hazard Ratios was less good but certainly adequate with complete information, but became increasingly inaccurate with less information, and frankly unusable when neither numbers at risk nor number of events were available.

The reason why the HR is generally reconstructed with less accuracy and more vulnerable to the level of information is because it is in effect a weighted average of ratios along the entire risk period, while survival probabilities and medians are simple point estimates. The reconstruction algorithm must make assumptions about the degree of censoring within each segment and these assumptions must affect the relative weighting of different portions of the curve. As the level of information is reduced, the assumptions become increasingly unrealistic. As it has been noted previously, though in a slightly different context [

22], we might anticipate that reconstructed HRs will be less accurate in cases where the data departs more from proportional hazards. It should be emphasised that our purpose in reconstructing the KM data is not to obtain an estimate of the HR, when it is not reported, but instead to allow investigators access to a good approximation of the Kaplan-Meier statistics.

Previous work using published KM curves in secondary analysis have approached the issue in several different ways [

4-

11]. In most cases, it is clear that the information has been extracted from the curves [

6-

11], but only a few authors [

8-

10] report that they carry out the data extraction using digitizing software. Dear [

4] extracts the survival probabilities and their variance and estimates their covariance using normal approximations under the assumption of no censoring. Arends [

5] adopts a similar strategy but uses a complementary log log link to model the probabilities. Earle [

10] provides a comparison of several methods [

4,

23-

25] by extracting the survival probabilities and estimating the number of patients at risk and number of events in successive time intervals by using the actuarial equations, using information on censoring if reported and ignoring censoring otherwise. Williamson [

11] uses the survival probabilities and the numbers at risk provided below the curves to estimate the number of events on the intervals defined by the numbers at risk provided, using the actuarial method. Parmar's method [

7], frequently cited and used in the meta-analysis literature, uses information on the minimum and maximum of follow-up to inform the censoring pattern, in addition to the extracted survival probabilities, to estimate numbers of events and numbers at risk in successive time intervals, and then produce estimates of hazard ratios in cases where these are not available in the published papers. Fiocco [

6] assumes a Poisson distribution and uses log-linear modelling based on estimated data using the same approach as Parmar. Finally, Ouwens [

8] and Jansen [

9] digitise the KM graphs at particular time points and use the number at risk at the beginning of the interval, if this is reported, or conservatively assume no censoring if it is not. The life-table data is then reconstructed as a series of conditionally independent binomial distributions.

The primary objective of this previous work was neither the reconstruction of Kaplan Meier data nor the reconstruction of life-table data, but was a necessary step that had to be taken in order for the authors to illustrate methods for combining survival data from several studies. All these previous attempts reconstructed the data in the form of life-table data at a limited number of time points, whereas we have tried to reconstruct the original KM intervals. Published survival curves are almost always based on the KM method justifying our approach of using inverted KM equations instead of life-table equations and solving at the same time the problem of pre-specification of intervals. For the 'all information' and 'no total events' cases, the censoring pattern varied by numbers at risk published intervals as in Williamson [

11]. For the 'no number at risk' case, the censoring pattern is assumed constant over the interval and for the 'neither' case, no censoring is assumed. Except for the 'neither' case where there is no information on number at risk or total number of events, we believe that our method represents an improvement over the other approaches described above. This is because we used an iterative numerical approach to solve the inverted KM equations, not described elsewhere, which is required to ensure consistency between the successive estimates of numbers at risk and reported values, and/or estimate of total number of events and reported value. Without such a procedure, it is not possible to make the change in survival probabilities over an interval and number of events and censorings within the interval consistent with the number at risk at the start of the next interval. We therefore believe that the methods proposed here are likely to be the most accurate of the methods proposed so far, when at least numbers at risk or total events are reported.

Very few authors [

7,

10] reported any assessment of their reconstruction of data. Earle [

10] evaluated the reproducibility of using digitized software to extract the data by reporting an intraclass correlation coefficient. Parmar [

7] calculated the ME on 48 studies between reconstructed HR using published survival curves and reconstructed HR using more direct estimates from either the Cox model or the logrank test results. He found no systematic bias for the HR and a slight systematic underestimation for their variance. We have provided a more complete evaluation of our method by reporting not only good reproducibility and lack of systematic error, but also information on MAE which tells us about the accuracy of the reconstruction that can be expected in a given instance.

Limitations of the new method should be mentioned. To the extent that published KM curves tend to pool data over different covariates that might affect survival, the method is still not quite the same as having true IPD. The inability to derive separate KM curves for different subgroups or to model the joint effects of covariates and treatment can impact on the estimated treatment effect due to aggregation bias [

26]. Even if the arms are well balanced on a covariate, and even if the covariate is not an effect-modifier, aggregation over the covariate will tend to bias the treatment effect towards the null, and the extent of the bias increases with the strength of the covariate effect. However, this is an issue for all meta-analysis where a covariate adjustment could not been performed.

A second limitation is that the reliability of the reconstructed data depends on two related elements: the quality of the initial input and the level of information provided by the publication. The figure extracted from the .pdf should not be of low quality (for instance, blurry figure and/or poor numerical axis scale); otherwise the user may struggle to extract accurate data via the digitising software. Furthermore, the extracted data needed to run the algorithm should be consistent and sufficient. We encourage users to undertake the initial digitization and pre-processing with scrupulous care. In our experience, although little training is required, it takes about half an hour to obtain the initial input for one curve. If incorrect data are entered this may trap the algorithm. The algorithm has been developed to be used for the large amount of RCTs reporting KM curves. However, as we have seen in the results section, if the total number of events and the numbers at risk other than at time zero are not provided, the algorithm may produce poor results. We recommend that all the information available is used, and wherever possible the reconstructed KM data should be compared with all results (survival probabilities, medians, hazard ratio) reported in the original publication. We recommend that all RCTs with time-to-event outcomes publish KM curves together with information on numbers at risk and total number of events.