|Home | About | Journals | Submit | Contact Us | Français|
Surgeons routinely evaluate and modify their surgical technique in order to improve patient outcome. It is also common for surgeons to analyze results before and after a change in technique to determine whether the change did indeed lead to better results. A simple comparison of results before and after a surgical modification may be confounded by the surgical learning curve. Here, we aim to develop a statistical method applicable to the analysis of before / after studies in surgery.
We used simulation studies to compare different statistical analyses of before / after studies. We evaluated a simple two group comparison of results before and after the modification by chi squared, and a novel bootstrap method that adjusts for the surgical learning curve.
In the presence of the learning curve, a simple two group comparison almost always found an ineffective surgical modification to be of benefit. If the surgical modification was harmful, leading to a 10% reduction in success rates, a two group comparison nonetheless reported a statistically significant improvement in outcome about 80% of the time. The bootstrap method had only moderate power, but did not find ineffective surgical modifications of benefit more than would be expected by chance.
Simplistic approaches to the analysis of before / after studies in surgery can lead to grossly erroneous results under a surgical learning curve. A straightforward alternative statistical method allows investigators to separate the effects of the learning curve from those of the surgical modification.
Surgeons routinely evaluate and modify their surgical technique in order to improve patient outcome. It is also common for surgeons to analyze results before and after a change in technique to determine whether the change did indeed lead to better results. For example, Poulakis et al. report on a series of 182 men with high risk prostate cancer treated by laparoscopic radical prostatectomy. Several modifications to surgical technique were implemented after the 71st patient was treated and the authors report that the rate of positive surgical margins reduced from 28% before the change, to 10% afterwards (p<0.001).1 Similarly, Chuang and colleagues introduced two changes to radical prostatectomy - early release of the neurovascular bundle, and early release under magnification – and reported that potency rates rose from 40.5% for unmodified surgery, to 54.8% and 66.1% after the two successive modifications were introduced2.
These “before / after” studies do not involve randomization of patients to the different types of operation and are therefore prone to the biases known to be associated with non-randomized trials. These include differences in case mix and in medical treatments adjunctive to surgery. A unique problem for before / after studies in surgery is the surgical learning curve. It is widely assumed that a surgeon’s results improve with increasing experience with a particular procedure4 and there is now compelling empirical evidence that this is indeed the case. Numerous studies have documented that technical aspects of a procedure, such as operating time5, blood loss6 or rates of conversion to open surgery7, improve with greater surgical experience. Data are also emerging that clinical outcomes also improve. For example, we recently reported a learning curve for cancer recurrence after radical prostatectomy for prostate cancer, with dramatically decreased recurrence rates for experienced surgeons 3. Figure 1 shows the learning curve specifically for patients with organ-confined disease8.
If a surgeon’s results are continually improving as cases accrue, results will differ either side of any randomly picked point in time. If that point in time corresponds to when a modification was made to surgical technique, it will appear as though the change in surgical technique resulted in a change in outcome, even if it had no effect.
Figure 2 shows a hypothetical learning curve for a surgeon, showing on the x axis the number of cases and on the y axis the success rate (which could either be a patient outcome or a process measure, such as time of operation less than 180 minutes). The success rate was 35% for patients treated before 150 prior cases, and 50% for patients treated afterwards (absolute difference 15%, p=0.03). Now imagine that the surgeon made a change in technique at 150 prior cases, and in truth the technique did not impact outcome. We would still observe an improvement in success rates of 15% due to the learning curve. Without proper adjustment for the learning curve, this improvement would be erroneously attributed to the surgical modification.
The motivating example for this paper comes from urologic oncology. A surgeon (JE) had modified his approach to radical prostatectomy in an effort to ameliorate erectile dysfunction, a common side-effect of this procedure. He wanted to know whether his modification had led to improvement in outcome.
Figure 3 shows the surgeon’s results up to the time where he made the modification. The x axis gives the number of consecutive cases; the y axis gives the potency rates at 6 months after adjustment. The line was calculated using logistic regression with potency as the outcome, number of cases since the start of the cohort as the predictor, and age and nerve sparing (a strong predictor of postoperative erectile function) as covariates. Results are clearly improving over time, suggesting that the surgeon is indeed experiencing a learning curve. The dashed line in figure 3 shows how we predict potency rates would have changed had the surgeon not modified his technique.
The potency rate before the modification (n=275) was 38%. Now if the modification was totally ineffective, we predict on the basis of the learning curve that the potency rate in the 97 patients treated after the modification would be around 50%. A comparison of 38% in 275 patients to 50% in 97 patients is statistically significant (p<0.005). As a result, a before / after comparison would often conclude that an ineffective surgical modification improved outcome.
One idea for how to adjust for the learning curve in before / after studies would be to fit a logistic regression model with an indicator of surgical modification as the predictor, and the number of cases since the start of the cohort as a covariate (along with other variables known to be associated with outcome). The immediate problem with this analysis is that the predictor and the covariate are correlated: for example, if the modification occurs near the middle of the series, the correlation between the modification and number of cases since the start of the cohrot will be around 0.85. When two variables are highly correlated, it is very difficult to differentiate between the two statistically.
Given the surgical learning curve, our question is not “Do results improve after the change in surgical technique?” but “Are results after the change better than expected?”. Figure 4 shows the actual results after the change in surgical technique. To obtain this line, we applied a logistic regression to data obtained after the surgical modification was implemented, again using potency rates as the dependent variable and the number of cases since the modification, age and nerve sparing as predictors. The actual outcomes appear superior to those expected on the basis of the learning curve.
Before we can conclude that results really do improve, we need to calculate a p value; we would also like 95% C.I. around our estimate of how much the technique improves outcome. To obtain these statistics, we used a statistical technique called bootstrapping. To illustrate bootstrapping, imagine that we had sampled 1000 medical school applicants, taking data on age and gender. We then create a new, artificial group of 1000 by randomly sampling from this data set. Because an applicant can be sampled more than once (or not at all) this is known as “bootstrap resampling with replacement”. The gender ratio and average age of the new data set is likely to be close to that of the original sample. This new “bootstrap sample” gives us an example of what our results might have been had we repeated the study. If we take a large number of bootstrap samples (2,000 – 10,000 is typical) we get a range of possible study results. These might be analyzed to conclude that, for example, although in our study, 58% of our applicants were women, “if we had run the study a large number of times, 95% of the time, the proportion of women would have been between 55% and 61%”. As it turns out, we do not have to run a bootstrap to obtain these numbers because there is a simple formula for the confidence interval around a proportion. A bootstrap is helpful when the formula for a particular statistic is difficult to specify.
For our before / after study in surgery, we use bootstrapping to obtain the p value and 95% C.I. as follows. First, we define two groups, patients treated before the modification (group 1) and those treated afterwards (group 2). Cases are then numbered consecutively in date order from the very first case in the cohort. We then follow the following steps.
To evaluate our statistical method further, we applied it to the hypothetical data set shown in figure 2. We assumed that the surgeon had changed surgical technique at the mid-point of the series and that doing so had either: a) had no effect; b) immediately reduced success rates by 10% or c) immediately improved success rates by 10% or 20%. We simulated data sets, applied one or other of these effects, and then applied our bootstrap method.
The results for 2,000 replications of the simulated data set are given in table 1 alongside the results for a simple before and after comparison by a chi squared test. The results for the first row – where the modification is ineffective – give the false positive rate or “size” of the test. For a one-sided test (i.e. statistically significant improvement), this is expected to be 2.5%. The results for the second and third rows give the power of the test.
As can be seen, the size of the bootstrap test is reasonable: when there is no effect of the surgical modification, one would conclude that the modification significantly improved potency only 2.4% of the time, close to ideal. In contrast, the simple before/after comparison has a size of 99.9%, in other words, nearly 100% of analyses conducted using the simple before /after comparison would conclude that an ineffective modification was effective. Table 1 also suggests that a simple before / after comparison would give us an 80% probability of leading us to conclude that a harmful modification was in fact beneficial. The power of the bootstrap test is moderate (12% and 40%, respectively, when improving success rates by 10% and 20%), but it will not fool us into concluding that an ineffective or harmful modification improves outcomes.
In our view, the problem with power is not a matter of the test, but of the data: it is inherently difficult to use a single surgeon series to demonstrate that a surgical modification leads to an improvement in outcome. This is analogous to clinical trials, where studies attempting to identify small but worthwhile effects – as would be the case with a surgical modification – generally involve large numbers of patients at multiple institutions. We repeated the simulation studies including three surgeons. The learning curve for each surgeon was modeled separately; predictions made for each patient treated after the modification were calculated using the unique learning curve of the patient’s surgeon. As can be seen from the table, power is improved.
We have described a method for the analysis of before / after studies in surgery, where a surgeon wants to know whether a change in technique has led to a change in outcome. That said, whatever the level of statistical sophistication that we apply to this type of study, it remains a non-randomized design, and thus subject to several potential biases. First, although the method can control for slowly developing trends over time, it cannot adjust for rapid changes in medical practice occurring after the surgical modification. In our potency study for example, had the surgeon modified technique in 1997, the introduction of Viagra in 1998 would have biased results in favor of the modification. Second, only randomization can adequately address the possibility of baseline differences between groups. In our potency example we adjusted for age, but no such statistical adjustment is perfect, and age is only a moderate predictor of potency.
Third, since the shape of the learning curve can vary, the test is sensitive to how well the true learning curve is modeled. Consider the scenario depicted in Figure 5. The solid black line shows the true learning curve before the surgical modification. This learning curve is non-linear, that is, it is steeper in the earlier part of the learning curve and starts to flatten out past 200 cases. With the introduction of an effective surgical modification at 200 prior cases, we would expect an improvement in outcome similar to that depicted with the dashed black line. However, if the bootstrap test was implemented by modeling the learning curve as linear, rather than non-linear, the expected outcome past 200 cases would be higher than that observed. In this case, since the true learning curve was not correctly modeled, one might falsely conclude that the modification was ineffective – or perhaps even harmful.
Fourth, the size (false positive rate) of the bootstrap test is sensitive to the steepness of the learning curve. We conducted additional simulations to examine this phenomenon: the size of the bootstrap test was lower than our main analysis with a steeper learning curve (0.8%), and was higher with a flatter learning curve (4.1%). Therefore, the size of the bootstrap test may be anticonservative with a flat learning curve. We consider this to be a small cost compared to the gross errors associated with the simple before/after comparison, which had a size >99%.
Randomized trials comparing different approaches to surgery are all too rare As such, we would like to re-emphasize the importance of randomized trials, and encourage the surgical community to conduct more such studies. Nonetheless, randomized studies can be difficult and expensive, and generally require preliminary data. Accordingly, surgeons will continue to conduct studies comparing results before and after a change in surgical technique.
We have shown that if surgeons’ results improve with experience, a simple comparison of results before and after a change in surgical technique would likely tell us that an ineffective modification, or even a harmful one, was of benefit. A straightforward alternative is to use regression modeling to compare observed with expected results. This method can help control for the learning curve and thus provide more accurate estimates of the effects of changing surgical technique.
Supported in part by funds from David H. Koch provided through the Prostate Cancer Foundation, the Sidney Kimmel Center for Prostate and Urologic Cancers and P50-CA92629 SPORE grant from the National Cancer Institute to Dr. P. T. Scardino.