Surrogate endpoints offer the hope of smaller or shorter cancer trials. It is, however, important to realize they come at the cost of an unverifiable extrapolation that could lead to misleading conclusions. With cancer prevention, the focus is on hypothesis testing in small surrogate endpoint trials before deciding whether to proceed to a large prevention trial. However, it is not generally appreciated that a small surrogate endpoint trial is highly sensitive to a deviation from the key Prentice criterion needed for the hypothesis-testing extrapolation. With cancer treatment, the focus is on estimation using historical trials with both surrogate and true endpoints to predict treatment effect based on the surrogate endpoint in a new trial. Successively leaving out one historical trial and computing the predicted treatment effect in the left-out trial yields a standard error multiplier that summarizes the increased uncertainty in estimation extrapolation. If this increased uncertainty is acceptable, three additional extrapolation issues (biological mechanism, treatment following observation of the surrogate endpoint, and side effects following observation of the surrogate endpoint) need to be considered. In summary, when using surrogate endpoint analyses, an appreciation of the problems of extrapolation is crucial.
The definitive evaluation of treatment to prevent a chronic disease with low incidence in middle age, such as cancer or cardiovascular disease, requires a trial with a large sample size of perhaps 20,000 or more. To help decide whether to implement a large true endpoint trial, investigators first typically estimate the effect of treatment on a surrogate endpoint in a trial with a greatly reduced sample size of perhaps 200 subjects. If investigators reject the null hypothesis of no treatment effect in the surrogate endpoint trial they implicitly assume they would likely correctly reject the null hypothesis of no treatment effect for the true endpoint. Surrogate endpoint trials are generally designed with adequate power to detect an effect of treatment on surrogate endpoint. However, we show that a small surrogate endpoint trial is more likely than a large surrogate endpoint trial to give a misleading conclusion about the beneficial effect of treatment on true endpoint, which can lead to a faulty (and costly) decision about implementing a large true endpoint prevention trial. If a small surrogate endpoint trial rejects the null hypothesis of no treatment effect, an intermediate-sized surrogate endpoint trial could be a useful next step in the decision-making process for launching a large true endpoint prevention trial.
Cancer prevention; Cardiovascular disease; Prentice criterion; Principal stratification; Sample size calculation; Surrogate endpoint
Causal inference from observational studies is a fundamental topic in biostatistics. The causal graph literature typically views probability theory as insufficient to express causal concepts in observational studies. In contrast, the view here is that probability theory is a desirable and sufficient basis for many topics in causal inference for the following two reasons. First probability theory is generally more flexible than causal graphs: besides explaining such causal graph topics as M-bias (adjusting for a collider) and bias amplification and attenuation (when adjusting for instrumental variable), probability theory is also the foundation of the paired availability design for historical controls, which does not fit into a causal graph framework. Second probability theory is the basis for insightful graphical displays including the BK-Plot for understanding Simpson’s paradox with a binary confounder, the BK2-plot for understanding bias amplification and attenuation in the presence of an unobserved binary confounder, and the PAD-Plot for understanding the principal stratification component of the paired availability design.
BK-Plot; causal graph; confounder; instrumental variable; observational study; Simpson’s paradox
Many cancer screening trials involve a screening programme of one or more screenings with follow-up after the last screening. Usually a maximum follow-up time is selected in advance. However, during the follow-up period there is an opportunity to report the results of the trial sooner than planned. Early reporting of results from a randomized screening trial is important because obtaining a valid result sooner translates into health benefits reaching the general population sooner. The health benefits are reduction in cancer deaths if screening is found to be beneficial and more screening is recommended, or avoidance of unnecessary biopsies, work-ups and morbidity if screening is not found to be beneficial and the rate of screening drops.
Our proposed method for deciding if results from a cancer screening trial should be reported earlier in the follow-up period is based on considerations involving postscreening noise. Postscreening noise (sometimes called dilution) refers to cancer deaths in the follow-up period that could not have been prevented by screening: (1) cancer deaths in the screened group that occurred after the last screening in subjects whose cancers were not detected during the screening program and (2) cancer deaths in the control group that occurred after the time of the last screening and whose cancers would not have been detected during the screening programme had they been randomized to screening (the number of which is unobserved). Because postscreening noise increases with follow-up after the last screening, we propose early reporting at the time during the follow-up period when postscreening noise first starts to overwhelm the estimated effect of screening as measured by a z-statistic. This leads to a confidence interval, adjusted for postscreening noise, that would not change substantially with additional follow-up. Details of the early reporting rule were refined by simulation, which also accounts for multiple looks.
For the re-analysis of the Health Insurance Plan trial for breast cancer screening and the Mayo Lung Project for lung cancer screening, estimates and confidence intervals for the effect of screening on cancer mortality were similar on early reporting and later.
The proposed early reporting rule for a cancer screening trial with post-screening follow-up is a promising method for making results from the trial available sooner, which translates into health benefits (reduction in cancer deaths or avoidance of unnecessary morbidity) reaching the population sooner.
For the evaluation and comparison of markers and risk prediction models, various novel measures have recently been introduced as alternatives to the commonly used difference in the area under the ROC curve (ΔAUC). The Net Reclassification Improvement (NRI) is increasingly popular to compare predictions with one or more risk thresholds, but decision-analytic approaches have also been proposed.
We aimed to identify the mathematical relationships between novel performance measures for the situation that a single risk threshold T is used to classify patients as having the outcome or not.
We considered the NRI and three utility-based measures that take misclassification costs into account: difference in Net Benefit (ΔNB), difference in Relative Utility (ΔRU), and weighted NRI (wNRI). We illustrate the behavior of these measures in 1938 women suspect of ovarian cancer (prevalence 28%).
The three utility-based measures appear transformations of each other, and hence always lead to consistent conclusions. On the other hand, conclusions may differ when using the standard NRI, depending on the adopted risk threshold T, prevalence P and the obtained differences in sensitivity and specificity of the two models that are compared. In the case study, adding the CA-125 tumor marker to a baseline set of covariates yielded a negative NRI yet a positive value for the utility-based measures.
The decision-analytic measures are each appropriate to indicate the clinical usefulness of an added marker or compare prediction models, since these measures each reflect misclassification costs. This is of practical importance as these measures may thus adjust conclusions based on purely statistical measures. A range of risk thresholds should be considered in applying these measures.
Systems biology uses systems of mathematical rules and formulas to study complex biological phenomena. In cancer research there are three distinct threads in systems biology research: modeling biology or biophysics with the goal of establishing plausibility or obtaining insights, modeling based on statistics, bioinformatics, and reverse engineering with the goal of better characterizing the system, and modeling with the goal of clinical predictions. Using illustrative examples we discuss these threads in the context of cancer research.
bioinformatics; microarrays; reverse engineering; receiver operating characteristic curves
We define personalized medicine as the administration of treatment to only persons thought most likely to benefit, typically those at high risk for mortality or another detrimental outcome. To evaluate personalized medicine, we propose a new design for a randomized trial that makes efficient use of high-throughput data (such as gene expression microarrays) and clinical data (such as tumor stage) collected at baseline from all participants. Under this design for a randomized trial involving experimental and control arms with a survival outcome, investigators first estimate the risk of mortality in the control arm based on the high-throughput and clinical data. Then investigators use data from both randomization arms to estimate both the effect of treatment among all participants and among participants in the highest prespecified category of risk. This design requires only an 18.1% increase in sample size compared with a standard randomized trial. A trial based on this design that has a 90% power to detect a realistic increase in survival from 70% to 80% among all participants, would also have a 90% power to detect an increase in survival from 50% to 73% in the highest quintile of risk.
Risk prediction models based on medical history or results of tests are increasingly common in the cancer literature. An important use of these models is to make treatment decisions on the basis of estimated risk. The relative utility curve is a simple method for evaluating risk prediction in a medical decision-making framework. Relative utility curves have three attractive features for the evaluation of risk prediction models. First, they put risk prediction into perspective because relative utility is the fraction of the expected utility of perfect prediction obtained by the risk prediction model at the optimal cut point. Second, they do not require precise specification of harms and benefits because relative utility is plotted against a summary measure of harms and benefits (ie, the risk threshold). Third, they are easy to compute from standard tables of data found in many articles on risk prediction. An important use of relative utility curves is to evaluate the addition of a risk factor to the risk prediction model. To illustrate an application of relative utility curves, an analysis was performed on previously published data involving the addition of breast density to a risk prediction model for invasive breast cancer.
The biomarker pipeline to develop and evaluate cancer screening tests has three stages: identification of promising biomarkers for the early detection of cancer, initial evaluation of biomarkers for cancer screening, and definitive evaluation of biomarkers for cancer screening. Statistical and biological issues to improve this pipeline are discussed. Although various recommendations, such as identifying cases based on clinical symptoms, keeping biomarker tests simple, and adjusting for postscreening noise, have been made previously, they are not widely known. New recommendations include more frequent specimen collection to help identify promising biomarkers and the use of the paired availability design with interval cases (symptomatic cancers detected in the interval after screening) for initial evaluation of biomarkers for cancer screening.
Because many medical decisions are based on risk prediction models constructed from medical history and results of tests, the evaluation of these prediction models is important. This paper makes five contributions to this evaluation: (1) the relative utility curve which gauges the potential for better prediction in terms of utilities, without the need for a reference level for one utility, while providing a sensitivity analysis for missipecification of utilities, (2) the relevant region, which is the set of values of prediction performance consistent with the recommended treatment status in the absence of prediction (3) the test threshold, which is the minimum number of tests that would be traded for a true positive in order for the expected utility to be non-negative, (4) the evaluation of two-stage predictions that reduce test costs, and (5) connections among various measures of prediction performance. An application involving the risk of cardiovascular disease is discussed.
decision analysis; decision curve; receiver operating characteristic curve; utility
External validation of existing lung cancer risk prediction models is limited. Using such models in clinical practice to guide the referral of patients for computed tomography (CT) screening for lung cancer depends on external validation and evidence of predicted clinical benefit.
To evaluate the discrimination of the Liverpool Lung Project (LLP) risk model and demonstrate its predicted benefit for stratifying patients for CT screening by using data from 3 independent studies from Europe and North America.
Case–control and prospective cohort study.
Europe and North America.
Participants in the European Early Lung Cancer (EUELC) and Harvard case–control studies and the LLP population-based prospective cohort (LLPC) study.
5-year absolute risks for lung cancer predicted by the LLP model.
The LLP risk model had good discrimination in both the Harvard (area under the receiver-operating characteristic curve [AUC], 0.76 [95% CI, 0.75 to 0.78]) and the LLPC (AUC, 0.82 [CI, 0.80 to 0.85]) studies and modest discrimination in the EUELC (AUC, 0.67 [CI, 0.64 to 0.69]) study. The decision utility analysis, which incorporates the harms and benefit of using a risk model to make clinical decisions, indicates that the LLP risk model performed better than smoking duration or family history alone in stratifying high-risk patients for lung cancer CT screening.
The model cannot assess whether including other risk factors, such as lung function or genetic markers, would improve accuracy. Lack of information on asbestos exposure in the LLPC limited the ability to validate the complete LLP risk model.
Validation of the LLP risk model in 3 independent external data sets demonstrated good discrimination and evidence of predicted benefits for stratifying patients for lung cancer CT screening. Further studies are needed to prospectively evaluate model performance and evaluate the optimal population risk thresholds for initiating lung cancer screening.
Primary Funding Source
Roy Castle Lung Cancer Foundation.
We provide a general framework for describing various roles for biomarkers in cancer prevention research (early detection, surrogate endpoint, and cohort identification for primary prevention) and the phases in their evaluation.
Using multiple historical trials with surrogate and true endpoints, we consider various models to predict the effect of treatment on a true endpoint in a target trial in which only a surrogate endpoint is observed. This predicted result is computed using (1) a prediction model (mixture, linear, or principal stratification) estimated from historical trials and the surrogate endpoint of the target trial and (2) a random extrapolation error estimated from successively leaving out each trial among the historical trials. The method applies to either binary outcomes or survival to a particular time that is computed from censored survival data. We compute a 95% confidence interval for the predicted result and validate its coverage using simulation. To summarize the additional uncertainty from using a predicted instead of true result for the estimated treatment effect, we compute its multiplier of standard error. Software is available for download.
Randomized trials; Reproducibility; Principal stratification
The paired availability design for historical controls postulated four classes corresponding to the treatment (old or new) a participant would receive if arrival occurred during either of two time periods associated with different availabilities of treatment. These classes were later extended to other settings and called principal strata. Judea Pearl asks if principal stratification is a goal or a tool and lists four interpretations of principal stratification. In the case of the paired availability design, principal stratification is a tool that falls squarely into Pearl's interpretation of principal stratification as “an approximation to research questions concerning population averages.” We describe the paired availability design and the important role played by principal stratification in estimating the effect of receipt of treatment in a population using data on changes in availability of treatment. We discuss the assumptions and their plausibility. We also introduce the extrapolated estimate to make the generalizability assumption more plausible. By showing why the assumptions are plausible we show why the paired availability design, which includes principal stratification as a key component, is useful for estimating the effect of receipt of treatment in a population. Thus, for our application, we answer Pearl's challenge to clearly demonstrate the value of principal stratification.
principal stratification; causal inference; paired availability design
Recently Cheng (Biometrics, 2009) proposed a model for the causal effect of receiving treatment when there is all-or-none compliance in one randomization group, with maximum likelihood estimation based on convex programming. We discuss an alternative approach that involves a model for all-or-none compliance in two randomization groups and estimation via a perfect fit or an EM algorithm for count data. We believe this approach is easier to implement, which would facilitate the reproduction of calculations.
All-or-none compliance; Causal effect; Multinomial outcomes; Noncompliance; Perfect fit; Principal stratification; Randomized trials
With the analysis of complex, messy data sets, the statistics community has recently focused attention on “reproducible research,” namely research that can be readily replicated by others. One standard that has been proposed is the availability of data sets and computer code. However, in some situations, raw data cannot be disseminated for reasons of confidentiality or because the data are so messy as to make dissemination impractical. For one such situation, we propose 2 steps for reproducible research: (i) presentation of a table of data and (ii) presentation of a formula to estimate key quantities from the table of data. We illustrate this strategy in the analysis of data from the Prostate Cancer Prevention Trial, which investigated the effect of the drug finasteride versus placebo on the period prevalence of prostate cancer. With such an important result at stake, a transparent analysis was important.
Categorical data; Maximum likelihood; Missing data; Multinomial–Poisson transformation; Propensity-to-be-missing score; Randomized trials
A simple classification rule with few genes and parameters is desirable when applying a classification rule to new data. One popular simple classification rule, diagonal discriminant analysis, yields linear or curved classification boundaries, called Ripples, that are optimal when gene expression levels are normally distributed with the appropriate variance, but may yield poor classification in other situations.
A simple modification of diagonal discriminant analysis yields smooth highly nonlinear classification boundaries, called Swirls, that sometimes outperforms Ripples. In particular, if the data are normally distributed with different variances in each class, Swirls substantially outperforms Ripples when using a pooled variance to reduce the number of parameters. The proposed classification rule for two classes selects either Swirls or Ripples after parsimoniously selecting the number of genes and distance measures. Applications to five cancer microarray data sets identified predictive genes related to the tissue organization theory of carcinogenesis.
The parsimonious selection of classifiers coupled with the selection of either Swirls or Ripples provides a good basis for formulating a simple, yet flexible, classification rule. Open source software is available for download.
Microarrays represent a potentially powerful tool for better understanding the role of the microenvironment on tumor biology. To make the best use of microarray data and avoid incorrect or unsubstantiated conclusions, care must be taken in the statistical analysis. To illustrate the statistical issues involved we discuss three microarray studies related to the microenvironment and tumor biology involving: (i) prostatic stroma cells in cancer and non-cancer tissues; (ii) breast stroma and epithelial cells in breast cancer patients and non-cancer patients; and (iii) serum associated with wound response and stroma in cancer patients. Using these examples we critically discuss three types of analyses: differential gene expression, cluster analysis, and class prediction. We also discuss design issues.
Bonferroni; Class prediction; Cluster analysis; Differential expression; False discovery rate; Sample size
There is experimental evidence from animal models favoring the notion that the disruption of interactions between stroma and epithelium plays an important role in the initiation of carcinogenesis. These disrupted interactions are hypothesized to be mediated by molecules, termed morphostats, which diffuse through the tissue to determine cell phenotype and maintain tissue architecture.
We developed a computer simulation based on simple properties of cell renewal and morphostats.
Under the computer simulation, the disruption of the morphostat gradient in the stroma generated epithelial precursors of cancer without any mutation in the epithelium.
The model is consistent with the possibility that the accumulation of genetic and epigenetic changes found in tumors could arise after the formation of a founder population of aberrant cells, defined as cells that are created by low or insufficient morphostat levels and that no longer respond to morphostat concentrations. Because the model is biologically plausible, we hope that these results will stimulate further experiments.
The prevailing paradigm in cancer research is the somatic mutation theory that posits that cancer begins with a single mutation in a somatic cell followed by successive mutations. Much cancer research involves refining the somatic mutation theory with an ever increasing catalog of genetic changes. The problem is that such research may miss paradoxical aspects of carcinogenesis for which there is no likely explanation under the somatic mutation theory. These paradoxical aspects offer opportunities for new research directions that should not be ignored.
Various paradoxes related to the somatic mutation theory of carcinogenesis are discussed: (1) the presence of large numbers of spatially distinct precancerous lesions at the onset of promotion, (2) the large number of genetic instabilities found in hyperplastic polyps not considered cancer, (3) spontaneous regression, (4) higher incidence of cancer in patients with xeroderma pigmentosa but not in patients with other comparable defects in DNA repair, (5) lower incidence of many cancers except leukemia and testicular cancer in patients with Down's syndrome, (6) cancer developing after normal tissue is transplanted to other parts of the body or next to stroma previously exposed to carcinogens, (7) the lack of tumors when epithelial cells exposed to a carcinogen were transplanted next to normal stroma, (8) the development of cancers when Millipore filters of various pore sizes were was inserted under the skin of rats, but only if the holes were sufficiently small. For the latter paradox, a microarray experiment is proposed to try to better understand the phenomena.
The famous physicist Niels Bohr said "How wonderful that we have met with a paradox. Now we have some hope of making progress." The same viewpoint should apply to cancer research. It is easy to ignore this piece of wisdom about the means to advance knowledge, but we do so at our peril.
A key aspect of randomized trial design is the choice of risk group. Some trials include patients from the entire at-risk population, others accrue only patients deemed to be at increased risk. We present a simple statistical approach for choosing between these approaches. The method is easily adapted to determine which of several competing definitions of high risk is optimal.
We treat eligibility criteria for a trial, such as a smoking history, as a prediction rule associated with a certain sensitivity (the number of patients who have the event and who are classified as high risk divided by the total number patients who have an event) and specificity (the number of patients who do not have an event and who do not meet criteria for high risk divided by the total number of patients who do not have an event). We then derive simple formulae to determine the proportion of patients receiving intervention, and the proportion who experience an event, where either all patients or only those at high risk are treated. We assume that the relative risk associated with intervention is the same over all choices of risk group. The proportion of events and interventions are combined using a net benefit approach and net benefit compared between strategies.
We applied our method to design a trial of adjuvant therapy after prostatectomy. We were able to demonstrate that treating a high risk group was superior to treating all patients; choose the optimal definition of high risk; test the robustness of our results by sensitivity analysis. Our results had a ready clinical interpretation that could immediately aid trial design.
The choice of risk group in randomized trials is usually based on rather informal methods. Our simple method demonstrates that this decision can be informed by simple statistical analyses.
The goal of most microarray studies is either the identification of genes that are most differentially expressed or the creation of a good classification rule. The disadvantage of the former is that it ignores the importance of gene interactions; the disadvantage of the latter is that it often does not provide a sufficient focus for further investigation because many genes may be included by chance. Our strategy is to search for classification rules that perform well with few genes and, if they are found, identify genes that occur relatively frequently under multiple random validation (random splits into training and test samples).
We analyzed data from four published studies related to cancer. For classification we used a filter with a nearest centroid rule that is easy to implement and has been previously shown to perform well. To comprehensively measure classification performance we used receiver operating characteristic curves. In the three data sets with good classification performance, the classification rules for 5 genes were only slightly worse than for 20 or 50 genes and somewhat better than for 1 gene. In two of these data sets, one or two genes had relatively high frequencies not noticeable with rules involving 20 or 50 genes: desmin for classifying colon cancer versus normal tissue; and zyxin and secretory granule proteoglycan genes for classifying two types of leukemia.
Using multiple random validation, investigators should look for classification rules that perform well with few genes and select, for further study, genes with relatively high frequencies of occurrence in these classification rules.