Systems biology uses systems of mathematical rules and formulas to study complex biological phenomena. In cancer research there are three distinct threads in systems biology research: modeling biology or biophysics with the goal of establishing plausibility or obtaining insights, modeling based on statistics, bioinformatics, and reverse engineering with the goal of better characterizing the system, and modeling with the goal of clinical predictions. Using illustrative examples we discuss these threads in the context of cancer research.
bioinformatics; microarrays; reverse engineering; receiver operating characteristic curves
Many cancer screening trials involve a screening programme of one or more screenings with follow-up after the last screening. Usually a maximum follow-up time is selected in advance. However, during the follow-up period there is an opportunity to report the results of the trial sooner than planned. Early reporting of results from a randomized screening trial is important because obtaining a valid result sooner translates into health benefits reaching the general population sooner. The health benefits are reduction in cancer deaths if screening is found to be beneficial and more screening is recommended, or avoidance of unnecessary biopsies, work-ups and morbidity if screening is not found to be beneficial and the rate of screening drops.
Our proposed method for deciding if results from a cancer screening trial should be reported earlier in the follow-up period is based on considerations involving postscreening noise. Postscreening noise (sometimes called dilution) refers to cancer deaths in the follow-up period that could not have been prevented by screening: (1) cancer deaths in the screened group that occurred after the last screening in subjects whose cancers were not detected during the screening program and (2) cancer deaths in the control group that occurred after the time of the last screening and whose cancers would not have been detected during the screening programme had they been randomized to screening (the number of which is unobserved). Because postscreening noise increases with follow-up after the last screening, we propose early reporting at the time during the follow-up period when postscreening noise first starts to overwhelm the estimated effect of screening as measured by a z-statistic. This leads to a confidence interval, adjusted for postscreening noise, that would not change substantially with additional follow-up. Details of the early reporting rule were refined by simulation, which also accounts for multiple looks.
For the re-analysis of the Health Insurance Plan trial for breast cancer screening and the Mayo Lung Project for lung cancer screening, estimates and confidence intervals for the effect of screening on cancer mortality were similar on early reporting and later.
The proposed early reporting rule for a cancer screening trial with post-screening follow-up is a promising method for making results from the trial available sooner, which translates into health benefits (reduction in cancer deaths or avoidance of unnecessary morbidity) reaching the population sooner.
Using multiple historical trials with surrogate and true endpoints, we consider various models to predict the effect of treatment on a true endpoint in a target trial in which only a surrogate endpoint is observed. This predicted result is computed using (1) a prediction model (mixture, linear, or principal stratification) estimated from historical trials and the surrogate endpoint of the target trial and (2) a random extrapolation error estimated from successively leaving out each trial among the historical trials. The method applies to either binary outcomes or survival to a particular time that is computed from censored survival data. We compute a 95% confidence interval for the predicted result and validate its coverage using simulation. To summarize the additional uncertainty from using a predicted instead of true result for the estimated treatment effect, we compute its multiplier of standard error. Software is available for download.
Randomized trials; Reproducibility; Principal stratification
We define personalized medicine as the administration of treatment to only persons thought most likely to benefit, typically those at high risk for mortality or another detrimental outcome. To evaluate personalized medicine, we propose a new design for a randomized trial that makes efficient use of high-throughput data (such as gene expression microarrays) and clinical data (such as tumor stage) collected at baseline from all participants. Under this design for a randomized trial involving experimental and control arms with a survival outcome, investigators first estimate the risk of mortality in the control arm based on the high-throughput and clinical data. Then investigators use data from both randomization arms to estimate both the effect of treatment among all participants and among participants in the highest prespecified category of risk. This design requires only an 18.1% increase in sample size compared with a standard randomized trial. A trial based on this design that has a 90% power to detect a realistic increase in survival from 70% to 80% among all participants, would also have a 90% power to detect an increase in survival from 50% to 73% in the highest quintile of risk.
Risk prediction models based on medical history or results of tests are increasingly common in the cancer literature. An important use of these models is to make treatment decisions on the basis of estimated risk. The relative utility curve is a simple method for evaluating risk prediction in a medical decision-making framework. Relative utility curves have three attractive features for the evaluation of risk prediction models. First, they put risk prediction into perspective because relative utility is the fraction of the expected utility of perfect prediction obtained by the risk prediction model at the optimal cut point. Second, they do not require precise specification of harms and benefits because relative utility is plotted against a summary measure of harms and benefits (ie, the risk threshold). Third, they are easy to compute from standard tables of data found in many articles on risk prediction. An important use of relative utility curves is to evaluate the addition of a risk factor to the risk prediction model. To illustrate an application of relative utility curves, an analysis was performed on previously published data involving the addition of breast density to a risk prediction model for invasive breast cancer.
The biomarker pipeline to develop and evaluate cancer screening tests has three stages: identification of promising biomarkers for the early detection of cancer, initial evaluation of biomarkers for cancer screening, and definitive evaluation of biomarkers for cancer screening. Statistical and biological issues to improve this pipeline are discussed. Although various recommendations, such as identifying cases based on clinical symptoms, keeping biomarker tests simple, and adjusting for postscreening noise, have been made previously, they are not widely known. New recommendations include more frequent specimen collection to help identify promising biomarkers and the use of the paired availability design with interval cases (symptomatic cancers detected in the interval after screening) for initial evaluation of biomarkers for cancer screening.
Because many medical decisions are based on risk prediction models constructed from medical history and results of tests, the evaluation of these prediction models is important. This paper makes five contributions to this evaluation: (1) the relative utility curve which gauges the potential for better prediction in terms of utilities, without the need for a reference level for one utility, while providing a sensitivity analysis for missipecification of utilities, (2) the relevant region, which is the set of values of prediction performance consistent with the recommended treatment status in the absence of prediction (3) the test threshold, which is the minimum number of tests that would be traded for a true positive in order for the expected utility to be non-negative, (4) the evaluation of two-stage predictions that reduce test costs, and (5) connections among various measures of prediction performance. An application involving the risk of cardiovascular disease is discussed.
decision analysis; decision curve; receiver operating characteristic curve; utility
The paired availability design for historical controls postulated four classes corresponding to the treatment (old or new) a participant would receive if arrival occurred during either of two time periods associated with different availabilities of treatment. These classes were later extended to other settings and called principal strata. Judea Pearl asks if principal stratification is a goal or a tool and lists four interpretations of principal stratification. In the case of the paired availability design, principal stratification is a tool that falls squarely into Pearl's interpretation of principal stratification as “an approximation to research questions concerning population averages.” We describe the paired availability design and the important role played by principal stratification in estimating the effect of receipt of treatment in a population using data on changes in availability of treatment. We discuss the assumptions and their plausibility. We also introduce the extrapolated estimate to make the generalizability assumption more plausible. By showing why the assumptions are plausible we show why the paired availability design, which includes principal stratification as a key component, is useful for estimating the effect of receipt of treatment in a population. Thus, for our application, we answer Pearl's challenge to clearly demonstrate the value of principal stratification.
principal stratification; causal inference; paired availability design
Recently Cheng (Biometrics, 2009) proposed a model for the causal effect of receiving treatment when there is all-or-none compliance in one randomization group, with maximum likelihood estimation based on convex programming. We discuss an alternative approach that involves a model for all-or-none compliance in two randomization groups and estimation via a perfect fit or an EM algorithm for count data. We believe this approach is easier to implement, which would facilitate the reproduction of calculations.
All-or-none compliance; Causal effect; Multinomial outcomes; Noncompliance; Perfect fit; Principal stratification; Randomized trials
With the analysis of complex, messy data sets, the statistics community has recently focused attention on “reproducible research,” namely research that can be readily replicated by others. One standard that has been proposed is the availability of data sets and computer code. However, in some situations, raw data cannot be disseminated for reasons of confidentiality or because the data are so messy as to make dissemination impractical. For one such situation, we propose 2 steps for reproducible research: (i) presentation of a table of data and (ii) presentation of a formula to estimate key quantities from the table of data. We illustrate this strategy in the analysis of data from the Prostate Cancer Prevention Trial, which investigated the effect of the drug finasteride versus placebo on the period prevalence of prostate cancer. With such an important result at stake, a transparent analysis was important.
Categorical data; Maximum likelihood; Missing data; Multinomial–Poisson transformation; Propensity-to-be-missing score; Randomized trials
A simple classification rule with few genes and parameters is desirable when applying a classification rule to new data. One popular simple classification rule, diagonal discriminant analysis, yields linear or curved classification boundaries, called Ripples, that are optimal when gene expression levels are normally distributed with the appropriate variance, but may yield poor classification in other situations.
A simple modification of diagonal discriminant analysis yields smooth highly nonlinear classification boundaries, called Swirls, that sometimes outperforms Ripples. In particular, if the data are normally distributed with different variances in each class, Swirls substantially outperforms Ripples when using a pooled variance to reduce the number of parameters. The proposed classification rule for two classes selects either Swirls or Ripples after parsimoniously selecting the number of genes and distance measures. Applications to five cancer microarray data sets identified predictive genes related to the tissue organization theory of carcinogenesis.
The parsimonious selection of classifiers coupled with the selection of either Swirls or Ripples provides a good basis for formulating a simple, yet flexible, classification rule. Open source software is available for download.
Microarrays represent a potentially powerful tool for better understanding the role of the microenvironment on tumor biology. To make the best use of microarray data and avoid incorrect or unsubstantiated conclusions, care must be taken in the statistical analysis. To illustrate the statistical issues involved we discuss three microarray studies related to the microenvironment and tumor biology involving: (i) prostatic stroma cells in cancer and non-cancer tissues; (ii) breast stroma and epithelial cells in breast cancer patients and non-cancer patients; and (iii) serum associated with wound response and stroma in cancer patients. Using these examples we critically discuss three types of analyses: differential gene expression, cluster analysis, and class prediction. We also discuss design issues.
Bonferroni; Class prediction; Cluster analysis; Differential expression; False discovery rate; Sample size
There is experimental evidence from animal models favoring the notion that the disruption of interactions between stroma and epithelium plays an important role in the initiation of carcinogenesis. These disrupted interactions are hypothesized to be mediated by molecules, termed morphostats, which diffuse through the tissue to determine cell phenotype and maintain tissue architecture.
We developed a computer simulation based on simple properties of cell renewal and morphostats.
Under the computer simulation, the disruption of the morphostat gradient in the stroma generated epithelial precursors of cancer without any mutation in the epithelium.
The model is consistent with the possibility that the accumulation of genetic and epigenetic changes found in tumors could arise after the formation of a founder population of aberrant cells, defined as cells that are created by low or insufficient morphostat levels and that no longer respond to morphostat concentrations. Because the model is biologically plausible, we hope that these results will stimulate further experiments.
The prevailing paradigm in cancer research is the somatic mutation theory that posits that cancer begins with a single mutation in a somatic cell followed by successive mutations. Much cancer research involves refining the somatic mutation theory with an ever increasing catalog of genetic changes. The problem is that such research may miss paradoxical aspects of carcinogenesis for which there is no likely explanation under the somatic mutation theory. These paradoxical aspects offer opportunities for new research directions that should not be ignored.
Various paradoxes related to the somatic mutation theory of carcinogenesis are discussed: (1) the presence of large numbers of spatially distinct precancerous lesions at the onset of promotion, (2) the large number of genetic instabilities found in hyperplastic polyps not considered cancer, (3) spontaneous regression, (4) higher incidence of cancer in patients with xeroderma pigmentosa but not in patients with other comparable defects in DNA repair, (5) lower incidence of many cancers except leukemia and testicular cancer in patients with Down's syndrome, (6) cancer developing after normal tissue is transplanted to other parts of the body or next to stroma previously exposed to carcinogens, (7) the lack of tumors when epithelial cells exposed to a carcinogen were transplanted next to normal stroma, (8) the development of cancers when Millipore filters of various pore sizes were was inserted under the skin of rats, but only if the holes were sufficiently small. For the latter paradox, a microarray experiment is proposed to try to better understand the phenomena.
The famous physicist Niels Bohr said "How wonderful that we have met with a paradox. Now we have some hope of making progress." The same viewpoint should apply to cancer research. It is easy to ignore this piece of wisdom about the means to advance knowledge, but we do so at our peril.
A key aspect of randomized trial design is the choice of risk group. Some trials include patients from the entire at-risk population, others accrue only patients deemed to be at increased risk. We present a simple statistical approach for choosing between these approaches. The method is easily adapted to determine which of several competing definitions of high risk is optimal.
We treat eligibility criteria for a trial, such as a smoking history, as a prediction rule associated with a certain sensitivity (the number of patients who have the event and who are classified as high risk divided by the total number patients who have an event) and specificity (the number of patients who do not have an event and who do not meet criteria for high risk divided by the total number of patients who do not have an event). We then derive simple formulae to determine the proportion of patients receiving intervention, and the proportion who experience an event, where either all patients or only those at high risk are treated. We assume that the relative risk associated with intervention is the same over all choices of risk group. The proportion of events and interventions are combined using a net benefit approach and net benefit compared between strategies.
We applied our method to design a trial of adjuvant therapy after prostatectomy. We were able to demonstrate that treating a high risk group was superior to treating all patients; choose the optimal definition of high risk; test the robustness of our results by sensitivity analysis. Our results had a ready clinical interpretation that could immediately aid trial design.
The choice of risk group in randomized trials is usually based on rather informal methods. Our simple method demonstrates that this decision can be informed by simple statistical analyses.
The goal of most microarray studies is either the identification of genes that are most differentially expressed or the creation of a good classification rule. The disadvantage of the former is that it ignores the importance of gene interactions; the disadvantage of the latter is that it often does not provide a sufficient focus for further investigation because many genes may be included by chance. Our strategy is to search for classification rules that perform well with few genes and, if they are found, identify genes that occur relatively frequently under multiple random validation (random splits into training and test samples).
We analyzed data from four published studies related to cancer. For classification we used a filter with a nearest centroid rule that is easy to implement and has been previously shown to perform well. To comprehensively measure classification performance we used receiver operating characteristic curves. In the three data sets with good classification performance, the classification rules for 5 genes were only slightly worse than for 20 or 50 genes and somewhat better than for 1 gene. In two of these data sets, one or two genes had relatively high frequencies not noticeable with rules involving 20 or 50 genes: desmin for classifying colon cancer versus normal tissue; and zyxin and secretory granule proteoglycan genes for classifying two types of leukemia.
Using multiple random validation, investigators should look for classification rules that perform well with few genes and select, for further study, genes with relatively high frequencies of occurrence in these classification rules.
The human genome map has started a hunt to find common genes that are associated with cancer. But new research questions the likelihood of success.
There is a common belief that most cancer prevention trials should be restricted to high-risk subjects in order to increase statistical power. This strategy is appropriate if the ultimate target population is subjects at the same high-risk. However if the target population is the general population, three assumptions may underlie the decision to enroll high-risk subject instead of average-risk subjects from the general population: higher statistical power for the same sample size, lower costs for the same power and type I error, and a correct ratio of benefits to harms. We critically investigate the plausibility of these assumptions.
We considered each assumption in the context of a simple example. We investigated statistical power for fixed sample size when the investigators assume that relative risk is invariant over risk group, but when, in reality, risk difference is invariant over risk groups. We investigated possible costs when a trial of high-risk subjects has the same power and type I error as a larger trial of average-risk subjects from the general population. We investigated the ratios of benefit to harms when extrapolating from high-risk to average-risk subjects.
Appearances here are misleading. First, the increase in statistical power with a trial of high-risk subjects rather than the same number of average-risk subjects from the general population assumes that the relative risk is the same for high-risk and average-risk subjects. However, if the absolute risk difference rather than the relative risk were the same, the power can be less with the high-risk subjects. In the analysis of data from a cancer prevention trial, we found that invariance of absolute risk difference over risk groups was nearly as plausible as invariance of relative risk over risk groups. Therefore a priori assumptions of constant relative risk across risk groups are not robust, limiting extrapolation of estimates of benefit to the general population. Second, a trial of high-risk subjects may cost more than a larger trial of average risk subjects with the same power and type I error because of additional recruitment and diagnostic testing to identify high-risk subjects. Third, the ratio of benefits to harms may be more favorable in high-risk persons than in average-risk persons in the general population, which means that extrapolating this ratio to the general population would be misleading. Thus there is no free lunch when using a trial of high-risk subjects to extrapolate results to the general population.
Unless the intervention is targeted to only high-risk subjects, cancer prevention trials should be implemented in the general population.
In recent years there has been increased interest in evaluating breast cancer screening using data from before-and-after studies in multiple geographic regions. One approach, not previously mentioned, is the paired availability design. The paired availability design was developed to evaluate the effect of medical interventions by comparing changes in outcomes before and after a change in the availability of an intervention in various locations. A simple potential outcomes model yields estimates of efficacy, the effect of receiving the intervention, as opposed to effectiveness, the effect of changing the availability of the intervention. By combining estimates of efficacy rather than effectiveness, the paired availability design avoids confounding due to different fractions of subjects receiving the interventions at different locations. The original formulation involved short-term outcomes; the challenge here is accommodating long-term outcomes.
The outcome is incident breast cancer deaths in a time period, which are breast cancer deaths that were diagnosed in the same time period. We considered the plausibility of the basic five assumptions of the paired availability design and propose a novel analysis to accommodate likely violations of the assumption of stable screening effects.
We applied the paired availability design to data on breast cancer screening from six counties in Sweden. The estimated yearly change in incident breast cancer deaths per 100,000 persons ages 40–69 (in most counties) due to receipt of screening (among the relevant type of subject in the potential outcomes model) was -9 with 95% confidence interval (-14, -4) or (-14, -5), depending on the sensitivity analysis.
In a realistic application, the extended paired availability design yielded reasonably precise confidence intervals for the effect of receiving screening on the rate of incident breast cancer death. Although the assumption of stable preferences may be questionable, its impact will be small if there is little screening in the first time period. However, estimates may be substantially confounded by improvements in systemic therapy over time. Therefore the results should be interpreted with care.
There is common belief among some medical researchers that if a potential surrogate endpoint is highly correlated with a true endpoint, then a positive (or negative) difference in potential surrogate endpoints between randomization groups would imply a positive (or negative) difference in unobserved true endpoints between randomization groups. We investigate this belief when the potential surrogate and unobserved true endpoints are perfectly correlated within each randomization group.
We use a graphical approach. The vertical axis is the unobserved true endpoint and the horizontal axis is the potential surrogate endpoint. Perfect correlation within each randomization group implies that, for each randomization group, potential surrogate and true endpoints are related by a straight line. In this scenario the investigator does not know the slopes or intercepts. We consider a plausible example where the slope of the line is higher for the experimental group than for the control group.
In our example with unknown lines, a decrease in mean potential surrogate endpoints from control to experimental groups corresponds to an increase in mean true endpoint from control to experimental groups. Thus the potential surrogate endpoints give the wrong inference. Similar results hold for binary potential surrogate and true outcomes (although the notion of correlation does not apply). The potential surrogate endpointwould give the correct inference if either (i) the unknown lines for the two group coincided, which means that the distribution of true endpoint conditional on potential surrogate endpoint does not depend on treatment group, which is called the Prentice Criterion or (ii) if one could accurately predict the lines based on data from prior studies.
Perfect correlation between potential surrogate and unobserved true outcomes within randomized groups does not guarantee correct inference based on a potential surrogate endpoint. Even in early phase trials, investigators should not base conclusions on potential surrogate endpoints in which the only validation is high correlation with the true endpoint within a group.
When evaluating cancer screening it is important to estimate the cumulative risk of false positives from periodic screening. Because the data typically come from studies in which the number of screenings varies by subject, estimation must take into account dropouts. A previous approach to estimate the probability of at least one false positive in n screenings unrealistically assumed that the probability of dropout does not depend on prior false positives.
By redefining the random variables, we obviate the unrealistic dropout assumption. We also propose a relatively simple logistic regression and extend estimation to the expected number of false positives in n screenings.
We illustrate our methodology using data from women ages 40 to 64 who received up to four annual breast cancer screenings in the Health Insurance Program of Greater New York study, which began in 1963. Covariates were age, time since previous screening, screening number, and whether or not a previous false positive occurred. Defining a false positive as an unnecessary biopsy, the only statistically significant covariate was whether or not a previous false positive occurred. Because the effect of screening number was not statistically significant, extrapolation beyond 4 screenings was reasonable. The estimated mean number of unnecessary biopsies in 10 years per woman screened is .11 with 95% confidence interval of (.10, .12). Defining a false positive as an unnecessary work-up, all the covariates were statistically significant and the estimated mean number of unnecessary work-ups in 4 years per woman screened is .34 with 95% confidence interval (.32, .36).
Using data from multiple cancer screenings with dropouts, and allowing dropout to depend on previous history of false positives, we propose a logistic regression model to estimate both the probability of at least one false positive and the expected number of false positives associated with n cancer screenings. The methodology can be used for both informed decision making at the individual level, as well as planning of health services.