|Home | About | Journals | Submit | Contact Us | Français|
The most rigorous and valid approach to evaluating cancer screening modalities is the randomized controlled trial, or RCT. RCTs are major undertakings and the intricacies of trial design, operations, and management are generally under appreciated by the typical researcher. The purpose of this chapter is to inform the reader of the “nuts and bolts” of designing and conducting cancer screening RCTs. Following a brief introduction as to why RCTs are critical in evaluating screening modalities, we discuss design considerations, including the choice of design type and duration of follow-up. We next present an approach to sample-size calculations. We then discuss aspects of trial implementation, including recruitment, randomization, and data management. A discussion of commonly employed data analyses comes next, and includes methods for the primary analysis, comparison between the screened and control arms of cause-specific mortality rates for the cancer of interest, as well as for secondary endpoints such as sensitivity. We follow with a discussion of sequential monitoring and interim analysis techniques, which are used to examine the primary outcome while the trial is ongoing. We close with thoughts on lessons learned from past cancer screening RCTs and provide recommendations for future trials. Throughout the chapter we illustrate topics with examples from completed or on-going RCTs, including the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial and the National Lung Screening Trial (NLST).
Screening for the early detection of cancer is regarded by many as an obvious intervention strategy, one that will help alleviate the burden of cancer in populations. What is not sufficiently recognized, however, is that cancer screening interventions are not automatically beneficial, that harms are associated with screening, and that it is important to carefully and rigorously evaluate a screening exam before it is introduced into a population, to ensure that benefits outweigh the harms. It is the purpose of this chapter to briefly discuss aspects of the most rigorous and valid approach to screening evaluation, the randomized controlled trial (RCT), by describing issues in the design, implementation, and analysis of such studies. Examples are provided.
Several designs have been used for cancer screening RCTs. Most studies have employed the traditional two-arm design [1–24], which aims to determine whether the screening intervention results in benefit, that is, a reduction in cause-specific mortality. In the two-arm design, participants in one arm receive the screening exam of interest while those in the other arm serve as controls, receiving either no screening exam as part of the trial or an exam routinely used in the population as a screening modality. Similar designs have been used to address questions about screening frequency , age-specific effectiveness , and the effect of adding one screening modality to another [27–31]. One trial, the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial, is testing multiple regimens by using two study arms: participants in intervention arm received screening exams for three cancers (all participants received lung (chest x-ray) and colorectal (flexible sigmoidoscopy) cancer screens, while women also received ovarian cancer screens (transvaginal ultrasound and CA-125 evaluation) and men also received prostate cancer screens (digital rectal exam and PSA level evaluation), while participants in the control arm received no trial exams . PLCO provides an excellent example of many of the issues confronted when designing and conducting an RCT .
A major design issue for PLCO involved deciding whether to conduct one trial or four separate trials, one for each cancer site. A comparison of costs and logistics revealed that evaluation of screening modalities for the four cancers using one trial was less costly and more efficient, as one administrative structure and coordinating center could be used. Another design issue involved the evaluation of multiple tests for a single cancer site. Rather than evaluating each test for the same cancer site with a different arm, it was decided that combinations of screening tests would be evaluated. So, for example, DRE and PSA testing were administered together at each screening round, rather than conducting a separate trial for each. The primary reasons for combining modalities were cost constraints and the desire to evaluate the regimen of combined interventions. If a combined regimen does not work, testing the individual procedures would not be warranted. If the combined regimen does work, each test could be independently evaluated in separate RCTs.
Several designs were considered for PLCO. The two primary competitors were the reciprocal control design and the all-versus-none design. The reciprocal control design would have had three arms: one devoted to screening for prostate or ovarian cancer, one to colorectal cancer screening, and one to lung cancer screening. Since screening would have been undertaken for only one cancer site (per gender in the case of the prostate/ovarian arm) in any given arm, participants in the other two arms would have served as controls. This design was ultimately deemed unfeasible because of the cost of screening all participants. Furthermore, the reciprocal control design was expected to result in substantial levels of contamination, as all participants, due to the fact that they were being screened, would be aware of the other trial-administered screening tests, which they might then seek out. The all-versus-none design, with participants randomized to one of two arms, thus was chosen. In the spirit of a multiphasic screening endeavor, one arm served as a control, while screens for all cancer sites of interest were administered in the other arm. Use of the all-versus-none design reasonably assumes that the screening tests for each cancer do not detect any of the other cancers of interest, and that the endpoints—death from each of the four cancers—are not related. It was further decided to employ the so-called “stop screen” approach, an approach in which screening is performed for a fixed number of years or screening rounds and then stopped, but follow-up continues for endpoint ascertainment . This approach was chosen because it had been used successfully in breast and colorectal cancer screening trials, and because it is the only design that allows direct assessment of overdiagnosis. Overdiagnosis is the identification through screening of cancers that never would have surfaced clinically in the absence of screening. This phenomenon has been observed repeatedly [35–40] and must now be considered the rule rather than the exception in cancer screening.
In addition to allowing for assessment of overdiagnosis, a stop-screen design is often necessary because resources typically are not available to screen throughout the life of the study. Therefore, the number of screening rounds also must be decided upon at the design stage in this type of trial. However, changes can be and often are made as the study progresses. In PLCO, the initial regimen of four annual screens for PSA and CA-125 was later expanded to six screens, and was a trade-off between the expected number of screens necessary to produce an effect (should one exist) versus available and anticipated resources. In the early years of PLCO, sigmoidoscopy was administered at the first and fourth annual screens, although the annual fourth screen eventually was replaced with a screen at the sixth annual visit to reflect emerging clinical practice. An annual interval between screens is typically chosen in RCTs, as it is the interval most likely to be used once mass screening programs are established in the community. Compared to less-frequent screening, an annual interval increases the likelihood of detection of a broad spectrum of preclinical conditions, thus providing a better representation of the natural history of the cancers under study. A longer interval is less desirable in most instances, as it might allow some rapidly growing lesions, lesions likely to be lethal but which could be cured if found early, to escape detection. An exception to the use of an annual interval is found in endoscopic screening for colorectal cancer: our current understanding of colorectal cancer progression resulted in a five-year screening interval in PLCO.
Duration of follow-up, the time from randomization to cessation of event ascertainment, is another important design parameter. In PLCO, a minimum of 10 years of follow-up was initially chosen to allow for sufficient time for mortality reductions, should any exist, to emerge. Although follow-up intervals of at least 7 years were typical in breast cancer screening trials [1, 41], it was assumed that the longer natural history of prostate cancer, and perhaps other cancers under study, warranted a longer follow-up period. In the National Lung Screening Trial (NLST), modeling of the disease and screening processes resulted in the decision to capture endpoint events over an approximately 7 year period . It must be recognized that these and other design parameters were chosen using the best information available at the time of design and may be subject to change as a result of data gathered during the trial and other information. In the Minnesota FOBT trial , the screening and follow-up periods chosen at the beginning of the trial were ultimately both extended, resulting in the opportunity for a mortality reduction to emerge.
A crucial element of trial design is calculation of the necessary number of participants, or sample size. Calculation of sample size involves identification and estimation of key design parameters, including hypothesized mortality reduction, expected compliance and contamination, number of screening rounds and length of follow-up. The approach used for PLCO  provides an example of how sample size can be calculated for a cancer screening RCT. For a trial designed to detect a (1-r) × 100% reduction (0< r <1) in the cumulative disease-specific death rate over its duration, the following are defined:
The total number of disease-specific deaths needed for a one-sided a-level significance test with power 1-ḅ is given by:
The number of participants in the control arm is given by:
It is important that those design parameters that need to be estimated are done so as accurately as possible to ensure the validity of the trial. Such estimation can emanate from previous studies, special investigations, modeling, or data from cancer registries [19, 33, 42]. In situations where the screening test is available in the community, it is crucial to estimate the likely compliance in both arms. In addition, the underlying event rates in the control arm must account for the likelihood of substantial healthy volunteer bias [43, 44]. Adjustment for this bias when calculating expected event rates is particularly important as failing to do so will likely result in an undersized and therefore statistically underpowered trial .
Recruitment to cancer screening RCTs is often the most challenging task in such endeavors, as these trials need to enroll tens of thousands of participants during a relatively short time period. The process is simplified when population registers exist (such as in some European countries) as individuals can be recruited from these lists. Such lists generally do not exist in the US, so other approaches are needed. In recent large US trials, the most successful method has been direct mail. With direct mail, rosters of names and addresses from for-profit, not-for-profit, and government organizations are acquired and recruitment materials are mailed to individuals on the rosters. This method was used in the PLCO trial, with a 1% enrollment yield . Other strategies are often necessary to recruit persons from minority populations. Community outreach, which involves the support of community-based organizations such as churches to lend credibility to a study, was used successfully in PLCO .
Individuals who meet the eligibility criteria are randomized to a trial arm. A formal randomization scheme should be used, and may involve stratification by key risk factors. Random assignment is best accomplished using computerized systems as opposed to sealed envelopes, as a computerized system helps to ensure blinded allocation. In PLCO, randomization software and encrypted files were loaded on desktop microcomputers at each screening center. As each participant was successfully randomized into the trial, his or her name, gender, date of birth, and assigned arm were automatically stored in encrypted data tables so data could not be altered. At the same time, a second protected set of synchronized tables, stored on a backup device, was also updated.
Cancer screening RCTs are typically large, long term projects that involve collection of vast amounts of data over many years. An automated data management system should be established at the outset, recognizing that hardware and software will need to be updated periodically. In PLCO, the data management system allows for the following: the ability for the trial leadership and the Coordinating Center (CC) to access screening center systems and data remotely, synchronization of databases on multiple platforms, preparation of high quality analysis datasets, secure backup and archiving of data, and robust configuration management . Data are entered using a distributed data entry system and are transmitted regularly to the CC.
In PLCO, numerous forms were developed for collection of crucial information. These forms included eligibility and consent forms, baseline and mid-trial questionnaires that sought demographic and risk factor information, examination forms that documented the outcome of trial screening exams, diagnostic evaluation and treatment forms, and questionnaires for annual follow-up of participants. Additional forms were developed as data collection needs grew during the course of the trial. Scannable data collection forms were used to prevent typing errors, and quality control checks were built into systems as well to ensure data integrity [33, 48]. With the increasing availability of electronic medical records systems in the US, future trials may have the opportunity to transmit this type of data directly into study systems.
Data elements that should be collected include:
A pilot phase can be very informative when conducted prior to full implementation of large screening trials. A pilot phase provides the opportunity to test trial procedures and assess design assumptions. Pilot phases were conducted with great success in both the PLCO and NLST trials [19, 33]. Long-term commitments to these trials were based on the success of these pilot phase studies, as they demonstrated that each trial was in fact feasible.
In PLCO, specific activities were planned for the pilot phase, which amounted to the first two years of the trial. Major activities involved refinement of protocol components including eligibility criteria, development of mechanisms for notifying participants of screening results and encouraging them to seek diagnostic evaluation for suspicious or positive results, collection of data on diagnostic work-up of participants with suspicious or positive screens, development of procedures for quality control of screening examinations, and development of follow-up procedures to determine cancer incidence and ascertain cause of death. All 10 PLCO screening centers participated in the pilot phase, allowing each screening center (SC) to gain invaluable experience. Furthermore, the pilot phase allowed study leaders to identify procedures that worked, and those that did not.
The major activities carried out by all PLCO SCs during the pilot phase were as follows:
During PLCO’s pilot phase, each SC identified sources from which to recruit participants and implemented recruitment strategies appropriate for its particular population. Trial investigators met to discuss how the trial was progressing and any need for procedural changes. The CC developed trial forms, established data entry and editing systems, and prepared a manual of operations and procedures. Efforts were made to standardize all screening examinations to the extent possible in a study that had multiple screening centers. In addition, performance at each SC was monitored and any deficiencies corrected so as to maintain adequate recruitment and ensure correct administration of activities during the main phase of the trial. Participants randomized during the pilot phase were treated as a vanguard group and included in the trial.
NLST’s pilot phase was called the Lung Screening Study (LSS) Feasibility Phase [49, 50]. It was conducted to address assertions by critics that it would be impossible to conduct an RCT of helical computed tomography (CT) screening for lung cancer because high risk individuals would not agree to be randomized. This study did in fact randomize an ample number of participants: 3318 within a two and a half month period at six PLCO screening centers. Participants were randomized to one of two screening modalities: helical CT or single view chest X-ray. Screens were performed at baseline and one year later. The LSS Feasibility Phase demonstrated convincingly that an RCT of helical CT was feasible. In addition to meeting accrual targets, crossover contamination (in particular, use of screening CT by participants randomized to the chest X-ray arm) was measured and found to be low. The experience from the LSS Feasibility Phase yielded valuable input for the design and conduct of NLST. Unlike PLCO, participants from the LSS Feasibility Phase were not included in NLST.
A data analysis plan should be prepared at the beginning of any screening trial. Of course, this plan is likely to be augmented as new secondary questions and analytic methods arise during the trial. Trial data are analyzed on an ongoing basis guided by the analysis plan, with more detailed analyses typically occurring for special meetings of trial investigators or the trial’s Data and Safety Monitoring Board (DSMB). Commonly used analytical methods include standard descriptive statistics and techniques such as regression, analysis of rates and proportions, contingency table methods, and analysis of survival data. New methods of analysis or modeling are devised as needed.
Typical data analyses include description of characteristics of the enrolled population and comparison of those characteristics across trial arms; estimation of screening test operating characteristics including sensitivity, specificity, predictive value, and false positive rate; calculation of prevalence and incidence of the cancer of interest; description of cancer characteristics including stage, histology, treatment , survival, and method of detection (screen-detected versus interval); identification of prognostic variables; estimation and modeling of lead time; calculation of potential surrogate endpoints such as incidence rates of interval cancers and advanced stage disease; enumeration of complications resulting from screening exams or diagnostic evaluation; and calculation of cause-specific and all-cause mortality. Other so-called “ancillary studies” could investigate other topics of interest, such as quality of life or the cost-benefit ratio of the screening process.
The primary and most important analysis in any cancer screening trial is the comparison of cause-specific mortality rates for the cancer of interest between the screened and control arms on an intent-to-screen basis. In addition, the death rates from other causes and all-cause mortality should be scrutinized to assess the comparability of the randomized populations. Mortality rates are compared using Poisson methods or the log-rank test . These methods assume that mortality rates in the screened and control arms are proportional over time; however, this may not always hold. In such an instance, other statistical methods are used .
Secondary variables worthy of special mention include sensitivity, therapy, and harms. Sensitivity is commonly calculated as the ratio of screen-detected cancers to all cancers found in the screened arm for some interval after a screen, typically one year. This approach, however, is biased and will yield an inflated estimate when overdiagnosed cancers are present among the screen-detected cancers. A correct method is calculated as one minus the interval cancer incidence divided by the control arm incidence, and can be adjusted for non-compliance in the screened arm if it is believed that screening attendance was correlated with the risk of disease . With regard to therapy, data should be collected on treatment received for persons diagnosed with the cancer of interest in both the screened and control arms. These data then can be examined within each stage of disease to determine whether therapy is comparable between randomized groups, thereby assessing the possibility of a confounding effect of therapy on the impact of the screening.
Harms of screening include overdiagnosis, false positive screening exams, complications of the screening procedures administered to trial participants, and complications of diagnostic and treatment procedures administered to trial participants as a result of positive screens and screen-detected cancer, respectively. Complications include any adverse medical event and any deaths potentially related to or occurring as a downstream consequence of trial procedures. False positive exams are quite common, as most individuals with a positive screening test will not be diagnosed with cancer after diagnostic workup. False positive rates therefore can be substantial . Cancer incidence is tracked after the cessation of screening to alert investigators to the possibility of substantial overdiagnosis, as evidenced by a persistent excess of cases in the screened arm relative to the control arm.
A process for regular, formal monitoring of safety issues and evaluation of accumulating data should be established early in the course of a cancer screening trial. This is best accomplished by the creation of a DSMB, a committee comprised of experts not associated with the trial who therefore can provide an independent assessment of trial progress. DSMBs were not employed in many early cancer screening trials, but more recent trials have included such a committee [19, 33, 55, 56].
Sequential monitoring is an integral part of the PLCO and NLST trials [19, 33]. In both trials, statistical monitoring guidelines were established for the DSMB to use when examining emerging data. The accruing mortality data and secondary endpoints are examined at least annually to determine if and when a protocol change is warranted, one that would result in an early decision about the efficacy of the screening intervention. The data are analyzed from two perspectives, one addressing the prospect of early termination due to a significantly large effect, and one addressing the prospect of early termination due to a negligible effect. For PLCO, this is done separately for the four cancers of interest.
Statistical group sequential procedures are used in PLCO . To assess whether early termination due to a large effect should occur, the O’Brien-Fleming boundary  and Lan-DeMets sequential procedure are used . The test statistic in PLCO is a weighted log-rank statistic with weights linear in and proportional to the cumulative mortality, where the weights are estimated from the empirical survival function based on the data from both arms . Choice of this combination of boundary and weights was based on power computations conducted using simulation methods. For early termination due to a negligible effect, a stochastic curtailment procedure is used . A similar approach is used for NLST .
A weighted rank statistic was chosen for these and other recent screening and prevention trials [24, 60] to account for anticipated non-proportional cause-specific hazard rates between the trial arms. Non-proportionality can arise early in the trial due to a lag time before the screening benefit, should one exist, emerges. A weighted log-rank statistic down-weights early events, thus placing more emphasis on later events, but preserves the randomized, intent-to-screen comparison. Another potential source of non-proportionality is dilution of the screening effect due to inclusion in the mortality comparison of deaths among cancers that are diagnosed, in both arms, after screening stops in the intervention arm. Methods to account for dilution have been proposed [61, 62] but further research is warranted.
Much has been learned from completed and ongoing cancer screening trials. Hopefully this knowledge will lead to improved trial design and analysis. The following is a list of thoughts and recommendations for future endeavors:
A cancer screening RCT is a major endeavor, requiring a long-term commitment by participants, investigators and funding organizations. If a decision is made to conduct a trial, necessary resources must be committed for the duration of the study. To accomplish this in the today’s climate of intense resource competition is difficult at best. One strategy is full commitment sequentially to the primary phases – pilot, recruitment, screening, and follow-up - with funding for each successive phase contingent on successful completion of the previous phase. Regardless of these difficulties, researchers should strive to establish such trials when a new screening modality appears promising, as the alternative, community acceptance of an unproven modality, could result in wasted resources and an overall detriment to public health.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.