|Home | About | Journals | Submit | Contact Us | Français|
Excellent recommendations exist for studying therapeutic and diagnostic questions. We observe that good guidelines on assessment of evidence for screening questions are currently lacking. Guidelines for diagnostic research (STARD), involving systematic application of the reference test (gold standard) to all subjects of large study populations, are not pertinent in situations of screening for disease that is currently not yet present. A five-step framework is proposed for assessing the potential use of a biomarker as a screening tool for cervical cancer: 1) correlation studies establishing a trend between the rate of biomarker expression and severity of neoplasia; 2) diagnostic studies in a clinical setting where all women are submitted to verification by the reference standard; 3) biobank-based studies with assessment in archived cytology samples of the biomarker in cervical cancer cases and controls; 4) prospective cohort studies with baseline assessment of the biomarker and monitoring of disease; 5) randomised intervention trials aiming to observe reduced incidence of cancer (or its surrogate, severe dysplasia) in the experimental arm at subsequent screening rounds.
The 5-phases framework should guide researchers and test developers in planning assessment of new biomarkers and protect clinicians and stakeholders against premature claims for insufficiently evaluated products.
The rationale of cervical cancer cytological screening is to identify and treat high-grade cervical intraepithelial neoplasiaa (CIN) (precancerous lesion) to prevent its progression to invasive cancer4. Programme sensitivity is a convenient metric of assessing cancer reduction and population effectiveness although it does not account for the impact of false positives on cost-effectiveness, the negative consequences of over-screening, and the occurrence of side effects 5. Programme sensitivity depends on the sensitivity of the chosen screening test, the compliance with further follow-up and the sensitivity of triage and diagnostic work-up, the natural history of the disease, and the screening policy (the target age group, screening interval, clinical thresholds for follow-up and treatment) 6. The essential elements in the natural evolution of the disease are the rates of onset of precursor lesions, the progression and regression rates of these precursor lesions and the distribution of their sojourn times. The mean sojourn time (period from detectability of a lesion until it develops into a clinically manifest cancer) generally is believed to be in the order of 10 years or more with cytology and the probability of detection increases as the preclinical phase progresses 7,8. Sojourn times of cancer precursors are usually not observable because of treatment and are therefore only estimable by modelling. A unique (unethical) experience in New-Zealand, where CIN3 lesions were left untreated, allowed observation of the natural history. The 30-year cumulative incidence of invasive cancer among women with CIN3 was 30% and among women with persistent CIN3 was 50%. Because of the long natural history of precursors, repetition of a moderately sensitive screen test, such as the Pap smear, can achieve high programme sensitivity and thereby reduce incidence of and mortality from cervical cancer to a low residual level9. The International Agency for Cancer Research estimated that well-organised cytological screening for cervical cancer precursors every 3–5 years between the ages of 35 and 64 years reduces the incidence of cervical cancer by 80% or more among the women screened 8,10,11b. The success of screening depends essentially on the participation of the target population, the quality of the screening test and further on the compliance with follow-up and the efficacy of treatment of screen-detected lesions. The efficiency of screening decreases in subsequent rounds because successive sensitive screening followed by appropriate therapy reduces the endemicity of precursors over time. The lesions still found are smaller lesions with less invasive potential.
The cross-sectional test accuracy of cervical cytology is highly variable since it depends on the availability of an adequately collected and prepared sample taken from the transformation zone and well-trained and motivated cyto-technicians for microscopic interpretation of the morphologic changes. By good quality assurance, a reasonably high sensitivity for high-grade CIN can be reached (>70%) but low sensitivity values (<50%) are not exceptional 12.
Because of low sensitivity, reported in several settings, alternative screening methods have been developed. We can distinguish four new methods of screening: a) alternative forms of cytology e.g., liquid-based cytology [see ref13 for a systematic review], automated or computer-assisted cytology; b) molecular detection of DNA or RNA of high-risk types of human papillomavirus (HPV), the virus causing cervical cancer14; and c) biomarkers associated with a progressive HPV infection such as immuno-staining of certain cell cycle regulating proteins whose expression has been altered, or maybe in the future, proteomic, transcriptomic or methylomic signatures of transforming HPV infections 15,16 and d) biophysical changes identifiable by spectroscopy17,18. In the rest of the paper, we discuss how such new techniques should be evaluated using – where possible - established methods to assess evidence of efficacy.
We first propose a methodology to rank evidence from published studies already performed. Subsequently, we propose a comprehensive framework for setting up new studies through which evaluation of biomarkers should pass to generate evidence on their potential application as a cancer screening test.
A list of indicators for screening effectiveness, assessed by different study methods, is enumerated in Table 1 and ranked from high to low according to the level of evidence that such studies provide.
Randomised clinical trials (RCTs) designed to demonstrate a reduction in invasive cervical cancer provide the highest level of evidence of efficacy of screening. Observation of a lower incidence of cervical cancer in the trial arm where a new screening test is applied provides the proof that the new method (including the management of screen positives) is more effective than the control method. Nevertheless, conducting such studies requires enormous financial resources and huge study populations to be followed for many years including a high risk of contamination between the experimental and control armsc. Meanwhile, during the lengthy interval to validate the new method, it may no longer be available or have become obsolete. Therefore, it is often proposed to study intermediate or surrogate outcomes (for instance outcomes 4 to 6 in Table 1) and to simulate the most likely outcomes relevant to public health using mathematical models. CIN3 is the direct precursor of invasive cancer and therefore, reduced incidence of CIN3+ is considered as an acceptable a proxy outcome of trials evaluating new preventive strategies19,20. Prospective cohort studies do not allow obtaining more rapid results than randomised trials and suffer from several potential biases. Retrospective evaluation of previously identified cohorts can speed evaluation but not reduce bias. Case-control studies, comparing screening histories in women with and without cervical cancer are appropriate to evaluate effectiveness retrospectively but are also prone to several selection and information biases. Changes over time (secular trends) or geographical differences in incidence or mortality can be interpreted as screening effects but can only be accepted as indication of screening effectiveness when no other factors can plausibly explain the observed changes.
It must be stressed that the aim of screening is to prevent cervical cancer, not simply detect pre-invasive lesions. A new screen test allowing detection of more high-grade CIN does not necessarily result in more pronounced reduction of cancer incidence since just additional non-progressive lesions might be detected.
For screening, an accurate test is needed21: this means that it is positive when CIN2/3 is present and negative when CIN2/3 is not present. In other words, a screen test must have a good clinical sensitivity and specificity. The severity of CIN must be explicitly defined when assessing the accuracy of a test. CIN1 is the histopathologic manifestation of a carcinogenic or non-carcinogenic HPV infection that rarely progresses on a per event basis to cancer 22,23. Its detection is not clinically useful, possibly leading to over-treatment, and should not be targeted by any screening test. On the other hand, CIN2 and especially CIN3 indicate a considerable risk of developing cancer and should therefore not be missed by a screen test. CIN2 is an intermediate condition, which contains overcalled CIN1 (caused by both carcinogenic and non-carcinogenic HPV), and under-called CIN3 24–28. CIN2 is a more regressive29 and less reproducible histological diagnosis than CIN328. Thus, while a CIN2 diagnosis is typically the clinical threshold for triggering excisional or ablative treatment, its inclusion as an endpoint for evaluation of a screening test may exaggerate the overall impact of a screening test. The observation that a new screen test is more sensitive than the conventional test in detecting CIN3 provides more convincing evidence that its use in screening will result in a higher reduction in cancer incidence than the detection of CIN2/3, which can be artificially elevated due to the detection of low-risk CIN2 destined to regress (over-diagnosis). Whether detection of more CIN2 with a new method corresponds (at least partly) with either progressive or regressive disease, cannot be assessed from cross-sectional studies. However, observing, at the second screening round among women with a negative first screen test, less CIN3+, in the experimental compared to the control arm of a trial, indicates that at least a part of the additionally detected CIN2 was not regressive. The total amount of CIN2+ cases in first and second screening arm in the experimental over the conventional arm, represent a measure of over-diagnosis, not a measure of efficacy.
Therefore, future authors should be recommended to report cross-sectional accuracy separately for both CIN2+ and CIN3+.
The most comprehensive design for evaluating the cross-sectional accuracy of screen tests is the independent application of all the tests to a screening population followed by verification in all study subjects, irrespective of the screen test results, using a valid gold standard assessed without prior knowledge of the screen test results. Under these conditions unbiased estimation of the test sensitivity and specificity is possible. We invite readers to consult STARD guidelines30 for good diagnostic research and QUADAS guidelines for evaluation of the quality of individual studies included in systematic reviews of diagnostic studies31.
Often, even in a research context (because of cost and/or ethical concerns), only women with positive screen tests and none or only a few with negative screen tests are verified and this situation results in verification bias yielding inflated sensitivity and underestimated specificity. Nevertheless, if multiple tests are evaluated and at least one test is very sensitive, the extent of verification bias is reduced, because virtually all women with CIN2/3 or CIN3 undergo diagnostic evaluation. Verification bias can be adjusted for if a random fraction of screen-negatives are referred for the application of the gold standard32–36. Also long term follow-up can be used to capture missed disease29.
When 2 screen tests are applied to the same study subjects and all subjects, positive for one or both tests, are verified with an acceptable gold standard, unbiased estimation of the test positive predictive value, the relative sensitivity and detection rate of true positives is possible37,38d,e. Thus, while the true absolute sensitivity cannot be determined, test performance can be ranked in an unbiased fashion. The same is true for randomised clinical trials, where different tests are applied to subjects in two or more study arms. For this reason, we believe that the Cochrane Collaboration should consider including such studies in systematic reviews (see further below). The reader should be warned that correction for verification bias by additional verification of test negative cases can yield erroneous results (sometimes even more biased than the original verification bias) if subjects are not selected at random, see ref39 for an example.
When the prevalence of disease is low (which always is the case in a screening setting) and only test-positive cases are verified, an approximated test specificity can be computed, (see formula).
This approximated test specificity does not suffer from verification bias.
The reliability or reproducibility of a test, including intra-batch and inter-batch reproducibility as well as intra-laboratory and inter-laboratory reproducibility, expresses the capacity to obtain the same test result – correct or not – when the screening test is repeated on the same individual. The reliability depends on the definition of distinct test criteria that can be applied by skilled personnel. Poor reproducibility automatically yields low average sensitivity and specificity. Reproducibility can be enhanced by training. Evaluation of new screening tests requires reproducibility experiments, preferentially including field circumstances.
Assessment of the gold standard, knowing the screen test result, includes a serious risk of overestimation of both the sensitivity and specificity. Therefore, in diagnostic research, where the objective is to evaluate the cross-sectional accuracy of a screen test, verification should be performed independently. This can be difficult when the screen test and the gold standard are based on the same principle, for instance in case of VIA screening (visual inspection of the cervix after application of acetic acid), validated using colposcopy 40–42.
It is usually assumed that histological examination of material obtained by colposcopically directed biopsy, loop excision or endocervical curettage, and – in absence of biopsy - a negative colposcopic impression provide a valid ascertainment of the true disease status. Recent data indicate that this assumption might not always be true 43,44.
Colposcopy performance has been challenged by results from prospective studies suggesting that up to 50% of prevalent precancers may be missed during colposcopy 45. The visual assessment of the cervix in colposcopy has a high inter-observer variability 46,47,47. It has been demonstrated that the sensitivity of colposcopy is not related to the experience of the colposcopist, but to the number of biopsies taken43. In random biopsies from normal appearing regions on the cervix substantial disease has been identified48. Again, follow-up can be used to compensate partially for the lack of sensitivity of colposcopy. As a consequence, one-time colposcopic-directed biopsy as it has been practiced should be considered an imperfect referent standard.
Currently, studies are underway that aim at analyzing better colposcopic procedures and at determining how many biopsies are necessary to improve disease ascertainment. Meanwhile, a combined endpoint including histology and cytology results can improve the disease ascertainment 49.
Once again, it must be repeated that the observation of increased cross-sectional sensitivity of a new test for histologically confirmed CIN2/3 or CIN3 does not necessarily imply that its inclusion in a screening programme will yield a reduction in incidence of lethal cervical cancer with respect to conventional cytological screeningf. Nevertheless, when biological and epidemiological arguments justify the assumption that the lesions detected in excess by the new method have a substantial chance of progression (acceptable longitudinal positive predictive value) and that screen negatives have a substantially lower chance to develop cancer in the future (higher longitudinal negative predictive value), planning of evaluating the new test in a randomised population- based trial in an organised setting can be considered50. Audits of screening effectiveness, including linkages with screening and cancer registries, that allow picking up missed disease detected beyond the timelines of studies, are a particularly useful tool of evaluation51,52. Finally, simulation models must help in identifying best choices but also in orienting the most influential issues to be addressed in future studies.
Until now we studied essentially programme effectiveness, stressing test sensitivity. Cervical cancer screening involves large populations and therefore can be extremely costly. Costs are mostly determined by the test cost and specificity. An overview of the cost components attributed to screening is presented in Table 2.
Since the prevalence of progressive cervical precursors is very low the number of false positive cases results from the false positive rate applied to nearly the entire target population. Therefore even a small decrease in specificity can have serious consequences on costs, if the next step involves a complicated or invasive procedure. Nevertheless, the loss in specificity of a screen test can be limited by raising the screening interval, by increasing the age at onset of screening and by raising the cut-off for test positivity. Mathematical models can be used to estimate the final outcome per unit of cost, but rely on accurate estimates of the screening performance, which are not always available.
The Cochrane Collaboration is a world-wide not-for-profit and independent organisation, dedicated to making up-to-date, accurate information about the effects of healthcare readily available worldwide. It produces and disseminates systematic reviews of healthcare interventions and promotes the search for evidence in the form of clinical trials and other studies of interventions. The Cochrane Collaboration essentially addresses therapeutic questions or effects of interventions, assessed by randomised clinical trials (conducted following the rules of good research practice: CONSORT guideline53), and has developed a rigorous method for assessing and pooling of such trials (based on the QUORUM guidelines)54. In 2007, at the Cochrane Colloquium in Sao Paulo, the Cochrane Diagnostic Test Accuracy Working Group officially launched the implementation of systematic reviews of diagnostic test accuracy in its Library. The original studies should involve testing subjects for the presence of a target disease with two (or more) tests (for instance a conventional and a new test) and, subsequently, submitting all tested subjects with a valid gold standard method (STARD guideline)30. All tests should be applied independently and nearly simultaneously, in a setting representative for the situation where the tests will be used. The hierarchical summary ROC curve analysis is an adequate statistical tool that allows summarizing accuracy estimates accounting for the intrinsic negative correlation between sensitivity and specificity corresponding with different test cutoffs55. In the evaluation of a new biomarker as a potential screening method, it often is unfeasible, unpractical and even unethical to apply the gold standard (for instance excision biopsies). Moreover, it is possible that such ‘gold standard’ verification is unreliable when the target disease, is not yet detectable or, if the procedure detects lesions which have a high chance of spontaneous regression (over-diagnosis).
We agree that strict application of the Cochrane methodology for reviewing and the STARD guidelines30 for original diagnostic studies will result in tremendous improvements of the quality of the research on diagnosis for current clinical disease. Nevertheless, more appropriate methods and longitudinal study designs are needed for screening studies aimed at identifying cancer precursors, where the target disease is not yet developed and where management is restricted to screen-positive subjects. The conceptual five-step evaluation process (see Table 3, below) will be of guidance as a paradigm for screen test evaluation56. In particular, biobank-based case-control studies exploring presence of biomarkers in samples, collected years to decades before the outcome, can provide a powerful research tool, but still require investigations with respect to feasibility. We refer readers to a more extensive discussion of the use of stored cervical cytology samples as a resource for molecular epidemiology 57.
It is the intention of the authors to work out this conceptual model for cervical cancer screening including triage of screen-positive women. A major outcome would be a concept and guideline for the design and conduct of biobank case-control studies as also proposed recently by Pepe et al 58. This concept will require thorough discussion and levels of approval by international methodologists.
As one example, the triage of LSIL (and its equivalent, hr HPV-positive ASCUS) offers an interesting opportunity to evaluate the capacity of biomarkers to distinguish between regressing and progressing abnormalities using a biobank-based design. High-risk (hr) HPV testing is considered insufficiently specific59,60. One could select prior cases of LSIL archived in the biobank and follow these up with repeat testing and registration (different algorithms are possible). After two or more years certain cases will have progressed and others regressed. Subsequently, one can retrieve the stored original LSIL samples from cases that progressed to high-grade CIN and from matched disease-free controls and apply one or more biomarker assays. When the new biomarker assay requires fresh samples, such biobank-based studies must be designed prospectively with concealed testing at baseline 58.
Cervical cancer screening using detection of DNA of hrHPV types passed through all phases of evaluation (as listed in Table 3), although some RCTs are still running. It was already known for many years that hrHPV testing is more sensitive but less specific than cervical cytology 61. More recently, randomised population-based trials have demonstrated that hrHPV-negative women older than 30–34 years, are at 47–71% lower risk of developing CIN3 or worse (CIN3+) than women who have a negative Pap smear over the next 5 years 62–64. This reduction in the CIN3+ burden can be regarded as a proxy for reduced incidence of invasive cancer14. A large RCT, conducted in India, demonstrated lower incidence of and mortality from cervical cancer in women testing HPV-negative compared to not-screened women, in contrast to women screened with visual inspection or cytology65.
HPV infection is common but usually transient. Reaching high sensitivity for detection of underlying high-grade CIN requires inclusion of all high-risk types in the assays, which inevitably reduces specificity because it includes weaker carcinogenic HPV genotypes 25. Therefore, when HPV-based screening for cervical cancer is considered, the challenge will be to identify appropriate triage algorithms that limit the burden of hrHPV positive women needing follow-up. Cytology triage is one possibility 66,67. Biomarkers which are widely expressed in transforming infections could also fulfil this role 68,69. Biomarkers can also be used to triage low-grade or borderline cytology60,70, when cytology is used for primary screening.
A recent meta-analysis (including manly phase 1 studies) summarised the correlation between p16INK4a (abbreviated as p16) over-expression and the severity of squamous cytological lesions, and demonstrated a high variation in the proportion of p16 positives (ranging between 10% and 100% in ASCUS [atypical squamous cells of undermined significance] and between 24% and 86% in LSIL [low-grade squamous intraepithelial lesions]), underlining lack of standardisation in immuno-staining, interpretation and reporting 71. Nevertheless, in experienced hands and using clearly defined criteria, p16 immuno-staining has shown excellent results with sensitivities for CIN2+ similar to hrHPV testing60, remarkably lower positivity rates (27% in ASCUS, 24% in LSIL) and consequently substantially higher specificities (84% and 81%, in respectively in ASCUS and LSIL) (one phase 2 study)72.
Currently, we must acknowledge the lack of good triage studies comparing p16 with currently used alternative strategies to triage equivocal cytological results. Concerning triage of hrHPV positive women, we note only one recent Italian study where hrHPV testing followed by p16-enhanced cytology showed a higher sensitivity for high-grade CIN and similar referral rate to colposcopy compared to primary screening by non-stained conventional cytology73. Pepe did not include triage studies in the framework of ranking evidence for efficacy of screening (Table 3). We propose to consider triage studies as providing evidence of level 2, if designed as a diagnostic study with concurrent gold standard assessment. Randomisation of two or more triage options including longitudinal outcome assessment (via screening and cancer registries, or via systematic gold standard assessment 2–3 years after triage testing) should be classified at a superior level (2+ level).
The question whether sufficient evidence exists to recommend p16-immunostaining as an alternative primary cervical cancer screening method must be answered negatively (many phase 1 studies71, a small number of pending phase 2 studies [C. Bergeron, personal communication], and one trial targeting p16-triage of HPV positive women [phase 2+]73). Yet these promising results warrant further evaluation by for more powerful and well-designed studies (of higher phases).
In order to explore the potential to use p16 over-expression as a progression marker in triage, we propose to set up an international workshop to standardise issues of sample processing and to define clear criteria for categorising levels of positivity60. In table 4, we propose a comprehensive set of studies, which are needed to demonstrate performance of p16 testing in screening.
This question intrigues not only the developers of new assays but also the public and health policy makers who whish to avoid dependency from one manufacture. It is agreed that lower-level evidence can be accepted for systems similar to those for which already sufficient evidence of efficacy is available.
Liquid-based cytology and/or automated cytology could be accepted as an alternative for conventional cytology if at least equal sensitivity and/or specificity, or preferentially, superior sensitivity and equal specificity or, equal sensitivity and superior specificity, using CIN2+ as outcome, can be demonstrated in a screening population. This can be achieved through a cross-sectional study with double testing (conventional and new assay) and blind interpretation of both assays and blind verification of subjects with cytological abnormality according to standard follow-up algorithms. A preferred alternative is the randomised trial, where colposcopists and histologists are blinded to the type of screen test. Example are the RCT currently being conducted in the Netherlands, comparing liquid and conventional cytology 74 and that conducted in Italy 75. In case of comparable accuracy, other elements, such as the proportion of unsatisfactory preparations, reading time, possibility of ancillary testing and costs should be considered, which can be done through a decision analysis.
Accepting that screening using HC2 or GP5/6+ PCR significantly reduces the prevalence of CIN3+ 14,64g, experts recently proposed that a new high-risk HPV test should reach a minimum relative sensitivity of at least 0.90 and a relative specificity of at least 0.98, using HC2 as comparator test and CIN2+ as threshold for disease. Moreover the new test should be highly reproducible (agreement>87%, minimum 500 samples)76.
Research for other new markers, based on molecular processes associated with carcinogenesis, should undergo all phases of evaluation. Possible applications of p16 immuno-cytochemistry, mRNA testing and HPV genotyping to secondary cervical cancer prevention are passing through the hierarchical path of generating evidence, unfortunately not always following the logic framework outlined in table 3. Triage of women with LSIL is a particularly pertinent research field for molecular biomarkers since neither hrHPV testing nor repeated cytology appear to be sufficiently discriminatory to find underlying or incipient relevant disease77.
The expected reduction in background risk of several cancers brought about by future HPV vaccination will be an additional dimension that must be integrated in search of screening methods with an acceptably high predictive value78,79. In fact, screen and follow-up strategies with high positive predictive value are also needed in well-screened populations, where over time, prevalent, large CIN3 with significant invasive potential will be preferentially detected and eliminated, leaving fewer CIN3 that have lower invasive potential. It is the intention of the authors to try assisting the research community by offering advice on future straight foreword study designs. The environment of the Cochrane Review Collaboration, involving cooperation with methodologists in diagnostic research, clinicians and clinical epidemiologists could offer a fruitful forum to realise the ambition of assessing current and future evidence for cervical cancer prevention strategies.
Financial support was received from: (1) The Belgian Foundation Against Cancer, Brussels, Belgium; (2) the Gynaecological Cancer Cochrane Review Collaboration (Bath, United Kingdom); (3) the European Commission (Directorate of SANCO, Luxembourg, Grand-Duchy of Luxembourg) through the ECCG (European Cooperation on development and implementation of Cancer screening and prevention Guidelines, IARC, Lyon, France) and the European Research EUROCOURSE (Optimisation of the Use of Registries for Scientific Excellence in research) Network, funded by the 7th Framework programme through the Comprehensive Cancer Centre South (Eindhoven, The Netherlands); (4) IWT (Institute for the Promotion of Innovation by Science and Technology in Flanders (through the Unit of Health Economics and Modelling Infectious Diseases, Vaccine & Infectious Disease Institute, University of Antwerp; project number 060081).
aIn this paper “CIN” (cervical intraepithelial neoplasia) is used for histologically confirmed lesions, while the SIL (Bethesda) terminology is used to describe cytological findings, as recommended in recent international guidelines 1–3.
bIt must be remarked that this estimate implies 100% compliance of screened women and that cancer occurring in women with lesions when screening starts are excluded from the estimate of 80% reduction.
cContamination means that study subjects enrolled to participate to a trial arm do not follow the procedures foreseen in the study protocol. For instance: women randomised to screening with cytology are screened with an HPV test in the context of opportunistic screening.
dThe same is true when different tests are studied in different populations as long as the prevalence of disease can be assumed to be the same (e.g. in randomised trials) 4.
eWhen not all screen-positives are verified and the selection of verified positive cases is not random, verification bias still can occur at the level of the PPV, detection rate and relative sensitivity.
fIt is important to distinguish cross-sectional and longitudinal accuracy parameters. Increased detection with a new test of CIN2 that will largely regress, will result in a higher cross-sectional sensitivity which is clinically not useful (over-diagnosis). In contrast, a screen-positive woman who, currently, does not have colposcopically visible CIN can develop a high-grade CIN2 in the future. Such a case may initially be classified as false-positive, only to be re-classified subsequently as a true-positive with longitudinal surveillance.
gLevel of evidence (see Table 1): outcome: reduction of CIN3+ (level 3); study type: RCT (level 1).