|Home | About | Journals | Submit | Contact Us | Français|
Histopathologic diagnosis of cervical biopsies determines clinical management of patients with an abnormal cervical cancer-screening test yet is prone to poor inter-observer reproducibility. Immunohistochemical staining for biomarkers related to the different stages of cervical carcinogenesis may provide objective standards to reduce diagnostic variability of cervical biopsy evaluations but systematic, rigorous evaluations of their potential clinical utility are lacking. To address diagnostic utility of HPV L1, p16INK4a, and Ki-67 immunohistochemical staining for improving diagnostic accuracy, we conducted a community- and population-based evaluation using 1455 consecutive cervical biopsies submitted to the Department of Pathology at the University of Virginia during a period of 14 months. Thin-sections of each biopsy from 1451 of 1455 (99.7%) biopsies underwent evaluation of immunohistochemical stains for three biomarkers, masked to the original diagnosis, and the results were compared to an adjudicated, consensus diagnosis by 3 pathologists. p16INK4a immunostaining, using the strongest staining as the cutpoint, was 86.7% sensitive and 82.8% specific for cervical intraepithelial neoplasia grade 2 or more severe (CIN2+) diagnoses. The p16INK4a performance was more sensitive (p < 0.001), less specific (p < 0.001), and of similar overall accuracy for CIN2+ compared to the combined performance of all pathologist reviews in routine clinical diagnostic service (sensitivity = 68.9%, specificity = 97.2%). Ki-67 immunostaining was also strongly associated with a CIN2+ diagnosis but its performance at all staining intensities was inferior to p16INK4a immunostaining, and did not increase the accuracy of CIN2+ diagnosis when combined with p16INK4a immunostaining compared to p16INK4a immunostaining alone. We found no utility for L1 immunostaining in distinguishing between CIN and non-CIN. In conclusion, with a rigorous evaluation, we found immunohistochemical staining for p16INK4a to be a useful and reliable diagnostic adjunct for distinguishing biopsies with and without CIN2+.
Cervical cancer prevention programs in the U.S. and other high-resource settings have traditionally relied on the repeated application of a three-stage intervention: 1) screening by Pap tests/cervical cytology; 2) colposcopic evaluation (magnified visualization of the cervix after the application of dilute acetic acid) of screen positives and directed biopsy of abnormal-appearing cervical tissue for diagnosis; and 3) excisional or ablative treatment of the cervical tissue in women diagnosed with precancerous lesions. This program of screening, diagnosis, and treatment of precancerous lesions has effectively reduced the burden and mortality due to cervical cancer where it is has been effectively implemented.1, 15
Since most screen-positive women do not have clinically important disease, the goal of managing women who screen positive is distinguishing between those women who have precancerous lesions that have malignant potential and need treatment from those who have benign human papillomavirus (HPV)-associated lesions, which are most likely to regress spontaneously and can be monitored without immediate intervention. However, making such distinctions accurately (sensitively and specifically) has proven difficult.
In the U.S., Canada, and some European countries, cervical histopathological diagnoses are currently graded according to the cervical intraepithelial neoplasia (CIN) system: Normal, CIN1, CIN2, CIN3, and cervical cancer. The CIN nomenclature is primarily based on a subjective measure of the thickness, from the stroma towards the surface of the epithelium of the percentage replacement of differentiating epithelial cells by minimally differentiating or proliferating epithelial cells. At least two-thirds of the epithelium is replaced in CIN3, between one-third and two-thirds in CIN2, and one-third or less in CIN1. This classification system was developed before it was known that these intraepithelial lesions, while representing a spectrum of histological changes caused by HPV, are actually two distinct and non-continuous biological processes, one benign and one precancerous.29
In this context, CIN1 is the histopathological manifestation of HPV infection by carcinogenic and non-carcinogenic human papillomavirus (HPV) genotypes. CIN2 and CIN3 (sometimes called carcinoma in situ) are considered precancerous diagnoses. However, while CIN3 is considered a definite precancer and is the best surrogate for cancer risk, with 30% of CIN3 in older women becoming invasive over 30 years19, CIN2 is now considered an equivocal diagnosis of cervical precancer and includes both CIN1/HPV effects as well as some precancerous lesions.8 The clinical challenge is to define which is which, i.e. which is of the equivocal diagnoses merits treatment as a precancer.
Histologic assessment of cervical dysplasia is complicated by inter-observer variability equaling that of cytologic interpretion.25 The two key interpretive issues include the distinction of normal from dysplasia (CIN) of any grade; and then benign, mostly transient dysplasia (CIN1) from precancer (CIN3 and some CIN2). The first issue, separating normal from any CIN, is most commonly the result of misclassifying a normal biopsy as CIN1.25 While this error might be viewed as trivial, an erroneous diagnosis of CIN1 provides false assurance to the colposcopist. Specifically if the diagnosis of CIN1 is rendered and indeed the histologic diagnosis is truly normal, the colposcopist thinks a lesion has been found when in fact normal epithelium has been biopsied and thus the ability to make accurate colposcopic and pathologic correlation is severely compromised. Furthermore, while not recommended by current management guidelines28, women diagnosed with CIN1 are sometimes aggressively treated.
Less often, truly precancerous or high-grade lesions are misclassified as negative for dysplasia particularly in the setting of reactive or metaplastic changes or when a biopsy is fragmented or poorly samples the underlying small lesion. Conversely, diagnostic errors resulting in the overcalling of negative as high-grade disease also occur.5
Once dysplasia is recognized, the second diagnostic issue is to determine whether the lesion has malignant potential justifying a therapeutic excision, while attempting to minimize the over-treatment of lesions likely to regress. This is an important balance between safety and over-treatment because treatment increases the risk of negative reproductive outcomes for subsequent pregnancies.2 Current protocols place this treatment threshold at CIN228, but CIN2 has been shown to be the least reproducible diagnosis4, 8, 25 and has much greater regressive potential than CIN3.6, 26 In our view, although CIN2 is a heterogeneous cohort8, containing both precancerous and transient lesions, its treatment provides a necessary margin of safety against the risk of cancer.
We have previously proposed a method for reducing inter-observer variability in cervical biopsy interpretations, particularly CIN2, by using fractions attributable to HPV1611; however, widespread genotyping is not yet available, and while applicable as a quality control metric for diagnostic tendencies, it is not helpful in the interpretation of any individual biopsy.
Other biomarkers have been proposed for use in triaging women with cervical dysplasia to increase diagnostic accuracy, akin to incorporating HPV testing into clinical practice as a triage of women with equivocal cytologic abnormalities to determine which women are in need of immediate colposcopy.18, 23 Several candidate biomarkers, such as HPV L1, p16INK4a, and Ki-67 as well as others, have been proposed for use on cytologic or histologic specimens based on promising preliminary results. 9, 13, 16, 17, 20, 30
To more rigorously evaluate the utility of HPV L1, p16INK4a, and Ki-67 immunohistochemical staining for improving diagnostic accuracy, we conducted a community- and population-based evaluation in almost 1,500 consecutive cervical biopsies. Our goals reflected the diagnostic challenges described above: distinguishing dysplastic from normal tissue and delineating the grade of the lesion among women with dysplasia. This large study has assessed and minimized inter-observer variation by performing a centralized adjudicated review of histology. We also negated any auto-correction of stain interpretations by reviewing hematoxylin and eosin stain (H&E) slides for standard histologic diagnosis and grading the immunohistochemistry results in a masked fashion, thereby permitting an unbiased evaluation of each biomarker’s independent contribution to the adjudicated histopathologic assessment.
Routine clinical H&E-stained slides of all cervical biopsies accessioned for 14 consecutive months (n = 1,455), excluding benign cervical polyps, were prospectively collected with documentation of the routine clinical diagnosis rendered by one of the on-service general anatomic pathologists (“clinical diagnosis”). The median age, mean age, and age range of this population was 26 years, 29 years, and 18-71 years, respectively. The use of the tissues for this study was approved by the institution review board of the University of Virginia, and the analysis of the de-identified data by P.E.C. was determined not to be human subjects research by the NIH Office of Human Subjects Research.
Slides underwent a quality control pathology review blinded to any clinical information except for patient age. A series of biopsies from a single patient were interpreted as separate biopsies. Diagnostic agreement between any two reviewing pathologists provided a “consensus diagnosis”. Slides first underwent independent reviews by two pathologists (M.H.S., K.A.A.). If there was diagnostic agreement between the first two reviewers, consensus diagnosis was achieved and no additional reviews were conducted. For cases with diagnostic disagreement between the first two reviewers, an independent third review was conducted by another pathologist (M.T.G.). If there was diagnostic agreement on two of the three reviews, no additional reviews were performed. Cases with three-way independent disagreement were re-reviewed together, masked to the previous diagnoses, by three reviewing pathologists at a multi-headed microscope until a consensus diagnosis (two-way or three-way agreement) was reached. This consensus was reached by H&E stains only; no immunohistochemical results were used.
All excisional follow-up diagnoses (hysterectomy, cold knife conization, or LEEP) were recorded. More certain diagnostic annotations, such as “favor” or “consistent with”, were considered informative (for example: “squamous dysplasia, favor moderate” was coded as moderate squamous dysplasia). Less certain diagnostic annotations, such as “cannot exclude”, were not considered informative (for example “mild squamous atypia, cannot exclude mild squamous dysplasia” was coded as negative). Biopsies documented as insufficient for diagnosis or diagnosed as benign cervical polyps were excluded from further review.
Results of the preceding Pap tests (the most proximate, available in-house Pap test within 3 years of the colposcopic examination) and Pap tests done at the time of colposcopy (“synchronous Pap test”) were tabulated when available. A subset of 1,359 (93.7%) women had data on either a preceding Pap (n = 790, 54.4%) or a synchronous Pap (n = 980, 67.5%) (411, 28.3%, had both). Any accompanying HPV testing (Hybrid Capture 2; Qiagen, Gaithersburg, MD) was also recorded.
All colposcopically- directed cervical biopsies were fixed in 0.25% zinc, neutral-buffered, 10% formalin (Richard-Allan Scientific, Kalamazoo, MI, USA) and embedded in paraffin per routine. These were then cut at 4 μM for 10 sequential levels and placed on negatively-charged, sialinated slides (Superfrost™, Thermo Fisher Scientific, Inc., Waltham, MA, USA). Slides 1, 2 and 10 were selected for hematoxylin and eosin staining and submitted to the on-service clinical faculty admixed with other routine non-cervical surgical cases. Slides 3-9 were retained for additional analyses, including immunohistochemical staining.
For immunohistochemistry, unstained slides were placed in a 60°C oven for 1 hour, cooled, de-paraffinized, and rehydrated through xylenes and graded ethanol solutions to water. All slides were then treated for 5 minutes in a 3% hydrogen peroxide solution in water to quench endogenous peroxidase. Antigen retrieval for all immunohistochemistry was performed using Target Retrieval Solution (Dako, Carpinteria, CA, USA; product code S1699) in a pressure cooker per the manufacturer’s protocol. Slides were then placed on a Dako Autostainer immunostaining system for use with immunohistochemistry. Non-specific antibody binding was inhibited by incubating sections with serum-free protein block (Dako; product code X0909) for 10 minutes. Primary antibodies against p16INK4a (mtm laboratories AG, Heidelberg, Germany; antibody used per package insert), Ki-67 (Dako; antibody diluted 1:300), and L1 capsid protein of all known HPV types (Cytoactiv Screening Set, Cytoimmun Diagnostics, Pirmasens, Germany; prediluted antibody) were applied for one hour at room temperature. Following washing and incubation with the relevant secondary antibody linked to horseradish peroxidase, also for one hour at room temperature, sections were developed by adding DAB+ (3,3′-Diaminobenzidine) chromogen (Dako; product code K3468) for 5 minutes. Slides were then counterstained in Richard-Allan hematoxylin, dehydrated through graded ethanol solutions, cleared with xylene, and cover-slipped.
A fourth pathologist (W.K.B.) reviewed and scored all immunohistochemical stains in the epithelium of each biopsy without clinical information or corresponding H&E-stained slides. The scoring of p16 generally included both nuclear and cytoplasmic staining, and was graded as 0 (no staining), 1 (rare singly dispersed cells staining), 2 (patchy but strong staining, often not continuous from basement membrane), and 3 (strong and diffuse staining, usually continuous staining from basement membrane and extending upward in proportion to lesion grade). The scoring of Ki-67 included nuclear staining only, and was scored as 0 (no staining), 1 (1-2 layers of basal/parabasal staining), 2 (diffuse staining confined to the bottom third or superficial staining but with skip areas usually between parabasal and upper zones), and 3 (continuous staining of greater than the lower third of the epithelium). L1 was scored as positive if at least one epithelial cell had discrete nuclear staining. See Figures 11--55 for examples of typical as well as problematic cases.
A total of 1,451 of 1,455 (99.7%) individual cervical biopsies had both diagnoses and all immunohistochemical staining results. Histologic diagnoses were categorized as follows: Negative, CIN1, CIN2, CIN3/adenocarcinoma in situ (AIS), and cancer. We also categorized diagnoses as no CIN (negative) versus CIN (CIN1 or more severe [CIN1+]). To examine the association of the biomarkers with the severity of CIN among women with any CIN, we grouped the small number of cancers with the CIN3/AIS (CIN3/AIS/Cancer).
As measures of diagnostic agreement between the clinical and consensus diagnoses, we calculated the percent total (raw) agreement with binomial 95% confidence intervals (95%CI) and Kappa values with 95%CI, and tested for a tendency for one or the other group to diagnose more severely using the symmetry χ2 test. Raw agreement, kappa values, and (if appropriate) linearly-weighted kappa values were calculated for the subset of immunohistochemical stained slides read by a second reviewer (K.A.A.) to assess specifically the relative reproducibility of the adjunctive stains.
A Mantel-Haenzel test for trend was used to assess trends of staining positive or intensity with the severity of the consensus diagnosis. Sensitivity, specificity, and Youden’s Index (YI) (YI = sensitivity + specificity − 1), as a metric of accuracy, were calculated for consensus diagnoses of CIN3 and more severe (CIN3+), CIN2+, and CIN1+. McNemar’s chi-square was used to test for differences in sensitivity and specificity.
Logistic regression was used to calculate odds ratios (OR) with 95%CI for the association of biomarker immunohistochemical staining with having CIN versus having no CIN, and multinomial logistic regression was used to calculate OR with 95%CI for the association of biomarker immunohistochemical staining with the severity of diagnosis among women with any CIN (CIN3 or CIN2 versus CIN1).
Two-sided Fisher exact tests were used to examine discrepancies in consensus diagnosis and p16INK4a immunohistochemical staining intensity (e.g. <CIN2 and score of 3).
P values of less than 0.05 were considered statistically significant. Stata 8 (College Station, Texas, USA) was used for statistical analyses.
The pair-wise clinical and consensus diagnoses are shown in Table 1. As rendered by academic surgical pathologists in routine general sign-out, there were 755 negative (52.0%), 451 CIN1 (31.1%), 147 CIN2 (10.1%), 92 CIN3/AIS (6.3%), and 6 cancer (0.4%) diagnoses. The consensus panel review of these biopsies resulted in 748 negative (51.6%), 394 CIN1 (27.2%), 177 CIN2 (12.2%), 127 CIN3/AIS (8.8%), and 5 cancer (0.3%) diagnoses. Raw agreement between the clinical diagnosis and consensus diagnosis was 74.2% (95% = 71.8-76.4%) and the Kappa value was 0.59 (95%CI = 0.55-0.62). There was a 76.8% agreement for negative histology, 49.3% for CIN1, 27.6% for CIN2, 46.0% for CIN3, 83.3% for cancer diagnoses. The consensus review had a tendency to render a more severe diagnosis than the clinical diagnosis (p < 0.001); notably the consensus review was more likely to call the clinical diagnosis of CIN2 as CIN3/AIS than the converse (48 versus 20) and to call a community diagnosis of CIN1 as CIN2 than the converse (77 versus 23).
Immunohistochemical staining results for HPV L1, p16INK4a, and Ki-67 were associated/related to each other. Increasing intensity of p16INK4a and Ki-67 immunohistochemical staining were strongly and directly correlated (ptrend < 0.001), while HPV L1 positivity decreased with increasing intensity of p16INK4a (ptrend < 0.001) and Ki-67 (ptrend < 0.001).
Shown in Table 2 are the results of the individual immunohistochemical stains versus the consensus diagnosis. Increasing intensity of p16INK4a (ptrend < 0.001) and Ki-67 (ptrend < 0.001) immunohistochemical staining was associated with the increasing severity of the consensus diagnosis. Statistically, immunohistochemical staining of HPV L1 (ptrend < 0.001) was negatively associated with the increasing severity of the consensus diagnosis. As would be predicted by the natural history of HPV and cervical carcinogenesis, negative histology (96.7%) and cancer (100%) were the most likely to test negative for L1 capsid protein.
We next examined the clinical performance (sensitivity, specificity, and YI) of different positive cutpoints for p16INK4a and Ki-67 staining, and the two markers combined, in relationship to the consensus diagnoses of CIN3+, CIN2+, and CIN1+ (Table 3). Increasing the positive cutpoint for p16INK4a staining increased its specificity and accuracy for CIN3+ and CIN2+ with only a minor decrement in sensitivity due to one case of CIN3/AIS having a staining intensity of 1 (discussed below). By comparison, increasing the positive cutpoint for Ki-67 staining increased its specificity and accuracy for CIN3+ and CIN2+ but there was a greater decrement in sensitivity. Combining the staining results for two biomarkers in general did little to enhance the clinical performance for CIN3+ compared to p16INK4a staining alone: the most accurate combination of p16INK4a (cutpoint = 3) and Ki-67 (cutpoint = 2) was only slightly less sensitive (98.5% vs. 99.2%), more specific (78.1% vs. 74.8%), and had a higher YI (76.6% vs. 74.0%) than p16INK4a (cutpoint = 3) staining alone.
We used logistic regression models to examine the association of each biomarker with having CIN (versus not) and the severity of CIN among those with any CIN (with CIN1 being the reference), mutually adjusting for the other biomarkers. In the models, we used a positive cutpoint of 3 for p16INK4a and Ki-67 staining. HPV L1 (OR = 8.1, 95%CI= 5.1-13.1), p16INK4a (OR = 16, 95%CI = 11-23), and Ki-67 (OR = 18, 95%CI = 4.3-77) staining were associated with having CIN. p16INK4a (OR = 4.7, 95%CI = 3.1-7.3) and Ki-67 (OR = 5.4, 95%CI = 2.9-9.9) staining were associated, and HPV L1 (OR = 0.62, 95%CI = 0.40-0.95) was negatively associated with, having a CIN2+ diagnosis. p16INK4a (OR = 140, 95%CI = 18-1000) and Ki-67 (OR = 22, 95%CI = 11-42) staining were associated, and HPV L1 (OR = 0.18, 95%CI = 0.098-0.35) was negatively associated with having a CIN3+ diagnosis.
Among women with a negative or CIN1 consensus diagnosis, women whose biopsy had p16INK4a immunohistochemical staining score of 3 (see examples in Figure 4) compared to those who had a score of <3 were: 1) more likely to have community diagnosis of CIN1+ (69.6% vs. 25.9%; p < 0.001) and CIN2+ (9.7% vs. 1.4%, p < 0.001) and 2) more likely to have a preceding or synchronous HSIL Pap (24.6% vs. 16.3%; p < 0.001). Among women with a CIN2 consensus diagnosis, women whose biopsy had p16INK4a immunohistochemical staining score of 3 compared to those who had a score of <3 were: 1) more likely to have community diagnosis of CIN1+ (98.5% vs. 80.0%; p < 0.001) and CIN2+ (54.0% vs. 40.0%, p = 0.2); 2) more likely to have a preceding or synchronous HSIL Pap (86.2% vs. 68.0%; p = 0.005); and 3) more likely to have a diagnosis of CIN2+ on their excised tissue (89.3% vs. 70.3%, p = 0.03). The one case of a CIN3 consensus diagnosis with immunohistochemical staining score less than 3 also had a CIN3 clinical diagnosis, a preceding Pap of atypical squamous cells, was positive for HPV L1, and Ki-67 immunohistochemical staining score of 2 (see Figure 5). The subsequent excision was diagnosed as CIN3. This most likely represents a technical failure of the p16 immunostain.
Table 4 compares the sensitivity, specificity, and Youden’s Index of p16INK4a with or without Ki-67 immunostaining with individual pathologists’ H&E interpretations for the detection of consensus CIN3+ and CIN2+. For an endpoint of CIN3+, p16INK4a with or without Ki-67 immunostaining was an equally sensitive but slightly less specific and less accurate (lower YI) diagnostic method compared to the combined performance of all pathologist reviews (using a diagnostic cutpoint of CIN2 or worse) (Sensitivity, p = 0.01; Specificity, p < 0.001) and to most of the individual performances of each pathologist singly. Similarly for an endpoint of CIN2+, p16INK4a with or without Ki-67 immunostaining was more sensitive but less specific diagnostic method compared to the combined performance of all pathologist reviews (Sensitivity, p < 0.001; Specificity, p < 0.001) and to most of the individual performances of each pathologist singly. Ranked according to YI, p16INK4a with or without Ki-67 immunostaining was near the middle of diagnostic accuracy for CIN2+ for the 12 surgical pathologists.
A random subset of immunohistochemical stained slides underwent a second review to assess the reproducibility of grading of the immunochemical stain. The raw agreement and Kappa for HPV L1 immunostaining (n = 159) were 96.9% and 0.88, respectively. The raw agreement, Kappa, and linear-weighted Kappa for Ki-67 immunostaining (n = 162) were 73.6%, 0.55, and 0.67, respectively. The raw agreement, Kappa, and linear-weighted Kappa for p16INK4a immunostaining (n = 162) were 76.5%, 0.64, and 0.80, respectively. Using the p16INK4a immunohistochemical staining score of 3 as the positive cutpoint, the raw agreement was 95.1% and kappa was 0.87. Using the p16INK4a immunostaining score of 3 and a Ki-67 immunostaining of 2+ as the positive cutpoint, the raw agreement was 95.0% and kappa was 0.86. By comparison, the ranges in the raw agreements and kappa values between the three pathologists conducting the consensus reviews were 88.3%-91.9% and 0.67-0.72, respectively.
There is little doubt that the pathology community agrees on the utility of adjunctive stains to increase the accuracy of clinical histologic interpretations. Many studies 9, 12, 13, 16, 17, 20, 30 have provided evidence that Ki-67 and p16INK4a immunohistochemistry are valuable adjunctive aids in the diagnosis of difficult cervical biopsies. However, virtually none of these studies take into account the realities of diagnostic variation and the confounding effect of autocorrelation. Furthermore interpretation of immunohistochemistry can also be subjective and there are no criteria that are widely accepted for what the exact cut point for a positive vs. negative immunohistochemical stain should be. Mix all these variables together and it becomes difficult to be able to get any accurate assessment of a biomarker’s performance for its stated purpose.
Patient management is highly contingent on histopathologic pathologic diagnosis. While widely used in clinical studies as the determinative end point for clinical trials, adjudicated histology is not the standard of care. Furthermore biomarker studies such as the immunochemical stains evaluated in this paper are used adjunctively at the discretion of pathologists. Indeed, part of the decision of whether to apply a special stain or not to a given case is completely dependent on whether the pathologist thinks they have a problem. Given the documented inter-observer variation of histologic diagnosis, together with the fact that these studies have documented that the biggest error in interpretations are at the thresholds of CIN vs. non-CIN as well as even more critically CIN1 vs. CIN2+, one could legitimately ask the questions as to whether biomarkers could be used to override or replace H&E diagnosis. To accomplish this goal, a clear index of performance of each potential marker, alone or in combination, must be established relative to an accepted diagnostic gold standard. This study was designed to address all these issues and the limitations with the existing literature.
We documented again the diagnostic variation in histologic diagnosis as has been previously observed.4, 8, 25 While findings in this interdepartmental study, given its relative tendency to under-call CIN2, are slightly different than ALTS8, 25, the overall impression of diagnostic variation and trends are still markedly evident. This was all true despite the fact that many of the faculty have worked together continuously for more than a decade to harmonize diagnoses in order to achieve more consistent management of patients. Indeed, these data from an optimal yet real-world setting highlight the difficulty in achieving reliable diagnostic performance without objective standards, and the circularity of the problem of evaluating biomarkers to address this issue in the absence of a clear gold standard. In other words, these data, like ALTS, likely underestimate severity of the diagnostic challenge. Furthermore, the literature on p16INK4a and Ki-67 is large, but in most cases the sensitivity and specificity estimates provided in that literature are confounded by this lack of a clear gold standard.
In this study, we went to great effort to try to create, if not a gold standard, an excellent reference standard to which we could reliably evaluate these biomarkers. While our reference standard was not perfect, rigorous review and consensus diagnoses does significantly reduce the misclassification of disease endpoints, thereby improving the association of a biomarker to the endpoint of interest.7
We demonstrated the following: 1) p16INK4a staining in a strong and diffuse block pattern is highly sensitive for CIN3+ as well as CIN2+ but not CIN1. Thus, p16INK4aimmunohistochemical staining is useful in distinguishing high-grade CIN from ≤CIN1 but probably not useful for distinguishing CIN1 from non-CIN (negative); 2) Ki-67 immunohistochemical staining was equally sensitive but less specific for CIN3+ and CIN2+ compared to p16INK4a. The addition of Ki-67 immunostaining to p16INK4a immunostaining, at least on sequential slides did not appreciably improve the diagnostic accuracy for CIN3+ and CIN2+ compared to p16INK4a immunostaining alone; and 3) L1 protein detection, which should be highly correlated with a productive viral infection, was neither sensitive nor specific for any class of cervical neoplasia, most likely because of the complexity of the temporal evolution of the HPV virion production that may be quite transient. Thus our study’s original hope that L1 would serve as a reliable marker to distinguish CIN vs. non-CIN was not realized. However, as an aside, finding L1 at the surface of 30% of CIN3 might be evidence of the progression of low-grade lesions to precancer.3, 24
The L1 positive cases that were negative by consensus diagnosis were selectively reviewed (data not shown) and commonly had at least one reviewer diagnosis of CIN1. These clearly represent the difficulty of the negative vs. CIN1 cutpoint that we had hoped to resolve. While L1 may highlight true CIN1 cases missed by our consensus diagnosis, we were disappointed by the overall sensitivity of L1 for consensus CIN1. In our opinion only a broad spectrum HPV ISH could possibly provide a better gold standard, and as the probes kits improve, we intend to further evaluate these cases with retained sandwiched slides.
Thus, p16INK4a immunohistochemical staining may serve multiple purposes in diagnosing cervical biopsies in routine clinical practice. First, it might be used as a training tool for newly trained pathologists to calibrate their own “receiver-operator characteristic (ROC)” curve to achieve good diagnostic performance for CIN2+. Second, in the same way that HPV DNA testing can be used as an objective QC standard for cytology21, 22, p16INK4a may likewise be used for cervical histopathology. Finally, where there is no or limited pathology review, as in some resource-constrained countries that have implemented a screening, diagnosis, and treatment program for cervical cancer prevention, easily interpreted, reproducible, and strong (3+) p16INK4a immunohistochemical staining might be used in lieu of a pathologist’s review.
While not yet definitively proven and likely to be controversial, given the repeated documented diagnostic variation in H&E interpretation of cervical biopsies, one could envision a reasonable tradeoff in biological and clinical accuracy whereby p16INK4a immunostaining would be a substitute for pathology review at the threshold for high-grade disease requiring treatment. If one considered our consensus diagnosis of CIN 2+ to reflect true disease, treatment based on a p16 score of 3 would appropriately refer 87% (268 of 309) of patients whereas treatment based on the community pathology diagnosis would appropriately refer 69% (213 of 309). Yet, by the same token, using a p16 score of 3 would possibly “over-treat” 39% (155 of 394) of women with consensus CIN1 whereas treatment based on the clinical diagnosis would only over-treat 6% (25 of 394). Thus, using p16INK4a immunostaining would result in a tradeoff of sensitivity for specificity for treatment of precancerous disease compared to a community diagnosis of CIN2+. As noted above, these are conservative estimates of benefits, given the expertise and experience of the pathologists participating in the study.
It is also worth noting that p16INK4a-positive CIN1 may represent a group of women who are at high risk of having or developing CIN2+ for the following reasons. First, we know that colposcopy is inaccurate14 and some CIN1 may be poorly sampled CIN2+. Second, despite our efforts, there will still be some misclassification of disease by adjudicated, consensus review. While our study demonstrated the value of p16 staining independent from H&E interpretation (specifically avoiding autocorrection), it is reasonable to conclude that for the time being, interpreting the staining result in the context of histology would provide the best diagnosis. Finally, p16INK4a-positive CIN1 may be at significant risk of subsequently developing into CIN2+27 although larger studies to quantify this risk are needed.
Based on our data, treatment based on p16INK4a-positive tissue may be a reasonable tradeoff of sensitivity and specificity for CIN2+, even versus standard H&E interpretation by expert pathologists. Based on our study design it is clear that strong p16 staining is easily interpreted and is highly correlated with the diagnoses based on the adjudicated review. Thus, newly trained or relatively inexperienced practitioners can rely on adjunctive p16 staining for the differential diagnosis of confounding lesions like repair or immature squamous metaplasia. Furthermore, in lower resource settings, where experienced pathologists may not be readily available, the reliable performance of p16INK4a staining may be preferred over H&E interpretation for determining which screen positives need treatment.
In conclusion, in a large community-and population-based series of biopsies that underwent rigorous pathology review, we found immunohistochemical staining for p16INK4a to be a useful and reliable diagnostic adjunct for distinguishing biopsies with and without CIN2+. Ki-67 immunostaining was inferior to p16INK4a and its inclusion with p16INK4a showed no marked improvement in clinical performance over p16INK4a alone. Thus, we achieved one of two pre-specified goals. Additional research is needed to identify biomarkers useful in distinguishing CIN from non-CIN. In our opinion, the most plausible biomarker, given that all CIN (squamous or glandular) are HPV DNA positive, would be sensitive and specific HPV DNA detection by in situ hybridization.
Dr. Castle was supported by the Intramural Research Program of the NIH, National Cancer Institute. Antibodies for the detection of p16INK4a were donated by mtm laboratories AG (Heidelberg, Germany) and for the detection of L1 capsid protein by Cytoimmun Diagnostics (Pirmasens, Germany). Dr. Stoler serves as consultant for mtm laboratories. In addition, Dr. Stoler has been a consultant in clinical trial and HPV DNA test development for Third Wave, Hologic, Qiagen, Roche Molecular Systems and Gen-Probe.
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.