With 2,601 specimens reviewed, to our knowledge, this study is the largest blinded review of the reproducibility of the 1994 WHO endometrial hyperplasia classification scheme to date. Our overall agreement (kappa = 0.71), was similar to other studies which have kappa values ranging from 0.2-0.7, Table 7.1, 10, 19, 24
Our agreement was lowest for the diagnosis of simple hyperplasia (kappa = 0.16) and highest for “not hyperplastic” endometrium (kappa = 0.76). Because of the high percentage of specimens with a final diagnosis of “no hyperplasia,” our overall agreement was inflated by our high agreement for non-hyperplastic specimens. In fact when “no hyperplasia” cases were excluded our overall agreement was significantly lower (kappa 0.36, specific agreement 80.6%). However, one might argue that having high numbers of non-hyperplastic specimens admixed with a variety of hyperplastic specimens is more analogous to actual clinical practice settings.
Although previous studies are not directly comparable due to differences in initial case selection and review panels, our agreement for specific diagnostic categories was similar both Zaino’s 2006
prospective GOG study and Bergeron’s 1999
European multi-institution study for the categories of non-atypical hyperplasia, atypical hyperplasia and adenocarcinoma. () With kappa values for agreement on atypical hyperplasia, a key diagnostic clinical decision point, as low 0.2-0.3 for “expert” gynecologic pathologists, there is clearly an issue with the reproducibility of the current WHO diagnostic scheme.
With specialists having so much trouble agreeing, it is not surprising that there is frequent disagreement between specialists and community pathologists. In Zaino’s GOG study the majority review panel diagnosis supported the referring institution diagnosis in only 38% of cases submitted as atypical hyperplasia.24
In our study, the final review panel diagnosis and the initial outside diagnostic categorization were not directly comparable, because the method of initial diagnosis categorization was intended to maximize the number of possible cases of complex and atypical hyperplasia selected for review (see Methods). Given our bias towards categorization of the initial diagnosis to a higher grade diagnostic category, it was not surprising that we had trends toward down-grading the initial diagnostic category by the review panel final diagnosis (data not shown). However, for the above reasons, agreement with the “original diagnosis” was not considered of value in this study.
What are the factors that cause diagnostic disagreement? Our study is the first of its kind to systematically investigate the contribution of sample adequacy, interpretation of key histologic features and the presence of complicating features (polyps and metaplasias) to diagnostic disagreement.
Clearly, problems with adequate sampling are an issue beyond effecting diagnostic agreement, with other studies showing the rates of finding “concurrent” carcinoma in hysterectomy specimens with a review panel diagnosis of normal or non-atypical hyperplasia as high as 19%.24
But, with a given amount of diagnostic tissue, how do the amount of total tissue present for evaluation and the total amount of hyperplastic
tissue present effect diagnostic agreement? In our study, specimens categorized by either pathologist as “scant” (< 0.5 cc) or “low volume” of hyperplastic tissue were significantly more likely to have disagreement about the diagnosis (p = < 0.013 and p = < 0.0001 respectively). This implies that specimens that have either a minimal amount of diagnostic tissue total (sample borders on inadequate), or samples that have only a very focal amount of hyperplastic tissue in otherwise normal endometrium, should be reviewed with caution. These samples may warrant a comment about the small amount of diagnostic tissue present and the uncertainty in the diagnosis and request additional sampling. Zaino et al suggest similar findings related to sample adequacy, with greater diagnostic reproducibility for dilation and curettage specimens than office biopsy or curettage methods.24
Interestingly, we did not find a statistically significant difference in diagnostic agreement between hysterectomy and non-hysterectomy specimens, possibly because even hysterectomies can have very low volumes of hyperplastic tissue, which can decrease diagnostic agreement.
While sampling is an issue that can be controlled by recommending additional tissue, the lack of objectivity in applying multiple diagnostic criteria to establish a diagnosis is more challenging. The histologic features referred to in the WHO as useful in establishing a diagnosis include “architectural changes,” “shift in the gland to stoma ratio,” and “cytologic atypia.” 21
However, strict definitions of these features and the criteria used to establish a specific WHO diagnosis are not spelled out in great detail. Architectural changes said to be characteristic of complex hyperplasia include “irregular epithelial budding” and “increased gland complexity,” which is not further defined. A “shift in gland to stroma ratio in favor of the glands” is noted by the WHO to be a feature of complex hyperplasia as well but a strict threshold is not set. The endometrial intraepithelial neoplasia (EIN) scheme used by George Mutter and colleagues is more specific, using a volume percent stroma of less than 55% (area of glands > stroma) as one of the diagnostic criteria for a diagnosis of EIN. 15
But it is the subjective interpretation of the presence of cytologic “atypia” in the WHO scheme that appears to be most problematic. In fact, the WHO specifically states that “definitions of cytologic atypia are difficult to apply in the endometrium because nuclear cytological changes occur frequently in hormonal imbalance, benign regeneration and metaplasia.”21
The WHO describes nuclear rounding, loss of polarity, prominent nucleoli, irregular nuclear membranes and cleared or dense chromatin as features of cytologic atypia but acknowledges that atypia may be best observed by comparison with the adjacent normal glands. The EIN scheme avoids using a descriptive definition of cytologic atypia and instead uses distinct cytology in the architecturally crowed focus that is different from background. 15
Given the fairly loosely defined WHO diagnostic criteria, we were interested in determining if disagreement about the presence of key histologic features was a major factor in whether there was agreement about a specific WHO diagnosis.
Other studies have evaluated which histologic features could most aid recognition of cytologic atypia or architectural complexity. Kendall et al found “gland crowding” significantly associated with a diagnosis of complex hyperplasia while nucleoli was the only feature significantly associated with a pathologist calling a case atypical.10
Bergeron et al also found the presence of “gland crowding” most significantly associated with a diagnosis of hyperplasia while “nuclear pleomorphism” was most significantly associated with classification as atypical.1
However, others have not investigated how concordance on the presence of certain histologic features specifically effect diagnostic agreement. Because our study did not include outcomes, we did not intend to define which features were more predictive of risk of carcinoma, but merely to investigate if we could agree on the presence of defined features and if disagreement of their presence effected agreement on final diagnosis.
In our study, cases with diagnostic disagreement were also more likely to disagree on specific key histologic features such as “architectural complexity,” “glandular crowding” and “cytologic atypia,” than cases with diagnostic agreement, indicating variable application of “defined” histologic features to formulate a diagnosis. Cytologic atypia was the feature most often disagreed on and had the largest difference between cases with disagreement versus agreement (47.3% of cases with diagnostic disagreement also disagreed on the presence of cytologic atypia, verses 16.1% of cases with diagnostic agreement). While our study only reflects the agreement between two pathologists, the poor reproducibility of atypical hyperplasia in previous studies supports these findings.1, 24
Given that the presence of atypia is currently considered the best predictor of outcome in the WHO scheme, this finding calls into question the reliability of using “atypia” as it is currently defined (and variably interpreted) as a breakpoint for diagnostic categories.
The final factors we investigated as possible causes of diagnostic disagreement were “complicating” histologic features - the presence of features of a polyp or metaplasia. Because polyps tend to be less hormone responsive they can have more irregularly distributed glands with areas of crowding and have various metaplastic cytologic changes that can make differentiation of “normal” polyp from a polyp with areas of hyperplasia challenging. We did find a significant association with diagnostic disagreement in specimens where either pathologist had noted there were features of a polyp present (N = 230, p < 0.0001). Better criteria are needed to distinguish changes in polyps that should be considered higher risk, neoplastic lesions. In addition, when crowding or cytologic changes are limited to a polyp, a comment as to the unclear significance of the changes may be warranted.
The presence of metaplastic changes, or “epithelial cytoplasmic change”, in the endometrium varies from squamous, to tubal, to repair-associated eosinophilic syncytial change. The presence of extensive metaplasia can complicate the diagnosis of hyperplasia by making glands look more crowded (especially in extensive squamous metaplasia) or cytologically atypical. To complicate matters further, metaplastic changes are often associated with hyperplasias. We did note greater diagnostic disagreement when either pathologist noted the presence of metaplasia, however this did not reach statistical significance (p=0.083). This may have occurred because noting the presence of metaplasia was an optional part of the scoring form. In fact, of the 10 cases called “cannot rule out hyperplasia,” the most common reason noted was extensive metaplastic changes. Metaplastic changes are histologic features to be aware of as a possible pitfall in diagnosing endometrial hyperplasia but, while metaplasia was commonly associated with a diagnosis of “cannot rule out hyperplasia”, it was not a major cause of diagnostic disagreement in this study.
Additional studies to establish more reproducible criteria for endometrial hyperplasia that are also predictive of progression to carcinoma are needed. Various alternate diagnostic schemes have been proposed. Bergeron et al proposed combining simple and complex hyperplasia into a single “hyperplasia” group and combining atypical hyperplasia with a subset of well-differentiated carcinomas into an “endometrial neoplasia” group.1
This scheme has the advantage of combining lesions that are treated in a similar way but its diagnostic utility has not been investigated. George Mutter and colleagues have more thoroughly investigated another scheme that was developed from molecular, histomorphometric and outcome data which separates pre-cancerous neoplastic “endometrial intra-epithelial neoplasia” from “benign endometrial hyperplasia” due to the presumed influence of unopposed estrogens.7, 13-17
However, while this system has the advantage of strong correlation of EIN with clonal populations, it is still unclear if this broad category can be further refined into high risk neoplasms that are likely to persist or progress to invasive carcinomas despite treatment with progestins versus lower risk neoplastic populations that may be spontaneously shed or regress with progestin therapy.
In conclusion, our study, the largest of its kind to date, confirms previous findings related to the poor reproducibility of the current WHO endometrial hyperplasia classification system. In addition, our findings suggest that diagnostic disagreement is due both to inability to agree on the presence of various key histologic features and the amount of diagnostic tissue present. We suggest that in the clinical setting specimens with limited amounts of diagnostic tissue (either low volumes of diagnostic hyperplastic tissue or overall scant specimens) should be interpreted with caution and that recommending additional tissue should be considered. Setting a threshold for amount of diagnostic tissue present necessary for a definitive diagnosis could improve diagnostic agreement and decrease the rates of immediate hysterectomies for this usually low-grade neoplastic process. Given the poor reproducibility of the diagnosis of endometrial hyperplasia, studies examining outcomes in this field may also want to consider limiting their cases to those that have diagnostic agreement among reviewing pathologists or at least have a minimum threshold of diagnostic tissue present. Using stricter criteria for outcome studies will help give a clearer picture of the natural history of endometrial hyperplasia and perhaps shed more light on which lesions are truly higher risk.