IRB approval and consenting process
Biopsy specimens for the test set cases were identified and obtained from the Breast Cancer Surveillance Consortium (BCSC) registries in Vermont and New Hampshire [9
]. The BCSC is a collaborative network of five geographically distinct mammography registries with linkages to breast pathology and/or tumor registries [10
]. BCSC procedures are Health Insurance Portability and Accountability Act (HIPAA) compliant and all registries have a Federal Certificate of Confidentiality and other protection for the identities of research subjects and the physicians and facilities that contribute data to the BCSC [12
]. Women enrolled in BCSC registries provided prior consent to BCSC investigators allowing their archived tissue samples to be used for research [10
]. Thus, the research subjects did not need to be re-consented for the development of the test sets created for the current study.
Institutional Review Boards at the University of Washington, Dartmouth College, the University of Vermont, Fred Hutchinson Cancer Research Center, and Providence Health & Services of Oregon approved all test set study activities. A study-specific Certificate of Confidentiality (NCI 11–049) was also obtained to protect the study findings from forced disclosure of identifiable information.
Test set case identification and selection
All biopsies used for the test set cases were performed between January 1, 2000 and December 31, 2008. Only excisional and core needle biopsies were used. Total mastectomy cases and fine needle aspiration specimens were excluded. Only one biopsy per woman was selected. When multiple biopsies were available from a single woman we randomly selected a biopsy within a hierarchical classification of cases with available clinical history. We prioritized biopsies in which the woman’s hormone therapy (HT) status at time of biopsy was known. If HT status was unknown, cases were selected by availability of information on family history. If both HT and family history were unknown cases were selected by known race.
All women with a previous history of breast cancer were excluded. A family history of breast cancer was defined as having a first degree relative (i.e., mother, daughter or sister) with a breast cancer diagnosis. Breast cancer history was assessed through a yes/no response to the following questions, depending on the BCSC site, on a risk factor questionnaire: “Have you ever been diagnosed with breast cancer?” or “Has the patient ever had breast cancer?”. A total of 19,498 biopsies obtained from 13,677 distinct women met our eligibility requirements.
Test set cases were selected using random stratified sampling based on the age of the woman (40–49 vs. ≥50), breast density (low vs. high), and the final diagnostic interpretation of the original BCSC contributing pathologist who reviewed the woman’s biopsy for clinical treatment and management. Contributing BCSC pathologists include a variety of practice settings ranging from private practices in small hospitals to large University-affiliated academic practices in tertiary medical centers. For all potential test set cases, we categorized the diagnostic interpretations of the BCSC pathologist into one of five diagnostic classifications (non-proliferative changes, proliferative changes without atypia, atypical ductal hyperplasia (ADH), ductal carcinoma in situ (DCIS), and invasive breast carcinoma). This resulted in 20 possible combinations of test set cases (5 diagnostic categories x 2 age groups x 2 breast density groups = 20 combinations). Low versus high breast density was defined as ≤ 50% fibroglandular (BI-RADS categories 1 and 2) or ≥ 51% fibroglandular (BI-RADS categories 3 and 4), respectively. Information on breast density was obtained from mammography exams in the three years prior to biopsy. We used data from the most proximal mammogram for women with multiple mammograms within the three year period.
We oversampled cases of ADH and DCIS compared to national estimates of biopsy outcomes in the U.S. [13
] to increase statistical confidence and raw rates of inter-observer variation in areas of breast pathology that may be more challenging to interpret, are lower frequency diagnoses, and where disagreements would affect treatment. Population-based adjustments according to disease prevalence will be made during statistical analysis.
Previous studies have shown misclassification rates among pathologists of over 50% for ADH and 17% for DCIS [14
]. Women in their 40s and women with dense breast tissue were also oversampled because age and breast density are known risk factors for both benign breast disease and breast cancer [15
], and because our a priori
hypotheses include that there is more diagnostic variability in biopsies from women aged 40–49 years and women with dense breast tissue. By design, half of the test set cases were from women aged 40–49 at the time of the biopsy and half were from women aged 50 and older, with no upper age limit. Also by design, half of the cases were from women with high-density breast tissue and half were from women with low-density breast tissue.
A listing of candidate cases was randomly identified from the Vermont and New Hampshire BCSC registries (data not shown). Tissue blocks and original slides were requested from clinical facilities in Vermont (n=8) and New Hampshire (n=8) for review by our expert pathology panel. If a facility did not send the material after three written requests and two phone requests, we removed that case from our selection and requested the next case on the list within the 20 selection categories until we met our target accrual of 425 cases (Figure ). Not all cases were available at the time of request.
Reasons for a case being unavailable included insufficient residual tissue in paraffin blocks, the facility required additional consent procedures, or the tissue had already been discarded.
Flow chart describing test set development.
Initial review of biopsy material
An expert pathology panel comprised of three internationally recognized breast pathologists reviewed all selected cases. One expert panel member conducted an initial assessment of all original slides associated with the biopsy received from the clinical facility. The woman’s age and biopsy type (core needle or excisional) were the only clinical history provided for each case. This expert was blinded to the original diagnosis made by the contributing pathologist. After the initial review was complete, new study-specific glass slides were created from each case’s appropriate paraffin embedded tissue block(s) to ensure consistent staining and image quality. The newly created slides were used for the full expert panel review.
Standardized data collection
We developed a standardized histology data collection form, called the Breast Pathology Assessment Tool Hierarchy (B-PATH) form, which the expert pathologists used to record detailed diagnostic information about each case during their review. The B-PATH form included the same five diagnostic categories that were used for case selection (non-proliferative changes, proliferative changes without atypia, ADH, DCIS, and invasive breast carcinoma). The expert pathologists were asked to indicate if the case was borderline between two diagnostic categories and whether they would have requested a second diagnostic opinion for the case had they seen it in clinical practice. Finally, the B-PATH form asked pathologists to rate the level of diagnostic difficulty and their level of confidence in the assessment of each case using a 6-point Likert scale, with 1 representing “very easy” or “very confident” and 6 representing “very challenging” or “not confident”, respectively.
Independent and consensus review by expert panel
The 3-member expert pathologist panel (including the expert who conducted the initial review of the original tissue slides) performed blinded independent assessments on each slide using the standardized B-PATH form.
We used a modified Delphi approach [16
] to establish the final reference standard diagnosis for each slide. This involved compiling the independent reviews for each case, providing the three experts with their initial interpretation of the slides, followed by a facilitated discussion of features and diagnostic criteria of areas where disagreement among the experts occurred during a re-review of the slide(s) at a multi-headed microscope. The facilitated discussion continued until a final consensus was reached among all three expert pathologists on the case interpretation. When more than one slide was available for a case, the panel selected the slide that was most representative of the diagnosis.
Sample size calculations and random assignment of cases into four diagnostic test sets
The number of cases per test set and the number of participating study pathologists were chosen to provide sufficient power to address the study aims. Using conservative assumptions about diagnostic variability among pathologists, [14
] we determined that 60 cases per test set of glass slides interpreted by 100 participating pathologists would, for example, yield 90% power to detect an effect of patient age (40–49 vs. ≥50 years) on a misclassification rate difference as small as 4.8% when interpreting cases with atypia and DCIS.
After the three expert pathologists reached final diagnostic consensus, and each case had been mapped into one of five primary diagnostic categories (Table ), 240 unique patient cases were randomly selected from a total of 336 cases reviewed by the expert panel (Figure ). Selection was performed within cells defined by three stratification factors in order to obtain the desired distribution of cases across factors. These included 1) case diagnosis (using the five diagnostic categories on the B-PATH data collection form), 2) patient age (40–49 vs. ≥50), and 3) breast density (low vs. high), respectively. A permuted block randomization method with block size of four was used to assign cases to the four test sets. Blocks were defined within strata by similarity of case difficulty score (i.e., the mean Likert rating on the difficulty level assigned to each case by the three expert pathologists). For strata in which the cell total was not evenly divisible by four, random permutations of the relevant sets were assigned to the remaining partial block. Four final test sets were developed, each of which contained 60 unique patient cases.
Diagnostic Breast Pathology Assessment Tool Hierarchy (B-PATH) mapping categories for test set cases
We aimed to create a test set of slides that, as closely as possible, mirrored the quality and variety of cases observed in everyday clinical practice. We recognize that there is a wide range of diagnostic quality of source material; our statistical sampling and selection methods were designed to eliminate or minimize selection bias. Cases were deemed ineligible only when there was slide preparation artifact or insufficient tissue present to interpret the slide (n=40), tissue other than breast tissue was present on the slide due to a contributing facility supplying an incorrect block (n=5), the tissue block was unavailable (n=1), male breast tissue was present (n=1), or atypical lactation changes were present (n=1). The final set of eligible cases selected may not be considered necessarily easy or difficult to interpret, or ideal for teaching purposes because the selection process was designed to circumvent the type of selection bias that may exist in typical continuing medical educational conferences and courses.
The results of the inter-rater agreement of the test sets and a description of how the test sets will be used to assess agreement in pathologists’ diagnostic interpretation of breast tissue will be reported elsewhere. In brief, approximately 200 study pathologists will be invited to independently review the test set cases and provide their diagnostic interpretation. These interpretations will be compared to the reference standard diagnoses as determined by the expert pathology panel and to the interpretations of other pathologists in the study.