|Home | About | Journals | Submit | Contact Us | Français|
A major challenge in studies of etiologic heterogeneity in breast cancer has been the limited throughput, accuracy and reproducibility of measuring tissue markers. Computerized image analysis systems may help address these concerns but published reports of their use are limited. We assessed agreement between automated and pathologist scores of a diverse set of immunohistochemical (IHC) assays performed on breast cancer TMAs.
TMAs of 440 breast cancers previously stained for ER-α, PR, HER-2, ER-β and aromatase were independently scored by two pathologists and three automated systems (TMALabII, TMAx, Ariol). Agreement between automated and pathologist scores of negative/positive was measured using the area under the receiver operator characteristics curve (AUC) and weighted kappa statistics (κ) for categorical scores. We also investigated the correlation between IHC scores and mRNA expression levels.
Agreement between pathologist and automated negative/positive and categorical scores was excellent for ER-α and PR (AUC range =0.98-0.99; κ range =0.86-0.91). Lower levels of agreement were seen for ER-β categorical scores (AUC=0.99-1.0; κ=0.80-0.86) and both negative/positive and categorical scores for aromatase (AUC=0.85-0.96; κ=0.41-0.67) and HER2 (AUC=0.94-0.97; κ=0.53-0.72). For ER-α and PR, there was strong correlation between mRNA levels and automated (ρ=0.67-0.74) and pathologist IHC scores (ρ=0.67-0.77). HER2 mRNA levels were more strongly correlated with pathologist (ρ=0.63) than automated IHC scores (ρ=0.41-0.49).
Automated analysis of IHC markers is a promising approach for scoring large numbers of breast cancer tissues in epidemiologic investigations. This would facilitate studies of etiologic heterogeneity which ultimately may allow improved risk prediction and better prevention approaches.
There is increasing evidence that risk factor associations for breast cancer vary by tumor subgroups defined by morphology and immunohistochemical (IHC) expression of tumor markers (1-4), but our knowledge of these relationships is incomplete. Refining our understanding of etiologic heterogeneity may permit improved risk assessment and allow better prevention and screening approaches. Performing risk factor analyses by tumor subgroups requires the study of a large number of cases, creating a need for reproducible, high-throughput methods for scoring IHC stains. The development of tissue microarray (TMA) technology has partly addressed this need by providing a platform for performing standardized, rapid IHC staining of many tumors. However, optimizing methods for scoring IHC stains is challenging. Interpretation of IHC stains by a pathologist remains the current standard but is limited by suboptimal inter-observer agreement (5), reliance on a semi-quantitative scoring metric and a taxing workload.
Automated image analysis systems offer a potential solution by providing objective, rapid, reproducible, quantitative measurements of IHC stains. Several commercially available systems are in use, but published reports of their performance are limited (6, 7). Given that the performance of these systems may vary by tissue type and marker, comprehensive validation is required before applying these methods to large-scale epidemiological investigations. In this report, we assess the performance of three automated systems for IHC scoring, TMAx (Beecher Instruments, Sun Prairie, WI), Ariol (Applied Imaging, Grand Rapids, MI) and TMALab II (Aperio, Vista, CA). These were applied to a diverse set of IHC stains with relevance to breast cancer: estrogen receptors α and β (ER-α, ER-β), progesterone receptor (PR), aromatase and human epidermal growth factor receptor 2 (HER2). These antigens were chosen to capture a diversity of staining patterns (nuclear, membrane and cytoplasmic) and challenges in interpretation.
In the absence of a true “gold-standard” for quantifying IHC staining, we employed two complementary approaches to assess the performance of these automated scoring systems. First, we measured the agreement between automated and pathologists' scores, comparing it to the level of inter and intra-pathologist agreement. Second, we investigated differences in the strength of the association between IHC scores derived with each method and mRNA expression levels determined in frozen tumor tissues available from a subset of patients.
The Polish Breast Cancer Study included women between 20 and 74 years of age who resided in Warsaw or Lodz, Poland from 2000 to 2003 (4). Breast cancer cases were identified through a rapid identification system organized at five participating hospitals and through cancer registries. At total of 2,386 cases agreed to participate and provided informed consent under a protocol approved at the National Cancer Institute and local Institutional Review Boards in Poland. This report includes a subset of 440 invasive carcinomas with available formalin fixed paraffin embedded tumor blocks that had been previously prepared as a TMA.
Routinely prepared paraffin-embedded formalin-fixed blocks of 440 cases with invasive breast cancer were used to construct TMA blocks with 2-fold representation as 0.6-mm diameter cores (Beecher instruments). Methods for performing IHC stains for ER-α, ER-β, PR, HER 2 and aromatase have been detailed elsewhere (8).
TMA slides were digitized via whole slide scanning with 20x objective using two systems, the Aperio T2 scanner (Aperio Technologies) and the Ariol SL-50 scanner (Genetix). The digital images of IHC stained slides generated with the Aperio system were independently scored by two pathologists (MES and MAD). The percentage (0, 1, 5, 10, 20…100%) of tumor cells with positive staining and the average staining intensity (0=negative, 1=weak, 2=intermediate, and 3=strong) were recorded for each marker. Stains for aromatase were either negative or diffusely positive; therefore we only assessed intensity for this marker. Combined scores based on the percentage of cells stained times intensity (ranging from 0-300) were generated for each pathologist and then averaged. To measure the repeatability of visual scores, one pathologist (MAD) re-scored a random sample of 10% of the spots masked to her previous scores. Stains for ER-α, PR and ER-β were considered positive if the average combined score was >=10.5. Positive stains for aromatase and HER-2 were defined as having an average intensity score of >=1.5 (i.e. at least one of the two tissue cores with 2+ score). Semi-quantitative categories (negative, low, moderate and strong staining) for the average pathologist scores were also created. The combined score cut-points for these categories for ER-α, PR and ER-β are as follows: none (<0.5), low (0.5-10.5), moderate (10.5-100.4) and strong (>=100.5). The intensity score cut-points for HER-2 and aromatase are as follows: none (<0.5), low (0.5-1.5), moderate (1.5-2.5) and strong (2.5-3). Based on the pathologists' assessments of core quality, we excluded missing tissue cores or cores that were un-interpretable secondary to artifacts (25%). For cases with tumors with two satisfactory cores, the results were averaged (59%); for cases with tumors with one poor quality spot, results were based on the interpretable core (28%). The remaining cases with two tumor cores that could not be interpreted were excluded (13%), leaving a total of 339 cases available for analyses, on average, for each stain.
Aperio-derived images were scored using two systems, TMAx and TMALab II. The Ariol system was used for both scanning and scoring. We excluded cores with blurred images (2%). Prior to analysis, we adjusted automated scoring algorithms for size and shape parameters in an effort to limit analysis to carcinoma cells within each spot and adjusted intensity thresholds to distinguish positive IHC reactions from background counterstaining. For the TMAx system, algorithm tuning and analysis were performed by the vendor. TMALab II algorithms were initially set by the vendor and then refined independently by two pathologists (PL and SMH). The Ariol system algorithm was tuned by an image analysis expert (WH) with the support of a pathologist (MES). Algorithms for nuclear stains were used to score ER-α, ER-β, and PR and cell membrane algorithms were used to score HER2. At the time of the analysis, refined automated scoring algorithms for cytoplasmic markers were not available and aromatase staining was quantified as the average positive staining intensity of the entire spot. For ER-α, ER-β and PR, the systems calculated the percent of cells stained (1-100%) and average positive stain intensity as a continuous measure. For HER2 and aromatase, only the continuous intensity measure was used. As in the pathologists' scores, the product of percentage and intensity was used as the main staining measure for ER-α, ER-β and PR.
Among the breast cancer cases included in this analysis, samples of 84 tumors had been snap frozen, stored in liquid nitrogen (−196°C), and subsequently profiled for mRNA expression. Briefly, approximately 30 mg of frozen tissue was processed to isolate RNA with TRIzol reagent (Invitrogen, Carlsbad, CA) and the resulting RNA was purified with Qiagen RNAeasy Mini columns. Aliquots of 250 ng of input RNA were amplified and labeled using the Illumina TotalPrep RNA Amplification kit (Applied Biosystems/Ambion, Austin, TX), according to the manufacturer's protocol. The biotin-labeled cRNAs were quantitated using RiboGreen RNA Quantitation reagent (Molecular Probes, Eugene, OR) and 750 ng was hybridized to Illumina HumanRef-8 v2 Expression Beadchip microarrays (Illumina, San Diego, CA).
We assessed agreement between automated IHC scores and pathologists' scores of negative vs. positive and categorical strength of staining. To quantify the agreement between automated scores and pathologists' negative/positive scores, we evaluated the area under the curve (AUC) of the receiver operating characteristics (ROC) graphs for each instrument, considering the pathologists' result of negative or positive as the reference. The ROC curve plots the true versus false positive fraction for each possible cut-off point that could have been used to define negative versus positive tumors. The AUC of the ROC graph represents the probability that an automated score will be higher for a randomly chosen true positive sample (defined by the pathologists) than for a randomly chosen true negative sample. An AUC of 1.0 would represent perfect discrimination of the pathologists' negative/positive categorization by an automated instrument and an AUC of 0.5 would correspond to no discriminatory accuracy. The AUCs for the three automated systems were compared using a non-parametric method (9).
To assess the agreement between the continuous scores of the automated instruments and the categorical scores of the pathologists for strength of staining, we converted the automated scores into the four categories used by the pathologists (negative, low, moderate and strong staining). This was done by aligning the distributions of the pathologist semi-quantitative scores with the automated scores. For example, if 30% of samples for a marker were categorized as negative by the pathologists, then the automated scores corresponding to the lowest 30th percentile of the automated results were categorized as negative. We measured the agreement using a weighted kappa statistic which represents the agreement exceeding that expected by chance. For comparison, we calculated intra-observer agreement for one pathologist (MAD) and inter-observer agreement between the two pathologists (MES and MAD). We interpreted kappa values of 0.8-1.0 as almost perfect agreement, 0.6-0.8 as substantial agreement, 0.4-0.6 as moderate agreement, 0.2-0.4 as fair agreement and 0.0-0.2 as slight agreement (10). In order to assess the impact of removing spots of poor quality from our main analyses, we compared the agreement between automated systems in spots of adequate quality to the agreement in spots of low quality.
Standard Illumina pre-processing was applied to the mRNA expression data. Specifically, the variance stabilization transformation was used followed by quantile normalization. We estimated correlations between IHC scores and mRNA levels by marker and IHC scoring method using Spearman's rank correlation test and used the Fisher r-to-z transformation to generate confidence intervals (11).
The AUC for all the three automated systems was roughly 0.99 for discrimination of pathologist positive/negative categorization of ER-α, PR and ER-β staining (Table 1). There was some difference in the capacity of the automated systems to discriminate between pathologist negative/positive categories for HER-2 and aromatase. In the case of HER-2 staining, TMALab II (Aperio) showed less agreement with pathologists' scores (AUC=0.93) than the other systems (AUC=0.96-0.97). Ariol showed lower agreement with pathologist scores for the aromatase stain (AUC=0.85) compared to the other systems (AUC=0.96). Results for the percent-only scores as compared with the combined percent-intensity scores for the nuclear markers (ER-alpha, ER-beta and PR) yielded similar results for ROC analyses, apart from a slightly lower agreement for the TMALab II (Aperio) system with pathologists for ER-beta (AUC=0.90).
Agreement between pathologists' semi-quantitative and automated scores for ER-α and PR were almost perfect for all systems (κ=0.86-0.91) (Figure 1). The intra-pathologist reproducibility (κ=0.86) and inter-pathologist reproducibility (κ=0.91-0.93) were also high. There was substantial agreement between automated and pathologist interpretations of HER-2 staining levels for TMAx and Ariol (κ=0.69-0.72) but less (κ=0.53) for TMALab II (Aperio). Inter-pathologist (κ=0.86) and intra-pathologist agreement (κ=0.95) for HER-2 staining levels was excellent. When we restricted the analysis to 296 tumors scored as negative (0+) or strongly positive (3+), agreement improved: TMAx and Ariol (κ=0.83-0.90), Aperio (κ=0.69), inter-pathologist (κ=0.98). The agreement between automated scores and pathologists' scores for ER-β was excellent with minimal differences among instruments (κ=0.80-0.86). Intra-observer agreement (κ=0.84) was substantially higher than inter-observer concordance for this marker (κ=0.69). Out of the five markers, aromatase automated scores showed the poorest agreement with pathologist scores with Ariol having lower agreement (κ=0.41) than TMALab II (Aperio) and TMAx (κ=0.65-0.67). The inter-pathologist (κ=0.30) and intra-pathologist agreement (κ=0.70) was also lowest for this marker.
In general, across all five markers, disagreement between pathologists and automated systems was limited to one-category discordances (e.g. weak vs. moderate staining). When we examined the instances of extreme discordance between pathologists and the Ariol system, a wide-variety of issues appeared to be driving the disagreement including misclassification of normal cells as tumor cells, staining artifacts, and equivocal staining (Supplementary Figure 2). Agreement between pathologist semi-quantitative scores for percent of positive staining cells did not differ substantially from the agreement levels seen for the combined score for the nuclear markers ER-alpha, PR and ER-beta. The two exceptions to this were a decrease in the agreement between TMALab II (Aperio) and pathologist ER-beta percent scores (κ=0.51) and an increase in the inter-pathologist agreement (κ=0.82) when using percent-only ER-beta scores.
We saw minimal differences in the agreement between pathologist and TMALab II (Aperio) staining categories generated by the three different users across all five markers (results not shown). The agreement between automated systems exhibited similar patterns to the agreement between pathologists and automated systems with ER-α and PR showing the highest (κ=0.85-0.91) and aromatase showing the lowest agreement (0.44-0.74). In general, there was a marked improvement in the agreement between automated systems in spots of adequate quality compared to spots of poor quality (Figure 2). ER-β and aromatase had the highest proportion spots which were difficult to interpret by the pathologists. ER-β, while a nuclear marker, displayed a variable amount of cytoplasmic and background staining which made evaluation difficult. Aromatase generally showed weak diffuse cytoplasmic staining with frequent concurrent staining of the stroma, which was also difficult to visually interpret (Supplementary Figure 1).
IHC and mRNA levels for ER-α and PR were highly correlated (Table 2) based on both automated systems (ρ=0.67-0.74) and visual reads by pathologists (ρ=0.67-0.77). HER2 IHC and mRNA levels were also correlated, but the association was stronger for pathologists' scores (ρ=0.63) than automated methods (0.41-0.49). IHC staining and mRNA levels for ER-β and aromatase were not correlated by any method. We investigated whether ER-β mRNA expression was correlated with IHC staining intensity or the percent of cells staining positively but did not see an association for either staining parameter (results not shown).
This study demonstrates that using automated systems to assess IHC stains performed on TMAs in epidemiologic studies of breast cancer is a promising approach. The three commercial systems that we evaluated performed similarly well; however, performance was less encouraging for some markers (HER2, aromatase and ER-beta), irrespective of mode of assessment.
ROC analysis performed to assess agreement between automated instruments and pathologists at the level of negative vs. positive for the nuclear markers ER-α and PR showed nearly perfect agreement for all systems. In addition, agreement between automated systems and pathologists for strength of ER-α and PR staining was excellent. Previous studies have also demonstrated close agreement between automated assessments and pathologists' scores for ER-α and PR IHC stains in breast cancer (7, 12-16). Rexhepaj et al. (15) report an AUC of 0.85 for ER-α and 0.74 for PR comparing pathologist and automated scores using a using an in-house image analysis system. Turbin et al. (13) report kappa statistics of 0.88-0.90 when comparing pathologist and Ariol scores of positive/negative staining for ER-α and Diaz et al (12) report a kappa of 0.84 using the QCA image analysis system.
After dichotomizing automated HER2 scores, we demonstrated strong agreement with pathologists' scores of negative and positive. Agreement was good but not as strong for comparisons of the strength of HER2 staining. Concordance between TMALab II (Aperio) and pathologists was minimally less for HER2 staining levels than for other instruments. Recalibration of the tuning parameters by multiple users failed to improve agreement, suggesting that the result was operator independent. The lower level of agreement for HER2 staining categories could reflect difficulties in visual quantification of intermediate staining levels, a problem which is well documented (17-19). Indeed, when we restricted the analysis to unequivocal results of 0 or 3+, the agreement between automated and pathologists' scores improved across all instruments. Previously, Joshi et al. (20) reported somewhat higher levels of agreement between automated and pathologists' scores of 0, 1+, 2+ and 3+ (κ= 0.80-0.91) using an in-house image analysis system. One study using the Ariol system (6) reported κ = 0.84 between automated and pathologist scores of 0/1+, 2+ and 3+. We achieved almost identical results when we re-analyzed our Ariol data using this grouping (κ=0.82).
Our findings regarding the use of automated scoring systems for ER-β and aromatase are less clear. While, in general, there was good agreement between the automated systems and pathologists scores, we saw lower levels of inter-pathologist agreement, particularly for aromatase. These data are consistent with our experience that measurement of nuclear-specific markers is quite reproducible, whereas measurement of markers in other cell compartments (cell membrane for HER2; cytoplasm for aromatase) or markers that show staining of multiple compartments (i.e. ER-β) is more challenging. We do not believe that difficulty in scoring antibodies for ER-β and aromatase is specific to the antibodies used in our study, but rather is exemplary of the general challenges related to assessing non-nuclear or multi-compartment stains. Reports suggest that ER-β antibodies may show both nuclear and cytoplasmic staining, which though difficult to score, may have prognostic significance (21). Similarly, the aromatase antibody we employed shows diffuse cytoplasmic staining, which presents scoring challenges, but is typical of a valid pattern of staining that one may wish to assess with automated systems if possible (22). Overall, our results for inter-observer agreement and automated scoring emphasize the need to validate automated methods for specific assays, demonstrate the range of agreement that is obtained for visual and automated scoring of different markers, and provide impetus for further methodologic development of both immunohistochemcial assays and automated scoring techniques.
We found substantially better agreement between scores for markers of adequate vs. poor quality, indicating that triage of spots is an important quality assurance measure for automated analysis. It is generally recommended that TMAs include two to four cores from each sample in order to minimize the impact of tissue misrepresentation and missing results (23-25). Redundant representation of tumors in TMAs when performing automated image analysis is particularly useful because many artifacts that are not limiting for pathologists may interfere with automated reads.
In the absence of protein quantification or other means of assessing accuracy, we assessed whether automated and pathologists' IHC scores produced similar correlations with mRNA levels. Correlations for ER-α, PR and HER 2 were of particular interest because of their clinical and epidemiological relevance. For ER-α and PR, all measures were highly correlated suggesting equally accurate representation of gene expression. For HER-2, IHC and mRNA were more strongly correlated with pathologist than automated scores, although results for the latter were also highly significant. In contrast, none of the IHC scores were correlated with mRNA for ER-β and aromatase. Given that expression at the mRNA and protein levels are not necessarily correlated, this result does not necessarily indicate poor IHC scoring, especially since this was seen uniformly across automated and pathologist scores.
The strengths of our study include evaluation of multiple markers and instruments, using tissues from a population-based study, the use of multiple pathologists' interpretations and exclusion of poor images for automated analysis. There are several limitations of our study. First, we performed the image analysis in a fully automated mode even though the Ariol and TMALab II systems are designed to be run on tumor rich regions of cores marked by a trained reviewer. By including some benign tissue in the scoring, this could have negatively impacted the performance of these two systems. Cores that contained benign epithelium or abundant stroma seemed to diminish agreement between the automated systems and pathologists. Indeed, spots showing a high level of discordance between pathologists and automated systems often appeared to be caused by the misclassification of normal cells as tumor cells. However, we assumed that gating on cell features of tumor cells and the preparation of TMAs using cores removed from tumor-rich areas of tissue minimized the impact of this approach on most samples. In addition, the use of different scanners and different procedures for tuning algorithms may have reduced the validity of direct comparisons between the three systems. However, we did not notice any systematic differences in the performance of the systems using Aperio-derived images (TMALab II, TMAx) compared to the Ariol system suggesting that the scanner variation had a minimal effect. Although our data suggest there is minimal effect from scanner variation, this subject has not been rigorously tested to date. Appropriate investigation of these issues requires cross-platform and cross-instrument comparisons to provide a systematic understanding of any potential sources of bias. While experience with a system may enhance performance, we attempted to optimize all automated analyses and observed that performance of automated analyses was similar for different users.
Our study shows that fully automated image analysis systems can provide results that agree well with pathologist scores, particularly for robust nuclear markers such as ER-α and PR. Automated image analysis systems can greatly facilitate large-scale multi-center epidemiologic studies by providing standardized, quantitative measures of IHC staining. While reducing misclassification of TMA scoring is crucial, it is not the only factor impacting IHC data. Other aspects influencing IHC results include delays in the time to formalin fixation (26), variation in the adequacy of formalin fixation (27) and improper storage of cut and unstained slides. Our group (28) and others (29) have previously evaluated issues related to slide storage but the difficult issue of the effect of tissue fixation on IHC data has yet to be completely addressed. Tackling each of these steps will be needed to realize the full potential of tissue-based epidemiologic research.
Representative digital images from the Aperio T2 scanner of a) ER-alpha b) PR c) HER2 d) ER-beta and e) Aromatase stained TMA spots.
Digital images from the Ariol scanner of spots which showed large discordance between pathologist and Ariol interpretation of a) ER-alpha, b) PR, c-d) ER-beta, e-f) HER2, g-h) Aromatase.
Discordances between automated and pathologists' scores: a) both pathologists scored case as negative while Ariol scored the case as strong positive, secondary to detected staining of entrapped benign epithelium; b) average pathologists' score was 15% while Ariol scored this as 50%, probably due to a lower automated threshold for detecting light brown as positive stain; c) dark stain is recognized in vascular endothelial cells and entrapped clusters of epithelial cells within dense fibrous tissue, raising disagreements about the validity of the staining pattern, d) pathologists scored as negative, whereas Ariol scored as 90% of cells stained because the instrument misclassified benign stromal cells as tumor cells; e-f) pathologists interpreted as negative, Ariol scored as strong positive related to artifact; g-h) pathologists scored as negative, whereas Ariol scored as positive, probably reflecting machine misclassification of heavy counterstain (blue) as positive (brown).
Note:Yellow dots denote nuclei which Ariol interpreted to be tumor nuclei showing positive staining and pink dots denote nuclei which Ariol interpreted to be tumor nuclei showing no positive staining.
KLB, MGC, RP, RY, SMH, RY, RC, SLC, PM, SD, PL, JF and MES were funded by the National Cancer Institute Intramural Research Program.
MAD was funded by University of Calgary.
PDPP and WJH were funded by Cancer Research UK.