|Home | About | Journals | Submit | Contact Us | Français|
High-throughput evaluation of tissue biomarkers in oncology has been greatly accelerated by the widespread use of tissue microarrays (TMAs) and immunohistochemistry. Although TMAs have the potential to facilitate protein expression profiling on a scale to rival experiments of tumour transcriptomes, the bottleneck and imprecision of manually scoring TMAs has impeded progress.
We report image analysis algorithms adapted from astronomy for the precise automated analysis of IHC in all subcellular compartments. The power of this technique is demonstrated using over 2000 breast tumours and comparing quantitative automated scores against manual assessment by pathologists.
All continuous automated scores showed good correlation with their corresponding ordinal manual scores. For oestrogen receptor (ER), the correlation was 0.82, P<0.0001, for BCL2 0.72, P<0.0001 and for HER2 0.62, P<0.0001. Automated scores showed excellent concordance with manual scores for the unsupervised assignment of cases to ‘positive' or ‘negative' categories with agreement rates of up to 96%.
The adaptation of astronomical algorithms coupled with their application to large annotated study cohorts, constitutes a powerful tool for the realisation of the enormous potential of digital pathology.
Immunohistochemistry (IHC) is the most widely used method for the assessment of protein expression in tissues in both the clinical and research setting. The advantages of IHC, which include preserved tissue morphology, quick turnaround time and ability to assay small amounts of tissue such as core biopsies, have established it as the principal ancillary study in diagnostic pathology. The coupling of IHC and tissue microarray (TMA) technology has enabled researchers to screen for candidate biomarkers in large study cohorts including clinical trials. However, this process continues to rely heavily on manual assessment of staining resulting in laboriously acquired semi-quantitative readouts of protein expression. In addition, TMAs and IHC have enabled the investigation of high-dimensional relationships between proteins expressed in cancers in a manner analogous to expression profiling using cDNA microarrays (Callagy et al, 2003; Makretsov et al, 2004; Abd El-Rehim et al, 2005; Jacquemier et al, 2005; Ali et al, 2011). However, these efforts are seriously limited by the bottleneck of manually assessing immunostains for tens of proteins across thousands of cases and the pathologist's ability to discriminate between small staining differences on this scale.
Astronomers have long been faced with the problem of automatically deriving objective, reproducible and continuous information from complex telescopic images of the sky. Driven by the large volume of data, image analysis in the field of astronomy has matured into a sophisticated, robust discipline. We therefore investigated the adaptation of algorithms used in astronomy to immunostained microscopic images of breast cancer in order to produce comparable measures of protein expression (Walton et al, 2010). We describe three algorithms developed for oestrogen receptor (ER), B-cell lymphoma protein 2 (BCL2) and human epidermal growth factor receptor 2 (HER2) representing examples of nuclear, cytoplasmic and membranous staining patterns, respectively. Our method includes a technique for dividing the study population into ‘positive' and ‘negative' subgroups in an unsupervised manner. The algorithms were tested in a cohort of over 2000 breast tumours represented in TMAs and compared with manual scores produced by pathologists.
This utilisation of digital pathology results in the production of continuous readouts of protein expression more typical of genomic experiments while retaining tissue morphology. Genomic research has been enormously advanced by the existence of public repositories of gene expression data. In the interest of transparency and in order to encourage continuing development, we have made all TMA images (over 6000 images) and algorithms used in this study available in a public repository. We hope that this resource will act as a hub for the collaborative development of image analysis algorithms by innovative researchers from diverse disciplines.
The large population-based breast study SEARCH (studies of epidemiology and risk factors in cancer heredity) was used for this work. This study includes women diagnosed with breast cancer from the East Anglia region. Details of this study have been published previously (Lesueur et al, 2005). IHC data from 2258 patients were included in this study. Characteristics of the study cohort are detailed in Table 1. The SEARCH study is approved by the Cambridgeshire 4 Research Ethics Committee (02/5/42); all study participants provided written informed consent.
TMAs were constructed as previously described (Kononen et al, 1998). One 0.6mm tissue core was used to represent each tumour. Following dewaxing in xylene and rehydration through graded alcohols, TMA sections were immunostained using a BondMax Autoimmunostainer (Leica, Bucks, UK). Details of antibodies and staining protocols are presented in Table 2. Bound primary antibody was detected using a polymer-conjugated secondary antibody as part of the Bond Polymer detection kit (Leica, Bucks, UK) and signal was developed using 3′-3′-diaminobenzidine (DAB) producing a brown stain. TMA slides were digitised using the Ariol platform (Genetix Ltd, Hampshire, UK) and images were subsequently extracted uncompressed (lossless) as .jpegs for downstream analysis. Scanned TMA images were manually scored by a pathologist using the Ariol user interface and blinded to patient or tumour characteristics; details of scoring systems are shown in Table 2.
We first converted stained TMA images into a format compatible with astronomy processing techniques since they are based on positive going fluxes relative to some positive sky background. The flexible image transport system (FITS) (Wells et al, 1981) was used since the uncompressed JPEG colour images are equivalent to three channels (Red Green Blue (RGB)); the conversion extracts the three image planes and inverts the intensities.
We used a top-level image processing approach consisting of forming a reference image by averaging the R+G channels and using this to form a difference image with respect to the B channels, that is, B−(R+G)/2 (Figure 1C and D). Estimates of the overall background level and random pixel noise in both reference and difference images were made. We used an iteratively clipped median for the level, and the median of the absolute deviations from the median (MAD) as the basis of the noise estimator (Hoaglin et al, 1983). A threshold k-sigma above the overall background was applied, in order to identify all significantly visible regions in the reference image and only those that were significantly stained in the difference image. The automated score was defined by two components: the proportion of pixels picked out in the difference image relative to the reference and the overall intensity (median) of these pixels in the difference image. Figure 1F illustrates how analysis of a histogram of the automated scores can be used to set a ‘blind' threshold for positivity, where the threshold was set at the 95% confidence point that staining was present based on the scatter of the unstained ensemble.
Immunostained tumour nuclei within complex tissue sections showed many similarities to astronomical images where small discrete objects, stars and distant galaxies, are superposed against a varying sky background. In astronomy this image segmentation problem has been well-studied (Irwin, 1985) and is composed of three stages: background estimation and tracking; detailed segmentation, that is, object detection using thresholded pixel connectivity to define objects; and finally object parameterisation, that is, generating measures such as position, shape, and intensity for each object.
A reference image was produced using the average of all three channels, with the intention of maximising overall signal-to-noise. Overall background variation in the reference image was tracked and removed to simplify image segmentation. Regions of contiguous connected pixels above some noise threshold were identified. These included isolated nuclei and clusters of closely packed nuclei, so a further step equivalent to ‘watershedding' (Tuominen et al, 2010) was required to segment individual nuclei (Figure 2C).
Object descriptors were computed for each nucleus. Figure 2D and E show diagnostic plots where the detected objects satisfied size limits and ellipticity/circularity constraints. The degree of staining for each nucleus (y-axis) was recorded as the ratio of the B channel intensity to the average of the R+G channels. This latter measure is shown on the x-axis revealing subtleties of the variation of the ratio as a function of the overall degree of staining. Figure 2D shows an ER+ example, while Figure 2E illustrates an ER− example. The vertical dashed boundaries are a minimum signal-to-noise limit requirement for inclusion in the final score, while the horizontal dashed boundary denotes the border between stained (above) and unstained (below) nuclei. The histogram of the distribution of automated scores was used to set a ‘blind' threshold for positivitiy. Nuclei that did not satisfy the selection requirements were flagged as ‘unscored'. We tried locating the locus (ratio) of unstained nuclei, and measuring the spread about this locus, to set a boundary independently for every image (tissue core). However, in some cases insufficient nuclei or complete lack of unstained nuclei led to dramatic differences in boundary location between individual tissue cores. Instead, we considered 172 cores of a single TMA slide as an ensemble, defining a single boundary for the set hence evading systematic variation due to small-number statistics. This also yields an overall quality check on the fidelity of the staining of a particular slide. This is illustrated in Figure 3A–D and Supplementary Figure S1. Figure 2F illustrates the distribution of manual Allred scores by the intensity and proportion components of the automated score. The final proportion statistic for each core is defined as the proportion of nuclei lying above the boundary compared with the total number of points on the plot, and the intensity as the ratio of the difference between the median ordinate values of the points above the boundary (stained) compared with below (unstained). This difference is then normalised by the median ordinate value of the unstained points to minimise dependency on image contrast.
A hybrid approach based on the top-level fragmentation of an image from the nuclear analysis was chosen, where the segmentation was halted at the level of groups of contiguous connected pixels.
The top-level segmentation was based on a background-corrected reference image composed of the average of the R+G channels, to avoid introducing a bias against unstained regions. Due to the complexity of the shapes involved, segments were retained for analysis based on a size criterion (number of connected contiguous pixels) (Figure 4C). Each remaining pixel was coded with the ratio of the background-corrected B channel flux to the average of the background-corrected R+G channels. This method reduced the impact of varying degrees of contrast, while the background correction reduced the sensitivity to overall background pollution. The final score was based on the proportion of segmented pixels with a flux ratio >1 compared with the total number of segmented pixels and the median value for the ratio of fluxes, labelled as the intensity statistic in Figure 4D. The funnel-like appearance of the scatter plot of automated scores (Figure 4D) arises as a consequence of the method. The neck at coordinates (1.0, 0.5) is a result of using the median ratio as the intensity score, by definition for a proportion of 0.5 the median ratio must be unity. The split between ‘positive' and ‘negative' scores was defined by the ‘neck' point at (1.0, 0.5).
Spearman's correlation coefficient was used to assess correlation between continuous automated scores and ordinal manual scores. All automated scores were between −1 and 1 where the 95% confidence point defining the presence of staining was 0. The agreement between automated and manual scores in assigning a ‘positive' or ‘negative' status was assessed using a receiver-operating characteristic (ROC) analysis where the manual score was used as the reference variable, providing a measure of sensitivity, specificity and proportion of cases concordantly classified. Associations with breast cancer-specific survival (BCSS) at 10 years were compared between manual and automated scores using a Cox proportional-hazards model providing a hazard ratio (HR) and 95% confidence interval (95% CI). Known violations of the proportional-hazards assumption (Blows et al, 2010) were accounted for by extending the model to include a coefficient, which was allowed to vary as a function of log time where if the log of the coefficient (T) is <1 hazard falls with time, while if it is >1 hazard increases with time. All statistical analyses were conducted in Intercooled Stata version 11.1 (StataCorp, College Station, TX, USA).
We used the molecular pathology arm of the large breast study SEARCH for this work (Lesueur et al, 2005; Ali et al, 2011). This is a population based study of women from the east of England with breast cancer. We included 2258 breast tumours and have made digital images for all three markers and reported algorithms freely accessible at: https://www.cri.cam.ac.uk/data/cclab/; username: cclabpub; password: uwzuhq8n.
As part of the nuclear staining analysis, the distribution of all objects (nuclei) for ER was illustrated as a scatter plot according to staining intensity for each TMA slide (Figure 3A and B; Supplementary Figure S1). These plots were inspected in order to identify slides where stained nuclei were not clearly distinguishable from unstained nuclei owing for example, to non-specific staining or excessive counterstain. This in effect provides a visual gauge of signal-to-noise. Although there was considerable variation in signal-to-noise, a population of clearly distinguishable stained objects was identifiable for every TMA slide included in the study; hence, in this instance no slides were excluded on the basis of staining quality. These plots also reflect the overall proportion of stained nuclei. This is illustrated in Figure 3 where plots summarising slides containing substantially different proportions of ER-positive cores as determined by manual scoring, have distinct appearances. The slide summarised in Figure 3A contained 64% ER-positive cores compared with the slide summarised in Figure 3B which contained 79% ER-positive cores. Since the quality of staining was consistently high for ER, we selected TMAs previously stained for the nuclear marker DNA replication licensing factor MCM2 (MCM2) with variable staining quality to demonstrate differences in signal-to-noise detectable by the nuclear algorithm. Figure 3C shows a summary plot with example tissue cores for a TMA slide stained for MCM2 together with examples of tissue cores where an intense counterstain diminishes the signal of positive nuclei. Figure 3D shows a summary plot with example tissue cores for another slide stained for MCM2 with a weak counterstain and background cytoplasmic staining. These plots provide an objective diagnostic of staining quality highlighting slides for further investigation.
Continuous automated scores and ordinal manual scores were highly correlated. TMAs stained for ER, BCL2 and HER2 had previously been scored by visual inspection of the digital images using standard ordinal scoring systems (Table 2). The distributions of manual and automated scores are illustrated as histograms in Figure 5. Spearman's correlation coefficients for all automated and manual scores are detailed in Table 3. The correlation between the automated score and eight-category Allred ordinal score for ER was the strongest at 0.82, P<0.0001. The histogram of the BCL2 manual H-score shows that although the range of the score is large (0–300) the majority of cases are clustered around the highest and lowest scores while cases with intermediate scores are relatively sparse. This contrasts with the appearance of the histogram for the automated score, which shows a more even distribution of cases through the gradation of staining with a similar cluster of cases at higher scores. This disparity in distribution highlights the ability of automated analysis to distinguish cases with more subtle differences in staining. The BCL2 automated score showed good correlation with the manual modified H-score at 0.73, P<0.0001. Although the distributions of the automated and manual scores for HER2 were the most similar of the three immunostains (Figure 5C), they showed the weakest correlation at 0.64, P<0.0001. This may, in part, be attributable to the relative scarcity of HER2-positive cases (185 cases (11%)). Correlation between automated scores is illustrated as a scatter matrix in Figure 5D. Oestrogen receptor and BCL2 are known to show a strong positive correlation (Dawson et al, 2010). The correlations between the manual and automated scores for ER and BCL2 were very similar at 0.58, P<0.0001 and 0.56, P<0.0001, respectively. Similarly, BCL2 and HER2 showed a negative correlation of −0.24, P<0.0001 by manual scores and −0.009, P<0.0001 by automated scores. Oestrogen receptor and HER2 manual scores were negatively correlated (Spearman's correlation coefficient=−0.19, P<0.0001), but this relationship was not reproduced between the automated scores (Spearman's correlation coefficient=−0.03, P=0.27). However, when restricted to the HER2-positive population as defined by automated analysis, we also find a significant negative correlation with the automated ER score (Spearman's rank correlation −0.27, P<0.0001).
In order to assign patients to ‘positive' or ‘negative' categories using the automated score, the population was divided at the level of the 95% confidence point that there was staining present as defined against the scatter of unstained objects; notably this is an unsupervised method and was not influenced by the dichotomous manual score. There was excellent concordance between the automated and manual scores in assigning cases to ‘positive' and ‘negative' categories. Receiver-operating characteristic analysis is detailed in Table 4. Cross-tabulations of dichotomous scores by marker are shown in Table 5. HER2 showed the best agreement between manual and automated dichotomised scores with 96% of cases classified concordantly at a sensitivity of 98.4% and a specificity of 95.7%. Dichotomisation of automated scores for ER also performed well with 93.2% of cases classified concordantly. The assignment of cases as BCL2+ or BCL2− using the automated method concordantly classified 87.3%. This unsupervised assignment of cases to ‘positive' and ‘negative' categories highlights the potential for our automated analysis to act as an unbiased classifier avoiding many of the pitfalls associated with manual scoring.
These patterns of concordance between dichotomous manual and automated scores were reflected in estimates of association with BCSS (Table 6; Figure 6). While both ER and HER2 showed near identical estimates between manual and automated scores, estimates for BCL2 manual (HR, 0.12; 95% CI, 0.06–0.25; P<0.001; T, 2.3 (1.4–3.8); P=0.001) and automated (HR, 0.24; 95% CI, 0.12–0.49; P<0.001; T, 1.7; 95% CI, 1.0–2.7; P=0.036) scores were slightly different. This disparity in survival prediction is consistent with observations that the method for analysis of cytoplasmic stains performed least well in terms of concordance with manual scores.
In order to investigate the reasons for discordance of dichotomous scores between automated and manual assessment, discordant cases were reviewed by two pathologists (HRA and BM-A). Each case was re-assigned as ‘positive' or ‘negative' according to a consensus decision and cases were also scored for the number of tumour cells present (more or less than 50 cells), presence of contaminating normal breast epithelium (absent or present) and lymphocytic infiltration (absent, sparse, marked). The results are detailed in Supplementary Table S2. Of the 184 cases stained for ER and discordantly scored between methods, 15 were reassigned following review to concordant categories. Similarly, 21 cases stained for BCL2 were reassigned to concordant categories following review, of 213 originally discordant cases. Notably, a large proportion of cases classified as ‘positive' by the automated method and ‘negative' by manual assessment (63 (48%)) contained an inflammatory infiltrate which is a probable cause of misclassification since B-lymphocytes express BCL2. Review of discordant cases stained for HER2 resulted in the reclassification of four cases to concordant categories of a total of 66 discordant cases. The reasons for discordance between methods for ER and HER2 arise as a result of the different thresholds used for positivity since the cutpoint at which the automated score was dichotomised was not optimised against the manual dichotomous score. For example, of cases classified as ER negative by automated analysis and ER positive by manual assessment, 56 (79%) were attributed an Allred score of 3 or 4 with just 2 (3%) with scores of 7 and 8.
The utility of IHC in assaying expression and localisation of proteins in tissues has led to its integration in both cancer research and clinical practice. However the subjective and semi-quantitative nature of IHC continues to limit its utility. For the first time, our approach to the problem of objectively interpreting complex microscopic images takes full advantage of existing robust, validated algorithms in the field of astronomy. We have described methods for the automated analysis of immunostains encompassing all three subcellular compartments. These algorithms produce objective continuous data which is highly correlated with manual scores produced by visual inspection. Moreover, we described an unsupervised method for assigning a cutpoint in order to classify ‘positive' and ‘negative' cases. This method showed excellent concordance with classification according to manual scores and very similar associations with survival.
Methods for the automated analysis of in situ protein expression have been previously described and are commercially available (Camp et al, 2002; Cordon-Cardo et al, 2007; Donovan et al, 2008; Rexhepaj et al, 2008; Turbin et al, 2008; Faratian et al, 2009; Turashvili et al, 2009; Bolton et al, 2010; Tuominen et al, 2010; Brugmann et al, 2012). These methods use different assays and different techniques for image analysis. Quantitative immunofluorescence offers the advantage of a larger dynamic range than IHC, however the detection of protein expression in different subcellular compartments is reliant on the simultaneous detection of a protein known to localise to the compartment of interest (Camp et al, 2002). This can limit the potential flexibility of the assay since multiple reactions are conducted on the same tissue section, necessitating the same antigen retrieval conditions for all proteins of interest and antibodies raised in different species in order to avoid cross-reaction (Camp et al, 2002). Techniques previously described for the automated analysis of IHC have been shown to perform well; however, these tend to be limited to stains localising to the nucleus for which commercial methods have also been shown to produce results concordant with manual scores (Rexhepaj et al, 2008; Turbin et al, 2008; Faratian et al, 2009; Tuominen et al, 2010, 2012). Unlike some other methods, our algorithm accounts for staining variability by adjusting for the differences between stained and unstained nuclei. This adjustment is especially important for cases with a more intense counterstain which can otherwise obscure weakly stained nuclei. In addition, by inspecting plots depicting the scatter of stained and unstained nuclei for each slide (Figure 3A–D; Supplementary Figure S1), our method enables the identification of slides with potentially poor-quality or artefactual staining for further consideration.
The phenomenon of bimodality in manual score distribution has been discussed previously with respect to ER (Rimm et al, 2007; Schnitt, 2006). Here, we corroborate the contention that a bimodal distribution of scores is an artefact of human interpretation of subtly different images rather than a true distribution. The histograms presented in Figure 5 illustrate relatively bimodal distributions for BCL2 and ER staining compared with the automated scores which show a more continuous pattern. This illustrates the potential for automated analysis of IHC to better reflect true differences in protein abundance between tumours, hence facilitating improved outcome prediction.
These methods have some limitations. First, the performance of the methods for each subcellular compartment differed significantly. Overall, the cytoplasmic method performed least well of the three in terms of concordantly classified cases and survival prediction compared to manual methods. This is in large part attributable to the misclassification of BCL2-expressing lymphocytes as tumour cells. This represents an area for on-going development and highlights the need for enduring collaboration. It also demonstrates the extent to which particular phenomena may be stain-specific and the advantage of making methodological adjustments as the need arises. The adaptation of existing astronomical algorithms makes this iterative process more flexible and efficient. Second, our attention has focussed on the use of these high-throughput methods for use with TMAs as part of large translational studies. It is in the context of research that these methods are most likely to make an impact. Their potential clinical utility including application to whole-tissue sections has not been evaluated. Indeed, the proportion of cases discordantly classified is greater than would be acceptable in a clinical context. However, for research purposes these techniques have substantial advantages over manual methods including the provision of quantitative information which may uncover novel associations which ultimately influence clinical practice.
Digital pathology represents an important adjunct to genomic data by enabling us to link data across platforms accounting for the cellular heterogeneity of tumours. The progress of genomic research has been substantially facilitated by the existence of public repositories of genomic data. In the same vein, we have made all TMA images and associated algorithms available for public access. We hope that this resource will enable other researchers to contribute to the development of digital pathology and to learn from our experience thus far.
In summary, we have developed a series of algorithms adapted from astronomy for the automated assessment of immunostains localising to the nucleus, cytoplasm and membrane. We find that automated scores show excellent correlation with scores based on visual inspection and can effectively divide the population into ‘positive' and ‘negative' groups significantly associated with outcome in an unsupervised manner. These methods constitute a high-throughput pipeline for the generation of objective, reproducible and continuous IHC data (Walton et al, 2010). This study takes advantage of a unique digital pathology resource by bringing together the expertise of researchers from diverse disciplines in order to develop a true systems pathology approach to cancer medicine.
We are very grateful to the participants of the SEARCH breast study who permitted the use of their tissue for research and to the many individuals who have made this work possible. We acknowledge the SEARCH team, the Eastern Cancer Registration and Information Centre, and the Histopathology Core Facility at the CRUK Cambridge Research Institute for immunohistochemical staining and digital image acquisition.
HRA, MI and CC designed the study. CC, JDB, NW and PDP provided study oversight. MI generated and conducted all image analysis algorithms and contributed to the manuscript. HRA provided histopathological support during algorithm generation, scored TMAs, conducted all statistical analyses comparing methods and wrote the manuscript with CC. LM provided infrastructural and database support. SJD and EP scored TMAs from the SEARCH study. FMB constructed TMAs and managed the clinical database for the SEARCH study. BM-A and HRA conducted histopathological review of cases discordant between automated and manual methods. JDB and NAW conceived of the collaboration and provided strategic advice during its progress. PDP and CC are the project leaders for molecular pathology studies in SEARCH. All authors read and approved the final manuscript.
The authors declare no conflict of interest.
Supplementary Information accompanies this paper on British Journal of Cancer website (http://www.nature.com/bjc)
This work is published under the standard license to publish agreement. After 12 months the work will become freely available and the license terms will switch to a Creative Commons Attribution-NonCommercial-Share Alike 3.0 Unported License.