Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3441061

Formats

Article sections

- Abstract
- 1 Introduction
- 2 The TACOMA algorithm
- 3 Co-training with RF
- 4 Applications on TMA images
- 5 Discussion
- References

Authors

Related links

Ann Appl Stat. Author manuscript; available in PMC 2012 September 13.

Published in final edited form as:

Ann Appl Stat. 2012 September; 6(3): 1280–1305.

PMCID: PMC3441061

NIHMSID: NIHMS387049

Donghui Yan,^{1} Pei Wang,^{1} Beatrice S. Knudsen,^{2} Michael Linden,^{3} and Timothy W. Randolph^{1}

See other articles in PMC that cite the published article.

Recent advances in tissue microarray technology have allowed immunohistochemistry to become a powerful medium-to-high throughput analysis tool, particularly for the validation of diagnostic and prognostic biomarkers. However, as study size grows, the manual evaluation of these assays becomes a prohibitive limitation; it vastly reduces throughput and greatly increases variability and expense. We propose an algorithm—Tissue Array Co-Occurrence Matrix Analysis (TACOMA)—for quantifying cellular phenotypes based on textural regularity summarized by local inter-pixel relationships. The algorithm can be easily trained for any staining pattern, is absent of sensitive tuning parameters and has the ability to report salient pixels in an image that contribute to its score. Pathologists’ input via informative training patches is an important aspect of the algorithm that allows the training for any specific marker or cell type. With co-training, the error rate of TACOMA can be reduced substantially for a very small training sample (e.g., with size 30). We give theoretical insights into the success of co-training via thinning of the feature set in a high dimensional setting when there is “sufficient” redundancy among the features. TACOMA is flexible, transparent and provides a scoring process that can be evaluated with clarity and confidence. In a study based on an estrogen receptor (ER) marker, we show that TACOMA is comparable to, or outperforms, pathologists’ performance in terms of accuracy and repeatability.

Tissue microarray (TMA) technology was first described by Wan et al [50] in 1987 and substantially improved by Kononen et al [34] in 1998 as a high-throughput technology for the assessment of protein expression in tissue samples. As shown in the top panel of Figure 1, the construction of a TMA begins with cylindrical cores extracted from a donor block of formalin-fixed and paraffin-embedded tissues. The cores are transferred to the grid of the recipient block. This grid is generated by punching cylindrical holes at equal distance into a precast rectangular mold of solid paraffin wax. Once all the holes are filled with donor cores, the block is heated to fuse the cores to the wax of the block. Normally, recipient blocks contain 360 to 480 tissue cores from donor blocks, often in triplate samples from each block and are thus called tissue micro arrays (TMA). They are sectioned transversely and each section is captured on a glass slide, such that slides display a crossection of each core in a grid-like fashion. More than 100 slides can be generated from each TMA block for analysis with a separate probe. This standardizes the hybridization process of the probe across hundreds of tissue samples. The use of TMAs in cancer biology has increased dramatically in recent years [9, 23, 27, 47] for the rapid evaluation of DNA, RNA and protein expressions on large numbers of clinical tissue samples; they remain the most efficient method for validating proteomics data and tissue biomarkers. We limit our discussion to high-density immunohistochemistry (IHC) staining, a method used for the measurement of protein expression, as the most common method for subcellular localization on a per cell basis.

The evaluation of protein expression requires the quantification, or scoring, of a TMA image. The scores can be used for the validation of biomarkers, assessment of therapeutic targets, analysis of clinical outcome etc [27]. The bottom panel of Figure 1 gives an example of several TMA images with scores assigned at a 4-point scale (see Section 4 for details).

Although the construction of TMAs has been automated for large-scale interrogation of markers in tissue samples, several factors limit the use of the TMA as a high-throughput assay. These include the variability, subjectivity and time-intensive effort inherent in the visual scoring of staining patterns [9, 48]. Indeed, a pathologist’s score relies on subjective judgments about colors, textures, intensities, densities and spatial relationships. As noted in [23], however, the human eye cannot provide an objective quantification that can be normalized to a reference. In general, problems stemming from the subjective and inconsistent scoring by pathologists are well known and have been highlighted by several studies [3, 4, 19, 33, 44, 49]. Thus, as study size grows, the value of TMAs in a rigorous statistical analysis may actually decrease without a consistent and objective scoring process.

These concerns have motivated the recent development of a variety of tools for automated scoring, ranging from sophisticated image enhancement tools, tissue segmentation to computer-assisted pathologist-based scoring. Many are focused on a particular cellular pattern, with HER2 (exhibiting nuclear staining) being the most commonly targeted marker; see, e.g., [25, 31, 36, 42, 43]. For a survey of commercial systems, we refer to [37] or [41], and also the review by Cregger et al. [16] which acknowledges that, given the rapid changes in this field, this information may become outdated as devices are abandoned, improved or newly developed. A property of most automated TMA scoring algorithms is that they rely on various forms of background subtraction, feature segmentation and thresholds for pixel intensity. Tuning of these algorithms can be difficult and may result in models sensitive to several variables including staining quality, background antibody binding, counterstain intensity, and the color and hue of chromogenic reaction products used to detect antibody binding. Moreover, such algorithms typically require tuning from the vendors with parameters specific to the markers’ staining pattern (e.g., nuclear versus cytoplasmic), or even require a dedicated person for such a system.

To address the further need for scoring of TMAs in large biomarker studies, we propose a framework—called Tissue Array Co-Occurrence Matrix Analysis (TACOMA)—that is trainable to any staining pattern or tissue type. By seeking texture-based patterns invariant in the images, TACOMA does not rely on intensity thresholds, color filters, image segmentation or shape recognition. It recognizes specific staining patterns based on expert input via a preliminary set of image patches. In addition to providing a score or categorization, TACOMA allows to see which pixels in an image contribute to its score. This clearly enhances interpretability and confidence in the results.

It should be noted that TACOMA is not designed for clinical diagnosis but rather a tool for use in large clinical studies that involve a range of potential biomarkers. Since many thousands of samples may be required, the cost and time required for pathologist-based scoring may be prohibitive and so an efficient automated alternative to human scoring can be essential. TACOMA is a framework for such a purpose.

An important concern in biomedical studies is that of the limited training sample size^{1}. The size of the training set may necessarily be small due to the cost, time or human efforts required to obtain them. We adopt co-training [53, 6] in the context of TACOMA to substantially reduce the training sample size. We explore the thinning of the feature set for co-training when a ‘natural’ split is not readily available but the features are fairly redundant, and this is supported by our theory that a thinned slice carries about the same classification power as the whole feature set under some conditions.

The organization of the remainder of this paper is as follows. We describe the TACOMA algorithm in Section 2, this is followed by a discussion on co-training to reduce the training sample size with some theoretical insights on the thinning scheme in Section 3. Then in Section 4, we present our experimental results. We conclude with a discussion in Section 5.

The primary challenge TACOMA addresses is the lack of easily-quantified criteria for scoring: features of interest are not localized in position or size. Moreover, within any region of relevance—one containing primarily cancer cells—there is no well-defined (quantifiable) shape that characterizes a pattern of staining (see, for example, the bottom panel of Figure 1 for an illustration). The key insight that underlies TACOMA is that, despite the heterogeneity of TMA images, they exhibit strong statistical regularity in the form of visually observable textures or staining pattern (see, for example, Figure 2(b)). And, with the guidance of pathologists, TACOMA can be trained for this pattern regardless of the cancer cell type (breast, prostate, etc.) or marker type (e.g., nucleus, cytoplasmic, etc).

TACOMA captures the texture patterns exhibited by TMA images through a matrix of counting statistics, the Gray Level Co-occurrence Matrix (GLCM). Through a small number of representative image patches, TACOMA constructs a feature mask so that the algorithm will focus on those biologically relevant features (i.e., a subset of GLCM entries). Besides scoring, TACOMA also reports salient image pixels (i.e., those contribute to the scoring of an image) which will be useful for the purpose of training, comparison of multiple TMA images, estimation of staining intensity etc. For the rest of this section, we will briefly discuss these individual building blocks of TACOMA followed by an algorithmic description of TACOMA.

The GLCM was originally proposed by Haralick [26] and has proven successful in a variety of remote-sensing applications [51]. The GLCM, of an image, is a matrix whose entries count the frequency of transitions between pixel intensities across neighboring pixels with a particular spatial relationship; see Figure 2. The description here is essentially adopted from [51]. We start by defining the spatial relationship between a pair of pixels in an image.

A spatial relationship has two elements, the direction and the distance of interaction. The set of all possible spatial relationships is defined as

$$\begin{array}{ccc}\Re \hfill & =\hfill & D\otimes L\hfill \\ \hfill & =\hfill & \{\nearrow ,\searrow ,\nwarrow ,\swarrow ,\downarrow ,\uparrow ,\to ,\leftarrow \}\otimes \{1,\dots ,d\}\hfill \end{array}$$

where *D* is the set of potential directions and *L* is the distance of interaction between the pair of pixels involved in a spatial relationship. The distance of interaction is the minimal number of steps required to move from one pixel to the other along a given direction. The particular spatial relationships used in our application are (, 3), (, 1) and (, 1). Details about the choice of these spatial relationships can be seen in Section 4.

Although the definition of spatial relationships can be extended to involve more pixels [51], we have focused on pairwise relationships which appear to be sufficient. Next we define the GLCM.

Let *N _{g}* be the number of gray levels in an image

A

N×_{g}Nmatrix such that its (_{g}a, b)-entry counts the number of pairs of pixels, with gray valuesaandb, respectively, having a spatial relationship ~, fora, b{1, 2, …,N}._{g}

This definition is illustrated in Figure 2(a) with more realistic examples in Figure 2(b). Figure 2(b) gives a clear indication as to how the GLCM distinguishes between TMA images having different staining patterns.

Our use of the GLCM is nonstandard in that we do not use any of the common scaler-valued summaries of a GLCM (see [26] and [13]), but instead employ the entire matrix (with masking) in a classification algorithm (see also [51]). A GLCM may have a large number of entries, typically thousands, however, the exceptional capability of Random Forests [7] in feature selection allows us to directly use all (or a masked subset of) GLCM entries to determine a final score or classification.

In order to incorporate prior knowledge about the staining pattern, we mask the GLCM matrix so that the scoring will focus on biologically pertinent features. This is realized by first choosing a set of image patches representing regions that consist predominantly of cancer cells and are chosen to represent the staining patterns; see Figure 3. The collection of GLCMs from these patches are then used to define a template of “significant entries” (c.f., TACOMA algorithm in Section 2.5) for all future GLCMs: when the GLCM of a new image is formed, only the entries that correspond to this template are retained. This masking step enforces the idea that features used in a classifier should not be based on stromal, arterial or other non-pertinent tissue which may exhibit non-specific or background staining. Note that only one small set of image patches is required; these images patches are used to produce a common feature mask to be used throughout for all images in both training and scoring.

In this manner, feature selection is initiated by expert biological knowledge. This involves little human effort but gains substantially in both interpretability and accuracy. The underlying philosophy is that no machine learning algorithms beat domain knowledge. Since by using image patches we do not indicate which features to select but instead specify their effect, we achieve the benefits of a manual-based feature selection but avoid its difficulty. This is a novel form of nonparametric, or implicit, feature selection which is applicable to settings beyond TMAs.

TACOMA uses Random Forests (RF) [7] as the underlying classifier. RF was proposed by Breiman and is considered one of the best classifiers in high dimensional setting [10]. In our experience, RF achieves significantly better performance than SVM and Boosting on the TMA images we use (see Table 1). Additionally Holmes et al [29] argue that RF is superior to others in dealing with tissue images. The fundamental building block of RF is a tree-based classifier which can be non-stable and sensitive to noise. RF takes advantage of such instability and creates a strong ensemble by bagging a large number of trees [7]. Each individual tree is grown on a bootstrap sample from the training set. For the splitting of tree nodes, RF randomly selects a number of candidate features or linear combinations of features and splits the tree node with the one that achieves the most reduction in the node impurity as defined by the Gini index (or other measures such as the out of bag (oob) estimates of generalization error) defined as follows.

$$\varphi \left(p\right)=\sum _{i=1}^{c}{p}_{i}(1-{p}_{i})$$

(1)

where ** p** = (

Performance comparison of RF, SVM, and Boosting of a naive Bayes classifier. The result for RF is adopted from Section 4.1. For SVM, we vary the choice of the kernel from {Gaussian, polynomial, sigmoid} with the best tuning parameters for N_{g} **...**

To test a future example *X*, let *X* fall from each tree for which *X* receives a vote for the class of the terminal node it reaches. The final class membership of *X* is obtained by a majority vote over the number of votes it receives for each class. The features are ranked by their respective reduction of node impurity as measured by the Gini index. Alternatives include the permutation-based measure, that is, permute variables one at a time and then rank according to the respective amount of decrease in accuracy (as estimated on oob observations).

A valuable property of TACOMA is its ability to report salient pixels in an image that determine its score (see Figure 8). This is based on a correspondence between the position of pixels in an image and entries in its GLCM, and made possible by the remarkable variable-ranking capability of RF. Here we use the importance measure (Gini index-based) provided by RF to rank the variables (i.e., entries of the GLCM) and then collect relevant image pixels associated with the important entries.

Since each entry of a GLCM is a counting statistic involving pairs of pixels, we can associate the (*a, b*)-entry of a GLCM with those pixels that make up this GLCM entry. The set of image pixels that are associated with the (*a, b*)-entry of a GLCM is formally represented as

$${\mathcal{G}}_{a,b}=\{x,y:x~y,I(x)=a,I(y)=b\}.$$

In the above representation, *x* and *y* represent the position of image pixels and we treat an image *I* as a map from the position of an image pixel to its gray value. Note that not all pairs of pixels with *x* ~ *y* such that *I*(*x*) = *a, I*(*y*) = *b* correspond to salient spots in a TMA image. However, if the (*a, b*)-feature is ‘important’ (e.g., as determined by RF) then typically most pixels in the set ${\mathcal{G}}_{a,b}$ are relevant.

Whereas RF appears to be a black box—taking a large number of GLCM features and producing a score—salient pixels provide a quick peek into its internals. Effectively, RF works in roughly the same manner as a pathologist, that is, they both use salient pixels to score the TMA images; the seemingly mysterious image features are merely a way of representation that is to be understood by a computer algorithm.

Denote the training sample by (*I*_{1}, *Y*_{1}), …, (*I _{n}, Y_{n}*) where

In the above description, *τ _{i}* is chosen as the median of entries of matrix

1: for i = 1 to L do |

2: compute the GLCM for the i image patch and denote by ^{th}Z;_{i} |

3: M ← the index set of _{i}Z that survive thresholding at level _{i}τ;_{i} |

4: end for |

5: $M\leftarrow {\cup}_{j=1}^{L}{M}_{j}$; |

6: for k = 1 to n do |

7: compute the GLCM of image I and keep only entries in index set _{k}M; |

8: denote the resulting matrix by X;_{k} |

9: end for |

10: Feed ${\cup}_{l=1}^{n}\left\{\right({X}_{l},{Y}_{l}\left)\right\}$ into the RF classifier and obtain a classification rule. |

The sample size is an important issue in the scoring of TMA images, mainly because of the high cost and human efforts involved in obtaining a large sample of high quality labels. For instance, it may take several hours for a well-trained pathologist to score 100 TMA images. Unfortunately, it is often the case that the classification performance drops sharply when the training sample size is reduced. For example, Figure 6 shows the error rate of TACOMA when the sample size varies. Our aim is to achieve reasonable accuracy for small sample size and co-training is adopted for this purpose.

Co-training was proposed in the landmark papers by Yarowsky [53] and Blum and Mitchell [6]. It is an effective way in training a model with an extremely small labeled sample and has been successfully applied in many applications [6, 38]. The idea is to train two separate classifiers (called coupling classifiers) each on a different set of features using a small number of labeled examples. Then the two classifiers iteratively transfer those confidently classified examples, along with the assigned label, to the labeled set. This process is repeated until all unlabeled examples have been labeled. For an illustration of the idea of co-training, see Figure 7. Co-training is relevant here due to the natural redundancy that exists among features that are based on GLCMs corresponding to different spatial relationships.

A learning mode that is closely related to co-training is self-learning [38] where a single classifier is used in the ‘*label* → *transfer →label*’ loop (c.f. Figure 7). However, empirical studies have shown that co-training is often superior [38]; the intuition is that, co-training allows the two coupling classifiers to progressively expand the ‘knowledge boundary’ of each other which is absent in self-learning.

Previous works in co-training use almost exclusively Expectation Maximization or Naive Bayes based classifiers where the posterior probability serves as the “confidence” required by co-training. Here we use RF [7] where the margin (to be defined shortly) provided by RF is used as a “natural” proxy for the “confidence”. The margin is defined through the votes received by an observation. For an observation *x* in the test set, let the number of votes it receives for the *i ^{th}* class be denoted by

$$\underset{i\in \{1,\dots ,C\}}{\mathrm{max}}{N}_{i}\left(x\right)-\underset{j\in \{1,\dots ,C\}}{\text{second}}{N}_{j}\left(x\right)$$

where *second* in the above indicates the second-largest element in a list.

To give an algorithmic description of co-training, let the two subsets of features be denoted by ${\mathcal{F}}_{1}$ and ${\mathcal{F}}_{1}$, respectively. Let the set of labeled and unlabeled examples be denoted by $\mathcal{L}$ and $\mathcal{U}$, respectively. The co-training process proceeds as Algorithm 2. The final error rate and class membership are determined by a fixed coupling classifier, say *f*_{1}. We set *m*_{1} = *m*_{2} = 2 in our experiments according to [6].

Co-training requires two subsets of features (or a feature split). However, co-training algorithms rarely provide a recipe for obtaining these feature splits. There are several possibilities one can explore.

The first is called a “natural” split, often resulting from an understanding of the problem structure. A rule of thumb as to what constitutes a natural

1: while the set $\mathcal{U}$ is not empty do |

2: for k = 1, 2 do |

3: Train RF classifier f on labeled examples from $\mathcal{L}$ using feature set ${\mathcal{F}}_{k}$;_{k} |

4: Classify examples in the set $\mathcal{U}$ with f;_{k} |

5: Under f, calculate the margin for each observation in $\mathcal{U}$;_{k} |

6: pick m observations, ${x}_{1}^{\left(k\right)},\dots ,{x}_{{m}_{k}}^{\left(k\right)}$, with the largest margins;_{k} |

7: end for |

8: $\mathcal{L}\leftarrow \mathcal{L}\cup \{{x}_{1}^{\left(1\right)},\dots ,{x}_{{m}_{1}}^{\left(1\right)},{x}_{1}^{\left(2\right)},\dots ,{x}_{{m}_{2}}^{\left(2\right)}\}$ |

9: $\mathcal{U}\leftarrow \mathcal{U}\setminus \{{x}_{1}^{\left(1\right)},\dots ,{x}_{{m}_{1}}^{\left(1\right)},{x}_{1}^{\left(2\right)},\dots ,{x}_{{m}_{2}}^{\left(2\right)}\}$ |

10: end while |

split is that each of the two feature subsets alone allows one to construct an acceptable classifier and that the two subsets somehow complement each other (e.g., conditional independence given the labels). Fortunately, TMA images represented in GLCM’s naturally have such properties. For a given problem, often there exist several *spatial relationships* (for example, (, 3) and (, 1) for TMA images studied in this work) with each inducing a GLCM sufficient to construct a classifier while the “dependence” among the induced GLCM’s is usually low. Thus it is ideal to apply co-training on TMA images using such natural splits.

When there is no natural split readily available, one has to find two proper subsets of features. One way is via random splitting. Co-training via a random split of features was initially considered by Nigam and Ghani [38] but has since been largely overlooked in the machine learning literature. Here we extend the idea of random splits to “thinning”, which is more flexible and potentially may lead to a better co-training performance. Specifically, rather than randomly splitting the original feature set $\mathcal{F}=\{1,\dots ,p\}$ into two halves, select two disjoint subsets of $\mathcal{F}$ with size not necessary equal but non-vanishing compared to *p*. This leads to feature subsets smaller than $\mathcal{F}$, hence the name “thinning”. One concrete implementation of this is to divide $\mathcal{F}$ into a number of, say *J*, equal-sized partitions (each partition is also called a thinned slice of $\mathcal{F}$). In the following discussion, unless otherwise stated, thinning always refers to this concrete implementation. It is clear that this includes random splits as a special case. Thinning allows one to construct self-learning classifier (the features are taken from one of the *J* partitions), co-training (randomly pick 2 out of *J* partitions), and so on. For a given problem, one can explore various alternatives associated with thinning but here we shall focus on co-training.

The extension of random split to thinning may lead to improved co-training performance, as thinning may make features from different partitions less dependent and meanwhile well preserves the classification power in a high-dimensional setting when there is sufficient redundancy among features (see Section 3.2). The optimal number of partitions can be selected by heuristics such as the kernel independence test [2, 24], which we leave for future work.

According to Blum and Mitchell [6], one essential ingredient of co-training is the “high” confidence of the two coupling classifiers in labeling the unlabeled examples. This is closely related to the strength of the two coupling classifiers which is in turn determined by the feature subsets involved. In this section, we study how much a thinned slice of the feature set $\mathcal{F}$ preserves its classification power. This provides insight into the nature of thinning and is interesting at its own right due to its close connection to several lines of interesting work [28, 17] in machine learning (see Section 3.2.2). We present our theoretical analysis in Section 3.2.1 and list related work in Section 3.2.2. In Appendix 6.3, we provide additional simulation results related to our theoretical analysis.

Our theoretical model is the Gaussian mixture specified as

$$\Pi \mathcal{N}({\mu}_{1},\Sigma )+(1-\Pi )\mathcal{N}({\mu}_{2},\Sigma ),$$

(2)

where Π {0, 1} indicates the label of an observation such that $\mathbb{P}(\Pi =1)=\pi $, and $\mathcal{N}(\mu ,\Sigma )$ stands for Gaussian distribution with mean $\mu \in {\mathbb{R}}^{p}$ and covariance matrix ∑. For simplicity, we consider $\pi =\frac{1}{2}$ and the 0-1 loss. We will define the ratio of separation as a measure of the fraction of “information” carried by the subset of features due to thinning with respect to that of the original feature set and show that this quantity is “preserved” upon thinning. For simplicity, we take *J* = 2 (i.e., random splits of $\mathcal{F}$) and similar discussion applies to *J* > 2.

Let the feature set $\mathcal{F}$ be decomposed as

$$\mathcal{F}={\mathcal{F}}_{1}\cup {\mathcal{F}}_{2},\phantom{\rule{thinmathspace}{0ex}}\text{such that}{\mathcal{F}}_{1}\cup {\mathcal{F}}_{2}=\varnothing \phantom{\rule{thinmathspace}{0ex}}\text{and}\mid {\mathcal{F}}_{1}\mid =\frac{p}{2}\triangleq m.$$

(3)

We will show that each of the two subsets of features, ${\mathcal{F}}_{1}$ and ${\mathcal{F}}_{2}$, carries a substantial fraction of the “information” contained in the original data when *p* is large, assuming the data is generated from Gaussian mixture (2).

A quantity that is crucial in our inquiry is

$${S}_{\mathcal{F}}={u}_{\mathcal{F}}^{T}{\Sigma}_{\mathcal{F}}^{-1}{u}_{\mathcal{F}}$$

(4)

where ${u}_{\mathcal{F}}={({\mu}_{1}-{\mu}_{2})}_{\mathcal{F}}={({U}_{1},{U}_{2},\dots ,{U}_{p})}_{\mathcal{F}}$. We call ${S}_{\mathcal{F}}$ the separation of the Gaussian mixture induced by the feature set $\mathcal{F}$. The separation is closely related to the Bayes error rate for classification through a well-known result in multivariate statistics.

*For Gaussian mixture*
(2)
*and 0-1 loss, the Bayes error rate is given by*
$\Phi (-\frac{1}{2}{\left({u}_{\mathcal{F}}^{T}{\Sigma}^{-1}{u}_{\mathcal{F}}\right)}^{1\u22152})$
*where* Φ(·) *is defined as*
$\Phi \left(x\right)={\int}_{-\infty}^{x}\frac{1}{\sqrt[]{2\pi}}{e}^{-\frac{{z}^{2}}{c}}dz$.

Let the covariance matrix ∑ be written as

$$\Sigma =\left[\begin{array}{cc}\hfill A\hfill & \hfill {B}^{T}\hfill \\ \hfill B\hfill & \hfill D\hfill \end{array}\right]$$

where we assume block *A* corresponds to features in ${\mathcal{F}}_{1}$ after a permutation of rows and columns. Accordingly, write ** u** as ${\mathit{u}}_{\mathcal{F}}=({\mathit{u}}_{{\mathcal{F}}_{1}},{\mathit{u}}_{{\mathcal{F}}_{2}})$ and defined ${S}_{{\mathcal{F}}_{1}}$ (called the separation induced by ${\mathcal{F}}_{1}$) similarly as (4). Now we can define the

$$\gamma =\frac{{S}_{{\mathcal{F}}_{1}}}{{S}_{\mathcal{F}}}.$$

(5)

To see why definition (5) is useful, we give here a numerical example. Assume there is a Gaussian mixture defined by (2) such that ∑_{100×100} is a tri-diagonal matrix with diagonals being all 1 and off-diagonals being 0.6, ${\mathit{u}}_{\mathcal{F}}={(1,\dots ,1)}^{T}$. Suppose one picks the first 50 variables and form a new Gaussian mixture with covariance matrix *A* and mixture center distance ${\mathit{u}}_{{\mathcal{F}}_{1}}$. We wish to see how much is affected in terms of the Bayes error rate. We have

$${S}_{\mathcal{F}}=45.87,\Phi \left(-\frac{1}{2}({u}_{\mathcal{F}}^{T}{\Sigma}^{-1}u{\mathcal{F})}^{\frac{1}{2}}\right)=3.54\times {10}^{-4}$$

$$S{\mathcal{F}}_{1}=23.32,\Phi \left(-\frac{1}{2}{\left(u{{\mathcal{F}}_{1}}^{T}{A}^{-1}u{\mathcal{F}}_{1}\right)}^{\frac{1}{2}}\right)=7.87\times {10}^{-3}$$

and *γ* = 0.5084. Here the difference between feature set ${\mathcal{F}}_{1}$ and $\mathcal{F}$ is very small in terms of their classification power. In general, if the dimension is sufficiently high and *γ* is non-vanishing, then using a subset of features will not incur much loss in classification power. In Theorem 2, we will show that, under certain conditions, *γ* does not vanish (i.e., *γ > c* for some positive constant *c*) so a feature subset is as good as the whole feature set in terms of classification power.

Our main assumption is actually a technical one, which is related to the “local” dependency among components of ** u** after some variable transformation. The exact context will become clear later in the proof of Theorem 2. For now, let ∑ have a Cholesky decomposition ∑ =

Our main result is the following theorem.

Assume the data are generated from Gaussian mixture (2). *Further assume the smallest eigenvalue of* ∑^{−1}, *denoted by λ _{min}*(∑

$$\frac{{S}_{{\mathcal{F}}_{1}}}{S\mathcal{F}}\ge {\left(\frac{1}{2}\right)}^{-}$$

*in probability as p* → ∞ *where* (*a*)^{−}
*indicates any constant smaller than a*. *When the number of partitions J* > 2, *the right-hand side is replaced by* 1*/J*.

See appendix for proof.

There are mainly two lines of work closely related to ours. One is the Johnson-Lindenstrauss lemma and related [30, 17]. The Johnson-Lindenstrauss (or J-L) lemma states that, for Gaussian mixtures in high-dimensional space, upon a random projection to a low-dimensional subspace, the separation between the mixture centers in the projected space is “comparable” to that in the original space with high probability. The difference is that the random projection in J-L is carried out via a nontrivial linear transformation and the separation is defined in terms of the Euclidean distance whereas, in our work, random projection is performed coordinate-wise in the original space and we define the separation with the Mahalanobis distance.

The other is the random subspace method [28], an early variant of the RF classifier ensemble algorithm that is comparable to bagging and Adaboost in terms of empirical performance. The random subspace method grows a tree by randomly selecting half of the features and then constructs a tree-based classifier. However, beyond simulations there has been no formal justification for the random selection of half of the features. Our result provides support on this aspect. In high dimensional data setting where the features are “redundant”, our result shows that a randomly selected half of the features can lead to a tree comparable, in terms of classification power, to a classifier that uses all the features; meanwhile the random nature of the set of features used in each tree makes the correlation between trees small, so good performance can be expected.

Our theoretical result, when used in co-training, can be viewed as a manifestation of the “blessings of the dimensionality” [20]. For high dimensional data analysis, the conventional wisdom is to do dimension reduction or projection pursuit. As a result, the “redundancy” among the features is typically not used and, in many cases, even becomes the nuisance one strives to get rid of. This is clearly a waste. When the “redundancy” among features is complementary, such redundancy actually allows one to construct two coupling learners from which co-training can be carried out. It should be emphasized that the splitting of the feature set works because of redundancy. We believe the exploration of this type of redundancy will have important impact in high dimensional data analysis.

To assess the performance of TACOMA, we evaluate a collection of TMA images from the Stanford Tissue Microarray Database, or STMAD (see [35] and http://tma.stanford.edu/). TMAs corresponding to the potential expression of the estrogen receptor (ER) protein in breast cancer tissue are used since ER is a histologically well-studied marker that is expressed in the cell nucleus. An example of TMA images can be seen in Figure 2. There are 641 TMA images in this set and each image has been assigned a score from {0, 1, 2, 3}. The scoring criteria are: ‘0’ representing a definite negative (no staining of cancer cells), ‘3’ a definitive positive (a majority of cancer cells show dark nucleus staining) and ‘2’ for positive (a minority of cancer cells show nucleus staining or a majority show weak nucleus staining). The score ‘1’ indicates ambiguous weak staining in a minority of cancer cells. The class distribution of the scores is (65.90%, 2.90%, 7.00%, 24.20%). Such an unbalanced class distribution makes the scoring task even more challenging. In our experiments, we assess the accuracy by the proportion of images in the test sample that receives the same score by a scoring algorithm as that given by the reference set (e.g., STMAD). We split the images into a training and a test set of sizes 313 and 328, respectively. In the following, we will first describe our choice of various design options and parameters, and then report performance of TACOMA on large training set (i.e., a set of 313 TMA images) and small training set (i.e., a set of 30 TMA images) with co-training in Section 4.1 and Section 4.2, respectively.

We use three spatial relationships, (, 3), (, 1) and (, 1) in our experiments. In particular, (, 3) is used in our experiment on TACOMA for large training set while (, 1) is used along with (, 3) in our co-training experiment and (, 1) in additional experiment (see Table 3). Often (, 1) is the default choice in applications, we use (, 3) here to reflect the granularity of the visual pattern seen in the images. Indeed, as the staining patterns in TMA images for ER markers occur only in nucleus, (, 3) leads to a slight better scoring performance than (, 1) according to our experiment on TACOMA; moreover, no significant difference is observed for TACOMA when simply concatenating features derived from (, 3) and (, 1). (, 1) is used in co-training in hoping that it is less correlated to features derived from (, 3) than others as these two are on the orthogonal directions.

Accuracy of TACOMA on TMA images corresponding to protein markers CD117, CD34, NMB and ER. Except for ER (which has a fixed training and test set), we use (, 1) and 80% of the instances for training and the rest for test; this is repeated for **...**

For a good balance of computational efficiency, discriminative power, as well as ease of implementation, we take *N _{g}* = 51 (our experiments are not particularly sensitive to the choice of

We use the R package ‘randomForest’ ^{3} in this work. There are two important parameters, the number of trees in the ensemble and the number of features to explore at each node split. These are searched through {50, 100, 200, 500} and $\{0.5\sqrt[]{p},\sqrt[]{p},2\sqrt[]{p}\}$ (the default value is $\sqrt[]{p}$ as suggested by the R package for *p* the number of features fed to RF), respectively, in this work and the best test set error rates are reported. More information on RF can be seen in [7].

The full set of 313 TMA images in the training set are used in this case. We run TACOMA on the training set (scores given by STMAD) and apply the trained classifier to the test set images to obtain TACOMA scores. Then, we blind STMAD scores in the test set of 328 images (100 of which are duplicated so totally there are 428 images) and have them re-evaluated by two experienced pathologists from two different institutions. The 100 duplicates allow us to evaluate the self-consistency of the pathologists.

Although the scores from STMAD do not necessarily represent ground truth, they serve as a fixed standard with respect to which the topics of accuracy and reproducibility can be examined. On the test set of 328 TMA images, TACOMA achieves a classification accuracy of 78.57% (accuracy defined as the proportion of images receiving the same score as STMAD). We argue this is close to the optimal. The Bayes rate is estimated for this particular data example (represented as GLCMs) with a simulation using a 1-nearest neighbor (1*NN*) classifier. The Bayes rate refers to the theoretically best classification rate given the data distribution. With the same training and test sets as RF classification, the accuracy achieved by 1*NN* is around 60%. According to a celebrated theorem of Cover and Hart [15], the error rate by 1*NN* is at most twice that of the Bayes rule. This implies an estimate of the Bayes rate at around 80% subject to small sample variation (the estimated Bayes rate on the original image or its quantized version are all bounded above by this number according to our simulation). Thus TACOMA is close to optimal.

In the above experiments, we use 4 image patches. To see if TACOMA is sensitive to the choice of image patches, we conduct experiments over a range of different patch sets and achieve an average accuracy at 78.66 ± 0.52%. This indicates that TACOMA is robust to the choices of image patches.

The ability of TACOMA to detect salient pixels is demonstrated in Figure 8 where image pixels are highlighted in white if they are associated with a significant scoring feature. These highlighted pixels are verified by the pathologists to be indicative. With relatively few exceptions, these locations correspond to areas of stained nuclei in cancer cells. We emphasize that these highlighted pixels indicate features most important for classification as opposed to identifying every property indicative of ER status. The highlighted pixels facilitate interpretation and the comparison of images by pathologists.

The superior classification performance of TACOMA is also demonstrated by scores provided by the two pathologists. These two copies of scores, along with STMAD, provide three independent pathologist-based scores. Among these, 142 images receive a unanimous score. Consequently, these may be viewed as a reference set of “true” scores against which the accuracy of TACOMA might be evaluated (accuracy being defined as the proportion of images receiving the same score as the reference set). Here, TACOMA achieves an accuracy of 90.14%; see Figure 9.

Scores provided by the two pathologists are also used to assess their self-consistency. Here self-consistency is defined as the proportion of repeated images receiving an identical score by the same pathologist. Consensus among different pathologists is an issue of concern [40, 49]. In order to obtain information about the self-consistency of pathologist-based scores, 100 images are selected from the set of 328 images. These 100 images are rotated and/or inverted, and then mixed at random with the 328 images to avoid recognition (so each pathologist actually scored a total of 428 TMA images). The self-consistency rates of the two pathologists are found to be in the range 75-84%. Of course, one desirable feature of any automated algorithm such as TACOMA is its complete (i.e., 100%) self-consistency.

We choose to use RF in TACOMA. Some popular alternatives include support vector machines (SVM) [14], Boosting [22] and Bayesian network [39] etc. Using the same training and test set as that for RF, we conduct experiments with SVM as well as Boosting of a naive Bayes classifier. The input features for both SVM and Boosting are the entries of the GLCM. We use the Libsvm [11] software for SVM. The naive Bayes classifier is adopted from [51]. The idea is to find the class that maximize the posterior probability for a new observation, denoted by

$$\mathrm{arg}\underset{k\in \{0,1,2,3\}}{\mathrm{max}}\mathrm{Prob}\{k\mid I(1,1)={a}_{1,1},\dots ,I({N}_{g},{N}_{g})={a}_{{N}_{g},{N}_{g}}\}$$

$${Y}_{i}=\sum _{j=1}^{p}{h}_{i,j}{U}_{j}$$

(6)

under the assumption that (*I*(1, 1), …, *I*(*N _{g}, N_{g}*)) follows a multinomial distribution. More details can be found in [51]. The results are shown in Table 1. We can see that RF outperforms SVM and Boosting by a large margin. This is consistent with observations made by Holmes et al [29].

We conduct experiments on co-training with natural splits and thinning. For natural splits, we use GLCM’s corresponding to two spatial relationships, (, 3) and (, 1), as features. For thinning, we combine features corresponding to (, 3) and (, 1) and then split this combined feature set.

The number of labeled examples is fixed at 30 (compared to 313 in experiments with large training set). This choice is to make it easy to get a nonempty class 1 (which carries only about 2.90% of the cases). We suspect this number can be further reduced without suffering much in learning accuracy. The test set is the same as that in Section 4.1. The result is shown in Table 2. One interesting observation is that, co-training by thinning achieves an accuracy very close to that by natural splits. Additionally, Table 2 also lists error rates given by RF on features corresponding to (, 3) (, 1) and its thinned subsets. Here thinning of the feature set does not cause much loss in RF performance, consistent with our discussion in Section 3.2.

This study focuses on the ER marker for which the staining is nuclear. However, the TACOMA algorithm can be applied with equal ease to markers that exhibit cell surface, cytoplasm or other staining patterns. Additional experiments are conducted on the Stanford TMA images corresponding to three additional protein markers: CD117, CD34 and NMB. These three sets of TMA images are selected for their large sample size and relatively few missing scores (excluded from experiment). The results are shown in Table 3. In contrast, the automated scoring of cytoplasmic markers is often viewed as more difficult and refined commercial algorithms for these were reportedly not available in a recent evaluation [9] of commercial scoring methods.

We have presented a new algorithm that automatically scores TMA images in an objective, efficient, and reproducible manner. Our contributions include: 1) the use of co-occurrence counting statistics to capture the spatial regularity inherent in a heterogeneous and irregular set of TMA images; 2) the ability to report salient pixels in an image that determine its score; 3) the incorporation of pathologists’ input via informative training patches so that our algorithm can easily adapt to specific markers and cell types; 4) a very small training sample is achievable with co-training and we have provided some theoretical insights into co-training via thinning of the feature set. Our experiments show that TACOMA can achieve performance comparable to well-trained pathologists. It uses the similar set of pixels for scoring as that would be used by a pathologist and is not adversely sensitive to the choice of image patches. The theory we have developed on the thinning scheme in co-training gives insights on why thinning may rival the performance of a natural split in co-training; a thinned slice may be as good as the whole feature set in terms of classification power hence thinning can lead to two strong coupling classifiers that will be used in co-training and this is what a natural split may achieve.

The utility of TACOMA lies in large population-based studies that seek to evaluate potential markers using IHC in large cohorts. Such a study may be compromised by a scoring process that is protracted, prohibitively expensive or poorly reproducible. Indeed, a manual scoring for such a study could require hundreds of hours of pathologists’ time without achieving a reproducibly consistent set of scores. Experiments with several IHC markers demonstrates that our approach has the potential to be as accurate as manual scoring while providing a fast, objective, inexpensive and highly reproducible alternative. Even more generally, TACOMA may be adopted to other types of textured images such as those appearing in remote sensing applications. These properties provide obvious advantages for any subsequent statistical analysis in determining the validity or clinical utility of a potential marker. Regarding reproducibility, we note that the scores provided by two pathologists in our informal pilot study revealed an intra-observer agreement of around 80% and an accuracy only in the range of 70%, as defined by the STMAD reference set (excluding all images deemed unscorable by the pathologists). This low inter-observer agreement may be attributed to a variety of factors including a lack of a subjective criteria used for scoring or the lack of training against an established standard. This performance could surely be improved upon, but it highlights the inherent subjectivity and variability of human-based scoring.

In summary, TACOMA provides a transparent scoring process that can be evaluated with clarity and confidence. It is also flexible with respect to marker patterns of cellular localization: although the ER marker is characterized by staining of the cell nucleus, TACOMA applies with comparable ease and success to cytoplasmic or other marker staining patterns (see Table 2 in Section 4.1).

A software implementation of TACOMA is available upon request and the associated R package will be made available to R project.

This appendix consists of four parts, corresponding to the Assumption ${\mathcal{A}}_{1}$, proofs for Theorem 2, simulation results for the ratio of separation, and additional large TMA images for salient pixel detection, respectively.

Let ${H}^{-1}={\left({h}_{i,j}\right)}_{i,j=1}^{p}$. It is known that *H*^{−1} is also a lower triangular matrix. The components of ** b** can be expressed as

$$\sum _{j=1}^{i}{h}_{i,j}^{2},\phantom{\rule{1em}{0ex}}\sum _{j,k=1}^{i}{h}_{i,j},{h}_{i,k}$$

(7)

for *i* = 1, …, *p*. The local dependence we are looking for is that the *Y _{i}*’s have the same distribution and, for any pair of (

We can express our assumption as follows ${\mathcal{A}}_{1}$. For each *i* = 1, …, *p*, *h _{i;j}* = 0 for $j-i>\mathcal{B}$ for some constant $\mathcal{B}$ (referred to as the bandwidth). Further, we require the following be constant (possibly excluding the first and last few) across

$$\underset{i=1,\dots ,p}{\mathrm{sup}}\sum _{j=i-\mathcal{B}}^{i}{h}_{i,j}^{4}\le M.$$

(8)

Moreover, there exists a universal constant *M* > 0 such that

$${T}_{\mathcal{I}}=\frac{1}{\mid \mathcal{I}\mid}\sum _{i\in \mathcal{I}}{Y}_{i}^{2}=\frac{1}{\mid \mathcal{I}\mid}\sum _{i\in \mathcal{I}}{\left(\sum _{j=1}^{i}{h}_{i,j}{U}_{j}\right)}^{2}$$

An important part for the proof of Theorem 2 concerns the sum of many “locally” dependent quadratic terms. We prove a useful tool for this first. Define

$${T}_{\mathcal{I}}\to \mathbb{E}{Y}_{1}^{2}$$

We will show that ${T}_{\mathcal{I}}$ is highly concentrated around its mean in the following lemma.

Suppose assumption ${\mathcal{A}}_{1}$ is true for the covariance matrix ∑ under all possible permutations of rows and columns for some universal constants $\mathcal{B}$ and M. Further assume U_{1} has bounded fourth moment. Then, for any instance of $\mathcal{I}$ defined by a random split,

$$Pr(\mid {T}_{m}-\mathbb{E}{T}_{m}\mid \ge a)\phantom{\rule{1em}{0ex}}\le \frac{1}{{a}^{2}{m}^{2}}\left[\sum _{i=1}^{m}Var\left({Y}_{i}^{2}\right)+\sum _{i\ne j}Cov({Y}_{i}^{2},{Y}_{j}^{2})\right].$$

(9)

in probability as $\mid \mathcal{I}\mid \to \infty $.

The proof is based on Chebychev’s inequality [12]. For simplicity details are omitted for the handling of the finite number (there are $\mathcal{B}-1$ of them) of *Y _{i}*’s that has less than $\mathcal{B}$ non-zero terms in expression (6) (the sum these of terms tends to 0 in probability). W.L.O.G., let $\mathcal{I}=\{1,\dots ,m\}$. Fix α > 0, then

$$\frac{1}{{a}^{2}{m}^{2}}\sum _{i=1}^{m}Var\left({y}_{i}^{2}\right)=\frac{1}{{a}^{2}{m}^{2}}\sum _{i=1}^{m}Var{\left[\sum _{j=1}^{p}{h}_{i,j}{U}_{j}\right]}^{2}\le \frac{1}{{a}^{2}{m}^{2}}\sum _{i=1}^{m}\mathbb{E}\left[{\mathcal{B}}^{3}\sum _{j=i-\mathcal{B}}^{i}{h}_{i,j}^{4}{U}_{j}^{4}\right]=\frac{{\mathcal{B}}^{3}\mathbb{E}\left({U}_{1}^{4}\right)}{{a}^{2}{m}^{2}}\sum _{i=1}^{m}\sum _{j=i-\mathcal{B}}^{i}{h}_{i,j}^{4}\le \frac{{\mathcal{B}}^{3}M\mathbb{E}\left({U}_{1}^{4}\right)}{{a}^{2}m}\phantom{\rule{thinmathspace}{0ex}}$$

We will show that both of the two terms in (9) vanish as *m* grows. We have

$$\frac{1}{{a}^{2}{m}^{2}}\sum _{i=1}^{m}\mathrm{Var}\left({Y}_{i}^{2}\right)=\frac{1}{{a}^{2}{m}^{2}}\sum _{i=1}^{m}\mathrm{Var}{\left[\sum _{j=1}^{p}{h}_{i,j}{U}_{j}\right]}^{2}\le \frac{1}{{a}^{2}{m}^{2}}\sum _{i=1}^{m}\mathbb{E}\left[{\mathcal{B}}^{3}\sum _{j=i-\mathcal{B}}^{i}{h}_{i,j}^{4}{U}_{j}^{4}\right]=\frac{{\mathcal{B}}^{3}\mathbb{E}\left({U}_{1}^{4}\right)}{{a}^{2}{m}^{2}}\sum _{i=j}^{m}\sum _{j=i-\mathcal{B}}^{i}{h}_{1,j}^{4}\le \frac{{\mathcal{B}}^{3}M\mathbb{E}\left({U}_{1}^{4}\right)}{{a}^{2}m},$$

which clearly vanishes as *m* grows assuming $\mathbb{E}{U}_{1}^{4}$ is bounded. Next we bound the covariance terms.

$$\frac{1}{{a}^{2}{m}^{2}}\sum _{\mid i-j\mid \le \mathcal{B}}Cov({Y}_{i}^{2},{Y}_{j}^{2})\le \frac{{C}_{2}}{{a}^{2}{m}^{2}}\sum _{\mid i-j\mid \le \mathcal{B}}{\left[\mathbb{E}{Y}_{i}^{4}\mathbb{E}{Y}_{j}^{4}\right]}^{1\u22152}\le \frac{1}{{a}^{2}{m}^{2}}\sum _{\mid i-j\mid \le \mathcal{B}}{\left[\left({\mathcal{B}}^{3}\mathbb{E}{U}_{1}^{4}\sum _{k=i-\mathcal{B}}^{i}{h}_{i,k}^{4}\right)\cdot \left({\mathcal{B}}^{3}\mathbb{E}{U}_{1}^{4}\sum _{k=j-\mathcal{B}}^{j}{h}_{j,k}^{4}\right)\right]}^{1\u22152}=\frac{{\mathcal{B}}^{3}\mathbb{E}{U}_{1}^{4}}{{a}^{2}{m}^{2}}\sum _{\mid i-j\mid \le \mathcal{B}}{\left[\left(\sum _{k=i-\mathcal{B}}^{i}{h}_{i,k}^{4}\right)\cdot \left(\sum _{k=i-\mathcal{B}}^{i}{h}_{i,k}^{4}\right)\right]}^{1\u22152}\le \frac{{\mathcal{B}}^{4}M\mathbb{E}{U}_{1}^{4}}{{a}^{2}m}.$$

Thus

$$Pr(\mid {T}_{m}-\mathbb{E}{T}_{m}\mid \ge a)\le \frac{{\mathcal{B}}^{3}(1+\mathcal{B})M\mathbb{E}{U}_{1}^{4}}{{a}^{2}m}$$

and the conclusion follows.

- An interesting special case of ${\mathcal{A}}_{1}$ is when
*H*^{−1}is a banded Toeplitz matrix, a subject of extensive study in time series, numerical analysis and covariance matrix estimation (see [45, 46, 18, 5, 8] and references therein). In this case,*H*^{−1}has constant sub-diagonals so ${\mathcal{A}}_{1}$ holds. Clearly if we permutate the entries in each row of*H*^{−1}(the resulting matrix is then a permutated Toeplitz matrix) such that*H*^{−1}is still lower triangular and banded (the bandwidth $\mathcal{B}$ can grow sub-linearly with*p*), then it is easy to see that ${\mathcal{A}}_{1}$ still holds. - From the proof of Lemma 3, we see that similar conclusions follow when we allow $\mathcal{B}$,
*M*, or both, to grow sub-linearly (the exact rate can be determined by inspecting the proof of Lemma 3) with*p*. Under such conditions, since $\mathbb{E}{Y}_{1}^{2}$ may be infinite, we can state the result in terms of ${T}_{\mathcal{I}}\u2215{T}_{\mathcal{F}}$. - The assumptions required by Lemma 3 may be too restricted, it seems possible to relax by requiring the resulting covariance matrix under permutations be approximable by (
*H*+ Δ)(*H*+ Δ)with ||Δ|| ≤^{T}*ζ*||*H*|| for some small constant*ζ*where ||·|| denotes the operator norm. Accordingly, the lower bound becomes*c*for constant*c*s.t. 0 <*c*< 0.5 (see Section 6.3).

Let ** u** be written as

$$\stackrel{~}{\Sigma}=\left[\begin{array}{cc}\hfill A\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \stackrel{~}{D}\hfill \end{array}\right]$$

where $\stackrel{~}{D}=\text{diag}({d}_{p},\dots ,{d}_{p})$ is a diagonal matrix with entries to be determined. We will use $\stackrel{~}{\Sigma}$ in place of *A*. Since

$${\stackrel{~}{\Sigma}}^{-1}=\left[\begin{array}{cc}\hfill {A}^{-1}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill {\stackrel{~}{D}}^{-1}\hfill \end{array}\right]$$

if we can make *d _{p}* sufficiently large, then effectively we would have ${\mathit{u}}^{T}{\stackrel{~}{\Sigma}}^{-1}\mathit{u}\approx {\mathit{u}}_{1}^{T}{A}^{-1}{\mathit{u}}_{1}$. The exact order of growth for

We wish to obtain a lower bound for

$$\gamma =\frac{{\mathit{u}}_{1}^{T}{A}^{-1}{\mathit{u}}_{1}}{{\mathit{u}}^{T}{\Sigma}^{-1}\mathit{u}}=\frac{{\mathit{u}}^{T}{\stackrel{~}{\Sigma}}^{-1}\mathit{u}}{{\mathit{u}}^{T}{\Sigma}^{-1}\mathit{u}}-\frac{{\mathit{u}}_{2}^{T}{\stackrel{~}{D}}^{-1}{\mathit{u}}_{2}}{{\mathit{u}}^{T}{\Sigma}^{-1}\mathit{u}}={\gamma}_{1}-{\gamma}_{2}.$$

We have

$${\gamma}_{2}=\frac{1}{{d}_{p}}\frac{{\mathit{u}}_{2}^{T}{\mathit{u}}_{2}}{{\mathit{u}}^{T}{\Sigma}^{-1}\mathit{u}}\le \frac{1}{{d}_{p}}\frac{{\mathit{u}}^{T}\mathit{u}}{{\mathit{u}}^{T}{\Sigma}^{-1}\mathit{u}}$$

which vanishes as *d _{p}* → ∞. To obtain a bound for

$$\frac{{\mathit{y}}^{T}{H}^{T}{\stackrel{~}{\Sigma}}^{-1}{H}_{\mathit{y}}}{{\mathit{y}}^{T}\mathit{y}}$$

(10)

where lower triangular matrix *H* is defined in the Cholesky decomposition ∑ = *HH ^{T}*. Write

$$H=\left[\begin{array}{cc}\hfill {H}_{1}\hfill & \hfill 0\hfill \\ \hfill {H}_{2}\hfill & \hfill {H}_{4}\hfill \end{array}\right].$$

Then $A={H}_{1}{H}_{1}^{T}$. We can compute the following

$${H}^{T}{\stackrel{~}{\Sigma}}^{-1}H=\left[\begin{array}{cc}\hfill {H}_{1}^{T}\hfill & \hfill {H}_{2}^{T}\hfill \\ \hfill 0\hfill & \hfill {H}_{4}^{T}\hfill \end{array}\right].\left[\begin{array}{cc}\hfill {A}^{-1}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill {\stackrel{~}{D}}^{-1}\hfill \end{array}\right].\left[\begin{array}{cc}\hfill {H}_{1}\hfill & \hfill 0\hfill \\ \hfill {H}_{2}\hfill & \hfill {H}_{4}\hfill \end{array}\right]=\left[\begin{array}{cc}\hfill {H}_{1}^{T}{A}^{-1}\hfill & \hfill {H}_{2}^{T}{\stackrel{~}{D}}^{-1}\hfill \\ \hfill 0\hfill & \hfill {H}_{4}^{T}{\stackrel{~}{D}}^{-1}\hfill \end{array}\right].\left[\begin{array}{cc}\hfill {H}_{1}\hfill & \hfill 0\hfill \\ \hfill {H}_{2}\hfill & \hfill {H}_{4}\hfill \end{array}\right]=\left[\begin{array}{cc}\hfill {I}_{m\times m}+{H}_{2}^{T}{\stackrel{~}{D}}^{-1}{H}_{2}\hfill & \hfill {H}_{2}^{T}{\stackrel{~}{D}}^{-1}{H}_{4}\hfill \\ \hfill {H}_{4}^{T}{\stackrel{~}{D}}^{-1}{H}_{2}\hfill & \hfill {H}_{4}^{T}{\stackrel{~}{D}}^{-1}{H}_{4}\hfill \end{array}\right].$$

It follows that

$$\begin{array}{ccc}\hfill \hfill & \hfill \hfill & \left({\mathit{y}}_{1}^{T}{\mathit{y}}_{2}^{T}\right){H}^{T}{\stackrel{~}{\Sigma}}^{-1}H\left[\begin{array}{c}\hfill {\mathit{y}}_{1}\hfill \\ \hfill {\mathit{y}}_{2}\hfill \end{array}\right]\hfill \\ \hfill \hfill & \hfill =\hfill & {\mathit{y}}_{1}^{T}{\mathit{y}}_{1}+{\mathit{y}}_{1}^{T}{H}_{2}^{T}{\stackrel{~}{D}}^{-1}{H}_{2}{\mathit{y}}_{1}+{\mathit{y}}_{2}^{T}{H}_{4}^{T}{\stackrel{~}{D}}^{-1}{H}_{2}{\mathit{y}}_{1}+{\mathit{y}}_{1}^{T}{H}_{2}^{T}{\stackrel{~}{D}}^{-1}{H}_{4}{\mathit{y}}_{2}+{\mathit{y}}_{2}^{T}{H}_{4}^{T}{\stackrel{~}{D}}^{-1}{H}_{4}{\mathit{y}}_{2}\hfill \\ \hfill \hfill & \hfill =\hfill & {\mathit{y}}_{1}^{T}{\mathit{y}}_{1}+Q\left(\mathit{y}\right)\hfill \end{array}.$$

Repeated application of the matrix norm inequality yields the following

$$Q\left(\mathit{y}\right)\le \Vert {\stackrel{~}{D}}^{-1}\Vert \cdot (\Vert {H}_{2}{\Vert}^{2}+\Vert {H}_{4}\Vert \cdot \Vert {H}_{2}\Vert +\Vert {H}_{2}\Vert \cdot \Vert {H}_{4}\Vert +\Vert {H}_{4}{\Vert}^{2})\cdot \Vert \mathit{y}{\Vert}^{2}\le \frac{4\Vert H{\Vert}_{F}^{2}}{{d}_{p}}\Vert \mathit{y}{\Vert}^{2}$$

where ||.||* _{F}* denotes the Frobenius norm. Simply setting ${d}_{p}=O(p\Vert H{\Vert}_{F}^{2})$ make the term

$${\gamma}_{1}=\frac{{\mathit{y}}^{T}{H}^{T}{\stackrel{~}{\Sigma}}^{-1}H\mathit{y}}{{\mathit{y}}^{T}\mathit{y}}\ge \frac{{\mathit{y}}_{1}^{T}{\mathit{y}}_{1}}{{\mathit{y}}^{T}\mathit{y}}-o\left(1\right).$$

Let ** y** be written as

$$\frac{{{S}_{\mathcal{F}}}_{1}}{{S}_{\mathcal{F}}}\ge {\gamma}_{1}-\frac{1}{{d}_{p}^{2}}\frac{\mathit{u}{\mathit{u}}^{T}}{\mathit{u}{\Sigma}^{-1}{\mathit{u}}^{T}}=\frac{1}{2}-{o}_{p}\left(1\right)$$

as *p* grows.

We conduct simulations on Gaussian mixtures ${\mathcal{G}}_{1-4}$ corresponding to different types of covariance matrices. The aim is to examine the generality of Theorem 2 when there is a departure from assumption ${\mathcal{A}}_{1}$. Note that here we are not trying to model the TMA images with a Gaussian mixture rather we choose it for simplicity.

In all cases, the covariance matrix ∑ has dimension 2000. The components of ** u** are generated i.i.d. uniform from [0, 1]. ∑ is defined as follows.

- ${\mathcal{G}}_{1}$: ∑ is banded with bandwidth 200 and its nonzero (
*i, j*)entry is defined by ∑ =^{th}*ρ*^{|i−j|}for*ρ*{0.1, 0.3, 0.6}. - ${\mathcal{G}}_{2}$: ∑ is banded with $\mathcal{B}\in \{50,100,200\}$ and the nonzero (
*i, j*)entry defined by ∑^{th}= /(|_{ij}*i*−*j*| + 0.5). - ${\mathcal{G}}_{3}$: ∑ is in the form of (
*X*+*X*)/2 with entries of matrix^{T}*X*generated i.i.d. uniform from [0,*μ*] for*μ*{0.1, 0.5, 1.0}, except that the diagonals are the smallest number such that*λ*(∑_{min}^{−1}) ≥ 10^{−5}. - ${\mathcal{G}}_{4}$: ∑ is in the form of (
*X*+*X*)/2 with entries of matrix^{T}*X*generated i.i.d. from $\mathcal{N}(0,{\sigma}^{2})$ for*σ*{0.1, 0.5, 1.0}, except that the diagonals are the smallest number such that*λ*(∑_{min}^{−1}) ≥ 10^{−5}.

Table 4 shows the results where, in all cases, the ratio of separation is greater than 0.3 and fairly close to 0.5 (the ratio of separation when assumption ${\mathcal{A}}_{1}$ is true). As the Bayes rate of the 2–class classification problem on Gaussian mixtures with the original feature set is close to 1, so is that on a thinned slice of the feature set.

^{1}The additional issue of label noise was studied elsewhere [52].

^{1}Often before the computing of GLCM, the gray level of each pixel in an image is scaled linearly from [1; N_{o}] to [1; N_{g}] for *N _{o}* the prede ned number, typically 256, of gray levels in the image.

^{3}Originally written in Fortran by Leo Breiman and Adele Cutler, and later ported to R by Andy Liaw and Matthew Wiener.

[1] Andersen TW. An introduction to multivariate statistical analysis. John Wiley; New York: 1958.

[2] Bach F, Jordan MI. Kernel independent component analysis. Journal of Machine Learning Research. 2003;3:1–48.

[3] Bentzen S, Buffa F, Wilson G. Multiple biomarker tissue microarrays: bioinformatics and practical approaches. Cancer and Metastasis Reviews. 2008;27(3):481–494. [PubMed]

[4] Berger A, Davis D, Tellez C, Prieto V, Gershenwald J, Johnson M, Rimm D, Bar-Eli M. Automated quantitative analysis of activator protein-2 *α* subcellular expression in melanoma tissue microarrays correlates with survival prediction. Cancer research. 2005;65(23):11185. [PubMed]

[5] Bickel PJ, Levina E. Regularized estimation of large covariance matrices. Annals of Statistics. 2008;36(1):199–227.

[6] Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. Proceedings of the eleventh annual conference on Computational learning theory.1998. pp. 92–100.

[7] Breiman L. Random Forests. Machine Learning. 2001;45(1):5–32.

[8] Cai T, Zhang C-H, Zhou H. Optimal rates of convergence for covariance matrix estimation. Annals of Statistics. 2010;38(4):2118–2144.

[9] Camp R, Neumeister V, Rimm D. A decade of tissue microarrays: progress in the discovery and validation of cancer biomarkers. Journal of Clinical Oncology. 2008;26(34):5630–5637. [PubMed]

[10] Caruana R, Karampatziakis N, Yessenalina A. An empirical evaluation of supervised learning in high dimensions. Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML).2008. pp. 96–103.

[11] Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. 2001 Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.

[12] Chung KL. A Course in Probability Theory. second edition Academic Press; 1974.

[13] Conners R, Harlow C. A theoretical comparison of texture algorithms. IEEE Transactions on Pattern Analyses and Machine Intelligence. 1980;2:204–222. [PubMed]

[14] Cortes C, Vapnik V. Support vector networks. Machine Learning. 1995;20:273–297.

[15] Cover T, Hart P. Nearest neighbor pattern classification. IEEE Transactions on Information Theory. 1967;13(1):21–27.

[16] Cregger M, Berger A, Rimm D. Immunohistochemistry and quantitative analysis of protein expression. Archives of pathology & laboratory medicine. 2006;130(7):1026. [PubMed]

[17] Dasgupta S, Gupta A. An elementary proof of the Johnson-Lindenstrauss lemma. Random Structures & Algorithms. 2002;22(1):60–65.

[18] Dickinson BW. Efficient solution of linear equations with banded Toeplitz matrices. IEEE Transactions on Acoustics, Speech and Signal Processing. 1979;ASSP-27:421–423.

[19] DiVito K, Camp R. Tissue microarrays - automated analysis and future directions. Breast Cancer Online. 2005;8(07)

[20] Donoho DL. High-dimensional data analysis: The curses and blessings of dimensionality. American Math Society on Math Challenges of the 21st century. 2000.

[21] Fan J, Fan Y, Lv J. High dimensional covariance matrix estimation using a factor model. Journal of Econometrics. 2008;147:186–197.

[22] Freund Y, Schapire RE. Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on Machine Learning (ICML).1996. pp. 148–156.

[23] Giltnane J, Rimm D. Technology insight: identification of biomarkers with tissue microarray technology. Nature Clinical Practice Oncology. 2004;1(2):104–111. [PubMed]

[24] Gretton A, Fukumizu K, Teo CH, Song L, Schölkopf B, Smola AJ. Advances in Neural Information Processing Systems (NIPS) 2007. A kernel statistical test of independence; pp. 585–592.

[25] Hall B, Ianosi-Irimie M, Javidian P, Chen W, Ganesan S, Foran D. Computer-assisted assessment of the human epidermal growth factor receptor 2 immunohistochemical assay in imaged histologic sections using a membrane isolation algorithm and quantitative analysis of positive controls. BMC Medical Imaging. 2008;8(1):11. [PMC free article] [PubMed]

[26] Haralick R. Statistical and structural approaches to texture. Proceedings of IEEE. 1979;67(5):786–803.

[27] Hassan S, Ferrario C, Mamo A, Basik M. Tissue microarrays: emerging standard for biomarker validation. Current Opinion in Biotechnology. 2008;19(1):19–25. [PubMed]

[28] Ho TK. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1998;20:832–844.

[29] Holmes S, Kapelner A, Lee P. An interactive Java statistical image segmentation system: Gemident. Journal of Statistical Software. 2009;30(10):1–20. [PMC free article] [PubMed]

[30] Johnson W, Lindenstrauss J. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics. 1984;26:189–206.

[31] Joshi AS, Sharangpani GM, Porter K, Keyhani S, Morrison C, Basu AS, Gholap GA, Gholap AS, Barsky SH. Semi-automated imaging system to quantitate Her-2/neu membrane receptor immunoreactivity in human breast cancer. Cytometry Part A. 2007;71(5):273–285. [PubMed]

[32] El Karoui N. Spectrum estimation for large dimensional covariance matrices using random matrix theory. 2006. arXiv:math/0609418v1.

[33] Kirkegaard T, Edwards J, Tovey S, McGlynn LM, Krishna SN, Mukherjee R, Tam L, Munro AF, Dunne B, Bartlett J. Observer variation in immunohistochemical analysis of protein expression, time for a change? Histopathology. 2006;48(7):787–794. [PubMed]

[34] Kononen J, Bubendorf L, Kallionimeni A, Bärlund M, Schraml P, Leighton S, Torhorst J, Mihatsch M, Sauter G, Kallionimeni O. Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nature Medicine. 1998;4(7):844–847. [PubMed]

[35] Marinelli R, Montgomery K, Liu C, Shah N, Prapong W, Nitzberg M, Zachariah Z, Sherlock G, Natkunam Y, West R, et al. The Stanford tissue microarray database. Nucleic Acids Research. 2007;36:D871–D877. [PMC free article] [PubMed]

[36] Masmoudi H, Hewitt S, Petrick N, Myers K, Gavrielides M. Automated Quantitative Assessment of HER-2/neu Immunohistochemical Expression in Breast Cancer. IEEE transactions on medical imaging. 2009 [PubMed]

[37] Mulrane L, Rexhepaj E, Penney S, Callanan JJ, Gallagher WM. Automated image analysis in histopathology: a valuable tool in medical diagnostics. Expert Review of Molecular Diagnostics. 2008;8(6):707–725. [PubMed]

[38] Nigam K, Ghani R. Analyzing the effectiveness and applicability of co-training. Proceedings of the ninth international conference on Information and knowledge management.2000. pp. 86–93.

[39] Pearl J. Bayesian networks: A model of self-activated memory for evidential reasoning. Proceedings of the seventh Conference of the Cognitive Science Society.1985. pp. 329–334.

[40] Penna A, Grilli R, Filardo G, Mainini F, Zola P, Mantovani L, Liberati A. Do different physicians’ panels reach similar conclusions? A case study on practice guidelines for limited surgery in breast cancer. European Journal of Public Health. 1997;7:436–440.

[41] Rojo MG, Bueno G, Slodkowska J. Review of imaging solutions for integrated quantitative immunohistochemistry in the pathology daily practice. Folia Histochemica et Cytobiologica. 2009;47(3):349–354. [PubMed]

[42] Skaland I, Øvestad I, Janssen EAM, Klos J, Kjellevold KH, Helliesen T, Baak J. Digital image analysis improves the quality of subjective HER-2 expression scoring in breast cancer. Applied Immunohisto-chemistry & Molecular Morphology. 2008;16(2):185. [PubMed]

[43] Tawfik OW, Kimler BF, Davis M, Donahue JK, Persons DL, Fan F, Hagemeister S, Thomas P, Connor C, Jewell W, et al. Comparison of immunohistochemistry by automated cellular imaging system (ACIS) versus fluorescence in-situ hybridization in the evaluation of HER-2/neu expression in primary breast carcinoma. Histopathology. 2005;48(3):258–267. [PubMed]

[44] Thomson T, Hayes M, Spinelli J, Hilland E, Sawrenko C, Phillips D, Dupuis B, Parker R. HER-2/neu in breast cancer: interobserver variability and performance of immunohistochemistry with 4 antibodies compared with fluorescent in situ hybridization. Modern Pathology. 2001;14(11):1079–1086. [PubMed]

[45] Trench WF. Weighting coefficients for the prediction of stationary time series from the finite past. SIAM Journal on Applied Mathematics. 1967;15:1502–1510.

[46] Trench WF. Inversion of Toeplitz band matrices. Mathematics of Computation. 1974;28:1089–1095.

[47] Voduc D, Kenney C, Nielsen T. Tissue microarrays in clinical oncology. Seminars in radiation oncology. 2008;18(2):89–97. [PMC free article] [PubMed]

[48] Vrolijk H, Sloos W, Mesker W, Franken P, Fodde R, Morreau H, Tanke H. Automated Acquisition of Stained Tissue Microarrays for High-Throughput Evaluation of Molecular Targets. Journal of Molecular Diagnostics. 2003;5(3):160–167. [PubMed]

[49] Walker R. Quantification of immunohistochemistry - issues concerning methods, utility and semiquantitative assessment I. Histopathology. 2006;49(4):406–410. [PubMed]

[50] Wan WH, Fortuna MB, Furmanski P. A rapid and efficient method for testing immunohistochemical reactivity of monoclonal antibodies against multiple tissue samples simultaneously. Journal of Immunological Methods. 1987;103:121–129. [PubMed]

[51] Yan D, Bickel PJ, Gong P. A discrete log density expansion based approach to Ikonos image classification. American Society for Photogrammetry and Remote Sensing Fall Speciality Conference.2006.

[52] Yan D, Gong P, Chen A, Zhong L. Classification under data contamination with application to image mis-registration in remote sensing. 2011. arXiv:1101.3594.

[53] Yarowsky D. Unsupervised word sense disambiguation rivaling supervised methods. in Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics.1995. pp. 189–196.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |