Home | About | Journals | Submit | Contact Us | Français |

**|**Biomark Insights**|**v.5; 2010**|**PMC2918352

Formats

Article sections

Authors

Related links

Biomark Insights. 2010; 5: 69–78.

Published online 2010 August 5.

PMCID: PMC2918352

Corresponding author email: ude.agu@5211pyw

Copyright © 2010 the author(s), publisher and licensee Libertas Academica Ltd.

This is an open access article. Unrestricted non-commercial use is permitted provided the original work is properly cited.

This article has been cited by other articles in PMC.

Detection of differential gene expression using microarray technology has received considerable interest in cancer research studies. Recently, many researchers discovered that oncogenes may be activated in some but not all samples in a given disease group. The existing statistical tools for detecting differentially expressed genes in a subset of the disease group mainly include cancer outlier profile analysis (COPA), outlier sum (OS), outlier robust *t*-statistic (ORT) and maximum ordered subset *t*-statistics (MOST). In this study, another approach named Least Sum of Ordered Subset Square *t*-statistic (LSOSS) is proposed. The results of our simulation studies indicated that LSOSS often has more power than previous statistical methods. When applied to real human breast and prostate cancer data sets, LSOSS was competitive in terms of the biological relevance of top ranked genes. Furthermore, a modified hierarchical clustering method was developed to classify the heterogeneous gene activation patterns of human breast cancer samples based on the significant genes detected by LSOSS. Three classes of gene activation patterns, which correspond to estrogen receptor (ER)+, ER− and a mixture of ER+ and ER−, were detected and each class was assigned a different gene signature.

The most widely used method for detecting differential gene expression in comparative microarray studies is the two-sample *t*-statistic. A gene is determined to be significant if the absolute *t*-value exceeds a certain threshold *c*, which is usually determined by its corresponding *P*-value or false discovery rate. Recently, Tomlins et al^{1} introduced the cancer outlier profile analysis (COPA) method for detecting cancer genes which are differentially expressed in a subset of disease samples. Heterogeneous patterns of oncogene activation were observed in the majority of cancer types considered in their studies. Thereafter, several further studies in this direction have been proposed. Tibshirani and Hastie^{2} introduced the outlier sums (OS) method, Wu^{3} proposed the outlier robust *t*-statistic (ORT), and Lian^{4} introduced the maximum ordered subset *t*-statistics (MOST) procedure.

In this study, a simple statistical test named Least Sum of Ordered Subset Square *t*-statistic (LSOSS) is proposed for detecting cancer outlier differential gene expression. The performance of LSOSS was compared to existing procedures using both simulated and real data sets. Furthermore, we extended previous studies by classifying heterogeneous gene activation patterns of human breast cancer.

Assuming case-control microarray data were generated for detecting differentially expressed genes consisting of *n* samples from a normal group and *m* samples from a cancer group. Let *x _{ij}* be the expression value for gene

The two-condition *t*-statistic for gene *i* is defined by:

$${t}_{i}=\frac{{\overline{y}}_{i}-{\overline{x}}_{i}}{{s}_{i}}$$

where * _{i}* is the mean expression value in cancer samples,

$${s}_{i}^{2}=\frac{\sum _{1\le j\le n}{({x}_{ij}-{\overline{x}}_{i})}^{2}+{\sum _{1\le j\le m}({y}_{ij}-{\overline{y}}_{i})}^{2}}{n+m-2}.$$

The *t*-statistic is powerful when most cancer samples are activated.

Tomlins et al^{1} defines the COPA statistic as

$${\mathit{\text{copa}}}_{i}=\frac{{q}_{r}(\{{y}_{ij}:1\le j\le m\})-{\mathit{\text{med}}}_{i}}{{\mathit{\text{mad}}}_{i}}$$

Where *q _{r}*(.) is the

$${\mathit{\text{med}}}_{i}=\mathit{\text{median}}(\{{x}_{ij}:1\le j\le n\},\{{y}_{ij}:1\le j\le m\}),$$

and *mad _{i}* is the median absolute deviation of expression values in all samples and is given by:

$$\begin{array}{l}{\mathit{\text{mad}}}_{i}=1.4826\hspace{0.17em}\times \hspace{0.17em}\mathit{\text{median}}(\{{x}_{ij}-{\mathit{\text{med}}}_{i})\hspace{0.17em}:\hspace{0.17em}\\ \hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}1\le j\le n\},\{({y}_{ij}-{\mathit{\text{med}}}_{i}):1\le j\le m\}).\end{array}$$

The COPA statistic uses a fixed *rth* sample percentile, which is determined by users. This limitation was overcome by the OS statistic^{2} defined by:

$${\mathit{\text{OS}}}_{i}=\frac{{\sum}_{{y}_{ij}\in {R}_{i}}({y}_{ij}-{\mathit{\text{med}}}_{i})}{{\mathit{\text{mad}}}_{i}}$$

where *R _{i}* = {

$$\begin{array}{l}\mathit{\text{IQR}}(\{{x}_{ij}:1\le j\le n\},\{{y}_{ij}:1\le j\le m\})\}\hspace{0.17em}=\hspace{0.17em}{q}_{75}(\{{x}_{ij}:\\ \hspace{0.17em}1\le j\le n\},\{{y}_{ij}:1\le j\le m\})-{q}_{25}(\{{x}_{ij}:1\le j\\ \le n\},\hspace{0.17em}\{{y}_{ij}:1\le j\le m\}).\end{array}$$

Wu^{3} modified the OS statistic by proposing the ORT statistic which consists mainly in changing the definition of *R _{i}* as:

$$\begin{array}{c}{R}_{i}=\{{y}_{ij}:{y}_{ij}>{q}_{75}(\{{x}_{ij}:1\le j\le n\})\\ +IQR(\{{x}_{ij}:1\le j\le n\})\}.\end{array}$$

and replacing *med _{i}* in OS by

$$\begin{array}{c}{\mathit{\text{mad}}}_{i}^{\prime}=1.4826\times \hspace{0.17em}\mathit{\text{median}}(\{{x}_{ij}-{\mathit{\text{med}}}_{ix:}):1\le j\le n\},\\ \{({y}_{ij}-{\mathit{\text{med}}}_{iy}):1\le j\le m\}),\end{array}$$

where *med _{iy}* is the median expression value in cancer samples. Lian

$${\mathit{\text{MOST}}}_{i}=\underset{1\le k\le m}{\text{max}}\left[\frac{\sum _{1\le j\le k}({y}_{ij}-{\mathit{\text{med}}}_{ix})}{{\mathit{\text{mad}}}_{i}^{\prime}}-{\mu}_{k}\right]/{\delta}_{k}.$$

*μ _{k}* and

In our proposed method, least sum of ordered subset square *t*-statistic (LSOSS), mean expression values in normal and cancer samples were considered instead of median expression values. Our hypothesis was that if outliers are present among cancer samples, the distribution of gene expression values in cancer samples will have two peaks. The higher peak corresponds to activated samples while the lower peak indicates inactivated samples. Consequently, this outlier issue can be addressed through the idea of detecting a “change point” or “break point” in the ordered gene expression values of the cancer group. A model related to fitting least squares should be effective for this goal. For each gene, an optimal change point in its expression can be detected and could be used to investigate potential outliers in cancer samples. To this end, we propose the Least Sum of Ordered Subset Square *t*-statistic (LSOSS). The general idea of LSOSS is to use the sum of squares of two ordered subsets of cancer samples to estimate the square sum of the *t*-statistic and to use the mean value of the appealing subset of cancer samples to estimate the mean value of cancer samples of the *t*-statistic.

The proposed LSOSS method involves the following steps:

- For each gene
*i*, the expression levels in cancer samples are sorted in descending order and then divided into two subsets:$$\begin{array}{l}{S}_{ik1}=\{{y}_{ij}:1\le j\le k\},\\ {S}_{ik2}=\{{y}_{ij}:k+1\le j\le m\}.\end{array}$$ - For the two subsets, the mean and sum of squares for each gene
*i*are calculated:$$\begin{array}{l}{\overline{y}}_{{S}_{ik1}}=\mathit{\text{mean}}(\{{y}_{ij}:1\le j\le k\}),\hfill \\ {\overline{y}}_{{S}_{ik2}}=\mathit{\text{mean}}(\{{y}_{ij}:k+1\le j\le m\}),\hfill \\ {\mathit{\text{SS}}}_{{S}_{ik1}}=\sum _{1\le j\le k}{({y}_{ij}-{\overline{y}}_{{S}_{ik1}})}^{2},\hfill \\ {\mathit{\text{SS}}}_{{S}_{ik2}}=\sum _{k+1\le j\le k}{({y}_{ij}-{\overline{y}}_{{S}_{ik2}})}^{2}.\hfill \end{array}$$The only issue left to be solved is the value*k*that divided the two subsets. For that purpose an exhaustive search was implemented for all possible values ranging from 1 to*m*-1. The optimum value of*k*is obtained by minimizing the pooled sum of squares for cancer samples given by:$$\text{arg}\hspace{0.17em}\underset{1\le k\le m-1}{\text{min}}({\mathit{\text{SS}}}_{{S}_{ik1}}+{\mathit{\text{SS}}}_{{S}_{ik2}}).$$Let ${s}_{ix}^{2}$ be the sum of squares for normal samples given by:$${s}_{ix}^{2}=\sum _{1\le j\le n}{({x}_{ij}-{\overline{x}}_{i})}^{2}.$$The pooled standard error estimated for gene*i*is defined by$${s}_{i}^{2}=\frac{{s}_{ix}^{2}+{\mathit{\text{SS}}}_{{S}_{ik1}}+{\mathit{\text{SS}}}_{{S}_{ik2}}}{n+m-2}.$$ - The LSOSS statistic for declaring a gene
*i*with outlier differential expression in case samples is computed as:($${\mathit{\text{LSSV}}}_{i}=k\frac{{\overline{y}}_{{S}_{ik1}}-{\overline{x}}_{i}}{{s}_{i}}$$*LSSV*= (_{i}*m*–*k*)(_{Sik2}–)/_{i}*S*, if repressed gene expression is of interest), where_{i}*k*could be interpreted as the number of outlier samples for gene*i*.

We developed a modified hierarchical clustering method for classification of heterogeneous gene activation patterns of human breast cancer samples. 100 permutations were conducted in order to assign a *P-*value for each gene. The top *d* genes detected by LSOSS, at the *p*-value <0.05, were selected for further analysis. For each gene *i*, the cancer samples that were selected as outliers were marked by 1 and the rest were marked by 0:

$$\begin{array}{l}{y}_{iw}^{\prime}=\{\begin{array}{l}1,\hspace{0.17em}\text{if}\hspace{0.17em}\text{gene}\hspace{0.17em}i\hspace{0.17em}\text{has}\hspace{0.17em}\text{an}\hspace{0.17em}\text{outlier}\hspace{0.17em}\text{in}\hspace{0.17em}\text{sample}\hspace{0.17em}w\hfill \\ 0,\hspace{0.17em}\text{otherwise},\hfill \end{array}\\ \hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}1\le w\le m.\end{array}$$

Thus, each cancer sample *w* can be represented by a vector with a rank *d* consisting of 0 or 1:

$${z}_{w}=({y}_{iw}^{\prime},1\le i\le d).$$

For each cancer sample, the number of 1’s indicates the number of genes with outlier expression in that sample compared to other case samples. The similarity between any two cancer samples *w* and *v* was denoted by the number of common outlier expression, which can be obtained by counting the number of 1’s computed by **z*** _{w}*·

Simulation studies were conducted to compare the performance of LSOSS with those of MOST, ORT, OS, COPA and the *t*-statistic. To this end, the R source code from Lian^{4} was used. The simulation was conducted assuming equal number of normal and cancer samples (n = m = 20) and the expression data was generated from a standard normal distribution. Expression for 2000 genes were simulated, of which 1000 genes were assumed to be differentially expressed and their data was generated by adding a constant, u, to their expression in the first k cancer samples.

The receiver operating characteristic (ROC) curve was used for evaluating the performance of the different statistical methods. Figure 1 shows the ROC curves for different combinations of k and u. When k = 10 and u = 2, LSOSS clearly outperforms others methods and was second best when k = 5 or 15 and u = 2. When k = 20 and u = 2, LSOSS was comparable to ORT and better than OS and COPA. When u is decreased to 1 with k = 10, LSOSS is the only method comparable to the *t*-statistic. LSOSS shows a low sensitivity when k = 2. However, the case where only one or two samples are activated within a large number of cancer samples may be less realistic. Overall, the performance of LSOSS is appealing in terms of detection power.

The breast cancer microarray data from West et al^{5} is available at http://data.cgt.duke.edu/west.php. The data were normalized by the quantile method^{6} and the log transformation of the expression values were used for the following analysis. There are in all 7129 genes and 49 tumor samples in this dataset. Among them, 25 tumor samples have negative lymph nodes (LN−) and 24 tumor samples have positive lymph nodes (LN+). We treated the negative LN samples as the control group and the positive LN samples as the cancer group. Genes with expression below a certain threshold (log(10)) in at least 20 samples were removed from the analysis. When evaluating LSOSS based on human breast cancer data, we studied how many genes among the top 25 genes selected separately by different statistical approaches showed biological relevance in the literature. The numbers of breast cancer related genes identified by existing methods (Table 1) were 8, 8, 4, 3, and 2 for MOST, ORT, OS, the *t*-statistics, and COPA, respectively. However, our proposed method (LSOSS) has identified 9 breast cancer related genes: KCNH2,^{7} NEO1,^{8} MAGEA3,^{9} ENG,^{10} GABRG2,^{11} ATM,^{12} NUP88,^{13} CYP3A7^{14} and PMP22.^{15} Although it should not be treated as a golden standard method for evaluating different statistical tools, this type of analysis generally validates the statistical results and highlights their biological relevance.

To further assess the performance of LSOSS on real data, we downloaded a human prostate cancer dataset.^{16} This dataset, generated by the Affymetrix HG-U95av2 chip, consists of 52 prostate tumor samples and 50 normal adjacent samples. The raw data were converted to expression values using a robust multi-array average (RMA) approach.^{17} Different statistical methods were run on this dataset and their performances was evaluated by the number of genes among the top 25 genes selected by each approach known to have biological relevance according to the National Cancer Institute Cancer Gene Index, available at https://cabig.nci.nih.gov/inventory/data-resources/cancer-gene-index/. The comparison of these different statistical approaches is summarized in Table 2. LSOSS, which identifies 5 prostate cancer related genes RB1,^{18} UBE2E3,^{19} BMI1,^{20} BTG2^{21} and ELF1,^{22} was the best approach with this dataset.

Breast cancer is a heterogeneous disease.^{23}^{,}^{24} Although a number of candidate cancer outliers were identified by existing tools, the heterogeneous gene activation patterns of cancer samples were not addressed after the usage of such methods. LSOSS was applied to the human breast cancer data set from West et al.^{5} At a *P-*value cutoff of 0.05, 228 genes were selected for further analysis. The hierarchical clustering method described in the Methods section was then implemented. Three main classes of heterogeneous activation patterns of human breast cancer were observed (Fig. 2). The samples and common outliers in each class are shown in Table 3. Interestingly, we found that the first class consists of 6 ER+ samples, the second class consists of 5 ER− samples, and the third class is a mixture of 4 ER+ and 1 ER− samples. The common outlier genes in each class are regarded as its genetic signature. It is worth noting that although some genes may be part of the genetic signature of different classes of cancer samples, each class has a unique gene signature. For the remaining 8 cancer samples without significant common outliers, their classes were assigned according to their coverage of the gene signatures for different classes (Table 4). Among them, 6 were classified into the mixture group and two others were classified into ER+ and ER− groups.

Unraveling the heterogeneous patterns of cancer samples is an important goal in medical research, especially for clinical diagnosis and the molecular understanding of cancer mechanisms. The heterogeneous patterns of oncogene activation have been well studied and several useful statistical tools have been proposed. LSOSS is a reasonable model to detect cancer outlier differential gene expression. For each gene, LSOSS tries to find an optimal “change point” in the ordered expression values of cancer samples. If one gene is expressed heterogeneously in cancer samples, the variance of gene expression values in cancer samples is overestimated by the *t*-statistic while LSOSS gives an appropriate estimate. Furthermore, LSOSS uses the mean value of the appealing subset instead of the overall mean value of the cancer samples. Thus, LSOSS detects cancer outliers more easily. If one gene is expressed homogeneously in cancer samples, LSOSS still works well because it behaves similarly to the *t*-statistic because the mean values of two subsets are expected to be very close in this case.

However, a single oncogene with heterogeneous expression cannot fully account for the heterogeneous gene activation patterns of cancer samples as the synergic and epistatic effects among multiple oncegenes should not be neglected. Thus, it is necessary to classify cancer samples and assign each class a specific gene signature. This goal, if achieved, will definitely facilitate the understanding of different underlying pathologies and genetics for heterogeneous cancers. Our proposed scheme could be a useful tool toward this goal. Three classes of heterogeneous gene activation patterns of human breast cancer were detected with specific gene signatures. In addition, these heterogeneous gene activation patterns may be regarded as the signatures for subtypes of human breast cancer. Thus, the procedure presented could also be useful in detecting and classifying breast cancer subtypes. The classification of breast cancer subtypes has been well discussed. ^{25}^{–}^{28} Our approach, however, differed from previous studies mainly in that the classification is based on different combinational activation patterns of candidate genes instead of clustering their expression values. The detection of specific gene interactions accounting for heterogeneous gene activation patterns of cancers is our next goal in this direction.

We thank Jamie Williams for critical reading of the manuscript. This study was supported in part by resources and technical expertise from the University of Georgia Research Computing Center, a partnership between the Office of the Vice President for Research and the Office of the Chief Information Officer.

**Disclosure**

This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.

1. Tomlins SA, Rhodes DR, Perner S, et al. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science. 2005;310:644–8. [PubMed]

2. Tibshirani R, Hastie T. Outlier sums for differential gene expression analysis. Biostatistics. 2007;8:2–8. [PubMed]

3. Wu B. Cancer outlier differential gene expression detection. Biostatistics. 2007;8:566–75. [PubMed]

4. Lian H. MOST: detecting cancer differential gene expression. Biostatistics. 2008;9:411–8. [PubMed]

5. West M, Blanchette C, Dressman H, et al. Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci U S A. 2001;98:11462–7. [PubMed]

6. Bolstad BM, Irizarry RA, Astrand M, Speed TP. A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics. 2003;19:185–93. [PubMed]

7. Kuznetsova EB, Kekeeva TV, Larin SS, et al. Novel methylation and expression markers associated with breast cancer. Mol Biol (Mosk) 2007;41:624–33. [PubMed]

8. Lee JE, Kim HJ, Bae JY, et al. Neogenin expression may be inversely correlated to the tumorigenicity of human breast cancer. BMC Cancer. 2005;5:154. [PMC free article] [PubMed]

9. Gaugler B, van den Eynde B, et al. Human gene MAGE-3 codes for an antigen recognized on a melanoma by autologous cytolytic T lymphocytes. J Exp Med. 1994;179:921–30. [PMC free article] [PubMed]

10. Gómez-Esquer F, Agudo D, Martínez-Arribas F, Nunez-Villar MJ, Schneider J. mRNA expression of the angiogenesis markers VEGF and CD105 (endoglin) in human breast cancer. Anticancer Res. 2004;24:1581–5. [PubMed]

11. Garib V, Lang K, Niggemann B, Zänker KS, Brandt L, Dittmar T. Propofolinduced calcium signalling and actin reorganization within breast carcinoma cells. Eur J Anaesthesiol. 2005;22:609–15. [PubMed]

12. Ye C, Cai Q, Dai Q, et al. Expression patterns of the ATM gene in mammary tissues and their associations with breast cancer survival. Cancer. 2007;109:1729–35. [PubMed]

13. Schneider J, Linares R, Martínez-Arribas F, et al. Developing chick embryos express a protein which shares homology with the nuclear pore complex protein Nup88 present in human tumors. Int J Dev Biol. 2004;48:339–42. [PubMed]

14. Calaf GM, Roy D. Human drug metabolism genes in parathion-and estrogen-treated breast cells. Int J Mol Med. 2007;20:875–81. [PubMed]

15. Kunz-Schughart LA, Heyder P, Schroeder J, Knuechel R. A heterologous 3-D coculture model of breast tumor cells and fibroblasts to study tumor-associated fibroblast differentiation. Exp Cell Res. 2001;266:74–86. [PubMed]

16. Singh D, Febbo PG, Ross K, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002;1:203–9. [PubMed]

17. Irizarry RA, Hobbs B, Collin F, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–64. [PubMed]

18. Cooney KA, Wetzel JC, Merajver SD, Macoska JA, Singleton TP, Wojno KJ. Distinct regions of allelic loss on 13q in prostate cancer. Cancer Res. 1996;56:1142–5. [PubMed]

19. Bull JH, Ellison G, Patel A, et al. Identification of potential diagnostic markers of prostate cancer and prostatic intraepithelial neoplasia using cDNA microarray. Br J Cancer. 2001;84:1512–9. [PMC free article] [PubMed]

20. Berezovska OP, Glinskii AB, Yang Z, Li XM, Hoffman RM, Glinsky GV. Essential role for activation of the Polycomb group (PcG) protein chromatin silencing pathway in metastatic prostate cancer. Cell Cycle. 2006;5:1886–901. [PubMed]

21. Ficazzola MA, Fraiman M, Gitlin J, et al. Antiproliferative B cell translocation gene 2 protein is down-regulated post-transcriptionally as an early event in prostate carcinogenesis. Carcinogenesis. 2001;22:1271–9. [PubMed]

22. Takai N, Miyazaki T, Nishida M, Nasu K, Miyakawa I. The significance of Elf-1 expression in epithelial ovarian carcinoma. Int J Mol Med. 2003;12:349–54. [PubMed]

23. Bertucci F, Birnbaum D. Reasons for breast cancer heterogeneity. J Biol. 2008;7:6. [PMC free article] [PubMed]

24. Anderson WF, Matsuno R. Breast cancer heterogeneity: a mixture of at least two main types? J Natl Cancer Ins. 2006;98:948–51. [PubMed]

25. Perou CM, Sørlie T, Eisen MB, et al. Molecular portraits of human breast tumours. Nature. 2000;406:747–52. [PubMed]

26. Sørlie T, Perou CM, Tibshirani R, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A. 2001;98:10869–74. [PubMed]

27. Sørlie T, Tibshirani R, Parker J, et al. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci U S A. 2003;100:8418–23. [PubMed]

28. Kapp AV, Jeffrey SS, Langerød A, et al. Discovery and validation of breast cancer subtypes. BMC Genomics. 2006;7:231. [PMC free article] [PubMed]

Articles from Biomarker Insights are provided here courtesy of **SAGE Publications**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |