Search tips
Search criteria 


Logo of bmiBiomarker Insights
Biomark Insights. 2010; 5: 69–78.
Published online 2010 August 5.
PMCID: PMC2918352

LSOSS: Detection of Cancer Outlier Differential Gene Expression


Detection of differential gene expression using microarray technology has received considerable interest in cancer research studies. Recently, many researchers discovered that oncogenes may be activated in some but not all samples in a given disease group. The existing statistical tools for detecting differentially expressed genes in a subset of the disease group mainly include cancer outlier profile analysis (COPA), outlier sum (OS), outlier robust t-statistic (ORT) and maximum ordered subset t-statistics (MOST). In this study, another approach named Least Sum of Ordered Subset Square t-statistic (LSOSS) is proposed. The results of our simulation studies indicated that LSOSS often has more power than previous statistical methods. When applied to real human breast and prostate cancer data sets, LSOSS was competitive in terms of the biological relevance of top ranked genes. Furthermore, a modified hierarchical clustering method was developed to classify the heterogeneous gene activation patterns of human breast cancer samples based on the significant genes detected by LSOSS. Three classes of gene activation patterns, which correspond to estrogen receptor (ER)+, ER− and a mixture of ER+ and ER−, were detected and each class was assigned a different gene signature.

Keywords: differential gene expression, cancer, outlier


The most widely used method for detecting differential gene expression in comparative microarray studies is the two-sample t-statistic. A gene is determined to be significant if the absolute t-value exceeds a certain threshold c, which is usually determined by its corresponding P-value or false discovery rate. Recently, Tomlins et al1 introduced the cancer outlier profile analysis (COPA) method for detecting cancer genes which are differentially expressed in a subset of disease samples. Heterogeneous patterns of oncogene activation were observed in the majority of cancer types considered in their studies. Thereafter, several further studies in this direction have been proposed. Tibshirani and Hastie2 introduced the outlier sums (OS) method, Wu3 proposed the outlier robust t-statistic (ORT), and Lian4 introduced the maximum ordered subset t-statistics (MOST) procedure.

In this study, a simple statistical test named Least Sum of Ordered Subset Square t-statistic (LSOSS) is proposed for detecting cancer outlier differential gene expression. The performance of LSOSS was compared to existing procedures using both simulated and real data sets. Furthermore, we extended previous studies by classifying heterogeneous gene activation patterns of human breast cancer.

Existing statistical methods

Assuming case-control microarray data were generated for detecting differentially expressed genes consisting of n samples from a normal group and m samples from a cancer group. Let xij be the expression value for gene i = (1, 2, …, p) and sample j = (1, 2, …, n) in the normal group and yij be the expression value for gene i = (1, 2, …, p) and sample j = (1, 2, …, m) in the cancer group. In this study, and without loss of generality, we are only interested in 1-sided tests where the activated genes from cancer samples are over-expressed or up-regulated.

The two-condition t-statistic for gene i is defined by:


where yi is the mean expression value in cancer samples, [x with macron]i is the mean expression value in normal samples for gene i and si is the pooled standard error estimate given by:


The t-statistic is powerful when most cancer samples are activated.

Tomlins et al1 defines the COPA statistic as


Where qr(.) is the rth percentile of the expression data, and medi is the median expression value for all samples


and madi is the median absolute deviation of expression values in all samples and is given by:


The COPA statistic uses a fixed rth sample percentile, which is determined by users. This limitation was overcome by the OS statistic2 defined by:


where Ri = {yij : yij > q75({xij : 1 ≤ jn},{yij : 1 ≤ jm}) + IQR({xij : 1 ≤ jn},{yij : 1 ≤ jm})} and IQR(•) is the inter-quantile range of the expression data


Wu3 modified the OS statistic by proposing the ORT statistic which consists mainly in changing the definition of Ri as:


and replacing medi in OS by medix, which is the median expression value in normal samples. Further, madi was replaced by


where mediy is the median expression value in cancer samples. Lian4 argued that OS and ORT statistics used arbitrary outliers and proposed the MOST statistic which consider all possible values for outlier thresholds. The MOST procedure requires cancer sample expression data be sorted in descending order and the following statistic calculated:


μk and δk are obtained from the order statistics of m samples generated from a standard normal distribution and are used to make different values of the statistic comparable for different values of k.


The least sum of ordered subset variance t-statistic

In our proposed method, least sum of ordered subset square t-statistic (LSOSS), mean expression values in normal and cancer samples were considered instead of median expression values. Our hypothesis was that if outliers are present among cancer samples, the distribution of gene expression values in cancer samples will have two peaks. The higher peak corresponds to activated samples while the lower peak indicates inactivated samples. Consequently, this outlier issue can be addressed through the idea of detecting a “change point” or “break point” in the ordered gene expression values of the cancer group. A model related to fitting least squares should be effective for this goal. For each gene, an optimal change point in its expression can be detected and could be used to investigate potential outliers in cancer samples. To this end, we propose the Least Sum of Ordered Subset Square t-statistic (LSOSS). The general idea of LSOSS is to use the sum of squares of two ordered subsets of cancer samples to estimate the square sum of the t-statistic and to use the mean value of the appealing subset of cancer samples to estimate the mean value of cancer samples of the t-statistic.

The proposed LSOSS method involves the following steps:

  1. For each gene i, the expression levels in cancer samples are sorted in descending order and then divided into two subsets:
  2. For the two subsets, the mean and sum of squares for each gene i are calculated:
    The only issue left to be solved is the value k that divided the two subsets. For that purpose an exhaustive search was implemented for all possible values ranging from 1 to m-1. The optimum value of k is obtained by minimizing the pooled sum of squares for cancer samples given by:
    Let six2 be the sum of squares for normal samples given by:
    The pooled standard error estimated for gene i is defined by
  3. The LSOSS statistic for declaring a gene i with outlier differential expression in case samples is computed as:
    (LSSVi = (mk)(ySik2[x with macron]i)/Si, if repressed gene expression is of interest), where k could be interpreted as the number of outlier samples for gene i.

A modified hierarchical clustering method for classification of heterogeneous gene activation patterns of human breast cancer

We developed a modified hierarchical clustering method for classification of heterogeneous gene activation patterns of human breast cancer samples. 100 permutations were conducted in order to assign a P-value for each gene. The top d genes detected by LSOSS, at the p-value <0.05, were selected for further analysis. For each gene i, the cancer samples that were selected as outliers were marked by 1 and the rest were marked by 0:


Thus, each cancer sample w can be represented by a vector with a rank d consisting of 0 or 1:


For each cancer sample, the number of 1’s indicates the number of genes with outlier expression in that sample compared to other case samples. The similarity between any two cancer samples w and v was denoted by the number of common outlier expression, which can be obtained by counting the number of 1’s computed by zw·zvT. Then, a hierarchical clustering method was adopted to cluster cancer samples. A bootstrap re-sampling method with 5000 replicates was used to assign a P-value to each sub tree of the clustering. The common outliers in a sub-tree with a p-value < 0.05 were highlighted. Then cancer samples were re-ordered according to the proposed clustering method. These vectors of re-ordered samples formed a d × m two-dimension array. We used a color image to display this array.


Simulation studies

Simulation studies were conducted to compare the performance of LSOSS with those of MOST, ORT, OS, COPA and the t-statistic. To this end, the R source code from Lian4 was used. The simulation was conducted assuming equal number of normal and cancer samples (n = m = 20) and the expression data was generated from a standard normal distribution. Expression for 2000 genes were simulated, of which 1000 genes were assumed to be differentially expressed and their data was generated by adding a constant, u, to their expression in the first k cancer samples.

The receiver operating characteristic (ROC) curve was used for evaluating the performance of the different statistical methods. Figure 1 shows the ROC curves for different combinations of k and u. When k = 10 and u = 2, LSOSS clearly outperforms others methods and was second best when k = 5 or 15 and u = 2. When k = 20 and u = 2, LSOSS was comparable to ORT and better than OS and COPA. When u is decreased to 1 with k = 10, LSOSS is the only method comparable to the t-statistic. LSOSS shows a low sensitivity when k = 2. However, the case where only one or two samples are activated within a large number of cancer samples may be less realistic. Overall, the performance of LSOSS is appealing in terms of detection power.

Figure 1.
ROC curves comparing different statistical methods.

Application to human breast cancer data

The breast cancer microarray data from West et al5 is available at The data were normalized by the quantile method6 and the log transformation of the expression values were used for the following analysis. There are in all 7129 genes and 49 tumor samples in this dataset. Among them, 25 tumor samples have negative lymph nodes (LN−) and 24 tumor samples have positive lymph nodes (LN+). We treated the negative LN samples as the control group and the positive LN samples as the cancer group. Genes with expression below a certain threshold (log(10)) in at least 20 samples were removed from the analysis. When evaluating LSOSS based on human breast cancer data, we studied how many genes among the top 25 genes selected separately by different statistical approaches showed biological relevance in the literature. The numbers of breast cancer related genes identified by existing methods (Table 1) were 8, 8, 4, 3, and 2 for MOST, ORT, OS, the t-statistics, and COPA, respectively. However, our proposed method (LSOSS) has identified 9 breast cancer related genes: KCNH2,7 NEO1,8 MAGEA3,9 ENG,10 GABRG2,11 ATM,12 NUP88,13 CYP3A714 and PMP22.15 Although it should not be treated as a golden standard method for evaluating different statistical tools, this type of analysis generally validates the statistical results and highlights their biological relevance.

Table 1.
Genes confirmed to be associated with breast cancer that are ranked on the top 25 identified using different cancer outlier detection approaches.

Application to human prostate cancer data

To further assess the performance of LSOSS on real data, we downloaded a human prostate cancer dataset.16 This dataset, generated by the Affymetrix HG-U95av2 chip, consists of 52 prostate tumor samples and 50 normal adjacent samples. The raw data were converted to expression values using a robust multi-array average (RMA) approach.17 Different statistical methods were run on this dataset and their performances was evaluated by the number of genes among the top 25 genes selected by each approach known to have biological relevance according to the National Cancer Institute Cancer Gene Index, available at The comparison of these different statistical approaches is summarized in Table 2. LSOSS, which identifies 5 prostate cancer related genes RB1,18 UBE2E3,19 BMI1,20 BTG221 and ELF1,22 was the best approach with this dataset.

Table 2.
Genes confirmed to be associated with prostate cancer that are ranked on the top 25 identified using different cancer outlier detection approaches.

Classification of heterogeneous gene activation patterns of human breast cancer

Breast cancer is a heterogeneous disease.23,24 Although a number of candidate cancer outliers were identified by existing tools, the heterogeneous gene activation patterns of cancer samples were not addressed after the usage of such methods. LSOSS was applied to the human breast cancer data set from West et al.5 At a P-value cutoff of 0.05, 228 genes were selected for further analysis. The hierarchical clustering method described in the Methods section was then implemented. Three main classes of heterogeneous activation patterns of human breast cancer were observed (Fig. 2). The samples and common outliers in each class are shown in Table 3. Interestingly, we found that the first class consists of 6 ER+ samples, the second class consists of 5 ER− samples, and the third class is a mixture of 4 ER+ and 1 ER− samples. The common outlier genes in each class are regarded as its genetic signature. It is worth noting that although some genes may be part of the genetic signature of different classes of cancer samples, each class has a unique gene signature. For the remaining 8 cancer samples without significant common outliers, their classes were assigned according to their coverage of the gene signatures for different classes (Table 4). Among them, 6 were classified into the mixture group and two others were classified into ER+ and ER− groups.

Figure 2.
Color image for classification of heterogeneous gene activation patterns of human breast cancer.
Table 3.
Classes and biomarkers of heterogeneous gene activation patterns of human breast cancer.
Table 4.
Classification of the cancer samples lacking significant common outliers.

Discussion and Conclusions

Unraveling the heterogeneous patterns of cancer samples is an important goal in medical research, especially for clinical diagnosis and the molecular understanding of cancer mechanisms. The heterogeneous patterns of oncogene activation have been well studied and several useful statistical tools have been proposed. LSOSS is a reasonable model to detect cancer outlier differential gene expression. For each gene, LSOSS tries to find an optimal “change point” in the ordered expression values of cancer samples. If one gene is expressed heterogeneously in cancer samples, the variance of gene expression values in cancer samples is overestimated by the t-statistic while LSOSS gives an appropriate estimate. Furthermore, LSOSS uses the mean value of the appealing subset instead of the overall mean value of the cancer samples. Thus, LSOSS detects cancer outliers more easily. If one gene is expressed homogeneously in cancer samples, LSOSS still works well because it behaves similarly to the t-statistic because the mean values of two subsets are expected to be very close in this case.

However, a single oncogene with heterogeneous expression cannot fully account for the heterogeneous gene activation patterns of cancer samples as the synergic and epistatic effects among multiple oncegenes should not be neglected. Thus, it is necessary to classify cancer samples and assign each class a specific gene signature. This goal, if achieved, will definitely facilitate the understanding of different underlying pathologies and genetics for heterogeneous cancers. Our proposed scheme could be a useful tool toward this goal. Three classes of heterogeneous gene activation patterns of human breast cancer were detected with specific gene signatures. In addition, these heterogeneous gene activation patterns may be regarded as the signatures for subtypes of human breast cancer. Thus, the procedure presented could also be useful in detecting and classifying breast cancer subtypes. The classification of breast cancer subtypes has been well discussed. 2528 Our approach, however, differed from previous studies mainly in that the classification is based on different combinational activation patterns of candidate genes instead of clustering their expression values. The detection of specific gene interactions accounting for heterogeneous gene activation patterns of cancers is our next goal in this direction.


We thank Jamie Williams for critical reading of the manuscript. This study was supported in part by resources and technical expertise from the University of Georgia Research Computing Center, a partnership between the Office of the Vice President for Research and the Office of the Chief Information Officer.



This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.


1. Tomlins SA, Rhodes DR, Perner S, et al. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science. 2005;310:644–8. [PubMed]
2. Tibshirani R, Hastie T. Outlier sums for differential gene expression analysis. Biostatistics. 2007;8:2–8. [PubMed]
3. Wu B. Cancer outlier differential gene expression detection. Biostatistics. 2007;8:566–75. [PubMed]
4. Lian H. MOST: detecting cancer differential gene expression. Biostatistics. 2008;9:411–8. [PubMed]
5. West M, Blanchette C, Dressman H, et al. Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci U S A. 2001;98:11462–7. [PubMed]
6. Bolstad BM, Irizarry RA, Astrand M, Speed TP. A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics. 2003;19:185–93. [PubMed]
7. Kuznetsova EB, Kekeeva TV, Larin SS, et al. Novel methylation and expression markers associated with breast cancer. Mol Biol (Mosk) 2007;41:624–33. [PubMed]
8. Lee JE, Kim HJ, Bae JY, et al. Neogenin expression may be inversely correlated to the tumorigenicity of human breast cancer. BMC Cancer. 2005;5:154. [PMC free article] [PubMed]
9. Gaugler B, van den Eynde B, et al. Human gene MAGE-3 codes for an antigen recognized on a melanoma by autologous cytolytic T lymphocytes. J Exp Med. 1994;179:921–30. [PMC free article] [PubMed]
10. Gómez-Esquer F, Agudo D, Martínez-Arribas F, Nunez-Villar MJ, Schneider J. mRNA expression of the angiogenesis markers VEGF and CD105 (endoglin) in human breast cancer. Anticancer Res. 2004;24:1581–5. [PubMed]
11. Garib V, Lang K, Niggemann B, Zänker KS, Brandt L, Dittmar T. Propofolinduced calcium signalling and actin reorganization within breast carcinoma cells. Eur J Anaesthesiol. 2005;22:609–15. [PubMed]
12. Ye C, Cai Q, Dai Q, et al. Expression patterns of the ATM gene in mammary tissues and their associations with breast cancer survival. Cancer. 2007;109:1729–35. [PubMed]
13. Schneider J, Linares R, Martínez-Arribas F, et al. Developing chick embryos express a protein which shares homology with the nuclear pore complex protein Nup88 present in human tumors. Int J Dev Biol. 2004;48:339–42. [PubMed]
14. Calaf GM, Roy D. Human drug metabolism genes in parathion-and estrogen-treated breast cells. Int J Mol Med. 2007;20:875–81. [PubMed]
15. Kunz-Schughart LA, Heyder P, Schroeder J, Knuechel R. A heterologous 3-D coculture model of breast tumor cells and fibroblasts to study tumor-associated fibroblast differentiation. Exp Cell Res. 2001;266:74–86. [PubMed]
16. Singh D, Febbo PG, Ross K, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002;1:203–9. [PubMed]
17. Irizarry RA, Hobbs B, Collin F, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–64. [PubMed]
18. Cooney KA, Wetzel JC, Merajver SD, Macoska JA, Singleton TP, Wojno KJ. Distinct regions of allelic loss on 13q in prostate cancer. Cancer Res. 1996;56:1142–5. [PubMed]
19. Bull JH, Ellison G, Patel A, et al. Identification of potential diagnostic markers of prostate cancer and prostatic intraepithelial neoplasia using cDNA microarray. Br J Cancer. 2001;84:1512–9. [PMC free article] [PubMed]
20. Berezovska OP, Glinskii AB, Yang Z, Li XM, Hoffman RM, Glinsky GV. Essential role for activation of the Polycomb group (PcG) protein chromatin silencing pathway in metastatic prostate cancer. Cell Cycle. 2006;5:1886–901. [PubMed]
21. Ficazzola MA, Fraiman M, Gitlin J, et al. Antiproliferative B cell translocation gene 2 protein is down-regulated post-transcriptionally as an early event in prostate carcinogenesis. Carcinogenesis. 2001;22:1271–9. [PubMed]
22. Takai N, Miyazaki T, Nishida M, Nasu K, Miyakawa I. The significance of Elf-1 expression in epithelial ovarian carcinoma. Int J Mol Med. 2003;12:349–54. [PubMed]
23. Bertucci F, Birnbaum D. Reasons for breast cancer heterogeneity. J Biol. 2008;7:6. [PMC free article] [PubMed]
24. Anderson WF, Matsuno R. Breast cancer heterogeneity: a mixture of at least two main types? J Natl Cancer Ins. 2006;98:948–51. [PubMed]
25. Perou CM, Sørlie T, Eisen MB, et al. Molecular portraits of human breast tumours. Nature. 2000;406:747–52. [PubMed]
26. Sørlie T, Perou CM, Tibshirani R, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A. 2001;98:10869–74. [PubMed]
27. Sørlie T, Tibshirani R, Parker J, et al. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci U S A. 2003;100:8418–23. [PubMed]
28. Kapp AV, Jeffrey SS, Langerød A, et al. Discovery and validation of breast cancer subtypes. BMC Genomics. 2006;7:231. [PMC free article] [PubMed]

Articles from Biomarker Insights are provided here courtesy of SAGE Publications