With the advent of high-throughput gene assay technologies, scientists are now able to measure genome-wide mRNA expression levels in a variety of settings using DNA microarrays (Schena, 2000
). One of the major tasks in studies involving these technologies is to find genes that are differentially expressed between 2 experimental conditions. The simplest example is to find genes that are up- or downregulated in cancerous tissue relative to healthy tissue. Typically in these experiments, the number of genes, represented as spots on the biochip, is much larger than the number of independent samples in the study. Consequently, assessing differential expression in this setting leads to performing several thousand hypothesis tests, which leads to the problem of multiple comparisons.
There has been an enormous literature on statistical assessment of differential expression in genomic studies (e.g. Efron and others, 2001
; Dudoit and others, 2002
), along with multiple comparisons procedures for controlling a proper error rate, such as the familywise type-I error (FWER) (Shaffer, 1995
) or the false discovery rate (FDR) (Benjamini and Hochberg, 1995
). However, in most of these studies, differential expression is tested using a test for difference in mean expression or testing that the entire distribution functions for gene expression in the 2 conditions are the same. For the former scenario, the most commonly used procedure is the 2-sample t
-test, while for the latter, the Wilcoxon rank sum test is used.
A more interesting differential expression pattern was observed by Tomlins and others (2005)
. They noticed that for certain genes, only a fraction of samples in one group were overexpressed relative to those in the other group; the remaining samples showed no evidence of differential expression. Tomlins and others (2005)
developed a ranking method known as cancer outlier profile analysis (COPA) for calculating outlier scores using gene expression data. Their score was purely descriptive; they did not attempt to assign any measure of significance to the gene scores. More recently, Tibshirani and Hastie (2007)
and Wu (2007)
have shown that significance can be assigned using modifications of 2-sample t
-tests. In addition, a nonparametric methodology proposed by Lyons-Weiler and others (2004)
could be applied to this problem as well. We discuss all these proposals in Section 3.2.
We should mention that while the term “outlier” has a pejorative meaning in statistics, it is a very meaningful concept in a biological sense. As noted by Lyons-Weiler and others (2004)
and subsequently by Tomlins and others (2005)
, the biology of oncogenesis permits that unique sets of genes may be involved in tumor development across patients. While statistical outliers refer to measurements that exceed the expected variation in a set of data, the oncogenetic outliers we seek to find will be putatively related to cancer processes.
The goal of this article is to describe a relatively general statistical model for the outlier approach of Tomlins and others (2005)
. By formulating the probabilistic model, we can clarify various issues in outlier profile analysis that have not been previously addressed and better situate the proposals of prior authors. In particular, their proposals are parametric in nature; we come up with alternative nonparametric procedures for outlier analysis with genomic data. As a by-product of our methods, we link multiple testing procedures with outlier detection. The paper is structured as follows: In Section 2, we describe the data setup and formulate the statistical model for outlier profile analysis in the case of a single gene. Doing this allows us to establish results about identifiability as well as develop a sample-specific hypothesis of interest. We also develop the proposed nonparametric estimation procedure and link it with multiple testing methodology. In Section 3, we describe the general procedure with genome-wide expression data sets and relate the prior proposals in the literature. In Section 4, we describe application of the proposed methodology to simulated data. Finally, we conclude with some discussion in Section 5.