PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of biostsLink to Publisher's site
 
Biostatistics. 2009 January; 10(1): 60–69.
Published online 2008 June 6. doi:  10.1093/biostatistics/kxn015
PMCID: PMC2605210
HHMIMSID: HHMIMS58593

Genomic outlier profile analysis: mixture models, null hypotheses, and nonparametric estimation

Debashis Ghosh*
Department of Statistics and Department of Public Health Sciences, Pennsylvania State University, University Park, PA 16802, USA, ghoshd/at/psu.edu

Abstract

In most analyses of large-scale genomic data sets, differential expression analysis is typically assessed by testing for differences in the mean of the distributions between 2 groups. A recent finding by Tomlins and others (2005) is of a different type of pattern of differential expression in which a fraction of samples in one group have overexpression relative to samples in the other group. In this work, we describe a general mixture model framework for the assessment of this type of expression, called outlier profile analysis. We start by considering the single-gene situation and establishing results on identifiability. We propose 2 nonparametric estimation procedures that have natural links to familiar multiple testing procedures. We then develop multivariate extensions of this methodology to handle genome-wide measurements. The proposed methodologies are compared using simulation studies as well as data from a prostate cancer gene expression study.

Keywords: Bonferroni correction, DNA microarray, False discovery rate, Goodness of fit, Multiple comparisons, Uniform distribution

1. INTRODUCTION

With the advent of high-throughput gene assay technologies, scientists are now able to measure genome-wide mRNA expression levels in a variety of settings using DNA microarrays (Schena, 2000). One of the major tasks in studies involving these technologies is to find genes that are differentially expressed between 2 experimental conditions. The simplest example is to find genes that are up- or downregulated in cancerous tissue relative to healthy tissue. Typically in these experiments, the number of genes, represented as spots on the biochip, is much larger than the number of independent samples in the study. Consequently, assessing differential expression in this setting leads to performing several thousand hypothesis tests, which leads to the problem of multiple comparisons.

There has been an enormous literature on statistical assessment of differential expression in genomic studies (e.g. Efron and others, 2001; Dudoit and others, 2002), along with multiple comparisons procedures for controlling a proper error rate, such as the familywise type-I error (FWER) (Shaffer, 1995) or the false discovery rate (FDR) (Benjamini and Hochberg, 1995). However, in most of these studies, differential expression is tested using a test for difference in mean expression or testing that the entire distribution functions for gene expression in the 2 conditions are the same. For the former scenario, the most commonly used procedure is the 2-sample t-test, while for the latter, the Wilcoxon rank sum test is used.

A more interesting differential expression pattern was observed by Tomlins and others (2005). They noticed that for certain genes, only a fraction of samples in one group were overexpressed relative to those in the other group; the remaining samples showed no evidence of differential expression. Tomlins and others (2005) developed a ranking method known as cancer outlier profile analysis (COPA) for calculating outlier scores using gene expression data. Their score was purely descriptive; they did not attempt to assign any measure of significance to the gene scores. More recently, Tibshirani and Hastie (2007) and Wu (2007) have shown that significance can be assigned using modifications of 2-sample t-tests. In addition, a nonparametric methodology proposed by Lyons-Weiler and others (2004) could be applied to this problem as well. We discuss all these proposals in Section 3.2.

We should mention that while the term “outlier” has a pejorative meaning in statistics, it is a very meaningful concept in a biological sense. As noted by Lyons-Weiler and others (2004) and subsequently by Tomlins and others (2005), the biology of oncogenesis permits that unique sets of genes may be involved in tumor development across patients. While statistical outliers refer to measurements that exceed the expected variation in a set of data, the oncogenetic outliers we seek to find will be putatively related to cancer processes.

The goal of this article is to describe a relatively general statistical model for the outlier approach of Tomlins and others (2005). By formulating the probabilistic model, we can clarify various issues in outlier profile analysis that have not been previously addressed and better situate the proposals of prior authors. In particular, their proposals are parametric in nature; we come up with alternative nonparametric procedures for outlier analysis with genomic data. As a by-product of our methods, we link multiple testing procedures with outlier detection. The paper is structured as follows: In Section 2, we describe the data setup and formulate the statistical model for outlier profile analysis in the case of a single gene. Doing this allows us to establish results about identifiability as well as develop a sample-specific hypothesis of interest. We also develop the proposed nonparametric estimation procedure and link it with multiple testing methodology. In Section 3, we describe the general procedure with genome-wide expression data sets and relate the prior proposals in the literature. In Section 4, we describe application of the proposed methodology to simulated data. Finally, we conclude with some discussion in Section 5.

2. OUTLIER PROFILE ANALYSIS: SINGLE-GENE CASE

2.1. Data, inference, and proposed methodology

The data consist of (Ygi, Zi), where Ygi is the gene expression measurement on the gth gene for the ith subject and Zi is a binary indicator taking values 0 and 1, g = 1,…, m, i = 1,…,n. We will refer to the group with Z= 0 as nondiseased samples and Z = 1 as diseased samples. We will use the notation Yg· to denote the gene expression profile of the gth gene across all subjects and Y·i to represent the m-dimensional expression profile for the ith individual. We will assume that there are n0 samples with Z = 0 and n1 samples with Z = 1 so n = n0 + n1. Without loss of generality, we will assume that the first n0 samples come from the undiseased samples.

We first consider a simple situation of G = 1 gene. Then, a simple model for modeling Yi [equivalent] Ygi conditional on Zi is the following:

An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx1_ht.jpg
(2.1)

where F0(y) is an unspecified distribution function, π0 is a fraction of samples that do not differentially express the gene between the 2 groups, and F1i is a family of distribution functions. The following lemma provides conditions under which such a scenario can be tested.

LEMMA 2.1

  • (a) If π0 ≠ 1 and F1iF0, then model (2.1) can be tested given the observed data.
  • (b) If Z1,…,Zn are not observed, then model (2.1) is not identifiable based on Y1,…,Yn.
Proof.

The proof of (a) follows from arguments in Section 3.1 of Genovese and Wasserman (2004). For (b), we will not be able to distinguish between F0 and F1i without information on Z. □

There are several points we wish to make at this stage. First, we have statistical independence because the model is for 1 gene across the n samples and the samples are independent. Second, it is obvious that if π0 = 0 and F1i does not depend on i, then we are reduced to a usual 2-sample problem. For that scenario, a common hypothesis to test is H0* : F0 = F1. Third, and perhaps, most importantly, model (2.1) is also a model for outliers in that those observations with Zi = 1 that come from F1i represent the outliers. For the given gene, one can thus potentially test the hypothesis H0i: the ith sample (i = 1,…,n) is not an outlier versus H1i: the ith sample is an outlier. We can actually test for a more specific hypothesis than has been discussed previously in the literature on outlier profile analysis, namely that for a given gene and given sample, the sample represents an outlier. Furthermore, the only assumption we need on F1i is that it does not equal F0. Note that the hypothesis being described here is more focused than that tested by Tibshirani and Hastie (2007) and Wu (2007). We will return to discussion of the hypothesis they test in Section 3.

We now develop the proposed procedure for our situation. At the first stage, we estimate F0 using the gene expression measurements with Zi = 0. This yields an empirical distribution function An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx2_ht.jpg(y) = (n0)−1An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx3_ht.jpg(Yi ≤ y, Zi = 0). Next, we transform the gene expression measurements with Zi = 1 using An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx2_ht.jpg, which generates new variables An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx4_ht.jpg= 1 − An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx2_ht.jpg(Yi), i = n0 + 1,…,n. If F0 were known, then for i = 1,…,n1,

An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx5_ht.jpg

where FU(u) = u is the cumulative distribution function (cdf) of a uniform(0,1) distribution and FWi(u) = F0 о{F1i−1(u)}. We propose 2 algorithms for selecting outliers. Here is the first, referred to as the Bonferroni algorithm:

  1. Set an error level α.
  2. Reject H0i for H1i for the ith sample (i.e. declare the ith sample to be an outlier) if and only if
    An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx6_ht.jpg

We call this the Bonferroni algorithm because the rule in Step 2 of the algorithm is very similar to the Bonferroni correction for p-values in multiple testing. Here, the number of tests being performed is equal to the number of diseased samples in the data set. This is why we adjust the significance level by n1 in Step 2.

The second algorithm we propose is to use the Benjamini–Hochberg (BH) (1995) algorithm for outlier detection. It proceeds by first sorting the An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx4_ht.jpgs in increasing order, An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx7_ht.jpg(1)An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx7_ht.jpg(2)(...)An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx7_ht.jpg(n1), and then selecting outliers using the following 2-step algorithm:

  1. Set an error rate α.
  2. Take as outliers An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx7_ht.jpg(1),…,An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx7_ht.jpg(An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx8_ht.jpg), where An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx8_ht.jpg = max{1 ≤ in1:An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx4_ht.jpgiα/n1}. If no such An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx8_ht.jpg exists, conclude that there are no outliers.

We have been assuming that F1i(y) ≤ F0(y) ∀ y in (2.1). More generally, we could allow F1i(y) ≠ F0(y). However, then we would have to look for outliers that have both small and large values of Ui. Observe that F1i(y) ≤ F0(y) in (2.1) corresponds to gene expression being stochastically larger in diseased samples relative to nondiseased samples and F1i(y) > F0(y) the opposite is true. In practice, we recommend running the procedure twice, one assuming that F1i(y) ≤ F0(y) to find outlying samples with overexpressed genes, the second time assuming the opposite.

2.2. Outlier detection and multiple testing

The outlier detection algorithms we have proposed have a very natural connection with multiple testing procedures. Since we can use (2.1) as a model for outliers, we can decide whether or not each sample is an outlier using a hypothesis test; this yields a total of n1 tests of hypotheses. We can then cross-classify samples into the table based on their “true” outlier status versus what we declare: this is shown in Table 1. Note that in the table, the only observed quantities are (n1, R, Q). Everything else about the table is unobserved.

There is a direct correspondence between Table 1 and testing multiple hypotheses. Based on the table, we can construct appropriate error measures to control. By an error, we mean that we declare a sample to be an outlier when it is not an outlier in truth. Two popular error measures to control are the FWER (Shaffer, 1995) and the FDR (Benjamini and Hochberg, 1995). In words, the FWER is the probability of making at least 1 false declaration of a sample being an outlier, while the FDR is the average number of false outliers among the samples declared to be outliers. Using the notation of Table 1, FWER equals Pr(X ≥ 1), while the FDR is E[X/R|R > 0]Pr (R > 0). Assume that F0 is known. It is easy to then show the following results:

  • (a) The Bonferroni algorithm controls FWER and FDR at level α.
  • (b) The BH algorithm controls FDR at level α.
Table 1.

Outcomes of n1 tests of hypotheses regarding outlying samples

These are exact results for finite samples; one can invoke the theoretical results of Genovese and Wasserman (2004) in order to study the asymptotic properties of the parameter estimators in the model. It becomes more difficult to prove results about error control with the proposed procedure because it involves An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx2_ht.jpg rather than F0. Since we normalize the gene expression measurements for the Z = 1 group by An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx2_ht.jpg, the transformed observations are not statistically independent. Using the notation of Genovese and Wasserman, define T as a mapping from [0, 1]n1 into [0, 1]; we can then define the Bonferroni and BH algorithms as

An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx9_ht.jpg

and

An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx10_ht.jpg

where An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx7_ht.jpg= (An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx7_ht.jpg1,…,An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx7_ht.jpgn1) and An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx11_ht.jpgn1(t) = (n1)−1An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx12_ht.jpg≤ t). However, the results of Genovese and Wasserman (2004) do not directly apply to this problem because of the dependence in the transformed observations An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx7_ht.jpg. Assume that the densities corresponding to F0 and F1, f0 and f1, are continuous and that f1 is strictly positive on {y:0 < F1(y) < 1}. Then, this is sufficient to guarantee the convergence of An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx2_ht.jpg to F0; by the continuous mapping theorem (Van der Vaart and Wellner, 1996), this implies that the Bonferroni procedure will asymptotically control the FWER. For the BH procedure, we make the additional 2 assumptions. First, we assume that π is identifiable; conditions guaranteeing this are given in Genovese and Wasserman (2004, Section 3.1). Second, we assume that the range of F0 о F1−1 is [0, 1]. This guarantees the uniform convergence of An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx11_ht.jpgn1(t) to its population limit. By another application of the argmax continuous mapping theorem, we have that TBH controls the FDR.

3. OUTLIER PROFILE ANALYSIS: GENOME-WIDE CASE

3.1. Multivariate extensions: model and methodology

Now, we wish to consider the outlier profile analysis problem for genome-scale data such as data generated by a gene expression microarray experiment. Then, model (2.1) becomes the following:

An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx13_ht.jpg
(3.1)
An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx14_ht.jpg
(3.2)

where An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx15_ht.jpg is an arbitrary distribution function. Note that we are leaving the structure of F0g and F1gi unspecified. We will return to this point later in Section 3.3. Now, suppose that in (3.2), the π 0g (g = 1,…,G) are a mixture themselves of a point mass at 1 and alternative distribution function so that

An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx16_ht.jpg
(3.3)

Now, if π0g comes from the first mixture component, then there is no differential expression for the gth gene. To make the model sensible, we need that smaller values of π0g correspond to an increased likelihood of coming from the distribution function FP.

What most previous authors have tested within this model (Lyons-Weiler and others, 2004; Tibshirani and Hastie, 2007; Wu, 2007) is H0g: π0g = 1, g = 1,…,G. In contrast to the hypothesis described in Section 2, which is a sample-specific hypothesis involving outliers, the null hypothesis H0G here is a gene-specific one. When we think about assessing significance now, any multiple testing adjustment needs to account for the multiplicity of genes in the study and not the number of samples.

Our approach is to have the following class of scores:

An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx17_ht.jpg
(3.4)

where An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx7_ht.jpggi is the gene-specific analog of An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx7_ht.jpgi from Section 2.2, Wgi is a weight function, and cig is a critical value depending on the particular procedure being used (Bonferroni 1, Bonferroni 2, or BH). Two natural choices for W are Wgi = 1 and Wgi = Ygi. Using the first weight function will make the statistic Sg fairly discrete, while using the second weight function will make the statistic Sg be more continuous. In this paper, we use the second one.

To derive the null distribution of (3.4), we permute the class labels (Z) between the cases and the controls. During this permutation, we recalculate An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx2_ht.jpg. Based on this, we can then perform the usual multiple testing adjustments controlling either the FWER or the FDR. Note that for this situation, we must adjust for the number of genes since the number of hypotheses being tested is equal to the number of genes on the microarray. Based on the permutations, we can then adjust the p-values for multiple testing. A variety of procedures for doing so based on FWER can be found in Dudoit and others (2002). For presenting scientists with a list of genes calibrated in evidence for outlierness, we use the q-value approach of Storey and Tibshirani (2003). In words, the q-value is approximately the smallest FDR at which we would reject the null hypothesis that there is no outlying expression for the gth gene in diseased relative to nondiseased samples. We then rank genes based on the q-value.

3.2. Comparison with previous methods

It is instructive to consider the difference between the proposed methodology versus those previous authors have constructed. We first start with the approach of Tomlins and others (2005). They standardize the data across all samples and create new measurements

An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx18_ht.jpg

where median(Y) is the median of the vector Y and MAD(Y) denotes the median absolute deviation of Y. The COPA score of Tomlins and others (2005) for the gth gene is the following:

An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx19_ht.jpg
(3.5)

where qr(Y) denotes the rth percentile of the vector Y. In words, Tomlins and others (2005) use the rth percentile of Yg·* for all samples; in practice, they consider r to be 75, 90, and 99. What Tibshirani and Hastie (2007) propose is a modified t-test for finding this pattern; they use as their statistic

An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx20_ht.jpg
(3.6)

where IQR(Y) denotes the interquartile range of a vector Y. Tibshirani and Hastie (2007) argue that the outlier sum (OS) score (3.6) is potentially more efficient than the COPA score because it sums over all outlying disease samples.

Wu (2007) develops an approach called the outlier robust t-statistic (ORT). He seeks to separate the diseased and undiseased populations as much as possible because he argues that it is possible for the distributions of gene expression measurements for the 2 groups to be different. His statistic is

An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx21_ht.jpg
(3.7)

There is a procedure proposed by Lyons-Weiler and others (2004), called the permutation percentile separability test (PPST), that could also be applied to this problem. Their statistic is

An external file that holds a picture, illustration, etc.
Object name is biostskxn015fx22_ht.jpg
(3.8)

For the last 3 formulae (3.6–3.8), the authors derive null distributions using permutation of the nondiseased and diseased samples. In comparing (3.5–3.8), we highlight several points. First, the original COPA measure (3.5) of Tomlins and others (2005) did not attempt to ascribe any measure of significance and defines the threshold based on all samples. The other approaches all attempt to use more statistically motivated criteria for assessing outliers, and they all sum over all n1 samples in the diseased population. The OS approach (3.6) uses all samples for ranking, but the other 2 approaches (3.7) and (3.8) only construct a cutoff using the samples in the nondiseased category. The latter 3 approaches (3.6–3.8) all test the hypothesis H0g : πg = 1.

Our approach in (3.4) differs in one major respect from the scores (3.6–3.8). We seek to control an error rate measure, which none of the other proposals do. This has the effect of creating a data-dependent threshold that incorporates variability in a more flexible way than by use of the interquartile range such as in (3.6) and (3.7).

4. NUMERICAL EXAMPLES

4.1. Simulation studies

To assess the performance of the methodology, we first conducted some simulation studies. In particular, we generated gene expression measurements for 1000 genes and allowed for 50 genes to have a differential expression pattern different between 2 groups, each with n = 20 samples. We considered differential expression in k = 5, 10, and 15 samples. For each simulation scenario, 100 data sets were generated. We took the baseline distribution of gene expression to be exponential with mean 1 and the differential expression to be exponential with mean 2. We compared the performance of the proposed methodology to the methods discussed in Section 3.2 using receiver operating characteristic (ROC) curves, averaged over the simulations. In terms of performance, ROC curves close to the diagonal indicate poor performance, while those closer to the upper left-hand corner indicate better performance. For the Bonferroni and BH methods, we took α = 0.05. The simulation results are indicated in Figure 1. We did not use the original COPA method of Tomlins and others (2005).

Fig. 1.

Average ROC curves of various outlier detection procedures using first simulation scenario. Solid line indicates OS method of Tibshirani and Hastie (2007). Dotted line indicates percentile-specific method (PPST) of Lyons-Weiler and others (2004). Dashed ...

Fig. 2.

Average ROC curves of various outlier detection procedures in second simulation scenario. See Figure 1 for methods corresponding to different symbols.

Based on the curves, we find that for small values of the false-positive rate, the proposed methodology using the BH procedure performs the best among all methods, while that using the OS method performs the worst. One point of note is that the PPST method (3.8), which has not been previously explored in the literature, tends to perform better than the OS and the outlier robust t-statistic and actually does better as k increases. The proposed methods are always competitive in these situations.

We next performed a simulation that mimicked a setup used by Wu (2007). We took the baseline distribution of genes to be standard normal with mean zero and variance one; in a fraction of samples, 50 genes had a normal distribution with mean 2 and variance 1. The simulation results are given in Figure 2. For small k, the BH procedure tends to perform the best, while for larger k, the PPST and proposed Bonferroni methods tend to perform much better.

4.2. Prostate cancer data set

We now apply the proposed methodology to data from a gene expression study in prostate cancer. There is a total of 101 samples in the study: 22 noncancerous samples and 79 cancerous samples; the samples were profiled using 2-color (red/green) microarrays. There were a total of 9984 genes on the original microarray; the following preprocessing steps were applied before using the methodology:

  1. Genes with more than 50% missing values across all samples were removed from the study.
  2. Missing values were imputed using a nearest neighbors algorithm (Troyanskaya and others, 2001), where the number of nearest neighbors is set to 10.

This left a total of 9272 genes for analysis.

As discussed in Section 2, one of the advantages of the proposed methodology here is that we can determine which are specific samples that show evidence of outlying expression with respect to a particular gene. As an example, we take ERG (v-ets erythroblastosis virus E26 oncogene homolog), which was found to be part of a gene fusion product that appears to be quite common in prostate cancer (Tomlins and others, 2005). If we apply the methods from Section 2 to ERG, we find that there are 40 samples that show evidence of outlying expression in the cancerous samples relative to the noncancerous samples using the Bonferroni method with α = 0.05. We get the same answer using the BH procedure with α = 0.05; in fact, the set of samples called outliers is the same using either method. When we apply the method, switching the 2 groups of samples, there are no samples in the noncancerous group that show evidence of outlying expression relative to the cancerous samples.

Next, we applied the methods from Section 3 in order to do a more global search of genes that show evidence for outlying expression. Here, we only focus on genes that show evidence of overexpression in the cancerous samples relative to the noncancerous samples. A comparison of correlation between the ranks of the genes based on the outlier score methods is given in Table 2. Based on Table 2, we find that the proposed methods give highly concordant results. There is less concordance with the other 3 methods.

Table 2.

Correlation between outlier scores using methods in Section 3

Next, we performed the q-value analysis of Storey and Tibshirani (2003). This was performed after calculating the permutation distribution using 10 000 samples. Interestingly, while the estimated π0 was one using the PPST, ORT, and OS methods, it was 0.19 for the Bonferroni procedure and 0.29 for the BH procedure. This leads to many more genes being called significantly differentially expressed using the latter 2 procedures versus the existing methods at any q-value cutoff.

5. DISCUSSION

In this article, we have placed a very formal statistical framework for outlier detection using genomic data. By formulating the problem using mixture models, we are able to clarify what hypotheses can be tested. Doing this also allows us to clarify the statistical contributions of previous work on this subject.

Another theme in this work is the relative utility of nonparametric methods. While much of the previous literature on outlier detection has used modified t-statistics, the empirical cdf-based approach proposed here tends to give very good performance in the simulation settings considered. While the t-statistic methods will be powerful in cases where the data are Gaussian, they will be less so in non-Gaussian settings. By contrast, the performance of the proposed nonparametric methods will be more robust to the choice of the data-generating mechanism.

One of the other facts noted by Tomlins and others (2005) was that there was a particular expression pattern to the ERG–ETV1 gene pair. In a fraction of samples, one of these genes would be overexpressed, while the other would not show any expression. This also suggests another type of gene expression pattern to search for; it is bivariate in nature. While a threshold-based method for assessing significance was proposed by MacDonald and Ghosh (2006), it would be desirable to extend the approach here to that problem as well. This is currently under investigation.

FUNDING

The Huck Institute of Life Sciences and the National Institutes of Health (R01-GM72007).

Acknowledgments

Conflict of Interest: None declared.

References

  • Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B. 1995;57:289–300.
  • Dudoit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica. 2002;12:111–140.
  • Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001;96:1151–1160.
  • Genovese CR, Wasserman L. A stochastic process approach to false discovery control. Annals of Statistics. 2004;32:1035–1061.
  • Lyons-Weiler J, Patel S, Becich MJ, Godfrey TE. Tests for finding complex patterns of differential expression in cancers: towards individualized medicine. BMC Bioinformatics. 2004;125:110. [PMC free article] [PubMed]
  • MacDonald JW, Ghosh D. COPA–cancer outlier profile analysis. Bioinformatics. 2006;22:2950–2951. [PubMed]
  • Schena M. Microarray Biochip Technology. Sunnyvale, CA: Eaton; 2000.
  • Shaffer J. Multiple hypothesis testing. Annual Reviews of Psychology. 1995;46:561–584.
  • Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America. 2003;100:9440–9445. [PubMed]
  • Tibshirani R, Hastie T. Outlier sums for differential gene expression analysis. Biostatistics. 2007;8:2–8. [PubMed]
  • Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S, Cao X, Tchinda J, Kuefer R. and others. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science. 2005;310:644–648. [PubMed]
  • Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17:520–525. [PubMed]
  • van der Vaart A, Wellner J. Weak Convergence and Empirical Processes: With Applications to Statistics. New York: Springer; 1996.
  • Wu B. Cancer outlier differential gene expression detection. Biostatistics. 2007;8:566–575. [PubMed]

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press